* cpu hotplug oops on 2.6.15-rc5
@ 2005-12-19 5:16 Sonny Rao
2005-12-19 6:41 ` Benjamin Herrenschmidt
2005-12-22 9:27 ` Ravikiran G Thirumalai
0 siblings, 2 replies; 16+ messages in thread
From: Sonny Rao @ 2005-12-19 5:16 UTC (permalink / raw)
To: linux-kernel; +Cc: manfred, clameter, anton, sonnyrao
(apologies if this is a dup)
Hi, I'm crashing 2.6.15-rc5 when I try and offline the last and only CPU in a node on a ppc64 Power5, SMT was disabled.
Here's the backtrace:
0:mon> t
[c0000001ad033820] c000000000096a7c .kfree+0x250/0x280
[c0000001ad0338d0] c00000000009a544 .cpuup_callback+0x238/0x5fc
[c0000001ad0339c0] c000000000068114 .notifier_call_chain+0x68/0x9c
[c0000001ad033a50] c0000000000789fc .cpu_down+0x1fc/0x368
[c0000001ad033b40] c0000000002ac658 .store_online+0x88/0xe8
[c0000001ad033bd0] c0000000002a6f14 .sysdev_store+0x4c/0x68
[c0000001ad033c50] c000000000110368 .sysfs_write_file+0x100/0x1a0
[c0000001ad033cf0] c0000000000be854 .vfs_write+0x100/0x200
[c0000001ad033d90] c0000000000bea64 .sys_write+0x54/0x9c
[c0000001ad033e30] c000000000008600 syscall_exit+0x0/0x18
--- Exception: c01 (System Call) at 000000000fe5ec10
SP (ffc4c4f0) is in userspace
0:mon> e
cpu 0x0: Vector: 300 (Data Access) at [c0000001ad033520]
pc: c00000000048bd30: ._spin_lock+0x18/0x80
lr: c000000000096a7c: .kfree+0x250/0x280
sp: c0000001ad0337a0
msr: 8000000000001032
dar: 48
dsisr: 40000000
current = 0xc0000001aff12040
paca = 0xc0000000005c1000
pid = 17376, comm = bash
Should I try this with CONFIG_DEBUG_SLAB ?
Sonny
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: cpu hotplug oops on 2.6.15-rc5
2005-12-19 5:16 cpu hotplug oops on 2.6.15-rc5 Sonny Rao
@ 2005-12-19 6:41 ` Benjamin Herrenschmidt
2005-12-19 7:08 ` Sonny Rao
2005-12-22 9:27 ` Ravikiran G Thirumalai
1 sibling, 1 reply; 16+ messages in thread
From: Benjamin Herrenschmidt @ 2005-12-19 6:41 UTC (permalink / raw)
To: Sonny Rao; +Cc: linux-kernel, manfred, clameter, anton, sonnyrao
On Mon, 2005-12-19 at 00:16 -0500, Sonny Rao wrote:
> (apologies if this is a dup)
>
> Hi, I'm crashing 2.6.15-rc5 when I try and offline the last and only CPU in a node on a ppc64 Power5, SMT was disabled.
First try on -rc6 just in case it's related to the SCSI fix (the bug was
corrupting the SLAB) that got merged just after rc5 iirc.
Ben.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: cpu hotplug oops on 2.6.15-rc5
2005-12-19 6:41 ` Benjamin Herrenschmidt
@ 2005-12-19 7:08 ` Sonny Rao
2005-12-19 21:17 ` Manfred Spraul
0 siblings, 1 reply; 16+ messages in thread
From: Sonny Rao @ 2005-12-19 7:08 UTC (permalink / raw)
To: Benjamin Herrenschmidt; +Cc: linux-kernel, manfred, clameter, anton, sonnyrao
On Mon, Dec 19, 2005 at 05:41:57PM +1100, Benjamin Herrenschmidt wrote:
> On Mon, 2005-12-19 at 00:16 -0500, Sonny Rao wrote:
> > (apologies if this is a dup)
> >
> > Hi, I'm crashing 2.6.15-rc5 when I try and offline the last and only CPU in a node on a ppc64 Power5, SMT was disabled.
>
> First try on -rc6 just in case it's related to the SCSI fix (the bug was
> corrupting the SLAB) that got merged just after rc5 iirc.
Ok, tried it: same crash on -rc6
2:mon> t
[c000000d9f33b820] c000000000097cd0 .kfree+0x29c/0x2cc
[c000000d9f33b8d0] c00000000009c3a8 .cpuup_callback+0x4f8/0x5fc
[c000000d9f33b9c0] c00000000048ff4c .notifier_call_chain+0x68/0x9c
[c000000d9f33ba50] c000000000078da8 .cpu_down+0x1fc/0x368
[c000000d9f33bb40] c0000000002ae514 .store_online+0x88/0xe8
[c000000d9f33bbd0] c0000000002a8dd0 .sysdev_store+0x4c/0x68
[c000000d9f33bc50] c000000000111e70 .sysfs_write_file+0x100/0x1a0
[c000000d9f33bcf0] c0000000000c0360 .vfs_write+0x100/0x200
[c000000d9f33bd90] c0000000000c0570 .sys_write+0x54/0x9c
[c000000d9f33be30] c000000000008600 syscall_exit+0x0/0x18
--- Exception: c01 (System Call) at 000000000fe5ec10
SP (ffa204f0) is in userspace
2:mon>
Sonny
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: cpu hotplug oops on 2.6.15-rc5
2005-12-19 7:08 ` Sonny Rao
@ 2005-12-19 21:17 ` Manfred Spraul
2005-12-19 23:16 ` SPAMHAUS-Re: " Sonny Rao
2005-12-19 23:40 ` Anton Blanchard
0 siblings, 2 replies; 16+ messages in thread
From: Manfred Spraul @ 2005-12-19 21:17 UTC (permalink / raw)
To: Sonny Rao; +Cc: Benjamin Herrenschmidt, linux-kernel, clameter, anton, sonnyrao
Sonny Rao wrote:
>Ok, tried it: same crash on -rc6
>
>2:mon> t
>[c000000d9f33b820] c000000000097cd0 .kfree+0x29c/0x2cc
>[c000000d9f33b8d0] c00000000009c3a8 .cpuup_callback+0x4f8/0x5fc
>[c000000d9f33b9c0] c00000000048ff4c .notifier_call_chain+0x68/0x9c
>[c000000d9f33ba50] c000000000078da8 .cpu_down+0x1fc/0x368
>[c000000d9f33bb40] c0000000002ae514 .store_online+0x88/0xe8
>[c000000d9f33bbd0] c0000000002a8dd0 .sysdev_store+0x4c/0x68
>[c000000d9f33bc50] c000000000111e70 .sysfs_write_file+0x100/0x1a0
>[c000000d9f33bcf0] c0000000000c0360 .vfs_write+0x100/0x200
>[c000000d9f33bd90] c0000000000c0570 .sys_write+0x54/0x9c
>[c000000d9f33be30] c000000000008600 syscall_exit+0x0/0x18
>
>
Very odd call chain.
Could you enable slab debugging?
--
Manfred
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: SPAMHAUS-Re: cpu hotplug oops on 2.6.15-rc5
2005-12-19 21:17 ` Manfred Spraul
@ 2005-12-19 23:16 ` Sonny Rao
2005-12-19 23:40 ` Anton Blanchard
1 sibling, 0 replies; 16+ messages in thread
From: Sonny Rao @ 2005-12-19 23:16 UTC (permalink / raw)
To: Manfred Spraul
Cc: Benjamin Herrenschmidt, linux-kernel, clameter, anton, sonnyrao
On Mon, Dec 19, 2005 at 10:17:04PM +0100, Manfred Spraul wrote:
> Sonny Rao wrote:
>
> >Ok, tried it: same crash on -rc6
> >
> >2:mon> t
> >[c000000d9f33b820] c000000000097cd0 .kfree+0x29c/0x2cc
> >[c000000d9f33b8d0] c00000000009c3a8 .cpuup_callback+0x4f8/0x5fc
> >[c000000d9f33b9c0] c00000000048ff4c .notifier_call_chain+0x68/0x9c
> >[c000000d9f33ba50] c000000000078da8 .cpu_down+0x1fc/0x368
> >[c000000d9f33bb40] c0000000002ae514 .store_online+0x88/0xe8
> >[c000000d9f33bbd0] c0000000002a8dd0 .sysdev_store+0x4c/0x68
> >[c000000d9f33bc50] c000000000111e70 .sysfs_write_file+0x100/0x1a0
> >[c000000d9f33bcf0] c0000000000c0360 .vfs_write+0x100/0x200
> >[c000000d9f33bd90] c0000000000c0570 .sys_write+0x54/0x9c
> >[c000000d9f33be30] c000000000008600 syscall_exit+0x0/0x18
> >
> >
> Very odd call chain.
> Could you enable slab debugging?
Actually, I did turn on slab debugging on -rc6, but it did not seem to
make any difference.
Sonny
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: cpu hotplug oops on 2.6.15-rc5
2005-12-19 21:17 ` Manfred Spraul
2005-12-19 23:16 ` SPAMHAUS-Re: " Sonny Rao
@ 2005-12-19 23:40 ` Anton Blanchard
1 sibling, 0 replies; 16+ messages in thread
From: Anton Blanchard @ 2005-12-19 23:40 UTC (permalink / raw)
To: Manfred Spraul
Cc: Sonny Rao, Benjamin Herrenschmidt, linux-kernel, clameter,
sonnyrao
Hi Manfred,
> Very odd call chain.
> Could you enable slab debugging?
Sonny and I had a look around, it seems to be in the
cpuup_callback() / CPU_DEAD case:
if (!cpus_empty(mask)) {
spin_unlock(&l3->list_lock);
goto unlock_cache;
}
if (l3->shared) {
free_block(cachep, l3->shared->entry,
l3->shared->avail, node);
kfree(l3->shared); <-------- HERE
l3->shared = NULL;
}
So we are removing the last cpu in a node, and tearing down the node
related structures. We looked at kfree() -> __cache_free() and we couldnt
convince ourselves that all the CONFIG_NUMA stuff in there wouldnt trip
over itself (since we would be doing the free on an alien node).
Anton
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: cpu hotplug oops on 2.6.15-rc5
2005-12-19 5:16 cpu hotplug oops on 2.6.15-rc5 Sonny Rao
2005-12-19 6:41 ` Benjamin Herrenschmidt
@ 2005-12-22 9:27 ` Ravikiran G Thirumalai
[not found] ` <20051222173700.GA5723@localhost.localdomain>
1 sibling, 1 reply; 16+ messages in thread
From: Ravikiran G Thirumalai @ 2005-12-22 9:27 UTC (permalink / raw)
To: Sonny Rao; +Cc: linux-kernel, manfred, clameter, anton, sonnyrao, shai
On Mon, Dec 19, 2005 at 12:16:59AM -0500, Sonny Rao wrote:
> (apologies if this is a dup)
>
> Hi, I'm crashing 2.6.15-rc5 when I try and offline the last and only CPU in a node on a ppc64 Power5, SMT was disabled.
>
> Here's the backtrace:
>
> 0:mon> t
> [c0000001ad033820] c000000000096a7c .kfree+0x250/0x280
> [c0000001ad0338d0] c00000000009a544 .cpuup_callback+0x238/0x5fc
> [c0000001ad0339c0] c000000000068114 .notifier_call_chain+0x68/0x9c
> [c0000001ad033a50] c0000000000789fc .cpu_down+0x1fc/0x368
> [c0000001ad033b40] c0000000002ac658 .store_online+0x88/0xe8
> [c0000001ad033bd0] c0000000002a6f14 .sysdev_store+0x4c/0x68
> [c0000001ad033c50] c000000000110368 .sysfs_write_file+0x100/0x1a0
> [c0000001ad033cf0] c0000000000be854 .vfs_write+0x100/0x200
> [c0000001ad033d90] c0000000000bea64 .sys_write+0x54/0x9c
> [c0000001ad033e30] c000000000008600 syscall_exit+0x0/0x18
> --- Exception: c01 (System Call) at 000000000fe5ec10
> SP (ffc4c4f0) is in userspace
>
> 0:mon> e
> cpu 0x0: Vector: 300 (Data Access) at [c0000001ad033520]
> pc: c00000000048bd30: ._spin_lock+0x18/0x80
> lr: c000000000096a7c: .kfree+0x250/0x280
> sp: c0000001ad0337a0
> msr: 8000000000001032
> dar: 48
> dsisr: 40000000
> current = 0xc0000001aff12040
> paca = 0xc0000000005c1000
> pid = 17376, comm = bash
>
>
Sonny,
Does this patch fix the issue? This one applies cleanly on 2.6.15-rc6
unlike the one that was sent to you earlier.
Thanks,
Kiran
From: Alok N Kataria <alokk@calsoftinc.com>
Fixes a bug in the CPU_DOWN call path, we shouldn't call kfree while
holding kmem_list3's list lock, nor should drain_alien_cache be called
with l3's list lock.
Signed-off-by : Alok N Kataria <alokk@calsoftinc.com>
Signed-off-by : Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by : Shai Fultheim <shai@scalex86.org>
Index: linux-2.6.15-rc6/mm/slab.c
===================================================================
--- linux-2.6.15-rc6.orig/mm/slab.c 2005-12-21 22:32:14.000000000 -0800
+++ linux-2.6.15-rc6/mm/slab.c 2005-12-21 22:32:58.000000000 -0800
@@ -824,14 +824,14 @@ static inline void __drain_alien_cache(k
}
}
-static void drain_alien_cache(kmem_cache_t *cachep, struct kmem_list3 *l3)
+static void drain_alien_cache(kmem_cache_t *cachep, struct array_cache **alien)
{
int i=0;
struct array_cache *ac;
unsigned long flags;
for_each_online_node(i) {
- ac = l3->alien[i];
+ ac = alien[i];
if (ac) {
spin_lock_irqsave(&ac->lock, flags);
__drain_alien_cache(cachep, ac, i);
@@ -842,7 +842,7 @@ static void drain_alien_cache(kmem_cache
#else
#define alloc_alien_cache(node, limit) do { } while (0)
#define free_alien_cache(ac_ptr) do { } while (0)
-#define drain_alien_cache(cachep, l3) do { } while (0)
+#define drain_alien_cache(cachep, alien) do { } while (0)
#endif
static int __devinit cpuup_callback(struct notifier_block *nfb,
@@ -921,7 +921,7 @@ static int __devinit cpuup_callback(stru
down(&cache_chain_sem);
list_for_each_entry(cachep, &cache_chain, next) {
- struct array_cache *nc;
+ struct array_cache *nc, *shared, **alien;
cpumask_t mask;
mask = node_to_cpumask(node);
@@ -932,7 +932,7 @@ static int __devinit cpuup_callback(stru
l3 = cachep->nodelists[node];
if (!l3)
- goto unlock_cache;
+ goto free_array_cache;
spin_lock(&l3->list_lock);
@@ -943,32 +943,40 @@ static int __devinit cpuup_callback(stru
if (!cpus_empty(mask)) {
spin_unlock(&l3->list_lock);
- goto unlock_cache;
+ goto free_array_cache;
}
- if (l3->shared) {
+ if ((shared = l3->shared)) {
free_block(cachep, l3->shared->entry,
l3->shared->avail, node);
kfree(l3->shared);
l3->shared = NULL;
}
- if (l3->alien) {
- drain_alien_cache(cachep, l3);
- free_alien_cache(l3->alien);
- l3->alien = NULL;
+
+ alien = l3->alien;
+ l3->alien = NULL;
+
+ spin_unlock(&l3->list_lock);
+
+ kfree(nc);
+ kfree(shared);
+ if (alien) {
+ drain_alien_cache(cachep, alien);
+ free_alien_cache(alien);
}
/* free slabs belonging to this node */
if (__node_shrink(cachep, node)) {
+ spin_lock(&l3->list_lock);
cachep->nodelists[node] = NULL;
spin_unlock(&l3->list_lock);
kfree(l3);
- } else {
- spin_unlock(&l3->list_lock);
}
+ goto unlock_cache;
+free_array_cache:
+ kfree(nc);
unlock_cache:
spin_unlock_irq(&cachep->spinlock);
- kfree(nc);
}
up(&cache_chain_sem);
break;
@@ -1918,7 +1926,7 @@ static void drain_cpu_caches(kmem_cache_
drain_array_locked(cachep, l3->shared, 1, node);
spin_unlock(&l3->list_lock);
if (l3->alien)
- drain_alien_cache(cachep, l3);
+ drain_alien_cache(cachep, l3->alien);
}
}
spin_unlock_irq(&cachep->spinlock);
@@ -3310,7 +3318,7 @@ static void cache_reap(void *unused)
l3 = searchp->nodelists[numa_node_id()];
if (l3->alien)
- drain_alien_cache(searchp, l3);
+ drain_alien_cache(searchp, l3->alien);
spin_lock_irq(&l3->list_lock);
drain_array_locked(searchp, ac_data(searchp), 0,
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: cpu hotplug oops on 2.6.15-rc5
[not found] ` <20051222173700.GA5723@localhost.localdomain>
@ 2005-12-22 17:53 ` Sonny Rao
2005-12-22 18:37 ` Ravikiran G Thirumalai
0 siblings, 1 reply; 16+ messages in thread
From: Sonny Rao @ 2005-12-22 17:53 UTC (permalink / raw)
To: Ravikiran G Thirumalai
Cc: linux-kernel, manfred, clameter, anton, shai, sonnyrao
On Thu, Dec 22, 2005 at 11:37:00AM -0600, Sonny Rao wrote:
> On Thu, Dec 22, 2005 at 01:27:43AM -0800, Ravikiran G Thirumalai wrote:
> > On Mon, Dec 19, 2005 at 12:16:59AM -0500, Sonny Rao wrote:
> > > (apologies if this is a dup)
> > >
> > > Hi, I'm crashing 2.6.15-rc5 when I try and offline the last and only CPU in a node on a ppc64 Power5, SMT was disabled.
> > >
> > > Here's the backtrace:
> > >
> > > 0:mon> t
> > > [c0000001ad033820] c000000000096a7c .kfree+0x250/0x280
> > > [c0000001ad0338d0] c00000000009a544 .cpuup_callback+0x238/0x5fc
> > > [c0000001ad0339c0] c000000000068114 .notifier_call_chain+0x68/0x9c
> > > [c0000001ad033a50] c0000000000789fc .cpu_down+0x1fc/0x368
> > > [c0000001ad033b40] c0000000002ac658 .store_online+0x88/0xe8
> > > [c0000001ad033bd0] c0000000002a6f14 .sysdev_store+0x4c/0x68
> > > [c0000001ad033c50] c000000000110368 .sysfs_write_file+0x100/0x1a0
> > > [c0000001ad033cf0] c0000000000be854 .vfs_write+0x100/0x200
> > > [c0000001ad033d90] c0000000000bea64 .sys_write+0x54/0x9c
> > > [c0000001ad033e30] c000000000008600 syscall_exit+0x0/0x18
> > > --- Exception: c01 (System Call) at 000000000fe5ec10
> > > SP (ffc4c4f0) is in userspace
> > >
> > > 0:mon> e
> > > cpu 0x0: Vector: 300 (Data Access) at [c0000001ad033520]
> > > pc: c00000000048bd30: ._spin_lock+0x18/0x80
> > > lr: c000000000096a7c: .kfree+0x250/0x280
> > > sp: c0000001ad0337a0
> > > msr: 8000000000001032
> > > dar: 48
> > > dsisr: 40000000
> > > current = 0xc0000001aff12040
> > > paca = 0xc0000000005c1000
> > > pid = 17376, comm = bash
> > >
> > >
> >
> > Sonny,
> > Does this patch fix the issue? This one applies cleanly on 2.6.15-rc6
> > unlike the one that was sent to you earlier.
>
> Hi, thanks, now I'm getting a slightly different error,
> hitting a BUG in the slab debug code:
>
> ihplus:~ # echo 0 > /sys/devices/system/cpu/cpu14/online
> cpu 0x4: Vector: 700 (Program Check) at [c0000003a8c233f0]
> pc: c00000000009bb2c: .check_slabp+0x130/0x188
> lr: c00000000009bb28: .check_slabp+0x12c/0x188
> sp: c0000003a8c23670
> msr: 8000000000021032
> current = 0xc0000001b95297f0
> paca = 0xc0000000005d7000
> pid = 11116, comm = bash
> kernel BUG in check_slabp at mm/slab.c:2368!
> enter ? for help
>
>
> 4:mon> t
> [c0000003a8c23700] c00000000009d918 .free_block+0x168/0x294
> [c0000003a8c237e0] c00000000009d1dc .kfree+0x2b8/0x2d4
> [c0000003a8c238a0] c0000000000a1644 .cpuup_callback+0x144/0x618
> [c0000003a8c239b0] c0000000004a0780 .notifier_call_chain+0x68/0x9c
> [c0000003a8c23a40] c00000000007d608 .cpu_down+0x1fc/0x358
> [c0000003a8c23b30] c0000000002bb4ec .store_online+0x88/0xe8
> [c0000003a8c23bc0] c0000000002b5c14 .sysdev_store+0x4c/0x68
> [c0000003a8c23c40] c000000000119c6c .sysfs_write_file+0x118/0x1bc
> [c0000003a8c23cf0] c0000000000c6078 .vfs_write+0x100/0x200
> [c0000003a8c23d90] c0000000000c6288 .sys_write+0x54/0x9c
> [c0000003a8c23e30] c000000000008600 syscall_exit+0x0/0x18
> --- Exception: c01 (System Call) at 000000000fe5ec10
> SP (ff865560) is in userspace
More details:
The above crash was with SMT on, and I had already off-lined the SMT
sibling thread.
When I boot with SMT off, I get a slightly different crash:
ihplus:~ # echo 0 > /sys/devices/system/cpu/cpu14/online
cpu 0x0: Vector: 700 (Program Check) at [c0000003afa13480]
pc: c00000000009d960: .free_block+0x1b0/0x294
lr: c00000000009d95c: .free_block+0x1ac/0x294
sp: c0000003afa13700
msr: 8000000000021032
current = 0xc0000003afe04000
paca = 0xc0000000005d5000
pid = 10998, comm = bash
kernel BUG in free_block at mm/slab.c:2664!
enter ? for help
0:mon> t
[c0000003afa137e0] c00000000009d1dc .kfree+0x2b8/0x2d4
[c0000003afa138a0] c0000000000a1644 .cpuup_callback+0x144/0x618
[c0000003afa139b0] c0000000004a0780 .notifier_call_chain+0x68/0x9c
[c0000003afa13a40] c00000000007d608 .cpu_down+0x1fc/0x358
[c0000003afa13b30] c0000000002bb4ec .store_online+0x88/0xe8
[c0000003afa13bc0] c0000000002b5c14 .sysdev_store+0x4c/0x68
[c0000003afa13c40] c000000000119c6c .sysfs_write_file+0x118/0x1bc
[c0000003afa13cf0] c0000000000c6078 .vfs_write+0x100/0x200
[c0000003afa13d90] c0000000000c6288 .sys_write+0x54/0x9c
[c0000003afa13e30] c000000000008600 syscall_exit+0x0/0x18
--- Exception: c01 (System Call) at 000000000fe5ec10
SP (ff8b4560) is in userspace
This one points to a double free somewhere
Sonny
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: cpu hotplug oops on 2.6.15-rc5
2005-12-22 17:53 ` Sonny Rao
@ 2005-12-22 18:37 ` Ravikiran G Thirumalai
2005-12-22 18:39 ` Sonny Rao
2005-12-22 19:45 ` Sonny Rao
0 siblings, 2 replies; 16+ messages in thread
From: Ravikiran G Thirumalai @ 2005-12-22 18:37 UTC (permalink / raw)
To: Sonny Rao; +Cc: linux-kernel, manfred, clameter, anton, shai, sonnyrao, alokk
On Thu, Dec 22, 2005 at 12:53:11PM -0500, Sonny Rao wrote:
> On Thu, Dec 22, 2005 at 11:37:00AM -0600, Sonny Rao wrote:
> > On Thu, Dec 22, 2005 at 01:27:43AM -0800, Ravikiran G Thirumalai wrote:
> > > On Mon, Dec 19, 2005 at 12:16:59AM -0500, Sonny Rao wrote:
> > > > (apologies if this is a dup)
> > > ...
> > > Sonny,
> > > Does this patch fix the issue? This one applies cleanly on 2.6.15-rc6
> > > unlike the one that was sent to you earlier.
> >
> > Hi, thanks, now I'm getting a slightly different error,
> > hitting a BUG in the slab debug code:
> >
> > ihplus:~ # echo 0 > /sys/devices/system/cpu/cpu14/online
> > cpu 0x4: Vector: 700 (Program Check) at [c0000003a8c233f0]
> > pc: c00000000009bb2c: .check_slabp+0x130/0x188
> > lr: c00000000009bb28: .check_slabp+0x12c/0x188
> > sp: c0000003a8c23670
> > msr: 8000000000021032
> > current = 0xc0000001b95297f0
> > paca = 0xc0000000005d7000
> > pid = 11116, comm = bash
> > kernel BUG in check_slabp at mm/slab.c:2368!
> > enter ? for help
> >
> >
> > 4:mon> t
> > [c0000003a8c23700] c00000000009d918 .free_block+0x168/0x294
> > [c0000003a8c237e0] c00000000009d1dc .kfree+0x2b8/0x2d4
> > [c0000003a8c238a0] c0000000000a1644 .cpuup_callback+0x144/0x618
> > [c0000003a8c239b0] c0000000004a0780 .notifier_call_chain+0x68/0x9c
> > [c0000003a8c23a40] c00000000007d608 .cpu_down+0x1fc/0x358
> > [c0000003a8c23b30] c0000000002bb4ec .store_online+0x88/0xe8
> > [c0000003a8c23bc0] c0000000002b5c14 .sysdev_store+0x4c/0x68
> > [c0000003a8c23c40] c000000000119c6c .sysfs_write_file+0x118/0x1bc
> > [c0000003a8c23cf0] c0000000000c6078 .vfs_write+0x100/0x200
> > [c0000003a8c23d90] c0000000000c6288 .sys_write+0x54/0x9c
> > [c0000003a8c23e30] c000000000008600 syscall_exit+0x0/0x18
> > --- Exception: c01 (System Call) at 000000000fe5ec10
> > SP (ff865560) is in userspace
>
> More details:
>
> The above crash was with SMT on, and I had already off-lined the SMT
> sibling thread.
>
> When I boot with SMT off, I get a slightly different crash:
I think i missed the first reply above. (I can't seem to find it on lkml
either). So just to confirm, both these crashes are with the new patch on
top of rc6?
Thanks,
Kiran
>
> ihplus:~ # echo 0 > /sys/devices/system/cpu/cpu14/online
> cpu 0x0: Vector: 700 (Program Check) at [c0000003afa13480]
> pc: c00000000009d960: .free_block+0x1b0/0x294
> lr: c00000000009d95c: .free_block+0x1ac/0x294
> sp: c0000003afa13700
> msr: 8000000000021032
> current = 0xc0000003afe04000
> paca = 0xc0000000005d5000
> pid = 10998, comm = bash
> kernel BUG in free_block at mm/slab.c:2664!
> enter ? for help
>
> 0:mon> t
> [c0000003afa137e0] c00000000009d1dc .kfree+0x2b8/0x2d4
> [c0000003afa138a0] c0000000000a1644 .cpuup_callback+0x144/0x618
> [c0000003afa139b0] c0000000004a0780 .notifier_call_chain+0x68/0x9c
> [c0000003afa13a40] c00000000007d608 .cpu_down+0x1fc/0x358
> [c0000003afa13b30] c0000000002bb4ec .store_online+0x88/0xe8
> [c0000003afa13bc0] c0000000002b5c14 .sysdev_store+0x4c/0x68
> [c0000003afa13c40] c000000000119c6c .sysfs_write_file+0x118/0x1bc
> [c0000003afa13cf0] c0000000000c6078 .vfs_write+0x100/0x200
> [c0000003afa13d90] c0000000000c6288 .sys_write+0x54/0x9c
> [c0000003afa13e30] c000000000008600 syscall_exit+0x0/0x18
> --- Exception: c01 (System Call) at 000000000fe5ec10
> SP (ff8b4560) is in userspace
>
> This one points to a double free somewhere
>
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: cpu hotplug oops on 2.6.15-rc5
2005-12-22 18:37 ` Ravikiran G Thirumalai
@ 2005-12-22 18:39 ` Sonny Rao
2005-12-22 18:54 ` Christoph Lameter
2005-12-22 19:45 ` Sonny Rao
1 sibling, 1 reply; 16+ messages in thread
From: Sonny Rao @ 2005-12-22 18:39 UTC (permalink / raw)
To: Ravikiran G Thirumalai
Cc: linux-kernel, manfred, clameter, anton, shai, sonnyrao, alokk
On Thu, Dec 22, 2005 at 10:37:50AM -0800, Ravikiran G Thirumalai wrote:
> On Thu, Dec 22, 2005 at 12:53:11PM -0500, Sonny Rao wrote:
> > On Thu, Dec 22, 2005 at 11:37:00AM -0600, Sonny Rao wrote:
> > > On Thu, Dec 22, 2005 at 01:27:43AM -0800, Ravikiran G Thirumalai wrote:
> > > > On Mon, Dec 19, 2005 at 12:16:59AM -0500, Sonny Rao wrote:
> > > > > (apologies if this is a dup)
> > > > ...
> > > > Sonny,
> > > > Does this patch fix the issue? This one applies cleanly on 2.6.15-rc6
> > > > unlike the one that was sent to you earlier.
> > >
> > > Hi, thanks, now I'm getting a slightly different error,
> > > hitting a BUG in the slab debug code:
> > >
> > > ihplus:~ # echo 0 > /sys/devices/system/cpu/cpu14/online
> > > cpu 0x4: Vector: 700 (Program Check) at [c0000003a8c233f0]
> > > pc: c00000000009bb2c: .check_slabp+0x130/0x188
> > > lr: c00000000009bb28: .check_slabp+0x12c/0x188
> > > sp: c0000003a8c23670
> > > msr: 8000000000021032
> > > current = 0xc0000001b95297f0
> > > paca = 0xc0000000005d7000
> > > pid = 11116, comm = bash
> > > kernel BUG in check_slabp at mm/slab.c:2368!
> > > enter ? for help
> > >
> > >
> > > 4:mon> t
> > > [c0000003a8c23700] c00000000009d918 .free_block+0x168/0x294
> > > [c0000003a8c237e0] c00000000009d1dc .kfree+0x2b8/0x2d4
> > > [c0000003a8c238a0] c0000000000a1644 .cpuup_callback+0x144/0x618
> > > [c0000003a8c239b0] c0000000004a0780 .notifier_call_chain+0x68/0x9c
> > > [c0000003a8c23a40] c00000000007d608 .cpu_down+0x1fc/0x358
> > > [c0000003a8c23b30] c0000000002bb4ec .store_online+0x88/0xe8
> > > [c0000003a8c23bc0] c0000000002b5c14 .sysdev_store+0x4c/0x68
> > > [c0000003a8c23c40] c000000000119c6c .sysfs_write_file+0x118/0x1bc
> > > [c0000003a8c23cf0] c0000000000c6078 .vfs_write+0x100/0x200
> > > [c0000003a8c23d90] c0000000000c6288 .sys_write+0x54/0x9c
> > > [c0000003a8c23e30] c000000000008600 syscall_exit+0x0/0x18
> > > --- Exception: c01 (System Call) at 000000000fe5ec10
> > > SP (ff865560) is in userspace
> >
> > More details:
> >
> > The above crash was with SMT on, and I had already off-lined the SMT
> > sibling thread.
> >
> > When I boot with SMT off, I get a slightly different crash:
>
> I think i missed the first reply above. (I can't seem to find it on lkml
> either). So just to confirm, both these crashes are with the new patch on
> top of rc6?
Yes, rc6 + the patch you provided.
The stupid mail relay server I'm using for my ibm account seems to be very
lethargic, sorry about that.
Sonny
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: cpu hotplug oops on 2.6.15-rc5
2005-12-22 18:39 ` Sonny Rao
@ 2005-12-22 18:54 ` Christoph Lameter
2005-12-22 19:09 ` Sonny Rao
0 siblings, 1 reply; 16+ messages in thread
From: Christoph Lameter @ 2005-12-22 18:54 UTC (permalink / raw)
To: Sonny Rao
Cc: Ravikiran G Thirumalai, linux-kernel, manfred, anton, shai,
sonnyrao, alokk
On Thu, 22 Dec 2005, Sonny Rao wrote:
> Yes, rc6 + the patch you provided.
We may be going down the wrong path here. Has someone else than Sonny
reproduced the problem?
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: cpu hotplug oops on 2.6.15-rc5
2005-12-22 18:54 ` Christoph Lameter
@ 2005-12-22 19:09 ` Sonny Rao
0 siblings, 0 replies; 16+ messages in thread
From: Sonny Rao @ 2005-12-22 19:09 UTC (permalink / raw)
To: Christoph Lameter
Cc: Ravikiran G Thirumalai, linux-kernel, manfred, anton, shai,
sonnyrao, alokk
On Thu, Dec 22, 2005 at 10:54:08AM -0800, Christoph Lameter wrote:
> On Thu, 22 Dec 2005, Sonny Rao wrote:
>
> > Yes, rc6 + the patch you provided.
>
> We may be going down the wrong path here. Has someone else than Sonny
> reproduced the problem?
Hi, I've also just reproduced the problem on another machine which does
have multiple cpus/node rather than just one cpu/node. The crash
occurs at the same place when I attempt to offline the last cpu in a
node.
But, I agree that somemone else should repro this. I only have ppc64
machines available to me right now.
Sonny
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: cpu hotplug oops on 2.6.15-rc5
2005-12-22 18:37 ` Ravikiran G Thirumalai
2005-12-22 18:39 ` Sonny Rao
@ 2005-12-22 19:45 ` Sonny Rao
2005-12-28 19:30 ` Nathan Lynch
1 sibling, 1 reply; 16+ messages in thread
From: Sonny Rao @ 2005-12-22 19:45 UTC (permalink / raw)
To: Ravikiran G Thirumalai
Cc: linux-kernel, manfred, clameter, anton, shai, sonnyrao, alokk
On Thu, Dec 22, 2005 at 10:37:50AM -0800, Ravikiran G Thirumalai wrote:
> On Thu, Dec 22, 2005 at 12:53:11PM -0500, Sonny Rao wrote:
> > On Thu, Dec 22, 2005 at 11:37:00AM -0600, Sonny Rao wrote:
> > > On Thu, Dec 22, 2005 at 01:27:43AM -0800, Ravikiran G Thirumalai wrote:
> > > > On Mon, Dec 19, 2005 at 12:16:59AM -0500, Sonny Rao wrote:
> > > > > (apologies if this is a dup)
> > > > ...
> > > > Sonny,
> > > > Does this patch fix the issue? This one applies cleanly on 2.6.15-rc6
> > > > unlike the one that was sent to you earlier.
> > >
> > > Hi, thanks, now I'm getting a slightly different error,
> > > hitting a BUG in the slab debug code:
> > >
> > > ihplus:~ # echo 0 > /sys/devices/system/cpu/cpu14/online
> > > cpu 0x4: Vector: 700 (Program Check) at [c0000003a8c233f0]
> > > pc: c00000000009bb2c: .check_slabp+0x130/0x188
> > > lr: c00000000009bb28: .check_slabp+0x12c/0x188
> > > sp: c0000003a8c23670
> > > msr: 8000000000021032
> > > current = 0xc0000001b95297f0
> > > paca = 0xc0000000005d7000
> > > pid = 11116, comm = bash
> > > kernel BUG in check_slabp at mm/slab.c:2368!
> > > enter ? for help
> > >
> > >
> > > 4:mon> t
> > > [c0000003a8c23700] c00000000009d918 .free_block+0x168/0x294
> > > [c0000003a8c237e0] c00000000009d1dc .kfree+0x2b8/0x2d4
> > > [c0000003a8c238a0] c0000000000a1644 .cpuup_callback+0x144/0x618
> > > [c0000003a8c239b0] c0000000004a0780 .notifier_call_chain+0x68/0x9c
> > > [c0000003a8c23a40] c00000000007d608 .cpu_down+0x1fc/0x358
> > > [c0000003a8c23b30] c0000000002bb4ec .store_online+0x88/0xe8
> > > [c0000003a8c23bc0] c0000000002b5c14 .sysdev_store+0x4c/0x68
> > > [c0000003a8c23c40] c000000000119c6c .sysfs_write_file+0x118/0x1bc
> > > [c0000003a8c23cf0] c0000000000c6078 .vfs_write+0x100/0x200
> > > [c0000003a8c23d90] c0000000000c6288 .sys_write+0x54/0x9c
> > > [c0000003a8c23e30] c000000000008600 syscall_exit+0x0/0x18
> > > --- Exception: c01 (System Call) at 000000000fe5ec10
> > > SP (ff865560) is in userspace
> >
> > More details:
> >
> > The above crash was with SMT on, and I had already off-lined the SMT
> > sibling thread.
> >
> > When I boot with SMT off, I get a slightly different crash:
>
> I think i missed the first reply above. (I can't seem to find it on lkml
> either). So just to confirm, both these crashes are with the new patch on
> top of rc6?
>
> Thanks,
> Kiran
>
> >
> > ihplus:~ # echo 0 > /sys/devices/system/cpu/cpu14/online
> > cpu 0x0: Vector: 700 (Program Check) at [c0000003afa13480]
> > pc: c00000000009d960: .free_block+0x1b0/0x294
> > lr: c00000000009d95c: .free_block+0x1ac/0x294
> > sp: c0000003afa13700
> > msr: 8000000000021032
> > current = 0xc0000003afe04000
> > paca = 0xc0000000005d5000
> > pid = 10998, comm = bash
> > kernel BUG in free_block at mm/slab.c:2664!
> > enter ? for help
> >
> > 0:mon> t
> > [c0000003afa137e0] c00000000009d1dc .kfree+0x2b8/0x2d4
> > [c0000003afa138a0] c0000000000a1644 .cpuup_callback+0x144/0x618
> > [c0000003afa139b0] c0000000004a0780 .notifier_call_chain+0x68/0x9c
> > [c0000003afa13a40] c00000000007d608 .cpu_down+0x1fc/0x358
> > [c0000003afa13b30] c0000000002bb4ec .store_online+0x88/0xe8
> > [c0000003afa13bc0] c0000000002b5c14 .sysdev_store+0x4c/0x68
> > [c0000003afa13c40] c000000000119c6c .sysfs_write_file+0x118/0x1bc
> > [c0000003afa13cf0] c0000000000c6078 .vfs_write+0x100/0x200
> > [c0000003afa13d90] c0000000000c6288 .sys_write+0x54/0x9c
> > [c0000003afa13e30] c000000000008600 syscall_exit+0x0/0x18
> > --- Exception: c01 (System Call) at 000000000fe5ec10
> > SP (ff8b4560) is in userspace
> >
> > This one points to a double free somewhere
Hi, I think I've found the double free in the rc6 kernel + your patch :
starting on line 949 of the patched slab.c
if ((shared = l3->shared)) {
free_block(cachep, l3->shared->entry,
l3->shared->avail, node);
kfree(l3->shared);
l3->shared = NULL;
}
alien = l3->alien;
l3->alien = NULL;
spin_unlock(&l3->list_lock);
kfree(nc);
kfree(shared);
You conditionally free l3->shared after assigning it to the auto var "shared"
then below that you call kfree on "shared" again == double free.
So, I got rid of the extra free. I don't know if this was correct but
I tried it anyway. Unfortunately this still does not work correctly.
The system hangs for a period of time and then drops into the debugger
again:
0:mon> t
[c00000000f71f890] c00000000049e5ec ._spin_lock+0x10/0x24
[c00000000f71f910] c00000000009d550 .kmem_cache_free+0x270/0x2a4
[c00000000f71f9d0] c0000000003f35e8 .kfree_skbmem+0xa0/0xfc
[c00000000f71fa50] c00000000044d01c .udp_rcv+0x7ac/0x818
[c00000000f71fb60] c000000000420b14 .ip_local_deliver+0xf8/0x3f0
[c00000000f71fbf0] c000000000420328 .ip_rcv+0x3a8/0x724
[c00000000f71fc90] c0000000003fa054 .netif_receive_skb+0x378/0x3d0
[c00000000f71fd30] c0000000003fa1c4 .process_backlog+0x118/0x254
[c00000000f71fe10] c0000000003f7d3c .net_rx_action+0x188/0x2b8
[c00000000f71fed0] c000000000060f18 .__do_softirq+0xd4/0x1b8
[c00000000f71ff90] c00000000002c78c .call_do_softirq+0x14/0x24
[c0000000005ab870] c00000000000bd30 .do_softirq+0x8c/0x9c
[c0000000005ab900] c00000000006143c .irq_exit+0x6c/0x84
[c0000000005ab980] c00000000000c060 .do_IRQ+0xe8/0x194
[c0000000005aba10] c000000000004134 hardware_interrupt_entry+0x8/0x54
--- Exception: 501 (Hardware Interrupt) at c000000000040670
.pseries_dedicated_idle+0x114/0x268
[c0000000005abde0] c000000000021048 .cpu_idle+0x4c/0x60
[c0000000005abe50] c0000000000091f4 .rest_init+0x44/0x5c
[c0000000005abed0] c00000000054e7f4 .start_kernel+0x29c/0x318
[c0000000005abf90] c000000000008494 .hmt_init+0x0/0x6c
0:mon>
0:mon> e
cpu 0x0: Vector: 300 (Data Access) at [c00000000f71f580]
pc: c000000000238db4: ._raw_spin_lock+0x2c/0x1d0
lr: c00000000049e5ec: ._spin_lock+0x10/0x24
sp: c00000000f71f800
msr: 8000000000001032
dar: 4c
dsisr: 40000000
current = 0xc00000000061b2f0
paca = 0xc0000000005d5000
pid = 0, comm = swapper
0:mon>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: cpu hotplug oops on 2.6.15-rc5
2005-12-22 19:45 ` Sonny Rao
@ 2005-12-28 19:30 ` Nathan Lynch
2005-12-29 0:30 ` Sonny Rao
0 siblings, 1 reply; 16+ messages in thread
From: Nathan Lynch @ 2005-12-28 19:30 UTC (permalink / raw)
To: Sonny Rao
Cc: Ravikiran G Thirumalai, linux-kernel, manfred, clameter, anton,
shai, sonnyrao, alokk
I wonder if this is related to the problem Sonny is seeing -- powerpc's
definitions of cpu_to_node et al. are not being used. The culprit is
some too-clever preprocessor usage in asm-generic/topology.h, for
example:
#ifndef cpu_to_node
#define cpu_to_node(cpu) (0)
#endif
But asm-powerpc/topology.h has cpu_to_node defined as a static inline
(which does not make it a preprocessor symbol), so we get the generic
- and incorrect - definition.
Does removing the #include of asm-generic/topology.h from the bottom
of asm-powerpc/topology.h have any effect?
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: cpu hotplug oops on 2.6.15-rc5
2005-12-28 19:30 ` Nathan Lynch
@ 2005-12-29 0:30 ` Sonny Rao
2005-12-29 4:18 ` Nathan Lynch
0 siblings, 1 reply; 16+ messages in thread
From: Sonny Rao @ 2005-12-29 0:30 UTC (permalink / raw)
To: Nathan Lynch
Cc: Ravikiran G Thirumalai, linux-kernel, manfred, clameter, anton,
shai, sonnyrao, alokk
On Wed, Dec 28, 2005 at 01:30:12PM -0600, Nathan Lynch wrote:
> I wonder if this is related to the problem Sonny is seeing -- powerpc's
> definitions of cpu_to_node et al. are not being used. The culprit is
> some too-clever preprocessor usage in asm-generic/topology.h, for
> example:
>
>
> #ifndef cpu_to_node
> #define cpu_to_node(cpu) (0)
> #endif
>
> But asm-powerpc/topology.h has cpu_to_node defined as a static inline
> (which does not make it a preprocessor symbol), so we get the generic
> - and incorrect - definition.
>
> Does removing the #include of asm-generic/topology.h from the bottom
> of asm-powerpc/topology.h have any effect?
Hi, no it doesn't make a difference. That include is protected by
CONFIG_NUMA as well, so it never gets hit. At Anton's suggestion I
even put in an #error into asm-generic/topology.h to make sure it
wasn't an issue -- it didn't hit.
Sonny
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: cpu hotplug oops on 2.6.15-rc5
2005-12-29 0:30 ` Sonny Rao
@ 2005-12-29 4:18 ` Nathan Lynch
0 siblings, 0 replies; 16+ messages in thread
From: Nathan Lynch @ 2005-12-29 4:18 UTC (permalink / raw)
To: Sonny Rao
Cc: Ravikiran G Thirumalai, linux-kernel, manfred, clameter, anton,
shai, sonnyrao, alokk
Sonny Rao wrote:
> On Wed, Dec 28, 2005 at 01:30:12PM -0600, Nathan Lynch wrote:
> >
> > Does removing the #include of asm-generic/topology.h from the bottom
> > of asm-powerpc/topology.h have any effect?
>
> Hi, no it doesn't make a difference. That include is protected by
> CONFIG_NUMA as well, so it never gets hit. At Anton's suggestion I
> even put in an #error into asm-generic/topology.h to make sure it
> wasn't an issue -- it didn't hit.
Gah, sorry, forgot Anton fixed this a while back.
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2005-12-29 4:18 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-12-19 5:16 cpu hotplug oops on 2.6.15-rc5 Sonny Rao
2005-12-19 6:41 ` Benjamin Herrenschmidt
2005-12-19 7:08 ` Sonny Rao
2005-12-19 21:17 ` Manfred Spraul
2005-12-19 23:16 ` SPAMHAUS-Re: " Sonny Rao
2005-12-19 23:40 ` Anton Blanchard
2005-12-22 9:27 ` Ravikiran G Thirumalai
[not found] ` <20051222173700.GA5723@localhost.localdomain>
2005-12-22 17:53 ` Sonny Rao
2005-12-22 18:37 ` Ravikiran G Thirumalai
2005-12-22 18:39 ` Sonny Rao
2005-12-22 18:54 ` Christoph Lameter
2005-12-22 19:09 ` Sonny Rao
2005-12-22 19:45 ` Sonny Rao
2005-12-28 19:30 ` Nathan Lynch
2005-12-29 0:30 ` Sonny Rao
2005-12-29 4:18 ` Nathan Lynch
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox