cpu hotplug oops on 2.6.15-rc5

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* cpu hotplug oops on 2.6.15-rc5
@ 2005-12-19  5:16 Sonny Rao
  2005-12-19  6:41 ` Benjamin Herrenschmidt
  2005-12-22  9:27 ` Ravikiran G Thirumalai
  0 siblings, 2 replies; 16+ messages in thread
From: Sonny Rao @ 2005-12-19  5:16 UTC (permalink / raw)
  To: linux-kernel; +Cc: manfred, clameter, anton, sonnyrao

(apologies if this is a dup)

Hi, I'm crashing 2.6.15-rc5 when I try and offline the last and only CPU in a node on a ppc64 Power5, SMT was disabled.

Here's the backtrace:

0:mon> t
[c0000001ad033820] c000000000096a7c .kfree+0x250/0x280
[c0000001ad0338d0] c00000000009a544 .cpuup_callback+0x238/0x5fc
[c0000001ad0339c0] c000000000068114 .notifier_call_chain+0x68/0x9c
[c0000001ad033a50] c0000000000789fc .cpu_down+0x1fc/0x368
[c0000001ad033b40] c0000000002ac658 .store_online+0x88/0xe8
[c0000001ad033bd0] c0000000002a6f14 .sysdev_store+0x4c/0x68
[c0000001ad033c50] c000000000110368 .sysfs_write_file+0x100/0x1a0
[c0000001ad033cf0] c0000000000be854 .vfs_write+0x100/0x200
[c0000001ad033d90] c0000000000bea64 .sys_write+0x54/0x9c
[c0000001ad033e30] c000000000008600 syscall_exit+0x0/0x18
--- Exception: c01 (System Call) at 000000000fe5ec10
SP (ffc4c4f0) is in userspace

0:mon> e
cpu 0x0: Vector: 300 (Data Access) at [c0000001ad033520]
    pc: c00000000048bd30: ._spin_lock+0x18/0x80
    lr: c000000000096a7c: .kfree+0x250/0x280
    sp: c0000001ad0337a0
   msr: 8000000000001032
   dar: 48
 dsisr: 40000000
  current = 0xc0000001aff12040
  paca    = 0xc0000000005c1000
    pid   = 17376, comm = bash


Should I try this with CONFIG_DEBUG_SLAB ?

Sonny

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: cpu hotplug oops on 2.6.15-rc5
  2005-12-19  5:16 cpu hotplug oops on 2.6.15-rc5 Sonny Rao
@ 2005-12-19  6:41 ` Benjamin Herrenschmidt
  2005-12-19  7:08   ` Sonny Rao
  2005-12-22  9:27 ` Ravikiran G Thirumalai
  1 sibling, 1 reply; 16+ messages in thread
From: Benjamin Herrenschmidt @ 2005-12-19  6:41 UTC (permalink / raw)
  To: Sonny Rao; +Cc: linux-kernel, manfred, clameter, anton, sonnyrao

On Mon, 2005-12-19 at 00:16 -0500, Sonny Rao wrote:
> (apologies if this is a dup)
> 
> Hi, I'm crashing 2.6.15-rc5 when I try and offline the last and only CPU in a node on a ppc64 Power5, SMT was disabled.

First try on -rc6 just in case it's related to the SCSI fix (the bug was
corrupting the SLAB) that got merged just after rc5 iirc.

Ben.





^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: cpu hotplug oops on 2.6.15-rc5
  2005-12-19  6:41 ` Benjamin Herrenschmidt
@ 2005-12-19  7:08   ` Sonny Rao
  2005-12-19 21:17     ` Manfred Spraul
  0 siblings, 1 reply; 16+ messages in thread
From: Sonny Rao @ 2005-12-19  7:08 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linux-kernel, manfred, clameter, anton, sonnyrao

On Mon, Dec 19, 2005 at 05:41:57PM +1100, Benjamin Herrenschmidt wrote:
> On Mon, 2005-12-19 at 00:16 -0500, Sonny Rao wrote:
> > (apologies if this is a dup)
> > 
> > Hi, I'm crashing 2.6.15-rc5 when I try and offline the last and only CPU in a node on a ppc64 Power5, SMT was disabled.
> 
> First try on -rc6 just in case it's related to the SCSI fix (the bug was
> corrupting the SLAB) that got merged just after rc5 iirc.

Ok, tried it: same crash on -rc6

2:mon> t
[c000000d9f33b820] c000000000097cd0 .kfree+0x29c/0x2cc
[c000000d9f33b8d0] c00000000009c3a8 .cpuup_callback+0x4f8/0x5fc
[c000000d9f33b9c0] c00000000048ff4c .notifier_call_chain+0x68/0x9c
[c000000d9f33ba50] c000000000078da8 .cpu_down+0x1fc/0x368
[c000000d9f33bb40] c0000000002ae514 .store_online+0x88/0xe8
[c000000d9f33bbd0] c0000000002a8dd0 .sysdev_store+0x4c/0x68
[c000000d9f33bc50] c000000000111e70 .sysfs_write_file+0x100/0x1a0
[c000000d9f33bcf0] c0000000000c0360 .vfs_write+0x100/0x200
[c000000d9f33bd90] c0000000000c0570 .sys_write+0x54/0x9c
[c000000d9f33be30] c000000000008600 syscall_exit+0x0/0x18
--- Exception: c01 (System Call) at 000000000fe5ec10
SP (ffa204f0) is in userspace
2:mon> 

Sonny

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: cpu hotplug oops on 2.6.15-rc5
  2005-12-19  7:08   ` Sonny Rao
@ 2005-12-19 21:17     ` Manfred Spraul
  2005-12-19 23:16       ` SPAMHAUS-Re: " Sonny Rao
  2005-12-19 23:40       ` Anton Blanchard
  0 siblings, 2 replies; 16+ messages in thread
From: Manfred Spraul @ 2005-12-19 21:17 UTC (permalink / raw)
  To: Sonny Rao; +Cc: Benjamin Herrenschmidt, linux-kernel, clameter, anton, sonnyrao

Sonny Rao wrote:

>Ok, tried it: same crash on -rc6
>
>2:mon> t
>[c000000d9f33b820] c000000000097cd0 .kfree+0x29c/0x2cc
>[c000000d9f33b8d0] c00000000009c3a8 .cpuup_callback+0x4f8/0x5fc
>[c000000d9f33b9c0] c00000000048ff4c .notifier_call_chain+0x68/0x9c
>[c000000d9f33ba50] c000000000078da8 .cpu_down+0x1fc/0x368
>[c000000d9f33bb40] c0000000002ae514 .store_online+0x88/0xe8
>[c000000d9f33bbd0] c0000000002a8dd0 .sysdev_store+0x4c/0x68
>[c000000d9f33bc50] c000000000111e70 .sysfs_write_file+0x100/0x1a0
>[c000000d9f33bcf0] c0000000000c0360 .vfs_write+0x100/0x200
>[c000000d9f33bd90] c0000000000c0570 .sys_write+0x54/0x9c
>[c000000d9f33be30] c000000000008600 syscall_exit+0x0/0x18
>  
>
Very odd call chain.
Could you enable slab debugging?

--
    Manfred

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: SPAMHAUS-Re: cpu hotplug oops on 2.6.15-rc5
  2005-12-19 21:17     ` Manfred Spraul
@ 2005-12-19 23:16       ` Sonny Rao
  2005-12-19 23:40       ` Anton Blanchard
  1 sibling, 0 replies; 16+ messages in thread
From: Sonny Rao @ 2005-12-19 23:16 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Benjamin Herrenschmidt, linux-kernel, clameter, anton, sonnyrao

On Mon, Dec 19, 2005 at 10:17:04PM +0100, Manfred Spraul wrote:
> Sonny Rao wrote:
> 
> >Ok, tried it: same crash on -rc6
> >
> >2:mon> t
> >[c000000d9f33b820] c000000000097cd0 .kfree+0x29c/0x2cc
> >[c000000d9f33b8d0] c00000000009c3a8 .cpuup_callback+0x4f8/0x5fc
> >[c000000d9f33b9c0] c00000000048ff4c .notifier_call_chain+0x68/0x9c
> >[c000000d9f33ba50] c000000000078da8 .cpu_down+0x1fc/0x368
> >[c000000d9f33bb40] c0000000002ae514 .store_online+0x88/0xe8
> >[c000000d9f33bbd0] c0000000002a8dd0 .sysdev_store+0x4c/0x68
> >[c000000d9f33bc50] c000000000111e70 .sysfs_write_file+0x100/0x1a0
> >[c000000d9f33bcf0] c0000000000c0360 .vfs_write+0x100/0x200
> >[c000000d9f33bd90] c0000000000c0570 .sys_write+0x54/0x9c
> >[c000000d9f33be30] c000000000008600 syscall_exit+0x0/0x18
> > 
> >
> Very odd call chain.
> Could you enable slab debugging?

Actually, I did turn on slab debugging on -rc6, but it did not seem to
make any difference.

Sonny

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: cpu hotplug oops on 2.6.15-rc5
  2005-12-19 21:17     ` Manfred Spraul
  2005-12-19 23:16       ` SPAMHAUS-Re: " Sonny Rao
@ 2005-12-19 23:40       ` Anton Blanchard
  1 sibling, 0 replies; 16+ messages in thread
From: Anton Blanchard @ 2005-12-19 23:40 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Sonny Rao, Benjamin Herrenschmidt, linux-kernel, clameter,
	sonnyrao

Hi Manfred,

> Very odd call chain.
> Could you enable slab debugging?

Sonny and I had a look around, it seems to be in the
cpuup_callback() / CPU_DEAD case:

      if (!cpus_empty(mask)) {
              spin_unlock(&l3->list_lock);
              goto unlock_cache;
      }

      if (l3->shared) {
              free_block(cachep, l3->shared->entry,
                              l3->shared->avail, node);
              kfree(l3->shared);                <-------- HERE
              l3->shared = NULL;
      }

So we are removing the last cpu in a node, and tearing down the node
related structures. We looked at kfree() -> __cache_free() and we couldnt
convince ourselves that all the CONFIG_NUMA stuff in there wouldnt trip
over itself (since we would be doing the free on an alien node).

Anton

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: cpu hotplug oops on 2.6.15-rc5
  2005-12-19  5:16 cpu hotplug oops on 2.6.15-rc5 Sonny Rao
  2005-12-19  6:41 ` Benjamin Herrenschmidt
@ 2005-12-22  9:27 ` Ravikiran G Thirumalai
       [not found]   ` <20051222173700.GA5723@localhost.localdomain>
  1 sibling, 1 reply; 16+ messages in thread
From: Ravikiran G Thirumalai @ 2005-12-22  9:27 UTC (permalink / raw)
  To: Sonny Rao; +Cc: linux-kernel, manfred, clameter, anton, sonnyrao, shai

On Mon, Dec 19, 2005 at 12:16:59AM -0500, Sonny Rao wrote:
> (apologies if this is a dup)
> 
> Hi, I'm crashing 2.6.15-rc5 when I try and offline the last and only CPU in a node on a ppc64 Power5, SMT was disabled.
> 
> Here's the backtrace:
> 
> 0:mon> t
> [c0000001ad033820] c000000000096a7c .kfree+0x250/0x280
> [c0000001ad0338d0] c00000000009a544 .cpuup_callback+0x238/0x5fc
> [c0000001ad0339c0] c000000000068114 .notifier_call_chain+0x68/0x9c
> [c0000001ad033a50] c0000000000789fc .cpu_down+0x1fc/0x368
> [c0000001ad033b40] c0000000002ac658 .store_online+0x88/0xe8
> [c0000001ad033bd0] c0000000002a6f14 .sysdev_store+0x4c/0x68
> [c0000001ad033c50] c000000000110368 .sysfs_write_file+0x100/0x1a0
> [c0000001ad033cf0] c0000000000be854 .vfs_write+0x100/0x200
> [c0000001ad033d90] c0000000000bea64 .sys_write+0x54/0x9c
> [c0000001ad033e30] c000000000008600 syscall_exit+0x0/0x18
> --- Exception: c01 (System Call) at 000000000fe5ec10
> SP (ffc4c4f0) is in userspace
> 
> 0:mon> e
> cpu 0x0: Vector: 300 (Data Access) at [c0000001ad033520]
>     pc: c00000000048bd30: ._spin_lock+0x18/0x80
>     lr: c000000000096a7c: .kfree+0x250/0x280
>     sp: c0000001ad0337a0
>    msr: 8000000000001032
>    dar: 48
>  dsisr: 40000000
>   current = 0xc0000001aff12040
>   paca    = 0xc0000000005c1000
>     pid   = 17376, comm = bash
> 
> 

Sonny,
Does this patch fix the issue?   This one applies cleanly on 2.6.15-rc6
unlike the one that was sent to you earlier.

Thanks,
Kiran

From: Alok N Kataria <alokk@calsoftinc.com>

Fixes a bug in the CPU_DOWN call path, we shouldn't call kfree while
holding kmem_list3's list lock, nor should drain_alien_cache be called
with l3's list lock.

Signed-off-by : Alok N Kataria <alokk@calsoftinc.com>
Signed-off-by : Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by : Shai Fultheim <shai@scalex86.org>

Index: linux-2.6.15-rc6/mm/slab.c
===================================================================
--- linux-2.6.15-rc6.orig/mm/slab.c	2005-12-21 22:32:14.000000000 -0800
+++ linux-2.6.15-rc6/mm/slab.c	2005-12-21 22:32:58.000000000 -0800
@@ -824,14 +824,14 @@ static inline void __drain_alien_cache(k
 	}
 }
 
-static void drain_alien_cache(kmem_cache_t *cachep, struct kmem_list3 *l3)
+static void drain_alien_cache(kmem_cache_t *cachep, struct array_cache **alien)
 {
 	int i=0;
 	struct array_cache *ac;
 	unsigned long flags;
 
 	for_each_online_node(i) {
-		ac = l3->alien[i];
+		ac = alien[i];
 		if (ac) {
 			spin_lock_irqsave(&ac->lock, flags);
 			__drain_alien_cache(cachep, ac, i);
@@ -842,7 +842,7 @@ static void drain_alien_cache(kmem_cache
 #else
 #define alloc_alien_cache(node, limit) do { } while (0)
 #define free_alien_cache(ac_ptr) do { } while (0)
-#define drain_alien_cache(cachep, l3) do { } while (0)
+#define drain_alien_cache(cachep, alien) do { } while (0)
 #endif
 
 static int __devinit cpuup_callback(struct notifier_block *nfb,
@@ -921,7 +921,7 @@ static int __devinit cpuup_callback(stru
 		down(&cache_chain_sem);
 
 		list_for_each_entry(cachep, &cache_chain, next) {
-			struct array_cache *nc;
+			struct array_cache *nc, *shared, **alien;
 			cpumask_t mask;
 
 			mask = node_to_cpumask(node);
@@ -932,7 +932,7 @@ static int __devinit cpuup_callback(stru
 			l3 = cachep->nodelists[node];
 
 			if (!l3)
-				goto unlock_cache;
+				goto free_array_cache;
 
 			spin_lock(&l3->list_lock);
 
@@ -943,32 +943,40 @@ static int __devinit cpuup_callback(stru
 
 			if (!cpus_empty(mask)) {
                                 spin_unlock(&l3->list_lock);
-                                goto unlock_cache;
+                                goto free_array_cache;
                         }
 
-			if (l3->shared) {
+			if ((shared = l3->shared)) {
 				free_block(cachep, l3->shared->entry,
 						l3->shared->avail, node);
 				kfree(l3->shared);
 				l3->shared = NULL;
 			}
-			if (l3->alien) {
-				drain_alien_cache(cachep, l3);
-				free_alien_cache(l3->alien);
-				l3->alien = NULL;
+
+			alien = l3->alien;
+			l3->alien = NULL;
+
+			spin_unlock(&l3->list_lock);
+
+			kfree(nc);
+			kfree(shared);
+			if (alien) {
+				drain_alien_cache(cachep, alien);
+				free_alien_cache(alien);
 			}
 
 			/* free slabs belonging to this node */
 			if (__node_shrink(cachep, node)) {
+				spin_lock(&l3->list_lock);
 				cachep->nodelists[node] = NULL;
 				spin_unlock(&l3->list_lock);
 				kfree(l3);
-			} else {
-				spin_unlock(&l3->list_lock);
 			}
+			goto unlock_cache;
+free_array_cache:
+			kfree(nc);
 unlock_cache:
 			spin_unlock_irq(&cachep->spinlock);
-			kfree(nc);
 		}
 		up(&cache_chain_sem);
 		break;
@@ -1918,7 +1926,7 @@ static void drain_cpu_caches(kmem_cache_
 			drain_array_locked(cachep, l3->shared, 1, node);
 			spin_unlock(&l3->list_lock);
 			if (l3->alien)
-				drain_alien_cache(cachep, l3);
+				drain_alien_cache(cachep, l3->alien);
 		}
 	}
 	spin_unlock_irq(&cachep->spinlock);
@@ -3310,7 +3318,7 @@ static void cache_reap(void *unused)
 
 		l3 = searchp->nodelists[numa_node_id()];
 		if (l3->alien)
-			drain_alien_cache(searchp, l3);
+			drain_alien_cache(searchp, l3->alien);
 		spin_lock_irq(&l3->list_lock);
 
 		drain_array_locked(searchp, ac_data(searchp), 0,

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: cpu hotplug oops on 2.6.15-rc5
       [not found]   ` <20051222173700.GA5723@localhost.localdomain>
@ 2005-12-22 17:53     ` Sonny Rao
  2005-12-22 18:37       ` Ravikiran G Thirumalai
  0 siblings, 1 reply; 16+ messages in thread
From: Sonny Rao @ 2005-12-22 17:53 UTC (permalink / raw)
  To: Ravikiran G Thirumalai
  Cc: linux-kernel, manfred, clameter, anton, shai, sonnyrao

On Thu, Dec 22, 2005 at 11:37:00AM -0600, Sonny Rao wrote:
> On Thu, Dec 22, 2005 at 01:27:43AM -0800, Ravikiran G Thirumalai wrote:
> > On Mon, Dec 19, 2005 at 12:16:59AM -0500, Sonny Rao wrote:
> > > (apologies if this is a dup)
> > > 
> > > Hi, I'm crashing 2.6.15-rc5 when I try and offline the last and only CPU in a node on a ppc64 Power5, SMT was disabled.
> > > 
> > > Here's the backtrace:
> > > 
> > > 0:mon> t
> > > [c0000001ad033820] c000000000096a7c .kfree+0x250/0x280
> > > [c0000001ad0338d0] c00000000009a544 .cpuup_callback+0x238/0x5fc
> > > [c0000001ad0339c0] c000000000068114 .notifier_call_chain+0x68/0x9c
> > > [c0000001ad033a50] c0000000000789fc .cpu_down+0x1fc/0x368
> > > [c0000001ad033b40] c0000000002ac658 .store_online+0x88/0xe8
> > > [c0000001ad033bd0] c0000000002a6f14 .sysdev_store+0x4c/0x68
> > > [c0000001ad033c50] c000000000110368 .sysfs_write_file+0x100/0x1a0
> > > [c0000001ad033cf0] c0000000000be854 .vfs_write+0x100/0x200
> > > [c0000001ad033d90] c0000000000bea64 .sys_write+0x54/0x9c
> > > [c0000001ad033e30] c000000000008600 syscall_exit+0x0/0x18
> > > --- Exception: c01 (System Call) at 000000000fe5ec10
> > > SP (ffc4c4f0) is in userspace
> > > 
> > > 0:mon> e
> > > cpu 0x0: Vector: 300 (Data Access) at [c0000001ad033520]
> > >     pc: c00000000048bd30: ._spin_lock+0x18/0x80
> > >     lr: c000000000096a7c: .kfree+0x250/0x280
> > >     sp: c0000001ad0337a0
> > >    msr: 8000000000001032
> > >    dar: 48
> > >  dsisr: 40000000
> > >   current = 0xc0000001aff12040
> > >   paca    = 0xc0000000005c1000
> > >     pid   = 17376, comm = bash
> > > 
> > > 
> > 
> > Sonny,
> > Does this patch fix the issue?   This one applies cleanly on 2.6.15-rc6
> > unlike the one that was sent to you earlier.
> 
> Hi, thanks, now I'm getting a slightly different error, 
> hitting a BUG in the slab debug code:
> 
> ihplus:~ # echo 0 > /sys/devices/system/cpu/cpu14/online 
> cpu 0x4: Vector: 700 (Program Check) at [c0000003a8c233f0]
>     pc: c00000000009bb2c: .check_slabp+0x130/0x188
>     lr: c00000000009bb28: .check_slabp+0x12c/0x188
>     sp: c0000003a8c23670
>    msr: 8000000000021032
>   current = 0xc0000001b95297f0
>   paca    = 0xc0000000005d7000
>     pid   = 11116, comm = bash
> kernel BUG in check_slabp at mm/slab.c:2368!
> enter ? for help
> 
> 
> 4:mon> t
> [c0000003a8c23700] c00000000009d918 .free_block+0x168/0x294
> [c0000003a8c237e0] c00000000009d1dc .kfree+0x2b8/0x2d4
> [c0000003a8c238a0] c0000000000a1644 .cpuup_callback+0x144/0x618
> [c0000003a8c239b0] c0000000004a0780 .notifier_call_chain+0x68/0x9c
> [c0000003a8c23a40] c00000000007d608 .cpu_down+0x1fc/0x358
> [c0000003a8c23b30] c0000000002bb4ec .store_online+0x88/0xe8
> [c0000003a8c23bc0] c0000000002b5c14 .sysdev_store+0x4c/0x68
> [c0000003a8c23c40] c000000000119c6c .sysfs_write_file+0x118/0x1bc
> [c0000003a8c23cf0] c0000000000c6078 .vfs_write+0x100/0x200
> [c0000003a8c23d90] c0000000000c6288 .sys_write+0x54/0x9c
> [c0000003a8c23e30] c000000000008600 syscall_exit+0x0/0x18
> --- Exception: c01 (System Call) at 000000000fe5ec10
> SP (ff865560) is in userspace

More details: 

The above crash was with SMT on, and I had already off-lined the SMT
sibling thread.  

When I boot with SMT off, I get a slightly different crash:

ihplus:~ # echo 0 > /sys/devices/system/cpu/cpu14/online 
cpu 0x0: Vector: 700 (Program Check) at [c0000003afa13480]
    pc: c00000000009d960: .free_block+0x1b0/0x294
    lr: c00000000009d95c: .free_block+0x1ac/0x294
    sp: c0000003afa13700
   msr: 8000000000021032
  current = 0xc0000003afe04000
  paca    = 0xc0000000005d5000
    pid   = 10998, comm = bash
kernel BUG in free_block at mm/slab.c:2664!
enter ? for help

0:mon> t
[c0000003afa137e0] c00000000009d1dc .kfree+0x2b8/0x2d4
[c0000003afa138a0] c0000000000a1644 .cpuup_callback+0x144/0x618
[c0000003afa139b0] c0000000004a0780 .notifier_call_chain+0x68/0x9c
[c0000003afa13a40] c00000000007d608 .cpu_down+0x1fc/0x358
[c0000003afa13b30] c0000000002bb4ec .store_online+0x88/0xe8
[c0000003afa13bc0] c0000000002b5c14 .sysdev_store+0x4c/0x68
[c0000003afa13c40] c000000000119c6c .sysfs_write_file+0x118/0x1bc
[c0000003afa13cf0] c0000000000c6078 .vfs_write+0x100/0x200
[c0000003afa13d90] c0000000000c6288 .sys_write+0x54/0x9c
[c0000003afa13e30] c000000000008600 syscall_exit+0x0/0x18
--- Exception: c01 (System Call) at 000000000fe5ec10
SP (ff8b4560) is in userspace

This one points to a double free somewhere

Sonny

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: cpu hotplug oops on 2.6.15-rc5
  2005-12-22 17:53     ` Sonny Rao
@ 2005-12-22 18:37       ` Ravikiran G Thirumalai
  2005-12-22 18:39         ` Sonny Rao
  2005-12-22 19:45         ` Sonny Rao
  0 siblings, 2 replies; 16+ messages in thread
From: Ravikiran G Thirumalai @ 2005-12-22 18:37 UTC (permalink / raw)
  To: Sonny Rao; +Cc: linux-kernel, manfred, clameter, anton, shai, sonnyrao, alokk

On Thu, Dec 22, 2005 at 12:53:11PM -0500, Sonny Rao wrote:
> On Thu, Dec 22, 2005 at 11:37:00AM -0600, Sonny Rao wrote:
> > On Thu, Dec 22, 2005 at 01:27:43AM -0800, Ravikiran G Thirumalai wrote:
> > > On Mon, Dec 19, 2005 at 12:16:59AM -0500, Sonny Rao wrote:
> > > > (apologies if this is a dup)
> > > ...
> > > Sonny,
> > > Does this patch fix the issue?   This one applies cleanly on 2.6.15-rc6
> > > unlike the one that was sent to you earlier.
> > 
> > Hi, thanks, now I'm getting a slightly different error, 
> > hitting a BUG in the slab debug code:
> > 
> > ihplus:~ # echo 0 > /sys/devices/system/cpu/cpu14/online 
> > cpu 0x4: Vector: 700 (Program Check) at [c0000003a8c233f0]
> >     pc: c00000000009bb2c: .check_slabp+0x130/0x188
> >     lr: c00000000009bb28: .check_slabp+0x12c/0x188
> >     sp: c0000003a8c23670
> >    msr: 8000000000021032
> >   current = 0xc0000001b95297f0
> >   paca    = 0xc0000000005d7000
> >     pid   = 11116, comm = bash
> > kernel BUG in check_slabp at mm/slab.c:2368!
> > enter ? for help
> > 
> > 
> > 4:mon> t
> > [c0000003a8c23700] c00000000009d918 .free_block+0x168/0x294
> > [c0000003a8c237e0] c00000000009d1dc .kfree+0x2b8/0x2d4
> > [c0000003a8c238a0] c0000000000a1644 .cpuup_callback+0x144/0x618
> > [c0000003a8c239b0] c0000000004a0780 .notifier_call_chain+0x68/0x9c
> > [c0000003a8c23a40] c00000000007d608 .cpu_down+0x1fc/0x358
> > [c0000003a8c23b30] c0000000002bb4ec .store_online+0x88/0xe8
> > [c0000003a8c23bc0] c0000000002b5c14 .sysdev_store+0x4c/0x68
> > [c0000003a8c23c40] c000000000119c6c .sysfs_write_file+0x118/0x1bc
> > [c0000003a8c23cf0] c0000000000c6078 .vfs_write+0x100/0x200
> > [c0000003a8c23d90] c0000000000c6288 .sys_write+0x54/0x9c
> > [c0000003a8c23e30] c000000000008600 syscall_exit+0x0/0x18
> > --- Exception: c01 (System Call) at 000000000fe5ec10
> > SP (ff865560) is in userspace
> 
> More details: 
> 
> The above crash was with SMT on, and I had already off-lined the SMT
> sibling thread.  
> 
> When I boot with SMT off, I get a slightly different crash:

I think i missed the first reply above. (I can't seem to find it on lkml
either).  So just to confirm, both these crashes are with the new patch on
top of rc6?

Thanks,
Kiran
 
> 
> ihplus:~ # echo 0 > /sys/devices/system/cpu/cpu14/online 
> cpu 0x0: Vector: 700 (Program Check) at [c0000003afa13480]
>     pc: c00000000009d960: .free_block+0x1b0/0x294
>     lr: c00000000009d95c: .free_block+0x1ac/0x294
>     sp: c0000003afa13700
>    msr: 8000000000021032
>   current = 0xc0000003afe04000
>   paca    = 0xc0000000005d5000
>     pid   = 10998, comm = bash
> kernel BUG in free_block at mm/slab.c:2664!
> enter ? for help
> 
> 0:mon> t
> [c0000003afa137e0] c00000000009d1dc .kfree+0x2b8/0x2d4
> [c0000003afa138a0] c0000000000a1644 .cpuup_callback+0x144/0x618
> [c0000003afa139b0] c0000000004a0780 .notifier_call_chain+0x68/0x9c
> [c0000003afa13a40] c00000000007d608 .cpu_down+0x1fc/0x358
> [c0000003afa13b30] c0000000002bb4ec .store_online+0x88/0xe8
> [c0000003afa13bc0] c0000000002b5c14 .sysdev_store+0x4c/0x68
> [c0000003afa13c40] c000000000119c6c .sysfs_write_file+0x118/0x1bc
> [c0000003afa13cf0] c0000000000c6078 .vfs_write+0x100/0x200
> [c0000003afa13d90] c0000000000c6288 .sys_write+0x54/0x9c
> [c0000003afa13e30] c000000000008600 syscall_exit+0x0/0x18
> --- Exception: c01 (System Call) at 000000000fe5ec10
> SP (ff8b4560) is in userspace
> 
> This one points to a double free somewhere
> 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: cpu hotplug oops on 2.6.15-rc5
  2005-12-22 18:37       ` Ravikiran G Thirumalai
@ 2005-12-22 18:39         ` Sonny Rao
  2005-12-22 18:54           ` Christoph Lameter
  2005-12-22 19:45         ` Sonny Rao
  1 sibling, 1 reply; 16+ messages in thread
From: Sonny Rao @ 2005-12-22 18:39 UTC (permalink / raw)
  To: Ravikiran G Thirumalai
  Cc: linux-kernel, manfred, clameter, anton, shai, sonnyrao, alokk

On Thu, Dec 22, 2005 at 10:37:50AM -0800, Ravikiran G Thirumalai wrote:
> On Thu, Dec 22, 2005 at 12:53:11PM -0500, Sonny Rao wrote:
> > On Thu, Dec 22, 2005 at 11:37:00AM -0600, Sonny Rao wrote:
> > > On Thu, Dec 22, 2005 at 01:27:43AM -0800, Ravikiran G Thirumalai wrote:
> > > > On Mon, Dec 19, 2005 at 12:16:59AM -0500, Sonny Rao wrote:
> > > > > (apologies if this is a dup)
> > > > ...
> > > > Sonny,
> > > > Does this patch fix the issue?   This one applies cleanly on 2.6.15-rc6
> > > > unlike the one that was sent to you earlier.
> > > 
> > > Hi, thanks, now I'm getting a slightly different error, 
> > > hitting a BUG in the slab debug code:
> > > 
> > > ihplus:~ # echo 0 > /sys/devices/system/cpu/cpu14/online 
> > > cpu 0x4: Vector: 700 (Program Check) at [c0000003a8c233f0]
> > >     pc: c00000000009bb2c: .check_slabp+0x130/0x188
> > >     lr: c00000000009bb28: .check_slabp+0x12c/0x188
> > >     sp: c0000003a8c23670
> > >    msr: 8000000000021032
> > >   current = 0xc0000001b95297f0
> > >   paca    = 0xc0000000005d7000
> > >     pid   = 11116, comm = bash
> > > kernel BUG in check_slabp at mm/slab.c:2368!
> > > enter ? for help
> > > 
> > > 
> > > 4:mon> t
> > > [c0000003a8c23700] c00000000009d918 .free_block+0x168/0x294
> > > [c0000003a8c237e0] c00000000009d1dc .kfree+0x2b8/0x2d4
> > > [c0000003a8c238a0] c0000000000a1644 .cpuup_callback+0x144/0x618
> > > [c0000003a8c239b0] c0000000004a0780 .notifier_call_chain+0x68/0x9c
> > > [c0000003a8c23a40] c00000000007d608 .cpu_down+0x1fc/0x358
> > > [c0000003a8c23b30] c0000000002bb4ec .store_online+0x88/0xe8
> > > [c0000003a8c23bc0] c0000000002b5c14 .sysdev_store+0x4c/0x68
> > > [c0000003a8c23c40] c000000000119c6c .sysfs_write_file+0x118/0x1bc
> > > [c0000003a8c23cf0] c0000000000c6078 .vfs_write+0x100/0x200
> > > [c0000003a8c23d90] c0000000000c6288 .sys_write+0x54/0x9c
> > > [c0000003a8c23e30] c000000000008600 syscall_exit+0x0/0x18
> > > --- Exception: c01 (System Call) at 000000000fe5ec10
> > > SP (ff865560) is in userspace
> > 
> > More details: 
> > 
> > The above crash was with SMT on, and I had already off-lined the SMT
> > sibling thread.  
> > 
> > When I boot with SMT off, I get a slightly different crash:
> 
> I think i missed the first reply above. (I can't seem to find it on lkml
> either).  So just to confirm, both these crashes are with the new patch on
> top of rc6?

Yes, rc6 + the patch you provided.

The stupid mail relay server I'm using for my ibm account seems to be very
lethargic, sorry about that. 

Sonny

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: cpu hotplug oops on 2.6.15-rc5
  2005-12-22 18:39         ` Sonny Rao
@ 2005-12-22 18:54           ` Christoph Lameter
  2005-12-22 19:09             ` Sonny Rao
  0 siblings, 1 reply; 16+ messages in thread
From: Christoph Lameter @ 2005-12-22 18:54 UTC (permalink / raw)
  To: Sonny Rao
  Cc: Ravikiran G Thirumalai, linux-kernel, manfred, anton, shai,
	sonnyrao, alokk

On Thu, 22 Dec 2005, Sonny Rao wrote:

> Yes, rc6 + the patch you provided.

We may be going down the wrong path here. Has someone else than Sonny 
reproduced the problem?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: cpu hotplug oops on 2.6.15-rc5
  2005-12-22 18:54           ` Christoph Lameter
@ 2005-12-22 19:09             ` Sonny Rao
  0 siblings, 0 replies; 16+ messages in thread
From: Sonny Rao @ 2005-12-22 19:09 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Ravikiran G Thirumalai, linux-kernel, manfred, anton, shai,
	sonnyrao, alokk

On Thu, Dec 22, 2005 at 10:54:08AM -0800, Christoph Lameter wrote:
> On Thu, 22 Dec 2005, Sonny Rao wrote:
> 
> > Yes, rc6 + the patch you provided.
> 
> We may be going down the wrong path here. Has someone else than Sonny 
> reproduced the problem?

Hi, I've also just reproduced the problem on another machine which does
have multiple cpus/node rather than just one cpu/node. The crash
occurs at the same place when I attempt to offline the last cpu in a
node.

But, I agree that somemone else should repro this.  I only have ppc64
machines available to me right now.

Sonny

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: cpu hotplug oops on 2.6.15-rc5
  2005-12-22 18:37       ` Ravikiran G Thirumalai
  2005-12-22 18:39         ` Sonny Rao
@ 2005-12-22 19:45         ` Sonny Rao
  2005-12-28 19:30           ` Nathan Lynch
  1 sibling, 1 reply; 16+ messages in thread
From: Sonny Rao @ 2005-12-22 19:45 UTC (permalink / raw)
  To: Ravikiran G Thirumalai
  Cc: linux-kernel, manfred, clameter, anton, shai, sonnyrao, alokk

On Thu, Dec 22, 2005 at 10:37:50AM -0800, Ravikiran G Thirumalai wrote:
> On Thu, Dec 22, 2005 at 12:53:11PM -0500, Sonny Rao wrote:
> > On Thu, Dec 22, 2005 at 11:37:00AM -0600, Sonny Rao wrote:
> > > On Thu, Dec 22, 2005 at 01:27:43AM -0800, Ravikiran G Thirumalai wrote:
> > > > On Mon, Dec 19, 2005 at 12:16:59AM -0500, Sonny Rao wrote:
> > > > > (apologies if this is a dup)
> > > > ...
> > > > Sonny,
> > > > Does this patch fix the issue?   This one applies cleanly on 2.6.15-rc6
> > > > unlike the one that was sent to you earlier.
> > > 
> > > Hi, thanks, now I'm getting a slightly different error, 
> > > hitting a BUG in the slab debug code:
> > > 
> > > ihplus:~ # echo 0 > /sys/devices/system/cpu/cpu14/online 
> > > cpu 0x4: Vector: 700 (Program Check) at [c0000003a8c233f0]
> > >     pc: c00000000009bb2c: .check_slabp+0x130/0x188
> > >     lr: c00000000009bb28: .check_slabp+0x12c/0x188
> > >     sp: c0000003a8c23670
> > >    msr: 8000000000021032
> > >   current = 0xc0000001b95297f0
> > >   paca    = 0xc0000000005d7000
> > >     pid   = 11116, comm = bash
> > > kernel BUG in check_slabp at mm/slab.c:2368!
> > > enter ? for help
> > > 
> > > 
> > > 4:mon> t
> > > [c0000003a8c23700] c00000000009d918 .free_block+0x168/0x294
> > > [c0000003a8c237e0] c00000000009d1dc .kfree+0x2b8/0x2d4
> > > [c0000003a8c238a0] c0000000000a1644 .cpuup_callback+0x144/0x618
> > > [c0000003a8c239b0] c0000000004a0780 .notifier_call_chain+0x68/0x9c
> > > [c0000003a8c23a40] c00000000007d608 .cpu_down+0x1fc/0x358
> > > [c0000003a8c23b30] c0000000002bb4ec .store_online+0x88/0xe8
> > > [c0000003a8c23bc0] c0000000002b5c14 .sysdev_store+0x4c/0x68
> > > [c0000003a8c23c40] c000000000119c6c .sysfs_write_file+0x118/0x1bc
> > > [c0000003a8c23cf0] c0000000000c6078 .vfs_write+0x100/0x200
> > > [c0000003a8c23d90] c0000000000c6288 .sys_write+0x54/0x9c
> > > [c0000003a8c23e30] c000000000008600 syscall_exit+0x0/0x18
> > > --- Exception: c01 (System Call) at 000000000fe5ec10
> > > SP (ff865560) is in userspace
> > 
> > More details: 
> > 
> > The above crash was with SMT on, and I had already off-lined the SMT
> > sibling thread.  
> > 
> > When I boot with SMT off, I get a slightly different crash:
> 
> I think i missed the first reply above. (I can't seem to find it on lkml
> either).  So just to confirm, both these crashes are with the new patch on
> top of rc6?
> 
> Thanks,
> Kiran
>  
> > 
> > ihplus:~ # echo 0 > /sys/devices/system/cpu/cpu14/online 
> > cpu 0x0: Vector: 700 (Program Check) at [c0000003afa13480]
> >     pc: c00000000009d960: .free_block+0x1b0/0x294
> >     lr: c00000000009d95c: .free_block+0x1ac/0x294
> >     sp: c0000003afa13700
> >    msr: 8000000000021032
> >   current = 0xc0000003afe04000
> >   paca    = 0xc0000000005d5000
> >     pid   = 10998, comm = bash
> > kernel BUG in free_block at mm/slab.c:2664!
> > enter ? for help
> > 
> > 0:mon> t
> > [c0000003afa137e0] c00000000009d1dc .kfree+0x2b8/0x2d4
> > [c0000003afa138a0] c0000000000a1644 .cpuup_callback+0x144/0x618
> > [c0000003afa139b0] c0000000004a0780 .notifier_call_chain+0x68/0x9c
> > [c0000003afa13a40] c00000000007d608 .cpu_down+0x1fc/0x358
> > [c0000003afa13b30] c0000000002bb4ec .store_online+0x88/0xe8
> > [c0000003afa13bc0] c0000000002b5c14 .sysdev_store+0x4c/0x68
> > [c0000003afa13c40] c000000000119c6c .sysfs_write_file+0x118/0x1bc
> > [c0000003afa13cf0] c0000000000c6078 .vfs_write+0x100/0x200
> > [c0000003afa13d90] c0000000000c6288 .sys_write+0x54/0x9c
> > [c0000003afa13e30] c000000000008600 syscall_exit+0x0/0x18
> > --- Exception: c01 (System Call) at 000000000fe5ec10
> > SP (ff8b4560) is in userspace
> > 
> > This one points to a double free somewhere

Hi, I think I've found the double free in the rc6 kernel + your patch :

starting on line 949 of the patched slab.c

                        if ((shared = l3->shared)) {
                                free_block(cachep, l3->shared->entry,
                                                l3->shared->avail, node);
                                kfree(l3->shared);
                                l3->shared = NULL;
                        }

                        alien = l3->alien;
                        l3->alien = NULL;

                        spin_unlock(&l3->list_lock);

                        kfree(nc);
                        kfree(shared);


You conditionally free l3->shared after assigning it to the auto var "shared"
then below that you call kfree on "shared" again == double free.

So, I got rid of the extra free.  I don't know if this was correct but
I tried it anyway.  Unfortunately this still does not work correctly.
The system hangs for a period of time and then drops into the debugger
again: 

0:mon> t
[c00000000f71f890] c00000000049e5ec ._spin_lock+0x10/0x24
[c00000000f71f910] c00000000009d550 .kmem_cache_free+0x270/0x2a4
[c00000000f71f9d0] c0000000003f35e8 .kfree_skbmem+0xa0/0xfc
[c00000000f71fa50] c00000000044d01c .udp_rcv+0x7ac/0x818
[c00000000f71fb60] c000000000420b14 .ip_local_deliver+0xf8/0x3f0
[c00000000f71fbf0] c000000000420328 .ip_rcv+0x3a8/0x724
[c00000000f71fc90] c0000000003fa054 .netif_receive_skb+0x378/0x3d0
[c00000000f71fd30] c0000000003fa1c4 .process_backlog+0x118/0x254
[c00000000f71fe10] c0000000003f7d3c .net_rx_action+0x188/0x2b8
[c00000000f71fed0] c000000000060f18 .__do_softirq+0xd4/0x1b8
[c00000000f71ff90] c00000000002c78c .call_do_softirq+0x14/0x24
[c0000000005ab870] c00000000000bd30 .do_softirq+0x8c/0x9c
[c0000000005ab900] c00000000006143c .irq_exit+0x6c/0x84
[c0000000005ab980] c00000000000c060 .do_IRQ+0xe8/0x194
[c0000000005aba10] c000000000004134 hardware_interrupt_entry+0x8/0x54
--- Exception: 501 (Hardware Interrupt) at c000000000040670
.pseries_dedicated_idle+0x114/0x268
[c0000000005abde0] c000000000021048 .cpu_idle+0x4c/0x60
[c0000000005abe50] c0000000000091f4 .rest_init+0x44/0x5c
[c0000000005abed0] c00000000054e7f4 .start_kernel+0x29c/0x318
[c0000000005abf90] c000000000008494 .hmt_init+0x0/0x6c
0:mon> 

0:mon> e
cpu 0x0: Vector: 300 (Data Access) at [c00000000f71f580]
    pc: c000000000238db4: ._raw_spin_lock+0x2c/0x1d0
    lr: c00000000049e5ec: ._spin_lock+0x10/0x24
    sp: c00000000f71f800
   msr: 8000000000001032
   dar: 4c
 dsisr: 40000000
  current = 0xc00000000061b2f0
  paca    = 0xc0000000005d5000
    pid   = 0, comm = swapper
0:mon> 



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: cpu hotplug oops on 2.6.15-rc5
  2005-12-22 19:45         ` Sonny Rao
@ 2005-12-28 19:30           ` Nathan Lynch
  2005-12-29  0:30             ` Sonny Rao
  0 siblings, 1 reply; 16+ messages in thread
From: Nathan Lynch @ 2005-12-28 19:30 UTC (permalink / raw)
  To: Sonny Rao
  Cc: Ravikiran G Thirumalai, linux-kernel, manfred, clameter, anton,
	shai, sonnyrao, alokk

I wonder if this is related to the problem Sonny is seeing -- powerpc's
definitions of cpu_to_node et al. are not being used.  The culprit is
some too-clever preprocessor usage in asm-generic/topology.h, for
example:

#ifndef cpu_to_node
#define cpu_to_node(cpu)	(0)
#endif

But asm-powerpc/topology.h has cpu_to_node defined as a static inline
(which does not make it a preprocessor symbol), so we get the generic
- and incorrect - definition.

Does removing the #include of asm-generic/topology.h from the bottom
of asm-powerpc/topology.h have any effect?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: cpu hotplug oops on 2.6.15-rc5
  2005-12-28 19:30           ` Nathan Lynch
@ 2005-12-29  0:30             ` Sonny Rao
  2005-12-29  4:18               ` Nathan Lynch
  0 siblings, 1 reply; 16+ messages in thread
From: Sonny Rao @ 2005-12-29  0:30 UTC (permalink / raw)
  To: Nathan Lynch
  Cc: Ravikiran G Thirumalai, linux-kernel, manfred, clameter, anton,
	shai, sonnyrao, alokk

On Wed, Dec 28, 2005 at 01:30:12PM -0600, Nathan Lynch wrote:
> I wonder if this is related to the problem Sonny is seeing -- powerpc's
> definitions of cpu_to_node et al. are not being used.  The culprit is
> some too-clever preprocessor usage in asm-generic/topology.h, for
> example:
> 
> 
> #ifndef cpu_to_node
> #define cpu_to_node(cpu)	(0)
> #endif
> 
> But asm-powerpc/topology.h has cpu_to_node defined as a static inline
> (which does not make it a preprocessor symbol), so we get the generic
> - and incorrect - definition.
> 
> Does removing the #include of asm-generic/topology.h from the bottom
> of asm-powerpc/topology.h have any effect?

Hi, no it doesn't make a difference.  That include is protected by
CONFIG_NUMA as well, so it never gets hit.  At Anton's suggestion I
even put in an #error into asm-generic/topology.h to make sure it
wasn't an issue -- it didn't hit.

Sonny

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: cpu hotplug oops on 2.6.15-rc5
  2005-12-29  0:30             ` Sonny Rao
@ 2005-12-29  4:18               ` Nathan Lynch
  0 siblings, 0 replies; 16+ messages in thread
From: Nathan Lynch @ 2005-12-29  4:18 UTC (permalink / raw)
  To: Sonny Rao
  Cc: Ravikiran G Thirumalai, linux-kernel, manfred, clameter, anton,
	shai, sonnyrao, alokk

Sonny Rao wrote:
> On Wed, Dec 28, 2005 at 01:30:12PM -0600, Nathan Lynch wrote:
> > 
> > Does removing the #include of asm-generic/topology.h from the bottom
> > of asm-powerpc/topology.h have any effect?
> 
> Hi, no it doesn't make a difference.  That include is protected by
> CONFIG_NUMA as well, so it never gets hit.  At Anton's suggestion I
> even put in an #error into asm-generic/topology.h to make sure it
> wasn't an issue -- it didn't hit.

Gah, sorry, forgot Anton fixed this a while back.


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2005-12-29  4:18 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-12-19  5:16 cpu hotplug oops on 2.6.15-rc5 Sonny Rao
2005-12-19  6:41 ` Benjamin Herrenschmidt
2005-12-19  7:08   ` Sonny Rao
2005-12-19 21:17     ` Manfred Spraul
2005-12-19 23:16       ` SPAMHAUS-Re: " Sonny Rao
2005-12-19 23:40       ` Anton Blanchard
2005-12-22  9:27 ` Ravikiran G Thirumalai
     [not found]   ` <20051222173700.GA5723@localhost.localdomain>
2005-12-22 17:53     ` Sonny Rao
2005-12-22 18:37       ` Ravikiran G Thirumalai
2005-12-22 18:39         ` Sonny Rao
2005-12-22 18:54           ` Christoph Lameter
2005-12-22 19:09             ` Sonny Rao
2005-12-22 19:45         ` Sonny Rao
2005-12-28 19:30           ` Nathan Lynch
2005-12-29  0:30             ` Sonny Rao
2005-12-29  4:18               ` Nathan Lynch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox