CPU hotplug broken in 2.6.8-rc2 ?

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* CPU hotplug broken in 2.6.8-rc2 ?
@ 2004-08-02  9:49 Dipankar Sarma
  2004-08-02  9:57 ` Dipankar Sarma
  2004-08-02 16:00 ` Zwane Mwaikambo
  0 siblings, 2 replies; 12+ messages in thread
From: Dipankar Sarma @ 2004-08-02  9:49 UTC (permalink / raw)
  To: V Srivatsa, nathanl; +Cc: Joel Schopp, Rusty Russell, linux-kernel, nickp

Could it be that recent sched domain stuff broke CPU hotplug ?
While testing cpu hotplug with some RCU changes, I got the following
panic (while onlining).

Thanks
Dipankar

cpu 0x2: Vector: 380 (Data SLB Access) at [c00000000152f4a0]
    pc: c00000000004b1b0: .find_busiest_group+0x274/0x464
    lr: c00000000004b0e4: .find_busiest_group+0x1a8/0x464
    sp: c00000000152f720
   msr: 8000000000001032
   dar: 10
  current = 0xc000000001520040
  paca    = 0xc000000000535200
    pid   = 0, comm = swapper
enter ? for help
2:mon>

2:mon> t
[c00000000152f720] c000000000654f30 (unreliable)
[c00000000152f830] c00000000004b4cc .rebalance_tick+0x12c/0x2d4
[c00000000152f920] c00000000005b954 .update_process_times+0xc4/0x154
[c00000000152f9c0] c0000000000385e8 .smp_local_timer_interrupt+0x3c/0x58
[c00000000152fa30] c000000000015088 .timer_interrupt+0x11c/0x3fc
[c00000000152fb10] c00000000000a2b4 Decrementer_common+0xb4/0x100
--- Exception: 901 (Decrementer) at c000000000013bc0 .default_idle+0x70/0x110
[c00000000152fe90] c0000000000139e4 .cpu_idle+0x38/0x50
[c00000000152ff00] c000000000038e18 .start_secondary+0xfc/0x150
[c00000000152ff90] c00000000000bf20 .enable_64b_mode+0x0/0x28
                                                                                
2:mon> r
R00 = 000000000000002b   R16 = 0000000000000040
R01 = c00000000152f720   R17 = 0000000000000180
R02 = c0000000006d6d40   R18 = 0000000000000040
R03 = 0000000000000020   R19 = c000000000828e08
R04 = 0000000000000020   R20 = 0000000000000002
R05 = 0000000000000002   R21 = 0000000000000000
R06 = c00000000073f9b0   R22 = 0000000000000000
R07 = 000000000000000b   R23 = c00000000152f790
R08 = c0000000006fc728   R24 = c0000000006d5008
R09 = 0000000000000015   R25 = c000000000529c38
R10 = 0000000000000000   R26 = c000000000529c38
R11 = 0000000000000080   R27 = c00000000073f9b0
R12 = 0000000028282482   R28 = 0000000000000001
R13 = c000000000535200   R29 = 0000000000000015
R14 = c00000000073f980   R30 = c0000000005bbe00
R15 = 0000000000000000   R31 = c00000000152f720
pc  = c00000000004b1b0 .find_busiest_group+0x274/0x464
lr  = c00000000004b0e4 .find_busiest_group+0x1a8/0x464
msr = 8000000000001032   cr  = 28282488
ctr = c000000000013b50   xer = 0000000000000000   trap =      380


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: CPU hotplug broken in 2.6.8-rc2 ?
  2004-08-02  9:49 CPU hotplug broken in 2.6.8-rc2 ? Dipankar Sarma
@ 2004-08-02  9:57 ` Dipankar Sarma
  2004-08-02 13:46   ` Anton Blanchard
  2004-08-02 19:38   ` Nathan Lynch
  2004-08-02 16:00 ` Zwane Mwaikambo
  1 sibling, 2 replies; 12+ messages in thread
From: Dipankar Sarma @ 2004-08-02  9:57 UTC (permalink / raw)
  To: V Srivatsa, Nathan Lynch
  Cc: Joel Schopp, Rusty Russell, linux-kernel, Nick Piggin

Copied to the right email ids to avoid bouncing emails on replies.

Thanks
Dipankar

On Mon, Aug 02, 2004 at 03:19:07PM +0530, Dipankar Sarma wrote:
> Could it be that recent sched domain stuff broke CPU hotplug ?
> While testing cpu hotplug with some RCU changes, I got the following
> panic (while onlining).
> 
> Thanks
> Dipankar
> 
> cpu 0x2: Vector: 380 (Data SLB Access) at [c00000000152f4a0]
>     pc: c00000000004b1b0: .find_busiest_group+0x274/0x464
>     lr: c00000000004b0e4: .find_busiest_group+0x1a8/0x464
>     sp: c00000000152f720
>    msr: 8000000000001032
>    dar: 10
>   current = 0xc000000001520040
>   paca    = 0xc000000000535200
>     pid   = 0, comm = swapper
> enter ? for help
> 2:mon>
> 
> 2:mon> t
> [c00000000152f720] c000000000654f30 (unreliable)
> [c00000000152f830] c00000000004b4cc .rebalance_tick+0x12c/0x2d4
> [c00000000152f920] c00000000005b954 .update_process_times+0xc4/0x154
> [c00000000152f9c0] c0000000000385e8 .smp_local_timer_interrupt+0x3c/0x58
> [c00000000152fa30] c000000000015088 .timer_interrupt+0x11c/0x3fc
> [c00000000152fb10] c00000000000a2b4 Decrementer_common+0xb4/0x100
> --- Exception: 901 (Decrementer) at c000000000013bc0 .default_idle+0x70/0x110
> [c00000000152fe90] c0000000000139e4 .cpu_idle+0x38/0x50
> [c00000000152ff00] c000000000038e18 .start_secondary+0xfc/0x150
> [c00000000152ff90] c00000000000bf20 .enable_64b_mode+0x0/0x28
>                                                                                 
> 2:mon> r
> R00 = 000000000000002b   R16 = 0000000000000040
> R01 = c00000000152f720   R17 = 0000000000000180
> R02 = c0000000006d6d40   R18 = 0000000000000040
> R03 = 0000000000000020   R19 = c000000000828e08
> R04 = 0000000000000020   R20 = 0000000000000002
> R05 = 0000000000000002   R21 = 0000000000000000
> R06 = c00000000073f9b0   R22 = 0000000000000000
> R07 = 000000000000000b   R23 = c00000000152f790
> R08 = c0000000006fc728   R24 = c0000000006d5008
> R09 = 0000000000000015   R25 = c000000000529c38
> R10 = 0000000000000000   R26 = c000000000529c38
> R11 = 0000000000000080   R27 = c00000000073f9b0
> R12 = 0000000028282482   R28 = 0000000000000001
> R13 = c000000000535200   R29 = 0000000000000015
> R14 = c00000000073f980   R30 = c0000000005bbe00
> R15 = 0000000000000000   R31 = c00000000152f720
> pc  = c00000000004b1b0 .find_busiest_group+0x274/0x464
> lr  = c00000000004b0e4 .find_busiest_group+0x1a8/0x464
> msr = 8000000000001032   cr  = 28282488
> ctr = c000000000013b50   xer = 0000000000000000   trap =      380
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: CPU hotplug broken in 2.6.8-rc2 ?
  2004-08-02  9:57 ` Dipankar Sarma
@ 2004-08-02 13:46   ` Anton Blanchard
  2004-08-02 19:38   ` Nathan Lynch
  1 sibling, 0 replies; 12+ messages in thread
From: Anton Blanchard @ 2004-08-02 13:46 UTC (permalink / raw)
  To: Dipankar Sarma
  Cc: V Srivatsa, Nathan Lynch, Joel Schopp, Rusty Russell,
	linux-kernel, Nick Piggin

 
> Could it be that recent sched domain stuff broke CPU hotplug ?
> While testing cpu hotplug with some RCU changes, I got the following
> panic (while onlining).

Yeah, Im seeing the same thing.

Anton

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: CPU hotplug broken in 2.6.8-rc2 ?
  2004-08-02  9:57 ` Dipankar Sarma
  2004-08-02 13:46   ` Anton Blanchard
@ 2004-08-02 19:38   ` Nathan Lynch
  2004-08-02 20:26     ` Nathan Lynch
  2004-08-03  0:13     ` Rusty Russell
  1 sibling, 2 replies; 12+ messages in thread
From: Nathan Lynch @ 2004-08-02 19:38 UTC (permalink / raw)
  To: dipankar; +Cc: V Srivatsa, Joel Schopp, Rusty Russell, lkml, Nick Piggin, zwane

On Mon, 2004-08-02 at 04:57, Dipankar Sarma wrote:
> Copied to the right email ids to avoid bouncing emails on replies.
> 
> Thanks
> Dipankar
> 
> On Mon, Aug 02, 2004 at 03:19:07PM +0530, Dipankar Sarma wrote:
> > Could it be that recent sched domain stuff broke CPU hotplug ?
> > While testing cpu hotplug with some RCU changes, I got the following
> > panic (while onlining).

Could you try on 2.6.8-rc2-mm2 along with this patch?  Vatsa had a patch
go in that should prevent the crash you are seeing -- the patch below is
needed to prevent the same crash in the offline case.  This check used
to be in load_balance and some other scheduler functions, iirc; does
anyone know why they were removed?

Nathan


---


diff -puN kernel/sched.c~check-for-cpu-offline-in-load_balance kernel/sched.c
--- 2.6.8-rc2-mm2/kernel/sched.c~check-for-cpu-offline-in-load_balance	2004-08-02 13:12:04.000000000 -0500
+++ 2.6.8-rc2-mm2-nathanl/kernel/sched.c	2004-08-02 13:12:58.000000000 -0500
@@ -1405,6 +1405,9 @@ static int load_balance(int this_cpu, ru
 
 	spin_lock(&this_rq->lock);
 
+	if (unlikely(cpu_is_offline(this_cpu)))
+		goto out_balanced;
+
 	group = find_busiest_group(sd, this_cpu, &imbalance, idle);
 	if (!group)
 		goto out_balanced;

_



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: CPU hotplug broken in 2.6.8-rc2 ?
  2004-08-02 19:38   ` Nathan Lynch
@ 2004-08-02 20:26     ` Nathan Lynch
  2004-08-03 21:07       ` Nathan Lynch
  2004-08-03  0:13     ` Rusty Russell
  1 sibling, 1 reply; 12+ messages in thread
From: Nathan Lynch @ 2004-08-02 20:26 UTC (permalink / raw)
  To: dipankar; +Cc: V Srivatsa, Joel Schopp, Rusty Russell, lkml, Nick Piggin, zwane

On Mon, 2004-08-02 at 14:38, Nathan Lynch wrote:
> Could you try on 2.6.8-rc2-mm2 along with this patch?  Vatsa had a patch
> go in that should prevent the crash you are seeing -- the patch below is
> needed to prevent the same crash in the offline case.  This check used
> to be in load_balance and some other scheduler functions, iirc; does
> anyone know why they were removed?

Er, I meant to put the check in rebalance_tick, not load_balance.

However, after a few minutes with this, I hit the BUG_ON in the CPU_DEAD
case in migration_call; not sure whether this is a separate issue.

Nathan

---

diff -puN kernel/sched.c~check-for-cpu-offline-in-rebalance_tick kernel/sched.c
--- 2.6.8-rc2-mm2/kernel/sched.c~check-for-cpu-offline-in-rebalance_tick	2004-08-02 15:18:24.000000000 -0500
+++ 2.6.8-rc2-mm2-nathanl/kernel/sched.c	2004-08-02 15:18:47.000000000 -0500
@@ -1616,6 +1616,9 @@ static void rebalance_tick(int this_cpu,
 	unsigned long j = jiffies + CPU_OFFSET(this_cpu);
 	struct sched_domain *sd;
 
+	if (cpu_is_offline(this_cpu))
+		return;
+
 	/* Update our load */
 	old_load = this_rq->cpu_load;
 	this_load = this_rq->nr_running * SCHED_LOAD_SCALE;

_



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: CPU hotplug broken in 2.6.8-rc2 ?
  2004-08-02 20:26     ` Nathan Lynch
@ 2004-08-03 21:07       ` Nathan Lynch
  2004-08-04 10:06         ` Srivatsa Vaddagiri
  2004-08-04 14:50         ` Zwane Mwaikambo
  0 siblings, 2 replies; 12+ messages in thread
From: Nathan Lynch @ 2004-08-03 21:07 UTC (permalink / raw)
  To: dipankar; +Cc: V Srivatsa, Joel Schopp, Rusty Russell, lkml, Nick Piggin, zwane

On Mon, 2004-08-02 at 15:26, Nathan Lynch wrote:
> On Mon, 2004-08-02 at 14:38, Nathan Lynch wrote:
> > Could you try on 2.6.8-rc2-mm2 along with this patch?  Vatsa had a patch
> > go in that should prevent the crash you are seeing -- the patch below is
> > needed to prevent the same crash in the offline case.  This check used
> > to be in load_balance and some other scheduler functions, iirc; does
> > anyone know why they were removed?
> 
> Er, I meant to put the check in rebalance_tick, not load_balance.
> 
> However, after a few minutes with this, I hit the BUG_ON in the CPU_DEAD
> case in migration_call; not sure whether this is a separate issue.

So, with the cpu_is_offline check in rebalance_tick on top of
2.6.8-rc2-mm2, this is the BUG_ON in migration_call I tend to hit while
hotplugging cpus as quickly as possible while running make -j 40:

        case CPU_DEAD:
                migrate_all_tasks(cpu);
                rq = cpu_rq(cpu);
                kthread_stop(rq->migration_thread);
                rq->migration_thread = NULL;
                /* Idle task back to normal (off runqueue, low prio) */
                rq = task_rq_lock(rq->idle, &flags);
                deactivate_task(rq->idle, rq);
                rq->idle->static_prio = MAX_PRIO;
                __setscheduler(rq->idle, SCHED_NORMAL, 0);
                task_rq_unlock(rq, &flags);
                BUG_ON(rq->nr_running != 0);

I can reproduce this on both ppc64 and i386.  Does anyone know why this
is happening?

If I remove the BUG_ON, things seem to go ok, but I doubt that's the
right thing to do.

Nathan



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: CPU hotplug broken in 2.6.8-rc2 ?
  2004-08-03 21:07       ` Nathan Lynch
@ 2004-08-04 10:06         ` Srivatsa Vaddagiri
  2004-08-04 13:12           ` Nathan Lynch
  2004-08-04 14:50         ` Zwane Mwaikambo
  1 sibling, 1 reply; 12+ messages in thread
From: Srivatsa Vaddagiri @ 2004-08-04 10:06 UTC (permalink / raw)
  To: Nathan Lynch
  Cc: dipankar, Joel Schopp, Rusty Russell, lkml, Nick Piggin, zwane

On Tue, Aug 03, 2004 at 04:07:20PM -0500, Nathan Lynch wrote:
>                 BUG_ON(rq->nr_running != 0);
> 
> I can reproduce this on both ppc64 and i386.  Does anyone know why this
> is happening?

I guess some task is still stuck with the dead CPU. Can you put a breakpoint on the BUG_ON 
and see the ps output (in kdb) to see which task is that when you hit the breakpoint?

I will also try debugging the 2.6.8-rc2 CPU Hotplug woes as soon as I can.


-- 


Thanks and Regards,
Srivatsa Vaddagiri,
Linux Technology Center,
IBM Software Labs,
Bangalore, INDIA - 560017

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: CPU hotplug broken in 2.6.8-rc2 ?
  2004-08-04 10:06         ` Srivatsa Vaddagiri
@ 2004-08-04 13:12           ` Nathan Lynch
  0 siblings, 0 replies; 12+ messages in thread
From: Nathan Lynch @ 2004-08-04 13:12 UTC (permalink / raw)
  To: vatsa; +Cc: dipankar, Joel Schopp, Rusty Russell, lkml, Nick Piggin, zwane

On Wed, 2004-08-04 at 05:06, Srivatsa Vaddagiri wrote:
> On Tue, Aug 03, 2004 at 04:07:20PM -0500, Nathan Lynch wrote:
> >                 BUG_ON(rq->nr_running != 0);
> > 
> > I can reproduce this on both ppc64 and i386.  Does anyone know why this
> > is happening?
> 
> I guess some task is still stuck with the dead CPU. Can you put a breakpoint on the BUG_ON 
> and see the ps output (in kdb) to see which task is that when you hit the breakpoint?

The task is always something like cc1 or sh from the build which is
running.

> 
> I will also try debugging the 2.6.8-rc2 CPU Hotplug woes as soon as I can.
> 

Well, I am seeing this with 2.6.8-rc2-mm2 -- with 2.6.8-rc2-bk13 (plus
the same patch) I cannot reproduce it; I have run the test for 12 hours
without problem.

Nathan


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: CPU hotplug broken in 2.6.8-rc2 ?
  2004-08-03 21:07       ` Nathan Lynch
  2004-08-04 10:06         ` Srivatsa Vaddagiri
@ 2004-08-04 14:50         ` Zwane Mwaikambo
  2004-08-04 21:07           ` Con Kolivas
  1 sibling, 1 reply; 12+ messages in thread
From: Zwane Mwaikambo @ 2004-08-04 14:50 UTC (permalink / raw)
  To: Nathan Lynch
  Cc: Dipankar Sarma, V Srivatsa, Joel Schopp, Rusty Russell, lkml,
	Nick Piggin, Con Kolivas

On Tue, 3 Aug 2004, Nathan Lynch wrote:

>                 __setscheduler(rq->idle, SCHED_NORMAL, 0);
>                 task_rq_unlock(rq, &flags);
>                 BUG_ON(rq->nr_running != 0);
>
> I can reproduce this on both ppc64 and i386.  Does anyone know why this
> is happening?
>
> If I remove the BUG_ON, things seem to go ok, but I doubt that's the
> right thing to do.

It could have something to do with the staircase scheduler, Con, got any
wise words?


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: CPU hotplug broken in 2.6.8-rc2 ?
  2004-08-04 14:50         ` Zwane Mwaikambo
@ 2004-08-04 21:07           ` Con Kolivas
  0 siblings, 0 replies; 12+ messages in thread
From: Con Kolivas @ 2004-08-04 21:07 UTC (permalink / raw)
  To: Zwane Mwaikambo
  Cc: Nathan Lynch, Dipankar Sarma, V Srivatsa, Joel Schopp,
	Rusty Russell, lkml, Nick Piggin

[-- Attachment #1: Type: text/plain, Size: 582 bytes --]

Zwane Mwaikambo wrote:
> On Tue, 3 Aug 2004, Nathan Lynch wrote:
> 
> 
>>                __setscheduler(rq->idle, SCHED_NORMAL, 0);
>>                task_rq_unlock(rq, &flags);
>>                BUG_ON(rq->nr_running != 0);
>>
>>I can reproduce this on both ppc64 and i386.  Does anyone know why this
>>is happening?
>>
>>If I remove the BUG_ON, things seem to go ok, but I doubt that's the
>>right thing to do.
> 
> 
> It could have something to do with the staircase scheduler, Con, got any
> wise words?

Doesn't this bug report say 2.6.8-rc2? It's mm2 that has staircase.

Con

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 256 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: CPU hotplug broken in 2.6.8-rc2 ?
  2004-08-02 19:38   ` Nathan Lynch
  2004-08-02 20:26     ` Nathan Lynch
@ 2004-08-03  0:13     ` Rusty Russell
  1 sibling, 0 replies; 12+ messages in thread
From: Rusty Russell @ 2004-08-03  0:13 UTC (permalink / raw)
  To: Nathan Lynch
  Cc: Dipankar Sarma, V Srivatsa, Joel Schopp,
	lkml - Kernel Mailing List, Nick Piggin, Zwane Mwaikambo

On Tue, 2004-08-03 at 05:38, Nathan Lynch wrote:
> diff -puN kernel/sched.c~check-for-cpu-offline-in-load_balance kernel/sched.c
> --- 2.6.8-rc2-mm2/kernel/sched.c~check-for-cpu-offline-in-load_balance	2004-08-02 13:12:04.000000000 -0500
> +++ 2.6.8-rc2-mm2-nathanl/kernel/sched.c	2004-08-02 13:12:58.000000000 -0500
> @@ -1405,6 +1405,9 @@ static int load_balance(int this_cpu, ru
>  
>  	spin_lock(&this_rq->lock);
>  
> +	if (unlikely(cpu_is_offline(this_cpu)))
> +		goto out_balanced;
> +

cpu_is_offline() is "unlikely" already.  Please just use "if
(cpu_is_offline(this_cpu))"

Thanks,
Rusty.
-- 
Anyone who quotes me in their signature is an idiot -- Rusty Russell


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: CPU hotplug broken in 2.6.8-rc2 ?
  2004-08-02  9:49 CPU hotplug broken in 2.6.8-rc2 ? Dipankar Sarma
  2004-08-02  9:57 ` Dipankar Sarma
@ 2004-08-02 16:00 ` Zwane Mwaikambo
  1 sibling, 0 replies; 12+ messages in thread
From: Zwane Mwaikambo @ 2004-08-02 16:00 UTC (permalink / raw)
  To: Dipankar Sarma
  Cc: V Srivatsa, nathanl, Joel Schopp, Rusty Russell, linux-kernel,
	nickp

On Mon, 2 Aug 2004, Dipankar Sarma wrote:

> Could it be that recent sched domain stuff broke CPU hotplug ?
> While testing cpu hotplug with some RCU changes, I got the following
> panic (while onlining).

This may be related, i bumped into similar backtrace on i386 when a timer
interrupt snuck in whilst the cpu was offline, so i ended up enabling
timer interrupts only after the processor was on the map. This setup
managed to survive 12hours with a kernel compile load over the weekend.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2004-08-04 21:09 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-08-02  9:49 CPU hotplug broken in 2.6.8-rc2 ? Dipankar Sarma
2004-08-02  9:57 ` Dipankar Sarma
2004-08-02 13:46   ` Anton Blanchard
2004-08-02 19:38   ` Nathan Lynch
2004-08-02 20:26     ` Nathan Lynch
2004-08-03 21:07       ` Nathan Lynch
2004-08-04 10:06         ` Srivatsa Vaddagiri
2004-08-04 13:12           ` Nathan Lynch
2004-08-04 14:50         ` Zwane Mwaikambo
2004-08-04 21:07           ` Con Kolivas
2004-08-03  0:13     ` Rusty Russell
2004-08-02 16:00 ` Zwane Mwaikambo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox