* CPU hotplug broken in 2.6.8-rc2 ?
@ 2004-08-02 9:49 Dipankar Sarma
2004-08-02 9:57 ` Dipankar Sarma
2004-08-02 16:00 ` Zwane Mwaikambo
0 siblings, 2 replies; 12+ messages in thread
From: Dipankar Sarma @ 2004-08-02 9:49 UTC (permalink / raw)
To: V Srivatsa, nathanl; +Cc: Joel Schopp, Rusty Russell, linux-kernel, nickp
Could it be that recent sched domain stuff broke CPU hotplug ?
While testing cpu hotplug with some RCU changes, I got the following
panic (while onlining).
Thanks
Dipankar
cpu 0x2: Vector: 380 (Data SLB Access) at [c00000000152f4a0]
pc: c00000000004b1b0: .find_busiest_group+0x274/0x464
lr: c00000000004b0e4: .find_busiest_group+0x1a8/0x464
sp: c00000000152f720
msr: 8000000000001032
dar: 10
current = 0xc000000001520040
paca = 0xc000000000535200
pid = 0, comm = swapper
enter ? for help
2:mon>
2:mon> t
[c00000000152f720] c000000000654f30 (unreliable)
[c00000000152f830] c00000000004b4cc .rebalance_tick+0x12c/0x2d4
[c00000000152f920] c00000000005b954 .update_process_times+0xc4/0x154
[c00000000152f9c0] c0000000000385e8 .smp_local_timer_interrupt+0x3c/0x58
[c00000000152fa30] c000000000015088 .timer_interrupt+0x11c/0x3fc
[c00000000152fb10] c00000000000a2b4 Decrementer_common+0xb4/0x100
--- Exception: 901 (Decrementer) at c000000000013bc0 .default_idle+0x70/0x110
[c00000000152fe90] c0000000000139e4 .cpu_idle+0x38/0x50
[c00000000152ff00] c000000000038e18 .start_secondary+0xfc/0x150
[c00000000152ff90] c00000000000bf20 .enable_64b_mode+0x0/0x28
2:mon> r
R00 = 000000000000002b R16 = 0000000000000040
R01 = c00000000152f720 R17 = 0000000000000180
R02 = c0000000006d6d40 R18 = 0000000000000040
R03 = 0000000000000020 R19 = c000000000828e08
R04 = 0000000000000020 R20 = 0000000000000002
R05 = 0000000000000002 R21 = 0000000000000000
R06 = c00000000073f9b0 R22 = 0000000000000000
R07 = 000000000000000b R23 = c00000000152f790
R08 = c0000000006fc728 R24 = c0000000006d5008
R09 = 0000000000000015 R25 = c000000000529c38
R10 = 0000000000000000 R26 = c000000000529c38
R11 = 0000000000000080 R27 = c00000000073f9b0
R12 = 0000000028282482 R28 = 0000000000000001
R13 = c000000000535200 R29 = 0000000000000015
R14 = c00000000073f980 R30 = c0000000005bbe00
R15 = 0000000000000000 R31 = c00000000152f720
pc = c00000000004b1b0 .find_busiest_group+0x274/0x464
lr = c00000000004b0e4 .find_busiest_group+0x1a8/0x464
msr = 8000000000001032 cr = 28282488
ctr = c000000000013b50 xer = 0000000000000000 trap = 380
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: CPU hotplug broken in 2.6.8-rc2 ? 2004-08-02 9:49 CPU hotplug broken in 2.6.8-rc2 ? Dipankar Sarma @ 2004-08-02 9:57 ` Dipankar Sarma 2004-08-02 13:46 ` Anton Blanchard 2004-08-02 19:38 ` Nathan Lynch 2004-08-02 16:00 ` Zwane Mwaikambo 1 sibling, 2 replies; 12+ messages in thread From: Dipankar Sarma @ 2004-08-02 9:57 UTC (permalink / raw) To: V Srivatsa, Nathan Lynch Cc: Joel Schopp, Rusty Russell, linux-kernel, Nick Piggin Copied to the right email ids to avoid bouncing emails on replies. Thanks Dipankar On Mon, Aug 02, 2004 at 03:19:07PM +0530, Dipankar Sarma wrote: > Could it be that recent sched domain stuff broke CPU hotplug ? > While testing cpu hotplug with some RCU changes, I got the following > panic (while onlining). > > Thanks > Dipankar > > cpu 0x2: Vector: 380 (Data SLB Access) at [c00000000152f4a0] > pc: c00000000004b1b0: .find_busiest_group+0x274/0x464 > lr: c00000000004b0e4: .find_busiest_group+0x1a8/0x464 > sp: c00000000152f720 > msr: 8000000000001032 > dar: 10 > current = 0xc000000001520040 > paca = 0xc000000000535200 > pid = 0, comm = swapper > enter ? for help > 2:mon> > > 2:mon> t > [c00000000152f720] c000000000654f30 (unreliable) > [c00000000152f830] c00000000004b4cc .rebalance_tick+0x12c/0x2d4 > [c00000000152f920] c00000000005b954 .update_process_times+0xc4/0x154 > [c00000000152f9c0] c0000000000385e8 .smp_local_timer_interrupt+0x3c/0x58 > [c00000000152fa30] c000000000015088 .timer_interrupt+0x11c/0x3fc > [c00000000152fb10] c00000000000a2b4 Decrementer_common+0xb4/0x100 > --- Exception: 901 (Decrementer) at c000000000013bc0 .default_idle+0x70/0x110 > [c00000000152fe90] c0000000000139e4 .cpu_idle+0x38/0x50 > [c00000000152ff00] c000000000038e18 .start_secondary+0xfc/0x150 > [c00000000152ff90] c00000000000bf20 .enable_64b_mode+0x0/0x28 > > 2:mon> r > R00 = 000000000000002b R16 = 0000000000000040 > R01 = c00000000152f720 R17 = 0000000000000180 > R02 = c0000000006d6d40 R18 = 0000000000000040 > R03 = 0000000000000020 R19 = c000000000828e08 > R04 = 0000000000000020 R20 = 0000000000000002 > R05 = 0000000000000002 R21 = 0000000000000000 > R06 = c00000000073f9b0 R22 = 0000000000000000 > R07 = 000000000000000b R23 = c00000000152f790 > R08 = c0000000006fc728 R24 = c0000000006d5008 > R09 = 0000000000000015 R25 = c000000000529c38 > R10 = 0000000000000000 R26 = c000000000529c38 > R11 = 0000000000000080 R27 = c00000000073f9b0 > R12 = 0000000028282482 R28 = 0000000000000001 > R13 = c000000000535200 R29 = 0000000000000015 > R14 = c00000000073f980 R30 = c0000000005bbe00 > R15 = 0000000000000000 R31 = c00000000152f720 > pc = c00000000004b1b0 .find_busiest_group+0x274/0x464 > lr = c00000000004b0e4 .find_busiest_group+0x1a8/0x464 > msr = 8000000000001032 cr = 28282488 > ctr = c000000000013b50 xer = 0000000000000000 trap = 380 > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: CPU hotplug broken in 2.6.8-rc2 ? 2004-08-02 9:57 ` Dipankar Sarma @ 2004-08-02 13:46 ` Anton Blanchard 2004-08-02 19:38 ` Nathan Lynch 1 sibling, 0 replies; 12+ messages in thread From: Anton Blanchard @ 2004-08-02 13:46 UTC (permalink / raw) To: Dipankar Sarma Cc: V Srivatsa, Nathan Lynch, Joel Schopp, Rusty Russell, linux-kernel, Nick Piggin > Could it be that recent sched domain stuff broke CPU hotplug ? > While testing cpu hotplug with some RCU changes, I got the following > panic (while onlining). Yeah, Im seeing the same thing. Anton ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: CPU hotplug broken in 2.6.8-rc2 ? 2004-08-02 9:57 ` Dipankar Sarma 2004-08-02 13:46 ` Anton Blanchard @ 2004-08-02 19:38 ` Nathan Lynch 2004-08-02 20:26 ` Nathan Lynch 2004-08-03 0:13 ` Rusty Russell 1 sibling, 2 replies; 12+ messages in thread From: Nathan Lynch @ 2004-08-02 19:38 UTC (permalink / raw) To: dipankar; +Cc: V Srivatsa, Joel Schopp, Rusty Russell, lkml, Nick Piggin, zwane On Mon, 2004-08-02 at 04:57, Dipankar Sarma wrote: > Copied to the right email ids to avoid bouncing emails on replies. > > Thanks > Dipankar > > On Mon, Aug 02, 2004 at 03:19:07PM +0530, Dipankar Sarma wrote: > > Could it be that recent sched domain stuff broke CPU hotplug ? > > While testing cpu hotplug with some RCU changes, I got the following > > panic (while onlining). Could you try on 2.6.8-rc2-mm2 along with this patch? Vatsa had a patch go in that should prevent the crash you are seeing -- the patch below is needed to prevent the same crash in the offline case. This check used to be in load_balance and some other scheduler functions, iirc; does anyone know why they were removed? Nathan --- diff -puN kernel/sched.c~check-for-cpu-offline-in-load_balance kernel/sched.c --- 2.6.8-rc2-mm2/kernel/sched.c~check-for-cpu-offline-in-load_balance 2004-08-02 13:12:04.000000000 -0500 +++ 2.6.8-rc2-mm2-nathanl/kernel/sched.c 2004-08-02 13:12:58.000000000 -0500 @@ -1405,6 +1405,9 @@ static int load_balance(int this_cpu, ru spin_lock(&this_rq->lock); + if (unlikely(cpu_is_offline(this_cpu))) + goto out_balanced; + group = find_busiest_group(sd, this_cpu, &imbalance, idle); if (!group) goto out_balanced; _ ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: CPU hotplug broken in 2.6.8-rc2 ? 2004-08-02 19:38 ` Nathan Lynch @ 2004-08-02 20:26 ` Nathan Lynch 2004-08-03 21:07 ` Nathan Lynch 2004-08-03 0:13 ` Rusty Russell 1 sibling, 1 reply; 12+ messages in thread From: Nathan Lynch @ 2004-08-02 20:26 UTC (permalink / raw) To: dipankar; +Cc: V Srivatsa, Joel Schopp, Rusty Russell, lkml, Nick Piggin, zwane On Mon, 2004-08-02 at 14:38, Nathan Lynch wrote: > Could you try on 2.6.8-rc2-mm2 along with this patch? Vatsa had a patch > go in that should prevent the crash you are seeing -- the patch below is > needed to prevent the same crash in the offline case. This check used > to be in load_balance and some other scheduler functions, iirc; does > anyone know why they were removed? Er, I meant to put the check in rebalance_tick, not load_balance. However, after a few minutes with this, I hit the BUG_ON in the CPU_DEAD case in migration_call; not sure whether this is a separate issue. Nathan --- diff -puN kernel/sched.c~check-for-cpu-offline-in-rebalance_tick kernel/sched.c --- 2.6.8-rc2-mm2/kernel/sched.c~check-for-cpu-offline-in-rebalance_tick 2004-08-02 15:18:24.000000000 -0500 +++ 2.6.8-rc2-mm2-nathanl/kernel/sched.c 2004-08-02 15:18:47.000000000 -0500 @@ -1616,6 +1616,9 @@ static void rebalance_tick(int this_cpu, unsigned long j = jiffies + CPU_OFFSET(this_cpu); struct sched_domain *sd; + if (cpu_is_offline(this_cpu)) + return; + /* Update our load */ old_load = this_rq->cpu_load; this_load = this_rq->nr_running * SCHED_LOAD_SCALE; _ ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: CPU hotplug broken in 2.6.8-rc2 ? 2004-08-02 20:26 ` Nathan Lynch @ 2004-08-03 21:07 ` Nathan Lynch 2004-08-04 10:06 ` Srivatsa Vaddagiri 2004-08-04 14:50 ` Zwane Mwaikambo 0 siblings, 2 replies; 12+ messages in thread From: Nathan Lynch @ 2004-08-03 21:07 UTC (permalink / raw) To: dipankar; +Cc: V Srivatsa, Joel Schopp, Rusty Russell, lkml, Nick Piggin, zwane On Mon, 2004-08-02 at 15:26, Nathan Lynch wrote: > On Mon, 2004-08-02 at 14:38, Nathan Lynch wrote: > > Could you try on 2.6.8-rc2-mm2 along with this patch? Vatsa had a patch > > go in that should prevent the crash you are seeing -- the patch below is > > needed to prevent the same crash in the offline case. This check used > > to be in load_balance and some other scheduler functions, iirc; does > > anyone know why they were removed? > > Er, I meant to put the check in rebalance_tick, not load_balance. > > However, after a few minutes with this, I hit the BUG_ON in the CPU_DEAD > case in migration_call; not sure whether this is a separate issue. So, with the cpu_is_offline check in rebalance_tick on top of 2.6.8-rc2-mm2, this is the BUG_ON in migration_call I tend to hit while hotplugging cpus as quickly as possible while running make -j 40: case CPU_DEAD: migrate_all_tasks(cpu); rq = cpu_rq(cpu); kthread_stop(rq->migration_thread); rq->migration_thread = NULL; /* Idle task back to normal (off runqueue, low prio) */ rq = task_rq_lock(rq->idle, &flags); deactivate_task(rq->idle, rq); rq->idle->static_prio = MAX_PRIO; __setscheduler(rq->idle, SCHED_NORMAL, 0); task_rq_unlock(rq, &flags); BUG_ON(rq->nr_running != 0); I can reproduce this on both ppc64 and i386. Does anyone know why this is happening? If I remove the BUG_ON, things seem to go ok, but I doubt that's the right thing to do. Nathan ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: CPU hotplug broken in 2.6.8-rc2 ? 2004-08-03 21:07 ` Nathan Lynch @ 2004-08-04 10:06 ` Srivatsa Vaddagiri 2004-08-04 13:12 ` Nathan Lynch 2004-08-04 14:50 ` Zwane Mwaikambo 1 sibling, 1 reply; 12+ messages in thread From: Srivatsa Vaddagiri @ 2004-08-04 10:06 UTC (permalink / raw) To: Nathan Lynch Cc: dipankar, Joel Schopp, Rusty Russell, lkml, Nick Piggin, zwane On Tue, Aug 03, 2004 at 04:07:20PM -0500, Nathan Lynch wrote: > BUG_ON(rq->nr_running != 0); > > I can reproduce this on both ppc64 and i386. Does anyone know why this > is happening? I guess some task is still stuck with the dead CPU. Can you put a breakpoint on the BUG_ON and see the ps output (in kdb) to see which task is that when you hit the breakpoint? I will also try debugging the 2.6.8-rc2 CPU Hotplug woes as soon as I can. -- Thanks and Regards, Srivatsa Vaddagiri, Linux Technology Center, IBM Software Labs, Bangalore, INDIA - 560017 ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: CPU hotplug broken in 2.6.8-rc2 ? 2004-08-04 10:06 ` Srivatsa Vaddagiri @ 2004-08-04 13:12 ` Nathan Lynch 0 siblings, 0 replies; 12+ messages in thread From: Nathan Lynch @ 2004-08-04 13:12 UTC (permalink / raw) To: vatsa; +Cc: dipankar, Joel Schopp, Rusty Russell, lkml, Nick Piggin, zwane On Wed, 2004-08-04 at 05:06, Srivatsa Vaddagiri wrote: > On Tue, Aug 03, 2004 at 04:07:20PM -0500, Nathan Lynch wrote: > > BUG_ON(rq->nr_running != 0); > > > > I can reproduce this on both ppc64 and i386. Does anyone know why this > > is happening? > > I guess some task is still stuck with the dead CPU. Can you put a breakpoint on the BUG_ON > and see the ps output (in kdb) to see which task is that when you hit the breakpoint? The task is always something like cc1 or sh from the build which is running. > > I will also try debugging the 2.6.8-rc2 CPU Hotplug woes as soon as I can. > Well, I am seeing this with 2.6.8-rc2-mm2 -- with 2.6.8-rc2-bk13 (plus the same patch) I cannot reproduce it; I have run the test for 12 hours without problem. Nathan ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: CPU hotplug broken in 2.6.8-rc2 ? 2004-08-03 21:07 ` Nathan Lynch 2004-08-04 10:06 ` Srivatsa Vaddagiri @ 2004-08-04 14:50 ` Zwane Mwaikambo 2004-08-04 21:07 ` Con Kolivas 1 sibling, 1 reply; 12+ messages in thread From: Zwane Mwaikambo @ 2004-08-04 14:50 UTC (permalink / raw) To: Nathan Lynch Cc: Dipankar Sarma, V Srivatsa, Joel Schopp, Rusty Russell, lkml, Nick Piggin, Con Kolivas On Tue, 3 Aug 2004, Nathan Lynch wrote: > __setscheduler(rq->idle, SCHED_NORMAL, 0); > task_rq_unlock(rq, &flags); > BUG_ON(rq->nr_running != 0); > > I can reproduce this on both ppc64 and i386. Does anyone know why this > is happening? > > If I remove the BUG_ON, things seem to go ok, but I doubt that's the > right thing to do. It could have something to do with the staircase scheduler, Con, got any wise words? ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: CPU hotplug broken in 2.6.8-rc2 ? 2004-08-04 14:50 ` Zwane Mwaikambo @ 2004-08-04 21:07 ` Con Kolivas 0 siblings, 0 replies; 12+ messages in thread From: Con Kolivas @ 2004-08-04 21:07 UTC (permalink / raw) To: Zwane Mwaikambo Cc: Nathan Lynch, Dipankar Sarma, V Srivatsa, Joel Schopp, Rusty Russell, lkml, Nick Piggin [-- Attachment #1: Type: text/plain, Size: 582 bytes --] Zwane Mwaikambo wrote: > On Tue, 3 Aug 2004, Nathan Lynch wrote: > > >> __setscheduler(rq->idle, SCHED_NORMAL, 0); >> task_rq_unlock(rq, &flags); >> BUG_ON(rq->nr_running != 0); >> >>I can reproduce this on both ppc64 and i386. Does anyone know why this >>is happening? >> >>If I remove the BUG_ON, things seem to go ok, but I doubt that's the >>right thing to do. > > > It could have something to do with the staircase scheduler, Con, got any > wise words? Doesn't this bug report say 2.6.8-rc2? It's mm2 that has staircase. Con [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 256 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: CPU hotplug broken in 2.6.8-rc2 ? 2004-08-02 19:38 ` Nathan Lynch 2004-08-02 20:26 ` Nathan Lynch @ 2004-08-03 0:13 ` Rusty Russell 1 sibling, 0 replies; 12+ messages in thread From: Rusty Russell @ 2004-08-03 0:13 UTC (permalink / raw) To: Nathan Lynch Cc: Dipankar Sarma, V Srivatsa, Joel Schopp, lkml - Kernel Mailing List, Nick Piggin, Zwane Mwaikambo On Tue, 2004-08-03 at 05:38, Nathan Lynch wrote: > diff -puN kernel/sched.c~check-for-cpu-offline-in-load_balance kernel/sched.c > --- 2.6.8-rc2-mm2/kernel/sched.c~check-for-cpu-offline-in-load_balance 2004-08-02 13:12:04.000000000 -0500 > +++ 2.6.8-rc2-mm2-nathanl/kernel/sched.c 2004-08-02 13:12:58.000000000 -0500 > @@ -1405,6 +1405,9 @@ static int load_balance(int this_cpu, ru > > spin_lock(&this_rq->lock); > > + if (unlikely(cpu_is_offline(this_cpu))) > + goto out_balanced; > + cpu_is_offline() is "unlikely" already. Please just use "if (cpu_is_offline(this_cpu))" Thanks, Rusty. -- Anyone who quotes me in their signature is an idiot -- Rusty Russell ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: CPU hotplug broken in 2.6.8-rc2 ? 2004-08-02 9:49 CPU hotplug broken in 2.6.8-rc2 ? Dipankar Sarma 2004-08-02 9:57 ` Dipankar Sarma @ 2004-08-02 16:00 ` Zwane Mwaikambo 1 sibling, 0 replies; 12+ messages in thread From: Zwane Mwaikambo @ 2004-08-02 16:00 UTC (permalink / raw) To: Dipankar Sarma Cc: V Srivatsa, nathanl, Joel Schopp, Rusty Russell, linux-kernel, nickp On Mon, 2 Aug 2004, Dipankar Sarma wrote: > Could it be that recent sched domain stuff broke CPU hotplug ? > While testing cpu hotplug with some RCU changes, I got the following > panic (while onlining). This may be related, i bumped into similar backtrace on i386 when a timer interrupt snuck in whilst the cpu was offline, so i ended up enabling timer interrupts only after the processor was on the map. This setup managed to survive 12hours with a kernel compile load over the weekend. ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2004-08-04 21:09 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2004-08-02 9:49 CPU hotplug broken in 2.6.8-rc2 ? Dipankar Sarma 2004-08-02 9:57 ` Dipankar Sarma 2004-08-02 13:46 ` Anton Blanchard 2004-08-02 19:38 ` Nathan Lynch 2004-08-02 20:26 ` Nathan Lynch 2004-08-03 21:07 ` Nathan Lynch 2004-08-04 10:06 ` Srivatsa Vaddagiri 2004-08-04 13:12 ` Nathan Lynch 2004-08-04 14:50 ` Zwane Mwaikambo 2004-08-04 21:07 ` Con Kolivas 2004-08-03 0:13 ` Rusty Russell 2004-08-02 16:00 ` Zwane Mwaikambo
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox