linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [Next] CPU Hotplug test failures on powerpc
@ 2009-12-11 10:53 Sachin Sant
  2009-12-14  2:48 ` Benjamin Herrenschmidt
                   ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Sachin Sant @ 2009-12-11 10:53 UTC (permalink / raw)
  To: Linux/PPC Development, linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, linux-next

While executing cpu_hotplug(from autotest) tests against latest
next on a power6 box, the machine locks up. A soft reset shows
the following trace

cpu 0x0: Vector: 100 (System Reset) at [c00000000c9333d0]
    pc: c0000000003433d8: .find_next_bit+0x54/0xc4
    lr: c000000000342f10: .cpumask_next_and+0x4c/0x94
    sp: c00000000c933650
   msr: 8000000000089032
  current = 0xc00000000c173840
  paca    = 0xc000000000bc2600
    pid   = 2602, comm = hotplug06.top.s
enter ? for help
[link register   ] c000000000342f10 .cpumask_next_and+0x4c/0x94
[c00000000c933650] c0000000000e9f34 .cpuset_cpus_allowed_locked+0x38/0x74 (unreliable)
[c00000000c9336e0] c000000000090074 .move_task_off_dead_cpu+0xc4/0x1ac
[c00000000c9337a0] c0000000005e4e5c .migration_call+0x304/0x830
[c00000000c933880] c0000000005e0880 .notifier_call_chain+0x68/0xe0
[c00000000c933920] c00000000012a92c ._cpu_down+0x210/0x34c
[c00000000c933a90] c00000000012aad8 .cpu_down+0x70/0xa8
[c00000000c933b20] c000000000525940 .store_online+0x54/0x894
[c00000000c933bb0] c000000000463430 .sysdev_store+0x3c/0x50
[c00000000c933c20] c0000000001f8320 .sysfs_write_file+0x124/0x18c
[c00000000c933ce0] c00000000017edac .vfs_write+0xd4/0x1fc
[c00000000c933d80] c00000000017efdc .SyS_write+0x58/0xa0
[c00000000c933e30] c0000000000085b4 syscall_exit+0x0/0x40
--- Exception: c01 (System Call) at 00000fff9fa8a8f8
SP (fffe7aef200) is in userspace
0:mon> e
cpu 0x0: Vector: 100 (System Reset) at [c00000000c9333d0]
    pc: c0000000003433d8: .find_next_bit+0x54/0xc4
    lr: c000000000342f10: .cpumask_next_and+0x4c/0x94
    sp: c00000000c933650
   msr: 8000000000089032
  current = 0xc00000000c173840
  paca    = 0xc000000000bc2600
    pid   = 2602, comm = hotplug06.top.s

Last few messages from the dmesg log shows

0:mon> 
<4>IRQ 17 affinity broken off cpu 0
<4>IRQ 18 affinity broken off cpu 0
<4>IRQ 19 affinity broken off cpu 0
<4>IRQ 264 affinity broken off cpu 0
<4>cpu 0 (hwid 0) Ready to die...
<7>clockevent: decrementer mult[83126e97] shift[32] cpu[0]
<4>Processor 0 found.
<4>IRQ 17 affinity broken off cpu 1
<4>IRQ 18 affinity broken off cpu 1
<4>IRQ 19 affinity broken off cpu 1
<4>IRQ 264 affinity broken off cpu 1
<4>cpu 1 (hwid 1) Ready to die...
<7>clockevent: decrementer mult[83126e97] shift[32] cpu[1]
<4>Processor 1 found.
<4>cpu 1 (hwid 1) Ready to die...
<7>clockevent: decrementer mult[83126e97] shift[32] cpu[1]
<4>Processor 1 found.
<4>cpu 1 (hwid 1) Ready to die...
<6>process 2423 (bash) no longer affine to cpu1
<7>clockevent: decrementer mult[83126e97] shift[32] cpu[1]
<4>Processor 1 found.
<4>cpu 1 (hwid 1) Ready to die...
<7>clockevent: decrementer mult[83126e97] shift[32] cpu[1]
<4>Processor 1 found.
<4>cpu 1 (hwid 1) Ready to die...
<7>clockevent: decrementer mult[83126e97] shift[32] cpu[1]
<4>Processor 1 found.
<4>cpu 1 (hwid 1) Ready to die...
<3>INFO: RCU detected CPU 0 stall (t=1000 jiffies)
<3>INFO: RCU detected CPU 0 stall (t=4000 jiffies)
0:mon>

After some debugging a possible suspect seems to be commit
6ad4c18.. : sched: Fix balance vs hotplug race

If i revert this patch i am able to execute the tests on this
power6 without any issues. 

But at the same time the above patch is required to solve the
cpu hotplug related race on x86_64(as a side note this same
x86_64 issue can be recreated against latest Linus git as well)
that i reported here :

http://marc.info/?l=linux-kernel&m=125802682922299&w=2

I will try few more iterations with and without the above
patch just to make sure i have the correct results.

If someone has a suggestion let me know.

Thanks
-Sachin


-- 

---------------------------------
Sachin Sant
IBM Linux Technology Center
India Systems and Technology Labs
Bangalore, India
---------------------------------

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Next] CPU Hotplug test failures on powerpc
  2009-12-11 10:53 [Next] CPU Hotplug test failures on powerpc Sachin Sant
@ 2009-12-14  2:48 ` Benjamin Herrenschmidt
  2009-12-14  4:37   ` Sachin Sant
  2009-12-14 10:22 ` Peter Zijlstra
  2009-12-16  6:25 ` Xiaotian Feng
  2 siblings, 1 reply; 22+ messages in thread
From: Benjamin Herrenschmidt @ 2009-12-14  2:48 UTC (permalink / raw)
  To: Sachin Sant
  Cc: Peter Zijlstra, Linux/PPC Development, Ingo Molnar, linux-next,
	linux-kernel

On Fri, 2009-12-11 at 16:23 +0530, Sachin Sant wrote:
> While executing cpu_hotplug(from autotest) tests against latest
> next on a power6 box, the machine locks up. A soft reset shows
> the following trace

Have you heard anything about that one yet or it's still to be
debugged ? It probably hit upstream by now.

Cheers,
Ben.

> cpu 0x0: Vector: 100 (System Reset) at [c00000000c9333d0]
>     pc: c0000000003433d8: .find_next_bit+0x54/0xc4
>     lr: c000000000342f10: .cpumask_next_and+0x4c/0x94
>     sp: c00000000c933650
>    msr: 8000000000089032
>   current = 0xc00000000c173840
>   paca    = 0xc000000000bc2600
>     pid   = 2602, comm = hotplug06.top.s
> enter ? for help
> [link register   ] c000000000342f10 .cpumask_next_and+0x4c/0x94
> [c00000000c933650] c0000000000e9f34 .cpuset_cpus_allowed_locked+0x38/0x74 (unreliable)
> [c00000000c9336e0] c000000000090074 .move_task_off_dead_cpu+0xc4/0x1ac
> [c00000000c9337a0] c0000000005e4e5c .migration_call+0x304/0x830
> [c00000000c933880] c0000000005e0880 .notifier_call_chain+0x68/0xe0
> [c00000000c933920] c00000000012a92c ._cpu_down+0x210/0x34c
> [c00000000c933a90] c00000000012aad8 .cpu_down+0x70/0xa8
> [c00000000c933b20] c000000000525940 .store_online+0x54/0x894
> [c00000000c933bb0] c000000000463430 .sysdev_store+0x3c/0x50
> [c00000000c933c20] c0000000001f8320 .sysfs_write_file+0x124/0x18c
> [c00000000c933ce0] c00000000017edac .vfs_write+0xd4/0x1fc
> [c00000000c933d80] c00000000017efdc .SyS_write+0x58/0xa0
> [c00000000c933e30] c0000000000085b4 syscall_exit+0x0/0x40
> --- Exception: c01 (System Call) at 00000fff9fa8a8f8
> SP (fffe7aef200) is in userspace
> 0:mon> e
> cpu 0x0: Vector: 100 (System Reset) at [c00000000c9333d0]
>     pc: c0000000003433d8: .find_next_bit+0x54/0xc4
>     lr: c000000000342f10: .cpumask_next_and+0x4c/0x94
>     sp: c00000000c933650
>    msr: 8000000000089032
>   current = 0xc00000000c173840
>   paca    = 0xc000000000bc2600
>     pid   = 2602, comm = hotplug06.top.s
> 
> Last few messages from the dmesg log shows
> 
> 0:mon> 
> <4>IRQ 17 affinity broken off cpu 0
> <4>IRQ 18 affinity broken off cpu 0
> <4>IRQ 19 affinity broken off cpu 0
> <4>IRQ 264 affinity broken off cpu 0
> <4>cpu 0 (hwid 0) Ready to die...
> <7>clockevent: decrementer mult[83126e97] shift[32] cpu[0]
> <4>Processor 0 found.
> <4>IRQ 17 affinity broken off cpu 1
> <4>IRQ 18 affinity broken off cpu 1
> <4>IRQ 19 affinity broken off cpu 1
> <4>IRQ 264 affinity broken off cpu 1
> <4>cpu 1 (hwid 1) Ready to die...
> <7>clockevent: decrementer mult[83126e97] shift[32] cpu[1]
> <4>Processor 1 found.
> <4>cpu 1 (hwid 1) Ready to die...
> <7>clockevent: decrementer mult[83126e97] shift[32] cpu[1]
> <4>Processor 1 found.
> <4>cpu 1 (hwid 1) Ready to die...
> <6>process 2423 (bash) no longer affine to cpu1
> <7>clockevent: decrementer mult[83126e97] shift[32] cpu[1]
> <4>Processor 1 found.
> <4>cpu 1 (hwid 1) Ready to die...
> <7>clockevent: decrementer mult[83126e97] shift[32] cpu[1]
> <4>Processor 1 found.
> <4>cpu 1 (hwid 1) Ready to die...
> <7>clockevent: decrementer mult[83126e97] shift[32] cpu[1]
> <4>Processor 1 found.
> <4>cpu 1 (hwid 1) Ready to die...
> <3>INFO: RCU detected CPU 0 stall (t=1000 jiffies)
> <3>INFO: RCU detected CPU 0 stall (t=4000 jiffies)
> 0:mon>
> 
> After some debugging a possible suspect seems to be commit
> 6ad4c18.. : sched: Fix balance vs hotplug race
> 
> If i revert this patch i am able to execute the tests on this
> power6 without any issues. 
> 
> But at the same time the above patch is required to solve the
> cpu hotplug related race on x86_64(as a side note this same
> x86_64 issue can be recreated against latest Linus git as well)
> that i reported here :
> 
> http://marc.info/?l=linux-kernel&m=125802682922299&w=2
> 
> I will try few more iterations with and without the above
> patch just to make sure i have the correct results.
> 
> If someone has a suggestion let me know.
> 
> Thanks
> -Sachin
> 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Next] CPU Hotplug test failures on powerpc
  2009-12-14  2:48 ` Benjamin Herrenschmidt
@ 2009-12-14  4:37   ` Sachin Sant
  0 siblings, 0 replies; 22+ messages in thread
From: Sachin Sant @ 2009-12-14  4:37 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Peter Zijlstra, Linux/PPC Development, Ingo Molnar, linux-next,
	linux-kernel

Benjamin Herrenschmidt wrote:
> On Fri, 2009-12-11 at 16:23 +0530, Sachin Sant wrote:
>   
>> While executing cpu_hotplug(from autotest) tests against latest
>> next on a power6 box, the machine locks up. A soft reset shows
>> the following trace
>>     
>
> Have you heard anything about that one yet or it's still to be
> debugged ? It probably hit upstream by now.
>   
Haven't received any response yet.

As you mentioned that patch went upstream and so did the problem.

thanks
-Sachin

> Cheers,
> Ben.
>
>   
>> cpu 0x0: Vector: 100 (System Reset) at [c00000000c9333d0]
>>     pc: c0000000003433d8: .find_next_bit+0x54/0xc4
>>     lr: c000000000342f10: .cpumask_next_and+0x4c/0x94
>>     sp: c00000000c933650
>>    msr: 8000000000089032
>>   current = 0xc00000000c173840
>>   paca    = 0xc000000000bc2600
>>     pid   = 2602, comm = hotplug06.top.s
>> enter ? for help
>> [link register   ] c000000000342f10 .cpumask_next_and+0x4c/0x94
>> [c00000000c933650] c0000000000e9f34 .cpuset_cpus_allowed_locked+0x38/0x74 (unreliable)
>> [c00000000c9336e0] c000000000090074 .move_task_off_dead_cpu+0xc4/0x1ac
>> [c00000000c9337a0] c0000000005e4e5c .migration_call+0x304/0x830
>> [c00000000c933880] c0000000005e0880 .notifier_call_chain+0x68/0xe0
>> [c00000000c933920] c00000000012a92c ._cpu_down+0x210/0x34c
>> [c00000000c933a90] c00000000012aad8 .cpu_down+0x70/0xa8
>> [c00000000c933b20] c000000000525940 .store_online+0x54/0x894
>> [c00000000c933bb0] c000000000463430 .sysdev_store+0x3c/0x50
>> [c00000000c933c20] c0000000001f8320 .sysfs_write_file+0x124/0x18c
>> [c00000000c933ce0] c00000000017edac .vfs_write+0xd4/0x1fc
>> [c00000000c933d80] c00000000017efdc .SyS_write+0x58/0xa0
>> [c00000000c933e30] c0000000000085b4 syscall_exit+0x0/0x40
>> --- Exception: c01 (System Call) at 00000fff9fa8a8f8
>> SP (fffe7aef200) is in userspace
>> 0:mon> e
>> cpu 0x0: Vector: 100 (System Reset) at [c00000000c9333d0]
>>     pc: c0000000003433d8: .find_next_bit+0x54/0xc4
>>     lr: c000000000342f10: .cpumask_next_and+0x4c/0x94
>>     sp: c00000000c933650
>>    msr: 8000000000089032
>>   current = 0xc00000000c173840
>>   paca    = 0xc000000000bc2600
>>     pid   = 2602, comm = hotplug06.top.s
>>
>> Last few messages from the dmesg log shows
>>
>> 0:mon> 
>> <4>IRQ 17 affinity broken off cpu 0
>> <4>IRQ 18 affinity broken off cpu 0
>> <4>IRQ 19 affinity broken off cpu 0
>> <4>IRQ 264 affinity broken off cpu 0
>> <4>cpu 0 (hwid 0) Ready to die...
>> <7>clockevent: decrementer mult[83126e97] shift[32] cpu[0]
>> <4>Processor 0 found.
>> <4>IRQ 17 affinity broken off cpu 1
>> <4>IRQ 18 affinity broken off cpu 1
>> <4>IRQ 19 affinity broken off cpu 1
>> <4>IRQ 264 affinity broken off cpu 1
>> <4>cpu 1 (hwid 1) Ready to die...
>> <7>clockevent: decrementer mult[83126e97] shift[32] cpu[1]
>> <4>Processor 1 found.
>> <4>cpu 1 (hwid 1) Ready to die...
>> <7>clockevent: decrementer mult[83126e97] shift[32] cpu[1]
>> <4>Processor 1 found.
>> <4>cpu 1 (hwid 1) Ready to die...
>> <6>process 2423 (bash) no longer affine to cpu1
>> <7>clockevent: decrementer mult[83126e97] shift[32] cpu[1]
>> <4>Processor 1 found.
>> <4>cpu 1 (hwid 1) Ready to die...
>> <7>clockevent: decrementer mult[83126e97] shift[32] cpu[1]
>> <4>Processor 1 found.
>> <4>cpu 1 (hwid 1) Ready to die...
>> <7>clockevent: decrementer mult[83126e97] shift[32] cpu[1]
>> <4>Processor 1 found.
>> <4>cpu 1 (hwid 1) Ready to die...
>> <3>INFO: RCU detected CPU 0 stall (t=1000 jiffies)
>> <3>INFO: RCU detected CPU 0 stall (t=4000 jiffies)
>> 0:mon>
>>
>> After some debugging a possible suspect seems to be commit
>> 6ad4c18.. : sched: Fix balance vs hotplug race
>>
>> If i revert this patch i am able to execute the tests on this
>> power6 without any issues. 
>>
>> But at the same time the above patch is required to solve the
>> cpu hotplug related race on x86_64(as a side note this same
>> x86_64 issue can be recreated against latest Linus git as well)
>> that i reported here :
>>
>> http://marc.info/?l=linux-kernel&m=125802682922299&w=2
>>
>> I will try few more iterations with and without the above
>> patch just to make sure i have the correct results.
>>
>> If someone has a suggestion let me know.
>>
>> Thanks
>> -Sachin
>>
>>
>>     
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-next" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>   


-- 

---------------------------------
Sachin Sant
IBM Linux Technology Center
India Systems and Technology Labs
Bangalore, India
---------------------------------

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Next] CPU Hotplug test failures on powerpc
  2009-12-11 10:53 [Next] CPU Hotplug test failures on powerpc Sachin Sant
  2009-12-14  2:48 ` Benjamin Herrenschmidt
@ 2009-12-14 10:22 ` Peter Zijlstra
  2009-12-14 11:11   ` Sachin Sant
  2009-12-16  6:25 ` Xiaotian Feng
  2 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2009-12-14 10:22 UTC (permalink / raw)
  To: Sachin Sant; +Cc: Linux/PPC Development, Ingo Molnar, linux-next, linux-kernel

On Fri, 2009-12-11 at 16:23 +0530, Sachin Sant wrote:
> While executing cpu_hotplug(from autotest) tests against latest
> next on a power6 box, the machine locks up. A soft reset shows
> the following trace
>=20
> cpu 0x0: Vector: 100 (System Reset) at [c00000000c9333d0]
>     pc: c0000000003433d8: .find_next_bit+0x54/0xc4
>     lr: c000000000342f10: .cpumask_next_and+0x4c/0x94
>     sp: c00000000c933650
>    msr: 8000000000089032
>   current =3D 0xc00000000c173840
>   paca    =3D 0xc000000000bc2600
>     pid   =3D 2602, comm =3D hotplug06.top.s
> enter ? for help
> [link register   ] c000000000342f10 .cpumask_next_and+0x4c/0x94
> [c00000000c933650] c0000000000e9f34 .cpuset_cpus_allowed_locked+0x38/0x74=
 (unreliable)
> [c00000000c9336e0] c000000000090074 .move_task_off_dead_cpu+0xc4/0x1ac
> [c00000000c9337a0] c0000000005e4e5c .migration_call+0x304/0x830
> [c00000000c933880] c0000000005e0880 .notifier_call_chain+0x68/0xe0
> [c00000000c933920] c00000000012a92c ._cpu_down+0x210/0x34c
> [c00000000c933a90] c00000000012aad8 .cpu_down+0x70/0xa8
> [c00000000c933b20] c000000000525940 .store_online+0x54/0x894
> [c00000000c933bb0] c000000000463430 .sysdev_store+0x3c/0x50
> [c00000000c933c20] c0000000001f8320 .sysfs_write_file+0x124/0x18c
> [c00000000c933ce0] c00000000017edac .vfs_write+0xd4/0x1fc
> [c00000000c933d80] c00000000017efdc .SyS_write+0x58/0xa0
> [c00000000c933e30] c0000000000085b4 syscall_exit+0x0/0x40
> --- Exception: c01 (System Call) at 00000fff9fa8a8f8
> SP (fffe7aef200) is in userspace
> 0:mon> e
> cpu 0x0: Vector: 100 (System Reset) at [c00000000c9333d0]
>     pc: c0000000003433d8: .find_next_bit+0x54/0xc4
>     lr: c000000000342f10: .cpumask_next_and+0x4c/0x94
>     sp: c00000000c933650
>    msr: 8000000000089032
>   current =3D 0xc00000000c173840
>   paca    =3D 0xc000000000bc2600
>     pid   =3D 2602, comm =3D hotplug06.top.s
>=20
> Last few messages from the dmesg log shows


> After some debugging a possible suspect seems to be commit
> 6ad4c18.. : sched: Fix balance vs hotplug race


Oh, wonderful :-/

So what is that thing whining about? Not being able to read a cpumask or
something?

Does your .config have cpusets enabled (there's a different
cpuset_cpus_allowed_locked implementation depending on that)?

I know of at least one remaining race and am working on closing that,
but I'm not sure I can explain this crash with that.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Next] CPU Hotplug test failures on powerpc
  2009-12-14 10:22 ` Peter Zijlstra
@ 2009-12-14 11:11   ` Sachin Sant
  2009-12-14 12:19     ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Sachin Sant @ 2009-12-14 11:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linux/PPC Development, Ingo Molnar, linux-next, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2723 bytes --]

Peter Zijlstra wrote:
> On Fri, 2009-12-11 at 16:23 +0530, Sachin Sant wrote:
>   
>> While executing cpu_hotplug(from autotest) tests against latest
>> next on a power6 box, the machine locks up. A soft reset shows
>> the following trace
>>
>> cpu 0x0: Vector: 100 (System Reset) at [c00000000c9333d0]
>>     pc: c0000000003433d8: .find_next_bit+0x54/0xc4
>>     lr: c000000000342f10: .cpumask_next_and+0x4c/0x94
>>     sp: c00000000c933650
>>    msr: 8000000000089032
>>   current = 0xc00000000c173840
>>   paca    = 0xc000000000bc2600
>>     pid   = 2602, comm = hotplug06.top.s
>> enter ? for help
>> [link register   ] c000000000342f10 .cpumask_next_and+0x4c/0x94
>> [c00000000c933650] c0000000000e9f34 .cpuset_cpus_allowed_locked+0x38/0x74 (unreliable)
>> [c00000000c9336e0] c000000000090074 .move_task_off_dead_cpu+0xc4/0x1ac
>> [c00000000c9337a0] c0000000005e4e5c .migration_call+0x304/0x830
>> [c00000000c933880] c0000000005e0880 .notifier_call_chain+0x68/0xe0
>> [c00000000c933920] c00000000012a92c ._cpu_down+0x210/0x34c
>> [c00000000c933a90] c00000000012aad8 .cpu_down+0x70/0xa8
>> [c00000000c933b20] c000000000525940 .store_online+0x54/0x894
>> [c00000000c933bb0] c000000000463430 .sysdev_store+0x3c/0x50
>> [c00000000c933c20] c0000000001f8320 .sysfs_write_file+0x124/0x18c
>> [c00000000c933ce0] c00000000017edac .vfs_write+0xd4/0x1fc
>> [c00000000c933d80] c00000000017efdc .SyS_write+0x58/0xa0
>> [c00000000c933e30] c0000000000085b4 syscall_exit+0x0/0x40
>> --- Exception: c01 (System Call) at 00000fff9fa8a8f8
>> SP (fffe7aef200) is in userspace
>> 0:mon> e
>> cpu 0x0: Vector: 100 (System Reset) at [c00000000c9333d0]
>>     pc: c0000000003433d8: .find_next_bit+0x54/0xc4
>>     lr: c000000000342f10: .cpumask_next_and+0x4c/0x94
>>     sp: c00000000c933650
>>    msr: 8000000000089032
>>   current = 0xc00000000c173840
>>   paca    = 0xc000000000bc2600
>>     pid   = 2602, comm = hotplug06.top.s
>>
>> Last few messages from the dmesg log shows
>>     
>
>
>   
>> After some debugging a possible suspect seems to be commit
>> 6ad4c18.. : sched: Fix balance vs hotplug race
>>     
>
>
> Oh, wonderful :-/
>
> So what is that thing whining about? Not being able to read a cpumask or
> something?
>
> Does your .config have cpusets enabled (there's a different
> cpuset_cpus_allowed_locked implementation depending on that)?
>   
Yes CPUSETS config is enabled. I have attached the config.

Thanks
-Sachin

> I know of at least one remaining race and am working on closing that,
> but I'm not sure I can explain this crash with that.
>
>   


-- 

---------------------------------
Sachin Sant
IBM Linux Technology Center
India Systems and Technology Labs
Bangalore, India
---------------------------------


[-- Attachment #2: config_cpu_hotplug.gz --]
[-- Type: application/x-gzip, Size: 19832 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Next] CPU Hotplug test failures on powerpc
  2009-12-14 11:11   ` Sachin Sant
@ 2009-12-14 12:19     ` Peter Zijlstra
  2009-12-14 21:17       ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2009-12-14 12:19 UTC (permalink / raw)
  To: Sachin Sant; +Cc: Linux/PPC Development, Ingo Molnar, linux-next, linux-kernel

On Mon, 2009-12-14 at 16:41 +0530, Sachin Sant wrote:
> Peter Zijlstra wrote:
> > On Fri, 2009-12-11 at 16:23 +0530, Sachin Sant wrote:
> >  =20
> >> While executing cpu_hotplug(from autotest) tests against latest
> >> next on a power6 box, the machine locks up. A soft reset shows
> >> the following trace
> >>
> >> cpu 0x0: Vector: 100 (System Reset) at [c00000000c9333d0]
> >>     pc: c0000000003433d8: .find_next_bit+0x54/0xc4
> >>     lr: c000000000342f10: .cpumask_next_and+0x4c/0x94
> >>     sp: c00000000c933650
> >>    msr: 8000000000089032
> >>   current =3D 0xc00000000c173840
> >>   paca    =3D 0xc000000000bc2600
> >>     pid   =3D 2602, comm =3D hotplug06.top.s
> >> enter ? for help
> >> [link register   ] c000000000342f10 .cpumask_next_and+0x4c/0x94
> >> [c00000000c933650] c0000000000e9f34 .cpuset_cpus_allowed_locked+0x38/0=
x74 (unreliable)
> >> [c00000000c9336e0] c000000000090074 .move_task_off_dead_cpu+0xc4/0x1ac
> >> [c00000000c9337a0] c0000000005e4e5c .migration_call+0x304/0x830
> >> [c00000000c933880] c0000000005e0880 .notifier_call_chain+0x68/0xe0
> >> [c00000000c933920] c00000000012a92c ._cpu_down+0x210/0x34c
> >> [c00000000c933a90] c00000000012aad8 .cpu_down+0x70/0xa8
> >> [c00000000c933b20] c000000000525940 .store_online+0x54/0x894
> >> [c00000000c933bb0] c000000000463430 .sysdev_store+0x3c/0x50
> >> [c00000000c933c20] c0000000001f8320 .sysfs_write_file+0x124/0x18c
> >> [c00000000c933ce0] c00000000017edac .vfs_write+0xd4/0x1fc
> >> [c00000000c933d80] c00000000017efdc .SyS_write+0x58/0xa0
> >> [c00000000c933e30] c0000000000085b4 syscall_exit+0x0/0x40
> >> --- Exception: c01 (System Call) at 00000fff9fa8a8f8
> >> SP (fffe7aef200) is in userspace
> >> 0:mon> e
> >> cpu 0x0: Vector: 100 (System Reset) at [c00000000c9333d0]
> >>     pc: c0000000003433d8: .find_next_bit+0x54/0xc4
> >>     lr: c000000000342f10: .cpumask_next_and+0x4c/0x94
> >>     sp: c00000000c933650
> >>    msr: 8000000000089032
> >>   current =3D 0xc00000000c173840
> >>   paca    =3D 0xc000000000bc2600
> >>     pid   =3D 2602, comm =3D hotplug06.top.s
> >>

OK so how do I read that above thing? What's a System Reset? Is that
like the x86 triple fault thing?

>From what I can make of it, its in move_task_off_dead_cpu(), right after
having called cpuset_cpus_allowed_locked(), doing that cpumask_any_and()
call.

static void move_task_off_dead_cpu(int dead_cpu, struct task_struct *p)
{
        int dest_cpu;
        const struct cpumask *nodemask =3D cpumask_of_node(cpu_to_node(dead=
_cpu));

again:
        /* Look for allowed, online CPU in same node. */
        for_each_cpu_and(dest_cpu, nodemask, cpu_active_mask)
                if (cpumask_test_cpu(dest_cpu, &p->cpus_allowed))
                        goto move;

        /* Any allowed, online CPU? */
        dest_cpu =3D cpumask_any_and(&p->cpus_allowed, cpu_active_mask);
        if (dest_cpu < nr_cpu_ids)
                goto move;

        /* No more Mr. Nice Guy. */
        if (dest_cpu >=3D nr_cpu_ids) {
                cpuset_cpus_allowed_locked(p, &p->cpus_allowed);
=3D=3D=3D=3D>           dest_cpu =3D cpumask_any_and(cpu_active_mask, &p->c=
pus_allowed);

                /*
                 * Don't tell them about moving exiting tasks or
                 * kernel threads (both mm NULL), since they never
                 * leave kernel.
                 */
                if (p->mm && printk_ratelimit()) {
                        pr_info("process %d (%s) no longer affine to cpu%d\=
n",
                                task_pid_nr(p), p->comm, dead_cpu);
                }
        }

move:
        /* It can have affinity changed while we were choosing. */
        if (unlikely(!__migrate_task_irq(p, dead_cpu, dest_cpu)))
                goto again;
}

Both masks, p->cpus_allowed and cpu_active_mask are stable in that p
won't go away since we hold the tasklist_lock (in migrate_list_tasks),
and cpu_active_mask is static storage, so WTH is it going funny on?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Next] CPU Hotplug test failures on powerpc
  2009-12-14 12:19     ` Peter Zijlstra
@ 2009-12-14 21:17       ` Benjamin Herrenschmidt
  2009-12-15  9:44         ` Sachin Sant
  0 siblings, 1 reply; 22+ messages in thread
From: Benjamin Herrenschmidt @ 2009-12-14 21:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linux/PPC Development, Ingo Molnar, linux-next, linux-kernel

On Mon, 2009-12-14 at 13:19 +0100, Peter Zijlstra wrote:

> > >> cpu 0x0: Vector: 100 (System Reset) at [c00000000c9333d0]
> > >>     pc: c0000000003433d8: .find_next_bit+0x54/0xc4
> > >>     lr: c000000000342f10: .cpumask_next_and+0x4c/0x94
> > >>     sp: c00000000c933650
> > >>    msr: 8000000000089032
> > >>   current = 0xc00000000c173840
> > >>   paca    = 0xc000000000bc2600
> > >>     pid   = 2602, comm = hotplug06.top.s
> > >> enter ? for help
> > >> [link register   ] c000000000342f10 .cpumask_next_and+0x4c/0x94
> > >> [c00000000c933650] c0000000000e9f34 .cpuset_cpus_allowed_locked+0x38/0x74 (unreliable)
> > >> [c00000000c9336e0] c000000000090074 .move_task_off_dead_cpu+0xc4/0x1ac
> > >> [c00000000c9337a0] c0000000005e4e5c .migration_call+0x304/0x830
> > >> [c00000000c933880] c0000000005e0880 .notifier_call_chain+0x68/0xe0
> > >> [c00000000c933920] c00000000012a92c ._cpu_down+0x210/0x34c
> > >> [c00000000c933a90] c00000000012aad8 .cpu_down+0x70/0xa8
> > >> [c00000000c933b20] c000000000525940 .store_online+0x54/0x894
> > >> [c00000000c933bb0] c000000000463430 .sysdev_store+0x3c/0x50
> > >> [c00000000c933c20] c0000000001f8320 .sysfs_write_file+0x124/0x18c
> > >> [c00000000c933ce0] c00000000017edac .vfs_write+0xd4/0x1fc
> > >> [c00000000c933d80] c00000000017efdc .SyS_write+0x58/0xa0
> > >> [c00000000c933e30] c0000000000085b4 syscall_exit+0x0/0x40
> > >> --- Exception: c01 (System Call) at 00000fff9fa8a8f8
> > >> SP (fffe7aef200) is in userspace
> > >> 0:mon> e
> > >> cpu 0x0: Vector: 100 (System Reset) at [c00000000c9333d0]
> > >>     pc: c0000000003433d8: .find_next_bit+0x54/0xc4
> > >>     lr: c000000000342f10: .cpumask_next_and+0x4c/0x94
> > >>     sp: c00000000c933650
> > >>    msr: 8000000000089032
> > >>   current = 0xc00000000c173840
> > >>   paca    = 0xc000000000bc2600
> > >>     pid   = 2602, comm = hotplug06.top.s
> > >>
> 
> OK so how do I read that above thing? What's a System Reset? Is that
> like the x86 triple fault thing?

Nah, it's an NMI that throws you into xmon. Basically, the machine was
hung and Sachin interrupted it with an NMI to see what was going on. The
above is the backtrace. It was at the moment of the NMI inside
find_next_bit() called from cpumask_next_and() etc... 

> >From what I can make of it, its in move_task_off_dead_cpu(), right after
> having called cpuset_cpus_allowed_locked(), doing that cpumask_any_and()
> call.

Yes, it looks like it.

> static void move_task_off_dead_cpu(int dead_cpu, struct task_struct *p)
> {
>         int dest_cpu;
>         const struct cpumask *nodemask = cpumask_of_node(cpu_to_node(dead_cpu));
> 
> again:
>         /* Look for allowed, online CPU in same node. */
>         for_each_cpu_and(dest_cpu, nodemask, cpu_active_mask)
>                 if (cpumask_test_cpu(dest_cpu, &p->cpus_allowed))
>                         goto move;
> 
>         /* Any allowed, online CPU? */
>         dest_cpu = cpumask_any_and(&p->cpus_allowed, cpu_active_mask);
>         if (dest_cpu < nr_cpu_ids)
>                 goto move;
> 
>         /* No more Mr. Nice Guy. */
>         if (dest_cpu >= nr_cpu_ids) {
>                 cpuset_cpus_allowed_locked(p, &p->cpus_allowed);
> ====>           dest_cpu = cpumask_any_and(cpu_active_mask, &p->cpus_allowed);
> 
>                 /*
>                  * Don't tell them about moving exiting tasks or
>                  * kernel threads (both mm NULL), since they never
>                  * leave kernel.
>                  */
>                 if (p->mm && printk_ratelimit()) {
>                         pr_info("process %d (%s) no longer affine to cpu%d\n",
>                                 task_pid_nr(p), p->comm, dead_cpu);
>                 }
>         }
> 
> move:
>         /* It can have affinity changed while we were choosing. */
>         if (unlikely(!__migrate_task_irq(p, dead_cpu, dest_cpu)))
>                 goto again;
> }
> 
> Both masks, p->cpus_allowed and cpu_active_mask are stable in that p
> won't go away since we hold the tasklist_lock (in migrate_list_tasks),
> and cpu_active_mask is static storage, so WTH is it going funny on?

Sachin, this is 100% reproduceable right ? You should be able to
sprinkle it with some xmon_printf() (rather than printk, just add a
prototype extern void xmon_printf(const char *fmt,...); somewhere, this
has the advantage of being fully synchronous and will print out even if
the printk sem is held.

Cheers,
Ben.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Next] CPU Hotplug test failures on powerpc
  2009-12-14 21:17       ` Benjamin Herrenschmidt
@ 2009-12-15  9:44         ` Sachin Sant
  2009-12-15 10:43           ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Sachin Sant @ 2009-12-15  9:44 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Peter Zijlstra
  Cc: Linux/PPC Development, Ingo Molnar, linux-next, linux-kernel

Benjamin Herrenschmidt wrote:
>> static void move_task_off_dead_cpu(int dead_cpu, struct task_struct *p)
>> {
>>         int dest_cpu;
>>         const struct cpumask *nodemask = cpumask_of_node(cpu_to_node(dead_cpu));
>>
>> again:
>>         /* Look for allowed, online CPU in same node. */
>>         for_each_cpu_and(dest_cpu, nodemask, cpu_active_mask)
>>                 if (cpumask_test_cpu(dest_cpu, &p->cpus_allowed))
>>                         goto move;
>>
>>         /* Any allowed, online CPU? */
>>         dest_cpu = cpumask_any_and(&p->cpus_allowed, cpu_active_mask);
>>         if (dest_cpu < nr_cpu_ids)
>>                 goto move;
>>
>>         /* No more Mr. Nice Guy. */
>>         if (dest_cpu >= nr_cpu_ids) {
>>                 cpuset_cpus_allowed_locked(p, &p->cpus_allowed);
>> ====>           dest_cpu = cpumask_any_and(cpu_active_mask, &p->cpus_allowed);
>>
>>                 /*
>>                  * Don't tell them about moving exiting tasks or
>>                  * kernel threads (both mm NULL), since they never
>>                  * leave kernel.
>>                  */
>>                 if (p->mm && printk_ratelimit()) {
>>                         pr_info("process %d (%s) no longer affine to cpu%d\n",
>>                                 task_pid_nr(p), p->comm, dead_cpu);
>>                 }
>>         }
>>
>> move:
>>         /* It can have affinity changed while we were choosing. */
>>         if (unlikely(!__migrate_task_irq(p, dead_cpu, dest_cpu)))
>>                 goto again;
>> }
>>
>> Both masks, p->cpus_allowed and cpu_active_mask are stable in that p
>> won't go away since we hold the tasklist_lock (in migrate_list_tasks),
>> and cpu_active_mask is static storage, so WTH is it going funny on?
>>     
I added some debug statements within the above code. 
This is a 2 cpu machine.

XMON dest_cpu = 1024 . dead_cpu = 1 . nr_cpu_ids = 2
XMON dest_cpu = 1024 
XMON dest_cpu = 1024 . dead_cpu = 1
XMON dest_cpu = 1024 . dead_cpu = 1 . nr_cpu_ids = 2
XMON dest_cpu = 1024 
XMON dest_cpu = 1024 . dead_cpu = 1
XMON dest_cpu = 1024 . dead_cpu = 1 . nr_cpu_ids = 2
XMON dest_cpu = 1024 
XMON dest_cpu = 1024 . dead_cpu = 1

Seems to me that the control is stuck in an infinite loop and hence the
machine appears to be in hung state. The dest_cpu value is always 1024
and never changes, which result in an infinite loop.

In working scenario the o/p is something on the following lines

XMON dest_cpu = 1024 . dead_cpu = 1 . nr_cpu_ids = 2
XMON dest_cpu = 0 
XMON dest_cpu = 1024 . dead_cpu = 1 . nr_cpu_ids = 2
XMON dest_cpu = 0 
XMON dest_cpu = 1024 . dead_cpu = 1 . nr_cpu_ids = 2
XMON dest_cpu = 0 

Let me know if i should try to record any specific value ?

Thanks
-Sachin

-- 

---------------------------------
Sachin Sant
IBM Linux Technology Center
India Systems and Technology Labs
Bangalore, India
---------------------------------

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Next] CPU Hotplug test failures on powerpc
  2009-12-15  9:44         ` Sachin Sant
@ 2009-12-15 10:43           ` Peter Zijlstra
  2009-12-15 13:47             ` Sachin Sant
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2009-12-15 10:43 UTC (permalink / raw)
  To: Sachin Sant; +Cc: Ingo Molnar, linux-next, linux-kernel, Linux/PPC Development

On Tue, 2009-12-15 at 15:14 +0530, Sachin Sant wrote:
> Benjamin Herrenschmidt wrote:
> >> static void move_task_off_dead_cpu(int dead_cpu, struct task_struct *p=
)
> >> {
> >>         int dest_cpu;
> >>         const struct cpumask *nodemask =3D cpumask_of_node(cpu_to_node=
(dead_cpu));
> >>
> >> again:
> >>         /* Look for allowed, online CPU in same node. */
> >>         for_each_cpu_and(dest_cpu, nodemask, cpu_active_mask)
> >>                 if (cpumask_test_cpu(dest_cpu, &p->cpus_allowed))
> >>                         goto move;
> >>
> >>         /* Any allowed, online CPU? */
> >>         dest_cpu =3D cpumask_any_and(&p->cpus_allowed, cpu_active_mask=
);
> >>         if (dest_cpu < nr_cpu_ids)
> >>                 goto move;
> >>
> >>         /* No more Mr. Nice Guy. */
> >>         if (dest_cpu >=3D nr_cpu_ids) {
> >>                 cpuset_cpus_allowed_locked(p, &p->cpus_allowed);
> >> =3D=3D=3D=3D>           dest_cpu =3D cpumask_any_and(cpu_active_mask, =
&p->cpus_allowed);
> >>
> >>                 /*
> >>                  * Don't tell them about moving exiting tasks or
> >>                  * kernel threads (both mm NULL), since they never
> >>                  * leave kernel.
> >>                  */
> >>                 if (p->mm && printk_ratelimit()) {
> >>                         pr_info("process %d (%s) no longer affine to c=
pu%d\n",
> >>                                 task_pid_nr(p), p->comm, dead_cpu);
> >>                 }
> >>         }
> >>
> >> move:
> >>         /* It can have affinity changed while we were choosing. */
> >>         if (unlikely(!__migrate_task_irq(p, dead_cpu, dest_cpu)))
> >>                 goto again;
> >> }
> >>
> >> Both masks, p->cpus_allowed and cpu_active_mask are stable in that p
> >> won't go away since we hold the tasklist_lock (in migrate_list_tasks),
> >> and cpu_active_mask is static storage, so WTH is it going funny on?
> >>    =20
> I added some debug statements within the above code.=20
> This is a 2 cpu machine.
>=20
> XMON dest_cpu =3D 1024 . dead_cpu =3D 1 . nr_cpu_ids =3D 2
> XMON dest_cpu =3D 1024=20
> XMON dest_cpu =3D 1024 . dead_cpu =3D 1
> XMON dest_cpu =3D 1024 . dead_cpu =3D 1 . nr_cpu_ids =3D 2
> XMON dest_cpu =3D 1024=20
> XMON dest_cpu =3D 1024 . dead_cpu =3D 1
> XMON dest_cpu =3D 1024 . dead_cpu =3D 1 . nr_cpu_ids =3D 2
> XMON dest_cpu =3D 1024=20
> XMON dest_cpu =3D 1024 . dead_cpu =3D 1
>=20
> Seems to me that the control is stuck in an infinite loop and hence the
> machine appears to be in hung state. The dest_cpu value is always 1024
> and never changes, which result in an infinite loop.
>=20
> In working scenario the o/p is something on the following lines
>=20
> XMON dest_cpu =3D 1024 . dead_cpu =3D 1 . nr_cpu_ids =3D 2
> XMON dest_cpu =3D 0=20
> XMON dest_cpu =3D 1024 . dead_cpu =3D 1 . nr_cpu_ids =3D 2
> XMON dest_cpu =3D 0=20
> XMON dest_cpu =3D 1024 . dead_cpu =3D 1 . nr_cpu_ids =3D 2
> XMON dest_cpu =3D 0=20
>=20
> Let me know if i should try to record any specific value ?

Could you possibly print the two masks themselves? cpumask_scnprintf()
and friend come in handy for this.

The dest_cpu=3D1024 thing seem to suggest the intersection between
p->cpus_allowed and cpu_active_mask is empty for some reason, even
though we forcefully reset p->cpus_allowed to the full set using
cpuset_cpus_allowed_locked().

/me goes re-read the cpu_active_map code, this really shouldn't happen.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Next] CPU Hotplug test failures on powerpc
  2009-12-15 10:43           ` Peter Zijlstra
@ 2009-12-15 13:47             ` Sachin Sant
  2009-12-15 15:03               ` Peter Zijlstra
  2009-12-16  6:56               ` Xiaotian Feng
  0 siblings, 2 replies; 22+ messages in thread
From: Sachin Sant @ 2009-12-15 13:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-next, linux-kernel, Linux/PPC Development

Peter Zijlstra wrote:
>> I added some debug statements within the above code. 
>> This is a 2 cpu machine.
>>
>> XMON dest_cpu = 1024 . dead_cpu = 1 . nr_cpu_ids = 2
>> XMON dest_cpu = 1024 
>> XMON dest_cpu = 1024 . dead_cpu = 1
>> XMON dest_cpu = 1024 . dead_cpu = 1 . nr_cpu_ids = 2
>> XMON dest_cpu = 1024 
>> XMON dest_cpu = 1024 . dead_cpu = 1
>> XMON dest_cpu = 1024 . dead_cpu = 1 . nr_cpu_ids = 2
>> XMON dest_cpu = 1024 
>> XMON dest_cpu = 1024 . dead_cpu = 1
>>
>> Seems to me that the control is stuck in an infinite loop and hence the
>> machine appears to be in hung state. The dest_cpu value is always 1024
>> and never changes, which result in an infinite loop.
>>
>> In working scenario the o/p is something on the following lines
>>
>> XMON dest_cpu = 1024 . dead_cpu = 1 . nr_cpu_ids = 2
>> XMON dest_cpu = 0 
>> XMON dest_cpu = 1024 . dead_cpu = 1 . nr_cpu_ids = 2
>> XMON dest_cpu = 0 
>> XMON dest_cpu = 1024 . dead_cpu = 1 . nr_cpu_ids = 2
>> XMON dest_cpu = 0 
>>
>> Let me know if i should try to record any specific value ?
>>     
>
> Could you possibly print the two masks themselves? cpumask_scnprintf()
> and friend come in handy for this.
>
> The dest_cpu=1024 thing seem to suggest the intersection between
> p->cpus_allowed and cpu_active_mask is empty for some reason, even
> though we forcefully reset p->cpus_allowed to the full set using
> cpuset_cpus_allowed_locked().
>   
So here is the data related to the two masks.

cpu_active_mask = 00000000,00000000,00000000,00000000,00000000,00000000,
00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,
00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,
00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,
00000000,00000000
XMON dest_cpu = 1024

while p->cpus_allowed =  00000000,00000000,00000000,00000000,00000000,
00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,
00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,
00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,
00000000,00000000,00000001
XMON dest_cpu = 1024

In working scenario the above data looks like

cpu_active_mask = 00000000,00000000,00000000,00000000,00000000,00000000,
00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,
00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,
00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,
00000000,00000002
XMON dest_cpu = 1

while p->cpus_allowed =  00000000,00000000,00000000,00000000,00000000,
00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,
00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,
00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,
00000000,00000000,00000002
XMON dest_cpu = 1


hope i got the data correct.

Thanks
-Sachin


-- 

---------------------------------
Sachin Sant
IBM Linux Technology Center
India Systems and Technology Labs
Bangalore, India
---------------------------------

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Next] CPU Hotplug test failures on powerpc
  2009-12-15 13:47             ` Sachin Sant
@ 2009-12-15 15:03               ` Peter Zijlstra
  2009-12-16  5:38                 ` Sachin Sant
  2009-12-16  6:56               ` Xiaotian Feng
  1 sibling, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2009-12-15 15:03 UTC (permalink / raw)
  To: Sachin Sant; +Cc: Ingo Molnar, linux-next, linux-kernel, Linux/PPC Development


Could you try the below?

---
 init/main.c |    7 +------
 1 files changed, 1 insertions(+), 6 deletions(-)

diff --git a/init/main.c b/init/main.c
index 4051d75..4be7de2 100644
--- a/init/main.c
+++ b/init/main.c
@@ -369,12 +369,6 @@ static void __init smp_init(void)
 {
 	unsigned int cpu;
=20
-	/*
-	 * Set up the current CPU as possible to migrate to.
-	 * The other ones will be done by cpu_up/cpu_down()
-	 */
-	set_cpu_active(smp_processor_id(), true);
-
 	/* FIXME: This should be done in userspace --RR */
 	for_each_present_cpu(cpu) {
 		if (num_online_cpus() >=3D setup_max_cpus)
@@ -486,6 +480,7 @@ static void __init boot_cpu_init(void)
 	int cpu =3D smp_processor_id();
 	/* Mark the boot cpu "present", "online" etc for SMP and UP case */
 	set_cpu_online(cpu, true);
+	set_cpu_active(cpu, true);
 	set_cpu_present(cpu, true);
 	set_cpu_possible(cpu, true);
 }

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [Next] CPU Hotplug test failures on powerpc
  2009-12-15 15:03               ` Peter Zijlstra
@ 2009-12-16  5:38                 ` Sachin Sant
  2009-12-16  7:14                   ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Sachin Sant @ 2009-12-16  5:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-next, linux-kernel, Linux/PPC Development

Peter Zijlstra wrote:
> Could you try the below?
>   
No luck. Still the same issue. The mask values don't change.

Thanks
-Sachin

> ---
>  init/main.c |    7 +------
>  1 files changed, 1 insertions(+), 6 deletions(-)
>
> diff --git a/init/main.c b/init/main.c
> index 4051d75..4be7de2 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -369,12 +369,6 @@ static void __init smp_init(void)
>  {
>  	unsigned int cpu;
>  
> -	/*
> -	 * Set up the current CPU as possible to migrate to.
> -	 * The other ones will be done by cpu_up/cpu_down()
> -	 */
> -	set_cpu_active(smp_processor_id(), true);
> -
>  	/* FIXME: This should be done in userspace --RR */
>  	for_each_present_cpu(cpu) {
>  		if (num_online_cpus() >= setup_max_cpus)
> @@ -486,6 +480,7 @@ static void __init boot_cpu_init(void)
>  	int cpu = smp_processor_id();
>  	/* Mark the boot cpu "present", "online" etc for SMP and UP case */
>  	set_cpu_online(cpu, true);
> +	set_cpu_active(cpu, true);
>  	set_cpu_present(cpu, true);
>  	set_cpu_possible(cpu, true);
>  }
>
>
>   


-- 

---------------------------------
Sachin Sant
IBM Linux Technology Center
India Systems and Technology Labs
Bangalore, India
---------------------------------

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Next] CPU Hotplug test failures on powerpc
  2009-12-11 10:53 [Next] CPU Hotplug test failures on powerpc Sachin Sant
  2009-12-14  2:48 ` Benjamin Herrenschmidt
  2009-12-14 10:22 ` Peter Zijlstra
@ 2009-12-16  6:25 ` Xiaotian Feng
  2009-12-16  6:41   ` Sachin Sant
  2 siblings, 1 reply; 22+ messages in thread
From: Xiaotian Feng @ 2009-12-16  6:25 UTC (permalink / raw)
  To: Sachin Sant
  Cc: Peter Zijlstra, linux-kernel, Linux/PPC Development, linux-next,
	Ingo Molnar

On Fri, Dec 11, 2009 at 6:53 PM, Sachin Sant <sachinp@in.ibm.com> wrote:
> While executing cpu_hotplug(from autotest) tests against latest
> next on a power6 box, the machine locks up. A soft reset shows
> the following trace
>
> cpu 0x0: Vector: 100 (System Reset) at [c00000000c9333d0]
> =C2=A0 pc: c0000000003433d8: .find_next_bit+0x54/0xc4
> =C2=A0 lr: c000000000342f10: .cpumask_next_and+0x4c/0x94
> =C2=A0 sp: c00000000c933650
> =C2=A0msr: 8000000000089032
> =C2=A0current =3D 0xc00000000c173840
> =C2=A0paca =C2=A0 =C2=A0=3D 0xc000000000bc2600
> =C2=A0 pid =C2=A0 =3D 2602, comm =3D hotplug06.top.s
> enter ? for help
> [link register =C2=A0 ] c000000000342f10 .cpumask_next_and+0x4c/0x94
> [c00000000c933650] c0000000000e9f34 .cpuset_cpus_allowed_locked+0x38/0x74
> (unreliable)
> [c00000000c9336e0] c000000000090074 .move_task_off_dead_cpu+0xc4/0x1ac
> [c00000000c9337a0] c0000000005e4e5c .migration_call+0x304/0x830
> [c00000000c933880] c0000000005e0880 .notifier_call_chain+0x68/0xe0
> [c00000000c933920] c00000000012a92c ._cpu_down+0x210/0x34c
> [c00000000c933a90] c00000000012aad8 .cpu_down+0x70/0xa8
> [c00000000c933b20] c000000000525940 .store_online+0x54/0x894
> [c00000000c933bb0] c000000000463430 .sysdev_store+0x3c/0x50
> [c00000000c933c20] c0000000001f8320 .sysfs_write_file+0x124/0x18c
> [c00000000c933ce0] c00000000017edac .vfs_write+0xd4/0x1fc
> [c00000000c933d80] c00000000017efdc .SyS_write+0x58/0xa0
> [c00000000c933e30] c0000000000085b4 syscall_exit+0x0/0x40
> --- Exception: c01 (System Call) at 00000fff9fa8a8f8
> SP (fffe7aef200) is in userspace
> 0:mon> e
> cpu 0x0: Vector: 100 (System Reset) at [c00000000c9333d0]
> =C2=A0 pc: c0000000003433d8: .find_next_bit+0x54/0xc4
> =C2=A0 lr: c000000000342f10: .cpumask_next_and+0x4c/0x94
> =C2=A0 sp: c00000000c933650
> =C2=A0msr: 8000000000089032
> =C2=A0current =3D 0xc00000000c173840
> =C2=A0paca =C2=A0 =C2=A0=3D 0xc000000000bc2600
> =C2=A0 pid =C2=A0 =3D 2602, comm =3D hotplug06.top.s
>

Does this testcase hotplug cpu 0 off?

> Last few messages from the dmesg log shows
>
> 0:mon> <4>IRQ 17 affinity broken off cpu 0
> <4>IRQ 18 affinity broken off cpu 0
> <4>IRQ 19 affinity broken off cpu 0
> <4>IRQ 264 affinity broken off cpu 0
> <4>cpu 0 (hwid 0) Ready to die...
> <7>clockevent: decrementer mult[83126e97] shift[32] cpu[0]
> <4>Processor 0 found.
> <4>IRQ 17 affinity broken off cpu 1
> <4>IRQ 18 affinity broken off cpu 1
> <4>IRQ 19 affinity broken off cpu 1
> <4>IRQ 264 affinity broken off cpu 1
> <4>cpu 1 (hwid 1) Ready to die...
> <7>clockevent: decrementer mult[83126e97] shift[32] cpu[1]
> <4>Processor 1 found.
> <4>cpu 1 (hwid 1) Ready to die...
> <7>clockevent: decrementer mult[83126e97] shift[32] cpu[1]
> <4>Processor 1 found.
> <4>cpu 1 (hwid 1) Ready to die...
> <6>process 2423 (bash) no longer affine to cpu1
> <7>clockevent: decrementer mult[83126e97] shift[32] cpu[1]
> <4>Processor 1 found.
> <4>cpu 1 (hwid 1) Ready to die...
> <7>clockevent: decrementer mult[83126e97] shift[32] cpu[1]
> <4>Processor 1 found.
> <4>cpu 1 (hwid 1) Ready to die...
> <7>clockevent: decrementer mult[83126e97] shift[32] cpu[1]
> <4>Processor 1 found.
> <4>cpu 1 (hwid 1) Ready to die...
> <3>INFO: RCU detected CPU 0 stall (t=3D1000 jiffies)
> <3>INFO: RCU detected CPU 0 stall (t=3D4000 jiffies)
> 0:mon>
>
> After some debugging a possible suspect seems to be commit
> 6ad4c18.. : sched: Fix balance vs hotplug race
>
> If i revert this patch i am able to execute the tests on this
> power6 without any issues.
> But at the same time the above patch is required to solve the
> cpu hotplug related race on x86_64(as a side note this same
> x86_64 issue can be recreated against latest Linus git as well)
> that i reported here :
>
> http://marc.info/?l=3Dlinux-kernel&m=3D125802682922299&w=3D2
>
> I will try few more iterations with and without the above
> patch just to make sure i have the correct results.
>
> If someone has a suggestion let me know.
>
> Thanks
> -Sachin
>
>
> --
>
> ---------------------------------
> Sachin Sant
> IBM Linux Technology Center
> India Systems and Technology Labs
> Bangalore, India
> ---------------------------------
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" i=
n
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =C2=A0http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at =C2=A0http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Next] CPU Hotplug test failures on powerpc
  2009-12-16  6:25 ` Xiaotian Feng
@ 2009-12-16  6:41   ` Sachin Sant
  2009-12-16  6:45     ` Xiaotian Feng
  0 siblings, 1 reply; 22+ messages in thread
From: Sachin Sant @ 2009-12-16  6:41 UTC (permalink / raw)
  To: Xiaotian Feng
  Cc: Peter Zijlstra, linux-kernel, Linux/PPC Development, linux-next,
	Ingo Molnar

Xiaotian Feng wrote:
> Does this testcase hotplug cpu 0 off?
>   
No, i don't think so. It skips cpu0 during online/offline
process.

thanks
-Sachin

-- 

---------------------------------
Sachin Sant
IBM Linux Technology Center
India Systems and Technology Labs
Bangalore, India
---------------------------------

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Next] CPU Hotplug test failures on powerpc
  2009-12-16  6:41   ` Sachin Sant
@ 2009-12-16  6:45     ` Xiaotian Feng
  2009-12-16  6:54       ` Sachin Sant
  0 siblings, 1 reply; 22+ messages in thread
From: Xiaotian Feng @ 2009-12-16  6:45 UTC (permalink / raw)
  To: Sachin Sant
  Cc: Peter Zijlstra, linux-kernel, Linux/PPC Development, linux-next,
	Ingo Molnar

On Wed, Dec 16, 2009 at 2:41 PM, Sachin Sant <sachinp@in.ibm.com> wrote:
> Xiaotian Feng wrote:
>>
>> Does this testcase hotplug cpu 0 off?
>>
>
> No, i don't think so. It skips cpu0 during online/offline
> process.

Then how could this happen ? Looks like cpu 0 is offline ....
0:mon> <4>IRQ 17 affinity broken off cpu 0
<4>IRQ 18 affinity broken off cpu 0
<4>IRQ 19 affinity broken off cpu 0
<4>IRQ 264 affinity broken off cpu 0
<4>cpu 0 (hwid 0) Ready to die...
<7>clockevent: decrementer mult[83126e97] shift[32] cpu[0]


>
> thanks
> -Sachin
>
> --
>
> ---------------------------------
> Sachin Sant
> IBM Linux Technology Center
> India Systems and Technology Labs
> Bangalore, India
> ---------------------------------
>
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Next] CPU Hotplug test failures on powerpc
  2009-12-16  6:45     ` Xiaotian Feng
@ 2009-12-16  6:54       ` Sachin Sant
  2009-12-16  7:18         ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Sachin Sant @ 2009-12-16  6:54 UTC (permalink / raw)
  To: Xiaotian Feng
  Cc: Peter Zijlstra, linux-kernel, Linux/PPC Development, linux-next,
	Ingo Molnar

Xiaotian Feng wrote:
> On Wed, Dec 16, 2009 at 2:41 PM, Sachin Sant <sachinp@in.ibm.com> wrote:
>   
>> Xiaotian Feng wrote:
>>     
>>> Does this testcase hotplug cpu 0 off?
>>>
>>>       
>> No, i don't think so. It skips cpu0 during online/offline
>> process.
>>     
>
> Then how could this happen ? Looks like cpu 0 is offline ....
> 0:mon> <4>IRQ 17 affinity broken off cpu 0
> <4>IRQ 18 affinity broken off cpu 0
> <4>IRQ 19 affinity broken off cpu 0
> <4>IRQ 264 affinity broken off cpu 0
> <4>cpu 0 (hwid 0) Ready to die...
> <7>clockevent: decrementer mult[83126e97] shift[32] cpu[0]
>   
Sorry i was looking at only one script. Looking more closely
at the test there are 6 different sub tests. The rest of the
tests do seem to hotplug CPU 0.

Thanks
-Sachin


-- 

---------------------------------
Sachin Sant
IBM Linux Technology Center
India Systems and Technology Labs
Bangalore, India
---------------------------------

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Next] CPU Hotplug test failures on powerpc
  2009-12-15 13:47             ` Sachin Sant
  2009-12-15 15:03               ` Peter Zijlstra
@ 2009-12-16  6:56               ` Xiaotian Feng
  1 sibling, 0 replies; 22+ messages in thread
From: Xiaotian Feng @ 2009-12-16  6:56 UTC (permalink / raw)
  To: Sachin Sant
  Cc: Peter Zijlstra, linux-kernel, Linux/PPC Development, linux-next,
	Ingo Molnar

On Tue, Dec 15, 2009 at 9:47 PM, Sachin Sant <sachinp@in.ibm.com> wrote:
> Peter Zijlstra wrote:
>>>
>>> I added some debug statements within the above code. This is a 2 cpu
>>> machine.
>>>
>>> XMON dest_cpu =3D 1024 . dead_cpu =3D 1 . nr_cpu_ids =3D 2
>>> XMON dest_cpu =3D 1024 XMON dest_cpu =3D 1024 . dead_cpu =3D 1
>>> XMON dest_cpu =3D 1024 . dead_cpu =3D 1 . nr_cpu_ids =3D 2
>>> XMON dest_cpu =3D 1024 XMON dest_cpu =3D 1024 . dead_cpu =3D 1
>>> XMON dest_cpu =3D 1024 . dead_cpu =3D 1 . nr_cpu_ids =3D 2
>>> XMON dest_cpu =3D 1024 XMON dest_cpu =3D 1024 . dead_cpu =3D 1
>>>
>>> Seems to me that the control is stuck in an infinite loop and hence the
>>> machine appears to be in hung state. The dest_cpu value is always 1024
>>> and never changes, which result in an infinite loop.
>>>
>>> In working scenario the o/p is something on the following lines
>>>
>>> XMON dest_cpu =3D 1024 . dead_cpu =3D 1 . nr_cpu_ids =3D 2
>>> XMON dest_cpu =3D 0 XMON dest_cpu =3D 1024 . dead_cpu =3D 1 . nr_cpu_id=
s =3D 2
>>> XMON dest_cpu =3D 0 XMON dest_cpu =3D 1024 . dead_cpu =3D 1 . nr_cpu_id=
s =3D 2
>>> XMON dest_cpu =3D 0
>>> Let me know if i should try to record any specific value ?
>>>
>>
>> Could you possibly print the two masks themselves? cpumask_scnprintf()
>> and friend come in handy for this.
>>
>> The dest_cpu=3D1024 thing seem to suggest the intersection between
>> p->cpus_allowed and cpu_active_mask is empty for some reason, even
>> though we forcefully reset p->cpus_allowed to the full set using
>> cpuset_cpus_allowed_locked().
>>
>
> So here is the data related to the two masks.
>
> cpu_active_mask =3D 00000000,00000000,00000000,00000000,00000000,00000000=
,
> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,
> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,
> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,
> 00000000,00000000
> XMON dest_cpu =3D 1024
>

How about cpu_online_mask? commit 6ad4c1 switches from cpu_online_mask
to cpu_active_mask.
Is there a mismatch for cpu_online_mask and cpu_active_mask?

> while p->cpus_allowed =3D =C2=A000000000,00000000,00000000,00000000,00000=
000,
> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,
> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,
> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,
> 00000000,00000000,00000001
> XMON dest_cpu =3D 1024
>
> In working scenario the above data looks like
>
> cpu_active_mask =3D 00000000,00000000,00000000,00000000,00000000,00000000=
,
> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,
> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,
> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,
> 00000000,00000002
> XMON dest_cpu =3D 1
>
> while p->cpus_allowed =3D =C2=A000000000,00000000,00000000,00000000,00000=
000,
> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,
> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,
> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,
> 00000000,00000000,00000002
> XMON dest_cpu =3D 1
>
>
> hope i got the data correct.
>
> Thanks
> -Sachin
>
>
> --
>
> ---------------------------------
> Sachin Sant
> IBM Linux Technology Center
> India Systems and Technology Labs
> Bangalore, India
> ---------------------------------
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-next" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =C2=A0http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Next] CPU Hotplug test failures on powerpc
  2009-12-16  5:38                 ` Sachin Sant
@ 2009-12-16  7:14                   ` Peter Zijlstra
  0 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2009-12-16  7:14 UTC (permalink / raw)
  To: Sachin Sant; +Cc: Ingo Molnar, linux-next, linux-kernel, Linux/PPC Development

On Wed, 2009-12-16 at 11:08 +0530, Sachin Sant wrote:
> Peter Zijlstra wrote:
> > Could you try the below?
> >   
> No luck. Still the same issue. The mask values don't change.

Bugger, that patch did solve a similar problem for a patch I'm working
on.

Can you maybe add a print of the cpu_active_mask() in set_cpu_active()
using WARN() so we can see where it changes the mask, and why it things
its empty?

> > ---
> >  init/main.c |    7 +------
> >  1 files changed, 1 insertions(+), 6 deletions(-)
> >
> > diff --git a/init/main.c b/init/main.c
> > index 4051d75..4be7de2 100644
> > --- a/init/main.c
> > +++ b/init/main.c
> > @@ -369,12 +369,6 @@ static void __init smp_init(void)
> >  {
> >  	unsigned int cpu;
> >  
> > -	/*
> > -	 * Set up the current CPU as possible to migrate to.
> > -	 * The other ones will be done by cpu_up/cpu_down()
> > -	 */
> > -	set_cpu_active(smp_processor_id(), true);
> > -
> >  	/* FIXME: This should be done in userspace --RR */
> >  	for_each_present_cpu(cpu) {
> >  		if (num_online_cpus() >= setup_max_cpus)
> > @@ -486,6 +480,7 @@ static void __init boot_cpu_init(void)
> >  	int cpu = smp_processor_id();
> >  	/* Mark the boot cpu "present", "online" etc for SMP and UP case */
> >  	set_cpu_online(cpu, true);
> > +	set_cpu_active(cpu, true);
> >  	set_cpu_present(cpu, true);
> >  	set_cpu_possible(cpu, true);
> >  }
> >
> >
> >   
> 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Next] CPU Hotplug test failures on powerpc
  2009-12-16  6:54       ` Sachin Sant
@ 2009-12-16  7:18         ` Peter Zijlstra
  2009-12-16  7:57           ` Xiaotian Feng
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2009-12-16  7:18 UTC (permalink / raw)
  To: Sachin Sant
  Cc: linux-kernel, Linux/PPC Development, linux-next, Xiaotian Feng,
	Ingo Molnar

On Wed, 2009-12-16 at 12:24 +0530, Sachin Sant wrote:
> Xiaotian Feng wrote:
> > On Wed, Dec 16, 2009 at 2:41 PM, Sachin Sant <sachinp@in.ibm.com> wrote:
> >   
> >> Xiaotian Feng wrote:
> >>     
> >>> Does this testcase hotplug cpu 0 off?
> >>>
> >>>       
> >> No, i don't think so. It skips cpu0 during online/offline
> >> process.
> >>     
> >
> > Then how could this happen ? Looks like cpu 0 is offline ....
> > 0:mon> <4>IRQ 17 affinity broken off cpu 0
> > <4>IRQ 18 affinity broken off cpu 0
> > <4>IRQ 19 affinity broken off cpu 0
> > <4>IRQ 264 affinity broken off cpu 0
> > <4>cpu 0 (hwid 0) Ready to die...
> > <7>clockevent: decrementer mult[83126e97] shift[32] cpu[0]
> >   
> Sorry i was looking at only one script. Looking more closely
> at the test there are 6 different sub tests. The rest of the
> tests do seem to hotplug CPU 0.

Ooh, cute, so you can actually hotplug cpu 0.. no wonder that didn't get
exposed on x86.

Still, the only time cpu_active_mask should not be equal to
cpu_online_mask is when we're in the middle of a hotplug, we clear
active early and set it late, but its all done under the hotplug mutex,
so we can at most have 1 cpu differences with online mask.

Unless of course, I messed up, which appears to be rather likely given
these problems ;-)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Next] CPU Hotplug test failures on powerpc
  2009-12-16  7:18         ` Peter Zijlstra
@ 2009-12-16  7:57           ` Xiaotian Feng
  2009-12-16  8:24             ` Sachin Sant
  0 siblings, 1 reply; 22+ messages in thread
From: Xiaotian Feng @ 2009-12-16  7:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Linux/PPC Development, linux-next, Ingo Molnar

On Wed, Dec 16, 2009 at 3:18 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, 2009-12-16 at 12:24 +0530, Sachin Sant wrote:
>> Xiaotian Feng wrote:
>> > On Wed, Dec 16, 2009 at 2:41 PM, Sachin Sant <sachinp@in.ibm.com> wrote:
>> >
>> >> Xiaotian Feng wrote:
>> >>
>> >>> Does this testcase hotplug cpu 0 off?
>> >>>
>> >>>
>> >> No, i don't think so. It skips cpu0 during online/offline
>> >> process.
>> >>
>> >
>> > Then how could this happen ? Looks like cpu 0 is offline ....
>> > 0:mon> <4>IRQ 17 affinity broken off cpu 0
>> > <4>IRQ 18 affinity broken off cpu 0
>> > <4>IRQ 19 affinity broken off cpu 0
>> > <4>IRQ 264 affinity broken off cpu 0
>> > <4>cpu 0 (hwid 0) Ready to die...
>> > <7>clockevent: decrementer mult[83126e97] shift[32] cpu[0]
>> >
>> Sorry i was looking at only one script. Looking more closely
>> at the test there are 6 different sub tests. The rest of the
>> tests do seem to hotplug CPU 0.
>
> Ooh, cute, so you can actually hotplug cpu 0.. no wonder that didn't get
> exposed on x86.
>
> Still, the only time cpu_active_mask should not be equal to
> cpu_online_mask is when we're in the middle of a hotplug, we clear
> active early and set it late, but its all done under the hotplug mutex,
> so we can at most have 1 cpu differences with online mask.
>

Could follow be possible?  We know there's cpu 0 and cpu 1,

offline cpu1 > done
offline cpu0 > false

consider this in cpu_down code,


int __ref cpu_down(unsigned int cpu)
{
<snip>
        set_cpu_active(cpu, false); // here, we set cpu 0 to inactive

        synchronize_sched();

        err = _cpu_down(cpu, 0);
out:
<snip>
}

Then in _cpu_down code:

static int __ref _cpu_down(unsigned int cpu, int tasks_frozen)
{
<snip>
        if (num_online_cpus() == 1)        // if we're trying to
offline cpu0, num_online_cpus will be 1
                return -EBUSY;                    // after return back
to cpu_down, we didn't change cpu 0 back to active

        if (!cpu_online(cpu))
                return -EINVAL;

        if (!alloc_cpumask_var(&old_allowed, GFP_KERNEL))
                return -ENOMEM;
<snip>
}

Then cpu 0 is not active, but online, then we try to offline cpu1, .......
This can not be exposed because x86 does not have
/sys/devices/system/cpu0/online.
I guess following patch fixes this bug.

---
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 291ac58..21ddace 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -199,14 +199,18 @@ static int __ref _cpu_down(unsigned int cpu, int
tasks_frozen)
                .hcpu = hcpu,
        };

-       if (num_online_cpus() == 1)
+       if (num_online_cpus() == 1) {
+               set_cpu_active(cpu, true);
                return -EBUSY;
+       }

        if (!cpu_online(cpu))
                return -EINVAL;

-       if (!alloc_cpumask_var(&old_allowed, GFP_KERNEL))
+       if (!alloc_cpumask_var(&old_allowed, GFP_KERNEL)) {
+               set_cpu_active(cpu, true);
                return -ENOMEM;
+       }

        cpu_hotplug_begin();
        err = __raw_notifier_call_chain(&cpu_chain, CPU_DOWN_PREPARE | mod,


> Unless of course, I messed up, which appears to be rather likely given
> these problems ;-)
>
>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [Next] CPU Hotplug test failures on powerpc
  2009-12-16  7:57           ` Xiaotian Feng
@ 2009-12-16  8:24             ` Sachin Sant
  2009-12-16  9:07               ` Xiaotian Feng
  0 siblings, 1 reply; 22+ messages in thread
From: Sachin Sant @ 2009-12-16  8:24 UTC (permalink / raw)
  To: Xiaotian Feng
  Cc: Peter Zijlstra, linux-kernel, Linux/PPC Development, linux-next,
	Ingo Molnar

Xiaotian Feng wrote:
> Could follow be possible?  We know there's cpu 0 and cpu 1,
>
> offline cpu1 > done
> offline cpu0 > false
>
> consider this in cpu_down code,
>
>
> int __ref cpu_down(unsigned int cpu)
> {
> <snip>
>         set_cpu_active(cpu, false); // here, we set cpu 0 to inactive
>
>         synchronize_sched();
>
>         err = _cpu_down(cpu, 0);
> out:
> <snip>
> }
>
> Then in _cpu_down code:
>
> static int __ref _cpu_down(unsigned int cpu, int tasks_frozen)
> {
> <snip>
>         if (num_online_cpus() == 1)        // if we're trying to
> offline cpu0, num_online_cpus will be 1
>                 return -EBUSY;                    // after return back
> to cpu_down, we didn't change cpu 0 back to active
>
>         if (!cpu_online(cpu))
>                 return -EINVAL;
>
>         if (!alloc_cpumask_var(&old_allowed, GFP_KERNEL))
>                 return -ENOMEM;
> <snip>
> }
>
> Then cpu 0 is not active, but online, then we try to offline cpu1, .......
> This can not be exposed because x86 does not have
> /sys/devices/system/cpu0/online.
> I guess following patch fixes this bug.
>   
Just tested this one on the POWER box and the test passed.
I did not observe the hang.

Thanks
-Sachin

> ---
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index 291ac58..21ddace 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -199,14 +199,18 @@ static int __ref _cpu_down(unsigned int cpu, int
> tasks_frozen)
>                 .hcpu = hcpu,
>         };
>
> -       if (num_online_cpus() == 1)
> +       if (num_online_cpus() == 1) {
> +               set_cpu_active(cpu, true);
>                 return -EBUSY;
> +       }
>
>         if (!cpu_online(cpu))
>                 return -EINVAL;
>
> -       if (!alloc_cpumask_var(&old_allowed, GFP_KERNEL))
> +       if (!alloc_cpumask_var(&old_allowed, GFP_KERNEL)) {
> +               set_cpu_active(cpu, true);
>                 return -ENOMEM;
> +       }
>
>         cpu_hotplug_begin();
>         err = __raw_notifier_call_chain(&cpu_chain, CPU_DOWN_PREPARE | mod,
>
>
>   
>> Unless of course, I messed up, which appears to be rather likely given
>> these problems ;-)
>>
>>
>>     
>
>   


-- 

---------------------------------
Sachin Sant
IBM Linux Technology Center
India Systems and Technology Labs
Bangalore, India
---------------------------------

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Next] CPU Hotplug test failures on powerpc
  2009-12-16  8:24             ` Sachin Sant
@ 2009-12-16  9:07               ` Xiaotian Feng
  0 siblings, 0 replies; 22+ messages in thread
From: Xiaotian Feng @ 2009-12-16  9:07 UTC (permalink / raw)
  To: Sachin Sant
  Cc: Peter Zijlstra, linux-kernel, Linux/PPC Development, linux-next,
	Ingo Molnar

On Wed, Dec 16, 2009 at 4:24 PM, Sachin Sant <sachinp@in.ibm.com> wrote:
> Xiaotian Feng wrote:
>>
>> Could follow be possible? =C2=A0We know there's cpu 0 and cpu 1,
>>
>> offline cpu1 > done
>> offline cpu0 > false
>>
>> consider this in cpu_down code,
>>
>>
>> int __ref cpu_down(unsigned int cpu)
>> {
>> <snip>
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0set_cpu_active(cpu, false); // here, we set c=
pu 0 to inactive
>>
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0synchronize_sched();
>>
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0err =3D _cpu_down(cpu, 0);
>> out:
>> <snip>
>> }
>>
>> Then in _cpu_down code:
>>
>> static int __ref _cpu_down(unsigned int cpu, int tasks_frozen)
>> {
>> <snip>
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0if (num_online_cpus() =3D=3D 1) =C2=A0 =C2=A0=
 =C2=A0 =C2=A0// if we're trying to
>> offline cpu0, num_online_cpus will be 1
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return -EBUSY; =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0// aft=
er return back
>> to cpu_down, we didn't change cpu 0 back to active
>>
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0if (!cpu_online(cpu))
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return -EINVAL;
>>
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0if (!alloc_cpumask_var(&old_allowed, GFP_KERN=
EL))
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return -ENOMEM;
>> <snip>
>> }
>>
>> Then cpu 0 is not active, but online, then we try to offline cpu1, .....=
..
>> This can not be exposed because x86 does not have
>> /sys/devices/system/cpu0/online.
>> I guess following patch fixes this bug.
>>
>
> Just tested this one on the POWER box and the test passed.
> I did not observe the hang.

Thanks for confirm, I will send formatted patch to upstream then:-)

>
> Thanks
> -Sachin
>
>> ---
>> diff --git a/kernel/cpu.c b/kernel/cpu.c
>> index 291ac58..21ddace 100644
>> --- a/kernel/cpu.c
>> +++ b/kernel/cpu.c
>> @@ -199,14 +199,18 @@ static int __ref _cpu_down(unsigned int cpu, int
>> tasks_frozen)
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0.hcpu =3D hcpu,
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0};
>>
>> - =C2=A0 =C2=A0 =C2=A0 if (num_online_cpus() =3D=3D 1)
>> + =C2=A0 =C2=A0 =C2=A0 if (num_online_cpus() =3D=3D 1) {
>> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 set_cpu_active(cpu, t=
rue);
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return -EBUSY;
>> + =C2=A0 =C2=A0 =C2=A0 }
>>
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0if (!cpu_online(cpu))
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return -EINVAL;
>>
>> - =C2=A0 =C2=A0 =C2=A0 if (!alloc_cpumask_var(&old_allowed, GFP_KERNEL))
>> + =C2=A0 =C2=A0 =C2=A0 if (!alloc_cpumask_var(&old_allowed, GFP_KERNEL))=
 {
>> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 set_cpu_active(cpu, t=
rue);
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return -ENOMEM;
>> + =C2=A0 =C2=A0 =C2=A0 }
>>
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0cpu_hotplug_begin();
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0err =3D __raw_notifier_call_chain(&cpu_chain,=
 CPU_DOWN_PREPARE | mod,
>>
>>
>>
>>>
>>> Unless of course, I messed up, which appears to be rather likely given
>>> these problems ;-)
>>>
>>>
>>>
>>
>>
>
>
> --
>
> ---------------------------------
> Sachin Sant
> IBM Linux Technology Center
> India Systems and Technology Labs
> Bangalore, India
> ---------------------------------
>
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2009-12-16  9:07 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-12-11 10:53 [Next] CPU Hotplug test failures on powerpc Sachin Sant
2009-12-14  2:48 ` Benjamin Herrenschmidt
2009-12-14  4:37   ` Sachin Sant
2009-12-14 10:22 ` Peter Zijlstra
2009-12-14 11:11   ` Sachin Sant
2009-12-14 12:19     ` Peter Zijlstra
2009-12-14 21:17       ` Benjamin Herrenschmidt
2009-12-15  9:44         ` Sachin Sant
2009-12-15 10:43           ` Peter Zijlstra
2009-12-15 13:47             ` Sachin Sant
2009-12-15 15:03               ` Peter Zijlstra
2009-12-16  5:38                 ` Sachin Sant
2009-12-16  7:14                   ` Peter Zijlstra
2009-12-16  6:56               ` Xiaotian Feng
2009-12-16  6:25 ` Xiaotian Feng
2009-12-16  6:41   ` Sachin Sant
2009-12-16  6:45     ` Xiaotian Feng
2009-12-16  6:54       ` Sachin Sant
2009-12-16  7:18         ` Peter Zijlstra
2009-12-16  7:57           ` Xiaotian Feng
2009-12-16  8:24             ` Sachin Sant
2009-12-16  9:07               ` Xiaotian Feng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).