Linux panics when suspend cannot offline the secondary cores

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* Linux panics when suspend cannot offline the secondary cores
@ 2016-06-10 15:41 Mason
  2016-06-10 21:35 ` Rafael J. Wysocki
  0 siblings, 1 reply; 9+ messages in thread
From: Mason @ 2016-06-10 15:41 UTC (permalink / raw)
  To: linux-arm-kernel

Hello,

I'm playing with S3 Suspend-to-RAM, and I noticed that Linux is really
unhappy when the suspend framework fails to offline secondary cores.

Is this expected/by design, or could it fail more gracefully?
(It could also be something missing in my platform's code.)

Regards.


# echo mem > /sys/power/state 
[   30.722352] PM: Syncing filesystems ... done.
[   30.727146] PM: Preparing system for sleep (mem)
[   30.736927] Freezing user space processes ... (elapsed 0.001 seconds) done.
[   30.745519] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
[   30.754098] PM: Suspending system (mem)
[   30.760934] PM: suspend of devices complete after 2.104 msecs
[   30.767638] PM: late suspend of devices complete after 0.883 msecs
[   30.774529] PM: noirq suspend of devices complete after 0.653 msecs
[   30.780846] Disabling non-boot CPUs ...
[   30.795697] CPU1: shutdown
[   30.795701] IN tango_cpu_die
[   30.795709] CPU1: smp_ops.cpu_die() returned, trying to resuscitate
[   30.795730] BUG: scheduling while atomic: swapper/1/0/0x00000002
[   30.795735] Modules linked in:
[   30.795756] Preemption disabled at:[<c04a5898>] schedule_preempt_disabled+0x20/0x24
[   30.795757] 
[   30.795766] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.7.0-rc1-next-20160530-00002-g6c94ca0b0db1-dirty #117
[   30.795768] Hardware name: Sigma Tango DT
[   30.795773] Backtrace: 
[   30.795790] [<c010b974>] (dump_backtrace) from [<c010bb70>] (show_stack+0x18/0x1c)
[   30.795797]  r7:60000013 r6:c080eb04 r5:00000000 r4:c080eb04
[   30.795811] [<c010bb58>] (show_stack) from [<c02eb084>] (dump_stack+0x80/0x94)
[   30.795820] [<c02eb004>] (dump_stack) from [<c013cb34>] (__schedule_bug+0x6c/0xb8)
[   30.795827]  r7:c0802638 r6:e745f6c0 r5:e7ae8ec0 r4:e7460000
[   30.795833] [<c013cac8>] (__schedule_bug) from [<c04a522c>] (__schedule+0x434/0x530)
[   30.795837]  r5:e7ae8ec0 r4:c0736ec0
[   30.795842] [<c04a4df8>] (__schedule) from [<c04a5378>] (schedule+0x50/0xb0)
[   30.795852]  r10:00000000 r9:c08024f8 r8:c05b8b6c r7:c081e2d6 r6:c05ce0b8 r5:c0802494
[   30.795855]  r4:e7460000
[   30.795861] [<c04a5328>] (schedule) from [<c04a5890>] (schedule_preempt_disabled+0x18/0x24)
[   30.795865]  r5:c0802494 r4:e7460000
[   30.795876] [<c04a5878>] (schedule_preempt_disabled) from [<c0155f0c>] (cpu_startup_entry+0x10c/0x18c)
[   30.795884] [<c0155e00>] (cpu_startup_entry) from [<c010dc14>] (secondary_start_kernel+0x158/0x164)
[   30.795888]  r7:c081e2d6 r4:c080b530
[   30.795898] [<c010dabc>] (secondary_start_kernel) from [<c04a9208>] (_raw_spin_unlock_irqrestore+0x30/0x5c)
[   30.795902]  r5:c0802494 r4:00000001
[   30.952513] IN tango_cpu_kill
[   30.955537] Unable to handle kernel NULL pointer dereference at virtual address 00000010
[   30.963668] pgd = c0004000
[   30.966382] [00000010] *pgd=00000000
[   30.969976] Internal error: Oops: 5 [#1] PREEMPT SMP ARM
[   30.975312] Modules linked in:
[   30.978379] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G        W       4.7.0-rc1-next-20160530-00002-g6c94ca0b0db1-dirty #117
[   30.989478] Hardware name: Sigma Tango DT
[   30.993503] task: e745f6c0 ti: e7460000 task.ti: e7460000
[   30.998933] PC is at __tick_nohz_idle_enter+0x2d8/0x444
[   31.004188] LR is at debug_smp_processor_id+0x20/0x24
[   31.009262] pc : [<c0184d1c>]    lr : [<c030305c>]    psr: 60000093
[   31.009262] sp : e7461f50  ip : e7461f20  fp : e7461fac
[   31.020800] r10: 00000000  r9 : 00000000  r8 : 00000000
[   31.026047] r7 : 00000000  r6 : 0032dcd5  r5 : 00000001  r4 : e7ae6e38
[   31.032605] r3 : 00000000  r2 : 0032dcd5  r1 : 00000000  r0 : 0032dcd5
[   31.039164] Flags: nZCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment none
[   31.046420] Control: 10c5387d  Table: 8000404a  DAC: 00000051
[   31.052192] Process swapper/1 (pid: 0, stack limit = 0xe7460210)
[   31.058226] Stack: (0xe7461f50 to 0xe7462000)
[   31.062602] 1f40:                                     c04a4fcc c013c8b0 00000001 00000000
[   31.070821] 1f60: 35293313 00000007 34faa6c3 00000007 34f6563e 00000007 34faa6c3 00000007
[   31.079041] 1f80: ffffffff 7fffffff c0734e38 c0802494 c05ce0b8 c081e2d6 c05b8b6c c08024f8
[   31.087261] 1fa0: e7461fc4 e7461fb0 c0185294 c0184a50 e7460000 c0802494 e7461fdc e7461fc8
[   31.095480] 1fc0: c0155e58 c0185258 c080b530 c081e2d6 e7461ff4 e7461fe0 c010dc14 c0155e0c
[   31.103700] 1fe0: 00000001 c0802494 00000000 e7461ff8 c04a9208 c010dac8 454115f5 56b2e41b
[   31.111916] Backtrace: 
[   31.114376] [<c0184a44>] (__tick_nohz_idle_enter) from [<c0185294>] (tick_nohz_idle_enter+0x48/0x80)
[   31.123553]  r9:c08024f8 r8:c05b8b6c r7:c081e2d6 r6:c05ce0b8 r5:c0802494 r4:c0734e38
[   31.131353] [<c018524c>] (tick_nohz_idle_enter) from [<c0155e58>] (cpu_startup_entry+0x58/0x18c)
[   31.140181]  r5:c0802494 r4:e7460000
[   31.143778] [<c0155e00>] (cpu_startup_entry) from [<c010dc14>] (secondary_start_kernel+0x158/0x164)
[   31.152868]  r7:c081e2d6 r4:c080b530
[   31.156464] [<c010dabc>] (secondary_start_kernel) from [<c04a9208>] (_raw_spin_unlock_irqrestore+0x30/0x5c)
[   31.166253]  r5:c0802494 r4:00000001
[   31.169848] Code: e89dabf0 e14b24d4 e1a00004 ebffff22 (e1c821d0) 
[   31.175972] ---[ end trace 5e1e78cb2505c930 ]---
[   31.180611] Kernel panic - not syncing: Attempted to kill the idle task!
[   31.187346] CPU0: stopping
[   31.190064] CPU: 0 PID: 10 Comm: migration/0 Tainted: G      D W       4.7.0-rc1-next-20160530-00002-g6c94ca0b0db1-dirty #117
[   31.201426] Hardware name: Sigma Tango DT
[   31.205449] Backtrace: 
[   31.207911] [<c010b974>] (dump_backtrace) from [<c010bb70>] (show_stack+0x18/0x1c)
[   31.215516]  r7:20000193 r6:c080eb04 r5:00000000 r4:c080eb04
[   31.221218] [<c010bb58>] (show_stack) from [<c02eb084>] (dump_stack+0x80/0x94)
[   31.228478] [<c02eb004>] (dump_stack) from [<c010e034>] (handle_IPI+0x1a0/0x1b4)
[   31.235909]  r7:00000000 r6:00000004 r5:00000000 r4:c0735428
[   31.241607] [<c010de94>] (handle_IPI) from [<c01014ec>] (gic_handle_irq+0x90/0x94)
[   31.249212]  r9:e8803100 r8:e8802100 r7:e745de78 r6:e880210c r5:c080277c r4:c080ed20
[   31.257008] [<c010145c>] (gic_handle_irq) from [<c010c694>] (__irq_svc+0x54/0x90)
[   31.264527] Exception stack(0xe745de78 to 0xe745dec0)
[   31.269600] de60:                                                       00000000 c05bfe50
[   31.277820] de80: 00000000 00000001 e6e49cfc 00000001 e6e49ce8 20000013 00000000 e7ad9eec
[   31.286039] dea0: e6e49c90 e745deec e745deb8 e745dec8 c030305c c01910b8 60000013 ffffffff
[   31.294255]  r9:e7ad9eec r8:00000000 r7:e745deac r6:ffffffff r5:60000013 r4:c01910b8
[   31.302057] [<c0191008>] (multi_cpu_stop) from [<c0191304>] (cpu_stopper_thread+0xa8/0x120)
[   31.310448]  r9:e7ad9eec r8:e745c000 r7:e6e49ce8 r6:c0191008 r5:e7ad9ee4 r4:e7ad9ee0
[   31.318245] [<c019125c>] (cpu_stopper_thread) from [<c013b500>] (smpboot_thread_fn+0x164/0x288)
[   31.326985]  r10:ffffe000 r9:c080a9bc r8:00000000 r7:00000001 r6:00000000 r5:e7418680
[   31.334866]  r4:e745c000
[   31.337412] [<c013b39c>] (smpboot_thread_fn) from [<c0138434>] (kthread+0xe4/0xfc)
[   31.345017]  r10:00000000 r9:00000000 r8:00000000 r7:c013b39c r6:e7418680 r5:e7418500
[   31.352898]  r4:00000000 r3:e7452080
[   31.356493] [<c0138350>] (kthread) from [<c0107c18>] (ret_from_fork+0x14/0x3c)
[   31.363749]  r7:00000000 r6:00000000 r5:c0138350 r4:e7418500
[   31.369447] ---[ end Kernel panic - not syncing: Attempted to kill the idle task!

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Linux panics when suspend cannot offline the secondary cores
  2016-06-10 15:41 Linux panics when suspend cannot offline the secondary cores Mason
@ 2016-06-10 21:35 ` Rafael J. Wysocki
  2016-06-10 21:37   ` Mason
  0 siblings, 1 reply; 9+ messages in thread
From: Rafael J. Wysocki @ 2016-06-10 21:35 UTC (permalink / raw)
  To: linux-arm-kernel

On Friday, June 10, 2016 05:41:32 PM Mason wrote:
> Hello,
> 
> I'm playing with S3 Suspend-to-RAM, and I noticed that Linux is really
> unhappy when the suspend framework fails to offline secondary cores.
> 
> Is this expected/by design, or could it fail more gracefully?
> (It could also be something missing in my platform's code.)

This looks like a CPU offline bug to me which is more general than just
system suspend.


> # echo mem > /sys/power/state 
> [   30.722352] PM: Syncing filesystems ... done.
> [   30.727146] PM: Preparing system for sleep (mem)
> [   30.736927] Freezing user space processes ... (elapsed 0.001 seconds) done.
> [   30.745519] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
> [   30.754098] PM: Suspending system (mem)
> [   30.760934] PM: suspend of devices complete after 2.104 msecs
> [   30.767638] PM: late suspend of devices complete after 0.883 msecs
> [   30.774529] PM: noirq suspend of devices complete after 0.653 msecs
> [   30.780846] Disabling non-boot CPUs ...
> [   30.795697] CPU1: shutdown
> [   30.795701] IN tango_cpu_die
> [   30.795709] CPU1: smp_ops.cpu_die() returned, trying to resuscitate
> [   30.795730] BUG: scheduling while atomic: swapper/1/0/0x00000002
> [   30.795735] Modules linked in:
> [   30.795756] Preemption disabled at:[<c04a5898>] schedule_preempt_disabled+0x20/0x24
> [   30.795757] 
> [   30.795766] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.7.0-rc1-next-20160530-00002-g6c94ca0b0db1-dirty #117
> [   30.795768] Hardware name: Sigma Tango DT
> [   30.795773] Backtrace: 
> [   30.795790] [<c010b974>] (dump_backtrace) from [<c010bb70>] (show_stack+0x18/0x1c)
> [   30.795797]  r7:60000013 r6:c080eb04 r5:00000000 r4:c080eb04
> [   30.795811] [<c010bb58>] (show_stack) from [<c02eb084>] (dump_stack+0x80/0x94)
> [   30.795820] [<c02eb004>] (dump_stack) from [<c013cb34>] (__schedule_bug+0x6c/0xb8)
> [   30.795827]  r7:c0802638 r6:e745f6c0 r5:e7ae8ec0 r4:e7460000
> [   30.795833] [<c013cac8>] (__schedule_bug) from [<c04a522c>] (__schedule+0x434/0x530)
> [   30.795837]  r5:e7ae8ec0 r4:c0736ec0
> [   30.795842] [<c04a4df8>] (__schedule) from [<c04a5378>] (schedule+0x50/0xb0)
> [   30.795852]  r10:00000000 r9:c08024f8 r8:c05b8b6c r7:c081e2d6 r6:c05ce0b8 r5:c0802494
> [   30.795855]  r4:e7460000
> [   30.795861] [<c04a5328>] (schedule) from [<c04a5890>] (schedule_preempt_disabled+0x18/0x24)
> [   30.795865]  r5:c0802494 r4:e7460000
> [   30.795876] [<c04a5878>] (schedule_preempt_disabled) from [<c0155f0c>] (cpu_startup_entry+0x10c/0x18c)
> [   30.795884] [<c0155e00>] (cpu_startup_entry) from [<c010dc14>] (secondary_start_kernel+0x158/0x164)
> [   30.795888]  r7:c081e2d6 r4:c080b530
> [   30.795898] [<c010dabc>] (secondary_start_kernel) from [<c04a9208>] (_raw_spin_unlock_irqrestore+0x30/0x5c)
> [   30.795902]  r5:c0802494 r4:00000001
> [   30.952513] IN tango_cpu_kill
> [   30.955537] Unable to handle kernel NULL pointer dereference at virtual address 00000010
> [   30.963668] pgd = c0004000
> [   30.966382] [00000010] *pgd=00000000
> [   30.969976] Internal error: Oops: 5 [#1] PREEMPT SMP ARM
> [   30.975312] Modules linked in:
> [   30.978379] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G        W       4.7.0-rc1-next-20160530-00002-g6c94ca0b0db1-dirty #117
> [   30.989478] Hardware name: Sigma Tango DT
> [   30.993503] task: e745f6c0 ti: e7460000 task.ti: e7460000
> [   30.998933] PC is at __tick_nohz_idle_enter+0x2d8/0x444
> [   31.004188] LR is at debug_smp_processor_id+0x20/0x24
> [   31.009262] pc : [<c0184d1c>]    lr : [<c030305c>]    psr: 60000093
> [   31.009262] sp : e7461f50  ip : e7461f20  fp : e7461fac
> [   31.020800] r10: 00000000  r9 : 00000000  r8 : 00000000
> [   31.026047] r7 : 00000000  r6 : 0032dcd5  r5 : 00000001  r4 : e7ae6e38
> [   31.032605] r3 : 00000000  r2 : 0032dcd5  r1 : 00000000  r0 : 0032dcd5
> [   31.039164] Flags: nZCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment none
> [   31.046420] Control: 10c5387d  Table: 8000404a  DAC: 00000051
> [   31.052192] Process swapper/1 (pid: 0, stack limit = 0xe7460210)
> [   31.058226] Stack: (0xe7461f50 to 0xe7462000)
> [   31.062602] 1f40:                                     c04a4fcc c013c8b0 00000001 00000000
> [   31.070821] 1f60: 35293313 00000007 34faa6c3 00000007 34f6563e 00000007 34faa6c3 00000007
> [   31.079041] 1f80: ffffffff 7fffffff c0734e38 c0802494 c05ce0b8 c081e2d6 c05b8b6c c08024f8
> [   31.087261] 1fa0: e7461fc4 e7461fb0 c0185294 c0184a50 e7460000 c0802494 e7461fdc e7461fc8
> [   31.095480] 1fc0: c0155e58 c0185258 c080b530 c081e2d6 e7461ff4 e7461fe0 c010dc14 c0155e0c
> [   31.103700] 1fe0: 00000001 c0802494 00000000 e7461ff8 c04a9208 c010dac8 454115f5 56b2e41b
> [   31.111916] Backtrace: 
> [   31.114376] [<c0184a44>] (__tick_nohz_idle_enter) from [<c0185294>] (tick_nohz_idle_enter+0x48/0x80)
> [   31.123553]  r9:c08024f8 r8:c05b8b6c r7:c081e2d6 r6:c05ce0b8 r5:c0802494 r4:c0734e38
> [   31.131353] [<c018524c>] (tick_nohz_idle_enter) from [<c0155e58>] (cpu_startup_entry+0x58/0x18c)
> [   31.140181]  r5:c0802494 r4:e7460000
> [   31.143778] [<c0155e00>] (cpu_startup_entry) from [<c010dc14>] (secondary_start_kernel+0x158/0x164)
> [   31.152868]  r7:c081e2d6 r4:c080b530
> [   31.156464] [<c010dabc>] (secondary_start_kernel) from [<c04a9208>] (_raw_spin_unlock_irqrestore+0x30/0x5c)
> [   31.166253]  r5:c0802494 r4:00000001
> [   31.169848] Code: e89dabf0 e14b24d4 e1a00004 ebffff22 (e1c821d0) 
> [   31.175972] ---[ end trace 5e1e78cb2505c930 ]---
> [   31.180611] Kernel panic - not syncing: Attempted to kill the idle task!
> [   31.187346] CPU0: stopping
> [   31.190064] CPU: 0 PID: 10 Comm: migration/0 Tainted: G      D W       4.7.0-rc1-next-20160530-00002-g6c94ca0b0db1-dirty #117
> [   31.201426] Hardware name: Sigma Tango DT
> [   31.205449] Backtrace: 
> [   31.207911] [<c010b974>] (dump_backtrace) from [<c010bb70>] (show_stack+0x18/0x1c)
> [   31.215516]  r7:20000193 r6:c080eb04 r5:00000000 r4:c080eb04
> [   31.221218] [<c010bb58>] (show_stack) from [<c02eb084>] (dump_stack+0x80/0x94)
> [   31.228478] [<c02eb004>] (dump_stack) from [<c010e034>] (handle_IPI+0x1a0/0x1b4)
> [   31.235909]  r7:00000000 r6:00000004 r5:00000000 r4:c0735428
> [   31.241607] [<c010de94>] (handle_IPI) from [<c01014ec>] (gic_handle_irq+0x90/0x94)
> [   31.249212]  r9:e8803100 r8:e8802100 r7:e745de78 r6:e880210c r5:c080277c r4:c080ed20
> [   31.257008] [<c010145c>] (gic_handle_irq) from [<c010c694>] (__irq_svc+0x54/0x90)
> [   31.264527] Exception stack(0xe745de78 to 0xe745dec0)
> [   31.269600] de60:                                                       00000000 c05bfe50
> [   31.277820] de80: 00000000 00000001 e6e49cfc 00000001 e6e49ce8 20000013 00000000 e7ad9eec
> [   31.286039] dea0: e6e49c90 e745deec e745deb8 e745dec8 c030305c c01910b8 60000013 ffffffff
> [   31.294255]  r9:e7ad9eec r8:00000000 r7:e745deac r6:ffffffff r5:60000013 r4:c01910b8
> [   31.302057] [<c0191008>] (multi_cpu_stop) from [<c0191304>] (cpu_stopper_thread+0xa8/0x120)
> [   31.310448]  r9:e7ad9eec r8:e745c000 r7:e6e49ce8 r6:c0191008 r5:e7ad9ee4 r4:e7ad9ee0
> [   31.318245] [<c019125c>] (cpu_stopper_thread) from [<c013b500>] (smpboot_thread_fn+0x164/0x288)
> [   31.326985]  r10:ffffe000 r9:c080a9bc r8:00000000 r7:00000001 r6:00000000 r5:e7418680
> [   31.334866]  r4:e745c000
> [   31.337412] [<c013b39c>] (smpboot_thread_fn) from [<c0138434>] (kthread+0xe4/0xfc)
> [   31.345017]  r10:00000000 r9:00000000 r8:00000000 r7:c013b39c r6:e7418680 r5:e7418500
> [   31.352898]  r4:00000000 r3:e7452080
> [   31.356493] [<c0138350>] (kthread) from [<c0107c18>] (ret_from_fork+0x14/0x3c)
> [   31.363749]  r7:00000000 r6:00000000 r5:c0138350 r4:e7418500
> [   31.369447] ---[ end Kernel panic - not syncing: Attempted to kill the idle task!
> --

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Linux panics when suspend cannot offline the secondary cores
  2016-06-10 21:35 ` Rafael J. Wysocki
@ 2016-06-10 21:37   ` Mason
  2016-06-13 12:06     ` Mason
  0 siblings, 1 reply; 9+ messages in thread
From: Mason @ 2016-06-10 21:37 UTC (permalink / raw)
  To: linux-arm-kernel

On 10/06/2016 23:35, Rafael J. Wysocki wrote:
              ^^^^^

Your clock is 5 minutes ahead ;-)

> On Friday, June 10, 2016 05:41:32 PM Mason wrote:
>
>> I'm playing with S3 Suspend-to-RAM, and I noticed that Linux is really
>> unhappy when the suspend framework fails to offline secondary cores.
>>
>> Is this expected/by design, or could it fail more gracefully?
>> (It could also be something missing in my platform's code.)
> 
> This looks like a CPU offline bug to me which is more general than just
> system suspend.

You may be right, I will try just off-lining cpu1.
Suspend may be a red herring.

By the way, I know my implementation of tango_cpu_die
is incorrect, I was testing the failure mode.

Regards.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Linux panics when suspend cannot offline the secondary cores
  2016-06-10 21:37   ` Mason
@ 2016-06-13 12:06     ` Mason
  2016-06-13 13:30       ` Rafael J. Wysocki
  0 siblings, 1 reply; 9+ messages in thread
From: Mason @ 2016-06-13 12:06 UTC (permalink / raw)
  To: linux-arm-kernel

On 10/06/2016 23:37, Mason wrote:

> On 10/06/2016 23:35, Rafael J. Wysocki wrote:
> 
>> On Friday, June 10, 2016 05:41:32 PM Mason wrote:
>>
>>> I'm playing with S3 Suspend-to-RAM, and I noticed that Linux is really
>>> unhappy when the suspend framework fails to offline secondary cores.
>>>
>>> Is this expected/by design, or could it fail more gracefully?
>>> (It could also be something missing in my platform's code.)
>>
>> This looks like a CPU offline bug to me which is more general than just
>> system suspend.
> 
> You may be right, I will try just off-lining cpu1.
> Suspend may be a red herring.
> 
> By the way, I know my implementation of tango_cpu_die
> is incorrect, I was testing the failure mode.

Hello Rafael,

Suspend was indeed a red herring. Manually requesting cpu1 off-lining
also makes Linux panic when cpu_die() unexpectedly returns.

The subject should perhaps have been:

  Linux panics when secondary core off-lining fails

Could it be made to fail more gracefully?
Or is this borkage inherent to the failed operation?
Or is it a bug in my platform code?
(A bug other than tango_cpu_die() failing to kill the core.)


#ifdef CONFIG_HOTPLUG_CPU
static int tango_cpu_kill(unsigned int cpu)
{
	printk("IN %s\n", __func__);
	return 1;
}

static void tango_cpu_die(unsigned int cpu)
{
	printk("IN %s\n", __func__);
}
#endif


Regards.


# echo 0 > /sys/devices/system/cpu/cpu1/online
[   60.619026] CPU1: shutdown
[   60.619031] IN tango_cpu_die
[   60.619041] CPU1: smp_ops.cpu_die() returned, trying to resuscitate
[   60.619063] BUG: scheduling while atomic: swapper/1/0/0x00000002
[   60.619069] Modules linked in:
[   60.619088] Preemption disabled at:[<c04a5898>] schedule_preempt_disabled+0x20/0x24
[   60.619089] 
[   60.619098] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.7.0-rc1-next-20160530-00002-g6c94ca0b0db1-dirty #117
[   60.619099] Hardware name: Sigma Tango DT
[   60.619104] Backtrace: 
[   60.619121] [<c010b974>] (dump_backtrace) from [<c010bb70>] (show_stack+0x18/0x1c)
[   60.619129]  r7:60000013 r6:c080eb04 r5:00000000 r4:c080eb04
[   60.619141] [<c010bb58>] (show_stack) from [<c02eb084>] (dump_stack+0x80/0x94)
[   60.619150] [<c02eb004>] (dump_stack) from [<c013cb34>] (__schedule_bug+0x6c/0xb8)
[   60.619157]  r7:c0802638 r6:df45b6c0 r5:dfbeaec0 r4:df45c000
[   60.619162] [<c013cac8>] (__schedule_bug) from [<c04a522c>] (__schedule+0x434/0x530)
[   60.619167]  r5:dfbeaec0 r4:c0736ec0
[   60.619172] [<c04a4df8>] (__schedule) from [<c04a5378>] (schedule+0x50/0xb0)
[   60.619182]  r10:00000000 r9:c08024f8 r8:c05b8b6c r7:c081e2d6 r6:c05ce0b8 r5:c0802494
[   60.619184]  r4:df45c000
[   60.619190] [<c04a5328>] (schedule) from [<c04a5890>] (schedule_preempt_disabled+0x18/0x24)
[   60.619195]  r5:c0802494 r4:df45c000
[   60.619206] [<c04a5878>] (schedule_preempt_disabled) from [<c0155f0c>] (cpu_startup_entry+0x10c/0x18c)
[   60.619213] [<c0155e00>] (cpu_startup_entry) from [<c010dc14>] (secondary_start_kernel+0x158/0x164)
[   60.619218]  r7:c081e2d6 r4:c080b530
[   60.619226] [<c010dabc>] (secondary_start_kernel) from [<c04a9208>] (_raw_spin_unlock_irqrestore+0x30/0x5c)
[   60.619231]  r5:c0802494 r4:00000001
[   60.775838] IN tango_cpu_kill
[   60.779453] Unable to handle kernel NULL pointer dereference at virtual address 00000010
[   60.787593] pgd = c0004000
[   60.790307] [00000010] *pgd=00000000
[   60.793901] Internal error: Oops: 17 [#1] PREEMPT SMP ARM
[   60.799324] Modules linked in:
[   60.802393] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G        W       4.7.0-rc1-next-20160530-00002-g6c94ca0b0db1-dirty #117
[   60.813493] Hardware name: Sigma Tango DT
[   60.817518] task: df45b6c0 ti: df45c000 task.ti: df45c000
[   60.822948] PC is at __tick_nohz_idle_enter+0x2d8/0x444
[   60.828204] LR is at debug_smp_processor_id+0x20/0x24
[   60.833278] pc : [<c0184d1c>]    lr : [<c030305c>]    psr: 60000093
[   60.833278] sp : df45df50  ip : df45df20  fp : df45dfac
[   60.844815] r10: 00000000  r9 : 00000000  r8 : 00000000
[   60.850063] r7 : 00000000  r6 : 0032dcd5  r5 : 00000001  r4 : dfbe8e38
[   60.856620] r3 : 00000000  r2 : 0032dcd5  r1 : 00000000  r0 : 0032dcd5
[   60.863179] Flags: nZCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment none
[   60.870435] Control: 10c5387d  Table: 9ed8804a  DAC: 00000051
[   60.876206] Process swapper/1 (pid: 0, stack limit = 0xdf45c210)
[   60.882240] Stack: (0xdf45df50 to 0xdf45e000)
[   60.886616] df40:                                     c04a4fcc c013c8b0 00000001 00000000
[   60.894836] df60: 26c51b42 0000000e 269f8229 0000000e 26923e6d 0000000e 269f8229 0000000e
[   60.903057] df80: ffffffff 7fffffff c0734e38 c0802494 c05ce0b8 c081e2d6 c05b8b6c c08024f8
[   60.911276] dfa0: df45dfc4 df45dfb0 c0185294 c0184a50 df45c000 c0802494 df45dfdc df45dfc8
[   60.919495] dfc0: c0155e58 c0185258 c080b530 c081e2d6 df45dff4 df45dfe0 c010dc14 c0155e0c
[   60.927716] dfe0: 00000001 c0802494 00000000 df45dff8 c04a9208 c010dac8 c1640288 22a54aa8
[   60.935932] Backtrace: 
[   60.938391] [<c0184a44>] (__tick_nohz_idle_enter) from [<c0185294>] (tick_nohz_idle_enter+0x48/0x80)
[   60.947569]  r9:c08024f8 r8:c05b8b6c r7:c081e2d6 r6:c05ce0b8 r5:c0802494 r4:c0734e38
[   60.955370] [<c018524c>] (tick_nohz_idle_enter) from [<c0155e58>] (cpu_startup_entry+0x58/0x18c)
[   60.964198]  r5:c0802494 r4:df45c000
[   60.967796] [<c0155e00>] (cpu_startup_entry) from [<c010dc14>] (secondary_start_kernel+0x158/0x164)
[   60.976885]  r7:c081e2d6 r4:c080b530
[   60.980485] [<c010dabc>] (secondary_start_kernel) from [<c04a9208>] (_raw_spin_unlock_irqrestore+0x30/0x5c)
[   60.990273]  r5:c0802494 r4:00000001
[   60.993867] Code: e89dabf0 e14b24d4 e1a00004 ebffff22 (e1c821d0) 
[   60.999991] ---[ end trace b2639488439a8390 ]---
[   61.004631] Kernel panic - not syncing: Attempted to kill the idle task!
[   61.011368] CPU0: stopping
[   61.014087] CPU: 0 PID: 10 Comm: migration/0 Tainted: G      D W       4.7.0-rc1-next-20160530-00002-g6c94ca0b0db1-dirty #117
[   61.025448] Hardware name: Sigma Tango DT
[   61.029471] Backtrace: 
[   61.031936] [<c010b974>] (dump_backtrace) from [<c010bb70>] (show_stack+0x18/0x1c)
[   61.039542]  r7:20000193 r6:c080eb04 r5:00000000 r4:c080eb04
[   61.045246] [<c010bb58>] (show_stack) from [<c02eb084>] (dump_stack+0x80/0x94)
[   61.052507] [<c02eb004>] (dump_stack) from [<c010e034>] (handle_IPI+0x1a0/0x1b4)
[   61.059936]  r7:00000000 r6:00000004 r5:00000000 r4:c0735428
[   61.065635] [<c010de94>] (handle_IPI) from [<c01014ec>] (gic_handle_irq+0x90/0x94)
[   61.073240]  r9:e0803100 r8:e0802100 r7:df459e78 r6:e080210c r5:c080277c r4:c080ed20
[   61.081038] [<c010145c>] (gic_handle_irq) from [<c010c694>] (__irq_svc+0x54/0x90)
[   61.088556] Exception stack(0xdf459e78 to 0xdf459ec0)
[   61.093629] 9e60:                                                       00000000 c05bfe50
[   61.101849] 9e80: 00000000 00000001 dee37d54 00000001 dee37d40 20000013 00000000 dfbdbeec
[   61.110069] 9ea0: dee37ce8 df459eec df459eb8 df459ec8 c030305c c01910b8 60000013 ffffffff
[   61.118285]  r9:dfbdbeec r8:00000000 r7:df459eac r6:ffffffff r5:60000013 r4:c01910b8
[   61.126086] [<c0191008>] (multi_cpu_stop) from [<c0191304>] (cpu_stopper_thread+0xa8/0x120)
[   61.134477]  r9:dfbdbeec r8:df458000 r7:dee37d40 r6:c0191008 r5:dfbdbee4 r4:dfbdbee0
[   61.142274] [<c019125c>] (cpu_stopper_thread) from [<c013b500>] (smpboot_thread_fn+0x164/0x288)
[   61.151014]  r10:ffffe000 r9:c080a9bc r8:00000000 r7:00000001 r6:00000000 r5:df41a680
[   61.158894]  r4:df458000
[   61.161440] [<c013b39c>] (smpboot_thread_fn) from [<c0138434>] (kthread+0xe4/0xfc)
[   61.169045]  r10:00000000 r9:00000000 r8:00000000 r7:c013b39c r6:df41a680 r5:df41a500
[   61.176927]  r4:00000000 r3:df44e080
[   61.180523] [<c0138350>] (kthread) from [<c0107c18>] (ret_from_fork+0x14/0x3c)
[   61.187778]  r7:00000000 r6:00000000 r5:c0138350 r4:df41a500
[   61.193475] ---[ end Kernel panic - not syncing: Attempted to kill the idle task!

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Linux panics when suspend cannot offline the secondary cores
  2016-06-13 12:06     ` Mason
@ 2016-06-13 13:30       ` Rafael J. Wysocki
  2016-06-13 13:50         ` Mason
  0 siblings, 1 reply; 9+ messages in thread
From: Rafael J. Wysocki @ 2016-06-13 13:30 UTC (permalink / raw)
  To: linux-arm-kernel

On Monday, June 13, 2016 02:06:14 PM Mason wrote:
> On 10/06/2016 23:37, Mason wrote:
> 
> > On 10/06/2016 23:35, Rafael J. Wysocki wrote:
> > 
> >> On Friday, June 10, 2016 05:41:32 PM Mason wrote:
> >>
> >>> I'm playing with S3 Suspend-to-RAM, and I noticed that Linux is really
> >>> unhappy when the suspend framework fails to offline secondary cores.
> >>>
> >>> Is this expected/by design, or could it fail more gracefully?
> >>> (It could also be something missing in my platform's code.)
> >>
> >> This looks like a CPU offline bug to me which is more general than just
> >> system suspend.
> > 
> > You may be right, I will try just off-lining cpu1.
> > Suspend may be a red herring.
> > 
> > By the way, I know my implementation of tango_cpu_die
> > is incorrect, I was testing the failure mode.
> 
> Hello Rafael,
> 
> Suspend was indeed a red herring. Manually requesting cpu1 off-lining
> also makes Linux panic when cpu_die() unexpectedly returns.
> 
> The subject should perhaps have been:
> 
>   Linux panics when secondary core off-lining fails
> 
> Could it be made to fail more gracefully?
> Or is this borkage inherent to the failed operation?
> Or is it a bug in my platform code?
> (A bug other than tango_cpu_die() failing to kill the core.)

Well, smp_ops.cpu_die() is not expected to return AFAICS, so that may be
the reason why it fails for you the way it does.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Linux panics when suspend cannot offline the secondary cores
  2016-06-13 13:30       ` Rafael J. Wysocki
@ 2016-06-13 13:50         ` Mason
  2016-06-13 20:49           ` Rafael J. Wysocki
  0 siblings, 1 reply; 9+ messages in thread
From: Mason @ 2016-06-13 13:50 UTC (permalink / raw)
  To: linux-arm-kernel

On 13/06/2016 15:30, Rafael J. Wysocki wrote:

> On Monday, June 13, 2016 02:06:14 PM Mason wrote:
>
>> On 10/06/2016 23:37, Mason wrote:
>>
>>> On 10/06/2016 23:35, Rafael J. Wysocki wrote:
>>>
>>>> On Friday, June 10, 2016 05:41:32 PM Mason wrote:
>>>>
>>>>> I'm playing with S3 Suspend-to-RAM, and I noticed that Linux is really
>>>>> unhappy when the suspend framework fails to offline secondary cores.
>>>>>
>>>>> Is this expected/by design, or could it fail more gracefully?
>>>>> (It could also be something missing in my platform's code.)
>>>>
>>>> This looks like a CPU offline bug to me which is more general than just
>>>> system suspend.
>>>
>>> You may be right, I will try just off-lining cpu1.
>>> Suspend may be a red herring.
>>>
>>> By the way, I know my implementation of tango_cpu_die
>>> is incorrect, I was testing the failure mode.
>>
>> Hello Rafael,
>>
>> Suspend was indeed a red herring. Manually requesting cpu1 off-lining
>> also makes Linux panic when cpu_die() unexpectedly returns.
>>
>> The subject should perhaps have been:
>>
>>   Linux panics when secondary core off-lining fails
>>
>> Could it be made to fail more gracefully?
>> Or is this borkage inherent to the failed operation?
>> Or is it a bug in my platform code?
>> (A bug other than tango_cpu_die() failing to kill the core.)
> 
> Well, smp_ops.cpu_die() is not expected to return AFAICS, so that may be
> the reason why it fails for you the way it does.

I am aware that smp_ops.cpu_die() is not expected to return.
(I was wondering if the framework could handle it gracefully.)

The actual implementation for cpu_die() asks the firmware to off-line
the current core. If the operation fails, for whatever reason, firmware
is not supposed to return control to Linux?

Is panic the only safe thing to do in Linux:
(If yes, then why doesn't the framework panic immediately?)

static void tango_cpu_die(unsigned int cpu)
{
	ask_firmware_to_offline(cpu);
	/* if we return here, something went wrong */
	panic("firmware could not offline");
}

Regards.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Linux panics when suspend cannot offline the secondary cores
  2016-06-13 13:50         ` Mason
@ 2016-06-13 20:49           ` Rafael J. Wysocki
  2016-06-13 21:02             ` Russell King - ARM Linux
  0 siblings, 1 reply; 9+ messages in thread
From: Rafael J. Wysocki @ 2016-06-13 20:49 UTC (permalink / raw)
  To: linux-arm-kernel

On Monday, June 13, 2016 03:50:56 PM Mason wrote:
> On 13/06/2016 15:30, Rafael J. Wysocki wrote:
> 
> > On Monday, June 13, 2016 02:06:14 PM Mason wrote:
> >
> >> On 10/06/2016 23:37, Mason wrote:
> >>
> >>> On 10/06/2016 23:35, Rafael J. Wysocki wrote:
> >>>
> >>>> On Friday, June 10, 2016 05:41:32 PM Mason wrote:
> >>>>
> >>>>> I'm playing with S3 Suspend-to-RAM, and I noticed that Linux is really
> >>>>> unhappy when the suspend framework fails to offline secondary cores.
> >>>>>
> >>>>> Is this expected/by design, or could it fail more gracefully?
> >>>>> (It could also be something missing in my platform's code.)
> >>>>
> >>>> This looks like a CPU offline bug to me which is more general than just
> >>>> system suspend.
> >>>
> >>> You may be right, I will try just off-lining cpu1.
> >>> Suspend may be a red herring.
> >>>
> >>> By the way, I know my implementation of tango_cpu_die
> >>> is incorrect, I was testing the failure mode.
> >>
> >> Hello Rafael,
> >>
> >> Suspend was indeed a red herring. Manually requesting cpu1 off-lining
> >> also makes Linux panic when cpu_die() unexpectedly returns.
> >>
> >> The subject should perhaps have been:
> >>
> >>   Linux panics when secondary core off-lining fails
> >>
> >> Could it be made to fail more gracefully?
> >> Or is this borkage inherent to the failed operation?
> >> Or is it a bug in my platform code?
> >> (A bug other than tango_cpu_die() failing to kill the core.)
> > 
> > Well, smp_ops.cpu_die() is not expected to return AFAICS, so that may be
> > the reason why it fails for you the way it does.
> 
> I am aware that smp_ops.cpu_die() is not expected to return.
> (I was wondering if the framework could handle it gracefully.)
> 
> The actual implementation for cpu_die() asks the firmware to off-line
> the current core. If the operation fails, for whatever reason, firmware
> is not supposed to return control to Linux?

Firmware can do what it wants (although ideally it should just do what it is
asked for).  smp_ops.cpu_die() is not supposed to return to its caller anyway.

> Is panic the only safe thing to do in Linux:
> (If yes, then why doesn't the framework panic immediately?)

I guess all of the existing implementations of smp_ops.cpu_die() don't return
to the caller no matter what, so the caller did not have to consider anything
else.

And quite frankly I don't see why it would have to.  smp_ops.cpu_die() simply
needs to be implemented to never return.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Linux panics when suspend cannot offline the secondary cores
  2016-06-13 20:49           ` Rafael J. Wysocki
@ 2016-06-13 21:02             ` Russell King - ARM Linux
  2016-06-14 12:42               ` Mason
  0 siblings, 1 reply; 9+ messages in thread
From: Russell King - ARM Linux @ 2016-06-13 21:02 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jun 13, 2016 at 10:49:32PM +0200, Rafael J. Wysocki wrote:
> I guess all of the existing implementations of smp_ops.cpu_die() don't return
> to the caller no matter what, so the caller did not have to consider anything
> else.

Existing implementations for hardware which implements CPU hotplug
takes the requested CPU down in such a way that smp_ops.cpu_die()
*never* returns.

We have a number of evaluation boards where its desirable to emulate
CPU hotplug.  These boards have no power management abilities, and
have no way to power down or reset a CPU from software.  For these,
we implement CPU hotplug by taking the CPU down gracefully, taking
it out of coherency, and then placing it in a loop waiting for the
CPU up event to arrive.  At that point (and this is the only legal
time) smp_ops.cpu_die() returns - at which point you get the
resuscitating kernel message, and the CPU re-enters the kernel.

This path is _only_ for these evaluation platforms which have no
hardware support for CPU hotplug, and therefore no PM and no kexec.

The *only* solution to having working PM support Mason's platform is
a properly implemented CPU hotplug correctly - which means ensuring
that the CPU is either powered down or placed in reset during the
smp_ops.cpu_die() call.  Everything else (even the simulation of it)
is not good enough.

That can be done either by the dying CPU when it calls into
smp_ops.cpu_die(), or the CPU requesting the death of the CPU via
smp_ops.cpu_kill().

Either way, it's up to the platform code to implement these, and as
I say, a correct and proper implementation of this is a fundamental
requirement for system power management (like suspend) and kexec in
a SMP system.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Linux panics when suspend cannot offline the secondary cores
  2016-06-13 21:02             ` Russell King - ARM Linux
@ 2016-06-14 12:42               ` Mason
  0 siblings, 0 replies; 9+ messages in thread
From: Mason @ 2016-06-14 12:42 UTC (permalink / raw)
  To: linux-arm-kernel

On 13/06/2016 23:02, Russell King - ARM Linux wrote:

> On Mon, Jun 13, 2016 at 10:49:32PM +0200, Rafael J. Wysocki wrote:
>
>> I guess all of the existing implementations of smp_ops.cpu_die() don't return
>> to the caller no matter what, so the caller did not have to consider anything
>> else.
> 
> Existing implementations for hardware which implements CPU hotplug
> takes the requested CPU down in such a way that smp_ops.cpu_die()
> *never* returns.
> 
> We have a number of evaluation boards where its desirable to emulate
> CPU hotplug.  These boards have no power management abilities, and
> have no way to power down or reset a CPU from software.  For these,
> we implement CPU hotplug by taking the CPU down gracefully, taking
> it out of coherency, and then placing it in a loop waiting for the
> CPU up event to arrive.  At that point (and this is the only legal
> time) smp_ops.cpu_die() returns - at which point you get the
> resuscitating kernel message, and the CPU re-enters the kernel.
> 
> This path is _only_ for these evaluation platforms which have no
> hardware support for CPU hotplug, and therefore no PM and no kexec.
> 
> The *only* solution to having working PM support Mason's platform is
> a properly implemented CPU hotplug correctly - which means ensuring
> that the CPU is either powered down or placed in reset during the
> smp_ops.cpu_die() call.  Everything else (even the simulation of it)
> is not good enough.
> 
> That can be done either by the dying CPU when it calls into
> smp_ops.cpu_die(), or the CPU requesting the death of the CPU via
> smp_ops.cpu_kill().
> 
> Either way, it's up to the platform code to implement these, and as
> I say, a correct and proper implementation of this is a fundamental
> requirement for system power management (like suspend) and kexec in
> a SMP system.

Hello Russell,

The current plan is to have cpu_die() jump into the firmware, and have
the firmware "park" the calling core into a WFI loop until someone wants
to online the parked core, via the smp_boot_secondary() callback.

Would that work?

So far, I haven't cared about what HOTPLUG does with the parked core,
because we would just provide HOTPLUG as a requirement for suspend,
which offlines the secondary cores, and then we will power down the
entire SoC.

On a tangential subject, is the scheduler able to off-line idle cores?

Regards.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2016-06-14 12:42 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-06-10 15:41 Linux panics when suspend cannot offline the secondary cores Mason
2016-06-10 21:35 ` Rafael J. Wysocki
2016-06-10 21:37   ` Mason
2016-06-13 12:06     ` Mason
2016-06-13 13:30       ` Rafael J. Wysocki
2016-06-13 13:50         ` Mason
2016-06-13 20:49           ` Rafael J. Wysocki
2016-06-13 21:02             ` Russell King - ARM Linux
2016-06-14 12:42               ` Mason

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).