All of lore.kernel.org
 help / color / mirror / Atom feed
From: Niklas Cassel <niklas.cassel@linaro.org>
To: peterz@infradead.org
Cc: linux-kernel@vger.kernel.org, bjorn.andersson@linaro.org
Subject: stop_machine() soft lockup
Date: Tue, 4 Sep 2018 21:03:22 +0200	[thread overview]
Message-ID: <20180904190322.GA21835@centauri.lan> (raw)

Hello Peter,

I'm seeing some lockups when booting linux-next on a db820c arm64 board.
I've tried to analyze, but I'm currently stuck.


[   30.974764] rcu: INFO: rcu_preempt self-detected stall on CPU
[   31.010121] rcu:     2-...0: (1 GPs behind) idle=3f6/1/0x4000000000000002 softirq=53/53 fqs=3099
[   31.035352] rcu:      (t=6518 jiffies g=-1119 q=1279)
[   31.054464] Task dump for CPU 2:
[   31.065349] migration/2     R  running task        0    22      2 0x0000002a
[   31.091937] Call trace:
[   31.103357]  dump_backtrace+0x0/0x200
[   31.114780]  show_stack+0x24/0x30
[   31.126183]  sched_show_task+0x20c/0x2d8
[   31.137604]  dump_cpu_task+0x48/0x58
[   31.149111]  rcu_dump_cpu_stacks+0xa0/0xe0
[   31.160620]  rcu_check_callbacks+0x85c/0xb60
[   31.172151]  update_process_times+0x34/0x60
[   31.183575]  tick_periodic+0x58/0x110
[   31.195048]  tick_handle_periodic+0x94/0xc8
[   31.206573]  arch_timer_handler_virt+0x38/0x58
[   31.218058]  handle_percpu_devid_irq+0xe4/0x458
[   31.229544]  generic_handle_irq+0x34/0x50
[   31.240952]  __handle_domain_irq+0x8c/0xf8
[   31.252375]  gic_handle_irq+0x84/0x180
[   31.263776]  el1_irq+0xec/0x198
[   31.275117]  multi_cpu_stop+0x198/0x220
[   31.286574]  cpu_stopper_thread+0xfc/0x188
[   31.298050]  smpboot_thread_fn+0x1bc/0x2c0
[   31.309562]  kthread+0x134/0x138
[   31.320888]  ret_from_fork+0x10/0x1c
[   31.332075] Task dump for CPU 3:
[   31.342925] migration/3     R  running task        0    27      2 0x0000002a
[   31.369347] Call trace:
[   31.380732]  __switch_to+0xa8/0xf0
[   31.391599]            (null)
[   32.813449] cpu_stopper_thread: calling softlockup_fn+0x0/0x78 for cpu: 0
[   32.853356] cpu_stopper_thread: calling softlockup_fn+0x0/0x78 for cpu: 1
[   32.996597] watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [migration/3:27]
[   33.008710] Modules linked in:
[   33.026913] irq event stamp: 702
[   33.038360] hardirqs last  enabled at (701): [<ffff0000081bb60c>] multi_cpu_stop+0x194/0x220
[   33.050062] hardirqs last disabled at (702): [<ffff000008083e2c>] el1_irq+0xac/0x198
[   33.061600] softirqs last  enabled at (396): [<ffff000008082164>] __do_softirq+0x47c/0x574
[   33.073284] softirqs last disabled at (389): [<ffff0000080ca41c>] irq_exit+0x134/0x148
[   33.084852] CPU: 3 PID: 27 Comm: migration/3 Tainted: G        W         4.19.0-rc2-next-20180904-00001-g32563c06c8db-dirty #42
[   33.095929] Hardware name: Qualcomm Technologies, Inc. DB820c (DT)
[   33.107086] pstate: 60400005 (nZCv daif +PAN -UAO)
[   33.118542] pc : multi_cpu_stop+0x198/0x220
[   33.129799] lr : multi_cpu_stop+0x194/0x220
[   33.140522] sp : ffff00000ae5bd50
[   33.151104] x29: ffff00000ae5bd50 x28: 0000000000000000
[   33.176154] x27: ffff00000803bc80 x26: 0000000000000000
[   33.201057] x25: 0000000000000000 x24: 0000000000000000
[   33.226014] x23: ffff0000090b4000 x22: 0000000000000003
[   33.250973] x21: ffff00000803bca4 x20: ffff00000803bc80
[   33.275806] x19: 0000000000000004 x18: ffffffffffffffff
[   33.300799] x17: 0000000000000727 x16: ffff000009fcfd00
[   33.325856] x15: ffff000009cbe1c8 x14: 31203a64656c6261
[   33.350852] x13: 000000006f1527a4 x12: ffff8000d958e3b0
[   33.375706] x11: 0000000000000000 x10: 0000000000000428
[   33.400629] x9 : ffff8000d9591000 x8 : 000000000000001c
[   33.425566] x7 : ffff8000d9c08400 x6 : 0000000000000444
[   33.450410] x5 : 0000000000007045 x4 : 0000000000000000
[   33.475399] x3 : 0000000000000000 x2 : ffff00000aabb000
[   33.500394] x1 : 0000000000000001 x0 : ffff8000d958db00
[   33.525460] Call trace:
[   33.536707]  multi_cpu_stop+0x198/0x220
[   33.548035]  cpu_stopper_thread+0xfc/0x188
[   33.559444]  smpboot_thread_fn+0x1bc/0x2c0
[   33.570610]  kthread+0x134/0x138
[   33.581944]  ret_from_fork+0x10/0x1c
[   36.813418] cpu_stopper_thread: calling softlockup_fn+0x0/0x78 for cpu: 0
[   36.853359] cpu_stopper_thread: calling softlockup_fn+0x0/0x78 for cpu: 1
[   40.813415] cpu_stopper_thread: calling softlockup_fn+0x0/0x78 for cpu: 0
[   40.853352] cpu_stopper_thread: calling softlockup_fn+0x0/0x78 for cpu: 1
[   44.813415] cpu_stopper_thread: calling softlockup_fn+0x0/0x78 for cpu: 0
[   44.853342] cpu_stopper_thread: calling softlockup_fn+0x0/0x78 for cpu: 1
[   48.813418] cpu_stopper_thread: calling softlockup_fn+0x0/0x78 for cpu: 0
[   48.853344] cpu_stopper_thread: calling softlockup_fn+0x0/0x78 for cpu: 1
[   52.813407] cpu_stopper_thread: calling softlockup_fn+0x0/0x78 for cpu: 0
[   52.853341] cpu_stopper_thread: calling softlockup_fn+0x0/0x78 for cpu: 1
[   56.813414] cpu_stopper_thread: calling softlockup_fn+0x0/0x78 for cpu: 0
[   56.853345] cpu_stopper_thread: calling softlockup_fn+0x0/0x78 for cpu: 1
[   56.921016] watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [migration/2:22]
[   56.933363] Modules linked in:
[   56.951546] irq event stamp: 604
[   56.963173] hardirqs last  enabled at (603): [<ffff0000081bb60c>] multi_cpu_stop+0x194/0x220
[   56.974744] hardirqs last disabled at (604): [<ffff000008083e2c>] el1_irq+0xac/0x198
[   56.986463] softirqs last  enabled at (450): [<ffff000008082164>] __do_softirq+0x47c/0x574
[   56.998266] softirqs last disabled at (443): [<ffff0000080ca41c>] irq_exit+0x134/0x148
[   57.009651] CPU: 2 PID: 22 Comm: migration/2 Tainted: G        W    L    4.19.0-rc2-next-20180904-00001-g32563c06c8db-dirty #42
[   57.021070] Hardware name: Qualcomm Technologies, Inc. DB820c (DT)
[   57.032279] pstate: 60400005 (nZCv daif +PAN -UAO)
[   57.043733] pc : multi_cpu_stop+0x198/0x220
[   57.055152] lr : multi_cpu_stop+0x194/0x220
[   57.065929] sp : ffff00000ae33d50
[   57.076598] x29: ffff00000ae33d50 x28: 0000000000000000
[   57.101700] x27: ffff00000803bc80 x26: 0000000000000000
[   57.126947] x25: 0000000000000000 x24: 0000000000000000
[   57.152160] x23: ffff0000090b4000 x22: 0000000000000002
[   57.177351] x21: ffff00000803bca4 x20: ffff00000803bc80
[   57.202378] x19: 0000000000000004 x18: ffffffffffffffff
[   57.227500] x17: 0000000000000708 x16: ffff000009fcfbe0
[   57.252639] x15: ffff000009cbe1c8 x14: 31203a64656c6261
[   57.277712] x13: 000000006f1527a4 x12: ffff8000d9fc63b0
[   57.302892] x11: 0000000000000000 x10: 0000000000000268
[   57.328157] x9 : ffff8000d9590400 x8 : 000000000000001c
[   57.353240] x7 : ffff8000d9c08400 x6 : 0000000000000284
[   57.378364] x5 : 0000000000004f5d x4 : 0000000000000000
[   57.403477] x3 : 0000000000000000 x2 : ffff00000aabb000
[   57.428601] x1 : 0000000000000001 x0 : ffff8000d9fc5b00
[   57.453865] Call trace:
[   57.465264]  multi_cpu_stop+0x198/0x220
[   57.476690]  cpu_stopper_thread+0xfc/0x188
[   57.487921]  smpboot_thread_fn+0x1bc/0x2c0
[   57.499416]  kthread+0x134/0x138
[   57.510752]  ret_from_fork+0x10/0x1c
[   60.813418] cpu_stopper_thread: calling softlockup_fn+0x0/0x78 for cpu: 0
[   60.853344] cpu_stopper_thread: calling softlockup_fn+0x0/0x78 for cpu: 1
[   63.137327] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=0 stuck for 60s!
[   63.138392] Showing busy workqueues and worker pools:
[   63.145567] workqueue mm_percpu_wq: flags=0x8
[   63.150564]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[   63.154568]     pending: vmstat_update
[   64.813418] cpu_stopper_thread: calling softlockup_fn+0x0/0x78 for cpu: 0
[   64.853348] cpu_stopper_thread: calling softlockup_fn+0x0/0x78 for cpu: 1
[   67.233334] INFO: task swapper/0:1 blocked for more than 30 seconds.
[   67.233610]       Tainted: G        W    L    4.19.0-rc2-next-20180904-00001-g32563c06c8db-dirty #42
[   67.239301] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[   67.248525] swapper/0       D    0     1      0 0x00000028
[   67.256236] Call trace:
[   67.261631]  __switch_to+0xa8/0xf0
[   67.263546]  __schedule+0x34c/0xc90
[   67.267279]  schedule+0x34/0x98
[   67.270662]  schedule_timeout+0x250/0x540
[   67.273785]  wait_for_common+0xe0/0x1a0
[   67.277955]  wait_for_completion+0x28/0x38
[   67.281602]  __stop_cpus+0xd4/0xf8
[   67.285766]  stop_cpus+0x70/0xa8
[   67.289145]  stop_machine_cpuslocked+0x124/0x130
[   67.292276]  stop_machine+0x54/0x70
[   67.297299]  timekeeping_notify+0x44/0x70
[   67.300088]  __clocksource_select+0xa8/0x1d8
[   67.304526]  clocksource_done_booting+0x4c/0x64
[   67.309124]  do_one_initcall+0x94/0x3f8
[   67.312847]  kernel_init_freeable+0x47c/0x528
[   67.316671]  kernel_init+0x18/0x110
[   67.321603]  ret_from_fork+0x10/0x1c
[   67.324490]
[   67.324490] Showing all locks held in the system:
[   67.328566] 3 locks held by swapper/0/1:
[   67.334709]  #0: (____ptrval____) (clocksource_mutex){+.+.}, at: clocksource_done_booting+0x2c/0x64
[   67.338641]  #1: (____ptrval____) (cpu_hotplug_lock.rw_sem){++++}, at: stop_machine+0x28/0x70
[   67.347569]  #2: (____ptrval____) (stop_cpus_mutex){+.+.}, at: stop_cpus+0x60/0xa8
[   67.356260] 1 lock held by khungtaskd/543:
[   67.363757]  #0: (____ptrval____) (rcu_read_lock){....}, at: debug_show_all_locks+0x14/0x1b0
[   67.367628]
[   67.376368] =============================================

These lines:
[   33.038360] hardirqs last  enabled at (701): [<ffff0000081bb60c>] multi_cpu_stop+0x194/0x220
[   33.050062] hardirqs last disabled at (702): [<ffff000008083e2c>] el1_irq+0xac/0x198
seem to indicate that an irq comes directly after local_irq_restore() in multi_cpu_stop().


I've added some debug prints. In the first stop_machine call, everything
completes as it should, in the second iteration, it looks like an irq comes
directly after local_irq_restore() in multi_cpu_stop(), and after that cpu2
and cpu3 is never completed, so wait_for_completion() never returns.


[    0.418197] stop_machine: calling stop_machine_cpuslocked with fn: __enable_cpu_capability+0x0/0x38
[    0.420051] stop_machine_cpuslocked: calling stop_cpus
[    0.429017] >>>>>>>>>>>>>>>>>>>>>>>> stop_cpus:450 taking stop_cpus_mutex
[    0.434205] __stop_cpus: queueing work
[    0.441027] queueing multi_cpu_stop+0x0/0x220 for cpu: 0
[    0.444641] queueing multi_cpu_stop+0x0/0x220 for cpu: 1
[    0.450145] queueing multi_cpu_stop+0x0/0x220 for cpu: 2
[    0.450188] cpu_stopper_thread: calling multi_cpu_stop+0x0/0x220 for cpu: 1
[    0.455438] queueing multi_cpu_stop+0x0/0x220 for cpu: 3
[    0.458486] cpu_stopper_thread: calling multi_cpu_stop+0x0/0x220 for cpu: 2
[    0.459101] multi_cpu_stop: cpu: 2 before local_save_flags, in atomic: 1 irqs_disabled: 0
[    0.459660] multi_cpu_stop: cpu: 2 after local_save_flags, in atomic: 1 irqs_disabled: 0
[    0.460713] multi_cpu_stop: cpu: 1 before local_save_flags, in atomic: 1 irqs_disabled: 0
[    0.467456] cpu_stopper_thread: calling multi_cpu_stop+0x0/0x220 for cpu: 0
[    0.469026] cpu_stopper_thread: calling multi_cpu_stop+0x0/0x220 for cpu: 3
[    0.469635] multi_cpu_stop: cpu: 3 before local_save_flags, in atomic: 1 irqs_disabled: 0
[    0.470190] multi_cpu_stop: cpu: 3 after local_save_flags, in atomic: 1 irqs_disabled: 0
[    0.472945] multi_cpu_stop: cpu: 1 after local_save_flags, in atomic: 1 irqs_disabled: 0
[    0.479629] multi_cpu_stop: cpu: 0 before local_save_flags, in atomic: 1 irqs_disabled: 0
[    0.542478] multi_cpu_stop: cpu: 0 after local_save_flags, in atomic: 1 irqs_disabled: 0
[    0.555518] multi_cpu_stop: cpu: 1 before local_irq_restore, in atomic: 1 irqs_disabled: 128
[    0.555523] multi_cpu_stop: cpu: 0 before local_irq_restore, in atomic: 1 irqs_disabled: 128
[    0.555830] multi_cpu_stop: cpu: 2 before local_irq_restore, in atomic: 1 irqs_disabled: 128
[    0.556074] multi_cpu_stop: cpu: 3 before local_irq_restore, in atomic: 1 irqs_disabled: 128
[    0.556366] multi_cpu_stop: cpu: 2 after local_irq_restore, in atomic: 1 irqs_disabled: 0
[    0.556624] multi_cpu_stop: cpu: 3 after local_irq_restore, in atomic: 1 irqs_disabled: 0
[    0.557133] cpu_stopper_thread: call to multi_cpu_stop+0x0/0x220 complete, setting done for cpu: 2
[    0.558722] multi_cpu_stop: cpu: 1 after local_irq_restore, in atomic: 1 irqs_disabled: 0
[    0.559213] cpu_stopper_thread: call to multi_cpu_stop+0x0/0x220 complete, setting done for cpu: 3
[    0.567138] multi_cpu_stop: cpu: 0 after local_irq_restore, in atomic: 1 irqs_disabled: 0
[    0.575551] cpu_stopper_thread: call to multi_cpu_stop+0x0/0x220 complete, setting done for cpu: 1
[    0.583972] cpu_stopper_thread: call to multi_cpu_stop+0x0/0x220 complete, setting done for cpu: 0
[    0.651730] <<<<<<<<<<<<<<<<<<<<<< __stop_cpus:411 now waiting for completion
[    0.660642] <<<<<<<<<<<<<<<<<<<<<< __stop_cpus:413 completion done
[    0.667836] >>>>>>>>>>>>>>>>>>>>>>>> stop_cpus:454 returning stop_cpus_mutex


[    5.016096] stop_machine: calling stop_machine_cpuslocked with fn: change_clocksource+0x0/0x118
[    5.016188] stop_machine_cpuslocked: calling stop_cpus
[    5.016271] >>>>>>>>>>>>>>>>>>>>>>>> stop_cpus:450 taking stop_cpus_mutex
[    5.016365] __stop_cpus: queueing work
[    5.016471] queueing multi_cpu_stop+0x0/0x220 for cpu: 0
[    5.016699] queueing multi_cpu_stop+0x0/0x220 for cpu: 1
[    5.016858] queueing multi_cpu_stop+0x0/0x220 for cpu: 2
[    5.016901] cpu_stopper_thread: calling multi_cpu_stop+0x0/0x220 for cpu: 0
[    5.016984] multi_cpu_stop: cpu: 0 before local_save_flags, in atomic: 1 irqs_disabled: 0
[    5.017004] queueing multi_cpu_stop+0x0/0x220 for cpu: 3
[    5.017063] multi_cpu_stop: cpu: 0 after local_save_flags, in atomic: 1 irqs_disabled: 0
[    5.017270] cpu_stopper_thread: calling multi_cpu_stop+0x0/0x220 for cpu: 1
[    5.017351] multi_cpu_stop: cpu: 1 before local_save_flags, in atomic: 1 irqs_disabled: 0
[    5.017429] multi_cpu_stop: cpu: 1 after local_save_flags, in atomic: 1 irqs_disabled: 0
[    5.046491] cpu_stopper_thread: calling multi_cpu_stop+0x0/0x220 for cpu: 2
[    5.079044] multi_cpu_stop: cpu: 2 before local_save_flags, in atomic: 1 irqs_disabled: 0
[    5.097884] multi_cpu_stop: cpu: 2 after local_save_flags, in atomic: 1 irqs_disabled: 0
[    5.518726] cpu_stopper_thread: calling multi_cpu_stop+0x0/0x220 for cpu: 3
[    5.571321] multi_cpu_stop: cpu: 3 before local_save_flags, in atomic: 1 irqs_disabled: 0
[    5.624144] multi_cpu_stop: cpu: 3 after local_save_flags, in atomic: 1 irqs_disabled: 0
[    5.680879] multi_cpu_stop: cpu: 1 before local_irq_restore, in atomic: 1 irqs_disabled: 128
[    5.680891] multi_cpu_stop: cpu: 0 before local_irq_restore, in atomic: 1 irqs_disabled: 128
[    5.681227] multi_cpu_stop: cpu: 0 after local_irq_restore, in atomic: 1 irqs_disabled: 0
[    5.682256] multi_cpu_stop: cpu: 3 before local_irq_restore, in atomic: 1 irqs_disabled: 128
[    5.682986] multi_cpu_stop: cpu: 2 before local_irq_restore, in atomic: 1 irqs_disabled: 128
[    5.688761] multi_cpu_stop: cpu: 1 after local_irq_restore, in atomic: 1 irqs_disabled: 0
[    5.697559] cpu_stopper_thread: call to multi_cpu_stop+0x0/0x220 complete, setting done for cpu: 0
[    5.705384] cpu_stopper_thread: call to multi_cpu_stop+0x0/0x220 complete, setting done for cpu: 1
[    5.705896] <<<<<<<<<<<<<<<<<<<<<< __stop_cpus:411 now waiting for completion



[   33.038360] hardirqs last  enabled at (701): [<ffff0000081bb60c>] multi_cpu_stop+0x194/0x220
translates to:
arch_local_irq_restore at /home/nks/linux-next/./arch/arm64/include/asm/irqflags.h:88 (discriminator 2)
 (inlined by) multi_cpu_stop at /home/nks/linux-next/kernel/stop_machine.c:238 (discriminator 2)

[   33.118542] pc : multi_cpu_stop+0x198/0x220
translates to:
get_current at /home/nks/linux-next/./arch/arm64/include/asm/current.h:19
 (inlined by) preempt_count at /home/nks/linux-next/./include/asm-generic/preempt.h:11
 (inlined by) multi_cpu_stop at /home/nks/linux-next/kernel/stop_machine.c:239
which equals:
 asm ("mrs %0, sp_el0" : "=r" (sp_el0));


If we got an irq directly after enabling irqs, I would assume that the execution
should have continued after the irq was done. Any thoughts?


Kind regards,
Niklas

             reply	other threads:[~2018-09-04 19:03 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-09-04 19:03 Niklas Cassel [this message]
2018-09-05  8:42 ` stop_machine() soft lockup Peter Zijlstra
2018-09-05 11:47   ` Niklas Cassel
2018-09-05 13:14     ` Peter Zijlstra
2018-09-05 13:45       ` Niklas Cassel
2018-09-06 12:56         ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180904190322.GA21835@centauri.lan \
    --to=niklas.cassel@linaro.org \
    --cc=bjorn.andersson@linaro.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=peterz@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.