All of lore.kernel.org
 help / color / mirror / Atom feed
From: Niklas Cassel <niklas.cassel@linaro.org>
To: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org, bjorn.andersson@linaro.org
Subject: Re: stop_machine() soft lockup
Date: Wed, 5 Sep 2018 13:47:49 +0200	[thread overview]
Message-ID: <20180905114749.GA5345@centauri.lan> (raw)
In-Reply-To: <20180905084241.GS24082@hirez.programming.kicks-ass.net>

On Wed, Sep 05, 2018 at 10:42:41AM +0200, Peter Zijlstra wrote:
> On Tue, Sep 04, 2018 at 09:03:22PM +0200, Niklas Cassel wrote:
> > Hello Peter,
> >
> > I'm seeing some lockups when booting linux-next on a db820c arm64 board.
> > I've tried to analyze, but I'm currently stuck.
>
> Please see (should be in your Inbox too):
>
>   https://lkml.kernel.org/r/20180905084158.GR24124@hirez.programming.kicks-ass.net

I'm sorry if I mislead you by replying to your other mail thread,
both of them have timekeeping_notify() in the call trace,
but my problem has this call trace:

[  128.747853]  wait_for_common+0xe0/0x1a0
[  128.752023]  wait_for_completionx+0x28/0x38
[  128.755677]  __stop_cpus+0xd4/0xf8
[  128.759837]  stop_cpus+0x70/0xa8
[  128.762958]  stop_machine_cpuslocked+0x124/0x130
[  128.766345]  stop_machine+0x54/0x70
[  128.771373]  timekeeping_notify+0x44/0x70
[  128.774158]  __clocksource_select+0xa8/0x1d8
[  128.778605]  clocksource_done_booting+0x4c/0x64
[  128.782931]  do_one_initcall+0x94/0x3f8
[  128.786921]  kernel_init_freeable+0x47c/0x528
[  128.790742]  kernel_init+0x18/0x110
[  128.795673]  ret_from_fork+0x10/0x1c


while your other mail thread has this call trace:

* stop_machine()
* timekeeping_notify()
* __clocksource_select()
* clocksource_select()
* clocksource_watchdog_work()


So my problem is not related to the watchdog, I tried your revert anyway,
but unfortunately my problem persists.


In my problem, what appears to happen is that in the end of multi_cpu_stop(),
local_irq_restore() is called, and an irq comes directly after that,
then after 22 seconds soft lockup detector kicks in (so I appear to still
get timer ticks). The PC that the soft lockup detector prints is the line
after local_irq_restore(). Without any prints, that is simply end brace of
the function.

I'm booting with kernel command line: "earlycon ftrace=irqsoff"
I can't seem to reproduce the problem without ftrace=irqsoff,
not sure if it because of timing or because ftrace is involved.

I managed to get another call trace with ftrace included,
unfortunately both CPUs appear to be printing at the same time.

[   32.703910] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [migration/2:22]
[   32.728589] Modules linked in:
[   32.747176] irq event stamp: 154
[   32.758851] hardirqs last  enabled at (153): [<ffff0000081bb59c>] multi_cpu_stop+0xfc/0x168
[   32.770736] hardirqs last disabled at (154): [<ffff000008083e2c>] el1_irq+0xac/0x198
[   32.774911] watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [kworker/3:0H:30]
[   32.779946] Modules linked in:
[   32.786425] softirqs last  enabled at (114): [<ffff000008082164>] __do_softirq+0x47c/0x574
[   32.795450] irq event stamp: 62
[   32.805549] softirqs last disabled at (95): [<ffff0000080ca41c>] irq_exit+0x134/0x148
[   32.810450] hardirqs last  enabled at (61): [<ffff00000908efe8>] _raw_spin_unlock_irq+0x38/0x78
[   32.816075] hardirqs last disabled at (62): [<ffff000008083e2c>] el1_irq+0xac/0x198
[   32.821812] softirqs last  enabled at (0): [<ffff0000080bdb68>] copy_process.isra.5.part.6+0x338/0x1ac$
[   32.829394] CPU: 2 PID: 22 Comm: migration/2 Tainted: G        W         4.19.0-rc2-next-20180904-0000$
-gd2e5ea917660 #47
[   32.836772] softirqs last disabled at (0): [<0000000000000000>]           (null)
[   32.846193] Hardware name: Qualcomm Technologies, Inc. DB820c (DT)
[   32.857546] CPU: 3 PID: 30 Comm: kworker/3:0H Tainted: G        W         4.19.0-rc2-next-20180904-000$
2-gd2e5ea917660 #47
[   32.864532] pstate: 60400005 (nZCv daif +PAN -UAO)
[   32.870789] Hardware name: Qualcomm Technologies, Inc. DB820c (DT)
[   32.881909] pc : multi_cpu_stop+0x100/0x168
[   32.887583] lr : multi_cpu_stop+0xfc/0x168
[   32.892723] sp : ffff00000ae33d70
[   32.897717] x29: ffff00000ae33d70 x28: 0000000000000000
[   32.906981] pstate: 60400005 (nZCv daif +PAN -UAO)
[   32.916202] x27: 0000000000000002 x26: ffff00000803bc80
[   32.923256] pc : _raw_spin_unlock_irq+0x3c/0x78
[   32.933817] x25: ffff0000081bb4a0
[   32.937340] lr : _raw_spin_unlock_irq+0x38/0x78
[   32.942465] sp : ffff00000ae73c60
[   32.947455] x29: ffff00000ae73c60 x28: ffff00000978b000
[   32.953060] x24: 0000000000000000
[   32.967484] x23: 0000000000000000
[   32.971265] x27: ffff8000d958db00
[   32.976343] x22: 0000000000000000
[   32.982508] x26: 0000000000000001
[   32.996862] x21: ffff00000803bca4
[   33.000630] x25: ffff8000d9e8a328
[   33.005754] x20: ffff00000803bc80
[   33.011888] x24: ffff000009087f88
[   33.026057] x19: 0000000000000004
[   33.029982] x23: ffff8000d971ad80
[   33.035127] x18: 00000000000008bd
[   33.041263] x22: ffff000009cbf000
[   33.055312] x17: 00000000000008bc
[   33.059310] x21: ffff8000d958db00
[   33.064303] x16: ffff000009fd1360
[   33.070605] x20: ffff0000080fc20c
[   33.084645] x15: ffff000009f2a300
[   33.088605] x19: ffff8000da7f4d40
[   33.093572] x14: 00000000000017fc
[   33.099943] x18: ffffffffffffffff
[   33.114000] x13: 00000000c04846ac
[   33.117948] x17: 0000000000000693
[   33.122855] x12: ffff8000d9fc63b0
[   33.129195] x16: ffff000009fcf7e0
[   33.143370] x11: 0000000000000000
[   33.147294] x15: ffff000009cbe1c8
[   33.152152] x10: 0000000000000348
[   33.158535] x14: 0000000000002fff
[   33.172812] x9 : ffff8000d9590400
[   33.176620] x13: 0000000016163b60
[   33.181506] x8 : 000000000000001c
[   33.187866] x12: ffff8000d971b630
[   33.202251] x7 : ffff8000d9c08400
[   33.205880] x11: 0000000000000000
[   33.210939] x6 : 0000000000000364
[   33.217236] x10: 0000000000000a48
[   33.231721] x5 : 0000000000005356
[   33.235281] x9 : ffff8000d9590c00
[   33.240397] x4 : 0000000000000000
[   33.246592] x8 : 000000000000001c
[   33.261065] x3 : 0000000000000000
[   33.264749] x7 : ffff8000d9c08400
[   33.269874] x2 : ffff00000aabb000
[   33.276019] x6 : 0000000000000a64
[   33.290389] x1 : 0000000000000001
[   33.294192] x5 : 000000000000cf1a
[   33.299235] x0 : ffff8000d9fc5b00
[   33.305452] x4 : 0000000000000000
[   33.319989] Call trace:
[   33.324715] x3 : 0000000000000000 x2 : ffff00000aabb000
[   33.331717]  multi_cpu_stop+0x100/0x168
[   33.341973] x1 : 0000000000000001
[   33.345891]  cpu_stopper_thread+0xa8/0x118
[   33.351777]  smpboot_thread_fn+0x1bc/0x2c0
[   33.357845] x0 : ffff8000d971ad80
[   33.362176]  kthread+0x134/0x138
[   33.371281] Call trace:
[   33.377420]  ret_from_fork+0x10/0x1c
[   33.382212]  _raw_spin_unlock_irq+0x3c/0x78
[   33.387685]  finish_task_switch+0xa4/0x200
[   33.396875]  __schedule+0x350/0xc90
[   33.408272]  preempt_schedule_notrace+0x5c/0x130
[   33.419484]  ftrace_ops_no_ops+0xf4/0x180
[   33.430795]  ftrace_graph_call+0x0/0xc
[   33.442095]  preempt_count_add+0x1c/0x130
[   33.453259]  schedule+0x2c/0x98
[   33.464586]  worker_thread+0xdc/0x478
[   33.475886]  kthread+0x134/0x138
[   33.487099]  ret_from_fork+0x10/0x1c
[   34.464807] rcu: INFO: rcu_preempt self-detected stall on CPU
[   34.487238] rcu:     2-...0: (2757 ticks this GP) idle=20e/1/0x4000000000000002 softirq=120/120 fqs=316
5
[   34.509364] rcu:      (t=6514 jiffies g=-1095 q=1250)
[   34.528505] Task dump for CPU 2:
[   34.539317] migration/2     R  running task        0    22      2 0x0000002a
[   34.565716] Call trace:
[   34.577387]  dump_backtrace+0x0/0x200
[   34.588782]  show_stack+0x24/0x30
[   34.600098]  sched_show_task+0x20c/0x2d8
[   34.611726]  dump_cpu_task+0x48/0x58
[   34.623109]  rcu_dump_cpu_stacks+0xa0/0xe0
[   34.634637]  rcu_check_callbacks+0x85c/0xb60
[   34.646324]  update_process_times+0x34/0x60
[   34.657569]  tick_periodic+0x58/0x110
[   34.669169]  tick_handle_periodic+0x94/0xc8
[   34.680719]  arch_timer_handler_virt+0x38/0x58
[   34.692090]  handle_percpu_devid_irq+0xe4/0x458
[   34.703783]  generic_handle_irq+0x34/0x50
[   34.715121]  __handle_domain_irq+0x8c/0xf8
[   34.726529]  gic_handle_irq+0x84/0x180
[   34.738131]  el1_irq+0xec/0x198
[   34.749285]  multi_cpu_stop+0x100/0x168
[   34.760840]  cpu_stopper_thread+0xa8/0x118
[   34.772384]  smpboot_thread_fn+0x1bc/0x2c0
[   34.783587]  kthread+0x134/0x138
[   34.795188]  ret_from_fork+0x10/0x1c



Could perhaps
[   32.821812] softirqs last  enabled at (0): [<ffff0000080bdb68>] copy_process.isra.5.part.6+0x338/0x1ac
be involved in why the execution appears to never continue after the irq?


Kind regards,
Niklas

  reply	other threads:[~2018-09-05 11:48 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-09-04 19:03 stop_machine() soft lockup Niklas Cassel
2018-09-05  8:42 ` Peter Zijlstra
2018-09-05 11:47   ` Niklas Cassel [this message]
2018-09-05 13:14     ` Peter Zijlstra
2018-09-05 13:45       ` Niklas Cassel
2018-09-06 12:56         ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180905114749.GA5345@centauri.lan \
    --to=niklas.cassel@linaro.org \
    --cc=bjorn.andersson@linaro.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=peterz@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.