linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Tejun Heo <tj@kernel.org>
Cc: linux-kernel@vger.kernel.org, kernel-team@fb.com
Subject: Re: Can RCU stall lead to hard lockups?
Date: Mon, 8 Jan 2018 20:24:25 -0800	[thread overview]
Message-ID: <20180109042425.GS9671@linux.vnet.ibm.com> (raw)
In-Reply-To: <20180109035207.GD3668920@devbig577.frc2.facebook.com>

On Mon, Jan 08, 2018 at 07:52:07PM -0800, Tejun Heo wrote:
> Hello, Paul.
> 
> So, I was looking at a machine which triggered crashdump from NMI hard
> lockup.  The dmesg was filled up with backtraces - many were stuck in
> reclaim path, which seems to be the culprit, the others were stuck in
> RCU path.  It looks like by the time crashdump was created, all CPUs
> got stuck in RCU path.
> 
> One of the CPUs looked like the following.
> 
>   PID: 4149739  TASK: ffff881a46baa880  CPU: 13  COMMAND: "CPUThreadPool8"
>    #0 [ffff881fff945e48] crash_nmi_callback at ffffffff8103f7d0
>    #1 [ffff881fff945e58] nmi_handle at ffffffff81020653
>    #2 [ffff881fff945eb0] default_do_nmi at ffffffff81020c36
>    #3 [ffff881fff945ed0] do_nmi at ffffffff81020d32
>    #4 [ffff881fff945ef0] end_repeat_nmi at ffffffff81956a7e
>       [exception RIP: io_serial_in+21]
>       RIP: ffffffff81630e55  RSP: ffff881fff943b88  RFLAGS: 00000002
>       RAX: 000000000000ca00  RBX: ffffffff8230e188  RCX: 0000000000000000
>       RDX: 00000000000002fd  RSI: 0000000000000005  RDI: ffffffff8230e188
>       RBP: ffff881fff943bb0   R8: 0000000000000000   R9: ffffffff820cb3c4
>       R10: 0000000000000019  R11: 0000000000002000  R12: 00000000000026e1
>       R13: 0000000000000020  R14: ffffffff820cd398  R15: 0000000000000035
>       ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
>   --- <NMI exception stack> ---
>    #5 [ffff881fff943b88] io_serial_in at ffffffff81630e55
>    #6 [ffff881fff943b90] wait_for_xmitr at ffffffff8163175c
>    #7 [ffff881fff943bb8] serial8250_console_putchar at ffffffff816317dc
>    #8 [ffff881fff943bd8] uart_console_write at ffffffff8162ac00
>    #9 [ffff881fff943c08] serial8250_console_write at ffffffff81634691
>   #10 [ffff881fff943c80] univ8250_console_write at ffffffff8162f7c2
>   #11 [ffff881fff943c90] console_unlock at ffffffff810dfc55
>   #12 [ffff881fff943cf0] vprintk_emit at ffffffff810dffb5
>   #13 [ffff881fff943d50] vprintk_default at ffffffff810e01bf
>   #14 [ffff881fff943d60] vprintk_func at ffffffff810e1127
>   #15 [ffff881fff943d70] printk at ffffffff8119a8a4
>   #16 [ffff881fff943dd0] print_cpu_stall_info at ffffffff810eb78c
>   #17 [ffff881fff943e88] rcu_check_callbacks at ffffffff810ef133
>   #18 [ffff881fff943ee8] update_process_times at ffffffff810f3497
>   #19 [ffff881fff943f10] tick_sched_timer at ffffffff81103037
>   #20 [ffff881fff943f38] __hrtimer_run_queues at ffffffff810f3f38
>   #21 [ffff881fff943f88] hrtimer_interrupt at ffffffff810f442b

You don't have a low-bandwidth console, do you?  (For example, the
output being directed to slow mass storage, across a slow network link,
or the like.)

> The CPU which triggered hardlockup:
> 
>   PID: 4149709  TASK: ffff88010f88c380  CPU: 26  COMMAND: "CPUThreadPool35"
>    #0 [ffff883fff1059d0] machine_kexec at ffffffff8104a874
>    #1 [ffff883fff105a30] __crash_kexec at ffffffff811116cc
>    #2 [ffff883fff105af0] __crash_kexec at ffffffff81111795
>    #3 [ffff883fff105b08] panic at ffffffff8119a6ae
>    #4 [ffff883fff105b98] watchdog_overflow_callback at ffffffff81135dbd
>    #5 [ffff883fff105bb0] __perf_event_overflow at ffffffff81186866
>    #6 [ffff883fff105be8] perf_event_overflow at ffffffff81192bc4
>    #7 [ffff883fff105bf8] intel_pmu_handle_irq at ffffffff8100b265
>    #8 [ffff883fff105df8] perf_event_nmi_handler at ffffffff8100489f
>    #9 [ffff883fff105e58] nmi_handle at ffffffff81020653
>   #10 [ffff883fff105eb0] default_do_nmi at ffffffff81020b94
>   #11 [ffff883fff105ed0] do_nmi at ffffffff81020d32
>   #12 [ffff883fff105ef0] end_repeat_nmi at ffffffff81956a7e
>       [exception RIP: queued_spin_lock_slowpath+248]
>       RIP: ffffffff810da958  RSP: ffff883fff103e68  RFLAGS: 00000046
>       RAX: 0000000000000000  RBX: 0000000000000046  RCX: 00000000006d0000
>       RDX: ffff883fff49a950  RSI: 0000000000d10101  RDI: ffffffff81e54300
>       RBP: ffff883fff103e80   R8: ffff883fff11a950   R9: 0000000000000000
>       R10: 000000000e5873ba  R11: 000000000000010f  R12: ffffffff81e54300
>       R13: 0000000000000000  R14: ffff88010f88c380  R15: ffffffff81e54300
>       ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>   --- <NMI exception stack> ---
>   #13 [ffff883fff103e68] queued_spin_lock_slowpath at ffffffff810da958
>   #14 [ffff883fff103e70] _raw_spin_lock_irqsave at ffffffff8195550b
>   #15 [ffff883fff103e88] rcu_check_callbacks at ffffffff810eed18
>   #16 [ffff883fff103ee8] update_process_times at ffffffff810f3497
>   #17 [ffff883fff103f10] tick_sched_timer at ffffffff81103037
>   #18 [ffff883fff103f38] __hrtimer_run_queues at ffffffff810f3f38
>   #19 [ffff883fff103f88] hrtimer_interrupt at ffffffff810f442b
>   --- <IRQ stack> ---
>   bt: read error: kernel virtual address: ffffffffffffffd8  type: "pt_regs"

This might be unable to acquire the spinlock that CPU 13 was holding
while printing to the console.

> All other CPUs have something similar to the following:
> 
>   PID: 4150460  TASK: ffff88162c891b00  CPU: 8   COMMAND: "GlobalIOThreadP"
>    #0 [ffff881fff805e48] crash_nmi_callback at ffffffff8103f7d0
>    #1 [ffff881fff805e58] nmi_handle at ffffffff81020653
>    #2 [ffff881fff805eb0] default_do_nmi at ffffffff81020c36
>    #3 [ffff881fff805ed0] do_nmi at ffffffff81020d32
>    #4 [ffff881fff805ef0] end_repeat_nmi at ffffffff81956a7e
>       [exception RIP: queued_spin_lock_slowpath+248]
>       RIP: ffffffff810da958  RSP: ffff881fff803e68  RFLAGS: 00000046
>       RAX: 0000000000000000  RBX: 0000000000000046  RCX: 0000000000250000
>       RDX: ffff881fffb1a950  RSI: 00000000002d0101  RDI: ffffffff81e54300
>       RBP: ffff881fff803e80   R8: ffff881fff81a950   R9: 0000000000000000
>       R10: 000000000e5873ba  R11: ffff88207fffd000  R12: ffffffff81e54300
>       R13: 0000000000000000  R14: ffff88162c891b00  R15: ffffffff81e54300
>       ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
>   --- <NMI exception stack> ---
>    #5 [ffff881fff803e68] queued_spin_lock_slowpath at ffffffff810da958
>    #6 [ffff881fff803e70] _raw_spin_lock_irqsave at ffffffff8195550b
>    #7 [ffff881fff803e88] rcu_check_callbacks at ffffffff810eed18
>    #8 [ffff881fff803ee8] update_process_times at ffffffff810f3497
>    #9 [ffff881fff803f10] tick_sched_timer at ffffffff81103037
>   #10 [ffff881fff803f38] __hrtimer_run_queues at ffffffff810f3f38
>   #11 [ffff881fff803f88] hrtimer_interrupt at ffffffff810f442b
>   --- <IRQ stack> ---
> 
> I don't know the RCU code at all but it *looks* like the first CPU is
> taking a sweet while flushing printk buffer while holding a lock (the
> console is IPMI serial console, which faithfully emulates 115200 baud
> rate), and everyone else seems stuck waiting for that spinlock in
> rcu_check_callbacks().
> 
> Does this sound possible?

115200 baud?  Ouch!!!  That -will- result in trouble from console
printing, and often also in RCU CPU stall warnings.

							Thanx, Paul

  reply	other threads:[~2018-01-09  4:23 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-01-09  3:52 Can RCU stall lead to hard lockups? Tejun Heo
2018-01-09  4:24 ` Paul E. McKenney [this message]
2018-01-09 14:11   ` Tejun Heo
2018-01-09 15:22     ` Paul E. McKenney
2018-01-09 18:47       ` [PATCH] RCU: Call touch_nmi_watchdog() while printing stall warnings Tejun Heo
2018-01-09 18:52         ` [PATCH trivial-v2] " Tejun Heo
2018-01-09 20:05           ` Paul E. McKenney
2018-01-10  5:34         ` Sergey Senozhatsky
2018-01-10 14:57           ` Tejun Heo
2018-01-11  0:45             ` Sergey Senozhatsky
2018-02-02 23:44       ` Can RCU stall lead to hard lockups? Serge E. Hallyn
2018-02-03 20:50         ` Paul E. McKenney
2018-02-07  2:33           ` Serge E. Hallyn
2018-02-07  2:53             ` Paul E. McKenney
2018-02-07  2:55               ` Serge E. Hallyn
2018-02-07  3:16                 ` Paul E. McKenney

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180109042425.GS9671@linux.vnet.ibm.com \
    --to=paulmck@linux.vnet.ibm.com \
    --cc=kernel-team@fb.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).