All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Tejun Heo <tj@kernel.org>
Cc: linux-kernel@vger.kernel.org, kernel-team@fb.com
Subject: Re: Can RCU stall lead to hard lockups?
Date: Mon, 8 Jan 2018 20:24:25 -0800	[thread overview]
Message-ID: <20180109042425.GS9671@linux.vnet.ibm.com> (raw)
In-Reply-To: <20180109035207.GD3668920@devbig577.frc2.facebook.com>

On Mon, Jan 08, 2018 at 07:52:07PM -0800, Tejun Heo wrote:
> Hello, Paul.
> 
> So, I was looking at a machine which triggered crashdump from NMI hard
> lockup.  The dmesg was filled up with backtraces - many were stuck in
> reclaim path, which seems to be the culprit, the others were stuck in
> RCU path.  It looks like by the time crashdump was created, all CPUs
> got stuck in RCU path.
> 
> One of the CPUs looked like the following.
> 
>   PID: 4149739  TASK: ffff881a46baa880  CPU: 13  COMMAND: "CPUThreadPool8"
>    #0 [ffff881fff945e48] crash_nmi_callback at ffffffff8103f7d0
>    #1 [ffff881fff945e58] nmi_handle at ffffffff81020653
>    #2 [ffff881fff945eb0] default_do_nmi at ffffffff81020c36
>    #3 [ffff881fff945ed0] do_nmi at ffffffff81020d32
>    #4 [ffff881fff945ef0] end_repeat_nmi at ffffffff81956a7e
>       [exception RIP: io_serial_in+21]
>       RIP: ffffffff81630e55  RSP: ffff881fff943b88  RFLAGS: 00000002
>       RAX: 000000000000ca00  RBX: ffffffff8230e188  RCX: 0000000000000000
>       RDX: 00000000000002fd  RSI: 0000000000000005  RDI: ffffffff8230e188
>       RBP: ffff881fff943bb0   R8: 0000000000000000   R9: ffffffff820cb3c4
>       R10: 0000000000000019  R11: 0000000000002000  R12: 00000000000026e1
>       R13: 0000000000000020  R14: ffffffff820cd398  R15: 0000000000000035
>       ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
>   --- <NMI exception stack> ---
>    #5 [ffff881fff943b88] io_serial_in at ffffffff81630e55
>    #6 [ffff881fff943b90] wait_for_xmitr at ffffffff8163175c
>    #7 [ffff881fff943bb8] serial8250_console_putchar at ffffffff816317dc
>    #8 [ffff881fff943bd8] uart_console_write at ffffffff8162ac00
>    #9 [ffff881fff943c08] serial8250_console_write at ffffffff81634691
>   #10 [ffff881fff943c80] univ8250_console_write at ffffffff8162f7c2
>   #11 [ffff881fff943c90] console_unlock at ffffffff810dfc55
>   #12 [ffff881fff943cf0] vprintk_emit at ffffffff810dffb5
>   #13 [ffff881fff943d50] vprintk_default at ffffffff810e01bf
>   #14 [ffff881fff943d60] vprintk_func at ffffffff810e1127
>   #15 [ffff881fff943d70] printk at ffffffff8119a8a4
>   #16 [ffff881fff943dd0] print_cpu_stall_info at ffffffff810eb78c
>   #17 [ffff881fff943e88] rcu_check_callbacks at ffffffff810ef133
>   #18 [ffff881fff943ee8] update_process_times at ffffffff810f3497
>   #19 [ffff881fff943f10] tick_sched_timer at ffffffff81103037
>   #20 [ffff881fff943f38] __hrtimer_run_queues at ffffffff810f3f38
>   #21 [ffff881fff943f88] hrtimer_interrupt at ffffffff810f442b

You don't have a low-bandwidth console, do you?  (For example, the
output being directed to slow mass storage, across a slow network link,
or the like.)

> The CPU which triggered hardlockup:
> 
>   PID: 4149709  TASK: ffff88010f88c380  CPU: 26  COMMAND: "CPUThreadPool35"
>    #0 [ffff883fff1059d0] machine_kexec at ffffffff8104a874
>    #1 [ffff883fff105a30] __crash_kexec at ffffffff811116cc
>    #2 [ffff883fff105af0] __crash_kexec at ffffffff81111795
>    #3 [ffff883fff105b08] panic at ffffffff8119a6ae
>    #4 [ffff883fff105b98] watchdog_overflow_callback at ffffffff81135dbd
>    #5 [ffff883fff105bb0] __perf_event_overflow at ffffffff81186866
>    #6 [ffff883fff105be8] perf_event_overflow at ffffffff81192bc4
>    #7 [ffff883fff105bf8] intel_pmu_handle_irq at ffffffff8100b265
>    #8 [ffff883fff105df8] perf_event_nmi_handler at ffffffff8100489f
>    #9 [ffff883fff105e58] nmi_handle at ffffffff81020653
>   #10 [ffff883fff105eb0] default_do_nmi at ffffffff81020b94
>   #11 [ffff883fff105ed0] do_nmi at ffffffff81020d32
>   #12 [ffff883fff105ef0] end_repeat_nmi at ffffffff81956a7e
>       [exception RIP: queued_spin_lock_slowpath+248]
>       RIP: ffffffff810da958  RSP: ffff883fff103e68  RFLAGS: 00000046
>       RAX: 0000000000000000  RBX: 0000000000000046  RCX: 00000000006d0000
>       RDX: ffff883fff49a950  RSI: 0000000000d10101  RDI: ffffffff81e54300
>       RBP: ffff883fff103e80   R8: ffff883fff11a950   R9: 0000000000000000
>       R10: 000000000e5873ba  R11: 000000000000010f  R12: ffffffff81e54300
>       R13: 0000000000000000  R14: ffff88010f88c380  R15: ffffffff81e54300
>       ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>   --- <NMI exception stack> ---
>   #13 [ffff883fff103e68] queued_spin_lock_slowpath at ffffffff810da958
>   #14 [ffff883fff103e70] _raw_spin_lock_irqsave at ffffffff8195550b
>   #15 [ffff883fff103e88] rcu_check_callbacks at ffffffff810eed18
>   #16 [ffff883fff103ee8] update_process_times at ffffffff810f3497
>   #17 [ffff883fff103f10] tick_sched_timer at ffffffff81103037
>   #18 [ffff883fff103f38] __hrtimer_run_queues at ffffffff810f3f38
>   #19 [ffff883fff103f88] hrtimer_interrupt at ffffffff810f442b
>   --- <IRQ stack> ---
>   bt: read error: kernel virtual address: ffffffffffffffd8  type: "pt_regs"

This might be unable to acquire the spinlock that CPU 13 was holding
while printing to the console.

> All other CPUs have something similar to the following:
> 
>   PID: 4150460  TASK: ffff88162c891b00  CPU: 8   COMMAND: "GlobalIOThreadP"
>    #0 [ffff881fff805e48] crash_nmi_callback at ffffffff8103f7d0
>    #1 [ffff881fff805e58] nmi_handle at ffffffff81020653
>    #2 [ffff881fff805eb0] default_do_nmi at ffffffff81020c36
>    #3 [ffff881fff805ed0] do_nmi at ffffffff81020d32
>    #4 [ffff881fff805ef0] end_repeat_nmi at ffffffff81956a7e
>       [exception RIP: queued_spin_lock_slowpath+248]
>       RIP: ffffffff810da958  RSP: ffff881fff803e68  RFLAGS: 00000046
>       RAX: 0000000000000000  RBX: 0000000000000046  RCX: 0000000000250000
>       RDX: ffff881fffb1a950  RSI: 00000000002d0101  RDI: ffffffff81e54300
>       RBP: ffff881fff803e80   R8: ffff881fff81a950   R9: 0000000000000000
>       R10: 000000000e5873ba  R11: ffff88207fffd000  R12: ffffffff81e54300
>       R13: 0000000000000000  R14: ffff88162c891b00  R15: ffffffff81e54300
>       ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
>   --- <NMI exception stack> ---
>    #5 [ffff881fff803e68] queued_spin_lock_slowpath at ffffffff810da958
>    #6 [ffff881fff803e70] _raw_spin_lock_irqsave at ffffffff8195550b
>    #7 [ffff881fff803e88] rcu_check_callbacks at ffffffff810eed18
>    #8 [ffff881fff803ee8] update_process_times at ffffffff810f3497
>    #9 [ffff881fff803f10] tick_sched_timer at ffffffff81103037
>   #10 [ffff881fff803f38] __hrtimer_run_queues at ffffffff810f3f38
>   #11 [ffff881fff803f88] hrtimer_interrupt at ffffffff810f442b
>   --- <IRQ stack> ---
> 
> I don't know the RCU code at all but it *looks* like the first CPU is
> taking a sweet while flushing printk buffer while holding a lock (the
> console is IPMI serial console, which faithfully emulates 115200 baud
> rate), and everyone else seems stuck waiting for that spinlock in
> rcu_check_callbacks().
> 
> Does this sound possible?

115200 baud?  Ouch!!!  That -will- result in trouble from console
printing, and often also in RCU CPU stall warnings.

							Thanx, Paul

  reply	other threads:[~2018-01-09  4:23 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-01-09  3:52 Can RCU stall lead to hard lockups? Tejun Heo
2018-01-09  4:24 ` Paul E. McKenney [this message]
2018-01-09 14:11   ` Tejun Heo
2018-01-09 15:22     ` Paul E. McKenney
2018-01-09 18:47       ` [PATCH] RCU: Call touch_nmi_watchdog() while printing stall warnings Tejun Heo
2018-01-09 18:52         ` [PATCH trivial-v2] " Tejun Heo
2018-01-09 20:05           ` Paul E. McKenney
2018-01-10  5:34         ` Sergey Senozhatsky
2018-01-10 14:57           ` Tejun Heo
2018-01-11  0:45             ` Sergey Senozhatsky
2018-02-02 23:44       ` Can RCU stall lead to hard lockups? Serge E. Hallyn
2018-02-03 20:50         ` Paul E. McKenney
2018-02-07  2:33           ` Serge E. Hallyn
2018-02-07  2:53             ` Paul E. McKenney
2018-02-07  2:55               ` Serge E. Hallyn
2018-02-07  3:16                 ` Paul E. McKenney

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180109042425.GS9671@linux.vnet.ibm.com \
    --to=paulmck@linux.vnet.ibm.com \
    --cc=kernel-team@fb.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.