* Can RCU stall lead to hard lockups?
@ 2018-01-09 3:52 Tejun Heo
2018-01-09 4:24 ` Paul E. McKenney
0 siblings, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2018-01-09 3:52 UTC (permalink / raw)
To: Paul E. McKenney; +Cc: linux-kernel, kernel-team
Hello, Paul.
So, I was looking at a machine which triggered crashdump from NMI hard
lockup. The dmesg was filled up with backtraces - many were stuck in
reclaim path, which seems to be the culprit, the others were stuck in
RCU path. It looks like by the time crashdump was created, all CPUs
got stuck in RCU path.
One of the CPUs looked like the following.
PID: 4149739 TASK: ffff881a46baa880 CPU: 13 COMMAND: "CPUThreadPool8"
#0 [ffff881fff945e48] crash_nmi_callback at ffffffff8103f7d0
#1 [ffff881fff945e58] nmi_handle at ffffffff81020653
#2 [ffff881fff945eb0] default_do_nmi at ffffffff81020c36
#3 [ffff881fff945ed0] do_nmi at ffffffff81020d32
#4 [ffff881fff945ef0] end_repeat_nmi at ffffffff81956a7e
[exception RIP: io_serial_in+21]
RIP: ffffffff81630e55 RSP: ffff881fff943b88 RFLAGS: 00000002
RAX: 000000000000ca00 RBX: ffffffff8230e188 RCX: 0000000000000000
RDX: 00000000000002fd RSI: 0000000000000005 RDI: ffffffff8230e188
RBP: ffff881fff943bb0 R8: 0000000000000000 R9: ffffffff820cb3c4
R10: 0000000000000019 R11: 0000000000002000 R12: 00000000000026e1
R13: 0000000000000020 R14: ffffffff820cd398 R15: 0000000000000035
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000
--- <NMI exception stack> ---
#5 [ffff881fff943b88] io_serial_in at ffffffff81630e55
#6 [ffff881fff943b90] wait_for_xmitr at ffffffff8163175c
#7 [ffff881fff943bb8] serial8250_console_putchar at ffffffff816317dc
#8 [ffff881fff943bd8] uart_console_write at ffffffff8162ac00
#9 [ffff881fff943c08] serial8250_console_write at ffffffff81634691
#10 [ffff881fff943c80] univ8250_console_write at ffffffff8162f7c2
#11 [ffff881fff943c90] console_unlock at ffffffff810dfc55
#12 [ffff881fff943cf0] vprintk_emit at ffffffff810dffb5
#13 [ffff881fff943d50] vprintk_default at ffffffff810e01bf
#14 [ffff881fff943d60] vprintk_func at ffffffff810e1127
#15 [ffff881fff943d70] printk at ffffffff8119a8a4
#16 [ffff881fff943dd0] print_cpu_stall_info at ffffffff810eb78c
#17 [ffff881fff943e88] rcu_check_callbacks at ffffffff810ef133
#18 [ffff881fff943ee8] update_process_times at ffffffff810f3497
#19 [ffff881fff943f10] tick_sched_timer at ffffffff81103037
#20 [ffff881fff943f38] __hrtimer_run_queues at ffffffff810f3f38
#21 [ffff881fff943f88] hrtimer_interrupt at ffffffff810f442b
The CPU which triggered hardlockup:
PID: 4149709 TASK: ffff88010f88c380 CPU: 26 COMMAND: "CPUThreadPool35"
#0 [ffff883fff1059d0] machine_kexec at ffffffff8104a874
#1 [ffff883fff105a30] __crash_kexec at ffffffff811116cc
#2 [ffff883fff105af0] __crash_kexec at ffffffff81111795
#3 [ffff883fff105b08] panic at ffffffff8119a6ae
#4 [ffff883fff105b98] watchdog_overflow_callback at ffffffff81135dbd
#5 [ffff883fff105bb0] __perf_event_overflow at ffffffff81186866
#6 [ffff883fff105be8] perf_event_overflow at ffffffff81192bc4
#7 [ffff883fff105bf8] intel_pmu_handle_irq at ffffffff8100b265
#8 [ffff883fff105df8] perf_event_nmi_handler at ffffffff8100489f
#9 [ffff883fff105e58] nmi_handle at ffffffff81020653
#10 [ffff883fff105eb0] default_do_nmi at ffffffff81020b94
#11 [ffff883fff105ed0] do_nmi at ffffffff81020d32
#12 [ffff883fff105ef0] end_repeat_nmi at ffffffff81956a7e
[exception RIP: queued_spin_lock_slowpath+248]
RIP: ffffffff810da958 RSP: ffff883fff103e68 RFLAGS: 00000046
RAX: 0000000000000000 RBX: 0000000000000046 RCX: 00000000006d0000
RDX: ffff883fff49a950 RSI: 0000000000d10101 RDI: ffffffff81e54300
RBP: ffff883fff103e80 R8: ffff883fff11a950 R9: 0000000000000000
R10: 000000000e5873ba R11: 000000000000010f R12: ffffffff81e54300
R13: 0000000000000000 R14: ffff88010f88c380 R15: ffffffff81e54300
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#13 [ffff883fff103e68] queued_spin_lock_slowpath at ffffffff810da958
#14 [ffff883fff103e70] _raw_spin_lock_irqsave at ffffffff8195550b
#15 [ffff883fff103e88] rcu_check_callbacks at ffffffff810eed18
#16 [ffff883fff103ee8] update_process_times at ffffffff810f3497
#17 [ffff883fff103f10] tick_sched_timer at ffffffff81103037
#18 [ffff883fff103f38] __hrtimer_run_queues at ffffffff810f3f38
#19 [ffff883fff103f88] hrtimer_interrupt at ffffffff810f442b
--- <IRQ stack> ---
bt: read error: kernel virtual address: ffffffffffffffd8 type: "pt_regs"
All other CPUs have something similar to the following:
PID: 4150460 TASK: ffff88162c891b00 CPU: 8 COMMAND: "GlobalIOThreadP"
#0 [ffff881fff805e48] crash_nmi_callback at ffffffff8103f7d0
#1 [ffff881fff805e58] nmi_handle at ffffffff81020653
#2 [ffff881fff805eb0] default_do_nmi at ffffffff81020c36
#3 [ffff881fff805ed0] do_nmi at ffffffff81020d32
#4 [ffff881fff805ef0] end_repeat_nmi at ffffffff81956a7e
[exception RIP: queued_spin_lock_slowpath+248]
RIP: ffffffff810da958 RSP: ffff881fff803e68 RFLAGS: 00000046
RAX: 0000000000000000 RBX: 0000000000000046 RCX: 0000000000250000
RDX: ffff881fffb1a950 RSI: 00000000002d0101 RDI: ffffffff81e54300
RBP: ffff881fff803e80 R8: ffff881fff81a950 R9: 0000000000000000
R10: 000000000e5873ba R11: ffff88207fffd000 R12: ffffffff81e54300
R13: 0000000000000000 R14: ffff88162c891b00 R15: ffffffff81e54300
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000
--- <NMI exception stack> ---
#5 [ffff881fff803e68] queued_spin_lock_slowpath at ffffffff810da958
#6 [ffff881fff803e70] _raw_spin_lock_irqsave at ffffffff8195550b
#7 [ffff881fff803e88] rcu_check_callbacks at ffffffff810eed18
#8 [ffff881fff803ee8] update_process_times at ffffffff810f3497
#9 [ffff881fff803f10] tick_sched_timer at ffffffff81103037
#10 [ffff881fff803f38] __hrtimer_run_queues at ffffffff810f3f38
#11 [ffff881fff803f88] hrtimer_interrupt at ffffffff810f442b
--- <IRQ stack> ---
I don't know the RCU code at all but it *looks* like the first CPU is
taking a sweet while flushing printk buffer while holding a lock (the
console is IPMI serial console, which faithfully emulates 115200 baud
rate), and everyone else seems stuck waiting for that spinlock in
rcu_check_callbacks().
Does this sound possible?
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 16+ messages in thread* Re: Can RCU stall lead to hard lockups? 2018-01-09 3:52 Can RCU stall lead to hard lockups? Tejun Heo @ 2018-01-09 4:24 ` Paul E. McKenney 2018-01-09 14:11 ` Tejun Heo 0 siblings, 1 reply; 16+ messages in thread From: Paul E. McKenney @ 2018-01-09 4:24 UTC (permalink / raw) To: Tejun Heo; +Cc: linux-kernel, kernel-team On Mon, Jan 08, 2018 at 07:52:07PM -0800, Tejun Heo wrote: > Hello, Paul. > > So, I was looking at a machine which triggered crashdump from NMI hard > lockup. The dmesg was filled up with backtraces - many were stuck in > reclaim path, which seems to be the culprit, the others were stuck in > RCU path. It looks like by the time crashdump was created, all CPUs > got stuck in RCU path. > > One of the CPUs looked like the following. > > PID: 4149739 TASK: ffff881a46baa880 CPU: 13 COMMAND: "CPUThreadPool8" > #0 [ffff881fff945e48] crash_nmi_callback at ffffffff8103f7d0 > #1 [ffff881fff945e58] nmi_handle at ffffffff81020653 > #2 [ffff881fff945eb0] default_do_nmi at ffffffff81020c36 > #3 [ffff881fff945ed0] do_nmi at ffffffff81020d32 > #4 [ffff881fff945ef0] end_repeat_nmi at ffffffff81956a7e > [exception RIP: io_serial_in+21] > RIP: ffffffff81630e55 RSP: ffff881fff943b88 RFLAGS: 00000002 > RAX: 000000000000ca00 RBX: ffffffff8230e188 RCX: 0000000000000000 > RDX: 00000000000002fd RSI: 0000000000000005 RDI: ffffffff8230e188 > RBP: ffff881fff943bb0 R8: 0000000000000000 R9: ffffffff820cb3c4 > R10: 0000000000000019 R11: 0000000000002000 R12: 00000000000026e1 > R13: 0000000000000020 R14: ffffffff820cd398 R15: 0000000000000035 > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000 > --- <NMI exception stack> --- > #5 [ffff881fff943b88] io_serial_in at ffffffff81630e55 > #6 [ffff881fff943b90] wait_for_xmitr at ffffffff8163175c > #7 [ffff881fff943bb8] serial8250_console_putchar at ffffffff816317dc > #8 [ffff881fff943bd8] uart_console_write at ffffffff8162ac00 > #9 [ffff881fff943c08] serial8250_console_write at ffffffff81634691 > #10 [ffff881fff943c80] univ8250_console_write at ffffffff8162f7c2 > #11 [ffff881fff943c90] console_unlock at ffffffff810dfc55 > #12 [ffff881fff943cf0] vprintk_emit at ffffffff810dffb5 > #13 [ffff881fff943d50] vprintk_default at ffffffff810e01bf > #14 [ffff881fff943d60] vprintk_func at ffffffff810e1127 > #15 [ffff881fff943d70] printk at ffffffff8119a8a4 > #16 [ffff881fff943dd0] print_cpu_stall_info at ffffffff810eb78c > #17 [ffff881fff943e88] rcu_check_callbacks at ffffffff810ef133 > #18 [ffff881fff943ee8] update_process_times at ffffffff810f3497 > #19 [ffff881fff943f10] tick_sched_timer at ffffffff81103037 > #20 [ffff881fff943f38] __hrtimer_run_queues at ffffffff810f3f38 > #21 [ffff881fff943f88] hrtimer_interrupt at ffffffff810f442b You don't have a low-bandwidth console, do you? (For example, the output being directed to slow mass storage, across a slow network link, or the like.) > The CPU which triggered hardlockup: > > PID: 4149709 TASK: ffff88010f88c380 CPU: 26 COMMAND: "CPUThreadPool35" > #0 [ffff883fff1059d0] machine_kexec at ffffffff8104a874 > #1 [ffff883fff105a30] __crash_kexec at ffffffff811116cc > #2 [ffff883fff105af0] __crash_kexec at ffffffff81111795 > #3 [ffff883fff105b08] panic at ffffffff8119a6ae > #4 [ffff883fff105b98] watchdog_overflow_callback at ffffffff81135dbd > #5 [ffff883fff105bb0] __perf_event_overflow at ffffffff81186866 > #6 [ffff883fff105be8] perf_event_overflow at ffffffff81192bc4 > #7 [ffff883fff105bf8] intel_pmu_handle_irq at ffffffff8100b265 > #8 [ffff883fff105df8] perf_event_nmi_handler at ffffffff8100489f > #9 [ffff883fff105e58] nmi_handle at ffffffff81020653 > #10 [ffff883fff105eb0] default_do_nmi at ffffffff81020b94 > #11 [ffff883fff105ed0] do_nmi at ffffffff81020d32 > #12 [ffff883fff105ef0] end_repeat_nmi at ffffffff81956a7e > [exception RIP: queued_spin_lock_slowpath+248] > RIP: ffffffff810da958 RSP: ffff883fff103e68 RFLAGS: 00000046 > RAX: 0000000000000000 RBX: 0000000000000046 RCX: 00000000006d0000 > RDX: ffff883fff49a950 RSI: 0000000000d10101 RDI: ffffffff81e54300 > RBP: ffff883fff103e80 R8: ffff883fff11a950 R9: 0000000000000000 > R10: 000000000e5873ba R11: 000000000000010f R12: ffffffff81e54300 > R13: 0000000000000000 R14: ffff88010f88c380 R15: ffffffff81e54300 > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 > --- <NMI exception stack> --- > #13 [ffff883fff103e68] queued_spin_lock_slowpath at ffffffff810da958 > #14 [ffff883fff103e70] _raw_spin_lock_irqsave at ffffffff8195550b > #15 [ffff883fff103e88] rcu_check_callbacks at ffffffff810eed18 > #16 [ffff883fff103ee8] update_process_times at ffffffff810f3497 > #17 [ffff883fff103f10] tick_sched_timer at ffffffff81103037 > #18 [ffff883fff103f38] __hrtimer_run_queues at ffffffff810f3f38 > #19 [ffff883fff103f88] hrtimer_interrupt at ffffffff810f442b > --- <IRQ stack> --- > bt: read error: kernel virtual address: ffffffffffffffd8 type: "pt_regs" This might be unable to acquire the spinlock that CPU 13 was holding while printing to the console. > All other CPUs have something similar to the following: > > PID: 4150460 TASK: ffff88162c891b00 CPU: 8 COMMAND: "GlobalIOThreadP" > #0 [ffff881fff805e48] crash_nmi_callback at ffffffff8103f7d0 > #1 [ffff881fff805e58] nmi_handle at ffffffff81020653 > #2 [ffff881fff805eb0] default_do_nmi at ffffffff81020c36 > #3 [ffff881fff805ed0] do_nmi at ffffffff81020d32 > #4 [ffff881fff805ef0] end_repeat_nmi at ffffffff81956a7e > [exception RIP: queued_spin_lock_slowpath+248] > RIP: ffffffff810da958 RSP: ffff881fff803e68 RFLAGS: 00000046 > RAX: 0000000000000000 RBX: 0000000000000046 RCX: 0000000000250000 > RDX: ffff881fffb1a950 RSI: 00000000002d0101 RDI: ffffffff81e54300 > RBP: ffff881fff803e80 R8: ffff881fff81a950 R9: 0000000000000000 > R10: 000000000e5873ba R11: ffff88207fffd000 R12: ffffffff81e54300 > R13: 0000000000000000 R14: ffff88162c891b00 R15: ffffffff81e54300 > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000 > --- <NMI exception stack> --- > #5 [ffff881fff803e68] queued_spin_lock_slowpath at ffffffff810da958 > #6 [ffff881fff803e70] _raw_spin_lock_irqsave at ffffffff8195550b > #7 [ffff881fff803e88] rcu_check_callbacks at ffffffff810eed18 > #8 [ffff881fff803ee8] update_process_times at ffffffff810f3497 > #9 [ffff881fff803f10] tick_sched_timer at ffffffff81103037 > #10 [ffff881fff803f38] __hrtimer_run_queues at ffffffff810f3f38 > #11 [ffff881fff803f88] hrtimer_interrupt at ffffffff810f442b > --- <IRQ stack> --- > > I don't know the RCU code at all but it *looks* like the first CPU is > taking a sweet while flushing printk buffer while holding a lock (the > console is IPMI serial console, which faithfully emulates 115200 baud > rate), and everyone else seems stuck waiting for that spinlock in > rcu_check_callbacks(). > > Does this sound possible? 115200 baud? Ouch!!! That -will- result in trouble from console printing, and often also in RCU CPU stall warnings. Thanx, Paul ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Can RCU stall lead to hard lockups? 2018-01-09 4:24 ` Paul E. McKenney @ 2018-01-09 14:11 ` Tejun Heo 2018-01-09 15:22 ` Paul E. McKenney 0 siblings, 1 reply; 16+ messages in thread From: Tejun Heo @ 2018-01-09 14:11 UTC (permalink / raw) To: Paul E. McKenney; +Cc: linux-kernel, kernel-team Hello, Paul. On Mon, Jan 08, 2018 at 08:24:25PM -0800, Paul E. McKenney wrote: > > I don't know the RCU code at all but it *looks* like the first CPU is > > taking a sweet while flushing printk buffer while holding a lock (the > > console is IPMI serial console, which faithfully emulates 115200 baud > > rate), and everyone else seems stuck waiting for that spinlock in > > rcu_check_callbacks(). > > > > Does this sound possible? > > 115200 baud? Ouch!!! That -will- result in trouble from console > printing, and often also in RCU CPU stall warnings. It could even be slower than 115200, and we occassionally see RCU stall warnings caused by printk storms, for example, while the kernel is trying to dump a lot of info after an OOM. That's an issue we probably want to improve from printk side; however, they don't usually lead to NMI hard lockup detector kicking in and crashing the machine, which is the peculiarity here. Hmmm... show_state_filter(), the function which dumps all task backtraces, share a similar problem and it avoids it by explicitly calling touch_nmi_watchdog(). Maybe we can do something like the following from RCU too? diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index db85ca3..3c4c4d3 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -561,8 +561,14 @@ static void rcu_print_detail_task_stall_rnp(struct rcu_node *rnp) } t = list_entry(rnp->gp_tasks->prev, struct task_struct, rcu_node_entry); - list_for_each_entry_continue(t, &rnp->blkd_tasks, rcu_node_entry) + list_for_each_entry_continue(t, &rnp->blkd_tasks, rcu_node_entry) { + touch_nmi_watchdog(); + /* + * We could be printing a lot of these messages while + * holding a spinlock. Avoid triggering hard lockup. + */ sched_show_task(t); + } raw_spin_unlock_irqrestore_rcu_node(rnp, flags); } @@ -1678,6 +1684,12 @@ static void print_cpu_stall_info(struct rcu_state *rsp, int cpu) char *ticks_title; unsigned long ticks_value; + /* + * We could be printing a lot of these messages while holding a + * spinlock. Avoid triggering hard lockup. + */ + touch_nmi_watchdog(); + if (rsp->gpnum == rdp->gpnum) { ticks_title = "ticks this GP"; ticks_value = rdp->ticks_this_gp; ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: Can RCU stall lead to hard lockups? 2018-01-09 14:11 ` Tejun Heo @ 2018-01-09 15:22 ` Paul E. McKenney 2018-01-09 18:47 ` [PATCH] RCU: Call touch_nmi_watchdog() while printing stall warnings Tejun Heo 2018-02-02 23:44 ` Can RCU stall lead to hard lockups? Serge E. Hallyn 0 siblings, 2 replies; 16+ messages in thread From: Paul E. McKenney @ 2018-01-09 15:22 UTC (permalink / raw) To: Tejun Heo; +Cc: linux-kernel, kernel-team On Tue, Jan 09, 2018 at 06:11:14AM -0800, Tejun Heo wrote: > Hello, Paul. > > On Mon, Jan 08, 2018 at 08:24:25PM -0800, Paul E. McKenney wrote: > > > I don't know the RCU code at all but it *looks* like the first CPU is > > > taking a sweet while flushing printk buffer while holding a lock (the > > > console is IPMI serial console, which faithfully emulates 115200 baud > > > rate), and everyone else seems stuck waiting for that spinlock in > > > rcu_check_callbacks(). > > > > > > Does this sound possible? > > > > 115200 baud? Ouch!!! That -will- result in trouble from console > > printing, and often also in RCU CPU stall warnings. > > It could even be slower than 115200, and we occassionally see RCU > stall warnings caused by printk storms, for example, while the kernel > is trying to dump a lot of info after an OOM. That's an issue we > probably want to improve from printk side; however, they don't usually > lead to NMI hard lockup detector kicking in and crashing the machine, > which is the peculiarity here. > > Hmmm... show_state_filter(), the function which dumps all task > backtraces, share a similar problem and it avoids it by explicitly > calling touch_nmi_watchdog(). Maybe we can do something like the > following from RCU too? If this fixes things for you, I would welcome such a patch. Thanx, Paul > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h > index db85ca3..3c4c4d3 100644 > --- a/kernel/rcu/tree_plugin.h > +++ b/kernel/rcu/tree_plugin.h > @@ -561,8 +561,14 @@ static void rcu_print_detail_task_stall_rnp(struct rcu_node *rnp) > } > t = list_entry(rnp->gp_tasks->prev, > struct task_struct, rcu_node_entry); > - list_for_each_entry_continue(t, &rnp->blkd_tasks, rcu_node_entry) > + list_for_each_entry_continue(t, &rnp->blkd_tasks, rcu_node_entry) { > + touch_nmi_watchdog(); > + /* > + * We could be printing a lot of these messages while > + * holding a spinlock. Avoid triggering hard lockup. > + */ > sched_show_task(t); > + } > raw_spin_unlock_irqrestore_rcu_node(rnp, flags); > } > > @@ -1678,6 +1684,12 @@ static void print_cpu_stall_info(struct rcu_state *rsp, int cpu) > char *ticks_title; > unsigned long ticks_value; > > + /* > + * We could be printing a lot of these messages while holding a > + * spinlock. Avoid triggering hard lockup. > + */ > + touch_nmi_watchdog(); > + > if (rsp->gpnum == rdp->gpnum) { > ticks_title = "ticks this GP"; > ticks_value = rdp->ticks_this_gp; > ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH] RCU: Call touch_nmi_watchdog() while printing stall warnings 2018-01-09 15:22 ` Paul E. McKenney @ 2018-01-09 18:47 ` Tejun Heo 2018-01-09 18:52 ` [PATCH trivial-v2] " Tejun Heo 2018-01-10 5:34 ` Sergey Senozhatsky 2018-02-02 23:44 ` Can RCU stall lead to hard lockups? Serge E. Hallyn 1 sibling, 2 replies; 16+ messages in thread From: Tejun Heo @ 2018-01-09 18:47 UTC (permalink / raw) To: Paul E. McKenney; +Cc: linux-kernel, kernel-team >From 45f8e12038020e44ef7d589de9247ad4fcd62f44 Mon Sep 17 00:00:00 2001 From: Tejun Heo <tj@kernel.org> Date: Tue, 9 Jan 2018 10:38:17 -0800 When RCU stall warning triggers, it can print out a lot of messages while holding spinlocks. If the console device is slow (e.g. an actual or IPMI serial console), it may end up triggering NMI hard lockup watchdog like the following. *** CPU printking while holding RCU spinlock PID: 4149739 TASK: ffff881a46baa880 CPU: 13 COMMAND: "CPUThreadPool8" #0 [ffff881fff945e48] crash_nmi_callback at ffffffff8103f7d0 #1 [ffff881fff945e58] nmi_handle at ffffffff81020653 #2 [ffff881fff945eb0] default_do_nmi at ffffffff81020c36 #3 [ffff881fff945ed0] do_nmi at ffffffff81020d32 #4 [ffff881fff945ef0] end_repeat_nmi at ffffffff81956a7e [exception RIP: io_serial_in+21] RIP: ffffffff81630e55 RSP: ffff881fff943b88 RFLAGS: 00000002 RAX: 000000000000ca00 RBX: ffffffff8230e188 RCX: 0000000000000000 RDX: 00000000000002fd RSI: 0000000000000005 RDI: ffffffff8230e188 RBP: ffff881fff943bb0 R8: 0000000000000000 R9: ffffffff820cb3c4 R10: 0000000000000019 R11: 0000000000002000 R12: 00000000000026e1 R13: 0000000000000020 R14: ffffffff820cd398 R15: 0000000000000035 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000 --- <NMI exception stack> --- #5 [ffff881fff943b88] io_serial_in at ffffffff81630e55 #6 [ffff881fff943b90] wait_for_xmitr at ffffffff8163175c #7 [ffff881fff943bb8] serial8250_console_putchar at ffffffff816317dc #8 [ffff881fff943bd8] uart_console_write at ffffffff8162ac00 #9 [ffff881fff943c08] serial8250_console_write at ffffffff81634691 #10 [ffff881fff943c80] univ8250_console_write at ffffffff8162f7c2 #11 [ffff881fff943c90] console_unlock at ffffffff810dfc55 #12 [ffff881fff943cf0] vprintk_emit at ffffffff810dffb5 #13 [ffff881fff943d50] vprintk_default at ffffffff810e01bf #14 [ffff881fff943d60] vprintk_func at ffffffff810e1127 #15 [ffff881fff943d70] printk at ffffffff8119a8a4 #16 [ffff881fff943dd0] print_cpu_stall_info at ffffffff810eb78c #17 [ffff881fff943e88] rcu_check_callbacks at ffffffff810ef133 #18 [ffff881fff943ee8] update_process_times at ffffffff810f3497 #19 [ffff881fff943f10] tick_sched_timer at ffffffff81103037 #20 [ffff881fff943f38] __hrtimer_run_queues at ffffffff810f3f38 #21 [ffff881fff943f88] hrtimer_interrupt at ffffffff810f442b *** CPU triggering the hardlockup watchdog PID: 4149709 TASK: ffff88010f88c380 CPU: 26 COMMAND: "CPUThreadPool35" #0 [ffff883fff1059d0] machine_kexec at ffffffff8104a874 #1 [ffff883fff105a30] __crash_kexec at ffffffff811116cc #2 [ffff883fff105af0] __crash_kexec at ffffffff81111795 #3 [ffff883fff105b08] panic at ffffffff8119a6ae #4 [ffff883fff105b98] watchdog_overflow_callback at ffffffff81135dbd #5 [ffff883fff105bb0] __perf_event_overflow at ffffffff81186866 #6 [ffff883fff105be8] perf_event_overflow at ffffffff81192bc4 #7 [ffff883fff105bf8] intel_pmu_handle_irq at ffffffff8100b265 #8 [ffff883fff105df8] perf_event_nmi_handler at ffffffff8100489f #9 [ffff883fff105e58] nmi_handle at ffffffff81020653 #10 [ffff883fff105eb0] default_do_nmi at ffffffff81020b94 #11 [ffff883fff105ed0] do_nmi at ffffffff81020d32 #12 [ffff883fff105ef0] end_repeat_nmi at ffffffff81956a7e [exception RIP: queued_spin_lock_slowpath+248] RIP: ffffffff810da958 RSP: ffff883fff103e68 RFLAGS: 00000046 RAX: 0000000000000000 RBX: 0000000000000046 RCX: 00000000006d0000 RDX: ffff883fff49a950 RSI: 0000000000d10101 RDI: ffffffff81e54300 RBP: ffff883fff103e80 R8: ffff883fff11a950 R9: 0000000000000000 R10: 000000000e5873ba R11: 000000000000010f R12: ffffffff81e54300 R13: 0000000000000000 R14: ffff88010f88c380 R15: ffffffff81e54300 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <NMI exception stack> --- #13 [ffff883fff103e68] queued_spin_lock_slowpath at ffffffff810da958 #14 [ffff883fff103e70] _raw_spin_lock_irqsave at ffffffff8195550b #15 [ffff883fff103e88] rcu_check_callbacks at ffffffff810eed18 #16 [ffff883fff103ee8] update_process_times at ffffffff810f3497 #17 [ffff883fff103f10] tick_sched_timer at ffffffff81103037 #18 [ffff883fff103f38] __hrtimer_run_queues at ffffffff810f3f38 #19 [ffff883fff103f88] hrtimer_interrupt at ffffffff810f442b --- <IRQ stack> --- Avoid spuriously triggering NMI hardlockup watchdog by touching it from the print functions. show_state_filter() shares the same problem and solution. Signed-off-by: Tejun Heo <tj@kernel.org> --- kernel/rcu/tree_plugin.h | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index db85ca3..a89d72a 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -561,8 +561,14 @@ static void rcu_print_detail_task_stall_rnp(struct rcu_node *rnp) } t = list_entry(rnp->gp_tasks->prev, struct task_struct, rcu_node_entry); - list_for_each_entry_continue(t, &rnp->blkd_tasks, rcu_node_entry) + list_for_each_entry_continue(t, &rnp->blkd_tasks, rcu_node_entry) { + touch_nmi_watchdog(); + /* + * We could be printing a lot while holding a spinlock. + * Avoid triggering hard lockup. + */ sched_show_task(t); + } raw_spin_unlock_irqrestore_rcu_node(rnp, flags); } @@ -1678,6 +1684,12 @@ static void print_cpu_stall_info(struct rcu_state *rsp, int cpu) char *ticks_title; unsigned long ticks_value; + /* + * We could be printing a lot while holding a spinlock. Avoid + * triggering hard lockup. + */ + touch_nmi_watchdog(); + if (rsp->gpnum == rdp->gpnum) { ticks_title = "ticks this GP"; ticks_value = rdp->ticks_this_gp; -- 2.9.5 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH trivial-v2] RCU: Call touch_nmi_watchdog() while printing stall warnings 2018-01-09 18:47 ` [PATCH] RCU: Call touch_nmi_watchdog() while printing stall warnings Tejun Heo @ 2018-01-09 18:52 ` Tejun Heo 2018-01-09 20:05 ` Paul E. McKenney 2018-01-10 5:34 ` Sergey Senozhatsky 1 sibling, 1 reply; 16+ messages in thread From: Tejun Heo @ 2018-01-09 18:52 UTC (permalink / raw) To: Paul E. McKenney; +Cc: linux-kernel, kernel-team >From c1d1ae75ba4bde7c833560248b3f561085fa850e Mon Sep 17 00:00:00 2001 From: Tejun Heo <tj@kernel.org> Date: Tue, 9 Jan 2018 10:38:17 -0800 Subject: [PATCH] RCU: Call touch_nmi_watchdog() while printing stall warnings When RCU stall warning triggers, it can print out a lot of messages while holding spinlocks. If the console device is slow (e.g. an actual or IPMI serial console), it may end up triggering NMI hard lockup watchdog like the following. *** CPU printking while holding RCU spinlock PID: 4149739 TASK: ffff881a46baa880 CPU: 13 COMMAND: "CPUThreadPool8" #0 [ffff881fff945e48] crash_nmi_callback at ffffffff8103f7d0 #1 [ffff881fff945e58] nmi_handle at ffffffff81020653 #2 [ffff881fff945eb0] default_do_nmi at ffffffff81020c36 #3 [ffff881fff945ed0] do_nmi at ffffffff81020d32 #4 [ffff881fff945ef0] end_repeat_nmi at ffffffff81956a7e [exception RIP: io_serial_in+21] RIP: ffffffff81630e55 RSP: ffff881fff943b88 RFLAGS: 00000002 RAX: 000000000000ca00 RBX: ffffffff8230e188 RCX: 0000000000000000 RDX: 00000000000002fd RSI: 0000000000000005 RDI: ffffffff8230e188 RBP: ffff881fff943bb0 R8: 0000000000000000 R9: ffffffff820cb3c4 R10: 0000000000000019 R11: 0000000000002000 R12: 00000000000026e1 R13: 0000000000000020 R14: ffffffff820cd398 R15: 0000000000000035 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000 --- <NMI exception stack> --- #5 [ffff881fff943b88] io_serial_in at ffffffff81630e55 #6 [ffff881fff943b90] wait_for_xmitr at ffffffff8163175c #7 [ffff881fff943bb8] serial8250_console_putchar at ffffffff816317dc #8 [ffff881fff943bd8] uart_console_write at ffffffff8162ac00 #9 [ffff881fff943c08] serial8250_console_write at ffffffff81634691 #10 [ffff881fff943c80] univ8250_console_write at ffffffff8162f7c2 #11 [ffff881fff943c90] console_unlock at ffffffff810dfc55 #12 [ffff881fff943cf0] vprintk_emit at ffffffff810dffb5 #13 [ffff881fff943d50] vprintk_default at ffffffff810e01bf #14 [ffff881fff943d60] vprintk_func at ffffffff810e1127 #15 [ffff881fff943d70] printk at ffffffff8119a8a4 #16 [ffff881fff943dd0] print_cpu_stall_info at ffffffff810eb78c #17 [ffff881fff943e88] rcu_check_callbacks at ffffffff810ef133 #18 [ffff881fff943ee8] update_process_times at ffffffff810f3497 #19 [ffff881fff943f10] tick_sched_timer at ffffffff81103037 #20 [ffff881fff943f38] __hrtimer_run_queues at ffffffff810f3f38 #21 [ffff881fff943f88] hrtimer_interrupt at ffffffff810f442b *** CPU triggering the hardlockup watchdog PID: 4149709 TASK: ffff88010f88c380 CPU: 26 COMMAND: "CPUThreadPool35" #0 [ffff883fff1059d0] machine_kexec at ffffffff8104a874 #1 [ffff883fff105a30] __crash_kexec at ffffffff811116cc #2 [ffff883fff105af0] __crash_kexec at ffffffff81111795 #3 [ffff883fff105b08] panic at ffffffff8119a6ae #4 [ffff883fff105b98] watchdog_overflow_callback at ffffffff81135dbd #5 [ffff883fff105bb0] __perf_event_overflow at ffffffff81186866 #6 [ffff883fff105be8] perf_event_overflow at ffffffff81192bc4 #7 [ffff883fff105bf8] intel_pmu_handle_irq at ffffffff8100b265 #8 [ffff883fff105df8] perf_event_nmi_handler at ffffffff8100489f #9 [ffff883fff105e58] nmi_handle at ffffffff81020653 #10 [ffff883fff105eb0] default_do_nmi at ffffffff81020b94 #11 [ffff883fff105ed0] do_nmi at ffffffff81020d32 #12 [ffff883fff105ef0] end_repeat_nmi at ffffffff81956a7e [exception RIP: queued_spin_lock_slowpath+248] RIP: ffffffff810da958 RSP: ffff883fff103e68 RFLAGS: 00000046 RAX: 0000000000000000 RBX: 0000000000000046 RCX: 00000000006d0000 RDX: ffff883fff49a950 RSI: 0000000000d10101 RDI: ffffffff81e54300 RBP: ffff883fff103e80 R8: ffff883fff11a950 R9: 0000000000000000 R10: 000000000e5873ba R11: 000000000000010f R12: ffffffff81e54300 R13: 0000000000000000 R14: ffff88010f88c380 R15: ffffffff81e54300 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <NMI exception stack> --- #13 [ffff883fff103e68] queued_spin_lock_slowpath at ffffffff810da958 #14 [ffff883fff103e70] _raw_spin_lock_irqsave at ffffffff8195550b #15 [ffff883fff103e88] rcu_check_callbacks at ffffffff810eed18 #16 [ffff883fff103ee8] update_process_times at ffffffff810f3497 #17 [ffff883fff103f10] tick_sched_timer at ffffffff81103037 #18 [ffff883fff103f38] __hrtimer_run_queues at ffffffff810f3f38 #19 [ffff883fff103f88] hrtimer_interrupt at ffffffff810f442b --- <IRQ stack> --- Avoid spuriously triggering NMI hardlockup watchdog by touching it from the print functions. show_state_filter() shares the same problem and solution. v2: Relocate the comment to where it belongs. Signed-off-by: Tejun Heo <tj@kernel.org> --- kernel/rcu/tree_plugin.h | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index db85ca3..bf05bf9 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -561,8 +561,14 @@ static void rcu_print_detail_task_stall_rnp(struct rcu_node *rnp) } t = list_entry(rnp->gp_tasks->prev, struct task_struct, rcu_node_entry); - list_for_each_entry_continue(t, &rnp->blkd_tasks, rcu_node_entry) + list_for_each_entry_continue(t, &rnp->blkd_tasks, rcu_node_entry) { + /* + * We could be printing a lot while holding a spinlock. + * Avoid triggering hard lockup. + */ + touch_nmi_watchdog(); sched_show_task(t); + } raw_spin_unlock_irqrestore_rcu_node(rnp, flags); } @@ -1678,6 +1684,12 @@ static void print_cpu_stall_info(struct rcu_state *rsp, int cpu) char *ticks_title; unsigned long ticks_value; + /* + * We could be printing a lot while holding a spinlock. Avoid + * triggering hard lockup. + */ + touch_nmi_watchdog(); + if (rsp->gpnum == rdp->gpnum) { ticks_title = "ticks this GP"; ticks_value = rdp->ticks_this_gp; -- 2.9.5 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH trivial-v2] RCU: Call touch_nmi_watchdog() while printing stall warnings 2018-01-09 18:52 ` [PATCH trivial-v2] " Tejun Heo @ 2018-01-09 20:05 ` Paul E. McKenney 0 siblings, 0 replies; 16+ messages in thread From: Paul E. McKenney @ 2018-01-09 20:05 UTC (permalink / raw) To: Tejun Heo; +Cc: linux-kernel, kernel-team On Tue, Jan 09, 2018 at 10:52:22AM -0800, Tejun Heo wrote: > >From c1d1ae75ba4bde7c833560248b3f561085fa850e Mon Sep 17 00:00:00 2001 > From: Tejun Heo <tj@kernel.org> > Date: Tue, 9 Jan 2018 10:38:17 -0800 > Subject: [PATCH] RCU: Call touch_nmi_watchdog() while printing stall warnings > > When RCU stall warning triggers, it can print out a lot of messages > while holding spinlocks. If the console device is slow (e.g. an > actual or IPMI serial console), it may end up triggering NMI hard > lockup watchdog like the following. Queued for testing and review, thank you! Thanx, Paul > *** CPU printking while holding RCU spinlock > > PID: 4149739 TASK: ffff881a46baa880 CPU: 13 COMMAND: "CPUThreadPool8" > #0 [ffff881fff945e48] crash_nmi_callback at ffffffff8103f7d0 > #1 [ffff881fff945e58] nmi_handle at ffffffff81020653 > #2 [ffff881fff945eb0] default_do_nmi at ffffffff81020c36 > #3 [ffff881fff945ed0] do_nmi at ffffffff81020d32 > #4 [ffff881fff945ef0] end_repeat_nmi at ffffffff81956a7e > [exception RIP: io_serial_in+21] > RIP: ffffffff81630e55 RSP: ffff881fff943b88 RFLAGS: 00000002 > RAX: 000000000000ca00 RBX: ffffffff8230e188 RCX: 0000000000000000 > RDX: 00000000000002fd RSI: 0000000000000005 RDI: ffffffff8230e188 > RBP: ffff881fff943bb0 R8: 0000000000000000 R9: ffffffff820cb3c4 > R10: 0000000000000019 R11: 0000000000002000 R12: 00000000000026e1 > R13: 0000000000000020 R14: ffffffff820cd398 R15: 0000000000000035 > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000 > --- <NMI exception stack> --- > #5 [ffff881fff943b88] io_serial_in at ffffffff81630e55 > #6 [ffff881fff943b90] wait_for_xmitr at ffffffff8163175c > #7 [ffff881fff943bb8] serial8250_console_putchar at ffffffff816317dc > #8 [ffff881fff943bd8] uart_console_write at ffffffff8162ac00 > #9 [ffff881fff943c08] serial8250_console_write at ffffffff81634691 > #10 [ffff881fff943c80] univ8250_console_write at ffffffff8162f7c2 > #11 [ffff881fff943c90] console_unlock at ffffffff810dfc55 > #12 [ffff881fff943cf0] vprintk_emit at ffffffff810dffb5 > #13 [ffff881fff943d50] vprintk_default at ffffffff810e01bf > #14 [ffff881fff943d60] vprintk_func at ffffffff810e1127 > #15 [ffff881fff943d70] printk at ffffffff8119a8a4 > #16 [ffff881fff943dd0] print_cpu_stall_info at ffffffff810eb78c > #17 [ffff881fff943e88] rcu_check_callbacks at ffffffff810ef133 > #18 [ffff881fff943ee8] update_process_times at ffffffff810f3497 > #19 [ffff881fff943f10] tick_sched_timer at ffffffff81103037 > #20 [ffff881fff943f38] __hrtimer_run_queues at ffffffff810f3f38 > #21 [ffff881fff943f88] hrtimer_interrupt at ffffffff810f442b > > *** CPU triggering the hardlockup watchdog > > PID: 4149709 TASK: ffff88010f88c380 CPU: 26 COMMAND: "CPUThreadPool35" > #0 [ffff883fff1059d0] machine_kexec at ffffffff8104a874 > #1 [ffff883fff105a30] __crash_kexec at ffffffff811116cc > #2 [ffff883fff105af0] __crash_kexec at ffffffff81111795 > #3 [ffff883fff105b08] panic at ffffffff8119a6ae > #4 [ffff883fff105b98] watchdog_overflow_callback at ffffffff81135dbd > #5 [ffff883fff105bb0] __perf_event_overflow at ffffffff81186866 > #6 [ffff883fff105be8] perf_event_overflow at ffffffff81192bc4 > #7 [ffff883fff105bf8] intel_pmu_handle_irq at ffffffff8100b265 > #8 [ffff883fff105df8] perf_event_nmi_handler at ffffffff8100489f > #9 [ffff883fff105e58] nmi_handle at ffffffff81020653 > #10 [ffff883fff105eb0] default_do_nmi at ffffffff81020b94 > #11 [ffff883fff105ed0] do_nmi at ffffffff81020d32 > #12 [ffff883fff105ef0] end_repeat_nmi at ffffffff81956a7e > [exception RIP: queued_spin_lock_slowpath+248] > RIP: ffffffff810da958 RSP: ffff883fff103e68 RFLAGS: 00000046 > RAX: 0000000000000000 RBX: 0000000000000046 RCX: 00000000006d0000 > RDX: ffff883fff49a950 RSI: 0000000000d10101 RDI: ffffffff81e54300 > RBP: ffff883fff103e80 R8: ffff883fff11a950 R9: 0000000000000000 > R10: 000000000e5873ba R11: 000000000000010f R12: ffffffff81e54300 > R13: 0000000000000000 R14: ffff88010f88c380 R15: ffffffff81e54300 > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 > --- <NMI exception stack> --- > #13 [ffff883fff103e68] queued_spin_lock_slowpath at ffffffff810da958 > #14 [ffff883fff103e70] _raw_spin_lock_irqsave at ffffffff8195550b > #15 [ffff883fff103e88] rcu_check_callbacks at ffffffff810eed18 > #16 [ffff883fff103ee8] update_process_times at ffffffff810f3497 > #17 [ffff883fff103f10] tick_sched_timer at ffffffff81103037 > #18 [ffff883fff103f38] __hrtimer_run_queues at ffffffff810f3f38 > #19 [ffff883fff103f88] hrtimer_interrupt at ffffffff810f442b > --- <IRQ stack> --- > > Avoid spuriously triggering NMI hardlockup watchdog by touching it > from the print functions. show_state_filter() shares the same problem > and solution. > > v2: Relocate the comment to where it belongs. > > Signed-off-by: Tejun Heo <tj@kernel.org> > --- > kernel/rcu/tree_plugin.h | 14 +++++++++++++- > 1 file changed, 13 insertions(+), 1 deletion(-) > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h > index db85ca3..bf05bf9 100644 > --- a/kernel/rcu/tree_plugin.h > +++ b/kernel/rcu/tree_plugin.h > @@ -561,8 +561,14 @@ static void rcu_print_detail_task_stall_rnp(struct rcu_node *rnp) > } > t = list_entry(rnp->gp_tasks->prev, > struct task_struct, rcu_node_entry); > - list_for_each_entry_continue(t, &rnp->blkd_tasks, rcu_node_entry) > + list_for_each_entry_continue(t, &rnp->blkd_tasks, rcu_node_entry) { > + /* > + * We could be printing a lot while holding a spinlock. > + * Avoid triggering hard lockup. > + */ > + touch_nmi_watchdog(); > sched_show_task(t); > + } > raw_spin_unlock_irqrestore_rcu_node(rnp, flags); > } > > @@ -1678,6 +1684,12 @@ static void print_cpu_stall_info(struct rcu_state *rsp, int cpu) > char *ticks_title; > unsigned long ticks_value; > > + /* > + * We could be printing a lot while holding a spinlock. Avoid > + * triggering hard lockup. > + */ > + touch_nmi_watchdog(); > + > if (rsp->gpnum == rdp->gpnum) { > ticks_title = "ticks this GP"; > ticks_value = rdp->ticks_this_gp; > -- > 2.9.5 > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: RCU: Call touch_nmi_watchdog() while printing stall warnings 2018-01-09 18:47 ` [PATCH] RCU: Call touch_nmi_watchdog() while printing stall warnings Tejun Heo 2018-01-09 18:52 ` [PATCH trivial-v2] " Tejun Heo @ 2018-01-10 5:34 ` Sergey Senozhatsky 2018-01-10 14:57 ` Tejun Heo 1 sibling, 1 reply; 16+ messages in thread From: Sergey Senozhatsky @ 2018-01-10 5:34 UTC (permalink / raw) To: Tejun Heo; +Cc: Paul E. McKenney, linux-kernel, kernel-team Hello, On (01/09/18 10:47), Tejun Heo wrote: > When RCU stall warning triggers, it can print out a lot of messages > while holding spinlocks. If the console device is slow (e.g. an > actual or IPMI serial console), it may end up triggering NMI hard > lockup watchdog like the following. > > *** CPU printking while holding RCU spinlock > > PID: 4149739 TASK: ffff881a46baa880 CPU: 13 COMMAND: "CPUThreadPool8" > #0 [ffff881fff945e48] crash_nmi_callback at ffffffff8103f7d0 > #1 [ffff881fff945e58] nmi_handle at ffffffff81020653 > #2 [ffff881fff945eb0] default_do_nmi at ffffffff81020c36 > #3 [ffff881fff945ed0] do_nmi at ffffffff81020d32 > #4 [ffff881fff945ef0] end_repeat_nmi at ffffffff81956a7e > [exception RIP: io_serial_in+21] > RIP: ffffffff81630e55 RSP: ffff881fff943b88 RFLAGS: 00000002 > RAX: 000000000000ca00 RBX: ffffffff8230e188 RCX: 0000000000000000 > RDX: 00000000000002fd RSI: 0000000000000005 RDI: ffffffff8230e188 > RBP: ffff881fff943bb0 R8: 0000000000000000 R9: ffffffff820cb3c4 > R10: 0000000000000019 R11: 0000000000002000 R12: 00000000000026e1 > R13: 0000000000000020 R14: ffffffff820cd398 R15: 0000000000000035 > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000 > --- <NMI exception stack> --- > #5 [ffff881fff943b88] io_serial_in at ffffffff81630e55 > #6 [ffff881fff943b90] wait_for_xmitr at ffffffff8163175c > #7 [ffff881fff943bb8] serial8250_console_putchar at ffffffff816317dc > #8 [ffff881fff943bd8] uart_console_write at ffffffff8162ac00 > #9 [ffff881fff943c08] serial8250_console_write at ffffffff81634691 > #10 [ffff881fff943c80] univ8250_console_write at ffffffff8162f7c2 > #11 [ffff881fff943c90] console_unlock at ffffffff810dfc55 > #12 [ffff881fff943cf0] vprintk_emit at ffffffff810dffb5 > #13 [ffff881fff943d50] vprintk_default at ffffffff810e01bf > #14 [ffff881fff943d60] vprintk_func at ffffffff810e1127 > #15 [ffff881fff943d70] printk at ffffffff8119a8a4 > #16 [ffff881fff943dd0] print_cpu_stall_info at ffffffff810eb78c > #17 [ffff881fff943e88] rcu_check_callbacks at ffffffff810ef133 > #18 [ffff881fff943ee8] update_process_times at ffffffff810f3497 > #19 [ffff881fff943f10] tick_sched_timer at ffffffff81103037 > #20 [ffff881fff943f38] __hrtimer_run_queues at ffffffff810f3f38 > #21 [ffff881fff943f88] hrtimer_interrupt at ffffffff810f442b Tejun, while we are at it, shall we also touch NMI in show_workqueue_state()? I suppose that show_pwq() can be quite verbose on some systems, can't it? for instance: sysrq-t [irq] -> show_workqueue_state() --- diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 9960f8842b31..649c00a9af12 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -48,6 +48,7 @@ #include <linux/moduleparam.h> #include <linux/uaccess.h> #include <linux/sched/isolation.h> +#include <linux/nmi.h> #include "workqueue_internal.h" @@ -4473,6 +4474,8 @@ void show_workqueue_state(void) if (pwq->nr_active || !list_empty(&pwq->delayed_works)) show_pwq(pwq); spin_unlock_irqrestore(&pwq->pool->lock, flags); + + touch_nmi_watchdog(); } } ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: RCU: Call touch_nmi_watchdog() while printing stall warnings 2018-01-10 5:34 ` Sergey Senozhatsky @ 2018-01-10 14:57 ` Tejun Heo 2018-01-11 0:45 ` Sergey Senozhatsky 0 siblings, 1 reply; 16+ messages in thread From: Tejun Heo @ 2018-01-10 14:57 UTC (permalink / raw) To: Sergey Senozhatsky; +Cc: Paul E. McKenney, linux-kernel, kernel-team Hello, On Wed, Jan 10, 2018 at 02:34:50PM +0900, Sergey Senozhatsky wrote: > Tejun, while we are at it, shall we also touch NMI in show_workqueue_state()? > I suppose that show_pwq() can be quite verbose on some systems, can't it? > > for instance: sysrq-t [irq] -> show_workqueue_state() Ah, it's less likely but definitely possible. > diff --git a/kernel/workqueue.c b/kernel/workqueue.c > index 9960f8842b31..649c00a9af12 100644 > --- a/kernel/workqueue.c > +++ b/kernel/workqueue.c > @@ -48,6 +48,7 @@ > #include <linux/moduleparam.h> > #include <linux/uaccess.h> > #include <linux/sched/isolation.h> > +#include <linux/nmi.h> > > #include "workqueue_internal.h" > > @@ -4473,6 +4474,8 @@ void show_workqueue_state(void) > if (pwq->nr_active || !list_empty(&pwq->delayed_works)) > show_pwq(pwq); > spin_unlock_irqrestore(&pwq->pool->lock, flags); > + > + touch_nmi_watchdog(); Can you send a patch w/ signed-off-by adding touch_nmi_watchdog() to both outer loops in that function. Theoretically, we can have a lot of pools too. Thanks. -- tejun ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: RCU: Call touch_nmi_watchdog() while printing stall warnings 2018-01-10 14:57 ` Tejun Heo @ 2018-01-11 0:45 ` Sergey Senozhatsky 0 siblings, 0 replies; 16+ messages in thread From: Sergey Senozhatsky @ 2018-01-11 0:45 UTC (permalink / raw) To: Tejun Heo; +Cc: Sergey Senozhatsky, Paul E. McKenney, linux-kernel, kernel-team On (01/10/18 06:57), Tejun Heo wrote: [..] > > diff --git a/kernel/workqueue.c b/kernel/workqueue.c > > index 9960f8842b31..649c00a9af12 100644 > > --- a/kernel/workqueue.c > > +++ b/kernel/workqueue.c > > @@ -48,6 +48,7 @@ > > #include <linux/moduleparam.h> > > #include <linux/uaccess.h> > > #include <linux/sched/isolation.h> > > +#include <linux/nmi.h> > > > > #include "workqueue_internal.h" > > > > @@ -4473,6 +4474,8 @@ void show_workqueue_state(void) > > if (pwq->nr_active || !list_empty(&pwq->delayed_works)) > > show_pwq(pwq); > > spin_unlock_irqrestore(&pwq->pool->lock, flags); > > + > > + touch_nmi_watchdog(); > > Can you send a patch w/ signed-off-by adding touch_nmi_watchdog() to > both outer loops in that function. Theoretically, we can have a lot > of pools too. thanks. will do. -ss ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Can RCU stall lead to hard lockups? 2018-01-09 15:22 ` Paul E. McKenney 2018-01-09 18:47 ` [PATCH] RCU: Call touch_nmi_watchdog() while printing stall warnings Tejun Heo @ 2018-02-02 23:44 ` Serge E. Hallyn 2018-02-03 20:50 ` Paul E. McKenney 1 sibling, 1 reply; 16+ messages in thread From: Serge E. Hallyn @ 2018-02-02 23:44 UTC (permalink / raw) To: Paul E. McKenney; +Cc: Tejun Heo, linux-kernel, kernel-team Quoting Paul E. McKenney (paulmck@linux.vnet.ibm.com): > On Tue, Jan 09, 2018 at 06:11:14AM -0800, Tejun Heo wrote: > > Hello, Paul. > > > > On Mon, Jan 08, 2018 at 08:24:25PM -0800, Paul E. McKenney wrote: > > > > I don't know the RCU code at all but it *looks* like the first CPU is > > > > taking a sweet while flushing printk buffer while holding a lock (the > > > > console is IPMI serial console, which faithfully emulates 115200 baud > > > > rate), and everyone else seems stuck waiting for that spinlock in > > > > rcu_check_callbacks(). > > > > > > > > Does this sound possible? > > > > > > 115200 baud? Ouch!!! That -will- result in trouble from console > > > printing, and often also in RCU CPU stall warnings. > > > > It could even be slower than 115200, and we occassionally see RCU > > stall warnings caused by printk storms, for example, while the kernel > > is trying to dump a lot of info after an OOM. That's an issue we > > probably want to improve from printk side; however, they don't usually > > lead to NMI hard lockup detector kicking in and crashing the machine, > > which is the peculiarity here. > > > > Hmmm... show_state_filter(), the function which dumps all task > > backtraces, share a similar problem and it avoids it by explicitly > > calling touch_nmi_watchdog(). Maybe we can do something like the > > following from RCU too? > > If this fixes things for you, I would welcome such a patch. Hi - would this also be relevant to 4.9-stable and 4.4-stable, or has something elsewhere changed after 4.9 that actually triggers this? thanks, -serge > Thanx, Paul > > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h > > index db85ca3..3c4c4d3 100644 > > --- a/kernel/rcu/tree_plugin.h > > +++ b/kernel/rcu/tree_plugin.h > > @@ -561,8 +561,14 @@ static void rcu_print_detail_task_stall_rnp(struct rcu_node *rnp) > > } > > t = list_entry(rnp->gp_tasks->prev, > > struct task_struct, rcu_node_entry); > > - list_for_each_entry_continue(t, &rnp->blkd_tasks, rcu_node_entry) > > + list_for_each_entry_continue(t, &rnp->blkd_tasks, rcu_node_entry) { > > + touch_nmi_watchdog(); > > + /* > > + * We could be printing a lot of these messages while > > + * holding a spinlock. Avoid triggering hard lockup. > > + */ > > sched_show_task(t); > > + } > > raw_spin_unlock_irqrestore_rcu_node(rnp, flags); > > } > > > > @@ -1678,6 +1684,12 @@ static void print_cpu_stall_info(struct rcu_state *rsp, int cpu) > > char *ticks_title; > > unsigned long ticks_value; > > > > + /* > > + * We could be printing a lot of these messages while holding a > > + * spinlock. Avoid triggering hard lockup. > > + */ > > + touch_nmi_watchdog(); > > + > > if (rsp->gpnum == rdp->gpnum) { > > ticks_title = "ticks this GP"; > > ticks_value = rdp->ticks_this_gp; > > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Can RCU stall lead to hard lockups? 2018-02-02 23:44 ` Can RCU stall lead to hard lockups? Serge E. Hallyn @ 2018-02-03 20:50 ` Paul E. McKenney 2018-02-07 2:33 ` Serge E. Hallyn 0 siblings, 1 reply; 16+ messages in thread From: Paul E. McKenney @ 2018-02-03 20:50 UTC (permalink / raw) To: Serge E. Hallyn; +Cc: Tejun Heo, linux-kernel, kernel-team On Fri, Feb 02, 2018 at 05:44:30PM -0600, Serge E. Hallyn wrote: > Quoting Paul E. McKenney (paulmck@linux.vnet.ibm.com): > > On Tue, Jan 09, 2018 at 06:11:14AM -0800, Tejun Heo wrote: > > > Hello, Paul. > > > > > > On Mon, Jan 08, 2018 at 08:24:25PM -0800, Paul E. McKenney wrote: > > > > > I don't know the RCU code at all but it *looks* like the first CPU is > > > > > taking a sweet while flushing printk buffer while holding a lock (the > > > > > console is IPMI serial console, which faithfully emulates 115200 baud > > > > > rate), and everyone else seems stuck waiting for that spinlock in > > > > > rcu_check_callbacks(). > > > > > > > > > > Does this sound possible? > > > > > > > > 115200 baud? Ouch!!! That -will- result in trouble from console > > > > printing, and often also in RCU CPU stall warnings. > > > > > > It could even be slower than 115200, and we occassionally see RCU > > > stall warnings caused by printk storms, for example, while the kernel > > > is trying to dump a lot of info after an OOM. That's an issue we > > > probably want to improve from printk side; however, they don't usually > > > lead to NMI hard lockup detector kicking in and crashing the machine, > > > which is the peculiarity here. > > > > > > Hmmm... show_state_filter(), the function which dumps all task > > > backtraces, share a similar problem and it avoids it by explicitly > > > calling touch_nmi_watchdog(). Maybe we can do something like the > > > following from RCU too? > > > > If this fixes things for you, I would welcome such a patch. > > Hi - would this also be relevant to 4.9-stable and 4.4-stable, or > has something elsewhere changed after 4.9 that actually triggers this? As far as I can tell, slow console lines have been prone to RCU CPU stall warnings for a very long time. Thanx, Paul > thanks, > -serge > > > Thanx, Paul > > > > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h > > > index db85ca3..3c4c4d3 100644 > > > --- a/kernel/rcu/tree_plugin.h > > > +++ b/kernel/rcu/tree_plugin.h > > > @@ -561,8 +561,14 @@ static void rcu_print_detail_task_stall_rnp(struct rcu_node *rnp) > > > } > > > t = list_entry(rnp->gp_tasks->prev, > > > struct task_struct, rcu_node_entry); > > > - list_for_each_entry_continue(t, &rnp->blkd_tasks, rcu_node_entry) > > > + list_for_each_entry_continue(t, &rnp->blkd_tasks, rcu_node_entry) { > > > + touch_nmi_watchdog(); > > > + /* > > > + * We could be printing a lot of these messages while > > > + * holding a spinlock. Avoid triggering hard lockup. > > > + */ > > > sched_show_task(t); > > > + } > > > raw_spin_unlock_irqrestore_rcu_node(rnp, flags); > > > } > > > > > > @@ -1678,6 +1684,12 @@ static void print_cpu_stall_info(struct rcu_state *rsp, int cpu) > > > char *ticks_title; > > > unsigned long ticks_value; > > > > > > + /* > > > + * We could be printing a lot of these messages while holding a > > > + * spinlock. Avoid triggering hard lockup. > > > + */ > > > + touch_nmi_watchdog(); > > > + > > > if (rsp->gpnum == rdp->gpnum) { > > > ticks_title = "ticks this GP"; > > > ticks_value = rdp->ticks_this_gp; > > > > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Can RCU stall lead to hard lockups? 2018-02-03 20:50 ` Paul E. McKenney @ 2018-02-07 2:33 ` Serge E. Hallyn 2018-02-07 2:53 ` Paul E. McKenney 0 siblings, 1 reply; 16+ messages in thread From: Serge E. Hallyn @ 2018-02-07 2:33 UTC (permalink / raw) To: Paul E. McKenney; +Cc: Serge E. Hallyn, Tejun Heo, linux-kernel, kernel-team On Sat, Feb 03, 2018 at 12:50:32PM -0800, Paul E. McKenney wrote: > On Fri, Feb 02, 2018 at 05:44:30PM -0600, Serge E. Hallyn wrote: > > Quoting Paul E. McKenney (paulmck@linux.vnet.ibm.com): > > > On Tue, Jan 09, 2018 at 06:11:14AM -0800, Tejun Heo wrote: > > > > Hello, Paul. > > > > > > > > On Mon, Jan 08, 2018 at 08:24:25PM -0800, Paul E. McKenney wrote: > > > > > > I don't know the RCU code at all but it *looks* like the first CPU is > > > > > > taking a sweet while flushing printk buffer while holding a lock (the > > > > > > console is IPMI serial console, which faithfully emulates 115200 baud > > > > > > rate), and everyone else seems stuck waiting for that spinlock in > > > > > > rcu_check_callbacks(). > > > > > > > > > > > > Does this sound possible? > > > > > > > > > > 115200 baud? Ouch!!! That -will- result in trouble from console > > > > > printing, and often also in RCU CPU stall warnings. > > > > > > > > It could even be slower than 115200, and we occassionally see RCU > > > > stall warnings caused by printk storms, for example, while the kernel > > > > is trying to dump a lot of info after an OOM. That's an issue we > > > > probably want to improve from printk side; however, they don't usually > > > > lead to NMI hard lockup detector kicking in and crashing the machine, > > > > which is the peculiarity here. > > > > > > > > Hmmm... show_state_filter(), the function which dumps all task > > > > backtraces, share a similar problem and it avoids it by explicitly > > > > calling touch_nmi_watchdog(). Maybe we can do something like the > > > > following from RCU too? > > > > > > If this fixes things for you, I would welcome such a patch. > > > > Hi - would this also be relevant to 4.9-stable and 4.4-stable, or > > has something elsewhere changed after 4.9 that actually triggers this? > > As far as I can tell, slow console lines have been prone to RCU CPU stall > warnings for a very long time. > > Thanx, Paul Ok, thanks Paul. Tejun were you going to push this? > > thanks, > > -serge > > > > > Thanx, Paul > > > > > > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h > > > > index db85ca3..3c4c4d3 100644 > > > > --- a/kernel/rcu/tree_plugin.h > > > > +++ b/kernel/rcu/tree_plugin.h > > > > @@ -561,8 +561,14 @@ static void rcu_print_detail_task_stall_rnp(struct rcu_node *rnp) > > > > } > > > > t = list_entry(rnp->gp_tasks->prev, > > > > struct task_struct, rcu_node_entry); > > > > - list_for_each_entry_continue(t, &rnp->blkd_tasks, rcu_node_entry) > > > > + list_for_each_entry_continue(t, &rnp->blkd_tasks, rcu_node_entry) { > > > > + touch_nmi_watchdog(); > > > > + /* > > > > + * We could be printing a lot of these messages while > > > > + * holding a spinlock. Avoid triggering hard lockup. > > > > + */ > > > > sched_show_task(t); > > > > + } > > > > raw_spin_unlock_irqrestore_rcu_node(rnp, flags); > > > > } > > > > > > > > @@ -1678,6 +1684,12 @@ static void print_cpu_stall_info(struct rcu_state *rsp, int cpu) > > > > char *ticks_title; > > > > unsigned long ticks_value; > > > > > > > > + /* > > > > + * We could be printing a lot of these messages while holding a > > > > + * spinlock. Avoid triggering hard lockup. > > > > + */ > > > > + touch_nmi_watchdog(); > > > > + > > > > if (rsp->gpnum == rdp->gpnum) { > > > > ticks_title = "ticks this GP"; > > > > ticks_value = rdp->ticks_this_gp; > > > > > > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Can RCU stall lead to hard lockups? 2018-02-07 2:33 ` Serge E. Hallyn @ 2018-02-07 2:53 ` Paul E. McKenney 2018-02-07 2:55 ` Serge E. Hallyn 0 siblings, 1 reply; 16+ messages in thread From: Paul E. McKenney @ 2018-02-07 2:53 UTC (permalink / raw) To: Serge E. Hallyn; +Cc: Tejun Heo, linux-kernel, kernel-team On Tue, Feb 06, 2018 at 08:33:03PM -0600, Serge E. Hallyn wrote: > On Sat, Feb 03, 2018 at 12:50:32PM -0800, Paul E. McKenney wrote: > > On Fri, Feb 02, 2018 at 05:44:30PM -0600, Serge E. Hallyn wrote: > > > Quoting Paul E. McKenney (paulmck@linux.vnet.ibm.com): > > > > On Tue, Jan 09, 2018 at 06:11:14AM -0800, Tejun Heo wrote: > > > > > Hello, Paul. > > > > > > > > > > On Mon, Jan 08, 2018 at 08:24:25PM -0800, Paul E. McKenney wrote: > > > > > > > I don't know the RCU code at all but it *looks* like the first CPU is > > > > > > > taking a sweet while flushing printk buffer while holding a lock (the > > > > > > > console is IPMI serial console, which faithfully emulates 115200 baud > > > > > > > rate), and everyone else seems stuck waiting for that spinlock in > > > > > > > rcu_check_callbacks(). > > > > > > > > > > > > > > Does this sound possible? > > > > > > > > > > > > 115200 baud? Ouch!!! That -will- result in trouble from console > > > > > > printing, and often also in RCU CPU stall warnings. > > > > > > > > > > It could even be slower than 115200, and we occassionally see RCU > > > > > stall warnings caused by printk storms, for example, while the kernel > > > > > is trying to dump a lot of info after an OOM. That's an issue we > > > > > probably want to improve from printk side; however, they don't usually > > > > > lead to NMI hard lockup detector kicking in and crashing the machine, > > > > > which is the peculiarity here. > > > > > > > > > > Hmmm... show_state_filter(), the function which dumps all task > > > > > backtraces, share a similar problem and it avoids it by explicitly > > > > > calling touch_nmi_watchdog(). Maybe we can do something like the > > > > > following from RCU too? > > > > > > > > If this fixes things for you, I would welcome such a patch. > > > > > > Hi - would this also be relevant to 4.9-stable and 4.4-stable, or > > > has something elsewhere changed after 4.9 that actually triggers this? > > > > As far as I can tell, slow console lines have been prone to RCU CPU stall > > warnings for a very long time. > > Ok, thanks Paul. > > Tejun were you going to push this? I have it queued for the next merge window. 3eea9623926f ("rcu: Call touch_nmi_watchdog() while printing stall warnings") in -rcu. Thanx, Paul > > > thanks, > > > -serge > > > > > > > Thanx, Paul > > > > > > > > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h > > > > > index db85ca3..3c4c4d3 100644 > > > > > --- a/kernel/rcu/tree_plugin.h > > > > > +++ b/kernel/rcu/tree_plugin.h > > > > > @@ -561,8 +561,14 @@ static void rcu_print_detail_task_stall_rnp(struct rcu_node *rnp) > > > > > } > > > > > t = list_entry(rnp->gp_tasks->prev, > > > > > struct task_struct, rcu_node_entry); > > > > > - list_for_each_entry_continue(t, &rnp->blkd_tasks, rcu_node_entry) > > > > > + list_for_each_entry_continue(t, &rnp->blkd_tasks, rcu_node_entry) { > > > > > + touch_nmi_watchdog(); > > > > > + /* > > > > > + * We could be printing a lot of these messages while > > > > > + * holding a spinlock. Avoid triggering hard lockup. > > > > > + */ > > > > > sched_show_task(t); > > > > > + } > > > > > raw_spin_unlock_irqrestore_rcu_node(rnp, flags); > > > > > } > > > > > > > > > > @@ -1678,6 +1684,12 @@ static void print_cpu_stall_info(struct rcu_state *rsp, int cpu) > > > > > char *ticks_title; > > > > > unsigned long ticks_value; > > > > > > > > > > + /* > > > > > + * We could be printing a lot of these messages while holding a > > > > > + * spinlock. Avoid triggering hard lockup. > > > > > + */ > > > > > + touch_nmi_watchdog(); > > > > > + > > > > > if (rsp->gpnum == rdp->gpnum) { > > > > > ticks_title = "ticks this GP"; > > > > > ticks_value = rdp->ticks_this_gp; > > > > > > > > > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Can RCU stall lead to hard lockups? 2018-02-07 2:53 ` Paul E. McKenney @ 2018-02-07 2:55 ` Serge E. Hallyn 2018-02-07 3:16 ` Paul E. McKenney 0 siblings, 1 reply; 16+ messages in thread From: Serge E. Hallyn @ 2018-02-07 2:55 UTC (permalink / raw) To: Paul E. McKenney; +Cc: Serge E. Hallyn, Tejun Heo, linux-kernel, kernel-team On Tue, Feb 06, 2018 at 06:53:37PM -0800, Paul E. McKenney wrote: > On Tue, Feb 06, 2018 at 08:33:03PM -0600, Serge E. Hallyn wrote: > > On Sat, Feb 03, 2018 at 12:50:32PM -0800, Paul E. McKenney wrote: > > > On Fri, Feb 02, 2018 at 05:44:30PM -0600, Serge E. Hallyn wrote: > > > > Quoting Paul E. McKenney (paulmck@linux.vnet.ibm.com): > > > > > On Tue, Jan 09, 2018 at 06:11:14AM -0800, Tejun Heo wrote: > > > > > > Hello, Paul. > > > > > > > > > > > > On Mon, Jan 08, 2018 at 08:24:25PM -0800, Paul E. McKenney wrote: > > > > > > > > I don't know the RCU code at all but it *looks* like the first CPU is > > > > > > > > taking a sweet while flushing printk buffer while holding a lock (the > > > > > > > > console is IPMI serial console, which faithfully emulates 115200 baud > > > > > > > > rate), and everyone else seems stuck waiting for that spinlock in > > > > > > > > rcu_check_callbacks(). > > > > > > > > > > > > > > > > Does this sound possible? > > > > > > > > > > > > > > 115200 baud? Ouch!!! That -will- result in trouble from console > > > > > > > printing, and often also in RCU CPU stall warnings. > > > > > > > > > > > > It could even be slower than 115200, and we occassionally see RCU > > > > > > stall warnings caused by printk storms, for example, while the kernel > > > > > > is trying to dump a lot of info after an OOM. That's an issue we > > > > > > probably want to improve from printk side; however, they don't usually > > > > > > lead to NMI hard lockup detector kicking in and crashing the machine, > > > > > > which is the peculiarity here. > > > > > > > > > > > > Hmmm... show_state_filter(), the function which dumps all task > > > > > > backtraces, share a similar problem and it avoids it by explicitly > > > > > > calling touch_nmi_watchdog(). Maybe we can do something like the > > > > > > following from RCU too? > > > > > > > > > > If this fixes things for you, I would welcome such a patch. > > > > > > > > Hi - would this also be relevant to 4.9-stable and 4.4-stable, or > > > > has something elsewhere changed after 4.9 that actually triggers this? > > > > > > As far as I can tell, slow console lines have been prone to RCU CPU stall > > > warnings for a very long time. > > > > Ok, thanks Paul. > > > > Tejun were you going to push this? > > I have it queued for the next merge window. 3eea9623926f ("rcu: Call > touch_nmi_watchdog() while printing stall warnings") in -rcu. D'oh - thanks! -serge ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Can RCU stall lead to hard lockups? 2018-02-07 2:55 ` Serge E. Hallyn @ 2018-02-07 3:16 ` Paul E. McKenney 0 siblings, 0 replies; 16+ messages in thread From: Paul E. McKenney @ 2018-02-07 3:16 UTC (permalink / raw) To: Serge E. Hallyn; +Cc: Tejun Heo, linux-kernel, kernel-team On Tue, Feb 06, 2018 at 08:55:04PM -0600, Serge E. Hallyn wrote: > On Tue, Feb 06, 2018 at 06:53:37PM -0800, Paul E. McKenney wrote: > > On Tue, Feb 06, 2018 at 08:33:03PM -0600, Serge E. Hallyn wrote: > > > On Sat, Feb 03, 2018 at 12:50:32PM -0800, Paul E. McKenney wrote: > > > > On Fri, Feb 02, 2018 at 05:44:30PM -0600, Serge E. Hallyn wrote: > > > > > Quoting Paul E. McKenney (paulmck@linux.vnet.ibm.com): > > > > > > On Tue, Jan 09, 2018 at 06:11:14AM -0800, Tejun Heo wrote: > > > > > > > Hello, Paul. > > > > > > > > > > > > > > On Mon, Jan 08, 2018 at 08:24:25PM -0800, Paul E. McKenney wrote: > > > > > > > > > I don't know the RCU code at all but it *looks* like the first CPU is > > > > > > > > > taking a sweet while flushing printk buffer while holding a lock (the > > > > > > > > > console is IPMI serial console, which faithfully emulates 115200 baud > > > > > > > > > rate), and everyone else seems stuck waiting for that spinlock in > > > > > > > > > rcu_check_callbacks(). > > > > > > > > > > > > > > > > > > Does this sound possible? > > > > > > > > > > > > > > > > 115200 baud? Ouch!!! That -will- result in trouble from console > > > > > > > > printing, and often also in RCU CPU stall warnings. > > > > > > > > > > > > > > It could even be slower than 115200, and we occassionally see RCU > > > > > > > stall warnings caused by printk storms, for example, while the kernel > > > > > > > is trying to dump a lot of info after an OOM. That's an issue we > > > > > > > probably want to improve from printk side; however, they don't usually > > > > > > > lead to NMI hard lockup detector kicking in and crashing the machine, > > > > > > > which is the peculiarity here. > > > > > > > > > > > > > > Hmmm... show_state_filter(), the function which dumps all task > > > > > > > backtraces, share a similar problem and it avoids it by explicitly > > > > > > > calling touch_nmi_watchdog(). Maybe we can do something like the > > > > > > > following from RCU too? > > > > > > > > > > > > If this fixes things for you, I would welcome such a patch. > > > > > > > > > > Hi - would this also be relevant to 4.9-stable and 4.4-stable, or > > > > > has something elsewhere changed after 4.9 that actually triggers this? > > > > > > > > As far as I can tell, slow console lines have been prone to RCU CPU stall > > > > warnings for a very long time. > > > > > > Ok, thanks Paul. > > > > > > Tejun were you going to push this? > > > > I have it queued for the next merge window. 3eea9623926f ("rcu: Call > > touch_nmi_watchdog() while printing stall warnings") in -rcu. > > D'oh - thanks! Not a problem at all! Had I lost this commit, it would not have been the first time. ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2018-02-07 3:16 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2018-01-09 3:52 Can RCU stall lead to hard lockups? Tejun Heo 2018-01-09 4:24 ` Paul E. McKenney 2018-01-09 14:11 ` Tejun Heo 2018-01-09 15:22 ` Paul E. McKenney 2018-01-09 18:47 ` [PATCH] RCU: Call touch_nmi_watchdog() while printing stall warnings Tejun Heo 2018-01-09 18:52 ` [PATCH trivial-v2] " Tejun Heo 2018-01-09 20:05 ` Paul E. McKenney 2018-01-10 5:34 ` Sergey Senozhatsky 2018-01-10 14:57 ` Tejun Heo 2018-01-11 0:45 ` Sergey Senozhatsky 2018-02-02 23:44 ` Can RCU stall lead to hard lockups? Serge E. Hallyn 2018-02-03 20:50 ` Paul E. McKenney 2018-02-07 2:33 ` Serge E. Hallyn 2018-02-07 2:53 ` Paul E. McKenney 2018-02-07 2:55 ` Serge E. Hallyn 2018-02-07 3:16 ` Paul E. McKenney
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).