* SRCU hung task on 5.10.y on synchronize_srcu(&fsnotify_mark_srcu)
@ 2024-08-27 20:01 Jon Kohler
2024-08-27 20:34 ` Paul E. McKenney
2024-09-04 9:19 ` Jan Kara
0 siblings, 2 replies; 12+ messages in thread
From: Jon Kohler @ 2024-08-27 20:01 UTC (permalink / raw)
To: paulmck@kernel.org, rcu@vger.kernel.org, jiangshanlai@gmail.com,
josh@joshtriplett.org
Cc: jack@suse.cz, linux-fsdevel@vger.kernel.org
Hey Paul, Lai, Josh, and the RCU list and Jan/FS list -
Reaching out about a tricky hung task issue that I'm running into. I've
got a virtualized Linux guest on top of a KVM based platform, running
a 5.10.y based kernel. The issue we're running into is a hung task that
*only* happens on shutdown/reboot of this particular VM once every
20-50 times.
The signature of the hung task is always similar to the output below,
where we appear to hang on the call to
synchronize_srcu(&fsnotify_mark_srcu)
in fsnotify_connector_destroy_workfn / fsnotify_mark_destroy_workfn,
where two kernel threads are both calling synchronize_srcu, then
scheduling out in wait_for_completion, and completely going out to
lunch for over 4 minutes. This then triggers the hung task timeout and
things blow up.
We are running audit=1 for this system and are using an el8 based
userspace.
I've flipped through the fs/notify code base for both 5.10 as well as
upstream mainline to see if something jumped off the page, and I
haven't yet spotted any particular suspect code from the caller side.
This hang appears to come up at the very end of the shutdown/reboot
process, seemingly after the system starts to unwind through initrd.
What I'm working on now is adding some instrumentation to the dracut
shutdown initrd scripts to see if I can how far we get down that path
before the system fails to make forward progress, which may give some
hints. TBD on that. I've also enabled lockdep with CONFIG_PROVE_RCU and
a plethora of DEBUG options [2], and didn't get anything interesting.
To be clear, we haven't seen lockdep spit out any complaints as of yet.
Reaching out to see if this sounds familar to anyone on the list, or if
there are any particular areas of the RCU code base that might be
suspect for this kind of issue. I'm happy to provide more information,
as frankly, I'm quite stumped at the moment.
Thanks all,
Jon
[1] panic trace
Normal shutdown process, then hangs on the following:
...
dracut Warning: Killing all remaining processes
...
INFO: task kworker/u20:7:1200701 blocked for more than 241 seconds.
Tainted: G O 5.10.205-2.el8.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u20:7 state:D stack: 0 pid:1200701 ppid: 2 flags:0x00004080
Workqueue: events_unbound fsnotify_connector_destroy_workfn
Call Trace:
__schedule+0x267/0x790
schedule+0x3c/0xb0
schedule_timeout+0x219/0x2b0
wait_for_completion+0x9e/0x100
__synchronize_srcu.part.24+0x83/0xb0
? __bpf_trace_rcu_utilization+0x10/0x10
? synchronize_srcu+0x5d/0xf0
fsnotify_connector_destroy_workfn+0x46/0x80
process_one_work+0x1fc/0x390
worker_thread+0x2d/0x3e0
? process_one_work+0x390/0x390
kthread+0x114/0x130
? kthread_park+0x80/0x80
ret_from_fork+0x1f/0x30
INFO: task kworker/u20:8:1287360 blocked for more than 241 seconds.
Tainted: G O 5.10.205-2.el8.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u20:8 state:D stack: 0 pid:1287360 ppid: 2 flags:0x00004080
Workqueue: events_unbound fsnotify_mark_destroy_workfn
Call Trace:
__schedule+0x267/0x790
schedule+0x3c/0xb0
schedule_timeout+0x219/0x2b0
? add_timer+0x14a/0x200
wait_for_completion+0x9e/0x100
__synchronize_srcu.part.24+0x83/0xb0
? __bpf_trace_rcu_utilization+0x10/0x10
fsnotify_mark_destroy_workfn+0x77/0xe0
process_one_work+0x1fc/0x390
? process_one_work+0x390/0x390
worker_thread+0x2d/0x3e0
? process_one_work+0x390/0x390
kthread+0x114/0x130
? kthread_park+0x80/0x80
ret_from_fork+0x1f/0x30
Kernel panic - not syncing: hung_task: blocked tasks
CPU: 1 PID: 64 Comm: khungtaskd Kdump: loaded Tainted: G O 5.10.205-2.el8.x86_64 #1
Hardware name: Red Hat KVM, BIOS 20230302.1.2662.el8 04/01/2014
Call Trace:
dump_stack+0x6d/0x8c
panic+0x114/0x2ea
watchdog.cold.8+0xb5/0xb5
? hungtask_pm_notify+0x50/0x50
kthread+0x114/0x130
? kthread_park+0x80/0x80
ret_from_fork+0x1f/0x30
[2] additional debugging config knobs turned up.
CONFIG_PROVE_LOCKING=y
CONFIG_LOCK_STAT=y
CONFIG_DEBUG_RT_MUTEXES=y
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y
CONFIG_DEBUG_RWSEMS=y
CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_LOCKDEP=y
CONFIG_LOCKDEP_BITS=15
CONFIG_LOCKDEP_CHAINS_BITS=16
CONFIG_LOCKDEP_STACK_TRACE_BITS=19
CONFIG_LOCKDEP_STACK_TRACE_HASH_BITS=14
CONFIG_LOCKDEP_CIRCULAR_QUEUE_BITS=12
CONFIG_DEBUG_SHIRQ=y
CONFIG_WQ_WATCHDOG=y
CONFIG_DEBUG_ATOMIC_SLEEP=y
CONFIG_DEBUG_LIST=y
CONFIG_DEBUG_PLIST=y
CONFIG_DEBUG_SG=y
CONFIG_DEBUG_NOTIFIERS=y
CONFIG_BUG_ON_DATA_CORRUPTION=y
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: SRCU hung task on 5.10.y on synchronize_srcu(&fsnotify_mark_srcu) 2024-08-27 20:01 SRCU hung task on 5.10.y on synchronize_srcu(&fsnotify_mark_srcu) Jon Kohler @ 2024-08-27 20:34 ` Paul E. McKenney 2024-08-27 23:33 ` Jon Kohler 2024-09-04 9:19 ` Jan Kara 1 sibling, 1 reply; 12+ messages in thread From: Paul E. McKenney @ 2024-08-27 20:34 UTC (permalink / raw) To: Jon Kohler Cc: rcu@vger.kernel.org, jiangshanlai@gmail.com, josh@joshtriplett.org, jack@suse.cz, linux-fsdevel@vger.kernel.org, Z qiang On Tue, Aug 27, 2024 at 08:01:27PM +0000, Jon Kohler wrote: > Hey Paul, Lai, Josh, and the RCU list and Jan/FS list - > Reaching out about a tricky hung task issue that I'm running into. I've > got a virtualized Linux guest on top of a KVM based platform, running > a 5.10.y based kernel. The issue we're running into is a hung task that > *only* happens on shutdown/reboot of this particular VM once every > 20-50 times. > > The signature of the hung task is always similar to the output below, > where we appear to hang on the call to > synchronize_srcu(&fsnotify_mark_srcu) One thing to try would be to add trace_printk() or similar to the SRCU readers, just in case someone was using srcu_read_lock_notrace() on fsnotify_mark_srcu, which I see no trace of in current mainline. Alternatively, if there is a version where this does not happen, try bisecting. Each bisection step would require something like 400-500 shutdown/reboots to prove the commit good. (Obviously, the first failure proves the commit bad, which for one-out-of-50 failures will take on average about 35 shutdown/reboots.) There could also be a bad SRCU backport from mainline, so please check what SRCU backports you have in your 5.10.y stable release. (Though maybe Jack has already done this?) Thanx, Paul > in fsnotify_connector_destroy_workfn / fsnotify_mark_destroy_workfn, > where two kernel threads are both calling synchronize_srcu, then > scheduling out in wait_for_completion, and completely going out to > lunch for over 4 minutes. This then triggers the hung task timeout and > things blow up. > > We are running audit=1 for this system and are using an el8 based > userspace. > > I've flipped through the fs/notify code base for both 5.10 as well as > upstream mainline to see if something jumped off the page, and I > haven't yet spotted any particular suspect code from the caller side. > > This hang appears to come up at the very end of the shutdown/reboot > process, seemingly after the system starts to unwind through initrd. > > What I'm working on now is adding some instrumentation to the dracut > shutdown initrd scripts to see if I can how far we get down that path > before the system fails to make forward progress, which may give some > hints. TBD on that. I've also enabled lockdep with CONFIG_PROVE_RCU and > a plethora of DEBUG options [2], and didn't get anything interesting. > To be clear, we haven't seen lockdep spit out any complaints as of yet. > > Reaching out to see if this sounds familar to anyone on the list, or if > there are any particular areas of the RCU code base that might be > suspect for this kind of issue. I'm happy to provide more information, > as frankly, I'm quite stumped at the moment. > > Thanks all, > Jon > > [1] panic trace > Normal shutdown process, then hangs on the following: > ... > dracut Warning: Killing all remaining processes > ... > INFO: task kworker/u20:7:1200701 blocked for more than 241 seconds. > Tainted: G O 5.10.205-2.el8.x86_64 #1 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > task:kworker/u20:7 state:D stack: 0 pid:1200701 ppid: 2 flags:0x00004080 > Workqueue: events_unbound fsnotify_connector_destroy_workfn > Call Trace: > __schedule+0x267/0x790 > schedule+0x3c/0xb0 > schedule_timeout+0x219/0x2b0 > wait_for_completion+0x9e/0x100 > __synchronize_srcu.part.24+0x83/0xb0 > ? __bpf_trace_rcu_utilization+0x10/0x10 > ? synchronize_srcu+0x5d/0xf0 > fsnotify_connector_destroy_workfn+0x46/0x80 > process_one_work+0x1fc/0x390 > worker_thread+0x2d/0x3e0 > ? process_one_work+0x390/0x390 > kthread+0x114/0x130 > ? kthread_park+0x80/0x80 > ret_from_fork+0x1f/0x30 > INFO: task kworker/u20:8:1287360 blocked for more than 241 seconds. > Tainted: G O 5.10.205-2.el8.x86_64 #1 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > task:kworker/u20:8 state:D stack: 0 pid:1287360 ppid: 2 flags:0x00004080 > Workqueue: events_unbound fsnotify_mark_destroy_workfn > Call Trace: > __schedule+0x267/0x790 > schedule+0x3c/0xb0 > schedule_timeout+0x219/0x2b0 > ? add_timer+0x14a/0x200 > wait_for_completion+0x9e/0x100 > __synchronize_srcu.part.24+0x83/0xb0 > ? __bpf_trace_rcu_utilization+0x10/0x10 > fsnotify_mark_destroy_workfn+0x77/0xe0 > process_one_work+0x1fc/0x390 > ? process_one_work+0x390/0x390 > worker_thread+0x2d/0x3e0 > ? process_one_work+0x390/0x390 > kthread+0x114/0x130 > ? kthread_park+0x80/0x80 > ret_from_fork+0x1f/0x30 > Kernel panic - not syncing: hung_task: blocked tasks > CPU: 1 PID: 64 Comm: khungtaskd Kdump: loaded Tainted: G O 5.10.205-2.el8.x86_64 #1 > Hardware name: Red Hat KVM, BIOS 20230302.1.2662.el8 04/01/2014 > Call Trace: > dump_stack+0x6d/0x8c > panic+0x114/0x2ea > watchdog.cold.8+0xb5/0xb5 > ? hungtask_pm_notify+0x50/0x50 > kthread+0x114/0x130 > ? kthread_park+0x80/0x80 > ret_from_fork+0x1f/0x30 > > [2] additional debugging config knobs turned up. > CONFIG_PROVE_LOCKING=y > CONFIG_LOCK_STAT=y > CONFIG_DEBUG_RT_MUTEXES=y > CONFIG_DEBUG_SPINLOCK=y > CONFIG_DEBUG_MUTEXES=y > CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y > CONFIG_DEBUG_RWSEMS=y > CONFIG_DEBUG_LOCK_ALLOC=y > CONFIG_LOCKDEP=y > CONFIG_LOCKDEP_BITS=15 > CONFIG_LOCKDEP_CHAINS_BITS=16 > CONFIG_LOCKDEP_STACK_TRACE_BITS=19 > CONFIG_LOCKDEP_STACK_TRACE_HASH_BITS=14 > CONFIG_LOCKDEP_CIRCULAR_QUEUE_BITS=12 > CONFIG_DEBUG_SHIRQ=y > CONFIG_WQ_WATCHDOG=y > CONFIG_DEBUG_ATOMIC_SLEEP=y > CONFIG_DEBUG_LIST=y > CONFIG_DEBUG_PLIST=y > CONFIG_DEBUG_SG=y > CONFIG_DEBUG_NOTIFIERS=y > CONFIG_BUG_ON_DATA_CORRUPTION=y ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: SRCU hung task on 5.10.y on synchronize_srcu(&fsnotify_mark_srcu) 2024-08-27 20:34 ` Paul E. McKenney @ 2024-08-27 23:33 ` Jon Kohler 2024-08-28 1:21 ` Paul E. McKenney 2024-08-28 2:33 ` Neeraj Upadhyay 0 siblings, 2 replies; 12+ messages in thread From: Jon Kohler @ 2024-08-27 23:33 UTC (permalink / raw) To: paulmck@kernel.org Cc: rcu@vger.kernel.org, jiangshanlai@gmail.com, josh@joshtriplett.org, jack@suse.cz, linux-fsdevel@vger.kernel.org, Z qiang > On Aug 27, 2024, at 4:34 PM, Paul E. McKenney <paulmck@kernel.org> wrote: > > On Tue, Aug 27, 2024 at 08:01:27PM +0000, Jon Kohler wrote: >> Hey Paul, Lai, Josh, and the RCU list and Jan/FS list - >> Reaching out about a tricky hung task issue that I'm running into. I've >> got a virtualized Linux guest on top of a KVM based platform, running >> a 5.10.y based kernel. The issue we're running into is a hung task that >> *only* happens on shutdown/reboot of this particular VM once every >> 20-50 times. >> >> The signature of the hung task is always similar to the output below, >> where we appear to hang on the call to >> synchronize_srcu(&fsnotify_mark_srcu) > > One thing to try would be to add trace_printk() or similar to the SRCU > readers, just in case someone was using srcu_read_lock_notrace() on > fsnotify_mark_srcu, which I see no trace of in current mainline. > > Alternatively, if there is a version where this does not happen, try > bisecting. Each bisection step would require something like 400-500 > shutdown/reboots to prove the commit good. (Obviously, the first failure > proves the commit bad, which for one-out-of-50 failures will take on > average about 35 shutdown/reboots.) > > There could also be a bad SRCU backport from mainline, so please check > what SRCU backports you have in your 5.10.y stable release. (Though > maybe Jack has already done this?) > > Thanx, Paul Thanks, Paul For posterity, this kernel is just built off of the regular ole stable tree, and here’s All of the backports to kernel/rcu are below. Stepping through this more, since we’re stalling at wait_for_completion, that must mean that wakeme_after_rcu() ... complete(&rcu->completion) is not happening, right? That or somehow wakeme_after_rcu() fires *before* wait_for_completion() is setup, causing the wait to hang forever? Is it possible for wakeme_after_rcu() to race, such that it fires somewhere else *before* wait_for_completion gets all the way to schedule()? [rcu]$ pwd /kernel/kernel/rcu [rcu]$ git remote -v origin https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git (fetch) origin https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git (push) [rcu]$ git log --oneline 2c85ebc57b3e..HEAD . ca4427ebc626 (HEAD, tag: v5.10.205) Linux 5.10.205 <<<< this is the base commit for 5.10.205 >>>> ... 175f4b062f69 rcu: kmemleak: Ignore kmemleak false positives when RCU-freeing objects <<<< this is the most recent backport commit to kernel/rcu >>>> 55887adc76e1 rcuscale: Move rcu_scale_writer() schedule_timeout_uninterruptible() to _idle() 066fbd8bc981 refscale: Fix uninitalized use of wait_queue_head_t d93ba6e46e5f rcu-tasks: Add trc_inspect_reader() checks for exiting critical section 3e22624f8fd3 rcu-tasks: Wait for trc_read_check_handler() IPIs 9190c1f0aed1 rcu-tasks: Fix IPI failure handling in trc_wait_for_one_reader ad4f8c117b8b rcu: Prevent expedited GP from enabling tick on offline CPU 4f91de9a81bd rcu-tasks: Simplify trc_read_check_handler() atomic operations 3a64cd01cdd6 rcu-tasks: Mark ->trc_reader_special.b.need_qs data races 058f077d09ba rcu-tasks: Mark ->trc_reader_nesting data races 604d6a5ff718 rcu/rcuscale: Stop kfree_scale_thread thread(s) after unloading rcuscale d414e24d1509 rcu/rcuscale: Move rcu_scale_*() after kfree_scale_cleanup() ecc5e6dbc269 rcuscale: Move shutdown from wait_event() to wait_event_idle() b62c816bdb5e rcuscale: Always log error message 8cd9917c13a7 rcuscale: Console output claims too few grace periods 7230a9e599d3 rcu/kvfree: Avoid freeing new kfree_rcu() memory after old grace period a7d21b858589 rcu: Protect rcu_print_task_exp_stall() ->exp_tasks access e4842de4ec13 refscale: Move shutdown from wait_event() to wait_event_idle() eb18bc5a8678 rcu: Avoid stack overflow due to __rcu_irq_enter_check_tick() being kprobe-ed d99d194e2f8c rcu-tasks: Make rude RCU-Tasks work well with CPU hotplug 2bf501f1bc78 rcu: Suppress smp_processor_id() complaint in synchronize_rcu_expedited_wait() 1c37e86a78c2 rcu-tasks: Fix synchronize_rcu_tasks() VS zap_pid_ns_processes() ad410f64f7ab rcu-tasks: Remove preemption disablement around srcu_read_[un]lock() calls b02b6bb83c68 rcu-tasks: Improve comments explaining tasks_rcu_exit_srcu purpose 7c15d7ecce00 rcu: Prevent lockdep-RCU splats on lock acquisition/release 5a52380b8193 rcu: Fix __this_cpu_read() lockdep warning in rcu_force_quiescent_state() 0dd025483f15 rcu-tasks: Convert RCU_LOCKDEP_WARN() to WARN_ONCE() 36d4ffbedff7 rcu: Back off upon fill_page_cache_func() allocation failure 10f30cba8f6c rcu: Make TASKS_RUDE_RCU select IRQ_WORK 1c6c3f233664 rcu-tasks: Fix race in schedule and flush work a22d66eb518f rcu: Apply callbacks processing time limit only on softirq 40fb3812d997 rcu: Fix callbacks processing time limit retaining cond_resched() fcc9797d0d13 rcu: Don't deboost before reporting expedited quiescent state 0c145262ac99 rcu/nocb: Fix missed nocb_timer requeue 657991fb06a4 rcu: Do not report strict GPs for outgoing CPUs 12d3389b7af6 rcu: Tighten rcu_advance_cbs_nowake() checks 0836f9404017 rcu/exp: Mark current CPU as exp-QS in IPI loop second pass 70692b06208c rcu: Mark accesses to rcu_state.n_force_qs af756be29c82 rcu: Always inline rcu_dynticks_task*_{enter,exit}() 226d68fb6c0a rcu: Fix existing exp request check in sync_sched_exp_online_cleanup() 02ddf26d849d rcu-tasks: Move RTGS_WAIT_CBS to beginning of rcu_tasks_kthread() loop 7f43cda650d5 rcutorture: Avoid problematic critical section nesting on PREEMPT_RT d3ca78775db4 rcu: Fix macro name CONFIG_TASKS_RCU_TRACE 497f3d9c3f58 rcu: Fix stall-warning deadlock due to non-release of rcu_node ->lock ea5e5bc881a4 rcu: Add lockdep_assert_irqs_disabled() to rcu_sched_clock_irq() and callees 527b56d7856f rcu: Fix to include first blocked task in stall warning 4b680b3fc6f3 rcu/tree: Handle VM stoppage in stall detection b6ae3854075e srcu: Provide polling interfaces for Tiny SRCU grace periods 450948b06ce8 srcu: Make Tiny SRCU use multi-bit grace-period counter 641e1d88404a srcu: Provide internal interface to start a Tiny SRCU grace period f789de3be808 srcu: Provide polling interfaces for Tree SRCU grace periods fdf66e5a7fc8 srcu: Provide internal interface to start a Tree SRCU grace period 86cb49e7314e rcu-tasks: Don't delete holdouts within trc_wait_for_one_reader() 55ddab2bfd70 rcu-tasks: Don't delete holdouts within trc_inspect_reader() 35a35909ec19 rcu: Reject RCU_LOCKDEP_WARN() false positives 23597afbe096 srcu: Fix broken node geometry after early ssp init 728f23e53c65 rcu: Invoke rcu_spawn_core_kthreads() from rcu_spawn_gp_kthread() 7d81aff28953 rcu: Remove spurious instrumentation_end() in rcu_nmi_enter() 09a27d662006 kvfree_rcu: Use same set of GFP flags as does single-argument e713bdd791ba rcu/nocb: Perform deferred wake up before last idle's need_resched() check 20b7669fa3f0 rcu: Pull deferred rcuog wake up to rcu_eqs_enter() callers 30b491e2b6cc rcu-tasks: Move RCU-tasks initialization to before early_initcall() 9b81af9c8455 rcu/tree: Defer kvfree_rcu() allocation to a clean context 5cacd18c5207 rcu,ftrace: Fix ftrace recursion 4540e84bd8a9 rcu: Allow rcu_irq_enter_check_tick() from NMI <<<< this is the first kernel/rcu backport for 5.10.y branch >>>> ... 2c85ebc57b3e (tag: v5.10) Linux 5.10 <<<< this is the divergence point from mainline to 5.10.y >>>> > >> in fsnotify_connector_destroy_workfn / fsnotify_mark_destroy_workfn, >> where two kernel threads are both calling synchronize_srcu, then >> scheduling out in wait_for_completion, and completely going out to >> lunch for over 4 minutes. This then triggers the hung task timeout and >> things blow up. >> >> We are running audit=1 for this system and are using an el8 based >> userspace. >> >> I've flipped through the fs/notify code base for both 5.10 as well as >> upstream mainline to see if something jumped off the page, and I >> haven't yet spotted any particular suspect code from the caller side. >> >> This hang appears to come up at the very end of the shutdown/reboot >> process, seemingly after the system starts to unwind through initrd. >> >> What I'm working on now is adding some instrumentation to the dracut >> shutdown initrd scripts to see if I can how far we get down that path >> before the system fails to make forward progress, which may give some >> hints. TBD on that. I've also enabled lockdep with CONFIG_PROVE_RCU and >> a plethora of DEBUG options [2], and didn't get anything interesting. >> To be clear, we haven't seen lockdep spit out any complaints as of yet. >> >> Reaching out to see if this sounds familar to anyone on the list, or if >> there are any particular areas of the RCU code base that might be >> suspect for this kind of issue. I'm happy to provide more information, >> as frankly, I'm quite stumped at the moment. >> >> Thanks all, >> Jon >> >> [1] panic trace >> Normal shutdown process, then hangs on the following: >> ... >> dracut Warning: Killing all remaining processes >> ... >> INFO: task kworker/u20:7:1200701 blocked for more than 241 seconds. >> Tainted: G O 5.10.205-2.el8.x86_64 #1 >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >> task:kworker/u20:7 state:D stack: 0 pid:1200701 ppid: 2 flags:0x00004080 >> Workqueue: events_unbound fsnotify_connector_destroy_workfn >> Call Trace: >> __schedule+0x267/0x790 >> schedule+0x3c/0xb0 >> schedule_timeout+0x219/0x2b0 >> wait_for_completion+0x9e/0x100 >> __synchronize_srcu.part.24+0x83/0xb0 >> ? __bpf_trace_rcu_utilization+0x10/0x10 >> ? synchronize_srcu+0x5d/0xf0 >> fsnotify_connector_destroy_workfn+0x46/0x80 >> process_one_work+0x1fc/0x390 >> worker_thread+0x2d/0x3e0 >> ? process_one_work+0x390/0x390 >> kthread+0x114/0x130 >> ? kthread_park+0x80/0x80 >> ret_from_fork+0x1f/0x30 >> INFO: task kworker/u20:8:1287360 blocked for more than 241 seconds. >> Tainted: G O 5.10.205-2.el8.x86_64 #1 >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >> task:kworker/u20:8 state:D stack: 0 pid:1287360 ppid: 2 flags:0x00004080 >> Workqueue: events_unbound fsnotify_mark_destroy_workfn >> Call Trace: >> __schedule+0x267/0x790 >> schedule+0x3c/0xb0 >> schedule_timeout+0x219/0x2b0 >> ? add_timer+0x14a/0x200 >> wait_for_completion+0x9e/0x100 >> __synchronize_srcu.part.24+0x83/0xb0 >> ? __bpf_trace_rcu_utilization+0x10/0x10 >> fsnotify_mark_destroy_workfn+0x77/0xe0 >> process_one_work+0x1fc/0x390 >> ? process_one_work+0x390/0x390 >> worker_thread+0x2d/0x3e0 >> ? process_one_work+0x390/0x390 >> kthread+0x114/0x130 >> ? kthread_park+0x80/0x80 >> ret_from_fork+0x1f/0x30 >> Kernel panic - not syncing: hung_task: blocked tasks >> CPU: 1 PID: 64 Comm: khungtaskd Kdump: loaded Tainted: G O 5.10.205-2.el8.x86_64 #1 >> Hardware name: Red Hat KVM, BIOS 20230302.1.2662.el8 04/01/2014 >> Call Trace: >> dump_stack+0x6d/0x8c >> panic+0x114/0x2ea >> watchdog.cold.8+0xb5/0xb5 >> ? hungtask_pm_notify+0x50/0x50 >> kthread+0x114/0x130 >> ? kthread_park+0x80/0x80 >> ret_from_fork+0x1f/0x30 >> >> [2] additional debugging config knobs turned up. >> CONFIG_PROVE_LOCKING=y >> CONFIG_LOCK_STAT=y >> CONFIG_DEBUG_RT_MUTEXES=y >> CONFIG_DEBUG_SPINLOCK=y >> CONFIG_DEBUG_MUTEXES=y >> CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y >> CONFIG_DEBUG_RWSEMS=y >> CONFIG_DEBUG_LOCK_ALLOC=y >> CONFIG_LOCKDEP=y >> CONFIG_LOCKDEP_BITS=15 >> CONFIG_LOCKDEP_CHAINS_BITS=16 >> CONFIG_LOCKDEP_STACK_TRACE_BITS=19 >> CONFIG_LOCKDEP_STACK_TRACE_HASH_BITS=14 >> CONFIG_LOCKDEP_CIRCULAR_QUEUE_BITS=12 >> CONFIG_DEBUG_SHIRQ=y >> CONFIG_WQ_WATCHDOG=y >> CONFIG_DEBUG_ATOMIC_SLEEP=y >> CONFIG_DEBUG_LIST=y >> CONFIG_DEBUG_PLIST=y >> CONFIG_DEBUG_SG=y >> CONFIG_DEBUG_NOTIFIERS=y >> CONFIG_BUG_ON_DATA_CORRUPTION=y ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: SRCU hung task on 5.10.y on synchronize_srcu(&fsnotify_mark_srcu) 2024-08-27 23:33 ` Jon Kohler @ 2024-08-28 1:21 ` Paul E. McKenney 2024-08-28 2:41 ` Jon Kohler 2024-08-28 2:33 ` Neeraj Upadhyay 1 sibling, 1 reply; 12+ messages in thread From: Paul E. McKenney @ 2024-08-28 1:21 UTC (permalink / raw) To: Jon Kohler Cc: rcu@vger.kernel.org, jiangshanlai@gmail.com, josh@joshtriplett.org, jack@suse.cz, linux-fsdevel@vger.kernel.org, Z qiang On Tue, Aug 27, 2024 at 11:33:32PM +0000, Jon Kohler wrote: > > > > On Aug 27, 2024, at 4:34 PM, Paul E. McKenney <paulmck@kernel.org> wrote: > > > > On Tue, Aug 27, 2024 at 08:01:27PM +0000, Jon Kohler wrote: > >> Hey Paul, Lai, Josh, and the RCU list and Jan/FS list - > >> Reaching out about a tricky hung task issue that I'm running into. I've > >> got a virtualized Linux guest on top of a KVM based platform, running > >> a 5.10.y based kernel. The issue we're running into is a hung task that > >> *only* happens on shutdown/reboot of this particular VM once every > >> 20-50 times. > >> > >> The signature of the hung task is always similar to the output below, > >> where we appear to hang on the call to > >> synchronize_srcu(&fsnotify_mark_srcu) > > > > One thing to try would be to add trace_printk() or similar to the SRCU > > readers, just in case someone was using srcu_read_lock_notrace() on > > fsnotify_mark_srcu, which I see no trace of in current mainline. > > > > Alternatively, if there is a version where this does not happen, try > > bisecting. Each bisection step would require something like 400-500 > > shutdown/reboots to prove the commit good. (Obviously, the first failure > > proves the commit bad, which for one-out-of-50 failures will take on > > average about 35 shutdown/reboots.) > > > > There could also be a bad SRCU backport from mainline, so please check > > what SRCU backports you have in your 5.10.y stable release. (Though > > maybe Jack has already done this?) > > > > Thanx, Paul > > Thanks, Paul > > For posterity, this kernel is just built off of the regular ole stable tree, and here’s > All of the backports to kernel/rcu are below. > > Stepping through this more, since we’re stalling at wait_for_completion, > that must mean that wakeme_after_rcu() ... complete(&rcu->completion) > is not happening, right? > > That or somehow wakeme_after_rcu() fires *before* wait_for_completion() > is setup, causing the wait to hang forever? There is no problem with that misordering -- in that case, the call to wait_for_completion() simply won't wait. But that would be a good path to trace. After all, you might have noticed that what we believe software will do does not always match what it actually does. > Is it possible for wakeme_after_rcu() to race, such that it fires somewhere > else *before* wait_for_completion gets all the way to schedule()? That should also work just fine, but it is still a good assumption to check. Good list below, but I must focus on mainline. Apologies! Plus this assumes that v5.10 worked for you -- has that been tested? So again, what version has worked for you? Thanx, Paul > [rcu]$ pwd > /kernel/kernel/rcu > [rcu]$ git remote -v > origin https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git (fetch) > origin https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git (push) > [rcu]$ git log --oneline 2c85ebc57b3e..HEAD . > ca4427ebc626 (HEAD, tag: v5.10.205) Linux 5.10.205 <<<< this is the base commit for 5.10.205 >>>> > ... > 175f4b062f69 rcu: kmemleak: Ignore kmemleak false positives when RCU-freeing objects <<<< this is the most recent backport commit to kernel/rcu >>>> > 55887adc76e1 rcuscale: Move rcu_scale_writer() schedule_timeout_uninterruptible() to _idle() > 066fbd8bc981 refscale: Fix uninitalized use of wait_queue_head_t > d93ba6e46e5f rcu-tasks: Add trc_inspect_reader() checks for exiting critical section > 3e22624f8fd3 rcu-tasks: Wait for trc_read_check_handler() IPIs > 9190c1f0aed1 rcu-tasks: Fix IPI failure handling in trc_wait_for_one_reader > ad4f8c117b8b rcu: Prevent expedited GP from enabling tick on offline CPU > 4f91de9a81bd rcu-tasks: Simplify trc_read_check_handler() atomic operations > 3a64cd01cdd6 rcu-tasks: Mark ->trc_reader_special.b.need_qs data races > 058f077d09ba rcu-tasks: Mark ->trc_reader_nesting data races > 604d6a5ff718 rcu/rcuscale: Stop kfree_scale_thread thread(s) after unloading rcuscale > d414e24d1509 rcu/rcuscale: Move rcu_scale_*() after kfree_scale_cleanup() > ecc5e6dbc269 rcuscale: Move shutdown from wait_event() to wait_event_idle() > b62c816bdb5e rcuscale: Always log error message > 8cd9917c13a7 rcuscale: Console output claims too few grace periods > 7230a9e599d3 rcu/kvfree: Avoid freeing new kfree_rcu() memory after old grace period > a7d21b858589 rcu: Protect rcu_print_task_exp_stall() ->exp_tasks access > e4842de4ec13 refscale: Move shutdown from wait_event() to wait_event_idle() > eb18bc5a8678 rcu: Avoid stack overflow due to __rcu_irq_enter_check_tick() being kprobe-ed > d99d194e2f8c rcu-tasks: Make rude RCU-Tasks work well with CPU hotplug > 2bf501f1bc78 rcu: Suppress smp_processor_id() complaint in synchronize_rcu_expedited_wait() > 1c37e86a78c2 rcu-tasks: Fix synchronize_rcu_tasks() VS zap_pid_ns_processes() > ad410f64f7ab rcu-tasks: Remove preemption disablement around srcu_read_[un]lock() calls > b02b6bb83c68 rcu-tasks: Improve comments explaining tasks_rcu_exit_srcu purpose > 7c15d7ecce00 rcu: Prevent lockdep-RCU splats on lock acquisition/release > 5a52380b8193 rcu: Fix __this_cpu_read() lockdep warning in rcu_force_quiescent_state() > 0dd025483f15 rcu-tasks: Convert RCU_LOCKDEP_WARN() to WARN_ONCE() > 36d4ffbedff7 rcu: Back off upon fill_page_cache_func() allocation failure > 10f30cba8f6c rcu: Make TASKS_RUDE_RCU select IRQ_WORK > 1c6c3f233664 rcu-tasks: Fix race in schedule and flush work > a22d66eb518f rcu: Apply callbacks processing time limit only on softirq > 40fb3812d997 rcu: Fix callbacks processing time limit retaining cond_resched() > fcc9797d0d13 rcu: Don't deboost before reporting expedited quiescent state > 0c145262ac99 rcu/nocb: Fix missed nocb_timer requeue > 657991fb06a4 rcu: Do not report strict GPs for outgoing CPUs > 12d3389b7af6 rcu: Tighten rcu_advance_cbs_nowake() checks > 0836f9404017 rcu/exp: Mark current CPU as exp-QS in IPI loop second pass > 70692b06208c rcu: Mark accesses to rcu_state.n_force_qs > af756be29c82 rcu: Always inline rcu_dynticks_task*_{enter,exit}() > 226d68fb6c0a rcu: Fix existing exp request check in sync_sched_exp_online_cleanup() > 02ddf26d849d rcu-tasks: Move RTGS_WAIT_CBS to beginning of rcu_tasks_kthread() loop > 7f43cda650d5 rcutorture: Avoid problematic critical section nesting on PREEMPT_RT > d3ca78775db4 rcu: Fix macro name CONFIG_TASKS_RCU_TRACE > 497f3d9c3f58 rcu: Fix stall-warning deadlock due to non-release of rcu_node ->lock > ea5e5bc881a4 rcu: Add lockdep_assert_irqs_disabled() to rcu_sched_clock_irq() and callees > 527b56d7856f rcu: Fix to include first blocked task in stall warning > 4b680b3fc6f3 rcu/tree: Handle VM stoppage in stall detection > b6ae3854075e srcu: Provide polling interfaces for Tiny SRCU grace periods > 450948b06ce8 srcu: Make Tiny SRCU use multi-bit grace-period counter > 641e1d88404a srcu: Provide internal interface to start a Tiny SRCU grace period > f789de3be808 srcu: Provide polling interfaces for Tree SRCU grace periods > fdf66e5a7fc8 srcu: Provide internal interface to start a Tree SRCU grace period > 86cb49e7314e rcu-tasks: Don't delete holdouts within trc_wait_for_one_reader() > 55ddab2bfd70 rcu-tasks: Don't delete holdouts within trc_inspect_reader() > 35a35909ec19 rcu: Reject RCU_LOCKDEP_WARN() false positives > 23597afbe096 srcu: Fix broken node geometry after early ssp init > 728f23e53c65 rcu: Invoke rcu_spawn_core_kthreads() from rcu_spawn_gp_kthread() > 7d81aff28953 rcu: Remove spurious instrumentation_end() in rcu_nmi_enter() > 09a27d662006 kvfree_rcu: Use same set of GFP flags as does single-argument > e713bdd791ba rcu/nocb: Perform deferred wake up before last idle's need_resched() check > 20b7669fa3f0 rcu: Pull deferred rcuog wake up to rcu_eqs_enter() callers > 30b491e2b6cc rcu-tasks: Move RCU-tasks initialization to before early_initcall() > 9b81af9c8455 rcu/tree: Defer kvfree_rcu() allocation to a clean context > 5cacd18c5207 rcu,ftrace: Fix ftrace recursion > 4540e84bd8a9 rcu: Allow rcu_irq_enter_check_tick() from NMI <<<< this is the first kernel/rcu backport for 5.10.y branch >>>> > ... > 2c85ebc57b3e (tag: v5.10) Linux 5.10 <<<< this is the divergence point from mainline to 5.10.y >>>> > > > > >> in fsnotify_connector_destroy_workfn / fsnotify_mark_destroy_workfn, > >> where two kernel threads are both calling synchronize_srcu, then > >> scheduling out in wait_for_completion, and completely going out to > >> lunch for over 4 minutes. This then triggers the hung task timeout and > >> things blow up. > >> > >> We are running audit=1 for this system and are using an el8 based > >> userspace. > >> > >> I've flipped through the fs/notify code base for both 5.10 as well as > >> upstream mainline to see if something jumped off the page, and I > >> haven't yet spotted any particular suspect code from the caller side. > >> > >> This hang appears to come up at the very end of the shutdown/reboot > >> process, seemingly after the system starts to unwind through initrd. > >> > >> What I'm working on now is adding some instrumentation to the dracut > >> shutdown initrd scripts to see if I can how far we get down that path > >> before the system fails to make forward progress, which may give some > >> hints. TBD on that. I've also enabled lockdep with CONFIG_PROVE_RCU and > >> a plethora of DEBUG options [2], and didn't get anything interesting. > >> To be clear, we haven't seen lockdep spit out any complaints as of yet. > >> > >> Reaching out to see if this sounds familar to anyone on the list, or if > >> there are any particular areas of the RCU code base that might be > >> suspect for this kind of issue. I'm happy to provide more information, > >> as frankly, I'm quite stumped at the moment. > >> > >> Thanks all, > >> Jon > >> > >> [1] panic trace > >> Normal shutdown process, then hangs on the following: > >> ... > >> dracut Warning: Killing all remaining processes > >> ... > >> INFO: task kworker/u20:7:1200701 blocked for more than 241 seconds. > >> Tainted: G O 5.10.205-2.el8.x86_64 #1 > >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > >> task:kworker/u20:7 state:D stack: 0 pid:1200701 ppid: 2 flags:0x00004080 > >> Workqueue: events_unbound fsnotify_connector_destroy_workfn > >> Call Trace: > >> __schedule+0x267/0x790 > >> schedule+0x3c/0xb0 > >> schedule_timeout+0x219/0x2b0 > >> wait_for_completion+0x9e/0x100 > >> __synchronize_srcu.part.24+0x83/0xb0 > >> ? __bpf_trace_rcu_utilization+0x10/0x10 > >> ? synchronize_srcu+0x5d/0xf0 > >> fsnotify_connector_destroy_workfn+0x46/0x80 > >> process_one_work+0x1fc/0x390 > >> worker_thread+0x2d/0x3e0 > >> ? process_one_work+0x390/0x390 > >> kthread+0x114/0x130 > >> ? kthread_park+0x80/0x80 > >> ret_from_fork+0x1f/0x30 > >> INFO: task kworker/u20:8:1287360 blocked for more than 241 seconds. > >> Tainted: G O 5.10.205-2.el8.x86_64 #1 > >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > >> task:kworker/u20:8 state:D stack: 0 pid:1287360 ppid: 2 flags:0x00004080 > >> Workqueue: events_unbound fsnotify_mark_destroy_workfn > >> Call Trace: > >> __schedule+0x267/0x790 > >> schedule+0x3c/0xb0 > >> schedule_timeout+0x219/0x2b0 > >> ? add_timer+0x14a/0x200 > >> wait_for_completion+0x9e/0x100 > >> __synchronize_srcu.part.24+0x83/0xb0 > >> ? __bpf_trace_rcu_utilization+0x10/0x10 > >> fsnotify_mark_destroy_workfn+0x77/0xe0 > >> process_one_work+0x1fc/0x390 > >> ? process_one_work+0x390/0x390 > >> worker_thread+0x2d/0x3e0 > >> ? process_one_work+0x390/0x390 > >> kthread+0x114/0x130 > >> ? kthread_park+0x80/0x80 > >> ret_from_fork+0x1f/0x30 > >> Kernel panic - not syncing: hung_task: blocked tasks > >> CPU: 1 PID: 64 Comm: khungtaskd Kdump: loaded Tainted: G O 5.10.205-2.el8.x86_64 #1 > >> Hardware name: Red Hat KVM, BIOS 20230302.1.2662.el8 04/01/2014 > >> Call Trace: > >> dump_stack+0x6d/0x8c > >> panic+0x114/0x2ea > >> watchdog.cold.8+0xb5/0xb5 > >> ? hungtask_pm_notify+0x50/0x50 > >> kthread+0x114/0x130 > >> ? kthread_park+0x80/0x80 > >> ret_from_fork+0x1f/0x30 > >> > >> [2] additional debugging config knobs turned up. > >> CONFIG_PROVE_LOCKING=y > >> CONFIG_LOCK_STAT=y > >> CONFIG_DEBUG_RT_MUTEXES=y > >> CONFIG_DEBUG_SPINLOCK=y > >> CONFIG_DEBUG_MUTEXES=y > >> CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y > >> CONFIG_DEBUG_RWSEMS=y > >> CONFIG_DEBUG_LOCK_ALLOC=y > >> CONFIG_LOCKDEP=y > >> CONFIG_LOCKDEP_BITS=15 > >> CONFIG_LOCKDEP_CHAINS_BITS=16 > >> CONFIG_LOCKDEP_STACK_TRACE_BITS=19 > >> CONFIG_LOCKDEP_STACK_TRACE_HASH_BITS=14 > >> CONFIG_LOCKDEP_CIRCULAR_QUEUE_BITS=12 > >> CONFIG_DEBUG_SHIRQ=y > >> CONFIG_WQ_WATCHDOG=y > >> CONFIG_DEBUG_ATOMIC_SLEEP=y > >> CONFIG_DEBUG_LIST=y > >> CONFIG_DEBUG_PLIST=y > >> CONFIG_DEBUG_SG=y > >> CONFIG_DEBUG_NOTIFIERS=y > >> CONFIG_BUG_ON_DATA_CORRUPTION=y > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: SRCU hung task on 5.10.y on synchronize_srcu(&fsnotify_mark_srcu) 2024-08-28 1:21 ` Paul E. McKenney @ 2024-08-28 2:41 ` Jon Kohler 2024-08-28 10:14 ` Paul E. McKenney 0 siblings, 1 reply; 12+ messages in thread From: Jon Kohler @ 2024-08-28 2:41 UTC (permalink / raw) To: paulmck@kernel.org Cc: rcu@vger.kernel.org, jiangshanlai@gmail.com, josh@joshtriplett.org, jack@suse.cz, linux-fsdevel@vger.kernel.org, Z qiang > On Aug 27, 2024, at 9:21 PM, Paul E. McKenney <paulmck@kernel.org> wrote: > > On Tue, Aug 27, 2024 at 11:33:32PM +0000, Jon Kohler wrote: >> >> >>> On Aug 27, 2024, at 4:34 PM, Paul E. McKenney <paulmck@kernel.org> wrote: >>> >>> On Tue, Aug 27, 2024 at 08:01:27PM +0000, Jon Kohler wrote: >>>> Hey Paul, Lai, Josh, and the RCU list and Jan/FS list - >>>> Reaching out about a tricky hung task issue that I'm running into. I've >>>> got a virtualized Linux guest on top of a KVM based platform, running >>>> a 5.10.y based kernel. The issue we're running into is a hung task that >>>> *only* happens on shutdown/reboot of this particular VM once every >>>> 20-50 times. >>>> >>>> The signature of the hung task is always similar to the output below, >>>> where we appear to hang on the call to >>>> synchronize_srcu(&fsnotify_mark_srcu) >>> >>> One thing to try would be to add trace_printk() or similar to the SRCU >>> readers, just in case someone was using srcu_read_lock_notrace() on >>> fsnotify_mark_srcu, which I see no trace of in current mainline. >>> >>> Alternatively, if there is a version where this does not happen, try >>> bisecting. Each bisection step would require something like 400-500 >>> shutdown/reboots to prove the commit good. (Obviously, the first failure >>> proves the commit bad, which for one-out-of-50 failures will take on >>> average about 35 shutdown/reboots.) >>> >>> There could also be a bad SRCU backport from mainline, so please check >>> what SRCU backports you have in your 5.10.y stable release. (Though >>> maybe Jack has already done this?) >>> >>> Thanx, Paul >> >> Thanks, Paul >> >> For posterity, this kernel is just built off of the regular ole stable tree, and here’s >> All of the backports to kernel/rcu are below. >> >> Stepping through this more, since we’re stalling at wait_for_completion, >> that must mean that wakeme_after_rcu() ... complete(&rcu->completion) >> is not happening, right? >> >> That or somehow wakeme_after_rcu() fires *before* wait_for_completion() >> is setup, causing the wait to hang forever? > > There is no problem with that misordering -- in that case, the call to > wait_for_completion() simply won't wait. > > But that would be a good path to trace. After all, you might have > noticed that what we believe software will do does not always match what > it actually does. Agreed, I’ll noodle on it some more, and try to bisect as well. I see Neeraj responded separately, I’ll check that out now. > > >> Is it possible for wakeme_after_rcu() to race, such that it fires somewhere >> else *before* wait_for_completion gets all the way to schedule()? > > That should also work just fine, but it is still a good assumption to > check. > > Good list below, but I must focus on mainline. Apologies! > > Plus this assumes that v5.10 worked for you -- has that been tested? > So again, what version has worked for you? We’ve had this particular service on 5.10.y for a while; however, this behavior just recently started bubbling up with a somewhat new(er) set of internal QA tests that very specifically stress shutdown/reboots in a loop for a completely unrelated durability test, so its possible this issue has existed for a while. > > Thanx, Paul > >> [rcu]$ pwd >> /kernel/kernel/rcu >> [rcu]$ git remote -v >> origin https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_pub_scm_linux_kernel_git_stable_linux.git&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=1Tj05_-3u5bekAXIOn3TZBxlsLepxSWdhZVB6OKs02CYfxx4O9XkBJaihu2H2SaF&s=i-EJON2mPUDM-dvijeXkVr4nsR1g_obFaq9F9DCfQNs&e= (fetch) >> origin https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_pub_scm_linux_kernel_git_stable_linux.git&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=1Tj05_-3u5bekAXIOn3TZBxlsLepxSWdhZVB6OKs02CYfxx4O9XkBJaihu2H2SaF&s=i-EJON2mPUDM-dvijeXkVr4nsR1g_obFaq9F9DCfQNs&e= (push) >> [rcu]$ git log --oneline 2c85ebc57b3e..HEAD . >> ca4427ebc626 (HEAD, tag: v5.10.205) Linux 5.10.205 <<<< this is the base commit for 5.10.205 >>>> >> ... >> 175f4b062f69 rcu: kmemleak: Ignore kmemleak false positives when RCU-freeing objects <<<< this is the most recent backport commit to kernel/rcu >>>> >> 55887adc76e1 rcuscale: Move rcu_scale_writer() schedule_timeout_uninterruptible() to _idle() >> 066fbd8bc981 refscale: Fix uninitalized use of wait_queue_head_t >> d93ba6e46e5f rcu-tasks: Add trc_inspect_reader() checks for exiting critical section >> 3e22624f8fd3 rcu-tasks: Wait for trc_read_check_handler() IPIs >> 9190c1f0aed1 rcu-tasks: Fix IPI failure handling in trc_wait_for_one_reader >> ad4f8c117b8b rcu: Prevent expedited GP from enabling tick on offline CPU >> 4f91de9a81bd rcu-tasks: Simplify trc_read_check_handler() atomic operations >> 3a64cd01cdd6 rcu-tasks: Mark ->trc_reader_special.b.need_qs data races >> 058f077d09ba rcu-tasks: Mark ->trc_reader_nesting data races >> 604d6a5ff718 rcu/rcuscale: Stop kfree_scale_thread thread(s) after unloading rcuscale >> d414e24d1509 rcu/rcuscale: Move rcu_scale_*() after kfree_scale_cleanup() >> ecc5e6dbc269 rcuscale: Move shutdown from wait_event() to wait_event_idle() >> b62c816bdb5e rcuscale: Always log error message >> 8cd9917c13a7 rcuscale: Console output claims too few grace periods >> 7230a9e599d3 rcu/kvfree: Avoid freeing new kfree_rcu() memory after old grace period >> a7d21b858589 rcu: Protect rcu_print_task_exp_stall() ->exp_tasks access >> e4842de4ec13 refscale: Move shutdown from wait_event() to wait_event_idle() >> eb18bc5a8678 rcu: Avoid stack overflow due to __rcu_irq_enter_check_tick() being kprobe-ed >> d99d194e2f8c rcu-tasks: Make rude RCU-Tasks work well with CPU hotplug >> 2bf501f1bc78 rcu: Suppress smp_processor_id() complaint in synchronize_rcu_expedited_wait() >> 1c37e86a78c2 rcu-tasks: Fix synchronize_rcu_tasks() VS zap_pid_ns_processes() >> ad410f64f7ab rcu-tasks: Remove preemption disablement around srcu_read_[un]lock() calls >> b02b6bb83c68 rcu-tasks: Improve comments explaining tasks_rcu_exit_srcu purpose >> 7c15d7ecce00 rcu: Prevent lockdep-RCU splats on lock acquisition/release >> 5a52380b8193 rcu: Fix __this_cpu_read() lockdep warning in rcu_force_quiescent_state() >> 0dd025483f15 rcu-tasks: Convert RCU_LOCKDEP_WARN() to WARN_ONCE() >> 36d4ffbedff7 rcu: Back off upon fill_page_cache_func() allocation failure >> 10f30cba8f6c rcu: Make TASKS_RUDE_RCU select IRQ_WORK >> 1c6c3f233664 rcu-tasks: Fix race in schedule and flush work >> a22d66eb518f rcu: Apply callbacks processing time limit only on softirq >> 40fb3812d997 rcu: Fix callbacks processing time limit retaining cond_resched() >> fcc9797d0d13 rcu: Don't deboost before reporting expedited quiescent state >> 0c145262ac99 rcu/nocb: Fix missed nocb_timer requeue >> 657991fb06a4 rcu: Do not report strict GPs for outgoing CPUs >> 12d3389b7af6 rcu: Tighten rcu_advance_cbs_nowake() checks >> 0836f9404017 rcu/exp: Mark current CPU as exp-QS in IPI loop second pass >> 70692b06208c rcu: Mark accesses to rcu_state.n_force_qs >> af756be29c82 rcu: Always inline rcu_dynticks_task*_{enter,exit}() >> 226d68fb6c0a rcu: Fix existing exp request check in sync_sched_exp_online_cleanup() >> 02ddf26d849d rcu-tasks: Move RTGS_WAIT_CBS to beginning of rcu_tasks_kthread() loop >> 7f43cda650d5 rcutorture: Avoid problematic critical section nesting on PREEMPT_RT >> d3ca78775db4 rcu: Fix macro name CONFIG_TASKS_RCU_TRACE >> 497f3d9c3f58 rcu: Fix stall-warning deadlock due to non-release of rcu_node ->lock >> ea5e5bc881a4 rcu: Add lockdep_assert_irqs_disabled() to rcu_sched_clock_irq() and callees >> 527b56d7856f rcu: Fix to include first blocked task in stall warning >> 4b680b3fc6f3 rcu/tree: Handle VM stoppage in stall detection >> b6ae3854075e srcu: Provide polling interfaces for Tiny SRCU grace periods >> 450948b06ce8 srcu: Make Tiny SRCU use multi-bit grace-period counter >> 641e1d88404a srcu: Provide internal interface to start a Tiny SRCU grace period >> f789de3be808 srcu: Provide polling interfaces for Tree SRCU grace periods >> fdf66e5a7fc8 srcu: Provide internal interface to start a Tree SRCU grace period >> 86cb49e7314e rcu-tasks: Don't delete holdouts within trc_wait_for_one_reader() >> 55ddab2bfd70 rcu-tasks: Don't delete holdouts within trc_inspect_reader() >> 35a35909ec19 rcu: Reject RCU_LOCKDEP_WARN() false positives >> 23597afbe096 srcu: Fix broken node geometry after early ssp init >> 728f23e53c65 rcu: Invoke rcu_spawn_core_kthreads() from rcu_spawn_gp_kthread() >> 7d81aff28953 rcu: Remove spurious instrumentation_end() in rcu_nmi_enter() >> 09a27d662006 kvfree_rcu: Use same set of GFP flags as does single-argument >> e713bdd791ba rcu/nocb: Perform deferred wake up before last idle's need_resched() check >> 20b7669fa3f0 rcu: Pull deferred rcuog wake up to rcu_eqs_enter() callers >> 30b491e2b6cc rcu-tasks: Move RCU-tasks initialization to before early_initcall() >> 9b81af9c8455 rcu/tree: Defer kvfree_rcu() allocation to a clean context >> 5cacd18c5207 rcu,ftrace: Fix ftrace recursion >> 4540e84bd8a9 rcu: Allow rcu_irq_enter_check_tick() from NMI <<<< this is the first kernel/rcu backport for 5.10.y branch >>>> >> ... >> 2c85ebc57b3e (tag: v5.10) Linux 5.10 <<<< this is the divergence point from mainline to 5.10.y >>>> >> >>> >>>> in fsnotify_connector_destroy_workfn / fsnotify_mark_destroy_workfn, >>>> where two kernel threads are both calling synchronize_srcu, then >>>> scheduling out in wait_for_completion, and completely going out to >>>> lunch for over 4 minutes. This then triggers the hung task timeout and >>>> things blow up. >>>> >>>> We are running audit=1 for this system and are using an el8 based >>>> userspace. >>>> >>>> I've flipped through the fs/notify code base for both 5.10 as well as >>>> upstream mainline to see if something jumped off the page, and I >>>> haven't yet spotted any particular suspect code from the caller side. >>>> >>>> This hang appears to come up at the very end of the shutdown/reboot >>>> process, seemingly after the system starts to unwind through initrd. >>>> >>>> What I'm working on now is adding some instrumentation to the dracut >>>> shutdown initrd scripts to see if I can how far we get down that path >>>> before the system fails to make forward progress, which may give some >>>> hints. TBD on that. I've also enabled lockdep with CONFIG_PROVE_RCU and >>>> a plethora of DEBUG options [2], and didn't get anything interesting. >>>> To be clear, we haven't seen lockdep spit out any complaints as of yet. >>>> >>>> Reaching out to see if this sounds familar to anyone on the list, or if >>>> there are any particular areas of the RCU code base that might be >>>> suspect for this kind of issue. I'm happy to provide more information, >>>> as frankly, I'm quite stumped at the moment. >>>> >>>> Thanks all, >>>> Jon >>>> >>>> [1] panic trace >>>> Normal shutdown process, then hangs on the following: >>>> ... >>>> dracut Warning: Killing all remaining processes >>>> ... >>>> INFO: task kworker/u20:7:1200701 blocked for more than 241 seconds. >>>> Tainted: G O 5.10.205-2.el8.x86_64 #1 >>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>>> task:kworker/u20:7 state:D stack: 0 pid:1200701 ppid: 2 flags:0x00004080 >>>> Workqueue: events_unbound fsnotify_connector_destroy_workfn >>>> Call Trace: >>>> __schedule+0x267/0x790 >>>> schedule+0x3c/0xb0 >>>> schedule_timeout+0x219/0x2b0 >>>> wait_for_completion+0x9e/0x100 >>>> __synchronize_srcu.part.24+0x83/0xb0 >>>> ? __bpf_trace_rcu_utilization+0x10/0x10 >>>> ? synchronize_srcu+0x5d/0xf0 >>>> fsnotify_connector_destroy_workfn+0x46/0x80 >>>> process_one_work+0x1fc/0x390 >>>> worker_thread+0x2d/0x3e0 >>>> ? process_one_work+0x390/0x390 >>>> kthread+0x114/0x130 >>>> ? kthread_park+0x80/0x80 >>>> ret_from_fork+0x1f/0x30 >>>> INFO: task kworker/u20:8:1287360 blocked for more than 241 seconds. >>>> Tainted: G O 5.10.205-2.el8.x86_64 #1 >>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>>> task:kworker/u20:8 state:D stack: 0 pid:1287360 ppid: 2 flags:0x00004080 >>>> Workqueue: events_unbound fsnotify_mark_destroy_workfn >>>> Call Trace: >>>> __schedule+0x267/0x790 >>>> schedule+0x3c/0xb0 >>>> schedule_timeout+0x219/0x2b0 >>>> ? add_timer+0x14a/0x200 >>>> wait_for_completion+0x9e/0x100 >>>> __synchronize_srcu.part.24+0x83/0xb0 >>>> ? __bpf_trace_rcu_utilization+0x10/0x10 >>>> fsnotify_mark_destroy_workfn+0x77/0xe0 >>>> process_one_work+0x1fc/0x390 >>>> ? process_one_work+0x390/0x390 >>>> worker_thread+0x2d/0x3e0 >>>> ? process_one_work+0x390/0x390 >>>> kthread+0x114/0x130 >>>> ? kthread_park+0x80/0x80 >>>> ret_from_fork+0x1f/0x30 >>>> Kernel panic - not syncing: hung_task: blocked tasks >>>> CPU: 1 PID: 64 Comm: khungtaskd Kdump: loaded Tainted: G O 5.10.205-2.el8.x86_64 #1 >>>> Hardware name: Red Hat KVM, BIOS 20230302.1.2662.el8 04/01/2014 >>>> Call Trace: >>>> dump_stack+0x6d/0x8c >>>> panic+0x114/0x2ea >>>> watchdog.cold.8+0xb5/0xb5 >>>> ? hungtask_pm_notify+0x50/0x50 >>>> kthread+0x114/0x130 >>>> ? kthread_park+0x80/0x80 >>>> ret_from_fork+0x1f/0x30 >>>> >>>> [2] additional debugging config knobs turned up. >>>> CONFIG_PROVE_LOCKING=y >>>> CONFIG_LOCK_STAT=y >>>> CONFIG_DEBUG_RT_MUTEXES=y >>>> CONFIG_DEBUG_SPINLOCK=y >>>> CONFIG_DEBUG_MUTEXES=y >>>> CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y >>>> CONFIG_DEBUG_RWSEMS=y >>>> CONFIG_DEBUG_LOCK_ALLOC=y >>>> CONFIG_LOCKDEP=y >>>> CONFIG_LOCKDEP_BITS=15 >>>> CONFIG_LOCKDEP_CHAINS_BITS=16 >>>> CONFIG_LOCKDEP_STACK_TRACE_BITS=19 >>>> CONFIG_LOCKDEP_STACK_TRACE_HASH_BITS=14 >>>> CONFIG_LOCKDEP_CIRCULAR_QUEUE_BITS=12 >>>> CONFIG_DEBUG_SHIRQ=y >>>> CONFIG_WQ_WATCHDOG=y >>>> CONFIG_DEBUG_ATOMIC_SLEEP=y >>>> CONFIG_DEBUG_LIST=y >>>> CONFIG_DEBUG_PLIST=y >>>> CONFIG_DEBUG_SG=y >>>> CONFIG_DEBUG_NOTIFIERS=y >>>> CONFIG_BUG_ON_DATA_CORRUPTION=y ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: SRCU hung task on 5.10.y on synchronize_srcu(&fsnotify_mark_srcu) 2024-08-28 2:41 ` Jon Kohler @ 2024-08-28 10:14 ` Paul E. McKenney 0 siblings, 0 replies; 12+ messages in thread From: Paul E. McKenney @ 2024-08-28 10:14 UTC (permalink / raw) To: Jon Kohler Cc: rcu@vger.kernel.org, jiangshanlai@gmail.com, josh@joshtriplett.org, jack@suse.cz, linux-fsdevel@vger.kernel.org, Z qiang On Wed, Aug 28, 2024 at 02:41:09AM +0000, Jon Kohler wrote: > > On Aug 27, 2024, at 9:21 PM, Paul E. McKenney <paulmck@kernel.org> wrote: > > On Tue, Aug 27, 2024 at 11:33:32PM +0000, Jon Kohler wrote: > >>> On Aug 27, 2024, at 4:34 PM, Paul E. McKenney <paulmck@kernel.org> wrote: > >>> On Tue, Aug 27, 2024 at 08:01:27PM +0000, Jon Kohler wrote: > >>>> Hey Paul, Lai, Josh, and the RCU list and Jan/FS list - > >>>> Reaching out about a tricky hung task issue that I'm running into. I've > >>>> got a virtualized Linux guest on top of a KVM based platform, running > >>>> a 5.10.y based kernel. The issue we're running into is a hung task that > >>>> *only* happens on shutdown/reboot of this particular VM once every > >>>> 20-50 times. > >>>> > >>>> The signature of the hung task is always similar to the output below, > >>>> where we appear to hang on the call to > >>>> synchronize_srcu(&fsnotify_mark_srcu) > >>> > >>> One thing to try would be to add trace_printk() or similar to the SRCU > >>> readers, just in case someone was using srcu_read_lock_notrace() on > >>> fsnotify_mark_srcu, which I see no trace of in current mainline. > >>> > >>> Alternatively, if there is a version where this does not happen, try > >>> bisecting. Each bisection step would require something like 400-500 > >>> shutdown/reboots to prove the commit good. (Obviously, the first failure > >>> proves the commit bad, which for one-out-of-50 failures will take on > >>> average about 35 shutdown/reboots.) > >>> > >>> There could also be a bad SRCU backport from mainline, so please check > >>> what SRCU backports you have in your 5.10.y stable release. (Though > >>> maybe Jack has already done this?) > >>> > >>> Thanx, Paul > >> > >> Thanks, Paul > >> > >> For posterity, this kernel is just built off of the regular ole stable tree, and here’s > >> All of the backports to kernel/rcu are below. > >> > >> Stepping through this more, since we’re stalling at wait_for_completion, > >> that must mean that wakeme_after_rcu() ... complete(&rcu->completion) > >> is not happening, right? > >> > >> That or somehow wakeme_after_rcu() fires *before* wait_for_completion() > >> is setup, causing the wait to hang forever? > > > > There is no problem with that misordering -- in that case, the call to > > wait_for_completion() simply won't wait. > > > > But that would be a good path to trace. After all, you might have > > noticed that what we believe software will do does not always match what > > it actually does. > > Agreed, I’ll noodle on it some more, and try to bisect as well. I see > Neeraj responded separately, I’ll check that out now. > > > > > >> Is it possible for wakeme_after_rcu() to race, such that it fires somewhere > >> else *before* wait_for_completion gets all the way to schedule()? > > > > That should also work just fine, but it is still a good assumption to > > check. > > > > Good list below, but I must focus on mainline. Apologies! > > > > Plus this assumes that v5.10 worked for you -- has that been tested? > > So again, what version has worked for you? > > We’ve had this particular service on 5.10.y for a while; however, this > behavior just recently started bubbling up with a somewhat new(er) set > of internal QA tests that very specifically stress shutdown/reboots in a > loop for a completely unrelated durability test, so its possible this issue > has existed for a while. I hope that the commit that Neeraj identified fixes this for you, and thank you very much, Neeraj! However, if it does not, I suggest: (1) running your new tests on older versions of 5.10.y in case the bug was introduced by partial back ports, (2) running your tests on newer versions of mainline in case there is another fix that needs to be backported, and, if necessary, (3) work on the tests so that the problem reproduces faster, reducing the costs of further experiments. Thanx, Paul > > Thanx, Paul > > > >> [rcu]$ pwd > >> /kernel/kernel/rcu > >> [rcu]$ git remote -v > >> origin https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_pub_scm_linux_kernel_git_stable_linux.git&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=1Tj05_-3u5bekAXIOn3TZBxlsLepxSWdhZVB6OKs02CYfxx4O9XkBJaihu2H2SaF&s=i-EJON2mPUDM-dvijeXkVr4nsR1g_obFaq9F9DCfQNs&e= (fetch) > >> origin https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_pub_scm_linux_kernel_git_stable_linux.git&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=1Tj05_-3u5bekAXIOn3TZBxlsLepxSWdhZVB6OKs02CYfxx4O9XkBJaihu2H2SaF&s=i-EJON2mPUDM-dvijeXkVr4nsR1g_obFaq9F9DCfQNs&e= (push) > >> [rcu]$ git log --oneline 2c85ebc57b3e..HEAD . > >> ca4427ebc626 (HEAD, tag: v5.10.205) Linux 5.10.205 <<<< this is the base commit for 5.10.205 >>>> > >> ... > >> 175f4b062f69 rcu: kmemleak: Ignore kmemleak false positives when RCU-freeing objects <<<< this is the most recent backport commit to kernel/rcu >>>> > >> 55887adc76e1 rcuscale: Move rcu_scale_writer() schedule_timeout_uninterruptible() to _idle() > >> 066fbd8bc981 refscale: Fix uninitalized use of wait_queue_head_t > >> d93ba6e46e5f rcu-tasks: Add trc_inspect_reader() checks for exiting critical section > >> 3e22624f8fd3 rcu-tasks: Wait for trc_read_check_handler() IPIs > >> 9190c1f0aed1 rcu-tasks: Fix IPI failure handling in trc_wait_for_one_reader > >> ad4f8c117b8b rcu: Prevent expedited GP from enabling tick on offline CPU > >> 4f91de9a81bd rcu-tasks: Simplify trc_read_check_handler() atomic operations > >> 3a64cd01cdd6 rcu-tasks: Mark ->trc_reader_special.b.need_qs data races > >> 058f077d09ba rcu-tasks: Mark ->trc_reader_nesting data races > >> 604d6a5ff718 rcu/rcuscale: Stop kfree_scale_thread thread(s) after unloading rcuscale > >> d414e24d1509 rcu/rcuscale: Move rcu_scale_*() after kfree_scale_cleanup() > >> ecc5e6dbc269 rcuscale: Move shutdown from wait_event() to wait_event_idle() > >> b62c816bdb5e rcuscale: Always log error message > >> 8cd9917c13a7 rcuscale: Console output claims too few grace periods > >> 7230a9e599d3 rcu/kvfree: Avoid freeing new kfree_rcu() memory after old grace period > >> a7d21b858589 rcu: Protect rcu_print_task_exp_stall() ->exp_tasks access > >> e4842de4ec13 refscale: Move shutdown from wait_event() to wait_event_idle() > >> eb18bc5a8678 rcu: Avoid stack overflow due to __rcu_irq_enter_check_tick() being kprobe-ed > >> d99d194e2f8c rcu-tasks: Make rude RCU-Tasks work well with CPU hotplug > >> 2bf501f1bc78 rcu: Suppress smp_processor_id() complaint in synchronize_rcu_expedited_wait() > >> 1c37e86a78c2 rcu-tasks: Fix synchronize_rcu_tasks() VS zap_pid_ns_processes() > >> ad410f64f7ab rcu-tasks: Remove preemption disablement around srcu_read_[un]lock() calls > >> b02b6bb83c68 rcu-tasks: Improve comments explaining tasks_rcu_exit_srcu purpose > >> 7c15d7ecce00 rcu: Prevent lockdep-RCU splats on lock acquisition/release > >> 5a52380b8193 rcu: Fix __this_cpu_read() lockdep warning in rcu_force_quiescent_state() > >> 0dd025483f15 rcu-tasks: Convert RCU_LOCKDEP_WARN() to WARN_ONCE() > >> 36d4ffbedff7 rcu: Back off upon fill_page_cache_func() allocation failure > >> 10f30cba8f6c rcu: Make TASKS_RUDE_RCU select IRQ_WORK > >> 1c6c3f233664 rcu-tasks: Fix race in schedule and flush work > >> a22d66eb518f rcu: Apply callbacks processing time limit only on softirq > >> 40fb3812d997 rcu: Fix callbacks processing time limit retaining cond_resched() > >> fcc9797d0d13 rcu: Don't deboost before reporting expedited quiescent state > >> 0c145262ac99 rcu/nocb: Fix missed nocb_timer requeue > >> 657991fb06a4 rcu: Do not report strict GPs for outgoing CPUs > >> 12d3389b7af6 rcu: Tighten rcu_advance_cbs_nowake() checks > >> 0836f9404017 rcu/exp: Mark current CPU as exp-QS in IPI loop second pass > >> 70692b06208c rcu: Mark accesses to rcu_state.n_force_qs > >> af756be29c82 rcu: Always inline rcu_dynticks_task*_{enter,exit}() > >> 226d68fb6c0a rcu: Fix existing exp request check in sync_sched_exp_online_cleanup() > >> 02ddf26d849d rcu-tasks: Move RTGS_WAIT_CBS to beginning of rcu_tasks_kthread() loop > >> 7f43cda650d5 rcutorture: Avoid problematic critical section nesting on PREEMPT_RT > >> d3ca78775db4 rcu: Fix macro name CONFIG_TASKS_RCU_TRACE > >> 497f3d9c3f58 rcu: Fix stall-warning deadlock due to non-release of rcu_node ->lock > >> ea5e5bc881a4 rcu: Add lockdep_assert_irqs_disabled() to rcu_sched_clock_irq() and callees > >> 527b56d7856f rcu: Fix to include first blocked task in stall warning > >> 4b680b3fc6f3 rcu/tree: Handle VM stoppage in stall detection > >> b6ae3854075e srcu: Provide polling interfaces for Tiny SRCU grace periods > >> 450948b06ce8 srcu: Make Tiny SRCU use multi-bit grace-period counter > >> 641e1d88404a srcu: Provide internal interface to start a Tiny SRCU grace period > >> f789de3be808 srcu: Provide polling interfaces for Tree SRCU grace periods > >> fdf66e5a7fc8 srcu: Provide internal interface to start a Tree SRCU grace period > >> 86cb49e7314e rcu-tasks: Don't delete holdouts within trc_wait_for_one_reader() > >> 55ddab2bfd70 rcu-tasks: Don't delete holdouts within trc_inspect_reader() > >> 35a35909ec19 rcu: Reject RCU_LOCKDEP_WARN() false positives > >> 23597afbe096 srcu: Fix broken node geometry after early ssp init > >> 728f23e53c65 rcu: Invoke rcu_spawn_core_kthreads() from rcu_spawn_gp_kthread() > >> 7d81aff28953 rcu: Remove spurious instrumentation_end() in rcu_nmi_enter() > >> 09a27d662006 kvfree_rcu: Use same set of GFP flags as does single-argument > >> e713bdd791ba rcu/nocb: Perform deferred wake up before last idle's need_resched() check > >> 20b7669fa3f0 rcu: Pull deferred rcuog wake up to rcu_eqs_enter() callers > >> 30b491e2b6cc rcu-tasks: Move RCU-tasks initialization to before early_initcall() > >> 9b81af9c8455 rcu/tree: Defer kvfree_rcu() allocation to a clean context > >> 5cacd18c5207 rcu,ftrace: Fix ftrace recursion > >> 4540e84bd8a9 rcu: Allow rcu_irq_enter_check_tick() from NMI <<<< this is the first kernel/rcu backport for 5.10.y branch >>>> > >> ... > >> 2c85ebc57b3e (tag: v5.10) Linux 5.10 <<<< this is the divergence point from mainline to 5.10.y >>>> > >> > >>> > >>>> in fsnotify_connector_destroy_workfn / fsnotify_mark_destroy_workfn, > >>>> where two kernel threads are both calling synchronize_srcu, then > >>>> scheduling out in wait_for_completion, and completely going out to > >>>> lunch for over 4 minutes. This then triggers the hung task timeout and > >>>> things blow up. > >>>> > >>>> We are running audit=1 for this system and are using an el8 based > >>>> userspace. > >>>> > >>>> I've flipped through the fs/notify code base for both 5.10 as well as > >>>> upstream mainline to see if something jumped off the page, and I > >>>> haven't yet spotted any particular suspect code from the caller side. > >>>> > >>>> This hang appears to come up at the very end of the shutdown/reboot > >>>> process, seemingly after the system starts to unwind through initrd. > >>>> > >>>> What I'm working on now is adding some instrumentation to the dracut > >>>> shutdown initrd scripts to see if I can how far we get down that path > >>>> before the system fails to make forward progress, which may give some > >>>> hints. TBD on that. I've also enabled lockdep with CONFIG_PROVE_RCU and > >>>> a plethora of DEBUG options [2], and didn't get anything interesting. > >>>> To be clear, we haven't seen lockdep spit out any complaints as of yet. > >>>> > >>>> Reaching out to see if this sounds familar to anyone on the list, or if > >>>> there are any particular areas of the RCU code base that might be > >>>> suspect for this kind of issue. I'm happy to provide more information, > >>>> as frankly, I'm quite stumped at the moment. > >>>> > >>>> Thanks all, > >>>> Jon > >>>> > >>>> [1] panic trace > >>>> Normal shutdown process, then hangs on the following: > >>>> ... > >>>> dracut Warning: Killing all remaining processes > >>>> ... > >>>> INFO: task kworker/u20:7:1200701 blocked for more than 241 seconds. > >>>> Tainted: G O 5.10.205-2.el8.x86_64 #1 > >>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > >>>> task:kworker/u20:7 state:D stack: 0 pid:1200701 ppid: 2 flags:0x00004080 > >>>> Workqueue: events_unbound fsnotify_connector_destroy_workfn > >>>> Call Trace: > >>>> __schedule+0x267/0x790 > >>>> schedule+0x3c/0xb0 > >>>> schedule_timeout+0x219/0x2b0 > >>>> wait_for_completion+0x9e/0x100 > >>>> __synchronize_srcu.part.24+0x83/0xb0 > >>>> ? __bpf_trace_rcu_utilization+0x10/0x10 > >>>> ? synchronize_srcu+0x5d/0xf0 > >>>> fsnotify_connector_destroy_workfn+0x46/0x80 > >>>> process_one_work+0x1fc/0x390 > >>>> worker_thread+0x2d/0x3e0 > >>>> ? process_one_work+0x390/0x390 > >>>> kthread+0x114/0x130 > >>>> ? kthread_park+0x80/0x80 > >>>> ret_from_fork+0x1f/0x30 > >>>> INFO: task kworker/u20:8:1287360 blocked for more than 241 seconds. > >>>> Tainted: G O 5.10.205-2.el8.x86_64 #1 > >>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > >>>> task:kworker/u20:8 state:D stack: 0 pid:1287360 ppid: 2 flags:0x00004080 > >>>> Workqueue: events_unbound fsnotify_mark_destroy_workfn > >>>> Call Trace: > >>>> __schedule+0x267/0x790 > >>>> schedule+0x3c/0xb0 > >>>> schedule_timeout+0x219/0x2b0 > >>>> ? add_timer+0x14a/0x200 > >>>> wait_for_completion+0x9e/0x100 > >>>> __synchronize_srcu.part.24+0x83/0xb0 > >>>> ? __bpf_trace_rcu_utilization+0x10/0x10 > >>>> fsnotify_mark_destroy_workfn+0x77/0xe0 > >>>> process_one_work+0x1fc/0x390 > >>>> ? process_one_work+0x390/0x390 > >>>> worker_thread+0x2d/0x3e0 > >>>> ? process_one_work+0x390/0x390 > >>>> kthread+0x114/0x130 > >>>> ? kthread_park+0x80/0x80 > >>>> ret_from_fork+0x1f/0x30 > >>>> Kernel panic - not syncing: hung_task: blocked tasks > >>>> CPU: 1 PID: 64 Comm: khungtaskd Kdump: loaded Tainted: G O 5.10.205-2.el8.x86_64 #1 > >>>> Hardware name: Red Hat KVM, BIOS 20230302.1.2662.el8 04/01/2014 > >>>> Call Trace: > >>>> dump_stack+0x6d/0x8c > >>>> panic+0x114/0x2ea > >>>> watchdog.cold.8+0xb5/0xb5 > >>>> ? hungtask_pm_notify+0x50/0x50 > >>>> kthread+0x114/0x130 > >>>> ? kthread_park+0x80/0x80 > >>>> ret_from_fork+0x1f/0x30 > >>>> > >>>> [2] additional debugging config knobs turned up. > >>>> CONFIG_PROVE_LOCKING=y > >>>> CONFIG_LOCK_STAT=y > >>>> CONFIG_DEBUG_RT_MUTEXES=y > >>>> CONFIG_DEBUG_SPINLOCK=y > >>>> CONFIG_DEBUG_MUTEXES=y > >>>> CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y > >>>> CONFIG_DEBUG_RWSEMS=y > >>>> CONFIG_DEBUG_LOCK_ALLOC=y > >>>> CONFIG_LOCKDEP=y > >>>> CONFIG_LOCKDEP_BITS=15 > >>>> CONFIG_LOCKDEP_CHAINS_BITS=16 > >>>> CONFIG_LOCKDEP_STACK_TRACE_BITS=19 > >>>> CONFIG_LOCKDEP_STACK_TRACE_HASH_BITS=14 > >>>> CONFIG_LOCKDEP_CIRCULAR_QUEUE_BITS=12 > >>>> CONFIG_DEBUG_SHIRQ=y > >>>> CONFIG_WQ_WATCHDOG=y > >>>> CONFIG_DEBUG_ATOMIC_SLEEP=y > >>>> CONFIG_DEBUG_LIST=y > >>>> CONFIG_DEBUG_PLIST=y > >>>> CONFIG_DEBUG_SG=y > >>>> CONFIG_DEBUG_NOTIFIERS=y > >>>> CONFIG_BUG_ON_DATA_CORRUPTION=y > > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: SRCU hung task on 5.10.y on synchronize_srcu(&fsnotify_mark_srcu) 2024-08-27 23:33 ` Jon Kohler 2024-08-28 1:21 ` Paul E. McKenney @ 2024-08-28 2:33 ` Neeraj Upadhyay 2024-08-28 3:01 ` Jon Kohler 1 sibling, 1 reply; 12+ messages in thread From: Neeraj Upadhyay @ 2024-08-28 2:33 UTC (permalink / raw) To: Jon Kohler Cc: paulmck@kernel.org, rcu@vger.kernel.org, jiangshanlai@gmail.com, josh@joshtriplett.org On Tue, Aug 27, 2024 at 11:33:32PM +0000, Jon Kohler wrote: > > > > On Aug 27, 2024, at 4:34 PM, Paul E. McKenney <paulmck@kernel.org> wrote: > > > > On Tue, Aug 27, 2024 at 08:01:27PM +0000, Jon Kohler wrote: > >> Hey Paul, Lai, Josh, and the RCU list and Jan/FS list - > >> Reaching out about a tricky hung task issue that I'm running into. I've > >> got a virtualized Linux guest on top of a KVM based platform, running > >> a 5.10.y based kernel. The issue we're running into is a hung task that > >> *only* happens on shutdown/reboot of this particular VM once every > >> 20-50 times. > >> > >> The signature of the hung task is always similar to the output below, > >> where we appear to hang on the call to > >> synchronize_srcu(&fsnotify_mark_srcu) > > > > One thing to try would be to add trace_printk() or similar to the SRCU > > readers, just in case someone was using srcu_read_lock_notrace() on > > fsnotify_mark_srcu, which I see no trace of in current mainline. > > > > Alternatively, if there is a version where this does not happen, try > > bisecting. Each bisection step would require something like 400-500 > > shutdown/reboots to prove the commit good. (Obviously, the first failure > > proves the commit bad, which for one-out-of-50 failures will take on > > average about 35 shutdown/reboots.) > > > > There could also be a bad SRCU backport from mainline, so please check > > what SRCU backports you have in your 5.10.y stable release. (Though > > maybe Jack has already done this?) > > > > Thanx, Paul > > Thanks, Paul > > For posterity, this kernel is just built off of the regular ole stable tree, and here’s > All of the backports to kernel/rcu are below. > > Stepping through this more, since we’re stalling at wait_for_completion, > that must mean that wakeme_after_rcu() ... complete(&rcu->completion) > is not happening, right? > > That or somehow wakeme_after_rcu() fires *before* wait_for_completion() > is setup, causing the wait to hang forever? > > Is it possible for wakeme_after_rcu() to race, such that it fires somewhere > else *before* wait_for_completion gets all the way to schedule()? > > [rcu]$ pwd > /kernel/kernel/rcu > [rcu]$ git remote -v > origin https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git (fetch) > origin https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git (push) > [rcu]$ git log --oneline 2c85ebc57b3e..HEAD . > ca4427ebc626 (HEAD, tag: v5.10.205) Linux 5.10.205 <<<< this is the base commit for 5.10.205 >>>> > ... > 175f4b062f69 rcu: kmemleak: Ignore kmemleak false positives when RCU-freeing objects <<<< this is the most recent backport commit to kernel/rcu >>>> > 55887adc76e1 rcuscale: Move rcu_scale_writer() schedule_timeout_uninterruptible() to _idle() > 066fbd8bc981 refscale: Fix uninitalized use of wait_queue_head_t > d93ba6e46e5f rcu-tasks: Add trc_inspect_reader() checks for exiting critical section > 3e22624f8fd3 rcu-tasks: Wait for trc_read_check_handler() IPIs > 9190c1f0aed1 rcu-tasks: Fix IPI failure handling in trc_wait_for_one_reader > ad4f8c117b8b rcu: Prevent expedited GP from enabling tick on offline CPU > 4f91de9a81bd rcu-tasks: Simplify trc_read_check_handler() atomic operations > 3a64cd01cdd6 rcu-tasks: Mark ->trc_reader_special.b.need_qs data races > 058f077d09ba rcu-tasks: Mark ->trc_reader_nesting data races > 604d6a5ff718 rcu/rcuscale: Stop kfree_scale_thread thread(s) after unloading rcuscale > d414e24d1509 rcu/rcuscale: Move rcu_scale_*() after kfree_scale_cleanup() > ecc5e6dbc269 rcuscale: Move shutdown from wait_event() to wait_event_idle() > b62c816bdb5e rcuscale: Always log error message > 8cd9917c13a7 rcuscale: Console output claims too few grace periods > 7230a9e599d3 rcu/kvfree: Avoid freeing new kfree_rcu() memory after old grace period > a7d21b858589 rcu: Protect rcu_print_task_exp_stall() ->exp_tasks access > e4842de4ec13 refscale: Move shutdown from wait_event() to wait_event_idle() > eb18bc5a8678 rcu: Avoid stack overflow due to __rcu_irq_enter_check_tick() being kprobe-ed > d99d194e2f8c rcu-tasks: Make rude RCU-Tasks work well with CPU hotplug > 2bf501f1bc78 rcu: Suppress smp_processor_id() complaint in synchronize_rcu_expedited_wait() > 1c37e86a78c2 rcu-tasks: Fix synchronize_rcu_tasks() VS zap_pid_ns_processes() > ad410f64f7ab rcu-tasks: Remove preemption disablement around srcu_read_[un]lock() calls > b02b6bb83c68 rcu-tasks: Improve comments explaining tasks_rcu_exit_srcu purpose > 7c15d7ecce00 rcu: Prevent lockdep-RCU splats on lock acquisition/release > 5a52380b8193 rcu: Fix __this_cpu_read() lockdep warning in rcu_force_quiescent_state() > 0dd025483f15 rcu-tasks: Convert RCU_LOCKDEP_WARN() to WARN_ONCE() > 36d4ffbedff7 rcu: Back off upon fill_page_cache_func() allocation failure > 10f30cba8f6c rcu: Make TASKS_RUDE_RCU select IRQ_WORK > 1c6c3f233664 rcu-tasks: Fix race in schedule and flush work > a22d66eb518f rcu: Apply callbacks processing time limit only on softirq > 40fb3812d997 rcu: Fix callbacks processing time limit retaining cond_resched() > fcc9797d0d13 rcu: Don't deboost before reporting expedited quiescent state > 0c145262ac99 rcu/nocb: Fix missed nocb_timer requeue > 657991fb06a4 rcu: Do not report strict GPs for outgoing CPUs > 12d3389b7af6 rcu: Tighten rcu_advance_cbs_nowake() checks > 0836f9404017 rcu/exp: Mark current CPU as exp-QS in IPI loop second pass > 70692b06208c rcu: Mark accesses to rcu_state.n_force_qs > af756be29c82 rcu: Always inline rcu_dynticks_task*_{enter,exit}() > 226d68fb6c0a rcu: Fix existing exp request check in sync_sched_exp_online_cleanup() > 02ddf26d849d rcu-tasks: Move RTGS_WAIT_CBS to beginning of rcu_tasks_kthread() loop > 7f43cda650d5 rcutorture: Avoid problematic critical section nesting on PREEMPT_RT > d3ca78775db4 rcu: Fix macro name CONFIG_TASKS_RCU_TRACE > 497f3d9c3f58 rcu: Fix stall-warning deadlock due to non-release of rcu_node ->lock > ea5e5bc881a4 rcu: Add lockdep_assert_irqs_disabled() to rcu_sched_clock_irq() and callees > 527b56d7856f rcu: Fix to include first blocked task in stall warning > 4b680b3fc6f3 rcu/tree: Handle VM stoppage in stall detection > b6ae3854075e srcu: Provide polling interfaces for Tiny SRCU grace periods > 450948b06ce8 srcu: Make Tiny SRCU use multi-bit grace-period counter > 641e1d88404a srcu: Provide internal interface to start a Tiny SRCU grace period > f789de3be808 srcu: Provide polling interfaces for Tree SRCU grace periods > fdf66e5a7fc8 srcu: Provide internal interface to start a Tree SRCU grace period > 86cb49e7314e rcu-tasks: Don't delete holdouts within trc_wait_for_one_reader() > 55ddab2bfd70 rcu-tasks: Don't delete holdouts within trc_inspect_reader() > 35a35909ec19 rcu: Reject RCU_LOCKDEP_WARN() false positives > 23597afbe096 srcu: Fix broken node geometry after early ssp init > 728f23e53c65 rcu: Invoke rcu_spawn_core_kthreads() from rcu_spawn_gp_kthread() > 7d81aff28953 rcu: Remove spurious instrumentation_end() in rcu_nmi_enter() > 09a27d662006 kvfree_rcu: Use same set of GFP flags as does single-argument > e713bdd791ba rcu/nocb: Perform deferred wake up before last idle's need_resched() check > 20b7669fa3f0 rcu: Pull deferred rcuog wake up to rcu_eqs_enter() callers > 30b491e2b6cc rcu-tasks: Move RCU-tasks initialization to before early_initcall() > 9b81af9c8455 rcu/tree: Defer kvfree_rcu() allocation to a clean context > 5cacd18c5207 rcu,ftrace: Fix ftrace recursion > 4540e84bd8a9 rcu: Allow rcu_irq_enter_check_tick() from NMI <<<< this is the first kernel/rcu backport for 5.10.y branch >>>> > ... > 2c85ebc57b3e (tag: v5.10) Linux 5.10 <<<< this is the divergence point from mainline to 5.10.y >>>> > Can you check if this commit [1] is present in your tree. If not, please try with it. Also, you might want to check the state of the kworker running process_srcu() work fn, to see where the worker thread corresponding to this SRCU domain is waiting. [1] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=5573fdbc3423475aae4b0c2e3b0076d6216e9ed1 > > > >> in fsnotify_connector_destroy_workfn / fsnotify_mark_destroy_workfn, > >> where two kernel threads are both calling synchronize_srcu, then > >> scheduling out in wait_for_completion, and completely going out to > >> lunch for over 4 minutes. This then triggers the hung task timeout and > >> things blow up. > >> > >> We are running audit=1 for this system and are using an el8 based > >> userspace. > >> > >> I've flipped through the fs/notify code base for both 5.10 as well as > >> upstream mainline to see if something jumped off the page, and I > >> haven't yet spotted any particular suspect code from the caller side. > >> > >> This hang appears to come up at the very end of the shutdown/reboot > >> process, seemingly after the system starts to unwind through initrd. > >> > >> What I'm working on now is adding some instrumentation to the dracut > >> shutdown initrd scripts to see if I can how far we get down that path > >> before the system fails to make forward progress, which may give some > >> hints. TBD on that. I've also enabled lockdep with CONFIG_PROVE_RCU and > >> a plethora of DEBUG options [2], and didn't get anything interesting. > >> To be clear, we haven't seen lockdep spit out any complaints as of yet. > >> > >> Reaching out to see if this sounds familar to anyone on the list, or if > >> there are any particular areas of the RCU code base that might be > >> suspect for this kind of issue. I'm happy to provide more information, > >> as frankly, I'm quite stumped at the moment. > >> > >> Thanks all, > >> Jon > >> > >> [1] panic trace > >> Normal shutdown process, then hangs on the following: > >> ... > >> dracut Warning: Killing all remaining processes > >> ... > >> INFO: task kworker/u20:7:1200701 blocked for more than 241 seconds. > >> Tainted: G O 5.10.205-2.el8.x86_64 #1 > >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > >> task:kworker/u20:7 state:D stack: 0 pid:1200701 ppid: 2 flags:0x00004080 > >> Workqueue: events_unbound fsnotify_connector_destroy_workfn > >> Call Trace: > >> __schedule+0x267/0x790 > >> schedule+0x3c/0xb0 > >> schedule_timeout+0x219/0x2b0 > >> wait_for_completion+0x9e/0x100 > >> __synchronize_srcu.part.24+0x83/0xb0 > >> ? __bpf_trace_rcu_utilization+0x10/0x10 > >> ? synchronize_srcu+0x5d/0xf0 > >> fsnotify_connector_destroy_workfn+0x46/0x80 > >> process_one_work+0x1fc/0x390 > >> worker_thread+0x2d/0x3e0 > >> ? process_one_work+0x390/0x390 > >> kthread+0x114/0x130 > >> ? kthread_park+0x80/0x80 > >> ret_from_fork+0x1f/0x30 > >> INFO: task kworker/u20:8:1287360 blocked for more than 241 seconds. > >> Tainted: G O 5.10.205-2.el8.x86_64 #1 > >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > >> task:kworker/u20:8 state:D stack: 0 pid:1287360 ppid: 2 flags:0x00004080 > >> Workqueue: events_unbound fsnotify_mark_destroy_workfn > >> Call Trace: > >> __schedule+0x267/0x790 > >> schedule+0x3c/0xb0 > >> schedule_timeout+0x219/0x2b0 > >> ? add_timer+0x14a/0x200 > >> wait_for_completion+0x9e/0x100 > >> __synchronize_srcu.part.24+0x83/0xb0 > >> ? __bpf_trace_rcu_utilization+0x10/0x10 > >> fsnotify_mark_destroy_workfn+0x77/0xe0 > >> process_one_work+0x1fc/0x390 > >> ? process_one_work+0x390/0x390 > >> worker_thread+0x2d/0x3e0 > >> ? process_one_work+0x390/0x390 > >> kthread+0x114/0x130 > >> ? kthread_park+0x80/0x80 > >> ret_from_fork+0x1f/0x30 > >> Kernel panic - not syncing: hung_task: blocked tasks > >> CPU: 1 PID: 64 Comm: khungtaskd Kdump: loaded Tainted: G O 5.10.205-2.el8.x86_64 #1 > >> Hardware name: Red Hat KVM, BIOS 20230302.1.2662.el8 04/01/2014 > >> Call Trace: > >> dump_stack+0x6d/0x8c > >> panic+0x114/0x2ea > >> watchdog.cold.8+0xb5/0xb5 > >> ? hungtask_pm_notify+0x50/0x50 > >> kthread+0x114/0x130 > >> ? kthread_park+0x80/0x80 > >> ret_from_fork+0x1f/0x30 > >> > >> [2] additional debugging config knobs turned up. > >> CONFIG_PROVE_LOCKING=y > >> CONFIG_LOCK_STAT=y > >> CONFIG_DEBUG_RT_MUTEXES=y > >> CONFIG_DEBUG_SPINLOCK=y > >> CONFIG_DEBUG_MUTEXES=y > >> CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y > >> CONFIG_DEBUG_RWSEMS=y > >> CONFIG_DEBUG_LOCK_ALLOC=y > >> CONFIG_LOCKDEP=y > >> CONFIG_LOCKDEP_BITS=15 > >> CONFIG_LOCKDEP_CHAINS_BITS=16 > >> CONFIG_LOCKDEP_STACK_TRACE_BITS=19 > >> CONFIG_LOCKDEP_STACK_TRACE_HASH_BITS=14 > >> CONFIG_LOCKDEP_CIRCULAR_QUEUE_BITS=12 > >> CONFIG_DEBUG_SHIRQ=y > >> CONFIG_WQ_WATCHDOG=y > >> CONFIG_DEBUG_ATOMIC_SLEEP=y > >> CONFIG_DEBUG_LIST=y > >> CONFIG_DEBUG_PLIST=y > >> CONFIG_DEBUG_SG=y > >> CONFIG_DEBUG_NOTIFIERS=y > >> CONFIG_BUG_ON_DATA_CORRUPTION=y > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: SRCU hung task on 5.10.y on synchronize_srcu(&fsnotify_mark_srcu) 2024-08-28 2:33 ` Neeraj Upadhyay @ 2024-08-28 3:01 ` Jon Kohler 0 siblings, 0 replies; 12+ messages in thread From: Jon Kohler @ 2024-08-28 3:01 UTC (permalink / raw) To: Neeraj Upadhyay Cc: paulmck@kernel.org, rcu@vger.kernel.org, jiangshanlai@gmail.com, josh@joshtriplett.org > On Aug 27, 2024, at 10:33 PM, Neeraj Upadhyay <Neeraj.Upadhyay@kernel.org> wrote: > > On Tue, Aug 27, 2024 at 11:33:32PM +0000, Jon Kohler wrote: >> >> >>> On Aug 27, 2024, at 4:34 PM, Paul E. McKenney <paulmck@kernel.org> wrote: >>> >>> On Tue, Aug 27, 2024 at 08:01:27PM +0000, Jon Kohler wrote: >>>> Hey Paul, Lai, Josh, and the RCU list and Jan/FS list - >>>> Reaching out about a tricky hung task issue that I'm running into. I've >>>> got a virtualized Linux guest on top of a KVM based platform, running >>>> a 5.10.y based kernel. The issue we're running into is a hung task that >>>> *only* happens on shutdown/reboot of this particular VM once every >>>> 20-50 times. >>>> >>>> The signature of the hung task is always similar to the output below, >>>> where we appear to hang on the call to >>>> synchronize_srcu(&fsnotify_mark_srcu) >>> >>> One thing to try would be to add trace_printk() or similar to the SRCU >>> readers, just in case someone was using srcu_read_lock_notrace() on >>> fsnotify_mark_srcu, which I see no trace of in current mainline. >>> >>> Alternatively, if there is a version where this does not happen, try >>> bisecting. Each bisection step would require something like 400-500 >>> shutdown/reboots to prove the commit good. (Obviously, the first failure >>> proves the commit bad, which for one-out-of-50 failures will take on >>> average about 35 shutdown/reboots.) >>> >>> There could also be a bad SRCU backport from mainline, so please check >>> what SRCU backports you have in your 5.10.y stable release. (Though >>> maybe Jack has already done this?) >>> >>> Thanx, Paul >> >> Thanks, Paul >> >> For posterity, this kernel is just built off of the regular ole stable tree, and here’s >> All of the backports to kernel/rcu are below. >> >> Stepping through this more, since we’re stalling at wait_for_completion, >> that must mean that wakeme_after_rcu() ... complete(&rcu->completion) >> is not happening, right? >> >> That or somehow wakeme_after_rcu() fires *before* wait_for_completion() >> is setup, causing the wait to hang forever? >> >> Is it possible for wakeme_after_rcu() to race, such that it fires somewhere >> else *before* wait_for_completion gets all the way to schedule()? >> >> [rcu]$ pwd >> /kernel/kernel/rcu >> [rcu]$ git remote -v >> origin https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_pub_scm_linux_kernel_git_stable_linux.git&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=gL9jpTalkQKRr468aytGK0c3GmISBnwoMhllG4jXQN4du5Z7hXSogIB-19whbUs3&s=QaiqdZTW1g36HjvkDrG9prdjMhKoxNtp3Vz4B4Xwcas&e= (fetch) >> origin https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_pub_scm_linux_kernel_git_stable_linux.git&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=gL9jpTalkQKRr468aytGK0c3GmISBnwoMhllG4jXQN4du5Z7hXSogIB-19whbUs3&s=QaiqdZTW1g36HjvkDrG9prdjMhKoxNtp3Vz4B4Xwcas&e= (push) >> [rcu]$ git log --oneline 2c85ebc57b3e..HEAD . >> ca4427ebc626 (HEAD, tag: v5.10.205) Linux 5.10.205 <<<< this is the base commit for 5.10.205 >>>> >> ... >> 175f4b062f69 rcu: kmemleak: Ignore kmemleak false positives when RCU-freeing objects <<<< this is the most recent backport commit to kernel/rcu >>>> >> 55887adc76e1 rcuscale: Move rcu_scale_writer() schedule_timeout_uninterruptible() to _idle() >> 066fbd8bc981 refscale: Fix uninitalized use of wait_queue_head_t >> d93ba6e46e5f rcu-tasks: Add trc_inspect_reader() checks for exiting critical section >> 3e22624f8fd3 rcu-tasks: Wait for trc_read_check_handler() IPIs >> 9190c1f0aed1 rcu-tasks: Fix IPI failure handling in trc_wait_for_one_reader >> ad4f8c117b8b rcu: Prevent expedited GP from enabling tick on offline CPU >> 4f91de9a81bd rcu-tasks: Simplify trc_read_check_handler() atomic operations >> 3a64cd01cdd6 rcu-tasks: Mark ->trc_reader_special.b.need_qs data races >> 058f077d09ba rcu-tasks: Mark ->trc_reader_nesting data races >> 604d6a5ff718 rcu/rcuscale: Stop kfree_scale_thread thread(s) after unloading rcuscale >> d414e24d1509 rcu/rcuscale: Move rcu_scale_*() after kfree_scale_cleanup() >> ecc5e6dbc269 rcuscale: Move shutdown from wait_event() to wait_event_idle() >> b62c816bdb5e rcuscale: Always log error message >> 8cd9917c13a7 rcuscale: Console output claims too few grace periods >> 7230a9e599d3 rcu/kvfree: Avoid freeing new kfree_rcu() memory after old grace period >> a7d21b858589 rcu: Protect rcu_print_task_exp_stall() ->exp_tasks access >> e4842de4ec13 refscale: Move shutdown from wait_event() to wait_event_idle() >> eb18bc5a8678 rcu: Avoid stack overflow due to __rcu_irq_enter_check_tick() being kprobe-ed >> d99d194e2f8c rcu-tasks: Make rude RCU-Tasks work well with CPU hotplug >> 2bf501f1bc78 rcu: Suppress smp_processor_id() complaint in synchronize_rcu_expedited_wait() >> 1c37e86a78c2 rcu-tasks: Fix synchronize_rcu_tasks() VS zap_pid_ns_processes() >> ad410f64f7ab rcu-tasks: Remove preemption disablement around srcu_read_[un]lock() calls >> b02b6bb83c68 rcu-tasks: Improve comments explaining tasks_rcu_exit_srcu purpose >> 7c15d7ecce00 rcu: Prevent lockdep-RCU splats on lock acquisition/release >> 5a52380b8193 rcu: Fix __this_cpu_read() lockdep warning in rcu_force_quiescent_state() >> 0dd025483f15 rcu-tasks: Convert RCU_LOCKDEP_WARN() to WARN_ONCE() >> 36d4ffbedff7 rcu: Back off upon fill_page_cache_func() allocation failure >> 10f30cba8f6c rcu: Make TASKS_RUDE_RCU select IRQ_WORK >> 1c6c3f233664 rcu-tasks: Fix race in schedule and flush work >> a22d66eb518f rcu: Apply callbacks processing time limit only on softirq >> 40fb3812d997 rcu: Fix callbacks processing time limit retaining cond_resched() >> fcc9797d0d13 rcu: Don't deboost before reporting expedited quiescent state >> 0c145262ac99 rcu/nocb: Fix missed nocb_timer requeue >> 657991fb06a4 rcu: Do not report strict GPs for outgoing CPUs >> 12d3389b7af6 rcu: Tighten rcu_advance_cbs_nowake() checks >> 0836f9404017 rcu/exp: Mark current CPU as exp-QS in IPI loop second pass >> 70692b06208c rcu: Mark accesses to rcu_state.n_force_qs >> af756be29c82 rcu: Always inline rcu_dynticks_task*_{enter,exit}() >> 226d68fb6c0a rcu: Fix existing exp request check in sync_sched_exp_online_cleanup() >> 02ddf26d849d rcu-tasks: Move RTGS_WAIT_CBS to beginning of rcu_tasks_kthread() loop >> 7f43cda650d5 rcutorture: Avoid problematic critical section nesting on PREEMPT_RT >> d3ca78775db4 rcu: Fix macro name CONFIG_TASKS_RCU_TRACE >> 497f3d9c3f58 rcu: Fix stall-warning deadlock due to non-release of rcu_node ->lock >> ea5e5bc881a4 rcu: Add lockdep_assert_irqs_disabled() to rcu_sched_clock_irq() and callees >> 527b56d7856f rcu: Fix to include first blocked task in stall warning >> 4b680b3fc6f3 rcu/tree: Handle VM stoppage in stall detection >> b6ae3854075e srcu: Provide polling interfaces for Tiny SRCU grace periods >> 450948b06ce8 srcu: Make Tiny SRCU use multi-bit grace-period counter >> 641e1d88404a srcu: Provide internal interface to start a Tiny SRCU grace period >> f789de3be808 srcu: Provide polling interfaces for Tree SRCU grace periods >> fdf66e5a7fc8 srcu: Provide internal interface to start a Tree SRCU grace period >> 86cb49e7314e rcu-tasks: Don't delete holdouts within trc_wait_for_one_reader() >> 55ddab2bfd70 rcu-tasks: Don't delete holdouts within trc_inspect_reader() >> 35a35909ec19 rcu: Reject RCU_LOCKDEP_WARN() false positives >> 23597afbe096 srcu: Fix broken node geometry after early ssp init >> 728f23e53c65 rcu: Invoke rcu_spawn_core_kthreads() from rcu_spawn_gp_kthread() >> 7d81aff28953 rcu: Remove spurious instrumentation_end() in rcu_nmi_enter() >> 09a27d662006 kvfree_rcu: Use same set of GFP flags as does single-argument >> e713bdd791ba rcu/nocb: Perform deferred wake up before last idle's need_resched() check >> 20b7669fa3f0 rcu: Pull deferred rcuog wake up to rcu_eqs_enter() callers >> 30b491e2b6cc rcu-tasks: Move RCU-tasks initialization to before early_initcall() >> 9b81af9c8455 rcu/tree: Defer kvfree_rcu() allocation to a clean context >> 5cacd18c5207 rcu,ftrace: Fix ftrace recursion >> 4540e84bd8a9 rcu: Allow rcu_irq_enter_check_tick() from NMI <<<< this is the first kernel/rcu backport for 5.10.y branch >>>> >> ... >> 2c85ebc57b3e (tag: v5.10) Linux 5.10 <<<< this is the divergence point from mainline to 5.10.y >>>> >> > > Can you check if this commit [1] is present in your tree. If not, please > try with it. > > Also, you might want to check the state of the kworker running > process_srcu() work fn, to see where the worker thread corresponding > to this SRCU domain is waiting. Neeraj, Thanks for the tip and advice. The commit msg sounds juicy and interesting, we’ll give it a try and see if that leads to any interesting findings. That mentioned commit (srcu: Fix callbacks acceleration mishandling) is not in 5.10.y LTS Both 6.6.y and 6.1.y have that commit https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/log/kernel/rcu/srcutree.c?h=v6.6.47 https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/log/kernel/rcu/srcutree.c?h=v6.1.106 5.15.y and 5.10.y (and below) do not have that commit https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/log/kernel/rcu/srcutree.c?h=v5.15.165 https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/log/kernel/rcu/srcutree.c?h=v5.10.224 > > > [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_pub_scm_linux_kernel_git_stable_linux.git_commit_-3Fid-3D5573fdbc3423475aae4b0c2e3b0076d6216e9ed1&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=gL9jpTalkQKRr468aytGK0c3GmISBnwoMhllG4jXQN4du5Z7hXSogIB-19whbUs3&s=lL8zRfuq9qPR4JHcfWJQdUn7cQmSH3UYXg-eY0Un28k&e= >>> >>>> in fsnotify_connector_destroy_workfn / fsnotify_mark_destroy_workfn, >>>> where two kernel threads are both calling synchronize_srcu, then >>>> scheduling out in wait_for_completion, and completely going out to >>>> lunch for over 4 minutes. This then triggers the hung task timeout and >>>> things blow up. >>>> >>>> We are running audit=1 for this system and are using an el8 based >>>> userspace. >>>> >>>> I've flipped through the fs/notify code base for both 5.10 as well as >>>> upstream mainline to see if something jumped off the page, and I >>>> haven't yet spotted any particular suspect code from the caller side. >>>> >>>> This hang appears to come up at the very end of the shutdown/reboot >>>> process, seemingly after the system starts to unwind through initrd. >>>> >>>> What I'm working on now is adding some instrumentation to the dracut >>>> shutdown initrd scripts to see if I can how far we get down that path >>>> before the system fails to make forward progress, which may give some >>>> hints. TBD on that. I've also enabled lockdep with CONFIG_PROVE_RCU and >>>> a plethora of DEBUG options [2], and didn't get anything interesting. >>>> To be clear, we haven't seen lockdep spit out any complaints as of yet. >>>> >>>> Reaching out to see if this sounds familar to anyone on the list, or if >>>> there are any particular areas of the RCU code base that might be >>>> suspect for this kind of issue. I'm happy to provide more information, >>>> as frankly, I'm quite stumped at the moment. >>>> >>>> Thanks all, >>>> Jon >>>> >>>> [1] panic trace >>>> Normal shutdown process, then hangs on the following: >>>> ... >>>> dracut Warning: Killing all remaining processes >>>> ... >>>> INFO: task kworker/u20:7:1200701 blocked for more than 241 seconds. >>>> Tainted: G O 5.10.205-2.el8.x86_64 #1 >>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>>> task:kworker/u20:7 state:D stack: 0 pid:1200701 ppid: 2 flags:0x00004080 >>>> Workqueue: events_unbound fsnotify_connector_destroy_workfn >>>> Call Trace: >>>> __schedule+0x267/0x790 >>>> schedule+0x3c/0xb0 >>>> schedule_timeout+0x219/0x2b0 >>>> wait_for_completion+0x9e/0x100 >>>> __synchronize_srcu.part.24+0x83/0xb0 >>>> ? __bpf_trace_rcu_utilization+0x10/0x10 >>>> ? synchronize_srcu+0x5d/0xf0 >>>> fsnotify_connector_destroy_workfn+0x46/0x80 >>>> process_one_work+0x1fc/0x390 >>>> worker_thread+0x2d/0x3e0 >>>> ? process_one_work+0x390/0x390 >>>> kthread+0x114/0x130 >>>> ? kthread_park+0x80/0x80 >>>> ret_from_fork+0x1f/0x30 >>>> INFO: task kworker/u20:8:1287360 blocked for more than 241 seconds. >>>> Tainted: G O 5.10.205-2.el8.x86_64 #1 >>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>>> task:kworker/u20:8 state:D stack: 0 pid:1287360 ppid: 2 flags:0x00004080 >>>> Workqueue: events_unbound fsnotify_mark_destroy_workfn >>>> Call Trace: >>>> __schedule+0x267/0x790 >>>> schedule+0x3c/0xb0 >>>> schedule_timeout+0x219/0x2b0 >>>> ? add_timer+0x14a/0x200 >>>> wait_for_completion+0x9e/0x100 >>>> __synchronize_srcu.part.24+0x83/0xb0 >>>> ? __bpf_trace_rcu_utilization+0x10/0x10 >>>> fsnotify_mark_destroy_workfn+0x77/0xe0 >>>> process_one_work+0x1fc/0x390 >>>> ? process_one_work+0x390/0x390 >>>> worker_thread+0x2d/0x3e0 >>>> ? process_one_work+0x390/0x390 >>>> kthread+0x114/0x130 >>>> ? kthread_park+0x80/0x80 >>>> ret_from_fork+0x1f/0x30 >>>> Kernel panic - not syncing: hung_task: blocked tasks >>>> CPU: 1 PID: 64 Comm: khungtaskd Kdump: loaded Tainted: G O 5.10.205-2.el8.x86_64 #1 >>>> Hardware name: Red Hat KVM, BIOS 20230302.1.2662.el8 04/01/2014 >>>> Call Trace: >>>> dump_stack+0x6d/0x8c >>>> panic+0x114/0x2ea >>>> watchdog.cold.8+0xb5/0xb5 >>>> ? hungtask_pm_notify+0x50/0x50 >>>> kthread+0x114/0x130 >>>> ? kthread_park+0x80/0x80 >>>> ret_from_fork+0x1f/0x30 >>>> >>>> [2] additional debugging config knobs turned up. >>>> CONFIG_PROVE_LOCKING=y >>>> CONFIG_LOCK_STAT=y >>>> CONFIG_DEBUG_RT_MUTEXES=y >>>> CONFIG_DEBUG_SPINLOCK=y >>>> CONFIG_DEBUG_MUTEXES=y >>>> CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y >>>> CONFIG_DEBUG_RWSEMS=y >>>> CONFIG_DEBUG_LOCK_ALLOC=y >>>> CONFIG_LOCKDEP=y >>>> CONFIG_LOCKDEP_BITS=15 >>>> CONFIG_LOCKDEP_CHAINS_BITS=16 >>>> CONFIG_LOCKDEP_STACK_TRACE_BITS=19 >>>> CONFIG_LOCKDEP_STACK_TRACE_HASH_BITS=14 >>>> CONFIG_LOCKDEP_CIRCULAR_QUEUE_BITS=12 >>>> CONFIG_DEBUG_SHIRQ=y >>>> CONFIG_WQ_WATCHDOG=y >>>> CONFIG_DEBUG_ATOMIC_SLEEP=y >>>> CONFIG_DEBUG_LIST=y >>>> CONFIG_DEBUG_PLIST=y >>>> CONFIG_DEBUG_SG=y >>>> CONFIG_DEBUG_NOTIFIERS=y >>>> CONFIG_BUG_ON_DATA_CORRUPTION=y >> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: SRCU hung task on 5.10.y on synchronize_srcu(&fsnotify_mark_srcu) 2024-08-27 20:01 SRCU hung task on 5.10.y on synchronize_srcu(&fsnotify_mark_srcu) Jon Kohler 2024-08-27 20:34 ` Paul E. McKenney @ 2024-09-04 9:19 ` Jan Kara 2024-09-04 14:40 ` Jon Kohler 1 sibling, 1 reply; 12+ messages in thread From: Jan Kara @ 2024-09-04 9:19 UTC (permalink / raw) To: Jon Kohler Cc: paulmck@kernel.org, rcu@vger.kernel.org, jiangshanlai@gmail.com, josh@joshtriplett.org, jack@suse.cz, linux-fsdevel@vger.kernel.org On Tue 27-08-24 20:01:27, Jon Kohler wrote: > Hey Paul, Lai, Josh, and the RCU list and Jan/FS list - > Reaching out about a tricky hung task issue that I'm running into. I've > got a virtualized Linux guest on top of a KVM based platform, running > a 5.10.y based kernel. The issue we're running into is a hung task that > *only* happens on shutdown/reboot of this particular VM once every > 20-50 times. > > The signature of the hung task is always similar to the output below, > where we appear to hang on the call to > synchronize_srcu(&fsnotify_mark_srcu) > in fsnotify_connector_destroy_workfn / fsnotify_mark_destroy_workfn, > where two kernel threads are both calling synchronize_srcu, then > scheduling out in wait_for_completion, and completely going out to > lunch for over 4 minutes. This then triggers the hung task timeout and > things blow up. Well, the most obvious reason for this would be that some process is hanging somewhere with fsnotify_mark_srcu held. When this happens, can you trigger sysrq-w in the VM and send here its output? > We are running audit=1 for this system and are using an el8 based > userspace. > > I've flipped through the fs/notify code base for both 5.10 as well as > upstream mainline to see if something jumped off the page, and I > haven't yet spotted any particular suspect code from the caller side. > > This hang appears to come up at the very end of the shutdown/reboot > process, seemingly after the system starts to unwind through initrd. > > What I'm working on now is adding some instrumentation to the dracut > shutdown initrd scripts to see if I can how far we get down that path > before the system fails to make forward progress, which may give some > hints. TBD on that. I've also enabled lockdep with CONFIG_PROVE_RCU and > a plethora of DEBUG options [2], and didn't get anything interesting. > To be clear, we haven't seen lockdep spit out any complaints as of yet. The fact that lockdep doesn't report anything is interesting but then lockdep doesn't track everything. In particular I think SRCU itself isn't tracked by lockdep. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: SRCU hung task on 5.10.y on synchronize_srcu(&fsnotify_mark_srcu) 2024-09-04 9:19 ` Jan Kara @ 2024-09-04 14:40 ` Jon Kohler 2024-09-04 19:51 ` Paul E. McKenney 2024-09-05 13:53 ` Jan Kara 0 siblings, 2 replies; 12+ messages in thread From: Jon Kohler @ 2024-09-04 14:40 UTC (permalink / raw) To: Jan Kara Cc: paulmck@kernel.org, rcu@vger.kernel.org, jiangshanlai@gmail.com, josh@joshtriplett.org, linux-fsdevel@vger.kernel.org > On Sep 4, 2024, at 5:19 AM, Jan Kara <jack@suse.cz> wrote: > > !-------------------------------------------------------------------| > CAUTION: External Email > > |-------------------------------------------------------------------! > > On Tue 27-08-24 20:01:27, Jon Kohler wrote: >> Hey Paul, Lai, Josh, and the RCU list and Jan/FS list - >> Reaching out about a tricky hung task issue that I'm running into. I've >> got a virtualized Linux guest on top of a KVM based platform, running >> a 5.10.y based kernel. The issue we're running into is a hung task that >> *only* happens on shutdown/reboot of this particular VM once every >> 20-50 times. >> >> The signature of the hung task is always similar to the output below, >> where we appear to hang on the call to >> synchronize_srcu(&fsnotify_mark_srcu) >> in fsnotify_connector_destroy_workfn / fsnotify_mark_destroy_workfn, >> where two kernel threads are both calling synchronize_srcu, then >> scheduling out in wait_for_completion, and completely going out to >> lunch for over 4 minutes. This then triggers the hung task timeout and >> things blow up. > > Well, the most obvious reason for this would be that some process is > hanging somewhere with fsnotify_mark_srcu held. When this happens, can you > trigger sysrq-w in the VM and send here its output? Jan - Thanks for the ping, that is *exactly* what is happening here. Some developments since my last note, the patch Neeraj pointed out wasn't the issue, but rather a confluence of realtime thread configurations that ended up completely starving whatever CPU was processing per-CPU callbacks. So, one thread would go out to lunch completely, and it would just never yield. This particular system was configured with RT_RUNTIME_SHARE unfortunately, so that realtime thread going out to lunch ate the entire system. What was odd is that this never, ever happened during runtime on some of these systems that have been up for years and getting beat up heavily, but rather only on shutdown. We’ve got more to chase down internally on that. One thing I wanted to bring up here though while I have you, I have noticed through various hits on google, mailing lists, etc over the years that this specific type of lockup with fsnotify_mark_srcu seems to happen now and then for various oddball reasons, with various root causes. It made me think that I wonder if there is a better structure that could be used here that might be a bit more durable. To be clear, I’m not saying that SRCU *is not* durable or anything of the sort (I promise!) but rather wondering if there was anything we could think about tweaking on the fsnotify side of the house to be more efficient. Thoughts? > >> We are running audit=1 for this system and are using an el8 based >> userspace. >> >> I've flipped through the fs/notify code base for both 5.10 as well as >> upstream mainline to see if something jumped off the page, and I >> haven't yet spotted any particular suspect code from the caller side. >> >> This hang appears to come up at the very end of the shutdown/reboot >> process, seemingly after the system starts to unwind through initrd. >> >> What I'm working on now is adding some instrumentation to the dracut >> shutdown initrd scripts to see if I can how far we get down that path >> before the system fails to make forward progress, which may give some >> hints. TBD on that. I've also enabled lockdep with CONFIG_PROVE_RCU and >> a plethora of DEBUG options [2], and didn't get anything interesting. >> To be clear, we haven't seen lockdep spit out any complaints as of yet. > > The fact that lockdep doesn't report anything is interesting but then > lockdep doesn't track everything. In particular I think SRCU itself isn't > tracked by lockdep. > > Honza > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: SRCU hung task on 5.10.y on synchronize_srcu(&fsnotify_mark_srcu) 2024-09-04 14:40 ` Jon Kohler @ 2024-09-04 19:51 ` Paul E. McKenney 2024-09-05 13:53 ` Jan Kara 1 sibling, 0 replies; 12+ messages in thread From: Paul E. McKenney @ 2024-09-04 19:51 UTC (permalink / raw) To: Jon Kohler Cc: Jan Kara, rcu@vger.kernel.org, jiangshanlai@gmail.com, josh@joshtriplett.org, linux-fsdevel@vger.kernel.org On Wed, Sep 04, 2024 at 02:40:07PM +0000, Jon Kohler wrote: > > > > On Sep 4, 2024, at 5:19 AM, Jan Kara <jack@suse.cz> wrote: > > > > !-------------------------------------------------------------------| > > CAUTION: External Email > > > > |-------------------------------------------------------------------! > > > > On Tue 27-08-24 20:01:27, Jon Kohler wrote: > >> Hey Paul, Lai, Josh, and the RCU list and Jan/FS list - > >> Reaching out about a tricky hung task issue that I'm running into. I've > >> got a virtualized Linux guest on top of a KVM based platform, running > >> a 5.10.y based kernel. The issue we're running into is a hung task that > >> *only* happens on shutdown/reboot of this particular VM once every > >> 20-50 times. > >> > >> The signature of the hung task is always similar to the output below, > >> where we appear to hang on the call to > >> synchronize_srcu(&fsnotify_mark_srcu) > >> in fsnotify_connector_destroy_workfn / fsnotify_mark_destroy_workfn, > >> where two kernel threads are both calling synchronize_srcu, then > >> scheduling out in wait_for_completion, and completely going out to > >> lunch for over 4 minutes. This then triggers the hung task timeout and > >> things blow up. > > > > Well, the most obvious reason for this would be that some process is > > hanging somewhere with fsnotify_mark_srcu held. When this happens, can you > > trigger sysrq-w in the VM and send here its output? > > Jan - Thanks for the ping, that is *exactly* what is happening here. > Some developments since my last note, the patch Neeraj pointed out > wasn't the issue, but rather a confluence of realtime thread configurations > that ended up completely starving whatever CPU was processing per-CPU > callbacks. So, one thread would go out to lunch completely, and it would > just never yield. This particular system was configured with RT_RUNTIME_SHARE > unfortunately, so that realtime thread going out to lunch ate the entire system. > > What was odd is that this never, ever happened during runtime on some > of these systems that have been up for years and getting beat up heavily, > but rather only on shutdown. We’ve got more to chase down internally on > that. > > One thing I wanted to bring up here though while I have you, I have > noticed through various hits on google, mailing lists, etc over the years that > this specific type of lockup with fsnotify_mark_srcu seems to happen now > and then for various oddball reasons, with various root causes. > > It made me think that I wonder if there is a better structure that could be > used here that might be a bit more durable. To be clear, I’m not saying that > SRCU *is not* durable or anything of the sort (I promise!) but rather > wondering if there was anything we could think about tweaking on the > fsnotify side of the house to be more efficient. > > Thoughts? For RCU in real-time environments, we have RCU priority boosting, which boost RCU readers that have been preempted for too long. However, this is SRCU, in which readers can simply block, in addition to being preempted. Of course, boosting the priority of a task that ha blocked (as opposed to being preempted) cannot help -- the task will remain blocked until awakened, regardless of its priority. But your case takes this one step further, in that the workqueue invoking callbacks is being preempted and starved, correct? The usual advice is to make sure that your housekeeping CPUs get sufficient CPU time. Easy to say, easy to do, harder to keep done uniformly across a large number of systems running diverse workloads. Still, this is the preferred approach. Just out of curiosity, is this a CONFIG_PREEMPT_RT kernel? Thanx, Paul > >> We are running audit=1 for this system and are using an el8 based > >> userspace. > >> > >> I've flipped through the fs/notify code base for both 5.10 as well as > >> upstream mainline to see if something jumped off the page, and I > >> haven't yet spotted any particular suspect code from the caller side. > >> > >> This hang appears to come up at the very end of the shutdown/reboot > >> process, seemingly after the system starts to unwind through initrd. > >> > >> What I'm working on now is adding some instrumentation to the dracut > >> shutdown initrd scripts to see if I can how far we get down that path > >> before the system fails to make forward progress, which may give some > >> hints. TBD on that. I've also enabled lockdep with CONFIG_PROVE_RCU and > >> a plethora of DEBUG options [2], and didn't get anything interesting. > >> To be clear, we haven't seen lockdep spit out any complaints as of yet. > > > > The fact that lockdep doesn't report anything is interesting but then > > lockdep doesn't track everything. In particular I think SRCU itself isn't > > tracked by lockdep. > > > > Honza > > -- > > Jan Kara <jack@suse.com> > > SUSE Labs, CR > > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: SRCU hung task on 5.10.y on synchronize_srcu(&fsnotify_mark_srcu) 2024-09-04 14:40 ` Jon Kohler 2024-09-04 19:51 ` Paul E. McKenney @ 2024-09-05 13:53 ` Jan Kara 1 sibling, 0 replies; 12+ messages in thread From: Jan Kara @ 2024-09-05 13:53 UTC (permalink / raw) To: Jon Kohler Cc: Jan Kara, paulmck@kernel.org, rcu@vger.kernel.org, jiangshanlai@gmail.com, josh@joshtriplett.org, linux-fsdevel@vger.kernel.org On Wed 04-09-24 14:40:07, Jon Kohler wrote: > > > > On Sep 4, 2024, at 5:19 AM, Jan Kara <jack@suse.cz> wrote: > > > > !-------------------------------------------------------------------| > > CAUTION: External Email > > > > |-------------------------------------------------------------------! > > > > On Tue 27-08-24 20:01:27, Jon Kohler wrote: > >> Hey Paul, Lai, Josh, and the RCU list and Jan/FS list - > >> Reaching out about a tricky hung task issue that I'm running into. I've > >> got a virtualized Linux guest on top of a KVM based platform, running > >> a 5.10.y based kernel. The issue we're running into is a hung task that > >> *only* happens on shutdown/reboot of this particular VM once every > >> 20-50 times. > >> > >> The signature of the hung task is always similar to the output below, > >> where we appear to hang on the call to > >> synchronize_srcu(&fsnotify_mark_srcu) > >> in fsnotify_connector_destroy_workfn / fsnotify_mark_destroy_workfn, > >> where two kernel threads are both calling synchronize_srcu, then > >> scheduling out in wait_for_completion, and completely going out to > >> lunch for over 4 minutes. This then triggers the hung task timeout and > >> things blow up. > > > > Well, the most obvious reason for this would be that some process is > > hanging somewhere with fsnotify_mark_srcu held. When this happens, can you > > trigger sysrq-w in the VM and send here its output? > > Jan - Thanks for the ping, that is *exactly* what is happening here. > Some developments since my last note, the patch Neeraj pointed out > wasn't the issue, but rather a confluence of realtime thread configurations > that ended up completely starving whatever CPU was processing per-CPU > callbacks. So, one thread would go out to lunch completely, and it would > just never yield. This particular system was configured with RT_RUNTIME_SHARE > unfortunately, so that realtime thread going out to lunch ate the entire system. Glad to hear this is explained (at least partially) :) > What was odd is that this never, ever happened during runtime on some > of these systems that have been up for years and getting beat up heavily, > but rather only on shutdown. We’ve got more to chase down internally on > that. > > One thing I wanted to bring up here though while I have you, I have > noticed through various hits on google, mailing lists, etc over the years that > this specific type of lockup with fsnotify_mark_srcu seems to happen now > and then for various oddball reasons, with various root causes. > > It made me think that I wonder if there is a better structure that could be > used here that might be a bit more durable. To be clear, I’m not saying that > SRCU *is not* durable or anything of the sort (I promise!) but rather > wondering if there was anything we could think about tweaking on the > fsnotify side of the house to be more efficient. Well, fsnotify_mark_srcu used to be a big problem in the past where fanotify code was waiting for userspace response to fanotify event with it held. And when userspace didn't reply, the kernel got stuck. After we have removed that sore spot couple years ago, I'm not aware of any more problems with it. In fact your report is probably the first one in a few years. So I'm hoping your google hits are mostly from the past / with old kernels :) Honza > >> We are running audit=1 for this system and are using an el8 based > >> userspace. > >> > >> I've flipped through the fs/notify code base for both 5.10 as well as > >> upstream mainline to see if something jumped off the page, and I > >> haven't yet spotted any particular suspect code from the caller side. > >> > >> This hang appears to come up at the very end of the shutdown/reboot > >> process, seemingly after the system starts to unwind through initrd. > >> > >> What I'm working on now is adding some instrumentation to the dracut > >> shutdown initrd scripts to see if I can how far we get down that path > >> before the system fails to make forward progress, which may give some > >> hints. TBD on that. I've also enabled lockdep with CONFIG_PROVE_RCU and > >> a plethora of DEBUG options [2], and didn't get anything interesting. > >> To be clear, we haven't seen lockdep spit out any complaints as of yet. > > > > The fact that lockdep doesn't report anything is interesting but then > > lockdep doesn't track everything. In particular I think SRCU itself isn't > > tracked by lockdep. > > > > Honza > > -- > > Jan Kara <jack@suse.com> > > SUSE Labs, CR > > -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2024-09-05 13:53 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-08-27 20:01 SRCU hung task on 5.10.y on synchronize_srcu(&fsnotify_mark_srcu) Jon Kohler 2024-08-27 20:34 ` Paul E. McKenney 2024-08-27 23:33 ` Jon Kohler 2024-08-28 1:21 ` Paul E. McKenney 2024-08-28 2:41 ` Jon Kohler 2024-08-28 10:14 ` Paul E. McKenney 2024-08-28 2:33 ` Neeraj Upadhyay 2024-08-28 3:01 ` Jon Kohler 2024-09-04 9:19 ` Jan Kara 2024-09-04 14:40 ` Jon Kohler 2024-09-04 19:51 ` Paul E. McKenney 2024-09-05 13:53 ` Jan Kara
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.