SRCU hung task on 5.10.y on synchronize_srcu(&fsnotify_mark

All of lore.kernel.org
 help / color / mirror / Atom feed

* SRCU hung task on 5.10.y on synchronize_srcu(&fsnotify_mark_srcu)
@ 2024-08-27 20:01 Jon Kohler
  2024-08-27 20:34 ` Paul E. McKenney
  2024-09-04  9:19 ` Jan Kara
  0 siblings, 2 replies; 12+ messages in thread
From: Jon Kohler @ 2024-08-27 20:01 UTC (permalink / raw)
  To: paulmck@kernel.org, rcu@vger.kernel.org, jiangshanlai@gmail.com,
	josh@joshtriplett.org
  Cc: jack@suse.cz, linux-fsdevel@vger.kernel.org

Hey Paul, Lai, Josh, and the RCU list and Jan/FS list -
Reaching out about a tricky hung task issue that I'm running into. I've
got a virtualized Linux guest on top of a KVM based platform, running
a 5.10.y based kernel. The issue we're running into is a hung task that
*only* happens on shutdown/reboot of this particular VM once every 
20-50 times.

The signature of the hung task is always similar to the output below,
where we appear to hang on the call to 
    synchronize_srcu(&fsnotify_mark_srcu)
in fsnotify_connector_destroy_workfn / fsnotify_mark_destroy_workfn,
where two kernel threads are both calling synchronize_srcu, then
scheduling out in wait_for_completion, and completely going out to
lunch for over 4 minutes. This then triggers the hung task timeout and
things blow up.

We are running audit=1 for this system and are using an el8 based
userspace.

I've flipped through the fs/notify code base for both 5.10 as well as
upstream mainline to see if something jumped off the page, and I
haven't yet spotted any particular suspect code from the caller side.

This hang appears to come up at the very end of the shutdown/reboot
process, seemingly after the system starts to unwind through initrd.

What I'm working on now is adding some instrumentation to the dracut
shutdown initrd scripts to see if I can how far we get down that path
before the system fails to make forward progress, which may give some
hints. TBD on that. I've also enabled lockdep with CONFIG_PROVE_RCU and
a plethora of DEBUG options [2], and didn't get anything interesting.
To be clear, we haven't seen lockdep spit out any complaints as of yet.

Reaching out to see if this sounds familar to anyone on the list, or if
there are any particular areas of the RCU code base that might be
suspect for this kind of issue. I'm happy to provide more information,
as frankly, I'm quite stumped at the moment.

Thanks all,
Jon

[1] panic trace
    Normal shutdown process, then hangs on the following:
    ...
    dracut Warning: Killing all remaining processes
    ...
    INFO: task kworker/u20:7:1200701 blocked for more than 241 seconds.
          Tainted: G           O      5.10.205-2.el8.x86_64 #1
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    task:kworker/u20:7   state:D stack:    0 pid:1200701 ppid:     2 flags:0x00004080
    Workqueue: events_unbound fsnotify_connector_destroy_workfn
    Call Trace:
     __schedule+0x267/0x790
     schedule+0x3c/0xb0
     schedule_timeout+0x219/0x2b0
     wait_for_completion+0x9e/0x100
     __synchronize_srcu.part.24+0x83/0xb0
     ? __bpf_trace_rcu_utilization+0x10/0x10
     ? synchronize_srcu+0x5d/0xf0
     fsnotify_connector_destroy_workfn+0x46/0x80
     process_one_work+0x1fc/0x390
     worker_thread+0x2d/0x3e0
     ? process_one_work+0x390/0x390
     kthread+0x114/0x130
     ? kthread_park+0x80/0x80
     ret_from_fork+0x1f/0x30
    INFO: task kworker/u20:8:1287360 blocked for more than 241 seconds.
          Tainted: G           O      5.10.205-2.el8.x86_64 #1
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    task:kworker/u20:8   state:D stack:    0 pid:1287360 ppid:     2 flags:0x00004080
    Workqueue: events_unbound fsnotify_mark_destroy_workfn
    Call Trace:
     __schedule+0x267/0x790
     schedule+0x3c/0xb0
     schedule_timeout+0x219/0x2b0
     ? add_timer+0x14a/0x200
     wait_for_completion+0x9e/0x100
     __synchronize_srcu.part.24+0x83/0xb0
     ? __bpf_trace_rcu_utilization+0x10/0x10
     fsnotify_mark_destroy_workfn+0x77/0xe0
     process_one_work+0x1fc/0x390
     ? process_one_work+0x390/0x390
     worker_thread+0x2d/0x3e0
     ? process_one_work+0x390/0x390
     kthread+0x114/0x130
     ? kthread_park+0x80/0x80
     ret_from_fork+0x1f/0x30
    Kernel panic - not syncing: hung_task: blocked tasks
    CPU: 1 PID: 64 Comm: khungtaskd Kdump: loaded Tainted: G           O      5.10.205-2.el8.x86_64 #1
    Hardware name: Red Hat KVM, BIOS 20230302.1.2662.el8 04/01/2014
    Call Trace:
     dump_stack+0x6d/0x8c
     panic+0x114/0x2ea
     watchdog.cold.8+0xb5/0xb5
     ? hungtask_pm_notify+0x50/0x50
     kthread+0x114/0x130
     ? kthread_park+0x80/0x80
     ret_from_fork+0x1f/0x30

[2] additional debugging config knobs turned up.
    CONFIG_PROVE_LOCKING=y
    CONFIG_LOCK_STAT=y
    CONFIG_DEBUG_RT_MUTEXES=y
    CONFIG_DEBUG_SPINLOCK=y
    CONFIG_DEBUG_MUTEXES=y
    CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y
    CONFIG_DEBUG_RWSEMS=y
    CONFIG_DEBUG_LOCK_ALLOC=y
    CONFIG_LOCKDEP=y
    CONFIG_LOCKDEP_BITS=15
    CONFIG_LOCKDEP_CHAINS_BITS=16
    CONFIG_LOCKDEP_STACK_TRACE_BITS=19
    CONFIG_LOCKDEP_STACK_TRACE_HASH_BITS=14
    CONFIG_LOCKDEP_CIRCULAR_QUEUE_BITS=12
    CONFIG_DEBUG_SHIRQ=y
    CONFIG_WQ_WATCHDOG=y
    CONFIG_DEBUG_ATOMIC_SLEEP=y
    CONFIG_DEBUG_LIST=y
    CONFIG_DEBUG_PLIST=y
    CONFIG_DEBUG_SG=y
    CONFIG_DEBUG_NOTIFIERS=y
    CONFIG_BUG_ON_DATA_CORRUPTION=y

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: SRCU hung task on 5.10.y on synchronize_srcu(&fsnotify_mark_srcu)
  2024-08-27 20:01 SRCU hung task on 5.10.y on synchronize_srcu(&fsnotify_mark_srcu) Jon Kohler
@ 2024-08-27 20:34 ` Paul E. McKenney
  2024-08-27 23:33   ` Jon Kohler
  2024-09-04  9:19 ` Jan Kara
  1 sibling, 1 reply; 12+ messages in thread
From: Paul E. McKenney @ 2024-08-27 20:34 UTC (permalink / raw)
  To: Jon Kohler
  Cc: rcu@vger.kernel.org, jiangshanlai@gmail.com,
	josh@joshtriplett.org, jack@suse.cz,
	linux-fsdevel@vger.kernel.org, Z qiang

On Tue, Aug 27, 2024 at 08:01:27PM +0000, Jon Kohler wrote:
> Hey Paul, Lai, Josh, and the RCU list and Jan/FS list -
> Reaching out about a tricky hung task issue that I'm running into. I've
> got a virtualized Linux guest on top of a KVM based platform, running
> a 5.10.y based kernel. The issue we're running into is a hung task that
> *only* happens on shutdown/reboot of this particular VM once every 
> 20-50 times.
> 
> The signature of the hung task is always similar to the output below,
> where we appear to hang on the call to 
>     synchronize_srcu(&fsnotify_mark_srcu)

One thing to try would be to add trace_printk() or similar to the SRCU
readers, just in case someone was using srcu_read_lock_notrace() on
fsnotify_mark_srcu, which I see no trace of in current mainline.

Alternatively, if there is a version where this does not happen, try
bisecting.  Each bisection step would require something like 400-500
shutdown/reboots to prove the commit good.  (Obviously, the first failure
proves the commit bad, which for one-out-of-50 failures will take on
average about 35 shutdown/reboots.)

There could also be a bad SRCU backport from mainline, so please check
what SRCU backports you have in your 5.10.y stable release.  (Though
maybe Jack has already done this?)

							Thanx, Paul

> in fsnotify_connector_destroy_workfn / fsnotify_mark_destroy_workfn,
> where two kernel threads are both calling synchronize_srcu, then
> scheduling out in wait_for_completion, and completely going out to
> lunch for over 4 minutes. This then triggers the hung task timeout and
> things blow up.
> 
> We are running audit=1 for this system and are using an el8 based
> userspace.
> 
> I've flipped through the fs/notify code base for both 5.10 as well as
> upstream mainline to see if something jumped off the page, and I
> haven't yet spotted any particular suspect code from the caller side.
> 
> This hang appears to come up at the very end of the shutdown/reboot
> process, seemingly after the system starts to unwind through initrd.
> 
> What I'm working on now is adding some instrumentation to the dracut
> shutdown initrd scripts to see if I can how far we get down that path
> before the system fails to make forward progress, which may give some
> hints. TBD on that. I've also enabled lockdep with CONFIG_PROVE_RCU and
> a plethora of DEBUG options [2], and didn't get anything interesting.
> To be clear, we haven't seen lockdep spit out any complaints as of yet.
> 
> Reaching out to see if this sounds familar to anyone on the list, or if
> there are any particular areas of the RCU code base that might be
> suspect for this kind of issue. I'm happy to provide more information,
> as frankly, I'm quite stumped at the moment.
> 
> Thanks all,
> Jon
> 
> [1] panic trace
>     Normal shutdown process, then hangs on the following:
>     ...
>     dracut Warning: Killing all remaining processes
>     ...
>     INFO: task kworker/u20:7:1200701 blocked for more than 241 seconds.
>           Tainted: G           O      5.10.205-2.el8.x86_64 #1
>     "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>     task:kworker/u20:7   state:D stack:    0 pid:1200701 ppid:     2 flags:0x00004080
>     Workqueue: events_unbound fsnotify_connector_destroy_workfn
>     Call Trace:
>      __schedule+0x267/0x790
>      schedule+0x3c/0xb0
>      schedule_timeout+0x219/0x2b0
>      wait_for_completion+0x9e/0x100
>      __synchronize_srcu.part.24+0x83/0xb0
>      ? __bpf_trace_rcu_utilization+0x10/0x10
>      ? synchronize_srcu+0x5d/0xf0
>      fsnotify_connector_destroy_workfn+0x46/0x80
>      process_one_work+0x1fc/0x390
>      worker_thread+0x2d/0x3e0
>      ? process_one_work+0x390/0x390
>      kthread+0x114/0x130
>      ? kthread_park+0x80/0x80
>      ret_from_fork+0x1f/0x30
>     INFO: task kworker/u20:8:1287360 blocked for more than 241 seconds.
>           Tainted: G           O      5.10.205-2.el8.x86_64 #1
>     "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>     task:kworker/u20:8   state:D stack:    0 pid:1287360 ppid:     2 flags:0x00004080
>     Workqueue: events_unbound fsnotify_mark_destroy_workfn
>     Call Trace:
>      __schedule+0x267/0x790
>      schedule+0x3c/0xb0
>      schedule_timeout+0x219/0x2b0
>      ? add_timer+0x14a/0x200
>      wait_for_completion+0x9e/0x100
>      __synchronize_srcu.part.24+0x83/0xb0
>      ? __bpf_trace_rcu_utilization+0x10/0x10
>      fsnotify_mark_destroy_workfn+0x77/0xe0
>      process_one_work+0x1fc/0x390
>      ? process_one_work+0x390/0x390
>      worker_thread+0x2d/0x3e0
>      ? process_one_work+0x390/0x390
>      kthread+0x114/0x130
>      ? kthread_park+0x80/0x80
>      ret_from_fork+0x1f/0x30
>     Kernel panic - not syncing: hung_task: blocked tasks
>     CPU: 1 PID: 64 Comm: khungtaskd Kdump: loaded Tainted: G           O      5.10.205-2.el8.x86_64 #1
>     Hardware name: Red Hat KVM, BIOS 20230302.1.2662.el8 04/01/2014
>     Call Trace:
>      dump_stack+0x6d/0x8c
>      panic+0x114/0x2ea
>      watchdog.cold.8+0xb5/0xb5
>      ? hungtask_pm_notify+0x50/0x50
>      kthread+0x114/0x130
>      ? kthread_park+0x80/0x80
>      ret_from_fork+0x1f/0x30
> 
> [2] additional debugging config knobs turned up.
>     CONFIG_PROVE_LOCKING=y
>     CONFIG_LOCK_STAT=y
>     CONFIG_DEBUG_RT_MUTEXES=y
>     CONFIG_DEBUG_SPINLOCK=y
>     CONFIG_DEBUG_MUTEXES=y
>     CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y
>     CONFIG_DEBUG_RWSEMS=y
>     CONFIG_DEBUG_LOCK_ALLOC=y
>     CONFIG_LOCKDEP=y
>     CONFIG_LOCKDEP_BITS=15
>     CONFIG_LOCKDEP_CHAINS_BITS=16
>     CONFIG_LOCKDEP_STACK_TRACE_BITS=19
>     CONFIG_LOCKDEP_STACK_TRACE_HASH_BITS=14
>     CONFIG_LOCKDEP_CIRCULAR_QUEUE_BITS=12
>     CONFIG_DEBUG_SHIRQ=y
>     CONFIG_WQ_WATCHDOG=y
>     CONFIG_DEBUG_ATOMIC_SLEEP=y
>     CONFIG_DEBUG_LIST=y
>     CONFIG_DEBUG_PLIST=y
>     CONFIG_DEBUG_SG=y
>     CONFIG_DEBUG_NOTIFIERS=y
>     CONFIG_BUG_ON_DATA_CORRUPTION=y

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: SRCU hung task on 5.10.y on synchronize_srcu(&fsnotify_mark_srcu)
  2024-08-27 20:34 ` Paul E. McKenney
@ 2024-08-27 23:33   ` Jon Kohler
  2024-08-28  1:21     ` Paul E. McKenney
  2024-08-28  2:33     ` Neeraj Upadhyay
  0 siblings, 2 replies; 12+ messages in thread
From: Jon Kohler @ 2024-08-27 23:33 UTC (permalink / raw)
  To: paulmck@kernel.org
  Cc: rcu@vger.kernel.org, jiangshanlai@gmail.com,
	josh@joshtriplett.org, jack@suse.cz,
	linux-fsdevel@vger.kernel.org, Z qiang

> On Aug 27, 2024, at 4:34 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> 
> On Tue, Aug 27, 2024 at 08:01:27PM +0000, Jon Kohler wrote:
>> Hey Paul, Lai, Josh, and the RCU list and Jan/FS list -
>> Reaching out about a tricky hung task issue that I'm running into. I've
>> got a virtualized Linux guest on top of a KVM based platform, running
>> a 5.10.y based kernel. The issue we're running into is a hung task that
>> *only* happens on shutdown/reboot of this particular VM once every 
>> 20-50 times.
>> 
>> The signature of the hung task is always similar to the output below,
>> where we appear to hang on the call to 
>>    synchronize_srcu(&fsnotify_mark_srcu)
> 
> One thing to try would be to add trace_printk() or similar to the SRCU
> readers, just in case someone was using srcu_read_lock_notrace() on
> fsnotify_mark_srcu, which I see no trace of in current mainline.
> 
> Alternatively, if there is a version where this does not happen, try
> bisecting.  Each bisection step would require something like 400-500
> shutdown/reboots to prove the commit good.  (Obviously, the first failure
> proves the commit bad, which for one-out-of-50 failures will take on
> average about 35 shutdown/reboots.)
> 
> There could also be a bad SRCU backport from mainline, so please check
> what SRCU backports you have in your 5.10.y stable release.  (Though
> maybe Jack has already done this?)
> 
> Thanx, Paul

Thanks, Paul

For posterity, this kernel is just built off of the regular ole stable tree, and here’s
All of the backports to kernel/rcu are below.

Stepping through this more, since we’re stalling at wait_for_completion,
that must mean that wakeme_after_rcu() ... complete(&rcu->completion)
is not happening, right?

That or somehow wakeme_after_rcu() fires *before* wait_for_completion()
is setup, causing the wait to hang forever?

Is it possible for wakeme_after_rcu() to race, such that it fires somewhere
else *before* wait_for_completion gets all the way to schedule()?

[rcu]$ pwd
/kernel/kernel/rcu
[rcu]$ git remote -v
origin https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git (fetch)
origin https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git (push)
[rcu]$ git log --oneline 2c85ebc57b3e..HEAD .
ca4427ebc626 (HEAD, tag: v5.10.205) Linux 5.10.205 <<<< this is the base commit for 5.10.205 >>>>
...
175f4b062f69 rcu: kmemleak: Ignore kmemleak false positives when RCU-freeing objects <<<< this is the most recent backport commit to kernel/rcu >>>>
55887adc76e1 rcuscale: Move rcu_scale_writer() schedule_timeout_uninterruptible() to _idle()
066fbd8bc981 refscale: Fix uninitalized use of wait_queue_head_t
d93ba6e46e5f rcu-tasks: Add trc_inspect_reader() checks for exiting critical section
3e22624f8fd3 rcu-tasks: Wait for trc_read_check_handler() IPIs
9190c1f0aed1 rcu-tasks: Fix IPI failure handling in trc_wait_for_one_reader
ad4f8c117b8b rcu: Prevent expedited GP from enabling tick on offline CPU
4f91de9a81bd rcu-tasks: Simplify trc_read_check_handler() atomic operations
3a64cd01cdd6 rcu-tasks: Mark ->trc_reader_special.b.need_qs data races
058f077d09ba rcu-tasks: Mark ->trc_reader_nesting data races
604d6a5ff718 rcu/rcuscale: Stop kfree_scale_thread thread(s) after unloading rcuscale
d414e24d1509 rcu/rcuscale: Move rcu_scale_*() after kfree_scale_cleanup()
ecc5e6dbc269 rcuscale: Move shutdown from wait_event() to wait_event_idle()
b62c816bdb5e rcuscale: Always log error message
8cd9917c13a7 rcuscale: Console output claims too few grace periods
7230a9e599d3 rcu/kvfree: Avoid freeing new kfree_rcu() memory after old grace period
a7d21b858589 rcu: Protect rcu_print_task_exp_stall() ->exp_tasks access
e4842de4ec13 refscale: Move shutdown from wait_event() to wait_event_idle()
eb18bc5a8678 rcu: Avoid stack overflow due to __rcu_irq_enter_check_tick() being kprobe-ed
d99d194e2f8c rcu-tasks: Make rude RCU-Tasks work well with CPU hotplug
2bf501f1bc78 rcu: Suppress smp_processor_id() complaint in synchronize_rcu_expedited_wait()
1c37e86a78c2 rcu-tasks: Fix synchronize_rcu_tasks() VS zap_pid_ns_processes()
ad410f64f7ab rcu-tasks: Remove preemption disablement around srcu_read_[un]lock() calls
b02b6bb83c68 rcu-tasks: Improve comments explaining tasks_rcu_exit_srcu purpose
7c15d7ecce00 rcu: Prevent lockdep-RCU splats on lock acquisition/release
5a52380b8193 rcu: Fix __this_cpu_read() lockdep warning in rcu_force_quiescent_state()
0dd025483f15 rcu-tasks: Convert RCU_LOCKDEP_WARN() to WARN_ONCE()
36d4ffbedff7 rcu: Back off upon fill_page_cache_func() allocation failure
10f30cba8f6c rcu: Make TASKS_RUDE_RCU select IRQ_WORK
1c6c3f233664 rcu-tasks: Fix race in schedule and flush work
a22d66eb518f rcu: Apply callbacks processing time limit only on softirq
40fb3812d997 rcu: Fix callbacks processing time limit retaining cond_resched()
fcc9797d0d13 rcu: Don't deboost before reporting expedited quiescent state
0c145262ac99 rcu/nocb: Fix missed nocb_timer requeue
657991fb06a4 rcu: Do not report strict GPs for outgoing CPUs
12d3389b7af6 rcu: Tighten rcu_advance_cbs_nowake() checks
0836f9404017 rcu/exp: Mark current CPU as exp-QS in IPI loop second pass
70692b06208c rcu: Mark accesses to rcu_state.n_force_qs
af756be29c82 rcu: Always inline rcu_dynticks_task*_{enter,exit}()
226d68fb6c0a rcu: Fix existing exp request check in sync_sched_exp_online_cleanup()
02ddf26d849d rcu-tasks: Move RTGS_WAIT_CBS to beginning of rcu_tasks_kthread() loop
7f43cda650d5 rcutorture: Avoid problematic critical section nesting on PREEMPT_RT
d3ca78775db4 rcu: Fix macro name CONFIG_TASKS_RCU_TRACE
497f3d9c3f58 rcu: Fix stall-warning deadlock due to non-release of rcu_node ->lock
ea5e5bc881a4 rcu: Add lockdep_assert_irqs_disabled() to rcu_sched_clock_irq() and callees
527b56d7856f rcu: Fix to include first blocked task in stall warning
4b680b3fc6f3 rcu/tree: Handle VM stoppage in stall detection
b6ae3854075e srcu: Provide polling interfaces for Tiny SRCU grace periods
450948b06ce8 srcu: Make Tiny SRCU use multi-bit grace-period counter
641e1d88404a srcu: Provide internal interface to start a Tiny SRCU grace period
f789de3be808 srcu: Provide polling interfaces for Tree SRCU grace periods
fdf66e5a7fc8 srcu: Provide internal interface to start a Tree SRCU grace period
86cb49e7314e rcu-tasks: Don't delete holdouts within trc_wait_for_one_reader()
55ddab2bfd70 rcu-tasks: Don't delete holdouts within trc_inspect_reader()
35a35909ec19 rcu: Reject RCU_LOCKDEP_WARN() false positives
23597afbe096 srcu: Fix broken node geometry after early ssp init
728f23e53c65 rcu: Invoke rcu_spawn_core_kthreads() from rcu_spawn_gp_kthread()
7d81aff28953 rcu: Remove spurious instrumentation_end() in rcu_nmi_enter()
09a27d662006 kvfree_rcu: Use same set of GFP flags as does single-argument
e713bdd791ba rcu/nocb: Perform deferred wake up before last idle's need_resched() check
20b7669fa3f0 rcu: Pull deferred rcuog wake up to rcu_eqs_enter() callers
30b491e2b6cc rcu-tasks: Move RCU-tasks initialization to before early_initcall()
9b81af9c8455 rcu/tree: Defer kvfree_rcu() allocation to a clean context
5cacd18c5207 rcu,ftrace: Fix ftrace recursion
4540e84bd8a9 rcu: Allow rcu_irq_enter_check_tick() from NMI <<<< this is the first kernel/rcu backport for 5.10.y branch >>>>
...
2c85ebc57b3e (tag: v5.10) Linux 5.10 <<<< this is the divergence point from mainline to 5.10.y >>>> 

> 
>> in fsnotify_connector_destroy_workfn / fsnotify_mark_destroy_workfn,
>> where two kernel threads are both calling synchronize_srcu, then
>> scheduling out in wait_for_completion, and completely going out to
>> lunch for over 4 minutes. This then triggers the hung task timeout and
>> things blow up.
>> 
>> We are running audit=1 for this system and are using an el8 based
>> userspace.
>> 
>> I've flipped through the fs/notify code base for both 5.10 as well as
>> upstream mainline to see if something jumped off the page, and I
>> haven't yet spotted any particular suspect code from the caller side.
>> 
>> This hang appears to come up at the very end of the shutdown/reboot
>> process, seemingly after the system starts to unwind through initrd.
>> 
>> What I'm working on now is adding some instrumentation to the dracut
>> shutdown initrd scripts to see if I can how far we get down that path
>> before the system fails to make forward progress, which may give some
>> hints. TBD on that. I've also enabled lockdep with CONFIG_PROVE_RCU and
>> a plethora of DEBUG options [2], and didn't get anything interesting.
>> To be clear, we haven't seen lockdep spit out any complaints as of yet.
>> 
>> Reaching out to see if this sounds familar to anyone on the list, or if
>> there are any particular areas of the RCU code base that might be
>> suspect for this kind of issue. I'm happy to provide more information,
>> as frankly, I'm quite stumped at the moment.
>> 
>> Thanks all,
>> Jon
>> 
>> [1] panic trace
>>    Normal shutdown process, then hangs on the following:
>>    ...
>>    dracut Warning: Killing all remaining processes
>>    ...
>>    INFO: task kworker/u20:7:1200701 blocked for more than 241 seconds.
>>          Tainted: G           O      5.10.205-2.el8.x86_64 #1
>>    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>    task:kworker/u20:7   state:D stack:    0 pid:1200701 ppid:     2 flags:0x00004080
>>    Workqueue: events_unbound fsnotify_connector_destroy_workfn
>>    Call Trace:
>>     __schedule+0x267/0x790
>>     schedule+0x3c/0xb0
>>     schedule_timeout+0x219/0x2b0
>>     wait_for_completion+0x9e/0x100
>>     __synchronize_srcu.part.24+0x83/0xb0
>>     ? __bpf_trace_rcu_utilization+0x10/0x10
>>     ? synchronize_srcu+0x5d/0xf0
>>     fsnotify_connector_destroy_workfn+0x46/0x80
>>     process_one_work+0x1fc/0x390
>>     worker_thread+0x2d/0x3e0
>>     ? process_one_work+0x390/0x390
>>     kthread+0x114/0x130
>>     ? kthread_park+0x80/0x80
>>     ret_from_fork+0x1f/0x30
>>    INFO: task kworker/u20:8:1287360 blocked for more than 241 seconds.
>>          Tainted: G           O      5.10.205-2.el8.x86_64 #1
>>    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>    task:kworker/u20:8   state:D stack:    0 pid:1287360 ppid:     2 flags:0x00004080
>>    Workqueue: events_unbound fsnotify_mark_destroy_workfn
>>    Call Trace:
>>     __schedule+0x267/0x790
>>     schedule+0x3c/0xb0
>>     schedule_timeout+0x219/0x2b0
>>     ? add_timer+0x14a/0x200
>>     wait_for_completion+0x9e/0x100
>>     __synchronize_srcu.part.24+0x83/0xb0
>>     ? __bpf_trace_rcu_utilization+0x10/0x10
>>     fsnotify_mark_destroy_workfn+0x77/0xe0
>>     process_one_work+0x1fc/0x390
>>     ? process_one_work+0x390/0x390
>>     worker_thread+0x2d/0x3e0
>>     ? process_one_work+0x390/0x390
>>     kthread+0x114/0x130
>>     ? kthread_park+0x80/0x80
>>     ret_from_fork+0x1f/0x30
>>    Kernel panic - not syncing: hung_task: blocked tasks
>>    CPU: 1 PID: 64 Comm: khungtaskd Kdump: loaded Tainted: G           O      5.10.205-2.el8.x86_64 #1
>>    Hardware name: Red Hat KVM, BIOS 20230302.1.2662.el8 04/01/2014
>>    Call Trace:
>>     dump_stack+0x6d/0x8c
>>     panic+0x114/0x2ea
>>     watchdog.cold.8+0xb5/0xb5
>>     ? hungtask_pm_notify+0x50/0x50
>>     kthread+0x114/0x130
>>     ? kthread_park+0x80/0x80
>>     ret_from_fork+0x1f/0x30
>> 
>> [2] additional debugging config knobs turned up.
>>    CONFIG_PROVE_LOCKING=y
>>    CONFIG_LOCK_STAT=y
>>    CONFIG_DEBUG_RT_MUTEXES=y
>>    CONFIG_DEBUG_SPINLOCK=y
>>    CONFIG_DEBUG_MUTEXES=y
>>    CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y
>>    CONFIG_DEBUG_RWSEMS=y
>>    CONFIG_DEBUG_LOCK_ALLOC=y
>>    CONFIG_LOCKDEP=y
>>    CONFIG_LOCKDEP_BITS=15
>>    CONFIG_LOCKDEP_CHAINS_BITS=16
>>    CONFIG_LOCKDEP_STACK_TRACE_BITS=19
>>    CONFIG_LOCKDEP_STACK_TRACE_HASH_BITS=14
>>    CONFIG_LOCKDEP_CIRCULAR_QUEUE_BITS=12
>>    CONFIG_DEBUG_SHIRQ=y
>>    CONFIG_WQ_WATCHDOG=y
>>    CONFIG_DEBUG_ATOMIC_SLEEP=y
>>    CONFIG_DEBUG_LIST=y
>>    CONFIG_DEBUG_PLIST=y
>>    CONFIG_DEBUG_SG=y
>>    CONFIG_DEBUG_NOTIFIERS=y
>>    CONFIG_BUG_ON_DATA_CORRUPTION=y

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: SRCU hung task on 5.10.y on synchronize_srcu(&fsnotify_mark_srcu)
  2024-08-27 23:33   ` Jon Kohler
@ 2024-08-28  1:21     ` Paul E. McKenney
  2024-08-28  2:41       ` Jon Kohler
  2024-08-28  2:33     ` Neeraj Upadhyay
  1 sibling, 1 reply; 12+ messages in thread
From: Paul E. McKenney @ 2024-08-28  1:21 UTC (permalink / raw)
  To: Jon Kohler
  Cc: rcu@vger.kernel.org, jiangshanlai@gmail.com,
	josh@joshtriplett.org, jack@suse.cz,
	linux-fsdevel@vger.kernel.org, Z qiang

On Tue, Aug 27, 2024 at 11:33:32PM +0000, Jon Kohler wrote:
> 
> 
> > On Aug 27, 2024, at 4:34 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> > 
> > On Tue, Aug 27, 2024 at 08:01:27PM +0000, Jon Kohler wrote:
> >> Hey Paul, Lai, Josh, and the RCU list and Jan/FS list -
> >> Reaching out about a tricky hung task issue that I'm running into. I've
> >> got a virtualized Linux guest on top of a KVM based platform, running
> >> a 5.10.y based kernel. The issue we're running into is a hung task that
> >> *only* happens on shutdown/reboot of this particular VM once every 
> >> 20-50 times.
> >> 
> >> The signature of the hung task is always similar to the output below,
> >> where we appear to hang on the call to 
> >>    synchronize_srcu(&fsnotify_mark_srcu)
> > 
> > One thing to try would be to add trace_printk() or similar to the SRCU
> > readers, just in case someone was using srcu_read_lock_notrace() on
> > fsnotify_mark_srcu, which I see no trace of in current mainline.
> > 
> > Alternatively, if there is a version where this does not happen, try
> > bisecting.  Each bisection step would require something like 400-500
> > shutdown/reboots to prove the commit good.  (Obviously, the first failure
> > proves the commit bad, which for one-out-of-50 failures will take on
> > average about 35 shutdown/reboots.)
> > 
> > There could also be a bad SRCU backport from mainline, so please check
> > what SRCU backports you have in your 5.10.y stable release.  (Though
> > maybe Jack has already done this?)
> > 
> > Thanx, Paul
> 
> Thanks, Paul
> 
> For posterity, this kernel is just built off of the regular ole stable tree, and here’s
> All of the backports to kernel/rcu are below.
> 
> Stepping through this more, since we’re stalling at wait_for_completion,
> that must mean that wakeme_after_rcu() ... complete(&rcu->completion)
> is not happening, right?
> 
> That or somehow wakeme_after_rcu() fires *before* wait_for_completion()
> is setup, causing the wait to hang forever?

There is no problem with that misordering -- in that case, the call to
wait_for_completion() simply won't wait.

But that would be a good path to trace.  After all, you might have
noticed that what we believe software will do does not always match what
it actually does.


> Is it possible for wakeme_after_rcu() to race, such that it fires somewhere
> else *before* wait_for_completion gets all the way to schedule()?

That should also work just fine, but it is still a good assumption to
check.

Good list below, but I must focus on mainline.  Apologies!

Plus this assumes that v5.10 worked for you -- has that been tested?
So again, what version has worked for you?

							Thanx, Paul

> [rcu]$ pwd
> /kernel/kernel/rcu
> [rcu]$ git remote -v
> origin https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git (fetch)
> origin https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git (push)
> [rcu]$ git log --oneline 2c85ebc57b3e..HEAD .
> ca4427ebc626 (HEAD, tag: v5.10.205) Linux 5.10.205 <<<< this is the base commit for 5.10.205 >>>>
> ...
> 175f4b062f69 rcu: kmemleak: Ignore kmemleak false positives when RCU-freeing objects <<<< this is the most recent backport commit to kernel/rcu >>>>
> 55887adc76e1 rcuscale: Move rcu_scale_writer() schedule_timeout_uninterruptible() to _idle()
> 066fbd8bc981 refscale: Fix uninitalized use of wait_queue_head_t
> d93ba6e46e5f rcu-tasks: Add trc_inspect_reader() checks for exiting critical section
> 3e22624f8fd3 rcu-tasks: Wait for trc_read_check_handler() IPIs
> 9190c1f0aed1 rcu-tasks: Fix IPI failure handling in trc_wait_for_one_reader
> ad4f8c117b8b rcu: Prevent expedited GP from enabling tick on offline CPU
> 4f91de9a81bd rcu-tasks: Simplify trc_read_check_handler() atomic operations
> 3a64cd01cdd6 rcu-tasks: Mark ->trc_reader_special.b.need_qs data races
> 058f077d09ba rcu-tasks: Mark ->trc_reader_nesting data races
> 604d6a5ff718 rcu/rcuscale: Stop kfree_scale_thread thread(s) after unloading rcuscale
> d414e24d1509 rcu/rcuscale: Move rcu_scale_*() after kfree_scale_cleanup()
> ecc5e6dbc269 rcuscale: Move shutdown from wait_event() to wait_event_idle()
> b62c816bdb5e rcuscale: Always log error message
> 8cd9917c13a7 rcuscale: Console output claims too few grace periods
> 7230a9e599d3 rcu/kvfree: Avoid freeing new kfree_rcu() memory after old grace period
> a7d21b858589 rcu: Protect rcu_print_task_exp_stall() ->exp_tasks access
> e4842de4ec13 refscale: Move shutdown from wait_event() to wait_event_idle()
> eb18bc5a8678 rcu: Avoid stack overflow due to __rcu_irq_enter_check_tick() being kprobe-ed
> d99d194e2f8c rcu-tasks: Make rude RCU-Tasks work well with CPU hotplug
> 2bf501f1bc78 rcu: Suppress smp_processor_id() complaint in synchronize_rcu_expedited_wait()
> 1c37e86a78c2 rcu-tasks: Fix synchronize_rcu_tasks() VS zap_pid_ns_processes()
> ad410f64f7ab rcu-tasks: Remove preemption disablement around srcu_read_[un]lock() calls
> b02b6bb83c68 rcu-tasks: Improve comments explaining tasks_rcu_exit_srcu purpose
> 7c15d7ecce00 rcu: Prevent lockdep-RCU splats on lock acquisition/release
> 5a52380b8193 rcu: Fix __this_cpu_read() lockdep warning in rcu_force_quiescent_state()
> 0dd025483f15 rcu-tasks: Convert RCU_LOCKDEP_WARN() to WARN_ONCE()
> 36d4ffbedff7 rcu: Back off upon fill_page_cache_func() allocation failure
> 10f30cba8f6c rcu: Make TASKS_RUDE_RCU select IRQ_WORK
> 1c6c3f233664 rcu-tasks: Fix race in schedule and flush work
> a22d66eb518f rcu: Apply callbacks processing time limit only on softirq
> 40fb3812d997 rcu: Fix callbacks processing time limit retaining cond_resched()
> fcc9797d0d13 rcu: Don't deboost before reporting expedited quiescent state
> 0c145262ac99 rcu/nocb: Fix missed nocb_timer requeue
> 657991fb06a4 rcu: Do not report strict GPs for outgoing CPUs
> 12d3389b7af6 rcu: Tighten rcu_advance_cbs_nowake() checks
> 0836f9404017 rcu/exp: Mark current CPU as exp-QS in IPI loop second pass
> 70692b06208c rcu: Mark accesses to rcu_state.n_force_qs
> af756be29c82 rcu: Always inline rcu_dynticks_task*_{enter,exit}()
> 226d68fb6c0a rcu: Fix existing exp request check in sync_sched_exp_online_cleanup()
> 02ddf26d849d rcu-tasks: Move RTGS_WAIT_CBS to beginning of rcu_tasks_kthread() loop
> 7f43cda650d5 rcutorture: Avoid problematic critical section nesting on PREEMPT_RT
> d3ca78775db4 rcu: Fix macro name CONFIG_TASKS_RCU_TRACE
> 497f3d9c3f58 rcu: Fix stall-warning deadlock due to non-release of rcu_node ->lock
> ea5e5bc881a4 rcu: Add lockdep_assert_irqs_disabled() to rcu_sched_clock_irq() and callees
> 527b56d7856f rcu: Fix to include first blocked task in stall warning
> 4b680b3fc6f3 rcu/tree: Handle VM stoppage in stall detection
> b6ae3854075e srcu: Provide polling interfaces for Tiny SRCU grace periods
> 450948b06ce8 srcu: Make Tiny SRCU use multi-bit grace-period counter
> 641e1d88404a srcu: Provide internal interface to start a Tiny SRCU grace period
> f789de3be808 srcu: Provide polling interfaces for Tree SRCU grace periods
> fdf66e5a7fc8 srcu: Provide internal interface to start a Tree SRCU grace period
> 86cb49e7314e rcu-tasks: Don't delete holdouts within trc_wait_for_one_reader()
> 55ddab2bfd70 rcu-tasks: Don't delete holdouts within trc_inspect_reader()
> 35a35909ec19 rcu: Reject RCU_LOCKDEP_WARN() false positives
> 23597afbe096 srcu: Fix broken node geometry after early ssp init
> 728f23e53c65 rcu: Invoke rcu_spawn_core_kthreads() from rcu_spawn_gp_kthread()
> 7d81aff28953 rcu: Remove spurious instrumentation_end() in rcu_nmi_enter()
> 09a27d662006 kvfree_rcu: Use same set of GFP flags as does single-argument
> e713bdd791ba rcu/nocb: Perform deferred wake up before last idle's need_resched() check
> 20b7669fa3f0 rcu: Pull deferred rcuog wake up to rcu_eqs_enter() callers
> 30b491e2b6cc rcu-tasks: Move RCU-tasks initialization to before early_initcall()
> 9b81af9c8455 rcu/tree: Defer kvfree_rcu() allocation to a clean context
> 5cacd18c5207 rcu,ftrace: Fix ftrace recursion
> 4540e84bd8a9 rcu: Allow rcu_irq_enter_check_tick() from NMI <<<< this is the first kernel/rcu backport for 5.10.y branch >>>>
> ...
> 2c85ebc57b3e (tag: v5.10) Linux 5.10 <<<< this is the divergence point from mainline to 5.10.y >>>> 
> 
> > 
> >> in fsnotify_connector_destroy_workfn / fsnotify_mark_destroy_workfn,
> >> where two kernel threads are both calling synchronize_srcu, then
> >> scheduling out in wait_for_completion, and completely going out to
> >> lunch for over 4 minutes. This then triggers the hung task timeout and
> >> things blow up.
> >> 
> >> We are running audit=1 for this system and are using an el8 based
> >> userspace.
> >> 
> >> I've flipped through the fs/notify code base for both 5.10 as well as
> >> upstream mainline to see if something jumped off the page, and I
> >> haven't yet spotted any particular suspect code from the caller side.
> >> 
> >> This hang appears to come up at the very end of the shutdown/reboot
> >> process, seemingly after the system starts to unwind through initrd.
> >> 
> >> What I'm working on now is adding some instrumentation to the dracut
> >> shutdown initrd scripts to see if I can how far we get down that path
> >> before the system fails to make forward progress, which may give some
> >> hints. TBD on that. I've also enabled lockdep with CONFIG_PROVE_RCU and
> >> a plethora of DEBUG options [2], and didn't get anything interesting.
> >> To be clear, we haven't seen lockdep spit out any complaints as of yet.
> >> 
> >> Reaching out to see if this sounds familar to anyone on the list, or if
> >> there are any particular areas of the RCU code base that might be
> >> suspect for this kind of issue. I'm happy to provide more information,
> >> as frankly, I'm quite stumped at the moment.
> >> 
> >> Thanks all,
> >> Jon
> >> 
> >> [1] panic trace
> >>    Normal shutdown process, then hangs on the following:
> >>    ...
> >>    dracut Warning: Killing all remaining processes
> >>    ...
> >>    INFO: task kworker/u20:7:1200701 blocked for more than 241 seconds.
> >>          Tainted: G           O      5.10.205-2.el8.x86_64 #1
> >>    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >>    task:kworker/u20:7   state:D stack:    0 pid:1200701 ppid:     2 flags:0x00004080
> >>    Workqueue: events_unbound fsnotify_connector_destroy_workfn
> >>    Call Trace:
> >>     __schedule+0x267/0x790
> >>     schedule+0x3c/0xb0
> >>     schedule_timeout+0x219/0x2b0
> >>     wait_for_completion+0x9e/0x100
> >>     __synchronize_srcu.part.24+0x83/0xb0
> >>     ? __bpf_trace_rcu_utilization+0x10/0x10
> >>     ? synchronize_srcu+0x5d/0xf0
> >>     fsnotify_connector_destroy_workfn+0x46/0x80
> >>     process_one_work+0x1fc/0x390
> >>     worker_thread+0x2d/0x3e0
> >>     ? process_one_work+0x390/0x390
> >>     kthread+0x114/0x130
> >>     ? kthread_park+0x80/0x80
> >>     ret_from_fork+0x1f/0x30
> >>    INFO: task kworker/u20:8:1287360 blocked for more than 241 seconds.
> >>          Tainted: G           O      5.10.205-2.el8.x86_64 #1
> >>    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >>    task:kworker/u20:8   state:D stack:    0 pid:1287360 ppid:     2 flags:0x00004080
> >>    Workqueue: events_unbound fsnotify_mark_destroy_workfn
> >>    Call Trace:
> >>     __schedule+0x267/0x790
> >>     schedule+0x3c/0xb0
> >>     schedule_timeout+0x219/0x2b0
> >>     ? add_timer+0x14a/0x200
> >>     wait_for_completion+0x9e/0x100
> >>     __synchronize_srcu.part.24+0x83/0xb0
> >>     ? __bpf_trace_rcu_utilization+0x10/0x10
> >>     fsnotify_mark_destroy_workfn+0x77/0xe0
> >>     process_one_work+0x1fc/0x390
> >>     ? process_one_work+0x390/0x390
> >>     worker_thread+0x2d/0x3e0
> >>     ? process_one_work+0x390/0x390
> >>     kthread+0x114/0x130
> >>     ? kthread_park+0x80/0x80
> >>     ret_from_fork+0x1f/0x30
> >>    Kernel panic - not syncing: hung_task: blocked tasks
> >>    CPU: 1 PID: 64 Comm: khungtaskd Kdump: loaded Tainted: G           O      5.10.205-2.el8.x86_64 #1
> >>    Hardware name: Red Hat KVM, BIOS 20230302.1.2662.el8 04/01/2014
> >>    Call Trace:
> >>     dump_stack+0x6d/0x8c
> >>     panic+0x114/0x2ea
> >>     watchdog.cold.8+0xb5/0xb5
> >>     ? hungtask_pm_notify+0x50/0x50
> >>     kthread+0x114/0x130
> >>     ? kthread_park+0x80/0x80
> >>     ret_from_fork+0x1f/0x30
> >> 
> >> [2] additional debugging config knobs turned up.
> >>    CONFIG_PROVE_LOCKING=y
> >>    CONFIG_LOCK_STAT=y
> >>    CONFIG_DEBUG_RT_MUTEXES=y
> >>    CONFIG_DEBUG_SPINLOCK=y
> >>    CONFIG_DEBUG_MUTEXES=y
> >>    CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y
> >>    CONFIG_DEBUG_RWSEMS=y
> >>    CONFIG_DEBUG_LOCK_ALLOC=y
> >>    CONFIG_LOCKDEP=y
> >>    CONFIG_LOCKDEP_BITS=15
> >>    CONFIG_LOCKDEP_CHAINS_BITS=16
> >>    CONFIG_LOCKDEP_STACK_TRACE_BITS=19
> >>    CONFIG_LOCKDEP_STACK_TRACE_HASH_BITS=14
> >>    CONFIG_LOCKDEP_CIRCULAR_QUEUE_BITS=12
> >>    CONFIG_DEBUG_SHIRQ=y
> >>    CONFIG_WQ_WATCHDOG=y
> >>    CONFIG_DEBUG_ATOMIC_SLEEP=y
> >>    CONFIG_DEBUG_LIST=y
> >>    CONFIG_DEBUG_PLIST=y
> >>    CONFIG_DEBUG_SG=y
> >>    CONFIG_DEBUG_NOTIFIERS=y
> >>    CONFIG_BUG_ON_DATA_CORRUPTION=y
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: SRCU hung task on 5.10.y on synchronize_srcu(&fsnotify_mark_srcu)
  2024-08-28  1:21     ` Paul E. McKenney
@ 2024-08-28  2:41       ` Jon Kohler
  2024-08-28 10:14         ` Paul E. McKenney
  0 siblings, 1 reply; 12+ messages in thread
From: Jon Kohler @ 2024-08-28  2:41 UTC (permalink / raw)
  To: paulmck@kernel.org
  Cc: rcu@vger.kernel.org, jiangshanlai@gmail.com,
	josh@joshtriplett.org, jack@suse.cz,
	linux-fsdevel@vger.kernel.org, Z qiang



> On Aug 27, 2024, at 9:21 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> 
> On Tue, Aug 27, 2024 at 11:33:32PM +0000, Jon Kohler wrote:
>> 
>> 
>>> On Aug 27, 2024, at 4:34 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
>>> 
>>> On Tue, Aug 27, 2024 at 08:01:27PM +0000, Jon Kohler wrote:
>>>> Hey Paul, Lai, Josh, and the RCU list and Jan/FS list -
>>>> Reaching out about a tricky hung task issue that I'm running into. I've
>>>> got a virtualized Linux guest on top of a KVM based platform, running
>>>> a 5.10.y based kernel. The issue we're running into is a hung task that
>>>> *only* happens on shutdown/reboot of this particular VM once every 
>>>> 20-50 times.
>>>> 
>>>> The signature of the hung task is always similar to the output below,
>>>> where we appear to hang on the call to 
>>>>   synchronize_srcu(&fsnotify_mark_srcu)
>>> 
>>> One thing to try would be to add trace_printk() or similar to the SRCU
>>> readers, just in case someone was using srcu_read_lock_notrace() on
>>> fsnotify_mark_srcu, which I see no trace of in current mainline.
>>> 
>>> Alternatively, if there is a version where this does not happen, try
>>> bisecting.  Each bisection step would require something like 400-500
>>> shutdown/reboots to prove the commit good.  (Obviously, the first failure
>>> proves the commit bad, which for one-out-of-50 failures will take on
>>> average about 35 shutdown/reboots.)
>>> 
>>> There could also be a bad SRCU backport from mainline, so please check
>>> what SRCU backports you have in your 5.10.y stable release.  (Though
>>> maybe Jack has already done this?)
>>> 
>>> Thanx, Paul
>> 
>> Thanks, Paul
>> 
>> For posterity, this kernel is just built off of the regular ole stable tree, and here’s
>> All of the backports to kernel/rcu are below.
>> 
>> Stepping through this more, since we’re stalling at wait_for_completion,
>> that must mean that wakeme_after_rcu() ... complete(&rcu->completion)
>> is not happening, right?
>> 
>> That or somehow wakeme_after_rcu() fires *before* wait_for_completion()
>> is setup, causing the wait to hang forever?
> 
> There is no problem with that misordering -- in that case, the call to
> wait_for_completion() simply won't wait.
> 
> But that would be a good path to trace.  After all, you might have
> noticed that what we believe software will do does not always match what
> it actually does.

Agreed, I’ll noodle on it some more, and try to bisect as well. I see 
Neeraj responded separately, I’ll check that out now.
> 
> 
>> Is it possible for wakeme_after_rcu() to race, such that it fires somewhere
>> else *before* wait_for_completion gets all the way to schedule()?
> 
> That should also work just fine, but it is still a good assumption to
> check.
> 
> Good list below, but I must focus on mainline.  Apologies!
> 
> Plus this assumes that v5.10 worked for you -- has that been tested?
> So again, what version has worked for you?

We’ve had this particular service on 5.10.y for a while; however, this
behavior just recently started bubbling up with a somewhat new(er) set
of internal QA tests that very specifically stress shutdown/reboots in a
loop for a completely unrelated durability test, so its possible this issue
has existed for a while.

> 
> Thanx, Paul
> 
>> [rcu]$ pwd
>> /kernel/kernel/rcu
>> [rcu]$ git remote -v
>> origin https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_pub_scm_linux_kernel_git_stable_linux.git&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=1Tj05_-3u5bekAXIOn3TZBxlsLepxSWdhZVB6OKs02CYfxx4O9XkBJaihu2H2SaF&s=i-EJON2mPUDM-dvijeXkVr4nsR1g_obFaq9F9DCfQNs&e=  (fetch)
>> origin https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_pub_scm_linux_kernel_git_stable_linux.git&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=1Tj05_-3u5bekAXIOn3TZBxlsLepxSWdhZVB6OKs02CYfxx4O9XkBJaihu2H2SaF&s=i-EJON2mPUDM-dvijeXkVr4nsR1g_obFaq9F9DCfQNs&e=  (push)
>> [rcu]$ git log --oneline 2c85ebc57b3e..HEAD .
>> ca4427ebc626 (HEAD, tag: v5.10.205) Linux 5.10.205 <<<< this is the base commit for 5.10.205 >>>>
>> ...
>> 175f4b062f69 rcu: kmemleak: Ignore kmemleak false positives when RCU-freeing objects <<<< this is the most recent backport commit to kernel/rcu >>>>
>> 55887adc76e1 rcuscale: Move rcu_scale_writer() schedule_timeout_uninterruptible() to _idle()
>> 066fbd8bc981 refscale: Fix uninitalized use of wait_queue_head_t
>> d93ba6e46e5f rcu-tasks: Add trc_inspect_reader() checks for exiting critical section
>> 3e22624f8fd3 rcu-tasks: Wait for trc_read_check_handler() IPIs
>> 9190c1f0aed1 rcu-tasks: Fix IPI failure handling in trc_wait_for_one_reader
>> ad4f8c117b8b rcu: Prevent expedited GP from enabling tick on offline CPU
>> 4f91de9a81bd rcu-tasks: Simplify trc_read_check_handler() atomic operations
>> 3a64cd01cdd6 rcu-tasks: Mark ->trc_reader_special.b.need_qs data races
>> 058f077d09ba rcu-tasks: Mark ->trc_reader_nesting data races
>> 604d6a5ff718 rcu/rcuscale: Stop kfree_scale_thread thread(s) after unloading rcuscale
>> d414e24d1509 rcu/rcuscale: Move rcu_scale_*() after kfree_scale_cleanup()
>> ecc5e6dbc269 rcuscale: Move shutdown from wait_event() to wait_event_idle()
>> b62c816bdb5e rcuscale: Always log error message
>> 8cd9917c13a7 rcuscale: Console output claims too few grace periods
>> 7230a9e599d3 rcu/kvfree: Avoid freeing new kfree_rcu() memory after old grace period
>> a7d21b858589 rcu: Protect rcu_print_task_exp_stall() ->exp_tasks access
>> e4842de4ec13 refscale: Move shutdown from wait_event() to wait_event_idle()
>> eb18bc5a8678 rcu: Avoid stack overflow due to __rcu_irq_enter_check_tick() being kprobe-ed
>> d99d194e2f8c rcu-tasks: Make rude RCU-Tasks work well with CPU hotplug
>> 2bf501f1bc78 rcu: Suppress smp_processor_id() complaint in synchronize_rcu_expedited_wait()
>> 1c37e86a78c2 rcu-tasks: Fix synchronize_rcu_tasks() VS zap_pid_ns_processes()
>> ad410f64f7ab rcu-tasks: Remove preemption disablement around srcu_read_[un]lock() calls
>> b02b6bb83c68 rcu-tasks: Improve comments explaining tasks_rcu_exit_srcu purpose
>> 7c15d7ecce00 rcu: Prevent lockdep-RCU splats on lock acquisition/release
>> 5a52380b8193 rcu: Fix __this_cpu_read() lockdep warning in rcu_force_quiescent_state()
>> 0dd025483f15 rcu-tasks: Convert RCU_LOCKDEP_WARN() to WARN_ONCE()
>> 36d4ffbedff7 rcu: Back off upon fill_page_cache_func() allocation failure
>> 10f30cba8f6c rcu: Make TASKS_RUDE_RCU select IRQ_WORK
>> 1c6c3f233664 rcu-tasks: Fix race in schedule and flush work
>> a22d66eb518f rcu: Apply callbacks processing time limit only on softirq
>> 40fb3812d997 rcu: Fix callbacks processing time limit retaining cond_resched()
>> fcc9797d0d13 rcu: Don't deboost before reporting expedited quiescent state
>> 0c145262ac99 rcu/nocb: Fix missed nocb_timer requeue
>> 657991fb06a4 rcu: Do not report strict GPs for outgoing CPUs
>> 12d3389b7af6 rcu: Tighten rcu_advance_cbs_nowake() checks
>> 0836f9404017 rcu/exp: Mark current CPU as exp-QS in IPI loop second pass
>> 70692b06208c rcu: Mark accesses to rcu_state.n_force_qs
>> af756be29c82 rcu: Always inline rcu_dynticks_task*_{enter,exit}()
>> 226d68fb6c0a rcu: Fix existing exp request check in sync_sched_exp_online_cleanup()
>> 02ddf26d849d rcu-tasks: Move RTGS_WAIT_CBS to beginning of rcu_tasks_kthread() loop
>> 7f43cda650d5 rcutorture: Avoid problematic critical section nesting on PREEMPT_RT
>> d3ca78775db4 rcu: Fix macro name CONFIG_TASKS_RCU_TRACE
>> 497f3d9c3f58 rcu: Fix stall-warning deadlock due to non-release of rcu_node ->lock
>> ea5e5bc881a4 rcu: Add lockdep_assert_irqs_disabled() to rcu_sched_clock_irq() and callees
>> 527b56d7856f rcu: Fix to include first blocked task in stall warning
>> 4b680b3fc6f3 rcu/tree: Handle VM stoppage in stall detection
>> b6ae3854075e srcu: Provide polling interfaces for Tiny SRCU grace periods
>> 450948b06ce8 srcu: Make Tiny SRCU use multi-bit grace-period counter
>> 641e1d88404a srcu: Provide internal interface to start a Tiny SRCU grace period
>> f789de3be808 srcu: Provide polling interfaces for Tree SRCU grace periods
>> fdf66e5a7fc8 srcu: Provide internal interface to start a Tree SRCU grace period
>> 86cb49e7314e rcu-tasks: Don't delete holdouts within trc_wait_for_one_reader()
>> 55ddab2bfd70 rcu-tasks: Don't delete holdouts within trc_inspect_reader()
>> 35a35909ec19 rcu: Reject RCU_LOCKDEP_WARN() false positives
>> 23597afbe096 srcu: Fix broken node geometry after early ssp init
>> 728f23e53c65 rcu: Invoke rcu_spawn_core_kthreads() from rcu_spawn_gp_kthread()
>> 7d81aff28953 rcu: Remove spurious instrumentation_end() in rcu_nmi_enter()
>> 09a27d662006 kvfree_rcu: Use same set of GFP flags as does single-argument
>> e713bdd791ba rcu/nocb: Perform deferred wake up before last idle's need_resched() check
>> 20b7669fa3f0 rcu: Pull deferred rcuog wake up to rcu_eqs_enter() callers
>> 30b491e2b6cc rcu-tasks: Move RCU-tasks initialization to before early_initcall()
>> 9b81af9c8455 rcu/tree: Defer kvfree_rcu() allocation to a clean context
>> 5cacd18c5207 rcu,ftrace: Fix ftrace recursion
>> 4540e84bd8a9 rcu: Allow rcu_irq_enter_check_tick() from NMI <<<< this is the first kernel/rcu backport for 5.10.y branch >>>>
>> ...
>> 2c85ebc57b3e (tag: v5.10) Linux 5.10 <<<< this is the divergence point from mainline to 5.10.y >>>> 
>> 
>>> 
>>>> in fsnotify_connector_destroy_workfn / fsnotify_mark_destroy_workfn,
>>>> where two kernel threads are both calling synchronize_srcu, then
>>>> scheduling out in wait_for_completion, and completely going out to
>>>> lunch for over 4 minutes. This then triggers the hung task timeout and
>>>> things blow up.
>>>> 
>>>> We are running audit=1 for this system and are using an el8 based
>>>> userspace.
>>>> 
>>>> I've flipped through the fs/notify code base for both 5.10 as well as
>>>> upstream mainline to see if something jumped off the page, and I
>>>> haven't yet spotted any particular suspect code from the caller side.
>>>> 
>>>> This hang appears to come up at the very end of the shutdown/reboot
>>>> process, seemingly after the system starts to unwind through initrd.
>>>> 
>>>> What I'm working on now is adding some instrumentation to the dracut
>>>> shutdown initrd scripts to see if I can how far we get down that path
>>>> before the system fails to make forward progress, which may give some
>>>> hints. TBD on that. I've also enabled lockdep with CONFIG_PROVE_RCU and
>>>> a plethora of DEBUG options [2], and didn't get anything interesting.
>>>> To be clear, we haven't seen lockdep spit out any complaints as of yet.
>>>> 
>>>> Reaching out to see if this sounds familar to anyone on the list, or if
>>>> there are any particular areas of the RCU code base that might be
>>>> suspect for this kind of issue. I'm happy to provide more information,
>>>> as frankly, I'm quite stumped at the moment.
>>>> 
>>>> Thanks all,
>>>> Jon
>>>> 
>>>> [1] panic trace
>>>>   Normal shutdown process, then hangs on the following:
>>>>   ...
>>>>   dracut Warning: Killing all remaining processes
>>>>   ...
>>>>   INFO: task kworker/u20:7:1200701 blocked for more than 241 seconds.
>>>>         Tainted: G           O      5.10.205-2.el8.x86_64 #1
>>>>   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>>   task:kworker/u20:7   state:D stack:    0 pid:1200701 ppid:     2 flags:0x00004080
>>>>   Workqueue: events_unbound fsnotify_connector_destroy_workfn
>>>>   Call Trace:
>>>>    __schedule+0x267/0x790
>>>>    schedule+0x3c/0xb0
>>>>    schedule_timeout+0x219/0x2b0
>>>>    wait_for_completion+0x9e/0x100
>>>>    __synchronize_srcu.part.24+0x83/0xb0
>>>>    ? __bpf_trace_rcu_utilization+0x10/0x10
>>>>    ? synchronize_srcu+0x5d/0xf0
>>>>    fsnotify_connector_destroy_workfn+0x46/0x80
>>>>    process_one_work+0x1fc/0x390
>>>>    worker_thread+0x2d/0x3e0
>>>>    ? process_one_work+0x390/0x390
>>>>    kthread+0x114/0x130
>>>>    ? kthread_park+0x80/0x80
>>>>    ret_from_fork+0x1f/0x30
>>>>   INFO: task kworker/u20:8:1287360 blocked for more than 241 seconds.
>>>>         Tainted: G           O      5.10.205-2.el8.x86_64 #1
>>>>   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>>   task:kworker/u20:8   state:D stack:    0 pid:1287360 ppid:     2 flags:0x00004080
>>>>   Workqueue: events_unbound fsnotify_mark_destroy_workfn
>>>>   Call Trace:
>>>>    __schedule+0x267/0x790
>>>>    schedule+0x3c/0xb0
>>>>    schedule_timeout+0x219/0x2b0
>>>>    ? add_timer+0x14a/0x200
>>>>    wait_for_completion+0x9e/0x100
>>>>    __synchronize_srcu.part.24+0x83/0xb0
>>>>    ? __bpf_trace_rcu_utilization+0x10/0x10
>>>>    fsnotify_mark_destroy_workfn+0x77/0xe0
>>>>    process_one_work+0x1fc/0x390
>>>>    ? process_one_work+0x390/0x390
>>>>    worker_thread+0x2d/0x3e0
>>>>    ? process_one_work+0x390/0x390
>>>>    kthread+0x114/0x130
>>>>    ? kthread_park+0x80/0x80
>>>>    ret_from_fork+0x1f/0x30
>>>>   Kernel panic - not syncing: hung_task: blocked tasks
>>>>   CPU: 1 PID: 64 Comm: khungtaskd Kdump: loaded Tainted: G           O      5.10.205-2.el8.x86_64 #1
>>>>   Hardware name: Red Hat KVM, BIOS 20230302.1.2662.el8 04/01/2014
>>>>   Call Trace:
>>>>    dump_stack+0x6d/0x8c
>>>>    panic+0x114/0x2ea
>>>>    watchdog.cold.8+0xb5/0xb5
>>>>    ? hungtask_pm_notify+0x50/0x50
>>>>    kthread+0x114/0x130
>>>>    ? kthread_park+0x80/0x80
>>>>    ret_from_fork+0x1f/0x30
>>>> 
>>>> [2] additional debugging config knobs turned up.
>>>>   CONFIG_PROVE_LOCKING=y
>>>>   CONFIG_LOCK_STAT=y
>>>>   CONFIG_DEBUG_RT_MUTEXES=y
>>>>   CONFIG_DEBUG_SPINLOCK=y
>>>>   CONFIG_DEBUG_MUTEXES=y
>>>>   CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y
>>>>   CONFIG_DEBUG_RWSEMS=y
>>>>   CONFIG_DEBUG_LOCK_ALLOC=y
>>>>   CONFIG_LOCKDEP=y
>>>>   CONFIG_LOCKDEP_BITS=15
>>>>   CONFIG_LOCKDEP_CHAINS_BITS=16
>>>>   CONFIG_LOCKDEP_STACK_TRACE_BITS=19
>>>>   CONFIG_LOCKDEP_STACK_TRACE_HASH_BITS=14
>>>>   CONFIG_LOCKDEP_CIRCULAR_QUEUE_BITS=12
>>>>   CONFIG_DEBUG_SHIRQ=y
>>>>   CONFIG_WQ_WATCHDOG=y
>>>>   CONFIG_DEBUG_ATOMIC_SLEEP=y
>>>>   CONFIG_DEBUG_LIST=y
>>>>   CONFIG_DEBUG_PLIST=y
>>>>   CONFIG_DEBUG_SG=y
>>>>   CONFIG_DEBUG_NOTIFIERS=y
>>>>   CONFIG_BUG_ON_DATA_CORRUPTION=y



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: SRCU hung task on 5.10.y on synchronize_srcu(&fsnotify_mark_srcu)
  2024-08-28  2:41       ` Jon Kohler
@ 2024-08-28 10:14         ` Paul E. McKenney
  0 siblings, 0 replies; 12+ messages in thread
From: Paul E. McKenney @ 2024-08-28 10:14 UTC (permalink / raw)
  To: Jon Kohler
  Cc: rcu@vger.kernel.org, jiangshanlai@gmail.com,
	josh@joshtriplett.org, jack@suse.cz,
	linux-fsdevel@vger.kernel.org, Z qiang

On Wed, Aug 28, 2024 at 02:41:09AM +0000, Jon Kohler wrote:
> > On Aug 27, 2024, at 9:21 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> > On Tue, Aug 27, 2024 at 11:33:32PM +0000, Jon Kohler wrote:
> >>> On Aug 27, 2024, at 4:34 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> >>> On Tue, Aug 27, 2024 at 08:01:27PM +0000, Jon Kohler wrote:
> >>>> Hey Paul, Lai, Josh, and the RCU list and Jan/FS list -
> >>>> Reaching out about a tricky hung task issue that I'm running into. I've
> >>>> got a virtualized Linux guest on top of a KVM based platform, running
> >>>> a 5.10.y based kernel. The issue we're running into is a hung task that
> >>>> *only* happens on shutdown/reboot of this particular VM once every 
> >>>> 20-50 times.
> >>>> 
> >>>> The signature of the hung task is always similar to the output below,
> >>>> where we appear to hang on the call to 
> >>>>   synchronize_srcu(&fsnotify_mark_srcu)
> >>> 
> >>> One thing to try would be to add trace_printk() or similar to the SRCU
> >>> readers, just in case someone was using srcu_read_lock_notrace() on
> >>> fsnotify_mark_srcu, which I see no trace of in current mainline.
> >>> 
> >>> Alternatively, if there is a version where this does not happen, try
> >>> bisecting.  Each bisection step would require something like 400-500
> >>> shutdown/reboots to prove the commit good.  (Obviously, the first failure
> >>> proves the commit bad, which for one-out-of-50 failures will take on
> >>> average about 35 shutdown/reboots.)
> >>> 
> >>> There could also be a bad SRCU backport from mainline, so please check
> >>> what SRCU backports you have in your 5.10.y stable release.  (Though
> >>> maybe Jack has already done this?)
> >>> 
> >>> Thanx, Paul
> >> 
> >> Thanks, Paul
> >> 
> >> For posterity, this kernel is just built off of the regular ole stable tree, and here’s
> >> All of the backports to kernel/rcu are below.
> >> 
> >> Stepping through this more, since we’re stalling at wait_for_completion,
> >> that must mean that wakeme_after_rcu() ... complete(&rcu->completion)
> >> is not happening, right?
> >> 
> >> That or somehow wakeme_after_rcu() fires *before* wait_for_completion()
> >> is setup, causing the wait to hang forever?
> > 
> > There is no problem with that misordering -- in that case, the call to
> > wait_for_completion() simply won't wait.
> > 
> > But that would be a good path to trace.  After all, you might have
> > noticed that what we believe software will do does not always match what
> > it actually does.
> 
> Agreed, I’ll noodle on it some more, and try to bisect as well. I see 
> Neeraj responded separately, I’ll check that out now.
> > 
> > 
> >> Is it possible for wakeme_after_rcu() to race, such that it fires somewhere
> >> else *before* wait_for_completion gets all the way to schedule()?
> > 
> > That should also work just fine, but it is still a good assumption to
> > check.
> > 
> > Good list below, but I must focus on mainline.  Apologies!
> > 
> > Plus this assumes that v5.10 worked for you -- has that been tested?
> > So again, what version has worked for you?
> 
> We’ve had this particular service on 5.10.y for a while; however, this
> behavior just recently started bubbling up with a somewhat new(er) set
> of internal QA tests that very specifically stress shutdown/reboots in a
> loop for a completely unrelated durability test, so its possible this issue
> has existed for a while.

I hope that the commit that Neeraj identified fixes this for you, and
thank you very much, Neeraj!

However, if it does not, I suggest: (1) running your new tests on older
versions of 5.10.y in case the bug was introduced by partial back ports,
(2) running your tests on newer versions of mainline in case there is
another fix that needs to be backported, and, if necessary, (3) work on
the tests so that the problem reproduces faster, reducing the costs of
further experiments.

							Thanx, Paul

> > Thanx, Paul
> > 
> >> [rcu]$ pwd
> >> /kernel/kernel/rcu
> >> [rcu]$ git remote -v
> >> origin https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_pub_scm_linux_kernel_git_stable_linux.git&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=1Tj05_-3u5bekAXIOn3TZBxlsLepxSWdhZVB6OKs02CYfxx4O9XkBJaihu2H2SaF&s=i-EJON2mPUDM-dvijeXkVr4nsR1g_obFaq9F9DCfQNs&e=  (fetch)
> >> origin https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_pub_scm_linux_kernel_git_stable_linux.git&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=1Tj05_-3u5bekAXIOn3TZBxlsLepxSWdhZVB6OKs02CYfxx4O9XkBJaihu2H2SaF&s=i-EJON2mPUDM-dvijeXkVr4nsR1g_obFaq9F9DCfQNs&e=  (push)
> >> [rcu]$ git log --oneline 2c85ebc57b3e..HEAD .
> >> ca4427ebc626 (HEAD, tag: v5.10.205) Linux 5.10.205 <<<< this is the base commit for 5.10.205 >>>>
> >> ...
> >> 175f4b062f69 rcu: kmemleak: Ignore kmemleak false positives when RCU-freeing objects <<<< this is the most recent backport commit to kernel/rcu >>>>
> >> 55887adc76e1 rcuscale: Move rcu_scale_writer() schedule_timeout_uninterruptible() to _idle()
> >> 066fbd8bc981 refscale: Fix uninitalized use of wait_queue_head_t
> >> d93ba6e46e5f rcu-tasks: Add trc_inspect_reader() checks for exiting critical section
> >> 3e22624f8fd3 rcu-tasks: Wait for trc_read_check_handler() IPIs
> >> 9190c1f0aed1 rcu-tasks: Fix IPI failure handling in trc_wait_for_one_reader
> >> ad4f8c117b8b rcu: Prevent expedited GP from enabling tick on offline CPU
> >> 4f91de9a81bd rcu-tasks: Simplify trc_read_check_handler() atomic operations
> >> 3a64cd01cdd6 rcu-tasks: Mark ->trc_reader_special.b.need_qs data races
> >> 058f077d09ba rcu-tasks: Mark ->trc_reader_nesting data races
> >> 604d6a5ff718 rcu/rcuscale: Stop kfree_scale_thread thread(s) after unloading rcuscale
> >> d414e24d1509 rcu/rcuscale: Move rcu_scale_*() after kfree_scale_cleanup()
> >> ecc5e6dbc269 rcuscale: Move shutdown from wait_event() to wait_event_idle()
> >> b62c816bdb5e rcuscale: Always log error message
> >> 8cd9917c13a7 rcuscale: Console output claims too few grace periods
> >> 7230a9e599d3 rcu/kvfree: Avoid freeing new kfree_rcu() memory after old grace period
> >> a7d21b858589 rcu: Protect rcu_print_task_exp_stall() ->exp_tasks access
> >> e4842de4ec13 refscale: Move shutdown from wait_event() to wait_event_idle()
> >> eb18bc5a8678 rcu: Avoid stack overflow due to __rcu_irq_enter_check_tick() being kprobe-ed
> >> d99d194e2f8c rcu-tasks: Make rude RCU-Tasks work well with CPU hotplug
> >> 2bf501f1bc78 rcu: Suppress smp_processor_id() complaint in synchronize_rcu_expedited_wait()
> >> 1c37e86a78c2 rcu-tasks: Fix synchronize_rcu_tasks() VS zap_pid_ns_processes()
> >> ad410f64f7ab rcu-tasks: Remove preemption disablement around srcu_read_[un]lock() calls
> >> b02b6bb83c68 rcu-tasks: Improve comments explaining tasks_rcu_exit_srcu purpose
> >> 7c15d7ecce00 rcu: Prevent lockdep-RCU splats on lock acquisition/release
> >> 5a52380b8193 rcu: Fix __this_cpu_read() lockdep warning in rcu_force_quiescent_state()
> >> 0dd025483f15 rcu-tasks: Convert RCU_LOCKDEP_WARN() to WARN_ONCE()
> >> 36d4ffbedff7 rcu: Back off upon fill_page_cache_func() allocation failure
> >> 10f30cba8f6c rcu: Make TASKS_RUDE_RCU select IRQ_WORK
> >> 1c6c3f233664 rcu-tasks: Fix race in schedule and flush work
> >> a22d66eb518f rcu: Apply callbacks processing time limit only on softirq
> >> 40fb3812d997 rcu: Fix callbacks processing time limit retaining cond_resched()
> >> fcc9797d0d13 rcu: Don't deboost before reporting expedited quiescent state
> >> 0c145262ac99 rcu/nocb: Fix missed nocb_timer requeue
> >> 657991fb06a4 rcu: Do not report strict GPs for outgoing CPUs
> >> 12d3389b7af6 rcu: Tighten rcu_advance_cbs_nowake() checks
> >> 0836f9404017 rcu/exp: Mark current CPU as exp-QS in IPI loop second pass
> >> 70692b06208c rcu: Mark accesses to rcu_state.n_force_qs
> >> af756be29c82 rcu: Always inline rcu_dynticks_task*_{enter,exit}()
> >> 226d68fb6c0a rcu: Fix existing exp request check in sync_sched_exp_online_cleanup()
> >> 02ddf26d849d rcu-tasks: Move RTGS_WAIT_CBS to beginning of rcu_tasks_kthread() loop
> >> 7f43cda650d5 rcutorture: Avoid problematic critical section nesting on PREEMPT_RT
> >> d3ca78775db4 rcu: Fix macro name CONFIG_TASKS_RCU_TRACE
> >> 497f3d9c3f58 rcu: Fix stall-warning deadlock due to non-release of rcu_node ->lock
> >> ea5e5bc881a4 rcu: Add lockdep_assert_irqs_disabled() to rcu_sched_clock_irq() and callees
> >> 527b56d7856f rcu: Fix to include first blocked task in stall warning
> >> 4b680b3fc6f3 rcu/tree: Handle VM stoppage in stall detection
> >> b6ae3854075e srcu: Provide polling interfaces for Tiny SRCU grace periods
> >> 450948b06ce8 srcu: Make Tiny SRCU use multi-bit grace-period counter
> >> 641e1d88404a srcu: Provide internal interface to start a Tiny SRCU grace period
> >> f789de3be808 srcu: Provide polling interfaces for Tree SRCU grace periods
> >> fdf66e5a7fc8 srcu: Provide internal interface to start a Tree SRCU grace period
> >> 86cb49e7314e rcu-tasks: Don't delete holdouts within trc_wait_for_one_reader()
> >> 55ddab2bfd70 rcu-tasks: Don't delete holdouts within trc_inspect_reader()
> >> 35a35909ec19 rcu: Reject RCU_LOCKDEP_WARN() false positives
> >> 23597afbe096 srcu: Fix broken node geometry after early ssp init
> >> 728f23e53c65 rcu: Invoke rcu_spawn_core_kthreads() from rcu_spawn_gp_kthread()
> >> 7d81aff28953 rcu: Remove spurious instrumentation_end() in rcu_nmi_enter()
> >> 09a27d662006 kvfree_rcu: Use same set of GFP flags as does single-argument
> >> e713bdd791ba rcu/nocb: Perform deferred wake up before last idle's need_resched() check
> >> 20b7669fa3f0 rcu: Pull deferred rcuog wake up to rcu_eqs_enter() callers
> >> 30b491e2b6cc rcu-tasks: Move RCU-tasks initialization to before early_initcall()
> >> 9b81af9c8455 rcu/tree: Defer kvfree_rcu() allocation to a clean context
> >> 5cacd18c5207 rcu,ftrace: Fix ftrace recursion
> >> 4540e84bd8a9 rcu: Allow rcu_irq_enter_check_tick() from NMI <<<< this is the first kernel/rcu backport for 5.10.y branch >>>>
> >> ...
> >> 2c85ebc57b3e (tag: v5.10) Linux 5.10 <<<< this is the divergence point from mainline to 5.10.y >>>> 
> >> 
> >>> 
> >>>> in fsnotify_connector_destroy_workfn / fsnotify_mark_destroy_workfn,
> >>>> where two kernel threads are both calling synchronize_srcu, then
> >>>> scheduling out in wait_for_completion, and completely going out to
> >>>> lunch for over 4 minutes. This then triggers the hung task timeout and
> >>>> things blow up.
> >>>> 
> >>>> We are running audit=1 for this system and are using an el8 based
> >>>> userspace.
> >>>> 
> >>>> I've flipped through the fs/notify code base for both 5.10 as well as
> >>>> upstream mainline to see if something jumped off the page, and I
> >>>> haven't yet spotted any particular suspect code from the caller side.
> >>>> 
> >>>> This hang appears to come up at the very end of the shutdown/reboot
> >>>> process, seemingly after the system starts to unwind through initrd.
> >>>> 
> >>>> What I'm working on now is adding some instrumentation to the dracut
> >>>> shutdown initrd scripts to see if I can how far we get down that path
> >>>> before the system fails to make forward progress, which may give some
> >>>> hints. TBD on that. I've also enabled lockdep with CONFIG_PROVE_RCU and
> >>>> a plethora of DEBUG options [2], and didn't get anything interesting.
> >>>> To be clear, we haven't seen lockdep spit out any complaints as of yet.
> >>>> 
> >>>> Reaching out to see if this sounds familar to anyone on the list, or if
> >>>> there are any particular areas of the RCU code base that might be
> >>>> suspect for this kind of issue. I'm happy to provide more information,
> >>>> as frankly, I'm quite stumped at the moment.
> >>>> 
> >>>> Thanks all,
> >>>> Jon
> >>>> 
> >>>> [1] panic trace
> >>>>   Normal shutdown process, then hangs on the following:
> >>>>   ...
> >>>>   dracut Warning: Killing all remaining processes
> >>>>   ...
> >>>>   INFO: task kworker/u20:7:1200701 blocked for more than 241 seconds.
> >>>>         Tainted: G           O      5.10.205-2.el8.x86_64 #1
> >>>>   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >>>>   task:kworker/u20:7   state:D stack:    0 pid:1200701 ppid:     2 flags:0x00004080
> >>>>   Workqueue: events_unbound fsnotify_connector_destroy_workfn
> >>>>   Call Trace:
> >>>>    __schedule+0x267/0x790
> >>>>    schedule+0x3c/0xb0
> >>>>    schedule_timeout+0x219/0x2b0
> >>>>    wait_for_completion+0x9e/0x100
> >>>>    __synchronize_srcu.part.24+0x83/0xb0
> >>>>    ? __bpf_trace_rcu_utilization+0x10/0x10
> >>>>    ? synchronize_srcu+0x5d/0xf0
> >>>>    fsnotify_connector_destroy_workfn+0x46/0x80
> >>>>    process_one_work+0x1fc/0x390
> >>>>    worker_thread+0x2d/0x3e0
> >>>>    ? process_one_work+0x390/0x390
> >>>>    kthread+0x114/0x130
> >>>>    ? kthread_park+0x80/0x80
> >>>>    ret_from_fork+0x1f/0x30
> >>>>   INFO: task kworker/u20:8:1287360 blocked for more than 241 seconds.
> >>>>         Tainted: G           O      5.10.205-2.el8.x86_64 #1
> >>>>   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >>>>   task:kworker/u20:8   state:D stack:    0 pid:1287360 ppid:     2 flags:0x00004080
> >>>>   Workqueue: events_unbound fsnotify_mark_destroy_workfn
> >>>>   Call Trace:
> >>>>    __schedule+0x267/0x790
> >>>>    schedule+0x3c/0xb0
> >>>>    schedule_timeout+0x219/0x2b0
> >>>>    ? add_timer+0x14a/0x200
> >>>>    wait_for_completion+0x9e/0x100
> >>>>    __synchronize_srcu.part.24+0x83/0xb0
> >>>>    ? __bpf_trace_rcu_utilization+0x10/0x10
> >>>>    fsnotify_mark_destroy_workfn+0x77/0xe0
> >>>>    process_one_work+0x1fc/0x390
> >>>>    ? process_one_work+0x390/0x390
> >>>>    worker_thread+0x2d/0x3e0
> >>>>    ? process_one_work+0x390/0x390
> >>>>    kthread+0x114/0x130
> >>>>    ? kthread_park+0x80/0x80
> >>>>    ret_from_fork+0x1f/0x30
> >>>>   Kernel panic - not syncing: hung_task: blocked tasks
> >>>>   CPU: 1 PID: 64 Comm: khungtaskd Kdump: loaded Tainted: G           O      5.10.205-2.el8.x86_64 #1
> >>>>   Hardware name: Red Hat KVM, BIOS 20230302.1.2662.el8 04/01/2014
> >>>>   Call Trace:
> >>>>    dump_stack+0x6d/0x8c
> >>>>    panic+0x114/0x2ea
> >>>>    watchdog.cold.8+0xb5/0xb5
> >>>>    ? hungtask_pm_notify+0x50/0x50
> >>>>    kthread+0x114/0x130
> >>>>    ? kthread_park+0x80/0x80
> >>>>    ret_from_fork+0x1f/0x30
> >>>> 
> >>>> [2] additional debugging config knobs turned up.
> >>>>   CONFIG_PROVE_LOCKING=y
> >>>>   CONFIG_LOCK_STAT=y
> >>>>   CONFIG_DEBUG_RT_MUTEXES=y
> >>>>   CONFIG_DEBUG_SPINLOCK=y
> >>>>   CONFIG_DEBUG_MUTEXES=y
> >>>>   CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y
> >>>>   CONFIG_DEBUG_RWSEMS=y
> >>>>   CONFIG_DEBUG_LOCK_ALLOC=y
> >>>>   CONFIG_LOCKDEP=y
> >>>>   CONFIG_LOCKDEP_BITS=15
> >>>>   CONFIG_LOCKDEP_CHAINS_BITS=16
> >>>>   CONFIG_LOCKDEP_STACK_TRACE_BITS=19
> >>>>   CONFIG_LOCKDEP_STACK_TRACE_HASH_BITS=14
> >>>>   CONFIG_LOCKDEP_CIRCULAR_QUEUE_BITS=12
> >>>>   CONFIG_DEBUG_SHIRQ=y
> >>>>   CONFIG_WQ_WATCHDOG=y
> >>>>   CONFIG_DEBUG_ATOMIC_SLEEP=y
> >>>>   CONFIG_DEBUG_LIST=y
> >>>>   CONFIG_DEBUG_PLIST=y
> >>>>   CONFIG_DEBUG_SG=y
> >>>>   CONFIG_DEBUG_NOTIFIERS=y
> >>>>   CONFIG_BUG_ON_DATA_CORRUPTION=y
> 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: SRCU hung task on 5.10.y on synchronize_srcu(&fsnotify_mark_srcu)
  2024-08-27 23:33   ` Jon Kohler
  2024-08-28  1:21     ` Paul E. McKenney
@ 2024-08-28  2:33     ` Neeraj Upadhyay
  2024-08-28  3:01       ` Jon Kohler
  1 sibling, 1 reply; 12+ messages in thread
From: Neeraj Upadhyay @ 2024-08-28  2:33 UTC (permalink / raw)
  To: Jon Kohler
  Cc: paulmck@kernel.org, rcu@vger.kernel.org, jiangshanlai@gmail.com,
	josh@joshtriplett.org

On Tue, Aug 27, 2024 at 11:33:32PM +0000, Jon Kohler wrote:
> 
> 
> > On Aug 27, 2024, at 4:34 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> > 
> > On Tue, Aug 27, 2024 at 08:01:27PM +0000, Jon Kohler wrote:
> >> Hey Paul, Lai, Josh, and the RCU list and Jan/FS list -
> >> Reaching out about a tricky hung task issue that I'm running into. I've
> >> got a virtualized Linux guest on top of a KVM based platform, running
> >> a 5.10.y based kernel. The issue we're running into is a hung task that
> >> *only* happens on shutdown/reboot of this particular VM once every 
> >> 20-50 times.
> >> 
> >> The signature of the hung task is always similar to the output below,
> >> where we appear to hang on the call to 
> >>    synchronize_srcu(&fsnotify_mark_srcu)
> > 
> > One thing to try would be to add trace_printk() or similar to the SRCU
> > readers, just in case someone was using srcu_read_lock_notrace() on
> > fsnotify_mark_srcu, which I see no trace of in current mainline.
> > 
> > Alternatively, if there is a version where this does not happen, try
> > bisecting.  Each bisection step would require something like 400-500
> > shutdown/reboots to prove the commit good.  (Obviously, the first failure
> > proves the commit bad, which for one-out-of-50 failures will take on
> > average about 35 shutdown/reboots.)
> > 
> > There could also be a bad SRCU backport from mainline, so please check
> > what SRCU backports you have in your 5.10.y stable release.  (Though
> > maybe Jack has already done this?)
> > 
> > Thanx, Paul
> 
> Thanks, Paul
> 
> For posterity, this kernel is just built off of the regular ole stable tree, and here’s
> All of the backports to kernel/rcu are below.
> 
> Stepping through this more, since we’re stalling at wait_for_completion,
> that must mean that wakeme_after_rcu() ... complete(&rcu->completion)
> is not happening, right?
> 
> That or somehow wakeme_after_rcu() fires *before* wait_for_completion()
> is setup, causing the wait to hang forever?
> 
> Is it possible for wakeme_after_rcu() to race, such that it fires somewhere
> else *before* wait_for_completion gets all the way to schedule()?
> 
> [rcu]$ pwd
> /kernel/kernel/rcu
> [rcu]$ git remote -v
> origin https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git (fetch)
> origin https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git (push)
> [rcu]$ git log --oneline 2c85ebc57b3e..HEAD .
> ca4427ebc626 (HEAD, tag: v5.10.205) Linux 5.10.205 <<<< this is the base commit for 5.10.205 >>>>
> ...
> 175f4b062f69 rcu: kmemleak: Ignore kmemleak false positives when RCU-freeing objects <<<< this is the most recent backport commit to kernel/rcu >>>>
> 55887adc76e1 rcuscale: Move rcu_scale_writer() schedule_timeout_uninterruptible() to _idle()
> 066fbd8bc981 refscale: Fix uninitalized use of wait_queue_head_t
> d93ba6e46e5f rcu-tasks: Add trc_inspect_reader() checks for exiting critical section
> 3e22624f8fd3 rcu-tasks: Wait for trc_read_check_handler() IPIs
> 9190c1f0aed1 rcu-tasks: Fix IPI failure handling in trc_wait_for_one_reader
> ad4f8c117b8b rcu: Prevent expedited GP from enabling tick on offline CPU
> 4f91de9a81bd rcu-tasks: Simplify trc_read_check_handler() atomic operations
> 3a64cd01cdd6 rcu-tasks: Mark ->trc_reader_special.b.need_qs data races
> 058f077d09ba rcu-tasks: Mark ->trc_reader_nesting data races
> 604d6a5ff718 rcu/rcuscale: Stop kfree_scale_thread thread(s) after unloading rcuscale
> d414e24d1509 rcu/rcuscale: Move rcu_scale_*() after kfree_scale_cleanup()
> ecc5e6dbc269 rcuscale: Move shutdown from wait_event() to wait_event_idle()
> b62c816bdb5e rcuscale: Always log error message
> 8cd9917c13a7 rcuscale: Console output claims too few grace periods
> 7230a9e599d3 rcu/kvfree: Avoid freeing new kfree_rcu() memory after old grace period
> a7d21b858589 rcu: Protect rcu_print_task_exp_stall() ->exp_tasks access
> e4842de4ec13 refscale: Move shutdown from wait_event() to wait_event_idle()
> eb18bc5a8678 rcu: Avoid stack overflow due to __rcu_irq_enter_check_tick() being kprobe-ed
> d99d194e2f8c rcu-tasks: Make rude RCU-Tasks work well with CPU hotplug
> 2bf501f1bc78 rcu: Suppress smp_processor_id() complaint in synchronize_rcu_expedited_wait()
> 1c37e86a78c2 rcu-tasks: Fix synchronize_rcu_tasks() VS zap_pid_ns_processes()
> ad410f64f7ab rcu-tasks: Remove preemption disablement around srcu_read_[un]lock() calls
> b02b6bb83c68 rcu-tasks: Improve comments explaining tasks_rcu_exit_srcu purpose
> 7c15d7ecce00 rcu: Prevent lockdep-RCU splats on lock acquisition/release
> 5a52380b8193 rcu: Fix __this_cpu_read() lockdep warning in rcu_force_quiescent_state()
> 0dd025483f15 rcu-tasks: Convert RCU_LOCKDEP_WARN() to WARN_ONCE()
> 36d4ffbedff7 rcu: Back off upon fill_page_cache_func() allocation failure
> 10f30cba8f6c rcu: Make TASKS_RUDE_RCU select IRQ_WORK
> 1c6c3f233664 rcu-tasks: Fix race in schedule and flush work
> a22d66eb518f rcu: Apply callbacks processing time limit only on softirq
> 40fb3812d997 rcu: Fix callbacks processing time limit retaining cond_resched()
> fcc9797d0d13 rcu: Don't deboost before reporting expedited quiescent state
> 0c145262ac99 rcu/nocb: Fix missed nocb_timer requeue
> 657991fb06a4 rcu: Do not report strict GPs for outgoing CPUs
> 12d3389b7af6 rcu: Tighten rcu_advance_cbs_nowake() checks
> 0836f9404017 rcu/exp: Mark current CPU as exp-QS in IPI loop second pass
> 70692b06208c rcu: Mark accesses to rcu_state.n_force_qs
> af756be29c82 rcu: Always inline rcu_dynticks_task*_{enter,exit}()
> 226d68fb6c0a rcu: Fix existing exp request check in sync_sched_exp_online_cleanup()
> 02ddf26d849d rcu-tasks: Move RTGS_WAIT_CBS to beginning of rcu_tasks_kthread() loop
> 7f43cda650d5 rcutorture: Avoid problematic critical section nesting on PREEMPT_RT
> d3ca78775db4 rcu: Fix macro name CONFIG_TASKS_RCU_TRACE
> 497f3d9c3f58 rcu: Fix stall-warning deadlock due to non-release of rcu_node ->lock
> ea5e5bc881a4 rcu: Add lockdep_assert_irqs_disabled() to rcu_sched_clock_irq() and callees
> 527b56d7856f rcu: Fix to include first blocked task in stall warning
> 4b680b3fc6f3 rcu/tree: Handle VM stoppage in stall detection
> b6ae3854075e srcu: Provide polling interfaces for Tiny SRCU grace periods
> 450948b06ce8 srcu: Make Tiny SRCU use multi-bit grace-period counter
> 641e1d88404a srcu: Provide internal interface to start a Tiny SRCU grace period
> f789de3be808 srcu: Provide polling interfaces for Tree SRCU grace periods
> fdf66e5a7fc8 srcu: Provide internal interface to start a Tree SRCU grace period
> 86cb49e7314e rcu-tasks: Don't delete holdouts within trc_wait_for_one_reader()
> 55ddab2bfd70 rcu-tasks: Don't delete holdouts within trc_inspect_reader()
> 35a35909ec19 rcu: Reject RCU_LOCKDEP_WARN() false positives
> 23597afbe096 srcu: Fix broken node geometry after early ssp init
> 728f23e53c65 rcu: Invoke rcu_spawn_core_kthreads() from rcu_spawn_gp_kthread()
> 7d81aff28953 rcu: Remove spurious instrumentation_end() in rcu_nmi_enter()
> 09a27d662006 kvfree_rcu: Use same set of GFP flags as does single-argument
> e713bdd791ba rcu/nocb: Perform deferred wake up before last idle's need_resched() check
> 20b7669fa3f0 rcu: Pull deferred rcuog wake up to rcu_eqs_enter() callers
> 30b491e2b6cc rcu-tasks: Move RCU-tasks initialization to before early_initcall()
> 9b81af9c8455 rcu/tree: Defer kvfree_rcu() allocation to a clean context
> 5cacd18c5207 rcu,ftrace: Fix ftrace recursion
> 4540e84bd8a9 rcu: Allow rcu_irq_enter_check_tick() from NMI <<<< this is the first kernel/rcu backport for 5.10.y branch >>>>
> ...
> 2c85ebc57b3e (tag: v5.10) Linux 5.10 <<<< this is the divergence point from mainline to 5.10.y >>>> 
>

Can you check if this commit [1] is present in your tree. If not, please
try with it.

Also, you might want to check the state of the kworker running
process_srcu() work fn, to see where the worker thread corresponding
to this SRCU domain is waiting.


[1] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=5573fdbc3423475aae4b0c2e3b0076d6216e9ed1
> > 
> >> in fsnotify_connector_destroy_workfn / fsnotify_mark_destroy_workfn,
> >> where two kernel threads are both calling synchronize_srcu, then
> >> scheduling out in wait_for_completion, and completely going out to
> >> lunch for over 4 minutes. This then triggers the hung task timeout and
> >> things blow up.
> >> 
> >> We are running audit=1 for this system and are using an el8 based
> >> userspace.
> >> 
> >> I've flipped through the fs/notify code base for both 5.10 as well as
> >> upstream mainline to see if something jumped off the page, and I
> >> haven't yet spotted any particular suspect code from the caller side.
> >> 
> >> This hang appears to come up at the very end of the shutdown/reboot
> >> process, seemingly after the system starts to unwind through initrd.
> >> 
> >> What I'm working on now is adding some instrumentation to the dracut
> >> shutdown initrd scripts to see if I can how far we get down that path
> >> before the system fails to make forward progress, which may give some
> >> hints. TBD on that. I've also enabled lockdep with CONFIG_PROVE_RCU and
> >> a plethora of DEBUG options [2], and didn't get anything interesting.
> >> To be clear, we haven't seen lockdep spit out any complaints as of yet.
> >> 
> >> Reaching out to see if this sounds familar to anyone on the list, or if
> >> there are any particular areas of the RCU code base that might be
> >> suspect for this kind of issue. I'm happy to provide more information,
> >> as frankly, I'm quite stumped at the moment.
> >> 
> >> Thanks all,
> >> Jon
> >> 
> >> [1] panic trace
> >>    Normal shutdown process, then hangs on the following:
> >>    ...
> >>    dracut Warning: Killing all remaining processes
> >>    ...
> >>    INFO: task kworker/u20:7:1200701 blocked for more than 241 seconds.
> >>          Tainted: G           O      5.10.205-2.el8.x86_64 #1
> >>    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >>    task:kworker/u20:7   state:D stack:    0 pid:1200701 ppid:     2 flags:0x00004080
> >>    Workqueue: events_unbound fsnotify_connector_destroy_workfn
> >>    Call Trace:
> >>     __schedule+0x267/0x790
> >>     schedule+0x3c/0xb0
> >>     schedule_timeout+0x219/0x2b0
> >>     wait_for_completion+0x9e/0x100
> >>     __synchronize_srcu.part.24+0x83/0xb0
> >>     ? __bpf_trace_rcu_utilization+0x10/0x10
> >>     ? synchronize_srcu+0x5d/0xf0
> >>     fsnotify_connector_destroy_workfn+0x46/0x80
> >>     process_one_work+0x1fc/0x390
> >>     worker_thread+0x2d/0x3e0
> >>     ? process_one_work+0x390/0x390
> >>     kthread+0x114/0x130
> >>     ? kthread_park+0x80/0x80
> >>     ret_from_fork+0x1f/0x30
> >>    INFO: task kworker/u20:8:1287360 blocked for more than 241 seconds.
> >>          Tainted: G           O      5.10.205-2.el8.x86_64 #1
> >>    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >>    task:kworker/u20:8   state:D stack:    0 pid:1287360 ppid:     2 flags:0x00004080
> >>    Workqueue: events_unbound fsnotify_mark_destroy_workfn
> >>    Call Trace:
> >>     __schedule+0x267/0x790
> >>     schedule+0x3c/0xb0
> >>     schedule_timeout+0x219/0x2b0
> >>     ? add_timer+0x14a/0x200
> >>     wait_for_completion+0x9e/0x100
> >>     __synchronize_srcu.part.24+0x83/0xb0
> >>     ? __bpf_trace_rcu_utilization+0x10/0x10
> >>     fsnotify_mark_destroy_workfn+0x77/0xe0
> >>     process_one_work+0x1fc/0x390
> >>     ? process_one_work+0x390/0x390
> >>     worker_thread+0x2d/0x3e0
> >>     ? process_one_work+0x390/0x390
> >>     kthread+0x114/0x130
> >>     ? kthread_park+0x80/0x80
> >>     ret_from_fork+0x1f/0x30
> >>    Kernel panic - not syncing: hung_task: blocked tasks
> >>    CPU: 1 PID: 64 Comm: khungtaskd Kdump: loaded Tainted: G           O      5.10.205-2.el8.x86_64 #1
> >>    Hardware name: Red Hat KVM, BIOS 20230302.1.2662.el8 04/01/2014
> >>    Call Trace:
> >>     dump_stack+0x6d/0x8c
> >>     panic+0x114/0x2ea
> >>     watchdog.cold.8+0xb5/0xb5
> >>     ? hungtask_pm_notify+0x50/0x50
> >>     kthread+0x114/0x130
> >>     ? kthread_park+0x80/0x80
> >>     ret_from_fork+0x1f/0x30
> >> 
> >> [2] additional debugging config knobs turned up.
> >>    CONFIG_PROVE_LOCKING=y
> >>    CONFIG_LOCK_STAT=y
> >>    CONFIG_DEBUG_RT_MUTEXES=y
> >>    CONFIG_DEBUG_SPINLOCK=y
> >>    CONFIG_DEBUG_MUTEXES=y
> >>    CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y
> >>    CONFIG_DEBUG_RWSEMS=y
> >>    CONFIG_DEBUG_LOCK_ALLOC=y
> >>    CONFIG_LOCKDEP=y
> >>    CONFIG_LOCKDEP_BITS=15
> >>    CONFIG_LOCKDEP_CHAINS_BITS=16
> >>    CONFIG_LOCKDEP_STACK_TRACE_BITS=19
> >>    CONFIG_LOCKDEP_STACK_TRACE_HASH_BITS=14
> >>    CONFIG_LOCKDEP_CIRCULAR_QUEUE_BITS=12
> >>    CONFIG_DEBUG_SHIRQ=y
> >>    CONFIG_WQ_WATCHDOG=y
> >>    CONFIG_DEBUG_ATOMIC_SLEEP=y
> >>    CONFIG_DEBUG_LIST=y
> >>    CONFIG_DEBUG_PLIST=y
> >>    CONFIG_DEBUG_SG=y
> >>    CONFIG_DEBUG_NOTIFIERS=y
> >>    CONFIG_BUG_ON_DATA_CORRUPTION=y
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: SRCU hung task on 5.10.y on synchronize_srcu(&fsnotify_mark_srcu)
  2024-08-28  2:33     ` Neeraj Upadhyay
@ 2024-08-28  3:01       ` Jon Kohler
  0 siblings, 0 replies; 12+ messages in thread
From: Jon Kohler @ 2024-08-28  3:01 UTC (permalink / raw)
  To: Neeraj Upadhyay
  Cc: paulmck@kernel.org, rcu@vger.kernel.org, jiangshanlai@gmail.com,
	josh@joshtriplett.org



> On Aug 27, 2024, at 10:33 PM, Neeraj Upadhyay <Neeraj.Upadhyay@kernel.org> wrote:
> 
> On Tue, Aug 27, 2024 at 11:33:32PM +0000, Jon Kohler wrote:
>> 
>> 
>>> On Aug 27, 2024, at 4:34 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
>>> 
>>> On Tue, Aug 27, 2024 at 08:01:27PM +0000, Jon Kohler wrote:
>>>> Hey Paul, Lai, Josh, and the RCU list and Jan/FS list -
>>>> Reaching out about a tricky hung task issue that I'm running into. I've
>>>> got a virtualized Linux guest on top of a KVM based platform, running
>>>> a 5.10.y based kernel. The issue we're running into is a hung task that
>>>> *only* happens on shutdown/reboot of this particular VM once every 
>>>> 20-50 times.
>>>> 
>>>> The signature of the hung task is always similar to the output below,
>>>> where we appear to hang on the call to 
>>>>   synchronize_srcu(&fsnotify_mark_srcu)
>>> 
>>> One thing to try would be to add trace_printk() or similar to the SRCU
>>> readers, just in case someone was using srcu_read_lock_notrace() on
>>> fsnotify_mark_srcu, which I see no trace of in current mainline.
>>> 
>>> Alternatively, if there is a version where this does not happen, try
>>> bisecting.  Each bisection step would require something like 400-500
>>> shutdown/reboots to prove the commit good.  (Obviously, the first failure
>>> proves the commit bad, which for one-out-of-50 failures will take on
>>> average about 35 shutdown/reboots.)
>>> 
>>> There could also be a bad SRCU backport from mainline, so please check
>>> what SRCU backports you have in your 5.10.y stable release.  (Though
>>> maybe Jack has already done this?)
>>> 
>>> Thanx, Paul
>> 
>> Thanks, Paul
>> 
>> For posterity, this kernel is just built off of the regular ole stable tree, and here’s
>> All of the backports to kernel/rcu are below.
>> 
>> Stepping through this more, since we’re stalling at wait_for_completion,
>> that must mean that wakeme_after_rcu() ... complete(&rcu->completion)
>> is not happening, right?
>> 
>> That or somehow wakeme_after_rcu() fires *before* wait_for_completion()
>> is setup, causing the wait to hang forever?
>> 
>> Is it possible for wakeme_after_rcu() to race, such that it fires somewhere
>> else *before* wait_for_completion gets all the way to schedule()?
>> 
>> [rcu]$ pwd
>> /kernel/kernel/rcu
>> [rcu]$ git remote -v
>> origin https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_pub_scm_linux_kernel_git_stable_linux.git&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=gL9jpTalkQKRr468aytGK0c3GmISBnwoMhllG4jXQN4du5Z7hXSogIB-19whbUs3&s=QaiqdZTW1g36HjvkDrG9prdjMhKoxNtp3Vz4B4Xwcas&e=  (fetch)
>> origin https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_pub_scm_linux_kernel_git_stable_linux.git&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=gL9jpTalkQKRr468aytGK0c3GmISBnwoMhllG4jXQN4du5Z7hXSogIB-19whbUs3&s=QaiqdZTW1g36HjvkDrG9prdjMhKoxNtp3Vz4B4Xwcas&e=  (push)
>> [rcu]$ git log --oneline 2c85ebc57b3e..HEAD .
>> ca4427ebc626 (HEAD, tag: v5.10.205) Linux 5.10.205 <<<< this is the base commit for 5.10.205 >>>>
>> ...
>> 175f4b062f69 rcu: kmemleak: Ignore kmemleak false positives when RCU-freeing objects <<<< this is the most recent backport commit to kernel/rcu >>>>
>> 55887adc76e1 rcuscale: Move rcu_scale_writer() schedule_timeout_uninterruptible() to _idle()
>> 066fbd8bc981 refscale: Fix uninitalized use of wait_queue_head_t
>> d93ba6e46e5f rcu-tasks: Add trc_inspect_reader() checks for exiting critical section
>> 3e22624f8fd3 rcu-tasks: Wait for trc_read_check_handler() IPIs
>> 9190c1f0aed1 rcu-tasks: Fix IPI failure handling in trc_wait_for_one_reader
>> ad4f8c117b8b rcu: Prevent expedited GP from enabling tick on offline CPU
>> 4f91de9a81bd rcu-tasks: Simplify trc_read_check_handler() atomic operations
>> 3a64cd01cdd6 rcu-tasks: Mark ->trc_reader_special.b.need_qs data races
>> 058f077d09ba rcu-tasks: Mark ->trc_reader_nesting data races
>> 604d6a5ff718 rcu/rcuscale: Stop kfree_scale_thread thread(s) after unloading rcuscale
>> d414e24d1509 rcu/rcuscale: Move rcu_scale_*() after kfree_scale_cleanup()
>> ecc5e6dbc269 rcuscale: Move shutdown from wait_event() to wait_event_idle()
>> b62c816bdb5e rcuscale: Always log error message
>> 8cd9917c13a7 rcuscale: Console output claims too few grace periods
>> 7230a9e599d3 rcu/kvfree: Avoid freeing new kfree_rcu() memory after old grace period
>> a7d21b858589 rcu: Protect rcu_print_task_exp_stall() ->exp_tasks access
>> e4842de4ec13 refscale: Move shutdown from wait_event() to wait_event_idle()
>> eb18bc5a8678 rcu: Avoid stack overflow due to __rcu_irq_enter_check_tick() being kprobe-ed
>> d99d194e2f8c rcu-tasks: Make rude RCU-Tasks work well with CPU hotplug
>> 2bf501f1bc78 rcu: Suppress smp_processor_id() complaint in synchronize_rcu_expedited_wait()
>> 1c37e86a78c2 rcu-tasks: Fix synchronize_rcu_tasks() VS zap_pid_ns_processes()
>> ad410f64f7ab rcu-tasks: Remove preemption disablement around srcu_read_[un]lock() calls
>> b02b6bb83c68 rcu-tasks: Improve comments explaining tasks_rcu_exit_srcu purpose
>> 7c15d7ecce00 rcu: Prevent lockdep-RCU splats on lock acquisition/release
>> 5a52380b8193 rcu: Fix __this_cpu_read() lockdep warning in rcu_force_quiescent_state()
>> 0dd025483f15 rcu-tasks: Convert RCU_LOCKDEP_WARN() to WARN_ONCE()
>> 36d4ffbedff7 rcu: Back off upon fill_page_cache_func() allocation failure
>> 10f30cba8f6c rcu: Make TASKS_RUDE_RCU select IRQ_WORK
>> 1c6c3f233664 rcu-tasks: Fix race in schedule and flush work
>> a22d66eb518f rcu: Apply callbacks processing time limit only on softirq
>> 40fb3812d997 rcu: Fix callbacks processing time limit retaining cond_resched()
>> fcc9797d0d13 rcu: Don't deboost before reporting expedited quiescent state
>> 0c145262ac99 rcu/nocb: Fix missed nocb_timer requeue
>> 657991fb06a4 rcu: Do not report strict GPs for outgoing CPUs
>> 12d3389b7af6 rcu: Tighten rcu_advance_cbs_nowake() checks
>> 0836f9404017 rcu/exp: Mark current CPU as exp-QS in IPI loop second pass
>> 70692b06208c rcu: Mark accesses to rcu_state.n_force_qs
>> af756be29c82 rcu: Always inline rcu_dynticks_task*_{enter,exit}()
>> 226d68fb6c0a rcu: Fix existing exp request check in sync_sched_exp_online_cleanup()
>> 02ddf26d849d rcu-tasks: Move RTGS_WAIT_CBS to beginning of rcu_tasks_kthread() loop
>> 7f43cda650d5 rcutorture: Avoid problematic critical section nesting on PREEMPT_RT
>> d3ca78775db4 rcu: Fix macro name CONFIG_TASKS_RCU_TRACE
>> 497f3d9c3f58 rcu: Fix stall-warning deadlock due to non-release of rcu_node ->lock
>> ea5e5bc881a4 rcu: Add lockdep_assert_irqs_disabled() to rcu_sched_clock_irq() and callees
>> 527b56d7856f rcu: Fix to include first blocked task in stall warning
>> 4b680b3fc6f3 rcu/tree: Handle VM stoppage in stall detection
>> b6ae3854075e srcu: Provide polling interfaces for Tiny SRCU grace periods
>> 450948b06ce8 srcu: Make Tiny SRCU use multi-bit grace-period counter
>> 641e1d88404a srcu: Provide internal interface to start a Tiny SRCU grace period
>> f789de3be808 srcu: Provide polling interfaces for Tree SRCU grace periods
>> fdf66e5a7fc8 srcu: Provide internal interface to start a Tree SRCU grace period
>> 86cb49e7314e rcu-tasks: Don't delete holdouts within trc_wait_for_one_reader()
>> 55ddab2bfd70 rcu-tasks: Don't delete holdouts within trc_inspect_reader()
>> 35a35909ec19 rcu: Reject RCU_LOCKDEP_WARN() false positives
>> 23597afbe096 srcu: Fix broken node geometry after early ssp init
>> 728f23e53c65 rcu: Invoke rcu_spawn_core_kthreads() from rcu_spawn_gp_kthread()
>> 7d81aff28953 rcu: Remove spurious instrumentation_end() in rcu_nmi_enter()
>> 09a27d662006 kvfree_rcu: Use same set of GFP flags as does single-argument
>> e713bdd791ba rcu/nocb: Perform deferred wake up before last idle's need_resched() check
>> 20b7669fa3f0 rcu: Pull deferred rcuog wake up to rcu_eqs_enter() callers
>> 30b491e2b6cc rcu-tasks: Move RCU-tasks initialization to before early_initcall()
>> 9b81af9c8455 rcu/tree: Defer kvfree_rcu() allocation to a clean context
>> 5cacd18c5207 rcu,ftrace: Fix ftrace recursion
>> 4540e84bd8a9 rcu: Allow rcu_irq_enter_check_tick() from NMI <<<< this is the first kernel/rcu backport for 5.10.y branch >>>>
>> ...
>> 2c85ebc57b3e (tag: v5.10) Linux 5.10 <<<< this is the divergence point from mainline to 5.10.y >>>> 
>> 
> 
> Can you check if this commit [1] is present in your tree. If not, please
> try with it.
> 
> Also, you might want to check the state of the kworker running
> process_srcu() work fn, to see where the worker thread corresponding
> to this SRCU domain is waiting.

Neeraj,
Thanks for the tip and advice. The commit msg sounds juicy and interesting, we’ll give it a
try and see if that leads to any interesting findings.

That mentioned commit (srcu: Fix callbacks acceleration mishandling) is not in 5.10.y LTS

Both 6.6.y and 6.1.y have that commit
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/log/kernel/rcu/srcutree.c?h=v6.6.47
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/log/kernel/rcu/srcutree.c?h=v6.1.106

5.15.y and 5.10.y (and below) do not have that commit
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/log/kernel/rcu/srcutree.c?h=v5.15.165
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/log/kernel/rcu/srcutree.c?h=v5.10.224

> 
> 
> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_pub_scm_linux_kernel_git_stable_linux.git_commit_-3Fid-3D5573fdbc3423475aae4b0c2e3b0076d6216e9ed1&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=gL9jpTalkQKRr468aytGK0c3GmISBnwoMhllG4jXQN4du5Z7hXSogIB-19whbUs3&s=lL8zRfuq9qPR4JHcfWJQdUn7cQmSH3UYXg-eY0Un28k&e=
>>> 
>>>> in fsnotify_connector_destroy_workfn / fsnotify_mark_destroy_workfn,
>>>> where two kernel threads are both calling synchronize_srcu, then
>>>> scheduling out in wait_for_completion, and completely going out to
>>>> lunch for over 4 minutes. This then triggers the hung task timeout and
>>>> things blow up.
>>>> 
>>>> We are running audit=1 for this system and are using an el8 based
>>>> userspace.
>>>> 
>>>> I've flipped through the fs/notify code base for both 5.10 as well as
>>>> upstream mainline to see if something jumped off the page, and I
>>>> haven't yet spotted any particular suspect code from the caller side.
>>>> 
>>>> This hang appears to come up at the very end of the shutdown/reboot
>>>> process, seemingly after the system starts to unwind through initrd.
>>>> 
>>>> What I'm working on now is adding some instrumentation to the dracut
>>>> shutdown initrd scripts to see if I can how far we get down that path
>>>> before the system fails to make forward progress, which may give some
>>>> hints. TBD on that. I've also enabled lockdep with CONFIG_PROVE_RCU and
>>>> a plethora of DEBUG options [2], and didn't get anything interesting.
>>>> To be clear, we haven't seen lockdep spit out any complaints as of yet.
>>>> 
>>>> Reaching out to see if this sounds familar to anyone on the list, or if
>>>> there are any particular areas of the RCU code base that might be
>>>> suspect for this kind of issue. I'm happy to provide more information,
>>>> as frankly, I'm quite stumped at the moment.
>>>> 
>>>> Thanks all,
>>>> Jon
>>>> 
>>>> [1] panic trace
>>>>   Normal shutdown process, then hangs on the following:
>>>>   ...
>>>>   dracut Warning: Killing all remaining processes
>>>>   ...
>>>>   INFO: task kworker/u20:7:1200701 blocked for more than 241 seconds.
>>>>         Tainted: G           O      5.10.205-2.el8.x86_64 #1
>>>>   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>>   task:kworker/u20:7   state:D stack:    0 pid:1200701 ppid:     2 flags:0x00004080
>>>>   Workqueue: events_unbound fsnotify_connector_destroy_workfn
>>>>   Call Trace:
>>>>    __schedule+0x267/0x790
>>>>    schedule+0x3c/0xb0
>>>>    schedule_timeout+0x219/0x2b0
>>>>    wait_for_completion+0x9e/0x100
>>>>    __synchronize_srcu.part.24+0x83/0xb0
>>>>    ? __bpf_trace_rcu_utilization+0x10/0x10
>>>>    ? synchronize_srcu+0x5d/0xf0
>>>>    fsnotify_connector_destroy_workfn+0x46/0x80
>>>>    process_one_work+0x1fc/0x390
>>>>    worker_thread+0x2d/0x3e0
>>>>    ? process_one_work+0x390/0x390
>>>>    kthread+0x114/0x130
>>>>    ? kthread_park+0x80/0x80
>>>>    ret_from_fork+0x1f/0x30
>>>>   INFO: task kworker/u20:8:1287360 blocked for more than 241 seconds.
>>>>         Tainted: G           O      5.10.205-2.el8.x86_64 #1
>>>>   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>>   task:kworker/u20:8   state:D stack:    0 pid:1287360 ppid:     2 flags:0x00004080
>>>>   Workqueue: events_unbound fsnotify_mark_destroy_workfn
>>>>   Call Trace:
>>>>    __schedule+0x267/0x790
>>>>    schedule+0x3c/0xb0
>>>>    schedule_timeout+0x219/0x2b0
>>>>    ? add_timer+0x14a/0x200
>>>>    wait_for_completion+0x9e/0x100
>>>>    __synchronize_srcu.part.24+0x83/0xb0
>>>>    ? __bpf_trace_rcu_utilization+0x10/0x10
>>>>    fsnotify_mark_destroy_workfn+0x77/0xe0
>>>>    process_one_work+0x1fc/0x390
>>>>    ? process_one_work+0x390/0x390
>>>>    worker_thread+0x2d/0x3e0
>>>>    ? process_one_work+0x390/0x390
>>>>    kthread+0x114/0x130
>>>>    ? kthread_park+0x80/0x80
>>>>    ret_from_fork+0x1f/0x30
>>>>   Kernel panic - not syncing: hung_task: blocked tasks
>>>>   CPU: 1 PID: 64 Comm: khungtaskd Kdump: loaded Tainted: G           O      5.10.205-2.el8.x86_64 #1
>>>>   Hardware name: Red Hat KVM, BIOS 20230302.1.2662.el8 04/01/2014
>>>>   Call Trace:
>>>>    dump_stack+0x6d/0x8c
>>>>    panic+0x114/0x2ea
>>>>    watchdog.cold.8+0xb5/0xb5
>>>>    ? hungtask_pm_notify+0x50/0x50
>>>>    kthread+0x114/0x130
>>>>    ? kthread_park+0x80/0x80
>>>>    ret_from_fork+0x1f/0x30
>>>> 
>>>> [2] additional debugging config knobs turned up.
>>>>   CONFIG_PROVE_LOCKING=y
>>>>   CONFIG_LOCK_STAT=y
>>>>   CONFIG_DEBUG_RT_MUTEXES=y
>>>>   CONFIG_DEBUG_SPINLOCK=y
>>>>   CONFIG_DEBUG_MUTEXES=y
>>>>   CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y
>>>>   CONFIG_DEBUG_RWSEMS=y
>>>>   CONFIG_DEBUG_LOCK_ALLOC=y
>>>>   CONFIG_LOCKDEP=y
>>>>   CONFIG_LOCKDEP_BITS=15
>>>>   CONFIG_LOCKDEP_CHAINS_BITS=16
>>>>   CONFIG_LOCKDEP_STACK_TRACE_BITS=19
>>>>   CONFIG_LOCKDEP_STACK_TRACE_HASH_BITS=14
>>>>   CONFIG_LOCKDEP_CIRCULAR_QUEUE_BITS=12
>>>>   CONFIG_DEBUG_SHIRQ=y
>>>>   CONFIG_WQ_WATCHDOG=y
>>>>   CONFIG_DEBUG_ATOMIC_SLEEP=y
>>>>   CONFIG_DEBUG_LIST=y
>>>>   CONFIG_DEBUG_PLIST=y
>>>>   CONFIG_DEBUG_SG=y
>>>>   CONFIG_DEBUG_NOTIFIERS=y
>>>>   CONFIG_BUG_ON_DATA_CORRUPTION=y
>> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: SRCU hung task on 5.10.y on synchronize_srcu(&fsnotify_mark_srcu)
  2024-08-27 20:01 SRCU hung task on 5.10.y on synchronize_srcu(&fsnotify_mark_srcu) Jon Kohler
  2024-08-27 20:34 ` Paul E. McKenney
@ 2024-09-04  9:19 ` Jan Kara
  2024-09-04 14:40   ` Jon Kohler
  1 sibling, 1 reply; 12+ messages in thread
From: Jan Kara @ 2024-09-04  9:19 UTC (permalink / raw)
  To: Jon Kohler
  Cc: paulmck@kernel.org, rcu@vger.kernel.org, jiangshanlai@gmail.com,
	josh@joshtriplett.org, jack@suse.cz,
	linux-fsdevel@vger.kernel.org

On Tue 27-08-24 20:01:27, Jon Kohler wrote:
> Hey Paul, Lai, Josh, and the RCU list and Jan/FS list -
> Reaching out about a tricky hung task issue that I'm running into. I've
> got a virtualized Linux guest on top of a KVM based platform, running
> a 5.10.y based kernel. The issue we're running into is a hung task that
> *only* happens on shutdown/reboot of this particular VM once every 
> 20-50 times.
> 
> The signature of the hung task is always similar to the output below,
> where we appear to hang on the call to 
>     synchronize_srcu(&fsnotify_mark_srcu)
> in fsnotify_connector_destroy_workfn / fsnotify_mark_destroy_workfn,
> where two kernel threads are both calling synchronize_srcu, then
> scheduling out in wait_for_completion, and completely going out to
> lunch for over 4 minutes. This then triggers the hung task timeout and
> things blow up.

Well, the most obvious reason for this would be that some process is
hanging somewhere with fsnotify_mark_srcu held. When this happens, can you
trigger sysrq-w in the VM and send here its output?

> We are running audit=1 for this system and are using an el8 based
> userspace.
> 
> I've flipped through the fs/notify code base for both 5.10 as well as
> upstream mainline to see if something jumped off the page, and I
> haven't yet spotted any particular suspect code from the caller side.
> 
> This hang appears to come up at the very end of the shutdown/reboot
> process, seemingly after the system starts to unwind through initrd.
> 
> What I'm working on now is adding some instrumentation to the dracut
> shutdown initrd scripts to see if I can how far we get down that path
> before the system fails to make forward progress, which may give some
> hints. TBD on that. I've also enabled lockdep with CONFIG_PROVE_RCU and
> a plethora of DEBUG options [2], and didn't get anything interesting.
> To be clear, we haven't seen lockdep spit out any complaints as of yet.

The fact that lockdep doesn't report anything is interesting but then
lockdep doesn't track everything. In particular I think SRCU itself isn't
tracked by lockdep.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: SRCU hung task on 5.10.y on synchronize_srcu(&fsnotify_mark_srcu)
  2024-09-04  9:19 ` Jan Kara
@ 2024-09-04 14:40   ` Jon Kohler
  2024-09-04 19:51     ` Paul E. McKenney
  2024-09-05 13:53     ` Jan Kara
  0 siblings, 2 replies; 12+ messages in thread
From: Jon Kohler @ 2024-09-04 14:40 UTC (permalink / raw)
  To: Jan Kara
  Cc: paulmck@kernel.org, rcu@vger.kernel.org, jiangshanlai@gmail.com,
	josh@joshtriplett.org, linux-fsdevel@vger.kernel.org

> On Sep 4, 2024, at 5:19 AM, Jan Kara <jack@suse.cz> wrote:
> 
> !-------------------------------------------------------------------|
>  CAUTION: External Email
> 
> |-------------------------------------------------------------------!
> 
> On Tue 27-08-24 20:01:27, Jon Kohler wrote:
>> Hey Paul, Lai, Josh, and the RCU list and Jan/FS list -
>> Reaching out about a tricky hung task issue that I'm running into. I've
>> got a virtualized Linux guest on top of a KVM based platform, running
>> a 5.10.y based kernel. The issue we're running into is a hung task that
>> *only* happens on shutdown/reboot of this particular VM once every 
>> 20-50 times.
>> 
>> The signature of the hung task is always similar to the output below,
>> where we appear to hang on the call to 
>>    synchronize_srcu(&fsnotify_mark_srcu)
>> in fsnotify_connector_destroy_workfn / fsnotify_mark_destroy_workfn,
>> where two kernel threads are both calling synchronize_srcu, then
>> scheduling out in wait_for_completion, and completely going out to
>> lunch for over 4 minutes. This then triggers the hung task timeout and
>> things blow up.
> 
> Well, the most obvious reason for this would be that some process is
> hanging somewhere with fsnotify_mark_srcu held. When this happens, can you
> trigger sysrq-w in the VM and send here its output?

Jan - Thanks for the ping, that is *exactly* what is happening here.
Some developments since my last note, the patch Neeraj pointed out
wasn't the issue, but rather a confluence of realtime thread configurations
that ended up completely starving whatever CPU was processing per-CPU
callbacks. So, one thread would go out to lunch completely, and it would
just never yield. This particular system was configured with RT_RUNTIME_SHARE
unfortunately, so that realtime thread going out to lunch ate the entire system.

What was odd is that this never, ever happened during runtime on some
of these systems that have been up for years and getting beat up heavily,
but rather only on shutdown. We’ve got more to chase down internally on
that.

One thing I wanted to bring up here though while I have you, I have
noticed through various hits on google, mailing lists, etc over the years that
this specific type of lockup with fsnotify_mark_srcu seems to happen now
and then for various oddball reasons, with various root causes. 

It made me think that I wonder if there is a better structure that could be
used here that might be a bit more durable. To be clear, I’m not saying that
SRCU *is not* durable or anything of the sort (I promise!) but rather
wondering if there was anything we could think about tweaking on the
fsnotify side of the house to be more efficient.

Thoughts?

> 
>> We are running audit=1 for this system and are using an el8 based
>> userspace.
>> 
>> I've flipped through the fs/notify code base for both 5.10 as well as
>> upstream mainline to see if something jumped off the page, and I
>> haven't yet spotted any particular suspect code from the caller side.
>> 
>> This hang appears to come up at the very end of the shutdown/reboot
>> process, seemingly after the system starts to unwind through initrd.
>> 
>> What I'm working on now is adding some instrumentation to the dracut
>> shutdown initrd scripts to see if I can how far we get down that path
>> before the system fails to make forward progress, which may give some
>> hints. TBD on that. I've also enabled lockdep with CONFIG_PROVE_RCU and
>> a plethora of DEBUG options [2], and didn't get anything interesting.
>> To be clear, we haven't seen lockdep spit out any complaints as of yet.
> 
> The fact that lockdep doesn't report anything is interesting but then
> lockdep doesn't track everything. In particular I think SRCU itself isn't
> tracked by lockdep.
> 
> Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: SRCU hung task on 5.10.y on synchronize_srcu(&fsnotify_mark_srcu)
  2024-09-04 14:40   ` Jon Kohler
@ 2024-09-04 19:51     ` Paul E. McKenney
  2024-09-05 13:53     ` Jan Kara
  1 sibling, 0 replies; 12+ messages in thread
From: Paul E. McKenney @ 2024-09-04 19:51 UTC (permalink / raw)
  To: Jon Kohler
  Cc: Jan Kara, rcu@vger.kernel.org, jiangshanlai@gmail.com,
	josh@joshtriplett.org, linux-fsdevel@vger.kernel.org

On Wed, Sep 04, 2024 at 02:40:07PM +0000, Jon Kohler wrote:
> 
> 
> > On Sep 4, 2024, at 5:19 AM, Jan Kara <jack@suse.cz> wrote:
> > 
> > !-------------------------------------------------------------------|
> >  CAUTION: External Email
> > 
> > |-------------------------------------------------------------------!
> > 
> > On Tue 27-08-24 20:01:27, Jon Kohler wrote:
> >> Hey Paul, Lai, Josh, and the RCU list and Jan/FS list -
> >> Reaching out about a tricky hung task issue that I'm running into. I've
> >> got a virtualized Linux guest on top of a KVM based platform, running
> >> a 5.10.y based kernel. The issue we're running into is a hung task that
> >> *only* happens on shutdown/reboot of this particular VM once every 
> >> 20-50 times.
> >> 
> >> The signature of the hung task is always similar to the output below,
> >> where we appear to hang on the call to 
> >>    synchronize_srcu(&fsnotify_mark_srcu)
> >> in fsnotify_connector_destroy_workfn / fsnotify_mark_destroy_workfn,
> >> where two kernel threads are both calling synchronize_srcu, then
> >> scheduling out in wait_for_completion, and completely going out to
> >> lunch for over 4 minutes. This then triggers the hung task timeout and
> >> things blow up.
> > 
> > Well, the most obvious reason for this would be that some process is
> > hanging somewhere with fsnotify_mark_srcu held. When this happens, can you
> > trigger sysrq-w in the VM and send here its output?
> 
> Jan - Thanks for the ping, that is *exactly* what is happening here.
> Some developments since my last note, the patch Neeraj pointed out
> wasn't the issue, but rather a confluence of realtime thread configurations
> that ended up completely starving whatever CPU was processing per-CPU
> callbacks. So, one thread would go out to lunch completely, and it would
> just never yield. This particular system was configured with RT_RUNTIME_SHARE
> unfortunately, so that realtime thread going out to lunch ate the entire system.
> 
> What was odd is that this never, ever happened during runtime on some
> of these systems that have been up for years and getting beat up heavily,
> but rather only on shutdown. We’ve got more to chase down internally on
> that.
> 
> One thing I wanted to bring up here though while I have you, I have
> noticed through various hits on google, mailing lists, etc over the years that
> this specific type of lockup with fsnotify_mark_srcu seems to happen now
> and then for various oddball reasons, with various root causes. 
> 
> It made me think that I wonder if there is a better structure that could be
> used here that might be a bit more durable. To be clear, I’m not saying that
> SRCU *is not* durable or anything of the sort (I promise!) but rather
> wondering if there was anything we could think about tweaking on the
> fsnotify side of the house to be more efficient.
> 
> Thoughts?

For RCU in real-time environments, we have RCU priority boosting, which
boost RCU readers that have been preempted for too long.  However, this is
SRCU, in which readers can simply block, in addition to being preempted.
Of course, boosting the priority of a task that ha blocked (as opposed
to being preempted) cannot help -- the task will remain blocked until
awakened, regardless of its priority.

But your case takes this one step further, in that the workqueue invoking
callbacks is being preempted and starved, correct?

The usual advice is to make sure that your housekeeping CPUs get
sufficient CPU time.  Easy to say, easy to do, harder to keep done
uniformly across a large number of systems running diverse workloads.
Still, this is the preferred approach.

Just out of curiosity, is this a CONFIG_PREEMPT_RT kernel?

							Thanx, Paul

> >> We are running audit=1 for this system and are using an el8 based
> >> userspace.
> >> 
> >> I've flipped through the fs/notify code base for both 5.10 as well as
> >> upstream mainline to see if something jumped off the page, and I
> >> haven't yet spotted any particular suspect code from the caller side.
> >> 
> >> This hang appears to come up at the very end of the shutdown/reboot
> >> process, seemingly after the system starts to unwind through initrd.
> >> 
> >> What I'm working on now is adding some instrumentation to the dracut
> >> shutdown initrd scripts to see if I can how far we get down that path
> >> before the system fails to make forward progress, which may give some
> >> hints. TBD on that. I've also enabled lockdep with CONFIG_PROVE_RCU and
> >> a plethora of DEBUG options [2], and didn't get anything interesting.
> >> To be clear, we haven't seen lockdep spit out any complaints as of yet.
> > 
> > The fact that lockdep doesn't report anything is interesting but then
> > lockdep doesn't track everything. In particular I think SRCU itself isn't
> > tracked by lockdep.
> > 
> > Honza
> > -- 
> > Jan Kara <jack@suse.com>
> > SUSE Labs, CR
> 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: SRCU hung task on 5.10.y on synchronize_srcu(&fsnotify_mark_srcu)
  2024-09-04 14:40   ` Jon Kohler
  2024-09-04 19:51     ` Paul E. McKenney
@ 2024-09-05 13:53     ` Jan Kara
  1 sibling, 0 replies; 12+ messages in thread
From: Jan Kara @ 2024-09-05 13:53 UTC (permalink / raw)
  To: Jon Kohler
  Cc: Jan Kara, paulmck@kernel.org, rcu@vger.kernel.org,
	jiangshanlai@gmail.com, josh@joshtriplett.org,
	linux-fsdevel@vger.kernel.org

On Wed 04-09-24 14:40:07, Jon Kohler wrote:
> 
> 
> > On Sep 4, 2024, at 5:19 AM, Jan Kara <jack@suse.cz> wrote:
> > 
> > !-------------------------------------------------------------------|
> >  CAUTION: External Email
> > 
> > |-------------------------------------------------------------------!
> > 
> > On Tue 27-08-24 20:01:27, Jon Kohler wrote:
> >> Hey Paul, Lai, Josh, and the RCU list and Jan/FS list -
> >> Reaching out about a tricky hung task issue that I'm running into. I've
> >> got a virtualized Linux guest on top of a KVM based platform, running
> >> a 5.10.y based kernel. The issue we're running into is a hung task that
> >> *only* happens on shutdown/reboot of this particular VM once every 
> >> 20-50 times.
> >> 
> >> The signature of the hung task is always similar to the output below,
> >> where we appear to hang on the call to 
> >>    synchronize_srcu(&fsnotify_mark_srcu)
> >> in fsnotify_connector_destroy_workfn / fsnotify_mark_destroy_workfn,
> >> where two kernel threads are both calling synchronize_srcu, then
> >> scheduling out in wait_for_completion, and completely going out to
> >> lunch for over 4 minutes. This then triggers the hung task timeout and
> >> things blow up.
> > 
> > Well, the most obvious reason for this would be that some process is
> > hanging somewhere with fsnotify_mark_srcu held. When this happens, can you
> > trigger sysrq-w in the VM and send here its output?
> 
> Jan - Thanks for the ping, that is *exactly* what is happening here.
> Some developments since my last note, the patch Neeraj pointed out
> wasn't the issue, but rather a confluence of realtime thread configurations
> that ended up completely starving whatever CPU was processing per-CPU
> callbacks. So, one thread would go out to lunch completely, and it would
> just never yield. This particular system was configured with RT_RUNTIME_SHARE
> unfortunately, so that realtime thread going out to lunch ate the entire system.

Glad to hear this is explained (at least partially) :)

> What was odd is that this never, ever happened during runtime on some
> of these systems that have been up for years and getting beat up heavily,
> but rather only on shutdown. We’ve got more to chase down internally on
> that.
> 
> One thing I wanted to bring up here though while I have you, I have
> noticed through various hits on google, mailing lists, etc over the years that
> this specific type of lockup with fsnotify_mark_srcu seems to happen now
> and then for various oddball reasons, with various root causes. 
> 
> It made me think that I wonder if there is a better structure that could be
> used here that might be a bit more durable. To be clear, I’m not saying that
> SRCU *is not* durable or anything of the sort (I promise!) but rather
> wondering if there was anything we could think about tweaking on the
> fsnotify side of the house to be more efficient.

Well, fsnotify_mark_srcu used to be a big problem in the past where
fanotify code was waiting for userspace response to fanotify event with it
held. And when userspace didn't reply, the kernel got stuck. After we have
removed that sore spot couple years ago, I'm not aware of any more problems
with it. In fact your report is probably the first one in a few years. So
I'm hoping your google hits are mostly from the past / with old kernels :)

								Honza

> >> We are running audit=1 for this system and are using an el8 based
> >> userspace.
> >> 
> >> I've flipped through the fs/notify code base for both 5.10 as well as
> >> upstream mainline to see if something jumped off the page, and I
> >> haven't yet spotted any particular suspect code from the caller side.
> >> 
> >> This hang appears to come up at the very end of the shutdown/reboot
> >> process, seemingly after the system starts to unwind through initrd.
> >> 
> >> What I'm working on now is adding some instrumentation to the dracut
> >> shutdown initrd scripts to see if I can how far we get down that path
> >> before the system fails to make forward progress, which may give some
> >> hints. TBD on that. I've also enabled lockdep with CONFIG_PROVE_RCU and
> >> a plethora of DEBUG options [2], and didn't get anything interesting.
> >> To be clear, we haven't seen lockdep spit out any complaints as of yet.
> > 
> > The fact that lockdep doesn't report anything is interesting but then
> > lockdep doesn't track everything. In particular I think SRCU itself isn't
> > tracked by lockdep.
> > 
> > Honza
> > -- 
> > Jan Kara <jack@suse.com>
> > SUSE Labs, CR
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-09-05 13:53 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-27 20:01 SRCU hung task on 5.10.y on synchronize_srcu(&fsnotify_mark_srcu) Jon Kohler
2024-08-27 20:34 ` Paul E. McKenney
2024-08-27 23:33   ` Jon Kohler
2024-08-28  1:21     ` Paul E. McKenney
2024-08-28  2:41       ` Jon Kohler
2024-08-28 10:14         ` Paul E. McKenney
2024-08-28  2:33     ` Neeraj Upadhyay
2024-08-28  3:01       ` Jon Kohler
2024-09-04  9:19 ` Jan Kara
2024-09-04 14:40   ` Jon Kohler
2024-09-04 19:51     ` Paul E. McKenney
2024-09-05 13:53     ` Jan Kara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.