Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler

Linux RCU subsystem development
 help / color / mirror / Atom feed

* Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock
       [not found] ` <20260323064248.1660757-1-sonam.sanju@intel.com>
@ 2026-03-31 18:17   ` Sean Christopherson
  2026-03-31 20:51     ` Paul E. McKenney
  0 siblings, 1 reply; 11+ messages in thread
From: Sean Christopherson @ 2026-03-31 18:17 UTC (permalink / raw)
  To: Sonam Sanju, Paul E. McKenney, Lai Jiangshan, Josh Triplett
  Cc: Paolo Bonzini, Vineeth Pillai, Dmitry Maluka, kvm, linux-kernel,
	stable, Steven Rostedt, Mathieu Desnoyers, rcu

+srcu folks

Please don't post subsequent versions In-Reply-To previous versions, it tends to
muck up tooling.

On Mon, Mar 23, 2026, Sonam Sanju wrote:
> irqfd_resampler_shutdown() and kvm_irqfd_assign() both call
> synchronize_srcu_expedited() while holding kvm->irqfds.resampler_lock.
> This can deadlock when multiple irqfd workers run concurrently on the
> kvm-irqfd-cleanup workqueue during VM teardown or when VMs are rapidly
> created and destroyed:
> 
>   CPU A (mutex holder)               CPU B/C/D (mutex waiters)
>   irqfd_shutdown()                   irqfd_shutdown() / kvm_irqfd_assign()
>    irqfd_resampler_shutdown()         irqfd_resampler_shutdown()
>     mutex_lock(resampler_lock)  <---- mutex_lock(resampler_lock) //BLOCKED
>     list_del_rcu(...)                     ...blocked...
>     synchronize_srcu_expedited()      // Waiters block workqueue,
>       // waits for SRCU grace            preventing SRCU grace
>       // period which requires            period from completing
>       // workqueue progress          --- DEADLOCK ---
> 
> In irqfd_resampler_shutdown(), the synchronize_srcu_expedited() in
> the else branch is called directly within the mutex.  In the if-last
> branch, kvm_unregister_irq_ack_notifier() also calls
> synchronize_srcu_expedited() internally.  In kvm_irqfd_assign(),
> synchronize_srcu_expedited() is called after list_add_rcu() but
> before mutex_unlock().  All paths can block indefinitely because:
> 
>   1. synchronize_srcu_expedited() waits for an SRCU grace period
>   2. SRCU grace period completion needs workqueue workers to run
>   3. The blocked mutex waiters occupy workqueue slots preventing progress

Unless I'm misunderstanding the bug, "fixing" in this in KVM is papering over an
underlying flaw.  Essentially, this would be establishing a rule that
synchronize_srcu_expedited() can *never* be called while holding a mutex.  That's
not viable.

>   4. The mutex holder never releases the lock -> deadlock

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock
  2026-03-31 18:17   ` [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock Sean Christopherson
@ 2026-03-31 20:51     ` Paul E. McKenney
  2026-04-01  9:47       ` Sonam Sanju
  2026-04-06 23:09       ` Paul E. McKenney
  0 siblings, 2 replies; 11+ messages in thread
From: Paul E. McKenney @ 2026-03-31 20:51 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Sonam Sanju, Lai Jiangshan, Josh Triplett, Paolo Bonzini,
	Vineeth Pillai, Dmitry Maluka, kvm, linux-kernel, stable,
	Steven Rostedt, Mathieu Desnoyers, rcu

On Tue, Mar 31, 2026 at 11:17:19AM -0700, Sean Christopherson wrote:
> +srcu folks
> 
> Please don't post subsequent versions In-Reply-To previous versions, it tends to
> muck up tooling.
> 
> On Mon, Mar 23, 2026, Sonam Sanju wrote:
> > irqfd_resampler_shutdown() and kvm_irqfd_assign() both call
> > synchronize_srcu_expedited() while holding kvm->irqfds.resampler_lock.
> > This can deadlock when multiple irqfd workers run concurrently on the
> > kvm-irqfd-cleanup workqueue during VM teardown or when VMs are rapidly
> > created and destroyed:
> > 
> >   CPU A (mutex holder)               CPU B/C/D (mutex waiters)
> >   irqfd_shutdown()                   irqfd_shutdown() / kvm_irqfd_assign()
> >    irqfd_resampler_shutdown()         irqfd_resampler_shutdown()
> >     mutex_lock(resampler_lock)  <---- mutex_lock(resampler_lock) //BLOCKED
> >     list_del_rcu(...)                     ...blocked...
> >     synchronize_srcu_expedited()      // Waiters block workqueue,
> >       // waits for SRCU grace            preventing SRCU grace
> >       // period which requires            period from completing
> >       // workqueue progress          --- DEADLOCK ---
> > 
> > In irqfd_resampler_shutdown(), the synchronize_srcu_expedited() in
> > the else branch is called directly within the mutex.  In the if-last
> > branch, kvm_unregister_irq_ack_notifier() also calls
> > synchronize_srcu_expedited() internally.  In kvm_irqfd_assign(),
> > synchronize_srcu_expedited() is called after list_add_rcu() but
> > before mutex_unlock().  All paths can block indefinitely because:
> > 
> >   1. synchronize_srcu_expedited() waits for an SRCU grace period
> >   2. SRCU grace period completion needs workqueue workers to run
> >   3. The blocked mutex waiters occupy workqueue slots preventing progress
> 
> Unless I'm misunderstanding the bug, "fixing" in this in KVM is papering over an
> underlying flaw.  Essentially, this would be establishing a rule that
> synchronize_srcu_expedited() can *never* be called while holding a mutex.  That's
> not viable.

First, it is OK to invoke synchronize_srcu_expedited() while holding
a mutex.  Second, the synchronize_srcu_expedited() function's use of
workqueues is the same as that of synchronize_srcu(), so in an alternate
universe where it was not OK to invoke synchronize_srcu_expedited() while
holding a mutex, it would also not be OK to invoke synchronize_srcu()
while holding that same mutex.  Third, it is also OK to acquire that
same mutex within a workqueue handler.  Fourth, SRCU and RCU use their
own workqueue, which no one else should be using (and that prohibition
most definitely includes the irqfd workers).

As a result, I do have to ask...  When you say "multiple irqfd workers",
exactly how many such workers are you running?

							Thanx, Paul

> >   4. The mutex holder never releases the lock -> deadlock

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock
  2026-03-31 20:51     ` Paul E. McKenney
@ 2026-04-01  9:47       ` Sonam Sanju
  2026-04-06 23:09       ` Paul E. McKenney
  1 sibling, 0 replies; 11+ messages in thread
From: Sonam Sanju @ 2026-04-01  9:47 UTC (permalink / raw)
  To: Paul E . McKenney, Sean Christopherson
  Cc: Paolo Bonzini, Vineeth Pillai, Dmitry Maluka, Lai Jiangshan,
	Josh Triplett, Steven Rostedt, Mathieu Desnoyers, kvm,
	linux-kernel, stable, rcu, Sonam Sanju

From: Sonam Sanju <sonam.sanju@intel.com>

On Tue, Mar 31, 2026 at 01:51:00PM -0700, Paul E. McKenney wrote:
> On Tue, Mar 31, 2026 at 11:17:19AM -0700, Sean Christopherson wrote:
> > Please don't post subsequent versions In-Reply-To previous versions, it tends to
> > muck up tooling.

Noted, will send future versions as new top-level threads. Sorry about
that.

> > Unless I'm misunderstanding the bug, "fixing" in this in KVM is papering over an
> > underlying flaw.  Essentially, this would be establishing a rule that
> > synchronize_srcu_expedited() can *never* be called while holding a mutex.  That's
> > not viable.
>
> First, it is OK to invoke synchronize_srcu_expedited() while holding
> a mutex.  Second, the synchronize_srcu_expedited() function's use of
> workqueues is the same as that of synchronize_srcu(), so in an alternate
> universe where it was not OK to invoke synchronize_srcu_expedited() while
> holding a mutex, it would also not be OK to invoke synchronize_srcu()
> while holding that same mutex.  Third, it is also OK to acquire that
> same mutex within a workqueue handler.  Fourth, SRCU and RCU use their
> own workqueue, which no one else should be using (and that prohibition
> most definitely includes the irqfd workers).

Thank you for clarifying this. 

> As a result, I do have to ask...  When you say "multiple irqfd workers",
> exactly how many such workers are you running?

While running cold reboot/ warm reboot cycling in our Android platforms 
with 6.18 kernel, the hung_task traces consistently show 8-15 
kvm-irqfd-cleanup workers in D state.  These are crosvm instances with 
roughly 10-16 irqfd lines per VM (virtio-blk, virtio-net, virtio-input,
virtio-snd, etc., each with a resampler).

Vineeth Pillai (Google) reproduced a related scenario under a VM
create/destroy stress test where the workqueue reached active=1024
refcnt=2062, though that is a much more extreme case than what we see
during normal shutdown.

The first part of the deadlock is genuinely there. One worker holds 
resampler_lock and blocks in synchronize_srcu_expedited() while the
remaining 8-15 workers block on __mutex_lock at 
irqfd_resampler_shutdown.  

Thanks,
Sonam

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock
  2026-03-31 20:51     ` Paul E. McKenney
  2026-04-01  9:47       ` Sonam Sanju
@ 2026-04-06 23:09       ` Paul E. McKenney
  1 sibling, 0 replies; 11+ messages in thread
From: Paul E. McKenney @ 2026-04-06 23:09 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Sonam Sanju, Lai Jiangshan, Josh Triplett, Paolo Bonzini,
	Vineeth Pillai, Dmitry Maluka, kvm, linux-kernel, stable,
	Steven Rostedt, Mathieu Desnoyers, rcu

On Tue, Mar 31, 2026 at 01:51:11PM -0700, Paul E. McKenney wrote:
> On Tue, Mar 31, 2026 at 11:17:19AM -0700, Sean Christopherson wrote:
> > +srcu folks

[ . . . ]

> > Unless I'm misunderstanding the bug, "fixing" in this in KVM is papering over an
> > underlying flaw.  Essentially, this would be establishing a rule that
> > synchronize_srcu_expedited() can *never* be called while holding a mutex.  That's
> > not viable.
> 
> First, it is OK to invoke synchronize_srcu_expedited() while holding
> a mutex.  Second, the synchronize_srcu_expedited() function's use of
> workqueues is the same as that of synchronize_srcu(), so in an alternate
> universe where it was not OK to invoke synchronize_srcu_expedited() while
> holding a mutex, it would also not be OK to invoke synchronize_srcu()
> while holding that same mutex.  Third, it is also OK to acquire that
> same mutex within a workqueue handler.  Fourth, SRCU and RCU use their
> own workqueue, which no one else should be using (and that prohibition
> most definitely includes the irqfd workers).
> 
> As a result, I do have to ask...  When you say "multiple irqfd workers",
> exactly how many such workers are you running?

Just to be clear, I am guessing that you have the workqueues counterpart
to a fork bomb.  However, if you are using a small finite number of
workqueue handlers, then we need to make adjustments in SRCU, workqueues,
or maybe SRCU's use of workqueues.

So if my fork-bomb guess is incorrect, please let me know.

							Thanx, Paul

> > >   4. The mutex holder never releases the lock -> deadlock

^ permalink raw reply	[flat|nested] 11+ messages in thread

[parent not found: <5194cf52-f8a8-4479-a95e-233104272839@linux.dev>]

* Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock
       [not found] <5194cf52-f8a8-4479-a95e-233104272839@linux.dev>
@ 2026-04-01 14:24 ` Sonam Sanju
  2026-04-06 14:20   ` Kunwu Chan
  0 siblings, 1 reply; 11+ messages in thread
From: Sonam Sanju @ 2026-04-01 14:24 UTC (permalink / raw)
  To: Kunwu Chan, Sean Christopherson, Paul E . McKenney
  Cc: Paolo Bonzini, Vineeth Pillai, Dmitry Maluka, kvm, linux-kernel,
	stable, rcu, Sonam Sanju

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 7982 bytes --]

From: Sonam Sanju <sonam.sanju@intel.com>

On Wed, Apr 01, 2026 at 05:34:58PM +0800, Kunwu Chan wrote:
> Building on the discussion so far, it would be helpful from the SRCU
> side to gather a bit more evidence to classify the issue.
>
> Calling synchronize_srcu_expedited() while holding a mutex is generally
> valid, so the observed behavior may be workload-dependent.

> The reported deadlock seems to rely on the assumption that SRCU grace
> period progress is indirectly blocked by irqfd workqueue saturation.
> It would be good to confirm whether that assumption actually holds.

I went back through our logs from two independent crash instances and
can now provide data for each of your questions.

> 1) Are SRCU GP kthreads/workers still making forward progress when
> the system is stuck?

No.  In both crash instances, process_srcu work items remain permanently
"pending" (never "in-flight") throughout the entire hang.

Instance 1 —  kernel 6.18.8, pool 14 (cpus=3):

  [  62.712760] workqueue rcu_gp: flags=0x108
  [  62.717801]   pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=2 refcnt=3
  [  62.717801]     pending: 2*process_srcu

  [  187.735092] workqueue rcu_gp: flags=0x108           (125 seconds later)
  [  187.735093]   pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=2 refcnt=3
  [  187.735093]     pending: 2*process_srcu              (still pending)

  9 consecutive dumps from t=62s to t=312s — process_srcu never runs.

Instance 2 —  kernel 6.18.2, pool 22 (cpus=5):

  [  93.280711] workqueue rcu_gp: flags=0x108
  [  93.280713]   pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=1 refcnt=2
  [  93.280716]     pending: process_srcu

  [  309.040801] workqueue rcu_gp: flags=0x108           (216 seconds later)
  [  309.040806]   pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=1 refcnt=2
  [  309.040806]     pending: process_srcu               (still pending)

  8 consecutive dumps from t=93s to t=341s — process_srcu never runs.

In both cases, rcu_gp's process_srcu is bound to the SAME per-CPU pool
where the kvm-irqfd-cleanup workers are blocked.  Both pools have idle
workers but are marked as hung/stalled:

  Instance 1: pool 14: cpus=3 hung=174s workers=11 idle: 4046 4038 4045 4039 4043 156 77 (7 idle)
  Instance 2: pool 22: cpus=5 hung=297s workers=12 idle: 4242 51 4248 4247 4245 435 4244 4239 (8 idle)

> 2) How many irqfd workers are active in the reported scenario, and
> can they saturate CPU or worker pools?

4 kvm-irqfd-cleanup workers in both instances, consistently across all
dumps:

Instance 1 ( pool 14 / cpus=3):

  [  62.831877] workqueue kvm-irqfd-cleanup: flags=0x0
  [  62.837838]   pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=4 refcnt=5
  [  62.837838]     in-flight: 157:irqfd_shutdown ,4044:irqfd_shutdown ,
                               102:irqfd_shutdown ,39:irqfd_shutdown

Instance 2 ( pool 22 / cpus=5):

  [  93.280894] workqueue kvm-irqfd-cleanup: flags=0x0
  [  93.280896]   pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=4 refcnt=5
  [  93.280900]     in-flight: 151:irqfd_shutdown ,4246:irqfd_shutdown ,
                               4241:irqfd_shutdown ,4243:irqfd_shutdown

These are from crosvm instances with multiple virtio devices
(virtio-blk, virtio-net, virtio-input, etc.), each registering an irqfd
with a resampler.  During VM shutdown, all irqfds are detached
concurrently, queueing that many irqfd_shutdown work items.

The 4 workers are not saturating CPU — they're all in D state.  But they
ARE all bound to the same per-CPU pool as rcu_gp's process_srcu work.

> 3) Do we have a concrete wait-for cycle showing that tasks blocked
> on resampler_lock are in turn preventing SRCU GP completion?

Yes, in both instances the hung task dump identifies the mutex holder
stuck in synchronize_srcu, with the other workers waiting on the mutex.

Instance 1 (t=314s):

  Worker pid 4044 — MUTEX HOLDER, stuck in synchronize_srcu:

    [  315.963979] task:kworker/3:8     state:D  pid:4044
    [  315.977125] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
    [  316.012504]  __synchronize_srcu+0x100/0x130
    [  316.023157]  irqfd_resampler_shutdown+0xf0/0x150  <-- offset 0xf0 (synchronize_srcu)

  Workers pid 39, 102, 157 — MUTEX WAITERS:

    [  314.793025] task:kworker/3:4     state:D  pid:157
    [  314.837472]  __mutex_lock+0x409/0xd90
    [  314.843100]  irqfd_resampler_shutdown+0x23/0x150  <-- offset 0x23 (mutex_lock)

Instance 2 (t=343s):

  Worker pid 4241 — MUTEX HOLDER, stuck in synchronize_srcu:

    [  343.193294] task:kworker/5:4     state:D  pid:4241
    [  343.193299] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
    [  343.193328]  __synchronize_srcu+0x100/0x130
    [  343.193335]  irqfd_resampler_shutdown+0xf0/0x150  <-- offset 0xf0 (synchronize_srcu)

  Workers pid 151, 4243, 4246 — MUTEX WAITERS:

    [  343.193369] task:kworker/5:6     state:D  pid:4243
    [  343.193397]  __mutex_lock+0x37d/0xbb0
    [  343.193397]  irqfd_resampler_shutdown+0x23/0x150  <-- offset 0x23 (mutex_lock)

Both instances show the identical wait-for cycle:

  1. One worker holds resampler_lock, blocks in __synchronize_srcu
     (waiting for SRCU grace period)
  2. SRCU GP needs process_srcu to run — but it stays "pending"
     on the same pool
  3. Other irqfd workers block on __mutex_lock in the same pool
  4. The pool is marked "hung" and no pending work makes progress
     for 250-300 seconds until kernel panic

> 4) Is the behavior reproducible in both irqfd_resampler_shutdown()
> and kvm_irqfd_assign() paths?

In our 4 crash instances the stuck mutex holder is always in 
irqfd_resampler_shutdown() at offset 0xf0 (synchronize_srcu).  This 
is consistent — these are all VM shutdown scenarios where only 
irqfd_shutdown workqueue items run.

The kvm_irqfd_assign() path was identified by Vineeth Pillai (Google)
during a VM create/destroy stress test where assign and shutdown race.
His traces showed kvm_irqfd (the assign path) stuck in
synchronize_srcu_expedited with irqfd_resampler_shutdown blocked on
the mutex, and workqueue pwq 46 at active=1024 refcnt=2062.

> If SRCU GP remains independent, it would help distinguish whether
> this is a strict deadlock or a form of workqueue starvation / lock
> contention.

Based on the data from both instances, SRCU GP is NOT remaining
independent.  process_srcu stays permanently pending on the affected
per-CPU pool for 250-300 seconds.  But it's not just process_srcu —
ALL pending work on the pool is stuck, including items from events,
cgroup, mm, slub, and other workqueues.


> A timestamp-correlated dump (blocked stacks + workqueue state +
> SRCU GP activity) would likely be sufficient to classify this.

I hope the correlated dumps above from both instances are helpful.
To summarize the timeline (consistent across both):

  t=0:   VM shutdown begins, crosvm detaches irqfds
  t=~14: 4 irqfd_shutdown work items queued on WQ_PERCPU pool
         One worker acquires resampler_lock, enters synchronize_srcu
         Other 3 workers block on __mutex_lock
  t=~43: First "BUG: workqueue lockup" — pool detected stuck
         rcu_gp: process_srcu shown as "pending" on same pool
  t=~93  Through t=~312: Repeated dumps every ~30s
         process_srcu remains permanently "pending"
         Pool has idle workers but no pending work executes
  t=~314: Hung task dump confirms mutex holder in __synchronize_srcu
  t=~316: init triggers sysrq crash → kernel panic

> Happy to help look at traces if available.

I can share the full console-ramoops-0 and dmesg-ramoops-0 from both
instances.  Shall I post them or send them off-list?

Thanks,
Sonam

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu  out of resampler_lock
  2026-04-01 14:24 ` Sonam Sanju
@ 2026-04-06 14:20   ` Kunwu Chan
  2026-04-17  1:18     ` Vineeth Pillai
  2026-04-21  5:12     ` Sonam Sanju
  0 siblings, 2 replies; 11+ messages in thread
From: Kunwu Chan @ 2026-04-06 14:20 UTC (permalink / raw)
  To: Sonam Sanju, Sean  Christopherson, Paul E . McKenney
  Cc: Paolo Bonzini, Vineeth Pillai, Dmitry Maluka, kvm, linux-kernel,
	stable, rcu, Sonam Sanju

April 1, 2026 at 10:24 PM, "Sonam Sanju" <sonam.sanju@intel.corp-partner.google.com mailto:sonam.sanju@intel.corp-partner.google.com?to=%22Sonam%20Sanju%22%20%3Csonam.sanju%40intel.corp-partner.google.com%3E > wrote:


> 
> From: Sonam Sanju <sonam.sanju@intel.com>
> 
> On Wed, Apr 01, 2026 at 05:34:58PM +0800, Kunwu Chan wrote:
> 
> > 
> > Building on the discussion so far, it would be helpful from the SRCU
> >  side to gather a bit more evidence to classify the issue.
> > 
> >  Calling synchronize_srcu_expedited() while holding a mutex is generally
> >  valid, so the observed behavior may be workload-dependent.
> > 
> >  The reported deadlock seems to rely on the assumption that SRCU grace
> >  period progress is indirectly blocked by irqfd workqueue saturation.
> >  It would be good to confirm whether that assumption actually holds.
> > 
> I went back through our logs from two independent crash instances and
> can now provide data for each of your questions.
> 
> > 
> > 1) Are SRCU GP kthreads/workers still making forward progress when
> >  the system is stuck?
> > 
> No. In both crash instances, process_srcu work items remain permanently
> "pending" (never "in-flight") throughout the entire hang.
> 
> Instance 1 — kernel 6.18.8, pool 14 (cpus=3):
> 
>  [ 62.712760] workqueue rcu_gp: flags=0x108
>  [ 62.717801] pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=2 refcnt=3
>  [ 62.717801] pending: 2*process_srcu
> 
>  [ 187.735092] workqueue rcu_gp: flags=0x108 (125 seconds later)
>  [ 187.735093] pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=2 refcnt=3
>  [ 187.735093] pending: 2*process_srcu (still pending)
> 
>  9 consecutive dumps from t=62s to t=312s — process_srcu never runs.
> 
> Instance 2 — kernel 6.18.2, pool 22 (cpus=5):
> 
>  [ 93.280711] workqueue rcu_gp: flags=0x108
>  [ 93.280713] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=1 refcnt=2
>  [ 93.280716] pending: process_srcu
> 
>  [ 309.040801] workqueue rcu_gp: flags=0x108 (216 seconds later)
>  [ 309.040806] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=1 refcnt=2
>  [ 309.040806] pending: process_srcu (still pending)
> 
>  8 consecutive dumps from t=93s to t=341s — process_srcu never runs.
> 
> In both cases, rcu_gp's process_srcu is bound to the SAME per-CPU pool
> where the kvm-irqfd-cleanup workers are blocked. Both pools have idle
> workers but are marked as hung/stalled:
> 
>  Instance 1: pool 14: cpus=3 hung=174s workers=11 idle: 4046 4038 4045 4039 4043 156 77 (7 idle)
>  Instance 2: pool 22: cpus=5 hung=297s workers=12 idle: 4242 51 4248 4247 4245 435 4244 4239 (8 idle)
> 
> > 
> > 2) How many irqfd workers are active in the reported scenario, and
> >  can they saturate CPU or worker pools?
> > 
> 4 kvm-irqfd-cleanup workers in both instances, consistently across all
> dumps:
> 
> Instance 1 ( pool 14 / cpus=3):
> 
>  [ 62.831877] workqueue kvm-irqfd-cleanup: flags=0x0
>  [ 62.837838] pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=4 refcnt=5
>  [ 62.837838] in-flight: 157:irqfd_shutdown ,4044:irqfd_shutdown ,
>  102:irqfd_shutdown ,39:irqfd_shutdown
> 
> Instance 2 ( pool 22 / cpus=5):
> 
>  [ 93.280894] workqueue kvm-irqfd-cleanup: flags=0x0
>  [ 93.280896] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=4 refcnt=5
>  [ 93.280900] in-flight: 151:irqfd_shutdown ,4246:irqfd_shutdown ,
>  4241:irqfd_shutdown ,4243:irqfd_shutdown
> 
> These are from crosvm instances with multiple virtio devices
> (virtio-blk, virtio-net, virtio-input, etc.), each registering an irqfd
> with a resampler. During VM shutdown, all irqfds are detached
> concurrently, queueing that many irqfd_shutdown work items.
> 
> The 4 workers are not saturating CPU — they're all in D state. But they
> ARE all bound to the same per-CPU pool as rcu_gp's process_srcu work.
> 
> > 
> > 3) Do we have a concrete wait-for cycle showing that tasks blocked
> >  on resampler_lock are in turn preventing SRCU GP completion?
> > 
> Yes, in both instances the hung task dump identifies the mutex holder
> stuck in synchronize_srcu, with the other workers waiting on the mutex.
> 
> Instance 1 (t=314s):
> 
>  Worker pid 4044 — MUTEX HOLDER, stuck in synchronize_srcu:
> 
>  [ 315.963979] task:kworker/3:8 state:D pid:4044
>  [ 315.977125] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
>  [ 316.012504] __synchronize_srcu+0x100/0x130
>  [ 316.023157] irqfd_resampler_shutdown+0xf0/0x150 <-- offset 0xf0 (synchronize_srcu)
> 
>  Workers pid 39, 102, 157 — MUTEX WAITERS:
> 
>  [ 314.793025] task:kworker/3:4 state:D pid:157
>  [ 314.837472] __mutex_lock+0x409/0xd90
>  [ 314.843100] irqfd_resampler_shutdown+0x23/0x150 <-- offset 0x23 (mutex_lock)
> 
> Instance 2 (t=343s):
> 
>  Worker pid 4241 — MUTEX HOLDER, stuck in synchronize_srcu:
> 
>  [ 343.193294] task:kworker/5:4 state:D pid:4241
>  [ 343.193299] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
>  [ 343.193328] __synchronize_srcu+0x100/0x130
>  [ 343.193335] irqfd_resampler_shutdown+0xf0/0x150 <-- offset 0xf0 (synchronize_srcu)
> 
>  Workers pid 151, 4243, 4246 — MUTEX WAITERS:
> 
>  [ 343.193369] task:kworker/5:6 state:D pid:4243
>  [ 343.193397] __mutex_lock+0x37d/0xbb0
>  [ 343.193397] irqfd_resampler_shutdown+0x23/0x150 <-- offset 0x23 (mutex_lock)
> 
> Both instances show the identical wait-for cycle:
> 
>  1. One worker holds resampler_lock, blocks in __synchronize_srcu
>  (waiting for SRCU grace period)
>  2. SRCU GP needs process_srcu to run — but it stays "pending"
>  on the same pool
>  3. Other irqfd workers block on __mutex_lock in the same pool
>  4. The pool is marked "hung" and no pending work makes progress
>  for 250-300 seconds until kernel panic
> 
> > 
> > 4) Is the behavior reproducible in both irqfd_resampler_shutdown()
> >  and kvm_irqfd_assign() paths?
> > 
> In our 4 crash instances the stuck mutex holder is always in 
> irqfd_resampler_shutdown() at offset 0xf0 (synchronize_srcu). This 
> is consistent — these are all VM shutdown scenarios where only 
> irqfd_shutdown workqueue items run.
> 
> The kvm_irqfd_assign() path was identified by Vineeth Pillai (Google)
> during a VM create/destroy stress test where assign and shutdown race.
> His traces showed kvm_irqfd (the assign path) stuck in
> synchronize_srcu_expedited with irqfd_resampler_shutdown blocked on
> the mutex, and workqueue pwq 46 at active=1024 refcnt=2062.
> 
> > 
> > If SRCU GP remains independent, it would help distinguish whether
> >  this is a strict deadlock or a form of workqueue starvation / lock
> >  contention.
> > 
> Based on the data from both instances, SRCU GP is NOT remaining
> independent. process_srcu stays permanently pending on the affected
> per-CPU pool for 250-300 seconds. But it's not just process_srcu —
> ALL pending work on the pool is stuck, including items from events,
> cgroup, mm, slub, and other workqueues.
> 
> > 
> > A timestamp-correlated dump (blocked stacks + workqueue state +
> >  SRCU GP activity) would likely be sufficient to classify this.
> > 
> I hope the correlated dumps above from both instances are helpful.
> To summarize the timeline (consistent across both):
> 
>  t=0: VM shutdown begins, crosvm detaches irqfds
>  t=~14: 4 irqfd_shutdown work items queued on WQ_PERCPU pool
>  One worker acquires resampler_lock, enters synchronize_srcu
>  Other 3 workers block on __mutex_lock
>  t=~43: First "BUG: workqueue lockup" — pool detected stuck
>  rcu_gp: process_srcu shown as "pending" on same pool
>  t=~93 Through t=~312: Repeated dumps every ~30s
>  process_srcu remains permanently "pending"
>  Pool has idle workers but no pending work executes
>  t=~314: Hung task dump confirms mutex holder in __synchronize_srcu
>  t=~316: init triggers sysrq crash → kernel panic
> 

Thanks, this is useful and much clearer.

One thing that is still unclear is dispatch behavior:
`process_srcu` stays pending for a long time, while the same pwq dump shows idle workers.

So the key question is: what prevents pending work from being dispatched on that pwq?
Is it due to:
  1) pwq stalled/hung state,
  2) worker availability/affinity constraints,
  3) or another dispatch-side condition?

Also, for scope:
- your crash instances consistently show the shutdown path
  (irqfd_resampler_shutdown + synchronize_srcu),
- while assign-path evidence, per current thread data, appears to come
  from a separate stress case.

A time-aligned dump with pwq state, pending/in-flight lists, and worker states
should help clarify this.


> > 
> > Happy to help look at traces if available.
> > 
> I can share the full console-ramoops-0 and dmesg-ramoops-0 from both
> instances. Shall I post them or send them off-list?
> 

If possible, please post sanitized ramoops/dmesg logs on-list so others can validate.

Thanx, Kunwu

> Thanks,
> Sonam
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu  out of resampler_lock
  2026-04-06 14:20   ` Kunwu Chan
@ 2026-04-17  1:18     ` Vineeth Pillai
  2026-04-19  3:03       ` Vineeth Remanan Pillai
  2026-04-21  5:12     ` Sonam Sanju
  1 sibling, 1 reply; 11+ messages in thread
From: Vineeth Pillai @ 2026-04-17  1:18 UTC (permalink / raw)
  To: kunwu.chan, paulmck
  Cc: dmaluka, kvm, linux-kernel, pbonzini, rcu, seanjc, sonam.sanju,
	sonam.sanju, stable, vineeth

Consolidating replies into one thread.

Hi Kunwu,

> One thing that is still unclear is dispatch behavior:
> `process_srcu` stays pending for a long time, while the same pwq dump shows idle workers.
>
> So the key question is: what prevents pending work from being dispatched on that pwq?
> Is it due to:
>   1) pwq stalled/hung state,
>   2) worker availability/affinity constraints,
>   3) or another dispatch-side condition?
>
> Also, for scope:
> - your crash instances consistently show the shutdown path
>   (irqfd_resampler_shutdown + synchronize_srcu),
> - while assign-path evidence, per current thread data, appears to come
>   from a separate stress case.

> A time-aligned dump with pwq state, pending/in-flight lists, and worker states
> should help clarify this.

I have a dmesg log showing this issue. This is from an automated stress
reboot test. The log is very similar to what Sonam shared.

<0>[  434.338427] BUG: workqueue lockup - pool cpus=5 node=0 flags=0x0 nice=0 stuck for 293s!
<6>[  434.339037] Showing busy workqueues and worker pools:
<6>[  434.339387] workqueue events: flags=0x100
<6>[  434.339667]   pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=2 refcnt=3
<6>[  434.339691]     pending: 2*xhci_dbc_handle_events
<6>[  434.340512] workqueue events: flags=0x100
<6>[  434.340789]   pwq 2: cpus=0 node=0 flags=0x0 nice=0 active=1 refcnt=2
<6>[  434.340793]     pending: vmstat_shepherd
<6>[  434.341507]   pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=45 refcnt=46
<6>[  434.341511]     pending: delayed_vfree_work, kernfs_notify_workfn, 5*destroy_super_work, 3*bpf_prog_free_deferred, 5*destroy_super_work, binder_deferred_func, bpf_prog_free_deferred, 25*destroy_super_work, drain_local_memcg_stock, update_stats_workfn, psi_avgs_work
<6>[  434.343578]   pwq 30: cpus=7 node=0 flags=0x0 nice=0 active=1 refcnt=2
<6>[  434.343582]     in-flight: 325:do_emergency_remount
<6>[  434.344376] workqueue events_unbound: flags=0x2
<6>[  434.344688]   pwq 34: cpus=0-7 node=0 flags=0x4 nice=0 active=2 refcnt=3
<6>[  434.344693]     in-flight: 339:fsnotify_connector_destroy_workfn fsnotify_connector_destroy_workfn
<6>[  434.345755]   pwq 34: cpus=0-7 node=0 flags=0x4 nice=0 active=2 refcnt=8
<6>[  434.345759]     in-flight: 153:fsnotify_mark_destroy_workfn BAR(3098) BAR(2564) BAR(2299) fsnotify_mark_destroy_workfn BAR(416) BAR(1116)
<6>[  434.347151] workqueue events_freezable: flags=0x104
<6>[  434.347590]   pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=1 refcnt=2
<6>[  434.347595]     pending: pci_pme_list_scan
<6>[  434.348681] workqueue events_power_efficient: flags=0x180
<6>[  434.349221]   pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=1 refcnt=2
<6>[  434.349226]     pending: check_lifetime
<6>[  434.350397] workqueue rcu_gp: flags=0x108
<6>[  434.350853]   pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=3 refcnt=4
<6>[  434.350857]     pending: 3*process_srcu
<6>[  434.351918] workqueue slub_flushwq: flags=0x8
<6>[  434.352409]   pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=1 refcnt=3
<6>[  434.352413]     pending: flush_cpu_slab BAR(1)
<6>[  434.353529] workqueue mm_percpu_wq: flags=0x8
<6>[  434.354087]   pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=1 refcnt=2
<6>[  434.354092]     pending: vmstat_update
<6>[  434.355205] workqueue quota_events_unbound: flags=0xa
<6>[  434.355725]   pwq 34: cpus=0-7 node=0 flags=0x4 nice=0 active=1 refcnt=3
<6>[  434.355730]     in-flight: 354:quota_release_workfn BAR(325)
<6>[  434.356980] workqueue kvm-irqfd-cleanup: flags=0x0
<6>[  434.357582]   pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=3 refcnt=4
<6>[  434.357586]     in-flight: 51:irqfd_shutdown ,3453:irqfd_shutdown ,3449:irqfd_shutdown
<6>[  434.359101] pool 22: cpus=5 node=0 flags=0x0 nice=0 hung=293s workers=11 idle: 282 154 3452 3451 3448 3450 3455 3454
<6>[  434.359989] pool 30: cpus=7 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 3460 332
<6>[  434.360539] pool 34: cpus=0-7 node=0 flags=0x4 nice=0 hung=0s workers=5 idle: 256 66

The relevant pwq is pwq 22. All three irqfd_shutdown workers are in-flight
but in D state. rcu_gp's process_srcu items are stuck pending.

Worker 51 (kworker/5:0) — blocked acquiring resampler_lock:
<6>[  440.576612] task:kworker/5:0     state:D stack:0     pid:51    tgid:51    ppid:2      task_flags:0x4208060 flags:0x00080000
<6>[  440.577379] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
<6>[  440.578085]  <TASK>
<6>[  440.578337]  preempt_schedule_irq+0x4a/0x90
<6>[  440.583712]  __mutex_lock+0x413/0xe40
<6>[  440.583969]  irqfd_resampler_shutdown+0x23/0x150
<6>[  440.584288]  irqfd_shutdown+0x66/0xc0
<6>[  440.584546]  process_scheduled_works+0x219/0x450
<6>[  440.584864]  worker_thread+0x2a7/0x3b0
<6>[  440.585421]  kthread+0x230/0x270

Worker 3449 (kworker/5:4) — same, blocked acquiring resampler_lock:
<6>[  440.671294] task:kworker/5:4     state:D stack:0     pid:3449  tgid:3449  ppid:2      task_flags:0x4208060 flags:0x00080000
<6>[  440.672088] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
<6>[  440.672662]  <TASK>
<6>[  440.673069]  schedule+0x5e/0xe0
<6>[  440.673708]  __mutex_lock+0x413/0xe40
<6>[  440.674059]  irqfd_resampler_shutdown+0x23/0x150
<6>[  440.674381]  irqfd_shutdown+0x66/0xc0
<6>[  440.674638]  process_scheduled_works+0x219/0x450
<6>[  440.674956]  worker_thread+0x2a7/0x3b0
<6>[  440.675308]  kthread+0x230/0x270

Worker 3453 (kworker/5:8) — holds resampler_lock, blocked waiting for SRCU GP:
<6>[  440.677368] task:kworker/5:8     state:D stack:0     pid:3453  tgid:3453  ppid:2      task_flags:0x4208060 flags:0x00080000
<6>[  440.678185] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
<6>[  440.678720]  <TASK>
<6>[  440.679127]  schedule+0x5e/0xe0
<6>[  440.679354]  schedule_timeout+0x2e/0x130
<6>[  440.680084]  wait_for_common+0xf7/0x1f0
<6>[  440.680355]  synchronize_srcu_expedited+0x109/0x140
<6>[  440.681164]  irqfd_resampler_shutdown+0xf0/0x150
<6>[  440.681481]  irqfd_shutdown+0x66/0xc0
<6>[  440.681738]  process_scheduled_works+0x219/0x450
<6>[  440.682055]  worker_thread+0x2a7/0x3b0
<6>[  440.682403]  kthread+0x230/0x270

The sequence is: worker 3453 acquires resampler_lock, and calls
synchronize_srcu_expedited() while holding the lock. This queues
process_srcu on rcu_gp, then blocks waiting for the GP to complete.
Workers 51 and 3449 are blocked trying to acquire the same resampler_lock.

Regarding your dispatch question: all three workers are in D state, so
they have all called schedule() and wq_worker_sleeping() should have
decremented pool->nr_running to zero. With nr_running == 0 and
process_srcu in the worklist, needs_more_worker() should be true and an
idle worker should be woken via kick_pool() when process_srcu is enqueued.
Why none of the 8 idle workers end up dispatching process_srcu is not
entirely clear to me.

Moving the synchronize_srcu_expedited() does solve this issue, but it
is not exactly sure why the deadlock between irqfd-shutdown workers is
causing the work queue to stall.

The full dmesg is at: https://gist.github.com/vineethrp/883db560a4503612448db9b10e02a9b5

Hi Paul,

> Just to be clear, I am guessing that you have the workqueues counterpart
> to a fork bomb. However, if you are using a small finite number of
> workqueue handlers, then we need to make adjustments in SRCU, workqueues,
> or maybe SRCU's use of workqueues.

In this log, I am not seeing a workqueue being stressed out. There are
8 idle workers, but for some reason no worker is assigned to run process_srcu.
Not sure if its a work queue related race condition or if its working as
intended to not kick new workers if there are in-flight workers in D state.

> SRCU and RCU use their own workqueue, which no one else should be
> using (and that prohibition most definitely includes the irqfd workers).

kvm-irqfd-cleanup and rcu_gp while being separate workqueues, share the
same per-CPU pool(pwq 22). Both are CPU-bound: rcu_gp has flags=0x108
(WQ_UNBOUND|WQ_FREEZABLE) but its pwq for CPU 5 resolves to the same
per-CPU pool (pool 22, flags=0x0) as kvm-irqfd-cleanup (flags=0x0).
I think CPU-bound workqueues share the per-CPU pool regardless of being
separate workqueues and these two workqueues end up competing for the
same underlying pool's workers.

Making kvm-irqfd-cleanup unbound (WQ_UNBOUND) would place it on a
separate pool from rcu_gp, preventing this interference and fixing the
stall I guess.

Thanks,
Vineeth

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock
  2026-04-17  1:18     ` Vineeth Pillai
@ 2026-04-19  3:03       ` Vineeth Remanan Pillai
  0 siblings, 0 replies; 11+ messages in thread
From: Vineeth Remanan Pillai @ 2026-04-19  3:03 UTC (permalink / raw)
  To: kunwu.chan, paulmck, Tejun Heo
  Cc: dmaluka, kvm, linux-kernel, pbonzini, rcu, seanjc, sonam.sanju,
	sonam.sanju, stable

On Thu, Apr 16, 2026 at 9:18 PM Vineeth Pillai <vineeth@bitbyteword.org> wrote:
>
> Consolidating replies into one thread.
>
> Hi Kunwu,
>
> > One thing that is still unclear is dispatch behavior:
> > `process_srcu` stays pending for a long time, while the same pwq dump shows idle workers.
> >
> > So the key question is: what prevents pending work from being dispatched on that pwq?
> > Is it due to:
> >   1) pwq stalled/hung state,
> >   2) worker availability/affinity constraints,
> >   3) or another dispatch-side condition?
> >
> > Also, for scope:
> > - your crash instances consistently show the shutdown path
> >   (irqfd_resampler_shutdown + synchronize_srcu),
> > - while assign-path evidence, per current thread data, appears to come
> >   from a separate stress case.
>
> > A time-aligned dump with pwq state, pending/in-flight lists, and worker states
> > should help clarify this.
>
> I have a dmesg log showing this issue. This is from an automated stress
> reboot test. The log is very similar to what Sonam shared.
>
> <0>[  434.338427] BUG: workqueue lockup - pool cpus=5 node=0 flags=0x0 nice=0 stuck for 293s!
> <6>[  434.339037] Showing busy workqueues and worker pools:
> <6>[  434.339387] workqueue events: flags=0x100
>  ...
> <6>[  434.350853]   pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=3 refcnt=4
> <6>[  434.350857]     pending: 3*process_srcu
> ...
> <6>[  434.356980] workqueue kvm-irqfd-cleanup: flags=0x0
> <6>[  434.357582]   pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=3 refcnt=4
> <6>[  434.357586]     in-flight: 51:irqfd_shutdown ,3453:irqfd_shutdown ,3449:irqfd_shutdown
>
> The relevant pwq is pwq 22. All three irqfd_shutdown workers are in-flight
> but in D state. rcu_gp's process_srcu items are stuck pending.
>
> Worker 51 (kworker/5:0) — blocked acquiring resampler_lock:
> <6>[  440.576612] task:kworker/5:0     state:D stack:0     pid:51    tgid:51    ppid:2      task_flags:0x4208060 flags:0x00080000
> <6>[  440.577379] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
> <6>[  440.578085]  <TASK>
> <6>[  440.578337]  preempt_schedule_irq+0x4a/0x90
> <6>[  440.583712]  __mutex_lock+0x413/0xe40
> <6>[  440.583969]  irqfd_resampler_shutdown+0x23/0x150
> <6>[  440.584288]  irqfd_shutdown+0x66/0xc0
> <6>[  440.584546]  process_scheduled_works+0x219/0x450
> <6>[  440.584864]  worker_thread+0x2a7/0x3b0
> <6>[  440.585421]  kthread+0x230/0x270
>
> Worker 3449 (kworker/5:4) — same, blocked acquiring resampler_lock:
> <6>[  440.671294] task:kworker/5:4     state:D stack:0     pid:3449  tgid:3449  ppid:2      task_flags:0x4208060 flags:0x00080000
> <6>[  440.672088] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
> <6>[  440.672662]  <TASK>
> <6>[  440.673069]  schedule+0x5e/0xe0
> <6>[  440.673708]  __mutex_lock+0x413/0xe40
> <6>[  440.674059]  irqfd_resampler_shutdown+0x23/0x150
> <6>[  440.674381]  irqfd_shutdown+0x66/0xc0
> <6>[  440.674638]  process_scheduled_works+0x219/0x450
> <6>[  440.674956]  worker_thread+0x2a7/0x3b0
> <6>[  440.675308]  kthread+0x230/0x270
>
> Worker 3453 (kworker/5:8) — holds resampler_lock, blocked waiting for SRCU GP:
> <6>[  440.677368] task:kworker/5:8     state:D stack:0     pid:3453  tgid:3453  ppid:2      task_flags:0x4208060 flags:0x00080000
> <6>[  440.678185] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
> <6>[  440.678720]  <TASK>
> <6>[  440.679127]  schedule+0x5e/0xe0
> <6>[  440.679354]  schedule_timeout+0x2e/0x130
> <6>[  440.680084]  wait_for_common+0xf7/0x1f0
> <6>[  440.680355]  synchronize_srcu_expedited+0x109/0x140
> <6>[  440.681164]  irqfd_resampler_shutdown+0xf0/0x150
> <6>[  440.681481]  irqfd_shutdown+0x66/0xc0
> <6>[  440.681738]  process_scheduled_works+0x219/0x450
> <6>[  440.682055]  worker_thread+0x2a7/0x3b0
> <6>[  440.682403]  kthread+0x230/0x270
>
> The sequence is: worker 3453 acquires resampler_lock, and calls
> synchronize_srcu_expedited() while holding the lock. This queues
> process_srcu on rcu_gp, then blocks waiting for the GP to complete.
> Workers 51 and 3449 are blocked trying to acquire the same resampler_lock.
>
> Regarding your dispatch question: all three workers are in D state, so
> they have all called schedule() and wq_worker_sleeping() should have
> decremented pool->nr_running to zero. With nr_running == 0 and
> process_srcu in the worklist, needs_more_worker() should be true and an
> idle worker should be woken via kick_pool() when process_srcu is enqueued.
> Why none of the 8 idle workers end up dispatching process_srcu is not
> entirely clear to me.
>
> Moving the synchronize_srcu_expedited() does solve this issue, but it
> is not exactly sure why the deadlock between irqfd-shutdown workers is
> causing the work queue to stall.
>

I think I know what is happening now. After adding some more debug
prints, I see that worker->sleeping is 0 for one of the workers
waiting for the mutex(pid 51) in the example above, and
pool->nr_running is 1. This prevents the pool from dispatching idle
workers.

This time I got a more descriptive stack trace as well:

<6>[18433.604285][T10987] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
<6>[18433.611204][T10987] Call Trace:
<6>[18433.615001][T10987]  <TASK>
<6>[18433.618414][T10987]  __schedule+0x8cf/0xdb0
<6>[18433.623372][T10987]  preempt_schedule_irq+0x4a/0x90
<6>[18433.629112][T10987]  asm_sysvec_reschedule_ipi+0x1a/0x20
<6>[18433.635340][T10987] RIP: 0010:kthread_data+0x15/0x30
<6>[18433.715343][T10987]  wq_worker_sleeping+0xc/0x90
<6>[18433.720806][T10987]  schedule+0x30/0xe0
<6>[18433.725379][T10987]  schedule_preempt_disabled+0x10/0x20
<6>[18433.731604][T10987]  __mutex_lock+0x413/0xe40
<6>[18433.736763][T10987]  irqfd_resampler_shutdown+0x23/0x150
<6>[18433.742989][T10987]  irqfd_shutdown+0x66/0xc0
<6>[18433.748145][T10987]  process_scheduled_works+0x219/0x450
<6>[18433.754370][T10987]  worker_thread+0x30b/0x450
<6>[18433.765460][T10987]  kthread+0x227/0x2a0
<6>[18433.775383][T10987]  ret_from_fork+0xfe/0x1b0

If I am reading the stack correctly, an IPI was serviced while at
wq_worker_sleeping() (which is responsible for setting
worker->sleeping to zero and decrementing nr_running). I guess the
process was interrupted before it could update nr_running and
sleeping. After IPI was serviced, preempt_schedule_irq() was called
and then __schedule() which schedules out the task before it could
decrement nr_running. And it is never woken up because the mutex
holder is waiting for the GP to complete. But process_srcu cannot
proceed because the workqueue pool is not kicking idle workers as
nr_running is 1. Effectively deadlocking.

So, basically what happens is (based on above example):
- srcu gp worker and irqfd workers(3453, 51) on the same per-cpu Pool
- worker 3453 acquires resampler_lock, and calls
synchronize_srcu_expedited() while holding the lock.
- worker 51 waits on the lock, but is unable to update critical
workqueue counters(nr_running and sleeping) before it schedules out.
- Workqueue pool is stalled and thereby preventing srcu GP progress.

This also explains why the issue is not seen when the
synchronize_srcu_expedited is called outside the lock.

Going directly to __schedule() after servicing IPI is the main problem
as  wq_worker_sleeping() could not complete. Without the IPI in
picture, schedule out would be:
_mutex_lock
 schedule()
    sched_submit_work()
        wq_worker_sleeping()
    __schedule_loop()
        __schedule()

WIth IPI in picture, it would be:
_mutex_lock
 schedule()
    sched_submit_work()
        wq_worker_sleeping() <-- half way through
              IPI
           preempt_schedule_irq()
              __schedule()

Moving `sched_submit_work()` to __schedule might solve this issue, but
I'm not sure if it would cause other issues. Adding Tejun for an
expert opinion on the workqueue side :-)

Thanks,
Vineeth

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock
  2026-04-06 14:20   ` Kunwu Chan
  2026-04-17  1:18     ` Vineeth Pillai
@ 2026-04-21  5:12     ` Sonam Sanju
  1 sibling, 0 replies; 11+ messages in thread
From: Sonam Sanju @ 2026-04-21  5:12 UTC (permalink / raw)
  To: kunwu.chan
  Cc: dmaluka, kvm, linux-kernel, paulmck, pbonzini, rcu, seanjc,
	sonam.sanju, stable, vineeth

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 3865 bytes --]

> Could you provide a time-aligned dump that includes:
>   - pwq state (active/pending/in-flight)
>   - pending and in-flight work items with their queue/start times
>   - worker task states

Below are time-aligned extracts from both instances.  Full logs are
included further down in this email.

=== Instance 1: kernel 6.18.8, pool 14 (cpus=3) ===

--- t=62s: First workqueue lockup dump (pool stuck 49s, since ~t=13s) ---

  kvm-irqfd-cleanup: pwq 14: active=4 refcnt=5
    in-flight: 157:irqfd_shutdown ,4044:irqfd_shutdown ,
               102:irqfd_shutdown ,39:irqfd_shutdown

  rcu_gp: pwq 14: active=2 refcnt=3
    pending: 2*process_srcu

  events: pwq 14: active=43 refcnt=44
    pending: binder_deferred_func, kernfs_notify_workfn,
             delayed_vfree_work, 5*destroy_super_work,
             3*bpf_prog_free_deferred, 10*destroy_super_work, ...

  mm_percpu_wq: pwq 14: active=2 refcnt=4
    pending: vmstat_update, lru_add_drain_per_cpu

  pm: pwq 14: active=1 refcnt=2
    pending: pm_runtime_work

  pool 14: cpus=3 flags=0x0 hung=49s workers=11
    idle: 4046 4038 4045 4039 4043 156 77  (7 idle)

  Active busy worker backtrace (pid 102):
    __schedule → schedule → schedule_preempt_disabled →
    __mutex_lock → irqfd_resampler_shutdown+0x23 →
    irqfd_shutdown → process_scheduled_works → worker_thread

--- t=312s: Last workqueue lockup dump (pool stuck 298s) ---

  kvm-irqfd-cleanup: pwq 14: active=4 (same 4 in-flight)
  rcu_gp: pwq 14: pending: 2*process_srcu  (still pending, 250s later)
  events: pwq 14: active=43  (same, no progress)
  pool 14: hung=298s workers=11 idle: 4046 4038 4045 4039 4043 156 77

--- t=314s: Hung task dump ---

  Worker 4044 (MUTEX HOLDER):
    task:kworker/3:8   state:D  pid:4044
    Workqueue: kvm-irqfd-cleanup irqfd_shutdown
      __synchronize_srcu+0x100/0x130
      irqfd_resampler_shutdown+0xf0/0x150  ← synchronize_srcu call

  Worker 157 (MUTEX WAITER):
    task:kworker/3:4   state:D  pid:157
      __mutex_lock+0x409/0xd90
      irqfd_resampler_shutdown+0x23/0x150  ← mutex_lock call

  (Workers 39 and 102 show identical mutex_lock stacks)

=== Instance 2: kernel 6.18.2, pool 22 (cpus=5) ===

--- t=93s: First workqueue lockup dump (pool stuck 79s, since ~t=14s) ---

  kvm-irqfd-cleanup: pwq 22: active=4 refcnt=5
    in-flight: 151:irqfd_shutdown ,4246:irqfd_shutdown ,
               4241:irqfd_shutdown ,4243:irqfd_shutdown

  rcu_gp: pwq 22: active=1 refcnt=2
    pending: process_srcu

  events: pwq 22: active=56 refcnt=57
    pending: kernfs_notify_workfn, delayed_vfree_work,
             binder_deferred_func, 47*destroy_super_work, ...

  pool 22: cpus=5 flags=0x0 hung=79s workers=12
    idle: 4242 51 4248 4247 4245 435 4244 4239  (8 idle)

--- t=341s: Last workqueue lockup dump (pool stuck 327s) ---

  kvm-irqfd-cleanup: pwq 22: active=4 (same)
  rcu_gp: pwq 22: pending: process_srcu  (still pending, 248s later)
  events: pwq 22: active=56  (56 pending items, zero progress)
  pool 22: hung=327s workers=12 idle: same 8 workers

--- t=343s: Hung task dump ---

  Worker 4241 (MUTEX HOLDER):
    task:kworker/5:4   state:D  pid:4241
    Workqueue: kvm-irqfd-cleanup irqfd_shutdown
      __synchronize_srcu+0x100/0x130
      irqfd_resampler_shutdown+0xf0/0x150

  Worker 4243 (MUTEX WAITER):
    task:kworker/5:6   state:D  pid:4243
      __mutex_lock+0x37d/0xbb0
      irqfd_resampler_shutdown+0x23/0x150

  (Workers 151 and 4246 show identical mutex_lock stacks)

> Please post sanitized ramoops/dmesg logs on-list so others can
> validate.

Full logs: https://gist.github.com/sonam-sanju/773855aa2cbe156ca19f3a87bbebc15e

Thanks,
Sonam

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] KVM: eventfd: Use WQ_UNBOUND workqueue for irqfd cleanup - New logs confirm preemption race
@ 2026-04-21 18:22 Tejun Heo
  2026-04-23  9:01 ` [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock Sonam Sanju
  0 siblings, 1 reply; 11+ messages in thread
From: Tejun Heo @ 2026-04-21 18:22 UTC (permalink / raw)
  To: Sonam Sanju
  Cc: vineeth, dmaluka, kunwu.chan, kvm, linux-kernel, paulmck,
	pbonzini, rcu, seanjc, stable

Hello, Sonam.

On Tue, Apr 21, 2026 at 10:24:55PM +0530, Sonam Sanju wrote:
> 3. The first stuck worker (kworker/2:0, PID 33) shows the preemption
>    in wq_worker_sleeping:
> 
>    kworker/2:0  state:D  Workqueue: kvm-irqfd-cleanup irqfd_shutdown
>      __schedule+0x87a/0xd60
>      preempt_schedule_irq+0x4a/0x90
>      asm_fred_entrypoint_kernel+0x41/0x70
>      ___ratelimit+0x1a1/0x1f0            <-- inside pr_info_ratelimited
>      wq_worker_sleeping+0x53/0x190       <-- preempted HERE
>      schedule+0x30/0xe0
>      schedule_preempt_disabled+0x10/0x20
>      __mutex_lock+0x413/0xe40
>      irqfd_resampler_shutdown+0x53/0x200
>      irqfd_shutdown+0xfa/0x190
> 
>    This confirms the exact race: a reschedule IPI interrupted
>    wq_worker_sleeping() after worker->sleeping was set to 1 but
>    before pool->nr_running was decremented. The preemption triggered
>    wq_worker_running() which incremented nr_running (1->2), then
>    on resume the decrement brought it back to 1 instead of 0.

The problem with this theory is that this kworker, while preempted, is still
runnable and should be dispatched to its CPU once it becomes available
again. Workqueue doesn't care whether the task gets preempted or when it
gets the CPU back. It only cares about whether the task enters blocking
state (!runnable). A task which is preempted, even on the way to blocking,
still is runnable and should get put back on the CPU by the scheduler.

If you can take a crashdump of the deadlocked state, can you see whether the
task is still on the scheduler's runqueue?

[Diagnostic notes below are AI-generated - apply judgment.]

The decisive field is `task->on_rq`:

  - 0: dequeued, truly blocked - your theory requires this. Then look at
    `task->sched_contributes_to_load` (set by block_task), and if
    CONFIG_SCHED_PROXY_EXEC is on, `task->blocked_on` and
    find_proxy_task() behavior.
  - 1: still queued - scheduler should pick it and self-heal the drift,
    so the "never woken up" step doesn't hold. Then the question becomes
    why EEVDF is not picking a queued task. Check `se->sched_delayed`
    first (DELAY_DEQUEUE leaves on_rq=1 but unrunnable until next pick),
    then cfs_rq throttling up the task_group hierarchy, then the rb-tree
    contents (vruntime/deadline/vlag of the stuck se vs others).

One snippet covering both branches, for each hung worker and for the
affected CPU's rq:

  from drgn.helpers.linux.sched import task_cpu
  from drgn.helpers.linux.list import list_for_each_entry

  t = find_task(prog, PID)
  cpu = task_cpu(t)
  rq = per_cpu(prog["runqueues"], cpu)
  cfs = rq.cfs

  print(f"state={hex(t.__state)} on_rq={int(t.on_rq)} "
        f"se.on_rq={int(t.se.on_rq)} sched_delayed={int(t.se.sched_delayed)} "
        f"cpu={cpu} on_cpu={int(t.on_cpu)}")
  print(f"vruntime={int(t.se.vruntime)} deadline={int(t.se.deadline)} "
        f"vlag={int(t.se.vlag)}")
  if hasattr(t, "blocked_on"):
      print(f"blocked_on={t.blocked_on}")

  print(f"rq.curr={rq.curr.comm.string_().decode()} "
        f"nr_running={int(rq.nr_running)} "
        f"cfs.h_nr_queued={int(cfs.h_nr_queued)} "
        f"cfs.h_nr_delayed={int(cfs.h_nr_delayed)} "
        f"min_vruntime={int(cfs.min_vruntime)}")
  # Walk throttle hierarchy (needs CONFIG_FAIR_GROUP_SCHED)
  c = t.se.cfs_rq
  while c:
      print(f"  cfs_rq throttled={int(c.throttled)} "
            f"throttle_count={int(c.throttle_count)}")
      c = c.tg.parent.cfs_rq[cpu] if c.tg.parent else None

Thanks.

--
tejun

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock
  2026-04-21 18:22 [PATCH v2] KVM: eventfd: Use WQ_UNBOUND workqueue for irqfd cleanup - New logs confirm preemption race Tejun Heo
@ 2026-04-23  9:01 ` Sonam Sanju
  2026-04-23 13:25   ` Vineeth Remanan Pillai
  0 siblings, 1 reply; 11+ messages in thread
From: Sonam Sanju @ 2026-04-23  9:01 UTC (permalink / raw)
  To: tj
  Cc: dmaluka, kunwu.chan, kvm, linux-kernel, paulmck, pbonzini, rcu,
	seanjc, sonam.sanju, stable, vineeth

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 2006 bytes --]

Hello Tejun,

Thank you for the detailed analysis.

On Wed, Apr 23, 2026, Tejun Heo wrote:
> The problem with this theory is that this kworker, while preempted, is still
> runnable and should be dispatched to its CPU once it becomes available
> again. Workqueue doesn't care whether the task gets preempted or when it
> gets the CPU back. It only cares about whether the task enters blocking
> state (!runnable). A task which is preempted, even on the way to blocking,
> still is runnable and should get put back on the CPU by the scheduler.
>
> If you can take a crashdump of the deadlocked state, can you see whether the
> task is still on the scheduler's runqueue?

I instrumented show_one_worker_pool() to dump scheduler state for each busy worker 
when the pool has been hung for >30 seconds.

All workers show on_rq=0.

== Pool state ==

  pool 2: cpus=0 node=0 flags=0x0 nice=0 hung=47s
  workers=13 nr_running=1 nr_idle=7

== Per-worker scheduler state (first dump at t=62.5s) ==

  PID  | state | on_rq | se.on_rq | sched_delayed | sleeping | blocked_on
  -----|-------|-------|----------|---------------|----------|-------------------
  4819 | 0x2   | 0     | 0        | 0             | 1        | ffff953608205210 type=1
  4823 | 0x2   | 0     | 0        | 0             | 1        | ffff953608205210 type=1
  4818 | 0x2   | 0     | 0        | 0             | 0        | ffff953608205210 type=1
  11   | 0x2   | 0     | 0        | 0             | 1        | ffff953608205210 type=1
  9    | 0x2   | 0     | 0        | 0             | 1        | ffff953608205210 type=1
  4814 | 0x2   | 0     | 0        | 0             | 1        | (mutex holder)


All 6 workers are in kvm-irqfd-cleanup, calling irqfd_shutdown →
irqfd_resampler_shutdown. They contend on the same resampler->lock
mutex (ffff953608205210).

Full logs: https://gist.github.com/sonam-sanju/08042878542b7a58d2818e6076554211

Thanks,
Sonam

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock
  2026-04-23  9:01 ` [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock Sonam Sanju
@ 2026-04-23 13:25   ` Vineeth Remanan Pillai
  0 siblings, 0 replies; 11+ messages in thread
From: Vineeth Remanan Pillai @ 2026-04-23 13:25 UTC (permalink / raw)
  To: Sonam Sanju
  Cc: tj, dmaluka, kunwu.chan, kvm, linux-kernel, paulmck, pbonzini,
	rcu, seanjc, stable

On Thu, Apr 23, 2026 at 5:05 AM Sonam Sanju <sonam.sanju@intel.com> wrote:
>
> Hello Tejun,
>
> Thank you for the detailed analysis.
>
> On Wed, Apr 23, 2026, Tejun Heo wrote:
> > The problem with this theory is that this kworker, while preempted, is still
> > runnable and should be dispatched to its CPU once it becomes available
> > again. Workqueue doesn't care whether the task gets preempted or when it
> > gets the CPU back. It only cares about whether the task enters blocking
> > state (!runnable). A task which is preempted, even on the way to blocking,
> > still is runnable and should get put back on the CPU by the scheduler.
> >
> > If you can take a crashdump of the deadlocked state, can you see whether the
> > task is still on the scheduler's runqueue?
>
> I instrumented show_one_worker_pool() to dump scheduler state for each busy worker
> when the pool has been hung for >30 seconds.
>
> All workers show on_rq=0.
>
> == Pool state ==
>
>   pool 2: cpus=0 node=0 flags=0x0 nice=0 hung=47s
>   workers=13 nr_running=1 nr_idle=7
>
> == Per-worker scheduler state (first dump at t=62.5s) ==
>
>   PID  | state | on_rq | se.on_rq | sched_delayed | sleeping | blocked_on
>   -----|-------|-------|----------|---------------|----------|-------------------
>   4819 | 0x2   | 0     | 0        | 0             | 1        | ffff953608205210 type=1
>   4823 | 0x2   | 0     | 0        | 0             | 1        | ffff953608205210 type=1
>   4818 | 0x2   | 0     | 0        | 0             | 0        | ffff953608205210 type=1
>   11   | 0x2   | 0     | 0        | 0             | 1        | ffff953608205210 type=1
>   9    | 0x2   | 0     | 0        | 0             | 1        | ffff953608205210 type=1
>   4814 | 0x2   | 0     | 0        | 0             | 1        | (mutex holder)
>
>
> All 6 workers are in kvm-irqfd-cleanup, calling irqfd_shutdown →
> irqfd_resampler_shutdown. They contend on the same resampler->lock
> mutex (ffff953608205210).
>

Sorry for the late disclosure; I was running the 6.18 Android kernel
and missed this relevant detail because the bug discussion initially
started with KVM and I had verified the irqfd related code was the
same as the vanilla kernel. Now, after going through Tejun's response
and reviewing the __schedule() code regarding SM_PREEMPT, I realized
the Android kernel has extra logic related to proxy execution that
might be triggering this issue. I tested on vanilla 6.18.23 kernel and
was not able to reproduce this.

Sonam, just checking if you are able to reproduce this issue with the
vanilla 6.18 kernel?

Thanks,
Vineeth

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-04-23 13:25 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20260323053353.805336-1-sonam.sanju@intel.com>
     [not found] ` <20260323064248.1660757-1-sonam.sanju@intel.com>
2026-03-31 18:17   ` [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock Sean Christopherson
2026-03-31 20:51     ` Paul E. McKenney
2026-04-01  9:47       ` Sonam Sanju
2026-04-06 23:09       ` Paul E. McKenney
     [not found] <5194cf52-f8a8-4479-a95e-233104272839@linux.dev>
2026-04-01 14:24 ` Sonam Sanju
2026-04-06 14:20   ` Kunwu Chan
2026-04-17  1:18     ` Vineeth Pillai
2026-04-19  3:03       ` Vineeth Remanan Pillai
2026-04-21  5:12     ` Sonam Sanju
2026-04-21 18:22 [PATCH v2] KVM: eventfd: Use WQ_UNBOUND workqueue for irqfd cleanup - New logs confirm preemption race Tejun Heo
2026-04-23  9:01 ` [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock Sonam Sanju
2026-04-23 13:25   ` Vineeth Remanan Pillai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox