Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler

Linux RCU subsystem development
 help / color / mirror / Atom feed

* Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock
       [not found] ` <20260323064248.1660757-1-sonam.sanju@intel.com>
@ 2026-03-31 18:17   ` Sean Christopherson
  2026-03-31 20:51     ` Paul E. McKenney
  0 siblings, 1 reply; 11+ messages in thread
From: Sean Christopherson @ 2026-03-31 18:17 UTC (permalink / raw)
  To: Sonam Sanju, Paul E. McKenney, Lai Jiangshan, Josh Triplett
  Cc: Paolo Bonzini, Vineeth Pillai, Dmitry Maluka, kvm, linux-kernel,
	stable, Steven Rostedt, Mathieu Desnoyers, rcu

+srcu folks

Please don't post subsequent versions In-Reply-To previous versions, it tends to
muck up tooling.

On Mon, Mar 23, 2026, Sonam Sanju wrote:
> irqfd_resampler_shutdown() and kvm_irqfd_assign() both call
> synchronize_srcu_expedited() while holding kvm->irqfds.resampler_lock.
> This can deadlock when multiple irqfd workers run concurrently on the
> kvm-irqfd-cleanup workqueue during VM teardown or when VMs are rapidly
> created and destroyed:
> 
>   CPU A (mutex holder)               CPU B/C/D (mutex waiters)
>   irqfd_shutdown()                   irqfd_shutdown() / kvm_irqfd_assign()
>    irqfd_resampler_shutdown()         irqfd_resampler_shutdown()
>     mutex_lock(resampler_lock)  <---- mutex_lock(resampler_lock) //BLOCKED
>     list_del_rcu(...)                     ...blocked...
>     synchronize_srcu_expedited()      // Waiters block workqueue,
>       // waits for SRCU grace            preventing SRCU grace
>       // period which requires            period from completing
>       // workqueue progress          --- DEADLOCK ---
> 
> In irqfd_resampler_shutdown(), the synchronize_srcu_expedited() in
> the else branch is called directly within the mutex.  In the if-last
> branch, kvm_unregister_irq_ack_notifier() also calls
> synchronize_srcu_expedited() internally.  In kvm_irqfd_assign(),
> synchronize_srcu_expedited() is called after list_add_rcu() but
> before mutex_unlock().  All paths can block indefinitely because:
> 
>   1. synchronize_srcu_expedited() waits for an SRCU grace period
>   2. SRCU grace period completion needs workqueue workers to run
>   3. The blocked mutex waiters occupy workqueue slots preventing progress

Unless I'm misunderstanding the bug, "fixing" in this in KVM is papering over an
underlying flaw.  Essentially, this would be establishing a rule that
synchronize_srcu_expedited() can *never* be called while holding a mutex.  That's
not viable.

>   4. The mutex holder never releases the lock -> deadlock

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock
  2026-03-31 18:17   ` [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock Sean Christopherson
@ 2026-03-31 20:51     ` Paul E. McKenney
  2026-04-01  9:47       ` Sonam Sanju
  2026-04-06 23:09       ` Paul E. McKenney
  0 siblings, 2 replies; 11+ messages in thread
From: Paul E. McKenney @ 2026-03-31 20:51 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Sonam Sanju, Lai Jiangshan, Josh Triplett, Paolo Bonzini,
	Vineeth Pillai, Dmitry Maluka, kvm, linux-kernel, stable,
	Steven Rostedt, Mathieu Desnoyers, rcu

On Tue, Mar 31, 2026 at 11:17:19AM -0700, Sean Christopherson wrote:
> +srcu folks
> 
> Please don't post subsequent versions In-Reply-To previous versions, it tends to
> muck up tooling.
> 
> On Mon, Mar 23, 2026, Sonam Sanju wrote:
> > irqfd_resampler_shutdown() and kvm_irqfd_assign() both call
> > synchronize_srcu_expedited() while holding kvm->irqfds.resampler_lock.
> > This can deadlock when multiple irqfd workers run concurrently on the
> > kvm-irqfd-cleanup workqueue during VM teardown or when VMs are rapidly
> > created and destroyed:
> > 
> >   CPU A (mutex holder)               CPU B/C/D (mutex waiters)
> >   irqfd_shutdown()                   irqfd_shutdown() / kvm_irqfd_assign()
> >    irqfd_resampler_shutdown()         irqfd_resampler_shutdown()
> >     mutex_lock(resampler_lock)  <---- mutex_lock(resampler_lock) //BLOCKED
> >     list_del_rcu(...)                     ...blocked...
> >     synchronize_srcu_expedited()      // Waiters block workqueue,
> >       // waits for SRCU grace            preventing SRCU grace
> >       // period which requires            period from completing
> >       // workqueue progress          --- DEADLOCK ---
> > 
> > In irqfd_resampler_shutdown(), the synchronize_srcu_expedited() in
> > the else branch is called directly within the mutex.  In the if-last
> > branch, kvm_unregister_irq_ack_notifier() also calls
> > synchronize_srcu_expedited() internally.  In kvm_irqfd_assign(),
> > synchronize_srcu_expedited() is called after list_add_rcu() but
> > before mutex_unlock().  All paths can block indefinitely because:
> > 
> >   1. synchronize_srcu_expedited() waits for an SRCU grace period
> >   2. SRCU grace period completion needs workqueue workers to run
> >   3. The blocked mutex waiters occupy workqueue slots preventing progress
> 
> Unless I'm misunderstanding the bug, "fixing" in this in KVM is papering over an
> underlying flaw.  Essentially, this would be establishing a rule that
> synchronize_srcu_expedited() can *never* be called while holding a mutex.  That's
> not viable.

First, it is OK to invoke synchronize_srcu_expedited() while holding
a mutex.  Second, the synchronize_srcu_expedited() function's use of
workqueues is the same as that of synchronize_srcu(), so in an alternate
universe where it was not OK to invoke synchronize_srcu_expedited() while
holding a mutex, it would also not be OK to invoke synchronize_srcu()
while holding that same mutex.  Third, it is also OK to acquire that
same mutex within a workqueue handler.  Fourth, SRCU and RCU use their
own workqueue, which no one else should be using (and that prohibition
most definitely includes the irqfd workers).

As a result, I do have to ask...  When you say "multiple irqfd workers",
exactly how many such workers are you running?

							Thanx, Paul

> >   4. The mutex holder never releases the lock -> deadlock

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock
  2026-03-31 20:51     ` Paul E. McKenney
@ 2026-04-01  9:47       ` Sonam Sanju
  2026-04-06 23:09       ` Paul E. McKenney
  1 sibling, 0 replies; 11+ messages in thread
From: Sonam Sanju @ 2026-04-01  9:47 UTC (permalink / raw)
  To: Paul E . McKenney, Sean Christopherson
  Cc: Paolo Bonzini, Vineeth Pillai, Dmitry Maluka, Lai Jiangshan,
	Josh Triplett, Steven Rostedt, Mathieu Desnoyers, kvm,
	linux-kernel, stable, rcu, Sonam Sanju

From: Sonam Sanju <sonam.sanju@intel.com>

On Tue, Mar 31, 2026 at 01:51:00PM -0700, Paul E. McKenney wrote:
> On Tue, Mar 31, 2026 at 11:17:19AM -0700, Sean Christopherson wrote:
> > Please don't post subsequent versions In-Reply-To previous versions, it tends to
> > muck up tooling.

Noted, will send future versions as new top-level threads. Sorry about
that.

> > Unless I'm misunderstanding the bug, "fixing" in this in KVM is papering over an
> > underlying flaw.  Essentially, this would be establishing a rule that
> > synchronize_srcu_expedited() can *never* be called while holding a mutex.  That's
> > not viable.
>
> First, it is OK to invoke synchronize_srcu_expedited() while holding
> a mutex.  Second, the synchronize_srcu_expedited() function's use of
> workqueues is the same as that of synchronize_srcu(), so in an alternate
> universe where it was not OK to invoke synchronize_srcu_expedited() while
> holding a mutex, it would also not be OK to invoke synchronize_srcu()
> while holding that same mutex.  Third, it is also OK to acquire that
> same mutex within a workqueue handler.  Fourth, SRCU and RCU use their
> own workqueue, which no one else should be using (and that prohibition
> most definitely includes the irqfd workers).

Thank you for clarifying this. 

> As a result, I do have to ask...  When you say "multiple irqfd workers",
> exactly how many such workers are you running?

While running cold reboot/ warm reboot cycling in our Android platforms 
with 6.18 kernel, the hung_task traces consistently show 8-15 
kvm-irqfd-cleanup workers in D state.  These are crosvm instances with 
roughly 10-16 irqfd lines per VM (virtio-blk, virtio-net, virtio-input,
virtio-snd, etc., each with a resampler).

Vineeth Pillai (Google) reproduced a related scenario under a VM
create/destroy stress test where the workqueue reached active=1024
refcnt=2062, though that is a much more extreme case than what we see
during normal shutdown.

The first part of the deadlock is genuinely there. One worker holds 
resampler_lock and blocks in synchronize_srcu_expedited() while the
remaining 8-15 workers block on __mutex_lock at 
irqfd_resampler_shutdown.  

Thanks,
Sonam

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock
       [not found] <5194cf52-f8a8-4479-a95e-233104272839@linux.dev>
@ 2026-04-01 14:24 ` Sonam Sanju
  2026-04-06 14:20   ` Kunwu Chan
  0 siblings, 1 reply; 11+ messages in thread
From: Sonam Sanju @ 2026-04-01 14:24 UTC (permalink / raw)
  To: Kunwu Chan, Sean Christopherson, Paul E . McKenney
  Cc: Paolo Bonzini, Vineeth Pillai, Dmitry Maluka, kvm, linux-kernel,
	stable, rcu, Sonam Sanju

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 7982 bytes --]

From: Sonam Sanju <sonam.sanju@intel.com>

On Wed, Apr 01, 2026 at 05:34:58PM +0800, Kunwu Chan wrote:
> Building on the discussion so far, it would be helpful from the SRCU
> side to gather a bit more evidence to classify the issue.
>
> Calling synchronize_srcu_expedited() while holding a mutex is generally
> valid, so the observed behavior may be workload-dependent.

> The reported deadlock seems to rely on the assumption that SRCU grace
> period progress is indirectly blocked by irqfd workqueue saturation.
> It would be good to confirm whether that assumption actually holds.

I went back through our logs from two independent crash instances and
can now provide data for each of your questions.

> 1) Are SRCU GP kthreads/workers still making forward progress when
> the system is stuck?

No.  In both crash instances, process_srcu work items remain permanently
"pending" (never "in-flight") throughout the entire hang.

Instance 1 —  kernel 6.18.8, pool 14 (cpus=3):

  [  62.712760] workqueue rcu_gp: flags=0x108
  [  62.717801]   pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=2 refcnt=3
  [  62.717801]     pending: 2*process_srcu

  [  187.735092] workqueue rcu_gp: flags=0x108           (125 seconds later)
  [  187.735093]   pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=2 refcnt=3
  [  187.735093]     pending: 2*process_srcu              (still pending)

  9 consecutive dumps from t=62s to t=312s — process_srcu never runs.

Instance 2 —  kernel 6.18.2, pool 22 (cpus=5):

  [  93.280711] workqueue rcu_gp: flags=0x108
  [  93.280713]   pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=1 refcnt=2
  [  93.280716]     pending: process_srcu

  [  309.040801] workqueue rcu_gp: flags=0x108           (216 seconds later)
  [  309.040806]   pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=1 refcnt=2
  [  309.040806]     pending: process_srcu               (still pending)

  8 consecutive dumps from t=93s to t=341s — process_srcu never runs.

In both cases, rcu_gp's process_srcu is bound to the SAME per-CPU pool
where the kvm-irqfd-cleanup workers are blocked.  Both pools have idle
workers but are marked as hung/stalled:

  Instance 1: pool 14: cpus=3 hung=174s workers=11 idle: 4046 4038 4045 4039 4043 156 77 (7 idle)
  Instance 2: pool 22: cpus=5 hung=297s workers=12 idle: 4242 51 4248 4247 4245 435 4244 4239 (8 idle)

> 2) How many irqfd workers are active in the reported scenario, and
> can they saturate CPU or worker pools?

4 kvm-irqfd-cleanup workers in both instances, consistently across all
dumps:

Instance 1 ( pool 14 / cpus=3):

  [  62.831877] workqueue kvm-irqfd-cleanup: flags=0x0
  [  62.837838]   pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=4 refcnt=5
  [  62.837838]     in-flight: 157:irqfd_shutdown ,4044:irqfd_shutdown ,
                               102:irqfd_shutdown ,39:irqfd_shutdown

Instance 2 ( pool 22 / cpus=5):

  [  93.280894] workqueue kvm-irqfd-cleanup: flags=0x0
  [  93.280896]   pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=4 refcnt=5
  [  93.280900]     in-flight: 151:irqfd_shutdown ,4246:irqfd_shutdown ,
                               4241:irqfd_shutdown ,4243:irqfd_shutdown

These are from crosvm instances with multiple virtio devices
(virtio-blk, virtio-net, virtio-input, etc.), each registering an irqfd
with a resampler.  During VM shutdown, all irqfds are detached
concurrently, queueing that many irqfd_shutdown work items.

The 4 workers are not saturating CPU — they're all in D state.  But they
ARE all bound to the same per-CPU pool as rcu_gp's process_srcu work.

> 3) Do we have a concrete wait-for cycle showing that tasks blocked
> on resampler_lock are in turn preventing SRCU GP completion?

Yes, in both instances the hung task dump identifies the mutex holder
stuck in synchronize_srcu, with the other workers waiting on the mutex.

Instance 1 (t=314s):

  Worker pid 4044 — MUTEX HOLDER, stuck in synchronize_srcu:

    [  315.963979] task:kworker/3:8     state:D  pid:4044
    [  315.977125] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
    [  316.012504]  __synchronize_srcu+0x100/0x130
    [  316.023157]  irqfd_resampler_shutdown+0xf0/0x150  <-- offset 0xf0 (synchronize_srcu)

  Workers pid 39, 102, 157 — MUTEX WAITERS:

    [  314.793025] task:kworker/3:4     state:D  pid:157
    [  314.837472]  __mutex_lock+0x409/0xd90
    [  314.843100]  irqfd_resampler_shutdown+0x23/0x150  <-- offset 0x23 (mutex_lock)

Instance 2 (t=343s):

  Worker pid 4241 — MUTEX HOLDER, stuck in synchronize_srcu:

    [  343.193294] task:kworker/5:4     state:D  pid:4241
    [  343.193299] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
    [  343.193328]  __synchronize_srcu+0x100/0x130
    [  343.193335]  irqfd_resampler_shutdown+0xf0/0x150  <-- offset 0xf0 (synchronize_srcu)

  Workers pid 151, 4243, 4246 — MUTEX WAITERS:

    [  343.193369] task:kworker/5:6     state:D  pid:4243
    [  343.193397]  __mutex_lock+0x37d/0xbb0
    [  343.193397]  irqfd_resampler_shutdown+0x23/0x150  <-- offset 0x23 (mutex_lock)

Both instances show the identical wait-for cycle:

  1. One worker holds resampler_lock, blocks in __synchronize_srcu
     (waiting for SRCU grace period)
  2. SRCU GP needs process_srcu to run — but it stays "pending"
     on the same pool
  3. Other irqfd workers block on __mutex_lock in the same pool
  4. The pool is marked "hung" and no pending work makes progress
     for 250-300 seconds until kernel panic

> 4) Is the behavior reproducible in both irqfd_resampler_shutdown()
> and kvm_irqfd_assign() paths?

In our 4 crash instances the stuck mutex holder is always in 
irqfd_resampler_shutdown() at offset 0xf0 (synchronize_srcu).  This 
is consistent — these are all VM shutdown scenarios where only 
irqfd_shutdown workqueue items run.

The kvm_irqfd_assign() path was identified by Vineeth Pillai (Google)
during a VM create/destroy stress test where assign and shutdown race.
His traces showed kvm_irqfd (the assign path) stuck in
synchronize_srcu_expedited with irqfd_resampler_shutdown blocked on
the mutex, and workqueue pwq 46 at active=1024 refcnt=2062.

> If SRCU GP remains independent, it would help distinguish whether
> this is a strict deadlock or a form of workqueue starvation / lock
> contention.

Based on the data from both instances, SRCU GP is NOT remaining
independent.  process_srcu stays permanently pending on the affected
per-CPU pool for 250-300 seconds.  But it's not just process_srcu —
ALL pending work on the pool is stuck, including items from events,
cgroup, mm, slub, and other workqueues.


> A timestamp-correlated dump (blocked stacks + workqueue state +
> SRCU GP activity) would likely be sufficient to classify this.

I hope the correlated dumps above from both instances are helpful.
To summarize the timeline (consistent across both):

  t=0:   VM shutdown begins, crosvm detaches irqfds
  t=~14: 4 irqfd_shutdown work items queued on WQ_PERCPU pool
         One worker acquires resampler_lock, enters synchronize_srcu
         Other 3 workers block on __mutex_lock
  t=~43: First "BUG: workqueue lockup" — pool detected stuck
         rcu_gp: process_srcu shown as "pending" on same pool
  t=~93  Through t=~312: Repeated dumps every ~30s
         process_srcu remains permanently "pending"
         Pool has idle workers but no pending work executes
  t=~314: Hung task dump confirms mutex holder in __synchronize_srcu
  t=~316: init triggers sysrq crash → kernel panic

> Happy to help look at traces if available.

I can share the full console-ramoops-0 and dmesg-ramoops-0 from both
instances.  Shall I post them or send them off-list?

Thanks,
Sonam

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu  out of resampler_lock
  2026-04-01 14:24 ` Sonam Sanju
@ 2026-04-06 14:20   ` Kunwu Chan
  2026-04-17  1:18     ` Vineeth Pillai
  2026-04-21  5:12     ` Sonam Sanju
  0 siblings, 2 replies; 11+ messages in thread
From: Kunwu Chan @ 2026-04-06 14:20 UTC (permalink / raw)
  To: Sonam Sanju, Sean  Christopherson, Paul E . McKenney
  Cc: Paolo Bonzini, Vineeth Pillai, Dmitry Maluka, kvm, linux-kernel,
	stable, rcu, Sonam Sanju

April 1, 2026 at 10:24 PM, "Sonam Sanju" <sonam.sanju@intel.corp-partner.google.com mailto:sonam.sanju@intel.corp-partner.google.com?to=%22Sonam%20Sanju%22%20%3Csonam.sanju%40intel.corp-partner.google.com%3E > wrote:


> 
> From: Sonam Sanju <sonam.sanju@intel.com>
> 
> On Wed, Apr 01, 2026 at 05:34:58PM +0800, Kunwu Chan wrote:
> 
> > 
> > Building on the discussion so far, it would be helpful from the SRCU
> >  side to gather a bit more evidence to classify the issue.
> > 
> >  Calling synchronize_srcu_expedited() while holding a mutex is generally
> >  valid, so the observed behavior may be workload-dependent.
> > 
> >  The reported deadlock seems to rely on the assumption that SRCU grace
> >  period progress is indirectly blocked by irqfd workqueue saturation.
> >  It would be good to confirm whether that assumption actually holds.
> > 
> I went back through our logs from two independent crash instances and
> can now provide data for each of your questions.
> 
> > 
> > 1) Are SRCU GP kthreads/workers still making forward progress when
> >  the system is stuck?
> > 
> No. In both crash instances, process_srcu work items remain permanently
> "pending" (never "in-flight") throughout the entire hang.
> 
> Instance 1 — kernel 6.18.8, pool 14 (cpus=3):
> 
>  [ 62.712760] workqueue rcu_gp: flags=0x108
>  [ 62.717801] pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=2 refcnt=3
>  [ 62.717801] pending: 2*process_srcu
> 
>  [ 187.735092] workqueue rcu_gp: flags=0x108 (125 seconds later)
>  [ 187.735093] pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=2 refcnt=3
>  [ 187.735093] pending: 2*process_srcu (still pending)
> 
>  9 consecutive dumps from t=62s to t=312s — process_srcu never runs.
> 
> Instance 2 — kernel 6.18.2, pool 22 (cpus=5):
> 
>  [ 93.280711] workqueue rcu_gp: flags=0x108
>  [ 93.280713] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=1 refcnt=2
>  [ 93.280716] pending: process_srcu
> 
>  [ 309.040801] workqueue rcu_gp: flags=0x108 (216 seconds later)
>  [ 309.040806] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=1 refcnt=2
>  [ 309.040806] pending: process_srcu (still pending)
> 
>  8 consecutive dumps from t=93s to t=341s — process_srcu never runs.
> 
> In both cases, rcu_gp's process_srcu is bound to the SAME per-CPU pool
> where the kvm-irqfd-cleanup workers are blocked. Both pools have idle
> workers but are marked as hung/stalled:
> 
>  Instance 1: pool 14: cpus=3 hung=174s workers=11 idle: 4046 4038 4045 4039 4043 156 77 (7 idle)
>  Instance 2: pool 22: cpus=5 hung=297s workers=12 idle: 4242 51 4248 4247 4245 435 4244 4239 (8 idle)
> 
> > 
> > 2) How many irqfd workers are active in the reported scenario, and
> >  can they saturate CPU or worker pools?
> > 
> 4 kvm-irqfd-cleanup workers in both instances, consistently across all
> dumps:
> 
> Instance 1 ( pool 14 / cpus=3):
> 
>  [ 62.831877] workqueue kvm-irqfd-cleanup: flags=0x0
>  [ 62.837838] pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=4 refcnt=5
>  [ 62.837838] in-flight: 157:irqfd_shutdown ,4044:irqfd_shutdown ,
>  102:irqfd_shutdown ,39:irqfd_shutdown
> 
> Instance 2 ( pool 22 / cpus=5):
> 
>  [ 93.280894] workqueue kvm-irqfd-cleanup: flags=0x0
>  [ 93.280896] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=4 refcnt=5
>  [ 93.280900] in-flight: 151:irqfd_shutdown ,4246:irqfd_shutdown ,
>  4241:irqfd_shutdown ,4243:irqfd_shutdown
> 
> These are from crosvm instances with multiple virtio devices
> (virtio-blk, virtio-net, virtio-input, etc.), each registering an irqfd
> with a resampler. During VM shutdown, all irqfds are detached
> concurrently, queueing that many irqfd_shutdown work items.
> 
> The 4 workers are not saturating CPU — they're all in D state. But they
> ARE all bound to the same per-CPU pool as rcu_gp's process_srcu work.
> 
> > 
> > 3) Do we have a concrete wait-for cycle showing that tasks blocked
> >  on resampler_lock are in turn preventing SRCU GP completion?
> > 
> Yes, in both instances the hung task dump identifies the mutex holder
> stuck in synchronize_srcu, with the other workers waiting on the mutex.
> 
> Instance 1 (t=314s):
> 
>  Worker pid 4044 — MUTEX HOLDER, stuck in synchronize_srcu:
> 
>  [ 315.963979] task:kworker/3:8 state:D pid:4044
>  [ 315.977125] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
>  [ 316.012504] __synchronize_srcu+0x100/0x130
>  [ 316.023157] irqfd_resampler_shutdown+0xf0/0x150 <-- offset 0xf0 (synchronize_srcu)
> 
>  Workers pid 39, 102, 157 — MUTEX WAITERS:
> 
>  [ 314.793025] task:kworker/3:4 state:D pid:157
>  [ 314.837472] __mutex_lock+0x409/0xd90
>  [ 314.843100] irqfd_resampler_shutdown+0x23/0x150 <-- offset 0x23 (mutex_lock)
> 
> Instance 2 (t=343s):
> 
>  Worker pid 4241 — MUTEX HOLDER, stuck in synchronize_srcu:
> 
>  [ 343.193294] task:kworker/5:4 state:D pid:4241
>  [ 343.193299] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
>  [ 343.193328] __synchronize_srcu+0x100/0x130
>  [ 343.193335] irqfd_resampler_shutdown+0xf0/0x150 <-- offset 0xf0 (synchronize_srcu)
> 
>  Workers pid 151, 4243, 4246 — MUTEX WAITERS:
> 
>  [ 343.193369] task:kworker/5:6 state:D pid:4243
>  [ 343.193397] __mutex_lock+0x37d/0xbb0
>  [ 343.193397] irqfd_resampler_shutdown+0x23/0x150 <-- offset 0x23 (mutex_lock)
> 
> Both instances show the identical wait-for cycle:
> 
>  1. One worker holds resampler_lock, blocks in __synchronize_srcu
>  (waiting for SRCU grace period)
>  2. SRCU GP needs process_srcu to run — but it stays "pending"
>  on the same pool
>  3. Other irqfd workers block on __mutex_lock in the same pool
>  4. The pool is marked "hung" and no pending work makes progress
>  for 250-300 seconds until kernel panic
> 
> > 
> > 4) Is the behavior reproducible in both irqfd_resampler_shutdown()
> >  and kvm_irqfd_assign() paths?
> > 
> In our 4 crash instances the stuck mutex holder is always in 
> irqfd_resampler_shutdown() at offset 0xf0 (synchronize_srcu). This 
> is consistent — these are all VM shutdown scenarios where only 
> irqfd_shutdown workqueue items run.
> 
> The kvm_irqfd_assign() path was identified by Vineeth Pillai (Google)
> during a VM create/destroy stress test where assign and shutdown race.
> His traces showed kvm_irqfd (the assign path) stuck in
> synchronize_srcu_expedited with irqfd_resampler_shutdown blocked on
> the mutex, and workqueue pwq 46 at active=1024 refcnt=2062.
> 
> > 
> > If SRCU GP remains independent, it would help distinguish whether
> >  this is a strict deadlock or a form of workqueue starvation / lock
> >  contention.
> > 
> Based on the data from both instances, SRCU GP is NOT remaining
> independent. process_srcu stays permanently pending on the affected
> per-CPU pool for 250-300 seconds. But it's not just process_srcu —
> ALL pending work on the pool is stuck, including items from events,
> cgroup, mm, slub, and other workqueues.
> 
> > 
> > A timestamp-correlated dump (blocked stacks + workqueue state +
> >  SRCU GP activity) would likely be sufficient to classify this.
> > 
> I hope the correlated dumps above from both instances are helpful.
> To summarize the timeline (consistent across both):
> 
>  t=0: VM shutdown begins, crosvm detaches irqfds
>  t=~14: 4 irqfd_shutdown work items queued on WQ_PERCPU pool
>  One worker acquires resampler_lock, enters synchronize_srcu
>  Other 3 workers block on __mutex_lock
>  t=~43: First "BUG: workqueue lockup" — pool detected stuck
>  rcu_gp: process_srcu shown as "pending" on same pool
>  t=~93 Through t=~312: Repeated dumps every ~30s
>  process_srcu remains permanently "pending"
>  Pool has idle workers but no pending work executes
>  t=~314: Hung task dump confirms mutex holder in __synchronize_srcu
>  t=~316: init triggers sysrq crash → kernel panic
> 

Thanks, this is useful and much clearer.

One thing that is still unclear is dispatch behavior:
`process_srcu` stays pending for a long time, while the same pwq dump shows idle workers.

So the key question is: what prevents pending work from being dispatched on that pwq?
Is it due to:
  1) pwq stalled/hung state,
  2) worker availability/affinity constraints,
  3) or another dispatch-side condition?

Also, for scope:
- your crash instances consistently show the shutdown path
  (irqfd_resampler_shutdown + synchronize_srcu),
- while assign-path evidence, per current thread data, appears to come
  from a separate stress case.

A time-aligned dump with pwq state, pending/in-flight lists, and worker states
should help clarify this.


> > 
> > Happy to help look at traces if available.
> > 
> I can share the full console-ramoops-0 and dmesg-ramoops-0 from both
> instances. Shall I post them or send them off-list?
> 

If possible, please post sanitized ramoops/dmesg logs on-list so others can validate.

Thanx, Kunwu

> Thanks,
> Sonam
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock
  2026-03-31 20:51     ` Paul E. McKenney
  2026-04-01  9:47       ` Sonam Sanju
@ 2026-04-06 23:09       ` Paul E. McKenney
  1 sibling, 0 replies; 11+ messages in thread
From: Paul E. McKenney @ 2026-04-06 23:09 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Sonam Sanju, Lai Jiangshan, Josh Triplett, Paolo Bonzini,
	Vineeth Pillai, Dmitry Maluka, kvm, linux-kernel, stable,
	Steven Rostedt, Mathieu Desnoyers, rcu

On Tue, Mar 31, 2026 at 01:51:11PM -0700, Paul E. McKenney wrote:
> On Tue, Mar 31, 2026 at 11:17:19AM -0700, Sean Christopherson wrote:
> > +srcu folks

[ . . . ]

> > Unless I'm misunderstanding the bug, "fixing" in this in KVM is papering over an
> > underlying flaw.  Essentially, this would be establishing a rule that
> > synchronize_srcu_expedited() can *never* be called while holding a mutex.  That's
> > not viable.
> 
> First, it is OK to invoke synchronize_srcu_expedited() while holding
> a mutex.  Second, the synchronize_srcu_expedited() function's use of
> workqueues is the same as that of synchronize_srcu(), so in an alternate
> universe where it was not OK to invoke synchronize_srcu_expedited() while
> holding a mutex, it would also not be OK to invoke synchronize_srcu()
> while holding that same mutex.  Third, it is also OK to acquire that
> same mutex within a workqueue handler.  Fourth, SRCU and RCU use their
> own workqueue, which no one else should be using (and that prohibition
> most definitely includes the irqfd workers).
> 
> As a result, I do have to ask...  When you say "multiple irqfd workers",
> exactly how many such workers are you running?

Just to be clear, I am guessing that you have the workqueues counterpart
to a fork bomb.  However, if you are using a small finite number of
workqueue handlers, then we need to make adjustments in SRCU, workqueues,
or maybe SRCU's use of workqueues.

So if my fork-bomb guess is incorrect, please let me know.

							Thanx, Paul

> > >   4. The mutex holder never releases the lock -> deadlock

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu  out of resampler_lock
  2026-04-06 14:20   ` Kunwu Chan
@ 2026-04-17  1:18     ` Vineeth Pillai
  2026-04-19  3:03       ` Vineeth Remanan Pillai
  2026-04-21  5:12     ` Sonam Sanju
  1 sibling, 1 reply; 11+ messages in thread
From: Vineeth Pillai @ 2026-04-17  1:18 UTC (permalink / raw)
  To: kunwu.chan, paulmck
  Cc: dmaluka, kvm, linux-kernel, pbonzini, rcu, seanjc, sonam.sanju,
	sonam.sanju, stable, vineeth

Consolidating replies into one thread.

Hi Kunwu,

> One thing that is still unclear is dispatch behavior:
> `process_srcu` stays pending for a long time, while the same pwq dump shows idle workers.
>
> So the key question is: what prevents pending work from being dispatched on that pwq?
> Is it due to:
>   1) pwq stalled/hung state,
>   2) worker availability/affinity constraints,
>   3) or another dispatch-side condition?
>
> Also, for scope:
> - your crash instances consistently show the shutdown path
>   (irqfd_resampler_shutdown + synchronize_srcu),
> - while assign-path evidence, per current thread data, appears to come
>   from a separate stress case.

> A time-aligned dump with pwq state, pending/in-flight lists, and worker states
> should help clarify this.

I have a dmesg log showing this issue. This is from an automated stress
reboot test. The log is very similar to what Sonam shared.

<0>[  434.338427] BUG: workqueue lockup - pool cpus=5 node=0 flags=0x0 nice=0 stuck for 293s!
<6>[  434.339037] Showing busy workqueues and worker pools:
<6>[  434.339387] workqueue events: flags=0x100
<6>[  434.339667]   pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=2 refcnt=3
<6>[  434.339691]     pending: 2*xhci_dbc_handle_events
<6>[  434.340512] workqueue events: flags=0x100
<6>[  434.340789]   pwq 2: cpus=0 node=0 flags=0x0 nice=0 active=1 refcnt=2
<6>[  434.340793]     pending: vmstat_shepherd
<6>[  434.341507]   pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=45 refcnt=46
<6>[  434.341511]     pending: delayed_vfree_work, kernfs_notify_workfn, 5*destroy_super_work, 3*bpf_prog_free_deferred, 5*destroy_super_work, binder_deferred_func, bpf_prog_free_deferred, 25*destroy_super_work, drain_local_memcg_stock, update_stats_workfn, psi_avgs_work
<6>[  434.343578]   pwq 30: cpus=7 node=0 flags=0x0 nice=0 active=1 refcnt=2
<6>[  434.343582]     in-flight: 325:do_emergency_remount
<6>[  434.344376] workqueue events_unbound: flags=0x2
<6>[  434.344688]   pwq 34: cpus=0-7 node=0 flags=0x4 nice=0 active=2 refcnt=3
<6>[  434.344693]     in-flight: 339:fsnotify_connector_destroy_workfn fsnotify_connector_destroy_workfn
<6>[  434.345755]   pwq 34: cpus=0-7 node=0 flags=0x4 nice=0 active=2 refcnt=8
<6>[  434.345759]     in-flight: 153:fsnotify_mark_destroy_workfn BAR(3098) BAR(2564) BAR(2299) fsnotify_mark_destroy_workfn BAR(416) BAR(1116)
<6>[  434.347151] workqueue events_freezable: flags=0x104
<6>[  434.347590]   pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=1 refcnt=2
<6>[  434.347595]     pending: pci_pme_list_scan
<6>[  434.348681] workqueue events_power_efficient: flags=0x180
<6>[  434.349221]   pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=1 refcnt=2
<6>[  434.349226]     pending: check_lifetime
<6>[  434.350397] workqueue rcu_gp: flags=0x108
<6>[  434.350853]   pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=3 refcnt=4
<6>[  434.350857]     pending: 3*process_srcu
<6>[  434.351918] workqueue slub_flushwq: flags=0x8
<6>[  434.352409]   pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=1 refcnt=3
<6>[  434.352413]     pending: flush_cpu_slab BAR(1)
<6>[  434.353529] workqueue mm_percpu_wq: flags=0x8
<6>[  434.354087]   pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=1 refcnt=2
<6>[  434.354092]     pending: vmstat_update
<6>[  434.355205] workqueue quota_events_unbound: flags=0xa
<6>[  434.355725]   pwq 34: cpus=0-7 node=0 flags=0x4 nice=0 active=1 refcnt=3
<6>[  434.355730]     in-flight: 354:quota_release_workfn BAR(325)
<6>[  434.356980] workqueue kvm-irqfd-cleanup: flags=0x0
<6>[  434.357582]   pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=3 refcnt=4
<6>[  434.357586]     in-flight: 51:irqfd_shutdown ,3453:irqfd_shutdown ,3449:irqfd_shutdown
<6>[  434.359101] pool 22: cpus=5 node=0 flags=0x0 nice=0 hung=293s workers=11 idle: 282 154 3452 3451 3448 3450 3455 3454
<6>[  434.359989] pool 30: cpus=7 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 3460 332
<6>[  434.360539] pool 34: cpus=0-7 node=0 flags=0x4 nice=0 hung=0s workers=5 idle: 256 66

The relevant pwq is pwq 22. All three irqfd_shutdown workers are in-flight
but in D state. rcu_gp's process_srcu items are stuck pending.

Worker 51 (kworker/5:0) — blocked acquiring resampler_lock:
<6>[  440.576612] task:kworker/5:0     state:D stack:0     pid:51    tgid:51    ppid:2      task_flags:0x4208060 flags:0x00080000
<6>[  440.577379] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
<6>[  440.578085]  <TASK>
<6>[  440.578337]  preempt_schedule_irq+0x4a/0x90
<6>[  440.583712]  __mutex_lock+0x413/0xe40
<6>[  440.583969]  irqfd_resampler_shutdown+0x23/0x150
<6>[  440.584288]  irqfd_shutdown+0x66/0xc0
<6>[  440.584546]  process_scheduled_works+0x219/0x450
<6>[  440.584864]  worker_thread+0x2a7/0x3b0
<6>[  440.585421]  kthread+0x230/0x270

Worker 3449 (kworker/5:4) — same, blocked acquiring resampler_lock:
<6>[  440.671294] task:kworker/5:4     state:D stack:0     pid:3449  tgid:3449  ppid:2      task_flags:0x4208060 flags:0x00080000
<6>[  440.672088] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
<6>[  440.672662]  <TASK>
<6>[  440.673069]  schedule+0x5e/0xe0
<6>[  440.673708]  __mutex_lock+0x413/0xe40
<6>[  440.674059]  irqfd_resampler_shutdown+0x23/0x150
<6>[  440.674381]  irqfd_shutdown+0x66/0xc0
<6>[  440.674638]  process_scheduled_works+0x219/0x450
<6>[  440.674956]  worker_thread+0x2a7/0x3b0
<6>[  440.675308]  kthread+0x230/0x270

Worker 3453 (kworker/5:8) — holds resampler_lock, blocked waiting for SRCU GP:
<6>[  440.677368] task:kworker/5:8     state:D stack:0     pid:3453  tgid:3453  ppid:2      task_flags:0x4208060 flags:0x00080000
<6>[  440.678185] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
<6>[  440.678720]  <TASK>
<6>[  440.679127]  schedule+0x5e/0xe0
<6>[  440.679354]  schedule_timeout+0x2e/0x130
<6>[  440.680084]  wait_for_common+0xf7/0x1f0
<6>[  440.680355]  synchronize_srcu_expedited+0x109/0x140
<6>[  440.681164]  irqfd_resampler_shutdown+0xf0/0x150
<6>[  440.681481]  irqfd_shutdown+0x66/0xc0
<6>[  440.681738]  process_scheduled_works+0x219/0x450
<6>[  440.682055]  worker_thread+0x2a7/0x3b0
<6>[  440.682403]  kthread+0x230/0x270

The sequence is: worker 3453 acquires resampler_lock, and calls
synchronize_srcu_expedited() while holding the lock. This queues
process_srcu on rcu_gp, then blocks waiting for the GP to complete.
Workers 51 and 3449 are blocked trying to acquire the same resampler_lock.

Regarding your dispatch question: all three workers are in D state, so
they have all called schedule() and wq_worker_sleeping() should have
decremented pool->nr_running to zero. With nr_running == 0 and
process_srcu in the worklist, needs_more_worker() should be true and an
idle worker should be woken via kick_pool() when process_srcu is enqueued.
Why none of the 8 idle workers end up dispatching process_srcu is not
entirely clear to me.

Moving the synchronize_srcu_expedited() does solve this issue, but it
is not exactly sure why the deadlock between irqfd-shutdown workers is
causing the work queue to stall.

The full dmesg is at: https://gist.github.com/vineethrp/883db560a4503612448db9b10e02a9b5

Hi Paul,

> Just to be clear, I am guessing that you have the workqueues counterpart
> to a fork bomb. However, if you are using a small finite number of
> workqueue handlers, then we need to make adjustments in SRCU, workqueues,
> or maybe SRCU's use of workqueues.

In this log, I am not seeing a workqueue being stressed out. There are
8 idle workers, but for some reason no worker is assigned to run process_srcu.
Not sure if its a work queue related race condition or if its working as
intended to not kick new workers if there are in-flight workers in D state.

> SRCU and RCU use their own workqueue, which no one else should be
> using (and that prohibition most definitely includes the irqfd workers).

kvm-irqfd-cleanup and rcu_gp while being separate workqueues, share the
same per-CPU pool(pwq 22). Both are CPU-bound: rcu_gp has flags=0x108
(WQ_UNBOUND|WQ_FREEZABLE) but its pwq for CPU 5 resolves to the same
per-CPU pool (pool 22, flags=0x0) as kvm-irqfd-cleanup (flags=0x0).
I think CPU-bound workqueues share the per-CPU pool regardless of being
separate workqueues and these two workqueues end up competing for the
same underlying pool's workers.

Making kvm-irqfd-cleanup unbound (WQ_UNBOUND) would place it on a
separate pool from rcu_gp, preventing this interference and fixing the
stall I guess.

Thanks,
Vineeth

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock
  2026-04-17  1:18     ` Vineeth Pillai
@ 2026-04-19  3:03       ` Vineeth Remanan Pillai
  0 siblings, 0 replies; 11+ messages in thread
From: Vineeth Remanan Pillai @ 2026-04-19  3:03 UTC (permalink / raw)
  To: kunwu.chan, paulmck, Tejun Heo
  Cc: dmaluka, kvm, linux-kernel, pbonzini, rcu, seanjc, sonam.sanju,
	sonam.sanju, stable

On Thu, Apr 16, 2026 at 9:18 PM Vineeth Pillai <vineeth@bitbyteword.org> wrote:
>
> Consolidating replies into one thread.
>
> Hi Kunwu,
>
> > One thing that is still unclear is dispatch behavior:
> > `process_srcu` stays pending for a long time, while the same pwq dump shows idle workers.
> >
> > So the key question is: what prevents pending work from being dispatched on that pwq?
> > Is it due to:
> >   1) pwq stalled/hung state,
> >   2) worker availability/affinity constraints,
> >   3) or another dispatch-side condition?
> >
> > Also, for scope:
> > - your crash instances consistently show the shutdown path
> >   (irqfd_resampler_shutdown + synchronize_srcu),
> > - while assign-path evidence, per current thread data, appears to come
> >   from a separate stress case.
>
> > A time-aligned dump with pwq state, pending/in-flight lists, and worker states
> > should help clarify this.
>
> I have a dmesg log showing this issue. This is from an automated stress
> reboot test. The log is very similar to what Sonam shared.
>
> <0>[  434.338427] BUG: workqueue lockup - pool cpus=5 node=0 flags=0x0 nice=0 stuck for 293s!
> <6>[  434.339037] Showing busy workqueues and worker pools:
> <6>[  434.339387] workqueue events: flags=0x100
>  ...
> <6>[  434.350853]   pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=3 refcnt=4
> <6>[  434.350857]     pending: 3*process_srcu
> ...
> <6>[  434.356980] workqueue kvm-irqfd-cleanup: flags=0x0
> <6>[  434.357582]   pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=3 refcnt=4
> <6>[  434.357586]     in-flight: 51:irqfd_shutdown ,3453:irqfd_shutdown ,3449:irqfd_shutdown
>
> The relevant pwq is pwq 22. All three irqfd_shutdown workers are in-flight
> but in D state. rcu_gp's process_srcu items are stuck pending.
>
> Worker 51 (kworker/5:0) — blocked acquiring resampler_lock:
> <6>[  440.576612] task:kworker/5:0     state:D stack:0     pid:51    tgid:51    ppid:2      task_flags:0x4208060 flags:0x00080000
> <6>[  440.577379] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
> <6>[  440.578085]  <TASK>
> <6>[  440.578337]  preempt_schedule_irq+0x4a/0x90
> <6>[  440.583712]  __mutex_lock+0x413/0xe40
> <6>[  440.583969]  irqfd_resampler_shutdown+0x23/0x150
> <6>[  440.584288]  irqfd_shutdown+0x66/0xc0
> <6>[  440.584546]  process_scheduled_works+0x219/0x450
> <6>[  440.584864]  worker_thread+0x2a7/0x3b0
> <6>[  440.585421]  kthread+0x230/0x270
>
> Worker 3449 (kworker/5:4) — same, blocked acquiring resampler_lock:
> <6>[  440.671294] task:kworker/5:4     state:D stack:0     pid:3449  tgid:3449  ppid:2      task_flags:0x4208060 flags:0x00080000
> <6>[  440.672088] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
> <6>[  440.672662]  <TASK>
> <6>[  440.673069]  schedule+0x5e/0xe0
> <6>[  440.673708]  __mutex_lock+0x413/0xe40
> <6>[  440.674059]  irqfd_resampler_shutdown+0x23/0x150
> <6>[  440.674381]  irqfd_shutdown+0x66/0xc0
> <6>[  440.674638]  process_scheduled_works+0x219/0x450
> <6>[  440.674956]  worker_thread+0x2a7/0x3b0
> <6>[  440.675308]  kthread+0x230/0x270
>
> Worker 3453 (kworker/5:8) — holds resampler_lock, blocked waiting for SRCU GP:
> <6>[  440.677368] task:kworker/5:8     state:D stack:0     pid:3453  tgid:3453  ppid:2      task_flags:0x4208060 flags:0x00080000
> <6>[  440.678185] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
> <6>[  440.678720]  <TASK>
> <6>[  440.679127]  schedule+0x5e/0xe0
> <6>[  440.679354]  schedule_timeout+0x2e/0x130
> <6>[  440.680084]  wait_for_common+0xf7/0x1f0
> <6>[  440.680355]  synchronize_srcu_expedited+0x109/0x140
> <6>[  440.681164]  irqfd_resampler_shutdown+0xf0/0x150
> <6>[  440.681481]  irqfd_shutdown+0x66/0xc0
> <6>[  440.681738]  process_scheduled_works+0x219/0x450
> <6>[  440.682055]  worker_thread+0x2a7/0x3b0
> <6>[  440.682403]  kthread+0x230/0x270
>
> The sequence is: worker 3453 acquires resampler_lock, and calls
> synchronize_srcu_expedited() while holding the lock. This queues
> process_srcu on rcu_gp, then blocks waiting for the GP to complete.
> Workers 51 and 3449 are blocked trying to acquire the same resampler_lock.
>
> Regarding your dispatch question: all three workers are in D state, so
> they have all called schedule() and wq_worker_sleeping() should have
> decremented pool->nr_running to zero. With nr_running == 0 and
> process_srcu in the worklist, needs_more_worker() should be true and an
> idle worker should be woken via kick_pool() when process_srcu is enqueued.
> Why none of the 8 idle workers end up dispatching process_srcu is not
> entirely clear to me.
>
> Moving the synchronize_srcu_expedited() does solve this issue, but it
> is not exactly sure why the deadlock between irqfd-shutdown workers is
> causing the work queue to stall.
>

I think I know what is happening now. After adding some more debug
prints, I see that worker->sleeping is 0 for one of the workers
waiting for the mutex(pid 51) in the example above, and
pool->nr_running is 1. This prevents the pool from dispatching idle
workers.

This time I got a more descriptive stack trace as well:

<6>[18433.604285][T10987] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
<6>[18433.611204][T10987] Call Trace:
<6>[18433.615001][T10987]  <TASK>
<6>[18433.618414][T10987]  __schedule+0x8cf/0xdb0
<6>[18433.623372][T10987]  preempt_schedule_irq+0x4a/0x90
<6>[18433.629112][T10987]  asm_sysvec_reschedule_ipi+0x1a/0x20
<6>[18433.635340][T10987] RIP: 0010:kthread_data+0x15/0x30
<6>[18433.715343][T10987]  wq_worker_sleeping+0xc/0x90
<6>[18433.720806][T10987]  schedule+0x30/0xe0
<6>[18433.725379][T10987]  schedule_preempt_disabled+0x10/0x20
<6>[18433.731604][T10987]  __mutex_lock+0x413/0xe40
<6>[18433.736763][T10987]  irqfd_resampler_shutdown+0x23/0x150
<6>[18433.742989][T10987]  irqfd_shutdown+0x66/0xc0
<6>[18433.748145][T10987]  process_scheduled_works+0x219/0x450
<6>[18433.754370][T10987]  worker_thread+0x30b/0x450
<6>[18433.765460][T10987]  kthread+0x227/0x2a0
<6>[18433.775383][T10987]  ret_from_fork+0xfe/0x1b0

If I am reading the stack correctly, an IPI was serviced while at
wq_worker_sleeping() (which is responsible for setting
worker->sleeping to zero and decrementing nr_running). I guess the
process was interrupted before it could update nr_running and
sleeping. After IPI was serviced, preempt_schedule_irq() was called
and then __schedule() which schedules out the task before it could
decrement nr_running. And it is never woken up because the mutex
holder is waiting for the GP to complete. But process_srcu cannot
proceed because the workqueue pool is not kicking idle workers as
nr_running is 1. Effectively deadlocking.

So, basically what happens is (based on above example):
- srcu gp worker and irqfd workers(3453, 51) on the same per-cpu Pool
- worker 3453 acquires resampler_lock, and calls
synchronize_srcu_expedited() while holding the lock.
- worker 51 waits on the lock, but is unable to update critical
workqueue counters(nr_running and sleeping) before it schedules out.
- Workqueue pool is stalled and thereby preventing srcu GP progress.

This also explains why the issue is not seen when the
synchronize_srcu_expedited is called outside the lock.

Going directly to __schedule() after servicing IPI is the main problem
as  wq_worker_sleeping() could not complete. Without the IPI in
picture, schedule out would be:
_mutex_lock
 schedule()
    sched_submit_work()
        wq_worker_sleeping()
    __schedule_loop()
        __schedule()

WIth IPI in picture, it would be:
_mutex_lock
 schedule()
    sched_submit_work()
        wq_worker_sleeping() <-- half way through
              IPI
           preempt_schedule_irq()
              __schedule()

Moving `sched_submit_work()` to __schedule might solve this issue, but
I'm not sure if it would cause other issues. Adding Tejun for an
expert opinion on the workqueue side :-)

Thanks,
Vineeth

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock
  2026-04-06 14:20   ` Kunwu Chan
  2026-04-17  1:18     ` Vineeth Pillai
@ 2026-04-21  5:12     ` Sonam Sanju
  1 sibling, 0 replies; 11+ messages in thread
From: Sonam Sanju @ 2026-04-21  5:12 UTC (permalink / raw)
  To: kunwu.chan
  Cc: dmaluka, kvm, linux-kernel, paulmck, pbonzini, rcu, seanjc,
	sonam.sanju, stable, vineeth

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 3865 bytes --]

> Could you provide a time-aligned dump that includes:
>   - pwq state (active/pending/in-flight)
>   - pending and in-flight work items with their queue/start times
>   - worker task states

Below are time-aligned extracts from both instances.  Full logs are
included further down in this email.

=== Instance 1: kernel 6.18.8, pool 14 (cpus=3) ===

--- t=62s: First workqueue lockup dump (pool stuck 49s, since ~t=13s) ---

  kvm-irqfd-cleanup: pwq 14: active=4 refcnt=5
    in-flight: 157:irqfd_shutdown ,4044:irqfd_shutdown ,
               102:irqfd_shutdown ,39:irqfd_shutdown

  rcu_gp: pwq 14: active=2 refcnt=3
    pending: 2*process_srcu

  events: pwq 14: active=43 refcnt=44
    pending: binder_deferred_func, kernfs_notify_workfn,
             delayed_vfree_work, 5*destroy_super_work,
             3*bpf_prog_free_deferred, 10*destroy_super_work, ...

  mm_percpu_wq: pwq 14: active=2 refcnt=4
    pending: vmstat_update, lru_add_drain_per_cpu

  pm: pwq 14: active=1 refcnt=2
    pending: pm_runtime_work

  pool 14: cpus=3 flags=0x0 hung=49s workers=11
    idle: 4046 4038 4045 4039 4043 156 77  (7 idle)

  Active busy worker backtrace (pid 102):
    __schedule → schedule → schedule_preempt_disabled →
    __mutex_lock → irqfd_resampler_shutdown+0x23 →
    irqfd_shutdown → process_scheduled_works → worker_thread

--- t=312s: Last workqueue lockup dump (pool stuck 298s) ---

  kvm-irqfd-cleanup: pwq 14: active=4 (same 4 in-flight)
  rcu_gp: pwq 14: pending: 2*process_srcu  (still pending, 250s later)
  events: pwq 14: active=43  (same, no progress)
  pool 14: hung=298s workers=11 idle: 4046 4038 4045 4039 4043 156 77

--- t=314s: Hung task dump ---

  Worker 4044 (MUTEX HOLDER):
    task:kworker/3:8   state:D  pid:4044
    Workqueue: kvm-irqfd-cleanup irqfd_shutdown
      __synchronize_srcu+0x100/0x130
      irqfd_resampler_shutdown+0xf0/0x150  ← synchronize_srcu call

  Worker 157 (MUTEX WAITER):
    task:kworker/3:4   state:D  pid:157
      __mutex_lock+0x409/0xd90
      irqfd_resampler_shutdown+0x23/0x150  ← mutex_lock call

  (Workers 39 and 102 show identical mutex_lock stacks)

=== Instance 2: kernel 6.18.2, pool 22 (cpus=5) ===

--- t=93s: First workqueue lockup dump (pool stuck 79s, since ~t=14s) ---

  kvm-irqfd-cleanup: pwq 22: active=4 refcnt=5
    in-flight: 151:irqfd_shutdown ,4246:irqfd_shutdown ,
               4241:irqfd_shutdown ,4243:irqfd_shutdown

  rcu_gp: pwq 22: active=1 refcnt=2
    pending: process_srcu

  events: pwq 22: active=56 refcnt=57
    pending: kernfs_notify_workfn, delayed_vfree_work,
             binder_deferred_func, 47*destroy_super_work, ...

  pool 22: cpus=5 flags=0x0 hung=79s workers=12
    idle: 4242 51 4248 4247 4245 435 4244 4239  (8 idle)

--- t=341s: Last workqueue lockup dump (pool stuck 327s) ---

  kvm-irqfd-cleanup: pwq 22: active=4 (same)
  rcu_gp: pwq 22: pending: process_srcu  (still pending, 248s later)
  events: pwq 22: active=56  (56 pending items, zero progress)
  pool 22: hung=327s workers=12 idle: same 8 workers

--- t=343s: Hung task dump ---

  Worker 4241 (MUTEX HOLDER):
    task:kworker/5:4   state:D  pid:4241
    Workqueue: kvm-irqfd-cleanup irqfd_shutdown
      __synchronize_srcu+0x100/0x130
      irqfd_resampler_shutdown+0xf0/0x150

  Worker 4243 (MUTEX WAITER):
    task:kworker/5:6   state:D  pid:4243
      __mutex_lock+0x37d/0xbb0
      irqfd_resampler_shutdown+0x23/0x150

  (Workers 151 and 4246 show identical mutex_lock stacks)

> Please post sanitized ramoops/dmesg logs on-list so others can
> validate.

Full logs: https://gist.github.com/sonam-sanju/773855aa2cbe156ca19f3a87bbebc15e

Thanks,
Sonam

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock
  2026-04-21 18:22 [PATCH v2] KVM: eventfd: Use WQ_UNBOUND workqueue for irqfd cleanup - New logs confirm preemption race Tejun Heo
@ 2026-04-23  9:01 ` Sonam Sanju
  2026-04-23 13:25   ` Vineeth Remanan Pillai
  0 siblings, 1 reply; 11+ messages in thread
From: Sonam Sanju @ 2026-04-23  9:01 UTC (permalink / raw)
  To: tj
  Cc: dmaluka, kunwu.chan, kvm, linux-kernel, paulmck, pbonzini, rcu,
	seanjc, sonam.sanju, stable, vineeth

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 2006 bytes --]

Hello Tejun,

Thank you for the detailed analysis.

On Wed, Apr 23, 2026, Tejun Heo wrote:
> The problem with this theory is that this kworker, while preempted, is still
> runnable and should be dispatched to its CPU once it becomes available
> again. Workqueue doesn't care whether the task gets preempted or when it
> gets the CPU back. It only cares about whether the task enters blocking
> state (!runnable). A task which is preempted, even on the way to blocking,
> still is runnable and should get put back on the CPU by the scheduler.
>
> If you can take a crashdump of the deadlocked state, can you see whether the
> task is still on the scheduler's runqueue?

I instrumented show_one_worker_pool() to dump scheduler state for each busy worker 
when the pool has been hung for >30 seconds.

All workers show on_rq=0.

== Pool state ==

  pool 2: cpus=0 node=0 flags=0x0 nice=0 hung=47s
  workers=13 nr_running=1 nr_idle=7

== Per-worker scheduler state (first dump at t=62.5s) ==

  PID  | state | on_rq | se.on_rq | sched_delayed | sleeping | blocked_on
  -----|-------|-------|----------|---------------|----------|-------------------
  4819 | 0x2   | 0     | 0        | 0             | 1        | ffff953608205210 type=1
  4823 | 0x2   | 0     | 0        | 0             | 1        | ffff953608205210 type=1
  4818 | 0x2   | 0     | 0        | 0             | 0        | ffff953608205210 type=1
  11   | 0x2   | 0     | 0        | 0             | 1        | ffff953608205210 type=1
  9    | 0x2   | 0     | 0        | 0             | 1        | ffff953608205210 type=1
  4814 | 0x2   | 0     | 0        | 0             | 1        | (mutex holder)


All 6 workers are in kvm-irqfd-cleanup, calling irqfd_shutdown →
irqfd_resampler_shutdown. They contend on the same resampler->lock
mutex (ffff953608205210).

Full logs: https://gist.github.com/sonam-sanju/08042878542b7a58d2818e6076554211

Thanks,
Sonam

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock
  2026-04-23  9:01 ` [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock Sonam Sanju
@ 2026-04-23 13:25   ` Vineeth Remanan Pillai
  0 siblings, 0 replies; 11+ messages in thread
From: Vineeth Remanan Pillai @ 2026-04-23 13:25 UTC (permalink / raw)
  To: Sonam Sanju
  Cc: tj, dmaluka, kunwu.chan, kvm, linux-kernel, paulmck, pbonzini,
	rcu, seanjc, stable

On Thu, Apr 23, 2026 at 5:05 AM Sonam Sanju <sonam.sanju@intel.com> wrote:
>
> Hello Tejun,
>
> Thank you for the detailed analysis.
>
> On Wed, Apr 23, 2026, Tejun Heo wrote:
> > The problem with this theory is that this kworker, while preempted, is still
> > runnable and should be dispatched to its CPU once it becomes available
> > again. Workqueue doesn't care whether the task gets preempted or when it
> > gets the CPU back. It only cares about whether the task enters blocking
> > state (!runnable). A task which is preempted, even on the way to blocking,
> > still is runnable and should get put back on the CPU by the scheduler.
> >
> > If you can take a crashdump of the deadlocked state, can you see whether the
> > task is still on the scheduler's runqueue?
>
> I instrumented show_one_worker_pool() to dump scheduler state for each busy worker
> when the pool has been hung for >30 seconds.
>
> All workers show on_rq=0.
>
> == Pool state ==
>
>   pool 2: cpus=0 node=0 flags=0x0 nice=0 hung=47s
>   workers=13 nr_running=1 nr_idle=7
>
> == Per-worker scheduler state (first dump at t=62.5s) ==
>
>   PID  | state | on_rq | se.on_rq | sched_delayed | sleeping | blocked_on
>   -----|-------|-------|----------|---------------|----------|-------------------
>   4819 | 0x2   | 0     | 0        | 0             | 1        | ffff953608205210 type=1
>   4823 | 0x2   | 0     | 0        | 0             | 1        | ffff953608205210 type=1
>   4818 | 0x2   | 0     | 0        | 0             | 0        | ffff953608205210 type=1
>   11   | 0x2   | 0     | 0        | 0             | 1        | ffff953608205210 type=1
>   9    | 0x2   | 0     | 0        | 0             | 1        | ffff953608205210 type=1
>   4814 | 0x2   | 0     | 0        | 0             | 1        | (mutex holder)
>
>
> All 6 workers are in kvm-irqfd-cleanup, calling irqfd_shutdown →
> irqfd_resampler_shutdown. They contend on the same resampler->lock
> mutex (ffff953608205210).
>

Sorry for the late disclosure; I was running the 6.18 Android kernel
and missed this relevant detail because the bug discussion initially
started with KVM and I had verified the irqfd related code was the
same as the vanilla kernel. Now, after going through Tejun's response
and reviewing the __schedule() code regarding SM_PREEMPT, I realized
the Android kernel has extra logic related to proxy execution that
might be triggering this issue. I tested on vanilla 6.18.23 kernel and
was not able to reproduce this.

Sonam, just checking if you are able to reproduce this issue with the
vanilla 6.18 kernel?

Thanks,
Vineeth

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-04-23 13:25 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20260323053353.805336-1-sonam.sanju@intel.com>
     [not found] ` <20260323064248.1660757-1-sonam.sanju@intel.com>
2026-03-31 18:17   ` [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock Sean Christopherson
2026-03-31 20:51     ` Paul E. McKenney
2026-04-01  9:47       ` Sonam Sanju
2026-04-06 23:09       ` Paul E. McKenney
     [not found] <5194cf52-f8a8-4479-a95e-233104272839@linux.dev>
2026-04-01 14:24 ` Sonam Sanju
2026-04-06 14:20   ` Kunwu Chan
2026-04-17  1:18     ` Vineeth Pillai
2026-04-19  3:03       ` Vineeth Remanan Pillai
2026-04-21  5:12     ` Sonam Sanju
2026-04-21 18:22 [PATCH v2] KVM: eventfd: Use WQ_UNBOUND workqueue for irqfd cleanup - New logs confirm preemption race Tejun Heo
2026-04-23  9:01 ` [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock Sonam Sanju
2026-04-23 13:25   ` Vineeth Remanan Pillai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox