* Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock
[not found] ` <20260323064248.1660757-1-sonam.sanju@intel.com>
@ 2026-03-31 18:17 ` Sean Christopherson
2026-03-31 20:51 ` Paul E. McKenney
0 siblings, 1 reply; 11+ messages in thread
From: Sean Christopherson @ 2026-03-31 18:17 UTC (permalink / raw)
To: Sonam Sanju, Paul E. McKenney, Lai Jiangshan, Josh Triplett
Cc: Paolo Bonzini, Vineeth Pillai, Dmitry Maluka, kvm, linux-kernel,
stable, Steven Rostedt, Mathieu Desnoyers, rcu
+srcu folks
Please don't post subsequent versions In-Reply-To previous versions, it tends to
muck up tooling.
On Mon, Mar 23, 2026, Sonam Sanju wrote:
> irqfd_resampler_shutdown() and kvm_irqfd_assign() both call
> synchronize_srcu_expedited() while holding kvm->irqfds.resampler_lock.
> This can deadlock when multiple irqfd workers run concurrently on the
> kvm-irqfd-cleanup workqueue during VM teardown or when VMs are rapidly
> created and destroyed:
>
> CPU A (mutex holder) CPU B/C/D (mutex waiters)
> irqfd_shutdown() irqfd_shutdown() / kvm_irqfd_assign()
> irqfd_resampler_shutdown() irqfd_resampler_shutdown()
> mutex_lock(resampler_lock) <---- mutex_lock(resampler_lock) //BLOCKED
> list_del_rcu(...) ...blocked...
> synchronize_srcu_expedited() // Waiters block workqueue,
> // waits for SRCU grace preventing SRCU grace
> // period which requires period from completing
> // workqueue progress --- DEADLOCK ---
>
> In irqfd_resampler_shutdown(), the synchronize_srcu_expedited() in
> the else branch is called directly within the mutex. In the if-last
> branch, kvm_unregister_irq_ack_notifier() also calls
> synchronize_srcu_expedited() internally. In kvm_irqfd_assign(),
> synchronize_srcu_expedited() is called after list_add_rcu() but
> before mutex_unlock(). All paths can block indefinitely because:
>
> 1. synchronize_srcu_expedited() waits for an SRCU grace period
> 2. SRCU grace period completion needs workqueue workers to run
> 3. The blocked mutex waiters occupy workqueue slots preventing progress
Unless I'm misunderstanding the bug, "fixing" in this in KVM is papering over an
underlying flaw. Essentially, this would be establishing a rule that
synchronize_srcu_expedited() can *never* be called while holding a mutex. That's
not viable.
> 4. The mutex holder never releases the lock -> deadlock
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock
2026-03-31 18:17 ` [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock Sean Christopherson
@ 2026-03-31 20:51 ` Paul E. McKenney
2026-04-01 9:47 ` Sonam Sanju
2026-04-06 23:09 ` Paul E. McKenney
0 siblings, 2 replies; 11+ messages in thread
From: Paul E. McKenney @ 2026-03-31 20:51 UTC (permalink / raw)
To: Sean Christopherson
Cc: Sonam Sanju, Lai Jiangshan, Josh Triplett, Paolo Bonzini,
Vineeth Pillai, Dmitry Maluka, kvm, linux-kernel, stable,
Steven Rostedt, Mathieu Desnoyers, rcu
On Tue, Mar 31, 2026 at 11:17:19AM -0700, Sean Christopherson wrote:
> +srcu folks
>
> Please don't post subsequent versions In-Reply-To previous versions, it tends to
> muck up tooling.
>
> On Mon, Mar 23, 2026, Sonam Sanju wrote:
> > irqfd_resampler_shutdown() and kvm_irqfd_assign() both call
> > synchronize_srcu_expedited() while holding kvm->irqfds.resampler_lock.
> > This can deadlock when multiple irqfd workers run concurrently on the
> > kvm-irqfd-cleanup workqueue during VM teardown or when VMs are rapidly
> > created and destroyed:
> >
> > CPU A (mutex holder) CPU B/C/D (mutex waiters)
> > irqfd_shutdown() irqfd_shutdown() / kvm_irqfd_assign()
> > irqfd_resampler_shutdown() irqfd_resampler_shutdown()
> > mutex_lock(resampler_lock) <---- mutex_lock(resampler_lock) //BLOCKED
> > list_del_rcu(...) ...blocked...
> > synchronize_srcu_expedited() // Waiters block workqueue,
> > // waits for SRCU grace preventing SRCU grace
> > // period which requires period from completing
> > // workqueue progress --- DEADLOCK ---
> >
> > In irqfd_resampler_shutdown(), the synchronize_srcu_expedited() in
> > the else branch is called directly within the mutex. In the if-last
> > branch, kvm_unregister_irq_ack_notifier() also calls
> > synchronize_srcu_expedited() internally. In kvm_irqfd_assign(),
> > synchronize_srcu_expedited() is called after list_add_rcu() but
> > before mutex_unlock(). All paths can block indefinitely because:
> >
> > 1. synchronize_srcu_expedited() waits for an SRCU grace period
> > 2. SRCU grace period completion needs workqueue workers to run
> > 3. The blocked mutex waiters occupy workqueue slots preventing progress
>
> Unless I'm misunderstanding the bug, "fixing" in this in KVM is papering over an
> underlying flaw. Essentially, this would be establishing a rule that
> synchronize_srcu_expedited() can *never* be called while holding a mutex. That's
> not viable.
First, it is OK to invoke synchronize_srcu_expedited() while holding
a mutex. Second, the synchronize_srcu_expedited() function's use of
workqueues is the same as that of synchronize_srcu(), so in an alternate
universe where it was not OK to invoke synchronize_srcu_expedited() while
holding a mutex, it would also not be OK to invoke synchronize_srcu()
while holding that same mutex. Third, it is also OK to acquire that
same mutex within a workqueue handler. Fourth, SRCU and RCU use their
own workqueue, which no one else should be using (and that prohibition
most definitely includes the irqfd workers).
As a result, I do have to ask... When you say "multiple irqfd workers",
exactly how many such workers are you running?
Thanx, Paul
> > 4. The mutex holder never releases the lock -> deadlock
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock
2026-03-31 20:51 ` Paul E. McKenney
@ 2026-04-01 9:47 ` Sonam Sanju
2026-04-06 23:09 ` Paul E. McKenney
1 sibling, 0 replies; 11+ messages in thread
From: Sonam Sanju @ 2026-04-01 9:47 UTC (permalink / raw)
To: Paul E . McKenney, Sean Christopherson
Cc: Paolo Bonzini, Vineeth Pillai, Dmitry Maluka, Lai Jiangshan,
Josh Triplett, Steven Rostedt, Mathieu Desnoyers, kvm,
linux-kernel, stable, rcu, Sonam Sanju
From: Sonam Sanju <sonam.sanju@intel.com>
On Tue, Mar 31, 2026 at 01:51:00PM -0700, Paul E. McKenney wrote:
> On Tue, Mar 31, 2026 at 11:17:19AM -0700, Sean Christopherson wrote:
> > Please don't post subsequent versions In-Reply-To previous versions, it tends to
> > muck up tooling.
Noted, will send future versions as new top-level threads. Sorry about
that.
> > Unless I'm misunderstanding the bug, "fixing" in this in KVM is papering over an
> > underlying flaw. Essentially, this would be establishing a rule that
> > synchronize_srcu_expedited() can *never* be called while holding a mutex. That's
> > not viable.
>
> First, it is OK to invoke synchronize_srcu_expedited() while holding
> a mutex. Second, the synchronize_srcu_expedited() function's use of
> workqueues is the same as that of synchronize_srcu(), so in an alternate
> universe where it was not OK to invoke synchronize_srcu_expedited() while
> holding a mutex, it would also not be OK to invoke synchronize_srcu()
> while holding that same mutex. Third, it is also OK to acquire that
> same mutex within a workqueue handler. Fourth, SRCU and RCU use their
> own workqueue, which no one else should be using (and that prohibition
> most definitely includes the irqfd workers).
Thank you for clarifying this.
> As a result, I do have to ask... When you say "multiple irqfd workers",
> exactly how many such workers are you running?
While running cold reboot/ warm reboot cycling in our Android platforms
with 6.18 kernel, the hung_task traces consistently show 8-15
kvm-irqfd-cleanup workers in D state. These are crosvm instances with
roughly 10-16 irqfd lines per VM (virtio-blk, virtio-net, virtio-input,
virtio-snd, etc., each with a resampler).
Vineeth Pillai (Google) reproduced a related scenario under a VM
create/destroy stress test where the workqueue reached active=1024
refcnt=2062, though that is a much more extreme case than what we see
during normal shutdown.
The first part of the deadlock is genuinely there. One worker holds
resampler_lock and blocks in synchronize_srcu_expedited() while the
remaining 8-15 workers block on __mutex_lock at
irqfd_resampler_shutdown.
Thanks,
Sonam
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock
[not found] <5194cf52-f8a8-4479-a95e-233104272839@linux.dev>
@ 2026-04-01 14:24 ` Sonam Sanju
2026-04-06 14:20 ` Kunwu Chan
0 siblings, 1 reply; 11+ messages in thread
From: Sonam Sanju @ 2026-04-01 14:24 UTC (permalink / raw)
To: Kunwu Chan, Sean Christopherson, Paul E . McKenney
Cc: Paolo Bonzini, Vineeth Pillai, Dmitry Maluka, kvm, linux-kernel,
stable, rcu, Sonam Sanju
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 7982 bytes --]
From: Sonam Sanju <sonam.sanju@intel.com>
On Wed, Apr 01, 2026 at 05:34:58PM +0800, Kunwu Chan wrote:
> Building on the discussion so far, it would be helpful from the SRCU
> side to gather a bit more evidence to classify the issue.
>
> Calling synchronize_srcu_expedited() while holding a mutex is generally
> valid, so the observed behavior may be workload-dependent.
> The reported deadlock seems to rely on the assumption that SRCU grace
> period progress is indirectly blocked by irqfd workqueue saturation.
> It would be good to confirm whether that assumption actually holds.
I went back through our logs from two independent crash instances and
can now provide data for each of your questions.
> 1) Are SRCU GP kthreads/workers still making forward progress when
> the system is stuck?
No. In both crash instances, process_srcu work items remain permanently
"pending" (never "in-flight") throughout the entire hang.
Instance 1 — kernel 6.18.8, pool 14 (cpus=3):
[ 62.712760] workqueue rcu_gp: flags=0x108
[ 62.717801] pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=2 refcnt=3
[ 62.717801] pending: 2*process_srcu
[ 187.735092] workqueue rcu_gp: flags=0x108 (125 seconds later)
[ 187.735093] pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=2 refcnt=3
[ 187.735093] pending: 2*process_srcu (still pending)
9 consecutive dumps from t=62s to t=312s — process_srcu never runs.
Instance 2 — kernel 6.18.2, pool 22 (cpus=5):
[ 93.280711] workqueue rcu_gp: flags=0x108
[ 93.280713] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=1 refcnt=2
[ 93.280716] pending: process_srcu
[ 309.040801] workqueue rcu_gp: flags=0x108 (216 seconds later)
[ 309.040806] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=1 refcnt=2
[ 309.040806] pending: process_srcu (still pending)
8 consecutive dumps from t=93s to t=341s — process_srcu never runs.
In both cases, rcu_gp's process_srcu is bound to the SAME per-CPU pool
where the kvm-irqfd-cleanup workers are blocked. Both pools have idle
workers but are marked as hung/stalled:
Instance 1: pool 14: cpus=3 hung=174s workers=11 idle: 4046 4038 4045 4039 4043 156 77 (7 idle)
Instance 2: pool 22: cpus=5 hung=297s workers=12 idle: 4242 51 4248 4247 4245 435 4244 4239 (8 idle)
> 2) How many irqfd workers are active in the reported scenario, and
> can they saturate CPU or worker pools?
4 kvm-irqfd-cleanup workers in both instances, consistently across all
dumps:
Instance 1 ( pool 14 / cpus=3):
[ 62.831877] workqueue kvm-irqfd-cleanup: flags=0x0
[ 62.837838] pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=4 refcnt=5
[ 62.837838] in-flight: 157:irqfd_shutdown ,4044:irqfd_shutdown ,
102:irqfd_shutdown ,39:irqfd_shutdown
Instance 2 ( pool 22 / cpus=5):
[ 93.280894] workqueue kvm-irqfd-cleanup: flags=0x0
[ 93.280896] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=4 refcnt=5
[ 93.280900] in-flight: 151:irqfd_shutdown ,4246:irqfd_shutdown ,
4241:irqfd_shutdown ,4243:irqfd_shutdown
These are from crosvm instances with multiple virtio devices
(virtio-blk, virtio-net, virtio-input, etc.), each registering an irqfd
with a resampler. During VM shutdown, all irqfds are detached
concurrently, queueing that many irqfd_shutdown work items.
The 4 workers are not saturating CPU — they're all in D state. But they
ARE all bound to the same per-CPU pool as rcu_gp's process_srcu work.
> 3) Do we have a concrete wait-for cycle showing that tasks blocked
> on resampler_lock are in turn preventing SRCU GP completion?
Yes, in both instances the hung task dump identifies the mutex holder
stuck in synchronize_srcu, with the other workers waiting on the mutex.
Instance 1 (t=314s):
Worker pid 4044 — MUTEX HOLDER, stuck in synchronize_srcu:
[ 315.963979] task:kworker/3:8 state:D pid:4044
[ 315.977125] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
[ 316.012504] __synchronize_srcu+0x100/0x130
[ 316.023157] irqfd_resampler_shutdown+0xf0/0x150 <-- offset 0xf0 (synchronize_srcu)
Workers pid 39, 102, 157 — MUTEX WAITERS:
[ 314.793025] task:kworker/3:4 state:D pid:157
[ 314.837472] __mutex_lock+0x409/0xd90
[ 314.843100] irqfd_resampler_shutdown+0x23/0x150 <-- offset 0x23 (mutex_lock)
Instance 2 (t=343s):
Worker pid 4241 — MUTEX HOLDER, stuck in synchronize_srcu:
[ 343.193294] task:kworker/5:4 state:D pid:4241
[ 343.193299] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
[ 343.193328] __synchronize_srcu+0x100/0x130
[ 343.193335] irqfd_resampler_shutdown+0xf0/0x150 <-- offset 0xf0 (synchronize_srcu)
Workers pid 151, 4243, 4246 — MUTEX WAITERS:
[ 343.193369] task:kworker/5:6 state:D pid:4243
[ 343.193397] __mutex_lock+0x37d/0xbb0
[ 343.193397] irqfd_resampler_shutdown+0x23/0x150 <-- offset 0x23 (mutex_lock)
Both instances show the identical wait-for cycle:
1. One worker holds resampler_lock, blocks in __synchronize_srcu
(waiting for SRCU grace period)
2. SRCU GP needs process_srcu to run — but it stays "pending"
on the same pool
3. Other irqfd workers block on __mutex_lock in the same pool
4. The pool is marked "hung" and no pending work makes progress
for 250-300 seconds until kernel panic
> 4) Is the behavior reproducible in both irqfd_resampler_shutdown()
> and kvm_irqfd_assign() paths?
In our 4 crash instances the stuck mutex holder is always in
irqfd_resampler_shutdown() at offset 0xf0 (synchronize_srcu). This
is consistent — these are all VM shutdown scenarios where only
irqfd_shutdown workqueue items run.
The kvm_irqfd_assign() path was identified by Vineeth Pillai (Google)
during a VM create/destroy stress test where assign and shutdown race.
His traces showed kvm_irqfd (the assign path) stuck in
synchronize_srcu_expedited with irqfd_resampler_shutdown blocked on
the mutex, and workqueue pwq 46 at active=1024 refcnt=2062.
> If SRCU GP remains independent, it would help distinguish whether
> this is a strict deadlock or a form of workqueue starvation / lock
> contention.
Based on the data from both instances, SRCU GP is NOT remaining
independent. process_srcu stays permanently pending on the affected
per-CPU pool for 250-300 seconds. But it's not just process_srcu —
ALL pending work on the pool is stuck, including items from events,
cgroup, mm, slub, and other workqueues.
> A timestamp-correlated dump (blocked stacks + workqueue state +
> SRCU GP activity) would likely be sufficient to classify this.
I hope the correlated dumps above from both instances are helpful.
To summarize the timeline (consistent across both):
t=0: VM shutdown begins, crosvm detaches irqfds
t=~14: 4 irqfd_shutdown work items queued on WQ_PERCPU pool
One worker acquires resampler_lock, enters synchronize_srcu
Other 3 workers block on __mutex_lock
t=~43: First "BUG: workqueue lockup" — pool detected stuck
rcu_gp: process_srcu shown as "pending" on same pool
t=~93 Through t=~312: Repeated dumps every ~30s
process_srcu remains permanently "pending"
Pool has idle workers but no pending work executes
t=~314: Hung task dump confirms mutex holder in __synchronize_srcu
t=~316: init triggers sysrq crash → kernel panic
> Happy to help look at traces if available.
I can share the full console-ramoops-0 and dmesg-ramoops-0 from both
instances. Shall I post them or send them off-list?
Thanks,
Sonam
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock
2026-04-01 14:24 ` Sonam Sanju
@ 2026-04-06 14:20 ` Kunwu Chan
2026-04-17 1:18 ` Vineeth Pillai
2026-04-21 5:12 ` Sonam Sanju
0 siblings, 2 replies; 11+ messages in thread
From: Kunwu Chan @ 2026-04-06 14:20 UTC (permalink / raw)
To: Sonam Sanju, Sean Christopherson, Paul E . McKenney
Cc: Paolo Bonzini, Vineeth Pillai, Dmitry Maluka, kvm, linux-kernel,
stable, rcu, Sonam Sanju
April 1, 2026 at 10:24 PM, "Sonam Sanju" <sonam.sanju@intel.corp-partner.google.com mailto:sonam.sanju@intel.corp-partner.google.com?to=%22Sonam%20Sanju%22%20%3Csonam.sanju%40intel.corp-partner.google.com%3E > wrote:
>
> From: Sonam Sanju <sonam.sanju@intel.com>
>
> On Wed, Apr 01, 2026 at 05:34:58PM +0800, Kunwu Chan wrote:
>
> >
> > Building on the discussion so far, it would be helpful from the SRCU
> > side to gather a bit more evidence to classify the issue.
> >
> > Calling synchronize_srcu_expedited() while holding a mutex is generally
> > valid, so the observed behavior may be workload-dependent.
> >
> > The reported deadlock seems to rely on the assumption that SRCU grace
> > period progress is indirectly blocked by irqfd workqueue saturation.
> > It would be good to confirm whether that assumption actually holds.
> >
> I went back through our logs from two independent crash instances and
> can now provide data for each of your questions.
>
> >
> > 1) Are SRCU GP kthreads/workers still making forward progress when
> > the system is stuck?
> >
> No. In both crash instances, process_srcu work items remain permanently
> "pending" (never "in-flight") throughout the entire hang.
>
> Instance 1 — kernel 6.18.8, pool 14 (cpus=3):
>
> [ 62.712760] workqueue rcu_gp: flags=0x108
> [ 62.717801] pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=2 refcnt=3
> [ 62.717801] pending: 2*process_srcu
>
> [ 187.735092] workqueue rcu_gp: flags=0x108 (125 seconds later)
> [ 187.735093] pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=2 refcnt=3
> [ 187.735093] pending: 2*process_srcu (still pending)
>
> 9 consecutive dumps from t=62s to t=312s — process_srcu never runs.
>
> Instance 2 — kernel 6.18.2, pool 22 (cpus=5):
>
> [ 93.280711] workqueue rcu_gp: flags=0x108
> [ 93.280713] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=1 refcnt=2
> [ 93.280716] pending: process_srcu
>
> [ 309.040801] workqueue rcu_gp: flags=0x108 (216 seconds later)
> [ 309.040806] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=1 refcnt=2
> [ 309.040806] pending: process_srcu (still pending)
>
> 8 consecutive dumps from t=93s to t=341s — process_srcu never runs.
>
> In both cases, rcu_gp's process_srcu is bound to the SAME per-CPU pool
> where the kvm-irqfd-cleanup workers are blocked. Both pools have idle
> workers but are marked as hung/stalled:
>
> Instance 1: pool 14: cpus=3 hung=174s workers=11 idle: 4046 4038 4045 4039 4043 156 77 (7 idle)
> Instance 2: pool 22: cpus=5 hung=297s workers=12 idle: 4242 51 4248 4247 4245 435 4244 4239 (8 idle)
>
> >
> > 2) How many irqfd workers are active in the reported scenario, and
> > can they saturate CPU or worker pools?
> >
> 4 kvm-irqfd-cleanup workers in both instances, consistently across all
> dumps:
>
> Instance 1 ( pool 14 / cpus=3):
>
> [ 62.831877] workqueue kvm-irqfd-cleanup: flags=0x0
> [ 62.837838] pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=4 refcnt=5
> [ 62.837838] in-flight: 157:irqfd_shutdown ,4044:irqfd_shutdown ,
> 102:irqfd_shutdown ,39:irqfd_shutdown
>
> Instance 2 ( pool 22 / cpus=5):
>
> [ 93.280894] workqueue kvm-irqfd-cleanup: flags=0x0
> [ 93.280896] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=4 refcnt=5
> [ 93.280900] in-flight: 151:irqfd_shutdown ,4246:irqfd_shutdown ,
> 4241:irqfd_shutdown ,4243:irqfd_shutdown
>
> These are from crosvm instances with multiple virtio devices
> (virtio-blk, virtio-net, virtio-input, etc.), each registering an irqfd
> with a resampler. During VM shutdown, all irqfds are detached
> concurrently, queueing that many irqfd_shutdown work items.
>
> The 4 workers are not saturating CPU — they're all in D state. But they
> ARE all bound to the same per-CPU pool as rcu_gp's process_srcu work.
>
> >
> > 3) Do we have a concrete wait-for cycle showing that tasks blocked
> > on resampler_lock are in turn preventing SRCU GP completion?
> >
> Yes, in both instances the hung task dump identifies the mutex holder
> stuck in synchronize_srcu, with the other workers waiting on the mutex.
>
> Instance 1 (t=314s):
>
> Worker pid 4044 — MUTEX HOLDER, stuck in synchronize_srcu:
>
> [ 315.963979] task:kworker/3:8 state:D pid:4044
> [ 315.977125] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
> [ 316.012504] __synchronize_srcu+0x100/0x130
> [ 316.023157] irqfd_resampler_shutdown+0xf0/0x150 <-- offset 0xf0 (synchronize_srcu)
>
> Workers pid 39, 102, 157 — MUTEX WAITERS:
>
> [ 314.793025] task:kworker/3:4 state:D pid:157
> [ 314.837472] __mutex_lock+0x409/0xd90
> [ 314.843100] irqfd_resampler_shutdown+0x23/0x150 <-- offset 0x23 (mutex_lock)
>
> Instance 2 (t=343s):
>
> Worker pid 4241 — MUTEX HOLDER, stuck in synchronize_srcu:
>
> [ 343.193294] task:kworker/5:4 state:D pid:4241
> [ 343.193299] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
> [ 343.193328] __synchronize_srcu+0x100/0x130
> [ 343.193335] irqfd_resampler_shutdown+0xf0/0x150 <-- offset 0xf0 (synchronize_srcu)
>
> Workers pid 151, 4243, 4246 — MUTEX WAITERS:
>
> [ 343.193369] task:kworker/5:6 state:D pid:4243
> [ 343.193397] __mutex_lock+0x37d/0xbb0
> [ 343.193397] irqfd_resampler_shutdown+0x23/0x150 <-- offset 0x23 (mutex_lock)
>
> Both instances show the identical wait-for cycle:
>
> 1. One worker holds resampler_lock, blocks in __synchronize_srcu
> (waiting for SRCU grace period)
> 2. SRCU GP needs process_srcu to run — but it stays "pending"
> on the same pool
> 3. Other irqfd workers block on __mutex_lock in the same pool
> 4. The pool is marked "hung" and no pending work makes progress
> for 250-300 seconds until kernel panic
>
> >
> > 4) Is the behavior reproducible in both irqfd_resampler_shutdown()
> > and kvm_irqfd_assign() paths?
> >
> In our 4 crash instances the stuck mutex holder is always in
> irqfd_resampler_shutdown() at offset 0xf0 (synchronize_srcu). This
> is consistent — these are all VM shutdown scenarios where only
> irqfd_shutdown workqueue items run.
>
> The kvm_irqfd_assign() path was identified by Vineeth Pillai (Google)
> during a VM create/destroy stress test where assign and shutdown race.
> His traces showed kvm_irqfd (the assign path) stuck in
> synchronize_srcu_expedited with irqfd_resampler_shutdown blocked on
> the mutex, and workqueue pwq 46 at active=1024 refcnt=2062.
>
> >
> > If SRCU GP remains independent, it would help distinguish whether
> > this is a strict deadlock or a form of workqueue starvation / lock
> > contention.
> >
> Based on the data from both instances, SRCU GP is NOT remaining
> independent. process_srcu stays permanently pending on the affected
> per-CPU pool for 250-300 seconds. But it's not just process_srcu —
> ALL pending work on the pool is stuck, including items from events,
> cgroup, mm, slub, and other workqueues.
>
> >
> > A timestamp-correlated dump (blocked stacks + workqueue state +
> > SRCU GP activity) would likely be sufficient to classify this.
> >
> I hope the correlated dumps above from both instances are helpful.
> To summarize the timeline (consistent across both):
>
> t=0: VM shutdown begins, crosvm detaches irqfds
> t=~14: 4 irqfd_shutdown work items queued on WQ_PERCPU pool
> One worker acquires resampler_lock, enters synchronize_srcu
> Other 3 workers block on __mutex_lock
> t=~43: First "BUG: workqueue lockup" — pool detected stuck
> rcu_gp: process_srcu shown as "pending" on same pool
> t=~93 Through t=~312: Repeated dumps every ~30s
> process_srcu remains permanently "pending"
> Pool has idle workers but no pending work executes
> t=~314: Hung task dump confirms mutex holder in __synchronize_srcu
> t=~316: init triggers sysrq crash → kernel panic
>
Thanks, this is useful and much clearer.
One thing that is still unclear is dispatch behavior:
`process_srcu` stays pending for a long time, while the same pwq dump shows idle workers.
So the key question is: what prevents pending work from being dispatched on that pwq?
Is it due to:
1) pwq stalled/hung state,
2) worker availability/affinity constraints,
3) or another dispatch-side condition?
Also, for scope:
- your crash instances consistently show the shutdown path
(irqfd_resampler_shutdown + synchronize_srcu),
- while assign-path evidence, per current thread data, appears to come
from a separate stress case.
A time-aligned dump with pwq state, pending/in-flight lists, and worker states
should help clarify this.
> >
> > Happy to help look at traces if available.
> >
> I can share the full console-ramoops-0 and dmesg-ramoops-0 from both
> instances. Shall I post them or send them off-list?
>
If possible, please post sanitized ramoops/dmesg logs on-list so others can validate.
Thanx, Kunwu
> Thanks,
> Sonam
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock
2026-03-31 20:51 ` Paul E. McKenney
2026-04-01 9:47 ` Sonam Sanju
@ 2026-04-06 23:09 ` Paul E. McKenney
1 sibling, 0 replies; 11+ messages in thread
From: Paul E. McKenney @ 2026-04-06 23:09 UTC (permalink / raw)
To: Sean Christopherson
Cc: Sonam Sanju, Lai Jiangshan, Josh Triplett, Paolo Bonzini,
Vineeth Pillai, Dmitry Maluka, kvm, linux-kernel, stable,
Steven Rostedt, Mathieu Desnoyers, rcu
On Tue, Mar 31, 2026 at 01:51:11PM -0700, Paul E. McKenney wrote:
> On Tue, Mar 31, 2026 at 11:17:19AM -0700, Sean Christopherson wrote:
> > +srcu folks
[ . . . ]
> > Unless I'm misunderstanding the bug, "fixing" in this in KVM is papering over an
> > underlying flaw. Essentially, this would be establishing a rule that
> > synchronize_srcu_expedited() can *never* be called while holding a mutex. That's
> > not viable.
>
> First, it is OK to invoke synchronize_srcu_expedited() while holding
> a mutex. Second, the synchronize_srcu_expedited() function's use of
> workqueues is the same as that of synchronize_srcu(), so in an alternate
> universe where it was not OK to invoke synchronize_srcu_expedited() while
> holding a mutex, it would also not be OK to invoke synchronize_srcu()
> while holding that same mutex. Third, it is also OK to acquire that
> same mutex within a workqueue handler. Fourth, SRCU and RCU use their
> own workqueue, which no one else should be using (and that prohibition
> most definitely includes the irqfd workers).
>
> As a result, I do have to ask... When you say "multiple irqfd workers",
> exactly how many such workers are you running?
Just to be clear, I am guessing that you have the workqueues counterpart
to a fork bomb. However, if you are using a small finite number of
workqueue handlers, then we need to make adjustments in SRCU, workqueues,
or maybe SRCU's use of workqueues.
So if my fork-bomb guess is incorrect, please let me know.
Thanx, Paul
> > > 4. The mutex holder never releases the lock -> deadlock
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock
2026-04-06 14:20 ` Kunwu Chan
@ 2026-04-17 1:18 ` Vineeth Pillai
2026-04-19 3:03 ` Vineeth Remanan Pillai
2026-04-21 5:12 ` Sonam Sanju
1 sibling, 1 reply; 11+ messages in thread
From: Vineeth Pillai @ 2026-04-17 1:18 UTC (permalink / raw)
To: kunwu.chan, paulmck
Cc: dmaluka, kvm, linux-kernel, pbonzini, rcu, seanjc, sonam.sanju,
sonam.sanju, stable, vineeth
Consolidating replies into one thread.
Hi Kunwu,
> One thing that is still unclear is dispatch behavior:
> `process_srcu` stays pending for a long time, while the same pwq dump shows idle workers.
>
> So the key question is: what prevents pending work from being dispatched on that pwq?
> Is it due to:
> 1) pwq stalled/hung state,
> 2) worker availability/affinity constraints,
> 3) or another dispatch-side condition?
>
> Also, for scope:
> - your crash instances consistently show the shutdown path
> (irqfd_resampler_shutdown + synchronize_srcu),
> - while assign-path evidence, per current thread data, appears to come
> from a separate stress case.
> A time-aligned dump with pwq state, pending/in-flight lists, and worker states
> should help clarify this.
I have a dmesg log showing this issue. This is from an automated stress
reboot test. The log is very similar to what Sonam shared.
<0>[ 434.338427] BUG: workqueue lockup - pool cpus=5 node=0 flags=0x0 nice=0 stuck for 293s!
<6>[ 434.339037] Showing busy workqueues and worker pools:
<6>[ 434.339387] workqueue events: flags=0x100
<6>[ 434.339667] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=2 refcnt=3
<6>[ 434.339691] pending: 2*xhci_dbc_handle_events
<6>[ 434.340512] workqueue events: flags=0x100
<6>[ 434.340789] pwq 2: cpus=0 node=0 flags=0x0 nice=0 active=1 refcnt=2
<6>[ 434.340793] pending: vmstat_shepherd
<6>[ 434.341507] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=45 refcnt=46
<6>[ 434.341511] pending: delayed_vfree_work, kernfs_notify_workfn, 5*destroy_super_work, 3*bpf_prog_free_deferred, 5*destroy_super_work, binder_deferred_func, bpf_prog_free_deferred, 25*destroy_super_work, drain_local_memcg_stock, update_stats_workfn, psi_avgs_work
<6>[ 434.343578] pwq 30: cpus=7 node=0 flags=0x0 nice=0 active=1 refcnt=2
<6>[ 434.343582] in-flight: 325:do_emergency_remount
<6>[ 434.344376] workqueue events_unbound: flags=0x2
<6>[ 434.344688] pwq 34: cpus=0-7 node=0 flags=0x4 nice=0 active=2 refcnt=3
<6>[ 434.344693] in-flight: 339:fsnotify_connector_destroy_workfn fsnotify_connector_destroy_workfn
<6>[ 434.345755] pwq 34: cpus=0-7 node=0 flags=0x4 nice=0 active=2 refcnt=8
<6>[ 434.345759] in-flight: 153:fsnotify_mark_destroy_workfn BAR(3098) BAR(2564) BAR(2299) fsnotify_mark_destroy_workfn BAR(416) BAR(1116)
<6>[ 434.347151] workqueue events_freezable: flags=0x104
<6>[ 434.347590] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=1 refcnt=2
<6>[ 434.347595] pending: pci_pme_list_scan
<6>[ 434.348681] workqueue events_power_efficient: flags=0x180
<6>[ 434.349221] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=1 refcnt=2
<6>[ 434.349226] pending: check_lifetime
<6>[ 434.350397] workqueue rcu_gp: flags=0x108
<6>[ 434.350853] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=3 refcnt=4
<6>[ 434.350857] pending: 3*process_srcu
<6>[ 434.351918] workqueue slub_flushwq: flags=0x8
<6>[ 434.352409] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=1 refcnt=3
<6>[ 434.352413] pending: flush_cpu_slab BAR(1)
<6>[ 434.353529] workqueue mm_percpu_wq: flags=0x8
<6>[ 434.354087] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=1 refcnt=2
<6>[ 434.354092] pending: vmstat_update
<6>[ 434.355205] workqueue quota_events_unbound: flags=0xa
<6>[ 434.355725] pwq 34: cpus=0-7 node=0 flags=0x4 nice=0 active=1 refcnt=3
<6>[ 434.355730] in-flight: 354:quota_release_workfn BAR(325)
<6>[ 434.356980] workqueue kvm-irqfd-cleanup: flags=0x0
<6>[ 434.357582] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=3 refcnt=4
<6>[ 434.357586] in-flight: 51:irqfd_shutdown ,3453:irqfd_shutdown ,3449:irqfd_shutdown
<6>[ 434.359101] pool 22: cpus=5 node=0 flags=0x0 nice=0 hung=293s workers=11 idle: 282 154 3452 3451 3448 3450 3455 3454
<6>[ 434.359989] pool 30: cpus=7 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 3460 332
<6>[ 434.360539] pool 34: cpus=0-7 node=0 flags=0x4 nice=0 hung=0s workers=5 idle: 256 66
The relevant pwq is pwq 22. All three irqfd_shutdown workers are in-flight
but in D state. rcu_gp's process_srcu items are stuck pending.
Worker 51 (kworker/5:0) — blocked acquiring resampler_lock:
<6>[ 440.576612] task:kworker/5:0 state:D stack:0 pid:51 tgid:51 ppid:2 task_flags:0x4208060 flags:0x00080000
<6>[ 440.577379] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
<6>[ 440.578085] <TASK>
<6>[ 440.578337] preempt_schedule_irq+0x4a/0x90
<6>[ 440.583712] __mutex_lock+0x413/0xe40
<6>[ 440.583969] irqfd_resampler_shutdown+0x23/0x150
<6>[ 440.584288] irqfd_shutdown+0x66/0xc0
<6>[ 440.584546] process_scheduled_works+0x219/0x450
<6>[ 440.584864] worker_thread+0x2a7/0x3b0
<6>[ 440.585421] kthread+0x230/0x270
Worker 3449 (kworker/5:4) — same, blocked acquiring resampler_lock:
<6>[ 440.671294] task:kworker/5:4 state:D stack:0 pid:3449 tgid:3449 ppid:2 task_flags:0x4208060 flags:0x00080000
<6>[ 440.672088] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
<6>[ 440.672662] <TASK>
<6>[ 440.673069] schedule+0x5e/0xe0
<6>[ 440.673708] __mutex_lock+0x413/0xe40
<6>[ 440.674059] irqfd_resampler_shutdown+0x23/0x150
<6>[ 440.674381] irqfd_shutdown+0x66/0xc0
<6>[ 440.674638] process_scheduled_works+0x219/0x450
<6>[ 440.674956] worker_thread+0x2a7/0x3b0
<6>[ 440.675308] kthread+0x230/0x270
Worker 3453 (kworker/5:8) — holds resampler_lock, blocked waiting for SRCU GP:
<6>[ 440.677368] task:kworker/5:8 state:D stack:0 pid:3453 tgid:3453 ppid:2 task_flags:0x4208060 flags:0x00080000
<6>[ 440.678185] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
<6>[ 440.678720] <TASK>
<6>[ 440.679127] schedule+0x5e/0xe0
<6>[ 440.679354] schedule_timeout+0x2e/0x130
<6>[ 440.680084] wait_for_common+0xf7/0x1f0
<6>[ 440.680355] synchronize_srcu_expedited+0x109/0x140
<6>[ 440.681164] irqfd_resampler_shutdown+0xf0/0x150
<6>[ 440.681481] irqfd_shutdown+0x66/0xc0
<6>[ 440.681738] process_scheduled_works+0x219/0x450
<6>[ 440.682055] worker_thread+0x2a7/0x3b0
<6>[ 440.682403] kthread+0x230/0x270
The sequence is: worker 3453 acquires resampler_lock, and calls
synchronize_srcu_expedited() while holding the lock. This queues
process_srcu on rcu_gp, then blocks waiting for the GP to complete.
Workers 51 and 3449 are blocked trying to acquire the same resampler_lock.
Regarding your dispatch question: all three workers are in D state, so
they have all called schedule() and wq_worker_sleeping() should have
decremented pool->nr_running to zero. With nr_running == 0 and
process_srcu in the worklist, needs_more_worker() should be true and an
idle worker should be woken via kick_pool() when process_srcu is enqueued.
Why none of the 8 idle workers end up dispatching process_srcu is not
entirely clear to me.
Moving the synchronize_srcu_expedited() does solve this issue, but it
is not exactly sure why the deadlock between irqfd-shutdown workers is
causing the work queue to stall.
The full dmesg is at: https://gist.github.com/vineethrp/883db560a4503612448db9b10e02a9b5
Hi Paul,
> Just to be clear, I am guessing that you have the workqueues counterpart
> to a fork bomb. However, if you are using a small finite number of
> workqueue handlers, then we need to make adjustments in SRCU, workqueues,
> or maybe SRCU's use of workqueues.
In this log, I am not seeing a workqueue being stressed out. There are
8 idle workers, but for some reason no worker is assigned to run process_srcu.
Not sure if its a work queue related race condition or if its working as
intended to not kick new workers if there are in-flight workers in D state.
> SRCU and RCU use their own workqueue, which no one else should be
> using (and that prohibition most definitely includes the irqfd workers).
kvm-irqfd-cleanup and rcu_gp while being separate workqueues, share the
same per-CPU pool(pwq 22). Both are CPU-bound: rcu_gp has flags=0x108
(WQ_UNBOUND|WQ_FREEZABLE) but its pwq for CPU 5 resolves to the same
per-CPU pool (pool 22, flags=0x0) as kvm-irqfd-cleanup (flags=0x0).
I think CPU-bound workqueues share the per-CPU pool regardless of being
separate workqueues and these two workqueues end up competing for the
same underlying pool's workers.
Making kvm-irqfd-cleanup unbound (WQ_UNBOUND) would place it on a
separate pool from rcu_gp, preventing this interference and fixing the
stall I guess.
Thanks,
Vineeth
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock
2026-04-17 1:18 ` Vineeth Pillai
@ 2026-04-19 3:03 ` Vineeth Remanan Pillai
0 siblings, 0 replies; 11+ messages in thread
From: Vineeth Remanan Pillai @ 2026-04-19 3:03 UTC (permalink / raw)
To: kunwu.chan, paulmck, Tejun Heo
Cc: dmaluka, kvm, linux-kernel, pbonzini, rcu, seanjc, sonam.sanju,
sonam.sanju, stable
On Thu, Apr 16, 2026 at 9:18 PM Vineeth Pillai <vineeth@bitbyteword.org> wrote:
>
> Consolidating replies into one thread.
>
> Hi Kunwu,
>
> > One thing that is still unclear is dispatch behavior:
> > `process_srcu` stays pending for a long time, while the same pwq dump shows idle workers.
> >
> > So the key question is: what prevents pending work from being dispatched on that pwq?
> > Is it due to:
> > 1) pwq stalled/hung state,
> > 2) worker availability/affinity constraints,
> > 3) or another dispatch-side condition?
> >
> > Also, for scope:
> > - your crash instances consistently show the shutdown path
> > (irqfd_resampler_shutdown + synchronize_srcu),
> > - while assign-path evidence, per current thread data, appears to come
> > from a separate stress case.
>
> > A time-aligned dump with pwq state, pending/in-flight lists, and worker states
> > should help clarify this.
>
> I have a dmesg log showing this issue. This is from an automated stress
> reboot test. The log is very similar to what Sonam shared.
>
> <0>[ 434.338427] BUG: workqueue lockup - pool cpus=5 node=0 flags=0x0 nice=0 stuck for 293s!
> <6>[ 434.339037] Showing busy workqueues and worker pools:
> <6>[ 434.339387] workqueue events: flags=0x100
> ...
> <6>[ 434.350853] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=3 refcnt=4
> <6>[ 434.350857] pending: 3*process_srcu
> ...
> <6>[ 434.356980] workqueue kvm-irqfd-cleanup: flags=0x0
> <6>[ 434.357582] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=3 refcnt=4
> <6>[ 434.357586] in-flight: 51:irqfd_shutdown ,3453:irqfd_shutdown ,3449:irqfd_shutdown
>
> The relevant pwq is pwq 22. All three irqfd_shutdown workers are in-flight
> but in D state. rcu_gp's process_srcu items are stuck pending.
>
> Worker 51 (kworker/5:0) — blocked acquiring resampler_lock:
> <6>[ 440.576612] task:kworker/5:0 state:D stack:0 pid:51 tgid:51 ppid:2 task_flags:0x4208060 flags:0x00080000
> <6>[ 440.577379] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
> <6>[ 440.578085] <TASK>
> <6>[ 440.578337] preempt_schedule_irq+0x4a/0x90
> <6>[ 440.583712] __mutex_lock+0x413/0xe40
> <6>[ 440.583969] irqfd_resampler_shutdown+0x23/0x150
> <6>[ 440.584288] irqfd_shutdown+0x66/0xc0
> <6>[ 440.584546] process_scheduled_works+0x219/0x450
> <6>[ 440.584864] worker_thread+0x2a7/0x3b0
> <6>[ 440.585421] kthread+0x230/0x270
>
> Worker 3449 (kworker/5:4) — same, blocked acquiring resampler_lock:
> <6>[ 440.671294] task:kworker/5:4 state:D stack:0 pid:3449 tgid:3449 ppid:2 task_flags:0x4208060 flags:0x00080000
> <6>[ 440.672088] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
> <6>[ 440.672662] <TASK>
> <6>[ 440.673069] schedule+0x5e/0xe0
> <6>[ 440.673708] __mutex_lock+0x413/0xe40
> <6>[ 440.674059] irqfd_resampler_shutdown+0x23/0x150
> <6>[ 440.674381] irqfd_shutdown+0x66/0xc0
> <6>[ 440.674638] process_scheduled_works+0x219/0x450
> <6>[ 440.674956] worker_thread+0x2a7/0x3b0
> <6>[ 440.675308] kthread+0x230/0x270
>
> Worker 3453 (kworker/5:8) — holds resampler_lock, blocked waiting for SRCU GP:
> <6>[ 440.677368] task:kworker/5:8 state:D stack:0 pid:3453 tgid:3453 ppid:2 task_flags:0x4208060 flags:0x00080000
> <6>[ 440.678185] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
> <6>[ 440.678720] <TASK>
> <6>[ 440.679127] schedule+0x5e/0xe0
> <6>[ 440.679354] schedule_timeout+0x2e/0x130
> <6>[ 440.680084] wait_for_common+0xf7/0x1f0
> <6>[ 440.680355] synchronize_srcu_expedited+0x109/0x140
> <6>[ 440.681164] irqfd_resampler_shutdown+0xf0/0x150
> <6>[ 440.681481] irqfd_shutdown+0x66/0xc0
> <6>[ 440.681738] process_scheduled_works+0x219/0x450
> <6>[ 440.682055] worker_thread+0x2a7/0x3b0
> <6>[ 440.682403] kthread+0x230/0x270
>
> The sequence is: worker 3453 acquires resampler_lock, and calls
> synchronize_srcu_expedited() while holding the lock. This queues
> process_srcu on rcu_gp, then blocks waiting for the GP to complete.
> Workers 51 and 3449 are blocked trying to acquire the same resampler_lock.
>
> Regarding your dispatch question: all three workers are in D state, so
> they have all called schedule() and wq_worker_sleeping() should have
> decremented pool->nr_running to zero. With nr_running == 0 and
> process_srcu in the worklist, needs_more_worker() should be true and an
> idle worker should be woken via kick_pool() when process_srcu is enqueued.
> Why none of the 8 idle workers end up dispatching process_srcu is not
> entirely clear to me.
>
> Moving the synchronize_srcu_expedited() does solve this issue, but it
> is not exactly sure why the deadlock between irqfd-shutdown workers is
> causing the work queue to stall.
>
I think I know what is happening now. After adding some more debug
prints, I see that worker->sleeping is 0 for one of the workers
waiting for the mutex(pid 51) in the example above, and
pool->nr_running is 1. This prevents the pool from dispatching idle
workers.
This time I got a more descriptive stack trace as well:
<6>[18433.604285][T10987] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
<6>[18433.611204][T10987] Call Trace:
<6>[18433.615001][T10987] <TASK>
<6>[18433.618414][T10987] __schedule+0x8cf/0xdb0
<6>[18433.623372][T10987] preempt_schedule_irq+0x4a/0x90
<6>[18433.629112][T10987] asm_sysvec_reschedule_ipi+0x1a/0x20
<6>[18433.635340][T10987] RIP: 0010:kthread_data+0x15/0x30
<6>[18433.715343][T10987] wq_worker_sleeping+0xc/0x90
<6>[18433.720806][T10987] schedule+0x30/0xe0
<6>[18433.725379][T10987] schedule_preempt_disabled+0x10/0x20
<6>[18433.731604][T10987] __mutex_lock+0x413/0xe40
<6>[18433.736763][T10987] irqfd_resampler_shutdown+0x23/0x150
<6>[18433.742989][T10987] irqfd_shutdown+0x66/0xc0
<6>[18433.748145][T10987] process_scheduled_works+0x219/0x450
<6>[18433.754370][T10987] worker_thread+0x30b/0x450
<6>[18433.765460][T10987] kthread+0x227/0x2a0
<6>[18433.775383][T10987] ret_from_fork+0xfe/0x1b0
If I am reading the stack correctly, an IPI was serviced while at
wq_worker_sleeping() (which is responsible for setting
worker->sleeping to zero and decrementing nr_running). I guess the
process was interrupted before it could update nr_running and
sleeping. After IPI was serviced, preempt_schedule_irq() was called
and then __schedule() which schedules out the task before it could
decrement nr_running. And it is never woken up because the mutex
holder is waiting for the GP to complete. But process_srcu cannot
proceed because the workqueue pool is not kicking idle workers as
nr_running is 1. Effectively deadlocking.
So, basically what happens is (based on above example):
- srcu gp worker and irqfd workers(3453, 51) on the same per-cpu Pool
- worker 3453 acquires resampler_lock, and calls
synchronize_srcu_expedited() while holding the lock.
- worker 51 waits on the lock, but is unable to update critical
workqueue counters(nr_running and sleeping) before it schedules out.
- Workqueue pool is stalled and thereby preventing srcu GP progress.
This also explains why the issue is not seen when the
synchronize_srcu_expedited is called outside the lock.
Going directly to __schedule() after servicing IPI is the main problem
as wq_worker_sleeping() could not complete. Without the IPI in
picture, schedule out would be:
_mutex_lock
schedule()
sched_submit_work()
wq_worker_sleeping()
__schedule_loop()
__schedule()
WIth IPI in picture, it would be:
_mutex_lock
schedule()
sched_submit_work()
wq_worker_sleeping() <-- half way through
IPI
preempt_schedule_irq()
__schedule()
Moving `sched_submit_work()` to __schedule might solve this issue, but
I'm not sure if it would cause other issues. Adding Tejun for an
expert opinion on the workqueue side :-)
Thanks,
Vineeth
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock
2026-04-06 14:20 ` Kunwu Chan
2026-04-17 1:18 ` Vineeth Pillai
@ 2026-04-21 5:12 ` Sonam Sanju
1 sibling, 0 replies; 11+ messages in thread
From: Sonam Sanju @ 2026-04-21 5:12 UTC (permalink / raw)
To: kunwu.chan
Cc: dmaluka, kvm, linux-kernel, paulmck, pbonzini, rcu, seanjc,
sonam.sanju, stable, vineeth
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 3865 bytes --]
> Could you provide a time-aligned dump that includes:
> - pwq state (active/pending/in-flight)
> - pending and in-flight work items with their queue/start times
> - worker task states
Below are time-aligned extracts from both instances. Full logs are
included further down in this email.
=== Instance 1: kernel 6.18.8, pool 14 (cpus=3) ===
--- t=62s: First workqueue lockup dump (pool stuck 49s, since ~t=13s) ---
kvm-irqfd-cleanup: pwq 14: active=4 refcnt=5
in-flight: 157:irqfd_shutdown ,4044:irqfd_shutdown ,
102:irqfd_shutdown ,39:irqfd_shutdown
rcu_gp: pwq 14: active=2 refcnt=3
pending: 2*process_srcu
events: pwq 14: active=43 refcnt=44
pending: binder_deferred_func, kernfs_notify_workfn,
delayed_vfree_work, 5*destroy_super_work,
3*bpf_prog_free_deferred, 10*destroy_super_work, ...
mm_percpu_wq: pwq 14: active=2 refcnt=4
pending: vmstat_update, lru_add_drain_per_cpu
pm: pwq 14: active=1 refcnt=2
pending: pm_runtime_work
pool 14: cpus=3 flags=0x0 hung=49s workers=11
idle: 4046 4038 4045 4039 4043 156 77 (7 idle)
Active busy worker backtrace (pid 102):
__schedule → schedule → schedule_preempt_disabled →
__mutex_lock → irqfd_resampler_shutdown+0x23 →
irqfd_shutdown → process_scheduled_works → worker_thread
--- t=312s: Last workqueue lockup dump (pool stuck 298s) ---
kvm-irqfd-cleanup: pwq 14: active=4 (same 4 in-flight)
rcu_gp: pwq 14: pending: 2*process_srcu (still pending, 250s later)
events: pwq 14: active=43 (same, no progress)
pool 14: hung=298s workers=11 idle: 4046 4038 4045 4039 4043 156 77
--- t=314s: Hung task dump ---
Worker 4044 (MUTEX HOLDER):
task:kworker/3:8 state:D pid:4044
Workqueue: kvm-irqfd-cleanup irqfd_shutdown
__synchronize_srcu+0x100/0x130
irqfd_resampler_shutdown+0xf0/0x150 ← synchronize_srcu call
Worker 157 (MUTEX WAITER):
task:kworker/3:4 state:D pid:157
__mutex_lock+0x409/0xd90
irqfd_resampler_shutdown+0x23/0x150 ← mutex_lock call
(Workers 39 and 102 show identical mutex_lock stacks)
=== Instance 2: kernel 6.18.2, pool 22 (cpus=5) ===
--- t=93s: First workqueue lockup dump (pool stuck 79s, since ~t=14s) ---
kvm-irqfd-cleanup: pwq 22: active=4 refcnt=5
in-flight: 151:irqfd_shutdown ,4246:irqfd_shutdown ,
4241:irqfd_shutdown ,4243:irqfd_shutdown
rcu_gp: pwq 22: active=1 refcnt=2
pending: process_srcu
events: pwq 22: active=56 refcnt=57
pending: kernfs_notify_workfn, delayed_vfree_work,
binder_deferred_func, 47*destroy_super_work, ...
pool 22: cpus=5 flags=0x0 hung=79s workers=12
idle: 4242 51 4248 4247 4245 435 4244 4239 (8 idle)
--- t=341s: Last workqueue lockup dump (pool stuck 327s) ---
kvm-irqfd-cleanup: pwq 22: active=4 (same)
rcu_gp: pwq 22: pending: process_srcu (still pending, 248s later)
events: pwq 22: active=56 (56 pending items, zero progress)
pool 22: hung=327s workers=12 idle: same 8 workers
--- t=343s: Hung task dump ---
Worker 4241 (MUTEX HOLDER):
task:kworker/5:4 state:D pid:4241
Workqueue: kvm-irqfd-cleanup irqfd_shutdown
__synchronize_srcu+0x100/0x130
irqfd_resampler_shutdown+0xf0/0x150
Worker 4243 (MUTEX WAITER):
task:kworker/5:6 state:D pid:4243
__mutex_lock+0x37d/0xbb0
irqfd_resampler_shutdown+0x23/0x150
(Workers 151 and 4246 show identical mutex_lock stacks)
> Please post sanitized ramoops/dmesg logs on-list so others can
> validate.
Full logs: https://gist.github.com/sonam-sanju/773855aa2cbe156ca19f3a87bbebc15e
Thanks,
Sonam
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock
2026-04-21 18:22 [PATCH v2] KVM: eventfd: Use WQ_UNBOUND workqueue for irqfd cleanup - New logs confirm preemption race Tejun Heo
@ 2026-04-23 9:01 ` Sonam Sanju
2026-04-23 13:25 ` Vineeth Remanan Pillai
0 siblings, 1 reply; 11+ messages in thread
From: Sonam Sanju @ 2026-04-23 9:01 UTC (permalink / raw)
To: tj
Cc: dmaluka, kunwu.chan, kvm, linux-kernel, paulmck, pbonzini, rcu,
seanjc, sonam.sanju, stable, vineeth
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 2006 bytes --]
Hello Tejun,
Thank you for the detailed analysis.
On Wed, Apr 23, 2026, Tejun Heo wrote:
> The problem with this theory is that this kworker, while preempted, is still
> runnable and should be dispatched to its CPU once it becomes available
> again. Workqueue doesn't care whether the task gets preempted or when it
> gets the CPU back. It only cares about whether the task enters blocking
> state (!runnable). A task which is preempted, even on the way to blocking,
> still is runnable and should get put back on the CPU by the scheduler.
>
> If you can take a crashdump of the deadlocked state, can you see whether the
> task is still on the scheduler's runqueue?
I instrumented show_one_worker_pool() to dump scheduler state for each busy worker
when the pool has been hung for >30 seconds.
All workers show on_rq=0.
== Pool state ==
pool 2: cpus=0 node=0 flags=0x0 nice=0 hung=47s
workers=13 nr_running=1 nr_idle=7
== Per-worker scheduler state (first dump at t=62.5s) ==
PID | state | on_rq | se.on_rq | sched_delayed | sleeping | blocked_on
-----|-------|-------|----------|---------------|----------|-------------------
4819 | 0x2 | 0 | 0 | 0 | 1 | ffff953608205210 type=1
4823 | 0x2 | 0 | 0 | 0 | 1 | ffff953608205210 type=1
4818 | 0x2 | 0 | 0 | 0 | 0 | ffff953608205210 type=1
11 | 0x2 | 0 | 0 | 0 | 1 | ffff953608205210 type=1
9 | 0x2 | 0 | 0 | 0 | 1 | ffff953608205210 type=1
4814 | 0x2 | 0 | 0 | 0 | 1 | (mutex holder)
All 6 workers are in kvm-irqfd-cleanup, calling irqfd_shutdown →
irqfd_resampler_shutdown. They contend on the same resampler->lock
mutex (ffff953608205210).
Full logs: https://gist.github.com/sonam-sanju/08042878542b7a58d2818e6076554211
Thanks,
Sonam
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock
2026-04-23 9:01 ` [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock Sonam Sanju
@ 2026-04-23 13:25 ` Vineeth Remanan Pillai
0 siblings, 0 replies; 11+ messages in thread
From: Vineeth Remanan Pillai @ 2026-04-23 13:25 UTC (permalink / raw)
To: Sonam Sanju
Cc: tj, dmaluka, kunwu.chan, kvm, linux-kernel, paulmck, pbonzini,
rcu, seanjc, stable
On Thu, Apr 23, 2026 at 5:05 AM Sonam Sanju <sonam.sanju@intel.com> wrote:
>
> Hello Tejun,
>
> Thank you for the detailed analysis.
>
> On Wed, Apr 23, 2026, Tejun Heo wrote:
> > The problem with this theory is that this kworker, while preempted, is still
> > runnable and should be dispatched to its CPU once it becomes available
> > again. Workqueue doesn't care whether the task gets preempted or when it
> > gets the CPU back. It only cares about whether the task enters blocking
> > state (!runnable). A task which is preempted, even on the way to blocking,
> > still is runnable and should get put back on the CPU by the scheduler.
> >
> > If you can take a crashdump of the deadlocked state, can you see whether the
> > task is still on the scheduler's runqueue?
>
> I instrumented show_one_worker_pool() to dump scheduler state for each busy worker
> when the pool has been hung for >30 seconds.
>
> All workers show on_rq=0.
>
> == Pool state ==
>
> pool 2: cpus=0 node=0 flags=0x0 nice=0 hung=47s
> workers=13 nr_running=1 nr_idle=7
>
> == Per-worker scheduler state (first dump at t=62.5s) ==
>
> PID | state | on_rq | se.on_rq | sched_delayed | sleeping | blocked_on
> -----|-------|-------|----------|---------------|----------|-------------------
> 4819 | 0x2 | 0 | 0 | 0 | 1 | ffff953608205210 type=1
> 4823 | 0x2 | 0 | 0 | 0 | 1 | ffff953608205210 type=1
> 4818 | 0x2 | 0 | 0 | 0 | 0 | ffff953608205210 type=1
> 11 | 0x2 | 0 | 0 | 0 | 1 | ffff953608205210 type=1
> 9 | 0x2 | 0 | 0 | 0 | 1 | ffff953608205210 type=1
> 4814 | 0x2 | 0 | 0 | 0 | 1 | (mutex holder)
>
>
> All 6 workers are in kvm-irqfd-cleanup, calling irqfd_shutdown →
> irqfd_resampler_shutdown. They contend on the same resampler->lock
> mutex (ffff953608205210).
>
Sorry for the late disclosure; I was running the 6.18 Android kernel
and missed this relevant detail because the bug discussion initially
started with KVM and I had verified the irqfd related code was the
same as the vanilla kernel. Now, after going through Tejun's response
and reviewing the __schedule() code regarding SM_PREEMPT, I realized
the Android kernel has extra logic related to proxy execution that
might be triggering this issue. I tested on vanilla 6.18.23 kernel and
was not able to reproduce this.
Sonam, just checking if you are able to reproduce this issue with the
vanilla 6.18 kernel?
Thanks,
Vineeth
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2026-04-23 13:25 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20260323053353.805336-1-sonam.sanju@intel.com>
[not found] ` <20260323064248.1660757-1-sonam.sanju@intel.com>
2026-03-31 18:17 ` [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock Sean Christopherson
2026-03-31 20:51 ` Paul E. McKenney
2026-04-01 9:47 ` Sonam Sanju
2026-04-06 23:09 ` Paul E. McKenney
[not found] <5194cf52-f8a8-4479-a95e-233104272839@linux.dev>
2026-04-01 14:24 ` Sonam Sanju
2026-04-06 14:20 ` Kunwu Chan
2026-04-17 1:18 ` Vineeth Pillai
2026-04-19 3:03 ` Vineeth Remanan Pillai
2026-04-21 5:12 ` Sonam Sanju
2026-04-21 18:22 [PATCH v2] KVM: eventfd: Use WQ_UNBOUND workqueue for irqfd cleanup - New logs confirm preemption race Tejun Heo
2026-04-23 9:01 ` [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock Sonam Sanju
2026-04-23 13:25 ` Vineeth Remanan Pillai
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox