From: "Kunwu Chan" <kunwu.chan@linux.dev>
To: "Sonam Sanju" <sonam.sanju@intel.corp-partner.google.com>,
"Sean Christopherson" <seanjc@google.com>,
"Paul E . McKenney" <paulmck@kernel.org>
Cc: "Paolo Bonzini" <pbonzini@redhat.com>,
"Vineeth Pillai" <vineeth@bitbyteword.org>,
"Dmitry Maluka" <dmaluka@chromium.org>,
kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
stable@vger.kernel.org, rcu@vger.kernel.org,
"Sonam Sanju" <sonam.sanju@intel.com>
Subject: Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock
Date: Mon, 06 Apr 2026 14:20:56 +0000 [thread overview]
Message-ID: <87add1dc9bb95dc50bc20ce5c8fbfe2999185dd3@linux.dev> (raw)
In-Reply-To: <20260401142456.2632730-1-sonam.sanju@intel.corp-partner.google.com>
April 1, 2026 at 10:24 PM, "Sonam Sanju" <sonam.sanju@intel.corp-partner.google.com mailto:sonam.sanju@intel.corp-partner.google.com?to=%22Sonam%20Sanju%22%20%3Csonam.sanju%40intel.corp-partner.google.com%3E > wrote:
>
> From: Sonam Sanju <sonam.sanju@intel.com>
>
> On Wed, Apr 01, 2026 at 05:34:58PM +0800, Kunwu Chan wrote:
>
> >
> > Building on the discussion so far, it would be helpful from the SRCU
> > side to gather a bit more evidence to classify the issue.
> >
> > Calling synchronize_srcu_expedited() while holding a mutex is generally
> > valid, so the observed behavior may be workload-dependent.
> >
> > The reported deadlock seems to rely on the assumption that SRCU grace
> > period progress is indirectly blocked by irqfd workqueue saturation.
> > It would be good to confirm whether that assumption actually holds.
> >
> I went back through our logs from two independent crash instances and
> can now provide data for each of your questions.
>
> >
> > 1) Are SRCU GP kthreads/workers still making forward progress when
> > the system is stuck?
> >
> No. In both crash instances, process_srcu work items remain permanently
> "pending" (never "in-flight") throughout the entire hang.
>
> Instance 1 — kernel 6.18.8, pool 14 (cpus=3):
>
> [ 62.712760] workqueue rcu_gp: flags=0x108
> [ 62.717801] pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=2 refcnt=3
> [ 62.717801] pending: 2*process_srcu
>
> [ 187.735092] workqueue rcu_gp: flags=0x108 (125 seconds later)
> [ 187.735093] pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=2 refcnt=3
> [ 187.735093] pending: 2*process_srcu (still pending)
>
> 9 consecutive dumps from t=62s to t=312s — process_srcu never runs.
>
> Instance 2 — kernel 6.18.2, pool 22 (cpus=5):
>
> [ 93.280711] workqueue rcu_gp: flags=0x108
> [ 93.280713] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=1 refcnt=2
> [ 93.280716] pending: process_srcu
>
> [ 309.040801] workqueue rcu_gp: flags=0x108 (216 seconds later)
> [ 309.040806] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=1 refcnt=2
> [ 309.040806] pending: process_srcu (still pending)
>
> 8 consecutive dumps from t=93s to t=341s — process_srcu never runs.
>
> In both cases, rcu_gp's process_srcu is bound to the SAME per-CPU pool
> where the kvm-irqfd-cleanup workers are blocked. Both pools have idle
> workers but are marked as hung/stalled:
>
> Instance 1: pool 14: cpus=3 hung=174s workers=11 idle: 4046 4038 4045 4039 4043 156 77 (7 idle)
> Instance 2: pool 22: cpus=5 hung=297s workers=12 idle: 4242 51 4248 4247 4245 435 4244 4239 (8 idle)
>
> >
> > 2) How many irqfd workers are active in the reported scenario, and
> > can they saturate CPU or worker pools?
> >
> 4 kvm-irqfd-cleanup workers in both instances, consistently across all
> dumps:
>
> Instance 1 ( pool 14 / cpus=3):
>
> [ 62.831877] workqueue kvm-irqfd-cleanup: flags=0x0
> [ 62.837838] pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=4 refcnt=5
> [ 62.837838] in-flight: 157:irqfd_shutdown ,4044:irqfd_shutdown ,
> 102:irqfd_shutdown ,39:irqfd_shutdown
>
> Instance 2 ( pool 22 / cpus=5):
>
> [ 93.280894] workqueue kvm-irqfd-cleanup: flags=0x0
> [ 93.280896] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=4 refcnt=5
> [ 93.280900] in-flight: 151:irqfd_shutdown ,4246:irqfd_shutdown ,
> 4241:irqfd_shutdown ,4243:irqfd_shutdown
>
> These are from crosvm instances with multiple virtio devices
> (virtio-blk, virtio-net, virtio-input, etc.), each registering an irqfd
> with a resampler. During VM shutdown, all irqfds are detached
> concurrently, queueing that many irqfd_shutdown work items.
>
> The 4 workers are not saturating CPU — they're all in D state. But they
> ARE all bound to the same per-CPU pool as rcu_gp's process_srcu work.
>
> >
> > 3) Do we have a concrete wait-for cycle showing that tasks blocked
> > on resampler_lock are in turn preventing SRCU GP completion?
> >
> Yes, in both instances the hung task dump identifies the mutex holder
> stuck in synchronize_srcu, with the other workers waiting on the mutex.
>
> Instance 1 (t=314s):
>
> Worker pid 4044 — MUTEX HOLDER, stuck in synchronize_srcu:
>
> [ 315.963979] task:kworker/3:8 state:D pid:4044
> [ 315.977125] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
> [ 316.012504] __synchronize_srcu+0x100/0x130
> [ 316.023157] irqfd_resampler_shutdown+0xf0/0x150 <-- offset 0xf0 (synchronize_srcu)
>
> Workers pid 39, 102, 157 — MUTEX WAITERS:
>
> [ 314.793025] task:kworker/3:4 state:D pid:157
> [ 314.837472] __mutex_lock+0x409/0xd90
> [ 314.843100] irqfd_resampler_shutdown+0x23/0x150 <-- offset 0x23 (mutex_lock)
>
> Instance 2 (t=343s):
>
> Worker pid 4241 — MUTEX HOLDER, stuck in synchronize_srcu:
>
> [ 343.193294] task:kworker/5:4 state:D pid:4241
> [ 343.193299] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
> [ 343.193328] __synchronize_srcu+0x100/0x130
> [ 343.193335] irqfd_resampler_shutdown+0xf0/0x150 <-- offset 0xf0 (synchronize_srcu)
>
> Workers pid 151, 4243, 4246 — MUTEX WAITERS:
>
> [ 343.193369] task:kworker/5:6 state:D pid:4243
> [ 343.193397] __mutex_lock+0x37d/0xbb0
> [ 343.193397] irqfd_resampler_shutdown+0x23/0x150 <-- offset 0x23 (mutex_lock)
>
> Both instances show the identical wait-for cycle:
>
> 1. One worker holds resampler_lock, blocks in __synchronize_srcu
> (waiting for SRCU grace period)
> 2. SRCU GP needs process_srcu to run — but it stays "pending"
> on the same pool
> 3. Other irqfd workers block on __mutex_lock in the same pool
> 4. The pool is marked "hung" and no pending work makes progress
> for 250-300 seconds until kernel panic
>
> >
> > 4) Is the behavior reproducible in both irqfd_resampler_shutdown()
> > and kvm_irqfd_assign() paths?
> >
> In our 4 crash instances the stuck mutex holder is always in
> irqfd_resampler_shutdown() at offset 0xf0 (synchronize_srcu). This
> is consistent — these are all VM shutdown scenarios where only
> irqfd_shutdown workqueue items run.
>
> The kvm_irqfd_assign() path was identified by Vineeth Pillai (Google)
> during a VM create/destroy stress test where assign and shutdown race.
> His traces showed kvm_irqfd (the assign path) stuck in
> synchronize_srcu_expedited with irqfd_resampler_shutdown blocked on
> the mutex, and workqueue pwq 46 at active=1024 refcnt=2062.
>
> >
> > If SRCU GP remains independent, it would help distinguish whether
> > this is a strict deadlock or a form of workqueue starvation / lock
> > contention.
> >
> Based on the data from both instances, SRCU GP is NOT remaining
> independent. process_srcu stays permanently pending on the affected
> per-CPU pool for 250-300 seconds. But it's not just process_srcu —
> ALL pending work on the pool is stuck, including items from events,
> cgroup, mm, slub, and other workqueues.
>
> >
> > A timestamp-correlated dump (blocked stacks + workqueue state +
> > SRCU GP activity) would likely be sufficient to classify this.
> >
> I hope the correlated dumps above from both instances are helpful.
> To summarize the timeline (consistent across both):
>
> t=0: VM shutdown begins, crosvm detaches irqfds
> t=~14: 4 irqfd_shutdown work items queued on WQ_PERCPU pool
> One worker acquires resampler_lock, enters synchronize_srcu
> Other 3 workers block on __mutex_lock
> t=~43: First "BUG: workqueue lockup" — pool detected stuck
> rcu_gp: process_srcu shown as "pending" on same pool
> t=~93 Through t=~312: Repeated dumps every ~30s
> process_srcu remains permanently "pending"
> Pool has idle workers but no pending work executes
> t=~314: Hung task dump confirms mutex holder in __synchronize_srcu
> t=~316: init triggers sysrq crash → kernel panic
>
Thanks, this is useful and much clearer.
One thing that is still unclear is dispatch behavior:
`process_srcu` stays pending for a long time, while the same pwq dump shows idle workers.
So the key question is: what prevents pending work from being dispatched on that pwq?
Is it due to:
1) pwq stalled/hung state,
2) worker availability/affinity constraints,
3) or another dispatch-side condition?
Also, for scope:
- your crash instances consistently show the shutdown path
(irqfd_resampler_shutdown + synchronize_srcu),
- while assign-path evidence, per current thread data, appears to come
from a separate stress case.
A time-aligned dump with pwq state, pending/in-flight lists, and worker states
should help clarify this.
> >
> > Happy to help look at traces if available.
> >
> I can share the full console-ramoops-0 and dmesg-ramoops-0 from both
> instances. Shall I post them or send them off-list?
>
If possible, please post sanitized ramoops/dmesg logs on-list so others can validate.
Thanx, Kunwu
> Thanks,
> Sonam
>
prev parent reply other threads:[~2026-04-06 14:21 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20260323053353.805336-1-sonam.sanju@intel.com>
2026-03-23 6:42 ` [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock Sonam Sanju
2026-03-31 18:17 ` Sean Christopherson
2026-03-31 20:51 ` Paul E. McKenney
2026-04-01 9:47 ` Sonam Sanju
2026-04-06 23:09 ` Paul E. McKenney
2026-04-01 9:34 ` Kunwu Chan
2026-04-01 14:24 ` Sonam Sanju
2026-04-06 14:20 ` Kunwu Chan [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87add1dc9bb95dc50bc20ce5c8fbfe2999185dd3@linux.dev \
--to=kunwu.chan@linux.dev \
--cc=dmaluka@chromium.org \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=paulmck@kernel.org \
--cc=pbonzini@redhat.com \
--cc=rcu@vger.kernel.org \
--cc=seanjc@google.com \
--cc=sonam.sanju@intel.com \
--cc=sonam.sanju@intel.corp-partner.google.com \
--cc=stable@vger.kernel.org \
--cc=vineeth@bitbyteword.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox