Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Kunwu Chan" <kunwu.chan@linux.dev>
To: "Sonam Sanju" <sonam.sanju@intel.corp-partner.google.com>,
	"Sean  Christopherson" <seanjc@google.com>,
	"Paul E . McKenney" <paulmck@kernel.org>
Cc: "Paolo Bonzini" <pbonzini@redhat.com>,
	"Vineeth Pillai" <vineeth@bitbyteword.org>,
	"Dmitry Maluka" <dmaluka@chromium.org>,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	stable@vger.kernel.org, rcu@vger.kernel.org,
	"Sonam Sanju" <sonam.sanju@intel.com>
Subject: Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu  out of resampler_lock
Date: Mon, 06 Apr 2026 14:20:56 +0000	[thread overview]
Message-ID: <87add1dc9bb95dc50bc20ce5c8fbfe2999185dd3@linux.dev> (raw)
In-Reply-To: <20260401142456.2632730-1-sonam.sanju@intel.corp-partner.google.com>

April 1, 2026 at 10:24 PM, "Sonam Sanju" <sonam.sanju@intel.corp-partner.google.com mailto:sonam.sanju@intel.corp-partner.google.com?to=%22Sonam%20Sanju%22%20%3Csonam.sanju%40intel.corp-partner.google.com%3E > wrote:


> 
> From: Sonam Sanju <sonam.sanju@intel.com>
> 
> On Wed, Apr 01, 2026 at 05:34:58PM +0800, Kunwu Chan wrote:
> 
> > 
> > Building on the discussion so far, it would be helpful from the SRCU
> >  side to gather a bit more evidence to classify the issue.
> > 
> >  Calling synchronize_srcu_expedited() while holding a mutex is generally
> >  valid, so the observed behavior may be workload-dependent.
> > 
> >  The reported deadlock seems to rely on the assumption that SRCU grace
> >  period progress is indirectly blocked by irqfd workqueue saturation.
> >  It would be good to confirm whether that assumption actually holds.
> > 
> I went back through our logs from two independent crash instances and
> can now provide data for each of your questions.
> 
> > 
> > 1) Are SRCU GP kthreads/workers still making forward progress when
> >  the system is stuck?
> > 
> No. In both crash instances, process_srcu work items remain permanently
> "pending" (never "in-flight") throughout the entire hang.
> 
> Instance 1 — kernel 6.18.8, pool 14 (cpus=3):
> 
>  [ 62.712760] workqueue rcu_gp: flags=0x108
>  [ 62.717801] pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=2 refcnt=3
>  [ 62.717801] pending: 2*process_srcu
> 
>  [ 187.735092] workqueue rcu_gp: flags=0x108 (125 seconds later)
>  [ 187.735093] pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=2 refcnt=3
>  [ 187.735093] pending: 2*process_srcu (still pending)
> 
>  9 consecutive dumps from t=62s to t=312s — process_srcu never runs.
> 
> Instance 2 — kernel 6.18.2, pool 22 (cpus=5):
> 
>  [ 93.280711] workqueue rcu_gp: flags=0x108
>  [ 93.280713] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=1 refcnt=2
>  [ 93.280716] pending: process_srcu
> 
>  [ 309.040801] workqueue rcu_gp: flags=0x108 (216 seconds later)
>  [ 309.040806] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=1 refcnt=2
>  [ 309.040806] pending: process_srcu (still pending)
> 
>  8 consecutive dumps from t=93s to t=341s — process_srcu never runs.
> 
> In both cases, rcu_gp's process_srcu is bound to the SAME per-CPU pool
> where the kvm-irqfd-cleanup workers are blocked. Both pools have idle
> workers but are marked as hung/stalled:
> 
>  Instance 1: pool 14: cpus=3 hung=174s workers=11 idle: 4046 4038 4045 4039 4043 156 77 (7 idle)
>  Instance 2: pool 22: cpus=5 hung=297s workers=12 idle: 4242 51 4248 4247 4245 435 4244 4239 (8 idle)
> 
> > 
> > 2) How many irqfd workers are active in the reported scenario, and
> >  can they saturate CPU or worker pools?
> > 
> 4 kvm-irqfd-cleanup workers in both instances, consistently across all
> dumps:
> 
> Instance 1 ( pool 14 / cpus=3):
> 
>  [ 62.831877] workqueue kvm-irqfd-cleanup: flags=0x0
>  [ 62.837838] pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=4 refcnt=5
>  [ 62.837838] in-flight: 157:irqfd_shutdown ,4044:irqfd_shutdown ,
>  102:irqfd_shutdown ,39:irqfd_shutdown
> 
> Instance 2 ( pool 22 / cpus=5):
> 
>  [ 93.280894] workqueue kvm-irqfd-cleanup: flags=0x0
>  [ 93.280896] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=4 refcnt=5
>  [ 93.280900] in-flight: 151:irqfd_shutdown ,4246:irqfd_shutdown ,
>  4241:irqfd_shutdown ,4243:irqfd_shutdown
> 
> These are from crosvm instances with multiple virtio devices
> (virtio-blk, virtio-net, virtio-input, etc.), each registering an irqfd
> with a resampler. During VM shutdown, all irqfds are detached
> concurrently, queueing that many irqfd_shutdown work items.
> 
> The 4 workers are not saturating CPU — they're all in D state. But they
> ARE all bound to the same per-CPU pool as rcu_gp's process_srcu work.
> 
> > 
> > 3) Do we have a concrete wait-for cycle showing that tasks blocked
> >  on resampler_lock are in turn preventing SRCU GP completion?
> > 
> Yes, in both instances the hung task dump identifies the mutex holder
> stuck in synchronize_srcu, with the other workers waiting on the mutex.
> 
> Instance 1 (t=314s):
> 
>  Worker pid 4044 — MUTEX HOLDER, stuck in synchronize_srcu:
> 
>  [ 315.963979] task:kworker/3:8 state:D pid:4044
>  [ 315.977125] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
>  [ 316.012504] __synchronize_srcu+0x100/0x130
>  [ 316.023157] irqfd_resampler_shutdown+0xf0/0x150 <-- offset 0xf0 (synchronize_srcu)
> 
>  Workers pid 39, 102, 157 — MUTEX WAITERS:
> 
>  [ 314.793025] task:kworker/3:4 state:D pid:157
>  [ 314.837472] __mutex_lock+0x409/0xd90
>  [ 314.843100] irqfd_resampler_shutdown+0x23/0x150 <-- offset 0x23 (mutex_lock)
> 
> Instance 2 (t=343s):
> 
>  Worker pid 4241 — MUTEX HOLDER, stuck in synchronize_srcu:
> 
>  [ 343.193294] task:kworker/5:4 state:D pid:4241
>  [ 343.193299] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
>  [ 343.193328] __synchronize_srcu+0x100/0x130
>  [ 343.193335] irqfd_resampler_shutdown+0xf0/0x150 <-- offset 0xf0 (synchronize_srcu)
> 
>  Workers pid 151, 4243, 4246 — MUTEX WAITERS:
> 
>  [ 343.193369] task:kworker/5:6 state:D pid:4243
>  [ 343.193397] __mutex_lock+0x37d/0xbb0
>  [ 343.193397] irqfd_resampler_shutdown+0x23/0x150 <-- offset 0x23 (mutex_lock)
> 
> Both instances show the identical wait-for cycle:
> 
>  1. One worker holds resampler_lock, blocks in __synchronize_srcu
>  (waiting for SRCU grace period)
>  2. SRCU GP needs process_srcu to run — but it stays "pending"
>  on the same pool
>  3. Other irqfd workers block on __mutex_lock in the same pool
>  4. The pool is marked "hung" and no pending work makes progress
>  for 250-300 seconds until kernel panic
> 
> > 
> > 4) Is the behavior reproducible in both irqfd_resampler_shutdown()
> >  and kvm_irqfd_assign() paths?
> > 
> In our 4 crash instances the stuck mutex holder is always in 
> irqfd_resampler_shutdown() at offset 0xf0 (synchronize_srcu). This 
> is consistent — these are all VM shutdown scenarios where only 
> irqfd_shutdown workqueue items run.
> 
> The kvm_irqfd_assign() path was identified by Vineeth Pillai (Google)
> during a VM create/destroy stress test where assign and shutdown race.
> His traces showed kvm_irqfd (the assign path) stuck in
> synchronize_srcu_expedited with irqfd_resampler_shutdown blocked on
> the mutex, and workqueue pwq 46 at active=1024 refcnt=2062.
> 
> > 
> > If SRCU GP remains independent, it would help distinguish whether
> >  this is a strict deadlock or a form of workqueue starvation / lock
> >  contention.
> > 
> Based on the data from both instances, SRCU GP is NOT remaining
> independent. process_srcu stays permanently pending on the affected
> per-CPU pool for 250-300 seconds. But it's not just process_srcu —
> ALL pending work on the pool is stuck, including items from events,
> cgroup, mm, slub, and other workqueues.
> 
> > 
> > A timestamp-correlated dump (blocked stacks + workqueue state +
> >  SRCU GP activity) would likely be sufficient to classify this.
> > 
> I hope the correlated dumps above from both instances are helpful.
> To summarize the timeline (consistent across both):
> 
>  t=0: VM shutdown begins, crosvm detaches irqfds
>  t=~14: 4 irqfd_shutdown work items queued on WQ_PERCPU pool
>  One worker acquires resampler_lock, enters synchronize_srcu
>  Other 3 workers block on __mutex_lock
>  t=~43: First "BUG: workqueue lockup" — pool detected stuck
>  rcu_gp: process_srcu shown as "pending" on same pool
>  t=~93 Through t=~312: Repeated dumps every ~30s
>  process_srcu remains permanently "pending"
>  Pool has idle workers but no pending work executes
>  t=~314: Hung task dump confirms mutex holder in __synchronize_srcu
>  t=~316: init triggers sysrq crash → kernel panic
> 

Thanks, this is useful and much clearer.

One thing that is still unclear is dispatch behavior:
`process_srcu` stays pending for a long time, while the same pwq dump shows idle workers.

So the key question is: what prevents pending work from being dispatched on that pwq?
Is it due to:
  1) pwq stalled/hung state,
  2) worker availability/affinity constraints,
  3) or another dispatch-side condition?

Also, for scope:
- your crash instances consistently show the shutdown path
  (irqfd_resampler_shutdown + synchronize_srcu),
- while assign-path evidence, per current thread data, appears to come
  from a separate stress case.

A time-aligned dump with pwq state, pending/in-flight lists, and worker states
should help clarify this.


> > 
> > Happy to help look at traces if available.
> > 
> I can share the full console-ramoops-0 and dmesg-ramoops-0 from both
> instances. Shall I post them or send them off-list?
> 

If possible, please post sanitized ramoops/dmesg logs on-list so others can validate.

Thanx, Kunwu

> Thanks,
> Sonam
>

next prev parent reply	other threads:[~2026-04-06 14:21 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-16  7:20 [PATCH] KVM: irqfd: fix shutdown deadlock by moving SRCU sync outside resampler_lock Sonam Sanju
2026-03-17 16:27 ` Sonam Sanju
2026-03-20 12:56   ` Vineeth Pillai (Google)
2026-03-23  5:33     ` Sonam Sanju
2026-03-23  6:42       ` [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock Sonam Sanju
2026-03-31 18:17         ` Sean Christopherson
2026-03-31 20:51           ` Paul E. McKenney
2026-04-01  9:47             ` Sonam Sanju
2026-04-06 23:09             ` Paul E. McKenney
2026-04-01  9:34         ` Kunwu Chan
2026-04-01 14:24           ` Sonam Sanju
2026-04-06 14:20             ` Kunwu Chan [this message]
2026-04-17  1:18               ` Vineeth Pillai
2026-04-19  3:03                 ` Vineeth Remanan Pillai
2026-04-21 16:54                   ` [PATCH v2] KVM: eventfd: Use WQ_UNBOUND workqueue for irqfd cleanup - New logs confirm preemption race Sonam Sanju
2026-04-21 18:22                     ` Tejun Heo
2026-04-23  9:01                       ` [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock Sonam Sanju
2026-04-23 13:25                         ` Vineeth Remanan Pillai
2026-04-21  5:12               ` Sonam Sanju

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87add1dc9bb95dc50bc20ce5c8fbfe2999185dd3@linux.dev \
    --to=kunwu.chan@linux.dev \
    --cc=dmaluka@chromium.org \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=paulmck@kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=rcu@vger.kernel.org \
    --cc=seanjc@google.com \
    --cc=sonam.sanju@intel.com \
    --cc=sonam.sanju@intel.corp-partner.google.com \
    --cc=stable@vger.kernel.org \
    --cc=vineeth@bitbyteword.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.