From: Sonam Sanju <sonam.sanju@intel.com>
To: vineeth@bitbyteword.org
Cc: dmaluka@chromium.org, kunwu.chan@linux.dev, kvm@vger.kernel.org,
linux-kernel@vger.kernel.org, paulmck@kernel.org,
pbonzini@redhat.com, rcu@vger.kernel.org, seanjc@google.com,
sonam.sanju@intel.com, stable@vger.kernel.org, tj@kernel.org
Subject: Re: [PATCH v2] KVM: eventfd: Use WQ_UNBOUND workqueue for irqfd cleanup - New logs confirm preemption race
Date: Tue, 21 Apr 2026 22:24:55 +0530 [thread overview]
Message-ID: <20260421165455.2486211-1-sonam.sanju@intel.com> (raw)
In-Reply-To: <CAO7JXPjEtnsk9xer+_uSPQi9DBqCe0cSnfB=ePaKntoKv=N3tQ@mail.gmail.com>
Hi Vineeth, Kunwu, Tejun,
Collected new crash logs with additional debug instrumentation in
wq_worker_sleeping(), kick_pool(), and show_one_worker_pool() to capture
pool state during the hang. The results conclusively confirm Vineeth's
preemption race theory.
From the new logs:
1. Pool dump with nr_running/nr_idle (added instrumentation):
pool 10: cpus=2 flags=0x0 hung=201s workers=11 nr_running=1 nr_idle=5
11 workers, 5 idle, 6 in D-state (all irqfd_shutdown) -- yet
nr_running=1. No worker is actually running on CPU 2.
2. NMI backtrace confirms CPU 2 is completely idle:
NMI backtrace for cpu 2 skipped: idling at intel_idle+0x57/0xa0
So nr_running=1 is a phantom count -- no worker is running, but
the pool thinks one is.
3. The first stuck worker (kworker/2:0, PID 33) shows the preemption
in wq_worker_sleeping:
kworker/2:0 state:D Workqueue: kvm-irqfd-cleanup irqfd_shutdown
__schedule+0x87a/0xd60
preempt_schedule_irq+0x4a/0x90
asm_fred_entrypoint_kernel+0x41/0x70
___ratelimit+0x1a1/0x1f0 <-- inside pr_info_ratelimited
wq_worker_sleeping+0x53/0x190 <-- preempted HERE
schedule+0x30/0xe0
schedule_preempt_disabled+0x10/0x20
__mutex_lock+0x413/0xe40
irqfd_resampler_shutdown+0x53/0x200
irqfd_shutdown+0xfa/0x190
This confirms the exact race: a reschedule IPI interrupted
wq_worker_sleeping() after worker->sleeping was set to 1 but
before pool->nr_running was decremented. The preemption triggered
wq_worker_running() which incremented nr_running (1->2), then
on resume the decrement brought it back to 1 instead of 0.
4. The second pool dump 31 seconds later shows the stall is permanent:
pool 10: cpus=2 flags=0x0 hung=232s workers=11 nr_running=1 nr_idle=5
Same phantom nr_running=1, hung time growing.
5. The deadlock chain:
- PID 33: holds resampler_lock mutex, stuck in wq_worker_sleeping
- PID 520: past mutex, stuck in synchronize_srcu_expedited
- PIDs 120, 4792, 4793, 4796: waiting on resampler_lock mutex
- crosvm_vcpu2: waiting in kvm_vm_release -> __flush_workqueue
- init (PID 1): stuck in pci_device_shutdown -> __flush_work
- Multiple userspace processes stuck in fsnotify_destroy_group
- Reboot thread timed out, system triggered sysrq crash
6. kick_pool_skip debug print fired for other pools but NOT for
pool 10 -- because need_more_worker() was never true (nr_running
was never 0), so kick_pool() was never even called for this pool.
Regarding a fix, we can consider a workqueue-level fix in
wq_worker_sleeping() itself:
void wq_worker_sleeping(struct task_struct *task)
{
...
if (READ_ONCE(worker->sleeping))
return;
+ preempt_disable();
WRITE_ONCE(worker->sleeping, 1);
raw_spin_lock_irq(&pool->lock);
if (worker->flags & WORKER_NOT_RUNNING) {
raw_spin_unlock_irq(&pool->lock);
+ preempt_enable();
return;
}
pool->nr_running--;
if (kick_pool(pool))
worker->current_pwq->stats[PWQ_STAT_CM_WAKEUP]++;
raw_spin_unlock_irq(&pool->lock);
+ preempt_enable();
}
The idea is to disable preemption from sleeping=1 until we hold the pool
lock (which disables IRQs). This prevents the reschedule IPI from
triggering preempt_schedule_irq() in this window. Note that
wq_worker_running() already uses preempt_disable/enable around its
nr_running++ for a similar race against unbind_workers().
Does this approach look correct to you?
Thanks,
Sonam
next prev parent reply other threads:[~2026-04-21 16:59 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-16 7:20 [PATCH] KVM: irqfd: fix shutdown deadlock by moving SRCU sync outside resampler_lock Sonam Sanju
2026-03-17 16:27 ` Sonam Sanju
2026-03-20 12:56 ` Vineeth Pillai (Google)
2026-03-23 5:33 ` Sonam Sanju
2026-03-23 6:42 ` [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock Sonam Sanju
2026-03-31 18:17 ` Sean Christopherson
2026-03-31 20:51 ` Paul E. McKenney
2026-04-01 9:47 ` Sonam Sanju
2026-04-06 23:09 ` Paul E. McKenney
2026-04-01 9:34 ` Kunwu Chan
2026-04-01 14:24 ` Sonam Sanju
2026-04-06 14:20 ` Kunwu Chan
2026-04-17 1:18 ` Vineeth Pillai
2026-04-19 3:03 ` Vineeth Remanan Pillai
2026-04-21 16:54 ` Sonam Sanju [this message]
2026-04-21 18:22 ` [PATCH v2] KVM: eventfd: Use WQ_UNBOUND workqueue for irqfd cleanup - New logs confirm preemption race Tejun Heo
2026-04-21 5:12 ` [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock Sonam Sanju
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260421165455.2486211-1-sonam.sanju@intel.com \
--to=sonam.sanju@intel.com \
--cc=dmaluka@chromium.org \
--cc=kunwu.chan@linux.dev \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=paulmck@kernel.org \
--cc=pbonzini@redhat.com \
--cc=rcu@vger.kernel.org \
--cc=seanjc@google.com \
--cc=stable@vger.kernel.org \
--cc=tj@kernel.org \
--cc=vineeth@bitbyteword.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox