Re: [PATCH v2] KVM: eventfd: Use WQ_UNBOUND workqueue for irqfd cleanup - New logs confirm preemption race

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

From: Sonam Sanju <sonam.sanju@intel.com>
To: vineeth@bitbyteword.org
Cc: dmaluka@chromium.org, kunwu.chan@linux.dev, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, paulmck@kernel.org,
	pbonzini@redhat.com, rcu@vger.kernel.org, seanjc@google.com,
	sonam.sanju@intel.com, stable@vger.kernel.org, tj@kernel.org
Subject: Re: [PATCH v2] KVM: eventfd: Use WQ_UNBOUND workqueue for irqfd cleanup - New logs confirm preemption race
Date: Tue, 21 Apr 2026 22:24:55 +0530	[thread overview]
Message-ID: <20260421165455.2486211-1-sonam.sanju@intel.com> (raw)
In-Reply-To: <CAO7JXPjEtnsk9xer+_uSPQi9DBqCe0cSnfB=ePaKntoKv=N3tQ@mail.gmail.com>

Hi Vineeth, Kunwu, Tejun,

Collected new crash logs with additional debug instrumentation in
wq_worker_sleeping(), kick_pool(), and show_one_worker_pool() to capture
pool state during the hang. The results conclusively confirm Vineeth's
preemption race theory.

From the new logs:

1. Pool dump with nr_running/nr_idle (added instrumentation):

   pool 10: cpus=2 flags=0x0 hung=201s workers=11 nr_running=1 nr_idle=5

   11 workers, 5 idle, 6 in D-state (all irqfd_shutdown) -- yet
   nr_running=1. No worker is actually running on CPU 2.

2. NMI backtrace confirms CPU 2 is completely idle:

   NMI backtrace for cpu 2 skipped: idling at intel_idle+0x57/0xa0

   So nr_running=1 is a phantom count -- no worker is running, but
   the pool thinks one is.

3. The first stuck worker (kworker/2:0, PID 33) shows the preemption
   in wq_worker_sleeping:

   kworker/2:0  state:D  Workqueue: kvm-irqfd-cleanup irqfd_shutdown
     __schedule+0x87a/0xd60
     preempt_schedule_irq+0x4a/0x90
     asm_fred_entrypoint_kernel+0x41/0x70
     ___ratelimit+0x1a1/0x1f0            <-- inside pr_info_ratelimited
     wq_worker_sleeping+0x53/0x190       <-- preempted HERE
     schedule+0x30/0xe0
     schedule_preempt_disabled+0x10/0x20
     __mutex_lock+0x413/0xe40
     irqfd_resampler_shutdown+0x53/0x200
     irqfd_shutdown+0xfa/0x190

   This confirms the exact race: a reschedule IPI interrupted
   wq_worker_sleeping() after worker->sleeping was set to 1 but
   before pool->nr_running was decremented. The preemption triggered
   wq_worker_running() which incremented nr_running (1->2), then
   on resume the decrement brought it back to 1 instead of 0.

4. The second pool dump 31 seconds later shows the stall is permanent:

   pool 10: cpus=2 flags=0x0 hung=232s workers=11 nr_running=1 nr_idle=5

   Same phantom nr_running=1, hung time growing.

5. The deadlock chain:
   - PID 33: holds resampler_lock mutex, stuck in wq_worker_sleeping
   - PID 520: past mutex, stuck in synchronize_srcu_expedited
   - PIDs 120, 4792, 4793, 4796: waiting on resampler_lock mutex
   - crosvm_vcpu2: waiting in kvm_vm_release -> __flush_workqueue
   - init (PID 1): stuck in pci_device_shutdown -> __flush_work
   - Multiple userspace processes stuck in fsnotify_destroy_group
   - Reboot thread timed out, system triggered sysrq crash

6. kick_pool_skip debug print fired for other pools but NOT for
   pool 10 -- because need_more_worker() was never true (nr_running
   was never 0), so kick_pool() was never even called for this pool.

Regarding a fix, we can consider a workqueue-level fix in 
wq_worker_sleeping() itself:

  void wq_worker_sleeping(struct task_struct *task)
  {
      ...
      if (READ_ONCE(worker->sleeping))
          return;

  +   preempt_disable();
      WRITE_ONCE(worker->sleeping, 1);
      raw_spin_lock_irq(&pool->lock);

      if (worker->flags & WORKER_NOT_RUNNING) {
          raw_spin_unlock_irq(&pool->lock);
  +       preempt_enable();
          return;
      }

      pool->nr_running--;
      if (kick_pool(pool))
          worker->current_pwq->stats[PWQ_STAT_CM_WAKEUP]++;

      raw_spin_unlock_irq(&pool->lock);
  +   preempt_enable();
  }

The idea is to disable preemption from sleeping=1 until we hold the pool
lock (which disables IRQs). This prevents the reschedule IPI from
triggering preempt_schedule_irq() in this window. Note that
wq_worker_running() already uses preempt_disable/enable around its
nr_running++ for a similar race against unbind_workers().

Does this approach look correct to you?

Thanks,
Sonam

next prev parent reply	other threads:[~2026-04-21 16:59 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-16  7:20 [PATCH] KVM: irqfd: fix shutdown deadlock by moving SRCU sync outside resampler_lock Sonam Sanju
2026-03-17 16:27 ` Sonam Sanju
2026-03-20 12:56   ` Vineeth Pillai (Google)
2026-03-23  5:33     ` Sonam Sanju
2026-03-23  6:42       ` [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock Sonam Sanju
2026-03-31 18:17         ` Sean Christopherson
2026-03-31 20:51           ` Paul E. McKenney
2026-04-01  9:47             ` Sonam Sanju
2026-04-06 23:09             ` Paul E. McKenney
2026-04-01  9:34         ` Kunwu Chan
2026-04-01 14:24           ` Sonam Sanju
2026-04-06 14:20             ` Kunwu Chan
2026-04-17  1:18               ` Vineeth Pillai
2026-04-19  3:03                 ` Vineeth Remanan Pillai
2026-04-21 16:54                   ` Sonam Sanju [this message]
2026-04-21 18:22                     ` [PATCH v2] KVM: eventfd: Use WQ_UNBOUND workqueue for irqfd cleanup - New logs confirm preemption race Tejun Heo
2026-04-21  5:12               ` [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock Sonam Sanju

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260421165455.2486211-1-sonam.sanju@intel.com \
    --to=sonam.sanju@intel.com \
    --cc=dmaluka@chromium.org \
    --cc=kunwu.chan@linux.dev \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=paulmck@kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=rcu@vger.kernel.org \
    --cc=seanjc@google.com \
    --cc=stable@vger.kernel.org \
    --cc=tj@kernel.org \
    --cc=vineeth@bitbyteword.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox