From: Tejun Heo <tj@kernel.org>
To: Sonam Sanju <sonam.sanju@intel.com>
Cc: vineeth@bitbyteword.org, dmaluka@chromium.org,
kunwu.chan@linux.dev, kvm@vger.kernel.org,
linux-kernel@vger.kernel.org, paulmck@kernel.org,
pbonzini@redhat.com, rcu@vger.kernel.org, seanjc@google.com,
stable@vger.kernel.org
Subject: Re: [PATCH v2] KVM: eventfd: Use WQ_UNBOUND workqueue for irqfd cleanup - New logs confirm preemption race
Date: Tue, 21 Apr 2026 08:22:16 -1000 [thread overview]
Message-ID: <aefAWGcAQHeRYbs8@slm.duckdns.org> (raw)
In-Reply-To: <20260421165455.2486211-1-sonam.sanju@intel.com>
Hello, Sonam.
On Tue, Apr 21, 2026 at 10:24:55PM +0530, Sonam Sanju wrote:
> 3. The first stuck worker (kworker/2:0, PID 33) shows the preemption
> in wq_worker_sleeping:
>
> kworker/2:0 state:D Workqueue: kvm-irqfd-cleanup irqfd_shutdown
> __schedule+0x87a/0xd60
> preempt_schedule_irq+0x4a/0x90
> asm_fred_entrypoint_kernel+0x41/0x70
> ___ratelimit+0x1a1/0x1f0 <-- inside pr_info_ratelimited
> wq_worker_sleeping+0x53/0x190 <-- preempted HERE
> schedule+0x30/0xe0
> schedule_preempt_disabled+0x10/0x20
> __mutex_lock+0x413/0xe40
> irqfd_resampler_shutdown+0x53/0x200
> irqfd_shutdown+0xfa/0x190
>
> This confirms the exact race: a reschedule IPI interrupted
> wq_worker_sleeping() after worker->sleeping was set to 1 but
> before pool->nr_running was decremented. The preemption triggered
> wq_worker_running() which incremented nr_running (1->2), then
> on resume the decrement brought it back to 1 instead of 0.
The problem with this theory is that this kworker, while preempted, is still
runnable and should be dispatched to its CPU once it becomes available
again. Workqueue doesn't care whether the task gets preempted or when it
gets the CPU back. It only cares about whether the task enters blocking
state (!runnable). A task which is preempted, even on the way to blocking,
still is runnable and should get put back on the CPU by the scheduler.
If you can take a crashdump of the deadlocked state, can you see whether the
task is still on the scheduler's runqueue?
[Diagnostic notes below are AI-generated - apply judgment.]
The decisive field is `task->on_rq`:
- 0: dequeued, truly blocked - your theory requires this. Then look at
`task->sched_contributes_to_load` (set by block_task), and if
CONFIG_SCHED_PROXY_EXEC is on, `task->blocked_on` and
find_proxy_task() behavior.
- 1: still queued - scheduler should pick it and self-heal the drift,
so the "never woken up" step doesn't hold. Then the question becomes
why EEVDF is not picking a queued task. Check `se->sched_delayed`
first (DELAY_DEQUEUE leaves on_rq=1 but unrunnable until next pick),
then cfs_rq throttling up the task_group hierarchy, then the rb-tree
contents (vruntime/deadline/vlag of the stuck se vs others).
One snippet covering both branches, for each hung worker and for the
affected CPU's rq:
from drgn.helpers.linux.sched import task_cpu
from drgn.helpers.linux.list import list_for_each_entry
t = find_task(prog, PID)
cpu = task_cpu(t)
rq = per_cpu(prog["runqueues"], cpu)
cfs = rq.cfs
print(f"state={hex(t.__state)} on_rq={int(t.on_rq)} "
f"se.on_rq={int(t.se.on_rq)} sched_delayed={int(t.se.sched_delayed)} "
f"cpu={cpu} on_cpu={int(t.on_cpu)}")
print(f"vruntime={int(t.se.vruntime)} deadline={int(t.se.deadline)} "
f"vlag={int(t.se.vlag)}")
if hasattr(t, "blocked_on"):
print(f"blocked_on={t.blocked_on}")
print(f"rq.curr={rq.curr.comm.string_().decode()} "
f"nr_running={int(rq.nr_running)} "
f"cfs.h_nr_queued={int(cfs.h_nr_queued)} "
f"cfs.h_nr_delayed={int(cfs.h_nr_delayed)} "
f"min_vruntime={int(cfs.min_vruntime)}")
# Walk throttle hierarchy (needs CONFIG_FAIR_GROUP_SCHED)
c = t.se.cfs_rq
while c:
print(f" cfs_rq throttled={int(c.throttled)} "
f"throttle_count={int(c.throttle_count)}")
c = c.tg.parent.cfs_rq[cpu] if c.tg.parent else None
Thanks.
--
tejun
next prev parent reply other threads:[~2026-04-21 18:22 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20260323053353.805336-1-sonam.sanju@intel.com>
2026-03-23 6:42 ` [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock Sonam Sanju
2026-03-31 18:17 ` Sean Christopherson
2026-03-31 20:51 ` Paul E. McKenney
2026-04-01 9:47 ` Sonam Sanju
2026-04-06 23:09 ` Paul E. McKenney
2026-04-01 9:34 ` Kunwu Chan
2026-04-01 14:24 ` Sonam Sanju
2026-04-06 14:20 ` Kunwu Chan
2026-04-17 1:18 ` Vineeth Pillai
2026-04-19 3:03 ` Vineeth Remanan Pillai
2026-04-21 16:54 ` [PATCH v2] KVM: eventfd: Use WQ_UNBOUND workqueue for irqfd cleanup - New logs confirm preemption race Sonam Sanju
2026-04-21 18:22 ` Tejun Heo [this message]
2026-04-21 5:12 ` [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock Sonam Sanju
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aefAWGcAQHeRYbs8@slm.duckdns.org \
--to=tj@kernel.org \
--cc=dmaluka@chromium.org \
--cc=kunwu.chan@linux.dev \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=paulmck@kernel.org \
--cc=pbonzini@redhat.com \
--cc=rcu@vger.kernel.org \
--cc=seanjc@google.com \
--cc=sonam.sanju@intel.com \
--cc=stable@vger.kernel.org \
--cc=vineeth@bitbyteword.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox