Linux RCU subsystem development
 help / color / mirror / Atom feed
[parent not found: <5194cf52-f8a8-4479-a95e-233104272839@linux.dev>]
* Re: [PATCH v2] KVM: eventfd: Use WQ_UNBOUND workqueue for irqfd cleanup - New logs confirm preemption race
@ 2026-04-21 18:22 Tejun Heo
  2026-04-23  9:01 ` [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock Sonam Sanju
  0 siblings, 1 reply; 11+ messages in thread
From: Tejun Heo @ 2026-04-21 18:22 UTC (permalink / raw)
  To: Sonam Sanju
  Cc: vineeth, dmaluka, kunwu.chan, kvm, linux-kernel, paulmck,
	pbonzini, rcu, seanjc, stable

Hello, Sonam.

On Tue, Apr 21, 2026 at 10:24:55PM +0530, Sonam Sanju wrote:
> 3. The first stuck worker (kworker/2:0, PID 33) shows the preemption
>    in wq_worker_sleeping:
> 
>    kworker/2:0  state:D  Workqueue: kvm-irqfd-cleanup irqfd_shutdown
>      __schedule+0x87a/0xd60
>      preempt_schedule_irq+0x4a/0x90
>      asm_fred_entrypoint_kernel+0x41/0x70
>      ___ratelimit+0x1a1/0x1f0            <-- inside pr_info_ratelimited
>      wq_worker_sleeping+0x53/0x190       <-- preempted HERE
>      schedule+0x30/0xe0
>      schedule_preempt_disabled+0x10/0x20
>      __mutex_lock+0x413/0xe40
>      irqfd_resampler_shutdown+0x53/0x200
>      irqfd_shutdown+0xfa/0x190
> 
>    This confirms the exact race: a reschedule IPI interrupted
>    wq_worker_sleeping() after worker->sleeping was set to 1 but
>    before pool->nr_running was decremented. The preemption triggered
>    wq_worker_running() which incremented nr_running (1->2), then
>    on resume the decrement brought it back to 1 instead of 0.

The problem with this theory is that this kworker, while preempted, is still
runnable and should be dispatched to its CPU once it becomes available
again. Workqueue doesn't care whether the task gets preempted or when it
gets the CPU back. It only cares about whether the task enters blocking
state (!runnable). A task which is preempted, even on the way to blocking,
still is runnable and should get put back on the CPU by the scheduler.

If you can take a crashdump of the deadlocked state, can you see whether the
task is still on the scheduler's runqueue?

[Diagnostic notes below are AI-generated - apply judgment.]

The decisive field is `task->on_rq`:

  - 0: dequeued, truly blocked - your theory requires this. Then look at
    `task->sched_contributes_to_load` (set by block_task), and if
    CONFIG_SCHED_PROXY_EXEC is on, `task->blocked_on` and
    find_proxy_task() behavior.
  - 1: still queued - scheduler should pick it and self-heal the drift,
    so the "never woken up" step doesn't hold. Then the question becomes
    why EEVDF is not picking a queued task. Check `se->sched_delayed`
    first (DELAY_DEQUEUE leaves on_rq=1 but unrunnable until next pick),
    then cfs_rq throttling up the task_group hierarchy, then the rb-tree
    contents (vruntime/deadline/vlag of the stuck se vs others).

One snippet covering both branches, for each hung worker and for the
affected CPU's rq:

  from drgn.helpers.linux.sched import task_cpu
  from drgn.helpers.linux.list import list_for_each_entry

  t = find_task(prog, PID)
  cpu = task_cpu(t)
  rq = per_cpu(prog["runqueues"], cpu)
  cfs = rq.cfs

  print(f"state={hex(t.__state)} on_rq={int(t.on_rq)} "
        f"se.on_rq={int(t.se.on_rq)} sched_delayed={int(t.se.sched_delayed)} "
        f"cpu={cpu} on_cpu={int(t.on_cpu)}")
  print(f"vruntime={int(t.se.vruntime)} deadline={int(t.se.deadline)} "
        f"vlag={int(t.se.vlag)}")
  if hasattr(t, "blocked_on"):
      print(f"blocked_on={t.blocked_on}")

  print(f"rq.curr={rq.curr.comm.string_().decode()} "
        f"nr_running={int(rq.nr_running)} "
        f"cfs.h_nr_queued={int(cfs.h_nr_queued)} "
        f"cfs.h_nr_delayed={int(cfs.h_nr_delayed)} "
        f"min_vruntime={int(cfs.min_vruntime)}")
  # Walk throttle hierarchy (needs CONFIG_FAIR_GROUP_SCHED)
  c = t.se.cfs_rq
  while c:
      print(f"  cfs_rq throttled={int(c.throttled)} "
            f"throttle_count={int(c.throttle_count)}")
      c = c.tg.parent.cfs_rq[cpu] if c.tg.parent else None

Thanks.

--
tejun

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-04-23 13:25 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20260323053353.805336-1-sonam.sanju@intel.com>
     [not found] ` <20260323064248.1660757-1-sonam.sanju@intel.com>
2026-03-31 18:17   ` [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock Sean Christopherson
2026-03-31 20:51     ` Paul E. McKenney
2026-04-01  9:47       ` Sonam Sanju
2026-04-06 23:09       ` Paul E. McKenney
     [not found] <5194cf52-f8a8-4479-a95e-233104272839@linux.dev>
2026-04-01 14:24 ` Sonam Sanju
2026-04-06 14:20   ` Kunwu Chan
2026-04-17  1:18     ` Vineeth Pillai
2026-04-19  3:03       ` Vineeth Remanan Pillai
2026-04-21  5:12     ` Sonam Sanju
2026-04-21 18:22 [PATCH v2] KVM: eventfd: Use WQ_UNBOUND workqueue for irqfd cleanup - New logs confirm preemption race Tejun Heo
2026-04-23  9:01 ` [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock Sonam Sanju
2026-04-23 13:25   ` Vineeth Remanan Pillai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox