[PATCH v3 00/10] sched/fair, KVM: Semantics-aware directed yield for oversubscribed KVM

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 00/10] sched/fair, KVM: Semantics-aware directed yield for oversubscribed KVM
@ 2026-06-12  1:33 Wanpeng Li
  2026-06-12  1:33 ` [PATCH v3 01/10] sched/fair: Add EEVDF lag credit primitive for nominated next-buddy Wanpeng Li
                   ` (11 more replies)
  0 siblings, 12 replies; 22+ messages in thread
From: Wanpeng Li @ 2026-06-12  1:33 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paolo Bonzini,
	Sean Christopherson
  Cc: K Prateek Nayak, Christian Borntraeger, Steven Rostedt,
	Vincent Guittot, Juri Lelli, linux-kernel, kvm, Wanpeng Li,
	Richie Buturla

From: Wanpeng Li <wanpengli@tencent.com>

On overcommitted hosts, a spinning vCPU often calls yield_to() to let a
lock holder or IPI receiver run.  The hint can be ineffective for two
independent reasons: the scheduler may fail to select the nominated task,
and KVM may nominate a task that is not the one the spinning vCPU is
waiting for.

This series addresses both sides.  The scheduler side credits bounded
EEVDF lag to the nominated next-buddy so the buddy hint is honored across
the relevant cgroup hierarchy, and forces a local reschedule so the
credited buddy can be selected immediately.  The KVM side tracks recent
unicast fixed IPI sender/receiver pairs and prefers the confirmed receiver
when selecting a directed-yield target.

Problem Statement
-----------------

In overcommitted virtualization scenarios, vCPUs frequently spin on locks
held by other vCPUs that are not currently running, or on IPI responses
from vCPUs that are runnable but not scheduled.  Paravirtual spinlock
support and PLE detect these situations and call yield_to() to let the
other vCPU make progress.

The current implementation has two limitations:

1. Scheduler-side limitation:

   yield_to_task_fair() relies on set_next_buddy() to express a preference
   for the target.  set_next_buddy() nominates the target at every level of
   its cgroup ancestor chain, but pick_eevdf()'s PICK_BUDDY branch only
   returns cfs_rq->next when that entity is already eligible
   (entity_eligible()).  A target that is behind avg_vruntime at any level
   of the chain is skipped, and the hint is dropped at the first ineligible
   group entity.

   Even when the target is eligible, yield_to() does not by itself force
   the caller off the CPU.  An active RUN_TO_PARITY protect_slice() on the
   local yielder can therefore keep pick_eevdf() returning the yielder
   instead of the target.

   The recent forfeit-on-yield work (commits 79104becf42b "sched/fair:
   Forfeit vruntime on yield" and 127b90315ca0 "sched/proxy: Yield the
   donor task") makes the yielder ineligible, but it does not make the
   nominated target eligible when that target is behind avg_vruntime, keep
   PICK_BUDDY from being dropped at the first ineligible group entity, or
   cancel an active RUN_TO_PARITY slice on the yielder.  This series builds
   on that behavior by crediting the target and cancelling slice
   protection, so the nominated entity is the one pick_eevdf() returns.

2. KVM-side limitation:

   kvm_vcpu_on_spin() selects a directed-yield target from coarse
   preempted / preempted-in-kernel state.  It cannot distinguish a vCPU
   spinning on an IPI response from a vCPU spinning on a lock.  When a vCPU
   sends an IPI and spins waiting for the response, the heuristic can boost
   an unrelated vCPU and miss the actual IPI receiver.

These effects lengthen lock hold times and increase spin time,
context-switch overhead and cache pressure in overcommitted environments,
especially for workloads with fine-grained synchronization.

Solution Overview
-----------------

Part 1: Scheduler EEVDF lag credit (patches 1-5)

Rather than penalizing the yielding vCPU, credit the nominated target so
pick_eevdf() honors the buddy hint.

The mechanism is EEVDF-native and cgroup-hierarchy-aware:

- Credit bounded EEVDF lag to the nominated next-buddy so pick_eevdf()'s
  PICK_BUDDY branch returns it.  Walk the same ancestor chain that
  set_next_buddy() nominated and credit each not-yet-eligible level, so the
  hint is not dropped at the first ineligible group entity.

- Credit to a small positive-vlag margin, not merely the vlag = 0
  eligibility boundary, so the target stays eligible across several
  scheduling decisions rather than a single pick.  The margin scales with
  runqueue depth and is clamped to entity_lag()'s legal positive-lag bound,
  preserving EEVDF fairness.

- Handle both the off-tree current entity (shifted in place, carrying any
  vprot window) and a queued (on-tree) entity (repositioned via the
  canonical place_entity()-paired requeue used by requeue_delayed_entity(),
  keeping sum_w_vruntime consistent with entity_key()).

- Force a local reschedule at the end of the credit path: cancel
  RUN_TO_PARITY slice protection along the yielder's sched_entity chain and
  resched_curr() the local CPU.  Only this forced preemption is rate
  limited (once per 6ms per rq) to avoid excessive forced preemption on
  PLE-heavy guests; the lag credit itself runs on every directed yield.

The mechanism is gated by SCHED_FEAT(YIELD_TO_LAG_CREDIT) (default on).
With the feature off, yield_to_task_fair() keeps the existing forfeit-only
behavior.

Part 2: KVM IPI-aware directed yield (patches 6-10)

KVM tracks recent unicast fixed IPI sender/receiver relationships and uses
them to prioritize directed-yield targets.

- Record unicast fixed IPIs from both LAPIC delivery paths, the APIC-map
  fast path and the slow fallback, when exactly one destination vCPU
  accepts the interrupt.

- Use READ_ONCE()/WRITE_ONCE() accessors.  The per-vCPU ipi_context state
  is only a best-effort scheduling hint.

- Age out stale relationships with a recency window (50ms default), and
  clear state on a matching-vector EOI without dropping unrelated pending
  IPI state.

Directed-yield candidate selection uses the following priority order:

  1. A confirmed recent IPI receiver of the spinning vCPU.
  2. The arch-specific pending-interrupt hint
     (kvm_arch_dy_has_pending_interrupt()).
  3. The existing preempted / preempted-in-kernel heuristic.

If the strict IPI-aware pass finds no eligible candidate, an optional
second pass falls back to a relaxed preempted-only search.  The fallback is
controlled by the enable_relaxed_boost module parameter (default on).

Runtime controls:
  * /sys/kernel/debug/sched/features (YIELD_TO_LAG_CREDIT)
  * /sys/module/kvm/parameters/ipi_tracking_enabled
  * /sys/module/kvm/parameters/ipi_window_ns
  * /sys/module/kvm/parameters/enable_relaxed_boost

Host-side deployment model
--------------------------

The series is host-side by design.  It requires no guest ABI, paravirtual
driver, negotiated feature bit, or guest kernel change, so existing guests
benefit without coordination between host and guest software.

That deployment model gives the mechanisms broad coverage.  The scheduler
lag credit applies to every yield_to() the host already receives, including
PLE and paravirtual spinlock paths.  The KVM side observes the actual
unicast-IPI sender/receiver relationship at software LAPIC delivery time,
so it covers spin and IPI waits from spinlocks, RCU, smp_call_function()
and IPI-based wakeups rather than a single paravirtualized operation such
as TLB shootdown.

The host-side approach also composes with existing paravirtualization.  If
a guest provides PV TLB shootdown or PV spinlocks, those interfaces reduce
the amount of spinning that reaches the host; this series handles the
residual yield_to() and IPI waits that remain.  It is runtime gated as
described above and can be enabled or disabled per host.

The scheduler side is independent of APICv, IPI virtualization and the
LAPIC delivery path.  The KVM side depends on software LAPIC delivery: when
IPI/EOI virtualization handles the guest's ICR and EOI writes in hardware,
no sender/receiver relationship is recorded, and candidate selection falls
back to the pending-interrupt and preempted heuristics, plus the relaxed
preempted-only pass added in patch 10.  In that configuration the tracking
state stays empty while the scheduler side remains fully active.

The design separates the consumer of the hint from its source.  Software
IPI tracking supplies the confirmed receiver on hosts where software LAPIC
delivery is observable today; a future guest-cooperative scheduling hint
could populate the same slot without changing the priority-ordered
candidate selection.

Performance Results
-------------------

Test environment: a 16-core x86-64 host, 16 vCPUs per guest.  Host CPU
overcommit is varied by co-locating 2, 3 and 4 guests (120 runs per point),
with APICv disabled so the KVM side observes IPI delivery in software.
Dbench reports throughput and reflects the scheduler-side lag credit; the
PARSEC workloads report end-to-end latency reduction under the full series.

Dbench (filesystem metadata operations), throughput improvement:
  2 VMs:  +6.65%
  3 VMs:  +4.80%
  4 VMs:  +7.59%

PARSEC Dedup, simlarge input (IPI-heavy synchronization), latency
reduction:
  2 VMs:  +8.87%
  3 VMs: +10.29%
  4 VMs: +15.60%

PARSEC VIPS, simlarge input (balanced sync and compute), latency
reduction:
  2 VMs: +10.23%
  3 VMs:  +6.63%
  4 VMs:  +4.50%

Analysis:

- Dedup's gains grow with the VM count: as more runnable vCPUs compete for
  each physical CPU, a directed yield is more likely to land on a vCPU that
  is genuinely preempted while an IPI sender spins, so honoring the
  confirmed receiver matters more.

- Dedup, with its IPI-heavy synchronization, benefits most from the
  IPI-aware directed yield.  Preferring the confirmed IPI receiver over the
  generic preempted-lock-holder heuristic shortens IPI response latency.

- VIPS mixes synchronization and compute, so its gains shrink as the VM
  count rises: at higher overcommit more of each run is spent in compute
  that a directed yield cannot accelerate, leaving less spin time to
  recover.

- Dbench benefits primarily from the scheduler-side lag credit; its lock
  patterns involve more direct lock-holder boosting than IPI spinning.

- No configuration regressed; the mechanisms degrade gracefully as
  contention rises.

The gains stem from three factors:

1. Lock holders receive sustained CPU time to complete critical sections,
   reducing lock hold duration and cascading contention.

2. IPI receivers are scheduled promptly when senders spin, reducing IPI
   response latency and wasted spin cycles.

3. Reduced context switching between lock waiters and holders improves
   cache utilization.

Scope of the scheduler-side benefit
-----------------------------------

The lag credit takes effect only when the yielding vCPU and its target
share a runqueue, i.e. when more runnable vCPUs than pCPUs contend for a
CPU:

- Under CPU overcommit - co-located guests, or a VM whose vCPUs are pooled
  onto fewer pCPUs than it has vCPUs - the waiter and the lock-holder or
  IPI-receiver land on the same rq, and the buddy hint applies.  The
  results here are from this regime, with guests co-located so their vCPUs
  contend for shared pCPUs.

- Without such contention - 1:1 vCPU:pCPU pinning, or a matched vCPU:pCPU
  count with no intra-VM overcommit - there is no eligible buddy to credit,
  so the path is inert and adds no overhead or regression.

Independent s390 testing (directed yield there uses the diag9c hypercall)
shows the same pattern: under intra-VM vCPU pooling the yield-to hypercall
rate falls by more than half with a few percent throughput gain, while 1:1
pinning and matched vCPU:pCPU configurations show no change either way.

Directed yield is a same-runqueue mechanism and cannot help a waiter whose
target is on a different rq; extending it to cross-runqueue cases is left
as future work.

Patch Organization
------------------

Patches 1-5: Scheduler EEVDF lag credit

  Patch 1: Add the eevdf_credit_entity_vlag() primitive and the
           YIELD_TO_LAG_CREDIT feature.  Handles the off-tree current
           entity and has no functional effect on its own.

  Patch 2: Credit to a persistent, queue-depth-scaled positive-vlag
           margin, clamped to entity_lag()'s legal bound.

  Patch 3: Extend the primitive to a queued (on-tree) entity via the
           canonical place_entity()-paired requeue.

  Patch 4: Wire the credit walk into yield_to_task_fair(), crediting each
           level of the nominated ancestor chain.

  Patch 5: Force a local reschedule (cancel RUN_TO_PARITY slice protection
           and resched_curr()) so the credited buddy can be selected.
           Activation patch; rate-limits only the forced preemption.

Patches 6-10: KVM IPI-aware directed yield

  Patch 6: Add per-vCPU IPI tracking infrastructure, module parameters and
           helper functions.  Candidate selection is unchanged.

  Patch 7: Track unicast fixed IPI delivery from both LAPIC paths.

  Patch 8: Clear IPI tracking on a matching-vector EOI.

  Patch 9: Implement IPI-aware directed-yield candidate selection with the
           priority order above.

  Patch 10: Add the relaxed preempted-only fallback as a safety net.

Testing
-------

Workloads tested:

- Dbench (filesystem metadata stress)
- PARSEC benchmarks (Dedup, VIPS)
- Kernel compilation (make -j16 in each VM)

No regressions observed on any configuration.  The mechanisms show neutral
to positive impact across diverse workloads.

Rate-limit policy
-----------------

The scheduler-side forced reschedule is rate-limited to bound the cost of
frequent VM exits.  Under the kvm-full profile, PLE-heavy workloads such as
PARSEC VIPS and Dedup take many PAUSE-loop exits; each exit can drive a
yield_to(), and thus a potential forced preemption.  Forcing a reschedule
on every yield_to() would add needless preemption pressure and cache churn.

The series limits only the forced preemption path
(cancel_protect_slice() plus resched_curr()) to once per 6ms per rq.  The
lag credit itself remains unthrottled, so each directed yield refreshes the
buddy hint.  The fixed 6ms interval is intentionally conservative; an
adaptive limit based on the per-rq yield_to()/PLE-exit rate can be explored
separately.

Changelog:

v2 -> v3:
- Redesign the scheduler side.  v2 applied a bounded vruntime penalty to
  the yielding vCPU (a "debooster"); v3 instead credits bounded EEVDF lag
  to the nominated next-buddy so pick_eevdf()'s PICK_BUDDY branch returns
  it.  Crediting the target is EEVDF-native, composes cleanly with
  RUN_TO_PARITY, and avoids the fairness reasoning required when shifting
  the yielder's vruntime in a cgroup hierarchy.  The redesign also removes
  the bulk of the v2 machinery:
  * Drop the cgroup LCA finder, reverse-pair debouncing, the per-rq
    penalty tracking and the dedicated debugfs sysctl.  The mechanism is
    now gated by SCHED_FEAT(YIELD_TO_LAG_CREDIT).
  * Credit to a queue-depth-scaled positive-vlag margin clamped to
    entity_lag()'s legal bound, keeping the target eligible across several
    picks while preserving EEVDF fairness.
  * Handle the off-tree current entity (in-place shift) and a queued
    on-tree entity (canonical place_entity()-paired requeue) separately,
    so sum_w_vruntime stays consistent with entity_key().
  * Add an explicit forced local reschedule that cancels RUN_TO_PARITY
    slice protection so the credited buddy can be selected; only the
    forced preemption is rate limited (6ms/rq), the lag credit runs on
    every yield.
- KVM side keeps the v2 design; rebased and reorganized into five patches
  (infrastructure, track delivery, clear-on-EOI, candidate selection,
  relaxed fallback).  Tracking now hooks both the APIC-map fast path and
  the slow fallback, and the EOI clear is vector-matched.
- Rebase onto v7.1-rc7.

v1 -> v2:
- Rebase onto v6.19-rc1 (v1 was based on v6.18-rc4).
- Drop the "KVM: Fix last_boosted_vcpu index assignment bug" patch, as
  v6.19-rc1 already contains the fix.
- Scheduler side (the v2 vruntime debooster, since replaced in v3):
  * Apply the deboost before yield_task_fair() to adapt to v6.19's EEVDF
    forfeit behavior (se->vruntime = se->deadline), which would otherwise
    inflate the yielder's vruntime before the penalty was computed.
  * Use rq->donor instead of rq->curr for correct EEVDF donor tracking.
  * Use h_nr_queued instead of nr_queued for accurate hierarchical task
    counting in the penalty cap.
  * Drop the vlag assignment (recalculated on dequeue/enqueue) and the
    update_min_vruntime() call (the yielder is cfs_rq->curr, off-tree), and
    remove the unnecessary gran_floor safeguard.
  * Rename the debugfs knob to vcpu_debooster_enabled.
- KVM IPI tracking: improve module-parameter documentation and add the
  kvm_vcpu_is_ipi_receiver() declaration to x86.h.

Wanpeng Li (10):
  sched/fair: Add EEVDF lag credit primitive for nominated next-buddy
  sched/fair: Credit a persistent, queue-depth-scaled vlag margin
  sched/fair: Credit queued next-buddy via canonical requeue
  sched/fair: Credit nominated next-buddy in yield_to_task_fair()
  sched/fair: Force a local resched on yield_to() so the buddy is picked
  KVM: x86: Add IPI tracking infrastructure for directed yield
  KVM: x86/lapic: Track unicast fixed IPI delivery
  KVM: x86/lapic: Clear IPI tracking on matching-vector EOI
  KVM: Add IPI-aware directed-yield candidate selection
  KVM: Add relaxed preempted-only fallback for directed yield

 arch/x86/include/asm/kvm_host.h |  19 +++
 arch/x86/kvm/lapic.c            | 234 +++++++++++++++++++++++++++++++-
 arch/x86/kvm/x86.c              |   3 +
 arch/x86/kvm/x86.h              |   8 ++
 include/linux/kvm_host.h        |   8 ++
 kernel/sched/fair.c             | 224 +++++++++++++++++++++++++++++-
 kernel/sched/features.h         |   9 ++
 kernel/sched/sched.h            |  10 ++
 virt/kvm/kvm_main.c             |  95 +++++++++++--
 9 files changed, 594 insertions(+), 16 deletions(-)

base-commit: 4549871118cf616eecdd2d939f78e3b9e1dddc48
-- 
2.43.0

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v3 01/10] sched/fair: Add EEVDF lag credit primitive for nominated next-buddy
  2026-06-12  1:33 [PATCH v3 00/10] sched/fair, KVM: Semantics-aware directed yield for oversubscribed KVM Wanpeng Li
@ 2026-06-12  1:33 ` Wanpeng Li
  2026-06-12  1:49   ` sashiko-bot
  2026-06-12  5:34   ` K Prateek Nayak
  2026-06-12  1:33 ` [PATCH v3 02/10] sched/fair: Credit a persistent, queue-depth-scaled vlag margin Wanpeng Li
                   ` (10 subsequent siblings)
  11 siblings, 2 replies; 22+ messages in thread
From: Wanpeng Li @ 2026-06-12  1:33 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paolo Bonzini,
	Sean Christopherson
  Cc: K Prateek Nayak, Christian Borntraeger, Steven Rostedt,
	Vincent Guittot, Juri Lelli, linux-kernel, kvm, Wanpeng Li,
	Richie Buturla

From: Wanpeng Li <wanpengli@tencent.com>

pick_eevdf()'s PICK_BUDDY path only returns cfs_rq->next when the
entity is eligible. A yield_to() target that is behind avg_vruntime at
any level of its sched_entity hierarchy is skipped, and the
set_next_buddy() hint is lost.

Add eevdf_credit_entity_vlag(), which can credit a nominated entity up to
the eligibility boundary so that pick_eevdf() can honor the buddy hint.
The helper handles cfs_rq->curr, which is off-tree and can be shifted in
place while carrying any active vprot window.

Gate the helper behind SCHED_FEAT(YIELD_TO_LAG_CREDIT). The helper has no
caller in this change, so mark it __maybe_unused; there is no functional
change.

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
---
 kernel/sched/fair.c     | 48 +++++++++++++++++++++++++++++++++++++++++
 kernel/sched/features.h |  9 ++++++++
 2 files changed, 57 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3ebec186f982..e7f5ea25fdae 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9341,6 +9341,54 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev, struct t
 	}
 }
 
+/*
+ * eevdf_credit_entity_vlag - credit a nominated next-buddy to eligibility
+ *
+ * Advance @se (already nominated by set_next_buddy(), so cfs_rq->next == se)
+ * just enough negative vlag to reach the eligibility boundary (vlag = 0) so
+ * pick_eevdf()'s PICK_BUDDY branch returns it. cfs_rq->curr is shifted in
+ * place (off-tree, carrying any vprot window). Queued entities are left
+ * unchanged.
+ *
+ * Idempotent: a no-op once @se is already eligible. Caller must hold
+ * rq_of(cfs_rq)->lock with rq_clock up to date.
+ */
+static void __maybe_unused
+eevdf_credit_entity_vlag(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	u64 avruntime, credit;
+	s64 vlag;
+
+	/* Callers gate this helper with YIELD_TO_LAG_CREDIT. */
+	if (cfs_rq->nr_queued < 2)
+		return;
+	if (throttled_hierarchy(cfs_rq))
+		return;
+	if (WARN_ON_ONCE(!se->on_rq) || se->sched_delayed)
+		return;
+
+	update_curr(cfs_rq);
+	avruntime = avg_vruntime(cfs_rq);
+	vlag = entity_lag(cfs_rq, se, avruntime);
+
+	/* Already eligible: nothing to do. */
+	if (vlag >= 0)
+		return;
+
+	credit = (u64)(-vlag);
+
+	if (cfs_rq->curr == se) {
+		/* curr is off-tree: in-place shift, carrying any vprot window. */
+		if (protect_slice(se))
+			se->vprot -= credit;
+		se->vruntime -= credit;
+		se->deadline -= credit;
+		return;
+	}
+
+	/* Queued entities are left unchanged by this helper path. */
+}
+
 /*
  * sched_yield() is very simple
  */
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 84c4fe3abd74..65c511c9ca28 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -40,6 +40,15 @@ SCHED_FEAT(NEXT_BUDDY, false)
  */
 SCHED_FEAT(PICK_BUDDY, true)
 
+/*
+ * Let yield_to_task_fair() credit bounded EEVDF lag to the nominated
+ * next-buddy so pick_eevdf() honors the hint even when the target has
+ * negative vlag at some level of its ancestor chain. The credit is bounded
+ * by a queue-depth-scaled margin within entity_lag()'s legal range, so
+ * fairness is preserved.
+ */
+SCHED_FEAT(YIELD_TO_LAG_CREDIT, true)
+
 /*
  * Consider buddies to be cache hot, decreases the likeliness of a
  * cache buddy being migrated away, increases cache locality.
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 02/10] sched/fair: Credit a persistent, queue-depth-scaled vlag margin
  2026-06-12  1:33 [PATCH v3 00/10] sched/fair, KVM: Semantics-aware directed yield for oversubscribed KVM Wanpeng Li
  2026-06-12  1:33 ` [PATCH v3 01/10] sched/fair: Add EEVDF lag credit primitive for nominated next-buddy Wanpeng Li
@ 2026-06-12  1:33 ` Wanpeng Li
  2026-06-12  1:53   ` sashiko-bot
  2026-06-12  6:07   ` K Prateek Nayak
  2026-06-12  1:33 ` [PATCH v3 03/10] sched/fair: Credit queued next-buddy via canonical requeue Wanpeng Li
                   ` (9 subsequent siblings)
  11 siblings, 2 replies; 22+ messages in thread
From: Wanpeng Li @ 2026-06-12  1:33 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paolo Bonzini,
	Sean Christopherson
  Cc: K Prateek Nayak, Christian Borntraeger, Steven Rostedt,
	Vincent Guittot, Juri Lelli, linux-kernel, kvm, Wanpeng Li,
	Richie Buturla

From: Wanpeng Li <wanpengli@tencent.com>

Crediting only up to vlag = 0 makes the buddy eligible for a single
pick_eevdf() pass. The next update_curr() can push it below zero again
before PICK_BUDDY consumes the hint.

Credit to a bounded positive-vlag margin instead, so the buddy stays
eligible across several scheduling decisions. Scale the margin with
runqueue depth, because deeper runqueues dilute eligibility faster, and
clamp it to entity_lag()'s legal positive-lag bound to preserve EEVDF
fairness.

The helper is not called in this change; there is no functional change.

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
---
 kernel/sched/fair.c | 58 ++++++++++++++++++++++++++++++++++++---------
 1 file changed, 47 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e7f5ea25fdae..c6502db62cd3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9342,21 +9342,45 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev, struct t
 }
 
 /*
- * eevdf_credit_entity_vlag - credit a nominated next-buddy to eligibility
+ * Positive-vlag target margin for a credited buddy, scaled by runqueue
+ * depth so it stays eligible across several picks. The caller clamps it to
+ * entity_lag()'s legal bound, so EEVDF fairness is preserved.
+ */
+static u64 __maybe_unused
+eevdf_persistent_margin(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	u64 base = sysctl_sched_base_slice;
+	unsigned int n = cfs_rq->h_nr_queued;
+	u64 raw;
+
+	if (n <= 4)
+		raw = base * 4;
+	else if (n <= 8)
+		raw = base * 6;
+	else if (n <= 16)
+		raw = base * 8;
+	else
+		raw = base * 12;
+
+	return calc_delta_fair(raw, se);
+}
+
+/*
+ * eevdf_credit_entity_vlag - credit bounded vlag to a nominated next-buddy
  *
  * Advance @se (already nominated by set_next_buddy(), so cfs_rq->next == se)
- * just enough negative vlag to reach the eligibility boundary (vlag = 0) so
- * pick_eevdf()'s PICK_BUDDY branch returns it. cfs_rq->curr is shifted in
- * place (off-tree, carrying any vprot window). Queued entities are left
- * unchanged.
+ * to a bounded positive-vlag margin so pick_eevdf()'s PICK_BUDDY branch
+ * keeps returning it across several picks, without exceeding entity_lag()'s
+ * legal bound. cfs_rq->curr is shifted in place (off-tree, carrying any
+ * vprot window). Queued entities are left unchanged.
  *
- * Idempotent: a no-op once @se is already eligible. Caller must hold
+ * Idempotent once @se holds the margin. Caller must hold
  * rq_of(cfs_rq)->lock with rq_clock up to date.
  */
 static void __maybe_unused
 eevdf_credit_entity_vlag(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	u64 avruntime, credit;
+	u64 avruntime, credit, want, margin, max_slice, lag_limit;
 	s64 vlag;
 
 	/* Callers gate this helper with YIELD_TO_LAG_CREDIT. */
@@ -9371,11 +9395,23 @@ eevdf_credit_entity_vlag(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	avruntime = avg_vruntime(cfs_rq);
 	vlag = entity_lag(cfs_rq, se, avruntime);
 
-	/* Already eligible: nothing to do. */
-	if (vlag >= 0)
-		return;
+	/* Clamp the margin to entity_lag()'s bound so place_entity() keeps it. */
+	max_slice = cfs_rq_max_slice(cfs_rq) + TICK_NSEC;
+	lag_limit = calc_delta_fair(max_slice, se);
+	margin = eevdf_persistent_margin(cfs_rq, se);
+	if (lag_limit && margin > lag_limit)
+		margin = lag_limit;
+	if (vlag >= 0) {
+		if ((u64)vlag >= margin)
+			return;
+		want = margin - (u64)vlag;
+	} else {
+		want = margin + (u64)(-vlag);
+	}
 
-	credit = (u64)(-vlag);
+	credit = want;
+	if (!credit)
+		return;
 
 	if (cfs_rq->curr == se) {
 		/* curr is off-tree: in-place shift, carrying any vprot window. */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 03/10] sched/fair: Credit queued next-buddy via canonical requeue
  2026-06-12  1:33 [PATCH v3 00/10] sched/fair, KVM: Semantics-aware directed yield for oversubscribed KVM Wanpeng Li
  2026-06-12  1:33 ` [PATCH v3 01/10] sched/fair: Add EEVDF lag credit primitive for nominated next-buddy Wanpeng Li
  2026-06-12  1:33 ` [PATCH v3 02/10] sched/fair: Credit a persistent, queue-depth-scaled vlag margin Wanpeng Li
@ 2026-06-12  1:33 ` Wanpeng Li
  2026-06-12  1:55   ` sashiko-bot
  2026-06-12  1:33 ` [PATCH v3 04/10] sched/fair: Credit nominated next-buddy in yield_to_task_fair() Wanpeng Li
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 22+ messages in thread
From: Wanpeng Li @ 2026-06-12  1:33 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paolo Bonzini,
	Sean Christopherson
  Cc: K Prateek Nayak, Christian Borntraeger, Steven Rostedt,
	Vincent Guittot, Juri Lelli, linux-kernel, kvm, Wanpeng Li,
	Richie Buturla

From: Wanpeng Li <wanpengli@tencent.com>

eevdf_credit_entity_vlag() handles cfs_rq->curr, which is off-tree. Extend
it to a queued entity, which is the common case when yield_to() targets a
task that is runnable but not currently executing.

A queued entity's key contributes to cfs_rq->sum_w_vruntime, so its
vruntime cannot be shifted in place. Use the canonical place_entity()-
paired requeue pattern: snapshot vlag, make the deadline relative to
se->vruntime, dequeue the entity, apply the credit via place_entity(), and
enqueue it again. This keeps sum_w_vruntime consistent with entity_key(),
preserves the slice, and shifts the deadline by the same amount as the
off-tree current-entity path.

The helper is not called in this change; there is no functional change.

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
---
 kernel/sched/fair.c | 21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c6502db62cd3..c28682fedb36 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9372,7 +9372,9 @@ eevdf_persistent_margin(struct cfs_rq *cfs_rq, struct sched_entity *se)
  * to a bounded positive-vlag margin so pick_eevdf()'s PICK_BUDDY branch
  * keeps returning it across several picks, without exceeding entity_lag()'s
  * legal bound. cfs_rq->curr is shifted in place (off-tree, carrying any
- * vprot window). Queued entities are left unchanged.
+ * vprot window); a queued entity uses the canonical
+ * place_entity()-paired requeue, keeping sum_w_vruntime consistent with
+ * entity_key().
  *
  * Idempotent once @se holds the margin. Caller must hold
  * rq_of(cfs_rq)->lock with rq_clock up to date.
@@ -9422,7 +9424,22 @@ eevdf_credit_entity_vlag(struct cfs_rq *cfs_rq, struct sched_entity *se)
 		return;
 	}
 
-	/* Queued entities are left unchanged by this helper path. */
+	/*
+	 * Canonical place_entity()-paired requeue: see dequeue_entity() and
+	 * requeue_delayed_entity(). place_entity() restores the deadline via
+	 * "se->deadline += se->vruntime", so make the deadline relative to
+	 * se->vruntime here (not avg_vruntime). This preserves the slice and
+	 * shifts the deadline by exactly -credit, mirroring the curr branch
+	 * and keeping deadline > vruntime.
+	 */
+	se->vlag = vlag + (s64)credit;
+	se->deadline -= se->vruntime;
+	se->rel_deadline = 1;
+	cfs_rq->nr_queued--;
+	__dequeue_entity(cfs_rq, se);
+	place_entity(cfs_rq, se, 0);
+	__enqueue_entity(cfs_rq, se);
+	cfs_rq->nr_queued++;
 }
 
 /*
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 04/10] sched/fair: Credit nominated next-buddy in yield_to_task_fair()
  2026-06-12  1:33 [PATCH v3 00/10] sched/fair, KVM: Semantics-aware directed yield for oversubscribed KVM Wanpeng Li
                   ` (2 preceding siblings ...)
  2026-06-12  1:33 ` [PATCH v3 03/10] sched/fair: Credit queued next-buddy via canonical requeue Wanpeng Li
@ 2026-06-12  1:33 ` Wanpeng Li
  2026-06-12  1:54   ` sashiko-bot
  2026-06-12  1:33 ` [PATCH v3 05/10] sched/fair: Force a local resched on yield_to() so the buddy is picked Wanpeng Li
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 22+ messages in thread
From: Wanpeng Li @ 2026-06-12  1:33 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paolo Bonzini,
	Sean Christopherson
  Cc: K Prateek Nayak, Christian Borntraeger, Steven Rostedt,
	Vincent Guittot, Juri Lelli, linux-kernel, kvm, Wanpeng Li,
	Richie Buturla

From: Wanpeng Li <wanpengli@tencent.com>

After set_next_buddy() nominates the yield_to() target at every level of
its sched_entity hierarchy, walk that same hierarchy and credit bounded
EEVDF lag to each not-yet-eligible entity. This allows pick_eevdf()'s
PICK_BUDDY path to select the nominated target instead of dropping the
hint at the first ineligible group entity.

Gate the walk with YIELD_TO_LAG_CREDIT. With the feature disabled,
yield_to_task_fair() keeps the existing forfeit-based behavior.
yield_to() holds both rq locks via double_rq_lock(), so touching the
target task's cfs_rqs, including remote cfs_rqs, is safe. Stop the walk
where set_next_buddy() stopped, and skip delayed or throttled entities.

Refresh the target rq clock when it differs from the local rq so the
per-level update_curr() calls observe current rq_clock values. The local
rq still uses the existing yield_task_fair() path in this change.

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
---
 kernel/sched/fair.c | 44 +++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 41 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c28682fedb36..48f65a4f1923 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9346,8 +9346,8 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev, struct t
  * depth so it stays eligible across several picks. The caller clamps it to
  * entity_lag()'s legal bound, so EEVDF fairness is preserved.
  */
-static u64 __maybe_unused
-eevdf_persistent_margin(struct cfs_rq *cfs_rq, struct sched_entity *se)
+static u64 eevdf_persistent_margin(struct cfs_rq *cfs_rq,
+				   struct sched_entity *se)
 {
 	u64 base = sysctl_sched_base_slice;
 	unsigned int n = cfs_rq->h_nr_queued;
@@ -9379,7 +9379,7 @@ eevdf_persistent_margin(struct cfs_rq *cfs_rq, struct sched_entity *se)
  * Idempotent once @se holds the margin. Caller must hold
  * rq_of(cfs_rq)->lock with rq_clock up to date.
  */
-static void __maybe_unused
+static void
 eevdf_credit_entity_vlag(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	u64 avruntime, credit, want, margin, max_slice, lag_limit;
@@ -9488,6 +9488,7 @@ static void yield_task_fair(struct rq *rq)
 static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
 {
 	struct sched_entity *se = &p->se;
+	struct rq *p_rq = task_rq(p);
 
 	/* !se->on_rq also covers throttled task */
 	if (!se->on_rq)
@@ -9496,6 +9497,43 @@ static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
 	/* Tell the scheduler that we'd really like se to run next. */
 	set_next_buddy(se);
 
+	/* Without lag credit, keep the existing forfeit-based yield. */
+	if (!sched_feat(YIELD_TO_LAG_CREDIT)) {
+		yield_task_fair(rq);
+		return true;
+	}
+
+	/*
+	 * Walk the ancestor chain set_next_buddy() just nominated and credit
+	 * bounded lag to each not-yet-eligible level so pick_eevdf() returns
+	 * it. yield_to() holds both rq locks via double_rq_lock(), so touching
+	 * p's cfs_rqs (possibly on another CPU) is safe; the primitive is
+	 * idempotent, so no rate limiting is needed.
+	 *
+	 * Only refresh p_rq's clock when it differs from the local rq. A
+	 * remote p_rq must be refreshed so the per-level update_curr() is
+	 * accurate. In the same-rq case we skip it: the credit is a
+	 * best-effort hint and the rq clock is recent enough, while the
+	 * trailing yield_task_fair() would otherwise make this a second
+	 * update_rq_clock() on the same rq and trip
+	 * SCHED_WARN_ON(WARN_DOUBLE_CLOCK).
+	 */
+	if (rq != p_rq)
+		update_rq_clock(p_rq);
+
+	for_each_sched_entity(se) {
+		struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+		if (cfs_rq->next != se)
+			break;
+		if (se->sched_delayed)
+			break;
+		if (throttled_hierarchy(cfs_rq))
+			break;
+
+		eevdf_credit_entity_vlag(cfs_rq, se);
+	}
+
 	yield_task_fair(rq);
 
 	return true;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 05/10] sched/fair: Force a local resched on yield_to() so the buddy is picked
  2026-06-12  1:33 [PATCH v3 00/10] sched/fair, KVM: Semantics-aware directed yield for oversubscribed KVM Wanpeng Li
                   ` (3 preceding siblings ...)
  2026-06-12  1:33 ` [PATCH v3 04/10] sched/fair: Credit nominated next-buddy in yield_to_task_fair() Wanpeng Li
@ 2026-06-12  1:33 ` Wanpeng Li
  2026-06-12  1:50   ` sashiko-bot
  2026-06-12  1:33 ` [PATCH v3 06/10] KVM: x86: Add IPI tracking infrastructure for directed yield Wanpeng Li
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 22+ messages in thread
From: Wanpeng Li @ 2026-06-12  1:33 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paolo Bonzini,
	Sean Christopherson
  Cc: K Prateek Nayak, Christian Borntraeger, Steven Rostedt,
	Vincent Guittot, Juri Lelli, linux-kernel, kvm, Wanpeng Li,
	Richie Buturla

From: Wanpeng Li <wanpengli@tencent.com>

Lag credit makes the target eligible for PICK_BUDDY, but yield_to() does
not by itself force the caller off the CPU. An active RUN_TO_PARITY
protect_slice() on the local yielder can therefore keep pick_eevdf()
returning the yielder instead of the credited buddy.

Add yield_to_local_force_resched() for the lag-credit path. It applies
the existing leaf forfeit, cancels slice protection along the yielder's
sched_entity hierarchy, and calls resched_curr() on the local rq.
cancel_protect_slice() is already used by PREEMPT_WAKEUP_SHORT and does
not modify vruntime.

Rate-limit only the forced preemption (cancel_protect_slice() plus
resched_curr()) to once per 6ms per rq. The lag credit itself remains
unthrottled so each directed yield refreshes the scheduling hint, while
compute-bound guests avoid excessive forced preemption on PLE-heavy spin
loops.

Dbench (filesystem metadata operations) on 16-vCPU guests under host CPU
overcommit, throughput improvement from the scheduler side alone:

  2 VMs:  +6.65%
  3 VMs:  +4.80%
  4 VMs:  +7.59%

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
---
 kernel/sched/fair.c  | 113 +++++++++++++++++++++++++++++++++++++------
 kernel/sched/sched.h |  10 ++++
 2 files changed, 108 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 48f65a4f1923..e9c5265cf0fa 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9485,6 +9485,83 @@ static void yield_task_fair(struct rq *rq)
 	}
 }
 
+/*
+ * Rate-limit the forced local reschedule on the yield_to() lag-credit path
+ * to at most once per 6ms per rq.
+ *
+ * Lag credit is intentionally not rate-limited: a contended lock holder
+ * should be credited on every directed yield to keep the scheduling hint
+ * effective. Only the forced preemption needs bounding, as cancelling
+ * RUN_TO_PARITY protection and calling resched_curr() on every PLE-driven
+ * yield_to() can cause excessive preemption on compute-bound guests.
+ *
+ * Returns true if the caller should skip forcing a reschedule because a
+ * recent one already happened on this rq; the credit just applied still
+ * persists, so the buddy can be selected at the next scheduling point.
+ *
+ * Called under rq->lock with rq_clock up to date. yield_to_task_fair()
+ * updates the clock before walking the hierarchy because yield_to() takes
+ * the rq locks without updating them.
+ */
+static bool yield_to_force_resched_rate_limit(struct rq *rq)
+{
+	u64 now = rq_clock(rq);
+	u64 last = rq->yield_to_force_resched_last_ns;
+
+	if (last && (now - last) <= 6 * NSEC_PER_MSEC)
+		return true;
+
+	rq->yield_to_force_resched_last_ns = now;
+	return false;
+}
+
+/*
+ * Forfeit the local yielder, cancel its RUN_TO_PARITY slice protection
+ * along the whole sched_entity chain, and force a reschedule.
+ *
+ * yield_to() does not reschedule the caller, and an active protect_slice()
+ * at any level can keep pick_eevdf() returning the yielder instead of the
+ * credited buddy. cancel_protect_slice() is EEVDF-native (also used by
+ * PREEMPT_WAKEUP_SHORT) and does not touch vruntime. Caller holds the
+ * local rq lock via yield_to()'s double_rq_lock().
+ *
+ * Only the forced preemption here is rate-limited (to once per 6ms per rq);
+ * the lag credit applied by the caller runs on every yield_to(). When
+ * throttled, the credited buddy can still be selected at the next natural
+ * scheduling point without tearing down slice protection and forcing an
+ * immediate switch.
+ */
+static void yield_to_local_force_resched(struct rq *rq)
+{
+	struct sched_entity *yse = &rq->donor->se;
+
+	yield_task_fair(rq);
+
+	/*
+	 * If the yielder is the only runnable task on this rq there is nothing
+	 * for resched_curr() to switch to: any credited buddy is on a remote rq
+	 * in this cross-rq case, where yield_to() already issued resched_curr()
+	 * on the target's rq. Skip the forced reschedule: it would be a no-op
+	 * and an unnecessary preemption of an unrelated local task.
+	 * yield_task_fair() also returns early here without updating rq_clock.
+	 */
+	if (rq->nr_running <= 1)
+		return;
+
+	/*
+	 * Rate-limit the forced preemption (cancel_protect_slice + resched_curr)
+	 * to once per 6ms per rq. rq's clock was refreshed by the caller before
+	 * the credit walk, so rq_clock(rq) read here is current.
+	 */
+	if (yield_to_force_resched_rate_limit(rq))
+		return;
+
+	for_each_sched_entity(yse)
+		cancel_protect_slice(yse);
+
+	resched_curr(rq);
+}
+
 static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
 {
 	struct sched_entity *se = &p->se;
@@ -9504,21 +9581,22 @@ static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
 	}
 
 	/*
-	 * Walk the ancestor chain set_next_buddy() just nominated and credit
-	 * bounded lag to each not-yet-eligible level so pick_eevdf() returns
-	 * it. yield_to() holds both rq locks via double_rq_lock(), so touching
-	 * p's cfs_rqs (possibly on another CPU) is safe; the primitive is
-	 * idempotent, so no rate limiting is needed.
+	 * Walk the ancestor chain nominated by set_next_buddy() and credit
+	 * bounded lag to each not-yet-eligible level, so pick_eevdf() can
+	 * honor the buddy hint. Lag credit runs on every directed yield; only
+	 * the forced preemption in yield_to_local_force_resched() is
+	 * rate-limited. yield_to() holds both rq locks via double_rq_lock(),
+	 * so touching p's cfs_rqs (possibly on another CPU) is safe.
 	 *
-	 * Only refresh p_rq's clock when it differs from the local rq. A
-	 * remote p_rq must be refreshed so the per-level update_curr() is
-	 * accurate. In the same-rq case we skip it: the credit is a
-	 * best-effort hint and the rq clock is recent enough, while the
-	 * trailing yield_task_fair() would otherwise make this a second
-	 * update_rq_clock() on the same rq and trip
-	 * SCHED_WARN_ON(WARN_DOUBLE_CLOCK).
-	 */
-	if (rq != p_rq)
+	 * Refresh the local rq clock first: yield_to() took the locks without
+	 * updating any clock and the per-level update_curr() below reads
+	 * rq_clock; assert_clock_updated() (default-on, no sched_feat gate)
+	 * fires otherwise. For a remote p_rq refresh it too; in the same-rq
+	 * case the refresh above already covers it (a redundant update is only
+	 * warned about under the default-off WARN_DOUBLE_CLOCK).
+	 */
+	update_rq_clock(rq);
+	if (p_rq != rq)
 		update_rq_clock(p_rq);
 
 	for_each_sched_entity(se) {
@@ -9534,7 +9612,12 @@ static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
 		eevdf_credit_entity_vlag(cfs_rq, se);
 	}
 
-	yield_task_fair(rq);
+	/*
+	 * Force the local CPU to reschedule so the credited buddy can be
+	 * selected instead of the protected yielder;
+	 * yield_to_local_force_resched() also does the leaf forfeit.
+	 */
+	yield_to_local_force_resched(rq);
 
 	return true;
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9f63b15d309d..690a2ab99beb 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1316,6 +1316,16 @@ struct rq {
 	unsigned int		ttwu_local;
 #endif
 
+	/*
+	 * Last rq_clock at which the yield_to() lag-credit path forced a local
+	 * reschedule on this rq. Used to rate-limit only the forced preemption
+	 * (cancel_protect_slice + resched_curr) to at most once per 6ms per rq,
+	 * preventing excessive forced preemption on PLE-heavy guests. The lag
+	 * credit itself is not rate-limited. Functional state, not a statistic,
+	 * so kept outside CONFIG_SCHEDSTATS.
+	 */
+	u64			yield_to_force_resched_last_ns;
+
 #ifdef CONFIG_CPU_IDLE
 	/* Must be inspected within a RCU lock section */
 	struct cpuidle_state	*idle_state;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 06/10] KVM: x86: Add IPI tracking infrastructure for directed yield
  2026-06-12  1:33 [PATCH v3 00/10] sched/fair, KVM: Semantics-aware directed yield for oversubscribed KVM Wanpeng Li
                   ` (4 preceding siblings ...)
  2026-06-12  1:33 ` [PATCH v3 05/10] sched/fair: Force a local resched on yield_to() so the buddy is picked Wanpeng Li
@ 2026-06-12  1:33 ` Wanpeng Li
  2026-06-12  1:33 ` [PATCH v3 07/10] KVM: x86/lapic: Track unicast fixed IPI delivery Wanpeng Li
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: Wanpeng Li @ 2026-06-12  1:33 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paolo Bonzini,
	Sean Christopherson
  Cc: K Prateek Nayak, Christian Borntraeger, Steven Rostedt,
	Vincent Guittot, Juri Lelli, linux-kernel, kvm, Wanpeng Li,
	Richie Buturla

From: Wanpeng Li <wanpengli@tencent.com>

On overcommitted hosts, a vCPU spinning on an IPI response is difficult
to distinguish from a vCPU spinning on a lock. kvm_vcpu_on_spin() can
therefore yield to an unrelated vCPU based only on coarse preemption
state.

Add per-vCPU IPI tracking for directed yield. struct kvm_vcpu_arch now
records the last sender and receiver vCPU indexes, the vector, a pending
flag, and a monotonic timestamp. Add helpers to record a send, query
whether a vCPU is the recent IPI receiver of another vCPU, and clear or
reset the context. Accesses use READ_ONCE() and WRITE_ONCE() because the
state is only a best-effort scheduling hint.

Add module parameters to enable tracking and to control the recency
window. Provide a weak generic kvm_vcpu_is_ipi_receiver() stub so
non-x86 builds keep the existing behavior. The state is reset on vCPU
create and destroy, and cleared on INIT.

This adds only state and helpers; directed-yield candidate selection is
unchanged.

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
---
 arch/x86/include/asm/kvm_host.h |  19 ++++++
 arch/x86/kvm/lapic.c            | 102 ++++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c              |   3 +
 arch/x86/kvm/x86.h              |   8 +++
 include/linux/kvm_host.h        |   8 +++
 virt/kvm/kvm_main.c             |   6 ++
 6 files changed, 146 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f14009f25a3b..a26623716a53 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1065,6 +1065,25 @@ struct kvm_vcpu_arch {
 	int pending_external_vector;
 	int highest_stale_pending_ioapic_eoi;
 
+	/*
+	 * IPI tracking for directed-yield optimization.
+	 *
+	 * Populated by kvm_track_ipi_communication() when a unicast fixed
+	 * IPI is delivered, and queried by kvm_vcpu_is_ipi_receiver() from
+	 * kvm_vcpu_on_spin() to prefer the confirmed IPI target before
+	 * generic preempted-lock-holder heuristics.
+	 *
+	 * All accesses are lockless READ_ONCE/WRITE_ONCE; best-effort by
+	 * design (see comment on kvm_vcpu_is_good_yield_candidate()).
+	 */
+	struct {
+		int	last_ipi_sender;	/* vCPU idx of last IPI sender  */
+		int	last_ipi_receiver;	/* vCPU idx of last IPI target  */
+		u8	vector;			/* vector of the pending IPI    */
+		bool	pending_ipi;		/* awaiting IPI response        */
+		u64	ipi_time_ns;		/* mono timestamp of IPI send   */
+	} ipi_context;
+
 	/* be preempted when it's in kernel-mode(cpl=0) */
 	bool preempted_in_kernel;
 
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 4078e624ca66..515409e0e22c 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -78,6 +78,29 @@ module_param(lapic_timer_advance, bool, 0444);
 static bool __read_mostly vector_hashing_enabled = true;
 module_param_named(vector_hashing, vector_hashing_enabled, bool, 0444);
 
+/*
+ * IPI tracking for directed-yield optimization.
+ *
+ * ipi_tracking_enabled  - master switch (default on). When off, the
+ *                         tracking hooks become no-ops and
+ *                         kvm_vcpu_is_ipi_receiver() always returns
+ *                         false, falling back to the legacy
+ *                         preempted-in-kernel heuristic.
+ *
+ * ipi_window_ns         - recency window. An IPI older than this is
+ *                         treated as stale and does not influence
+ *                         directed-yield selection. Long enough to
+ *                         cover typical spin-on-IPI-response periods,
+ *                         short enough to avoid stale state inflating
+ *                         boost priority on throughput-sensitive
+ *                         workloads.
+ */
+static bool ipi_tracking_enabled = true;
+module_param(ipi_tracking_enabled, bool, 0644);
+
+static unsigned long ipi_window_ns = 50 * NSEC_PER_MSEC;
+module_param(ipi_window_ns, ulong, 0644);
+
 static int kvm_lapic_msr_read(struct kvm_lapic *apic, u32 reg, u64 *data);
 static int kvm_lapic_msr_write(struct kvm_lapic *apic, u32 reg, u64 data);
 
@@ -1144,6 +1167,85 @@ static int kvm_apic_compare_prio(struct kvm_vcpu *vcpu1, struct kvm_vcpu *vcpu2)
 	return vcpu1->arch.apic_arb_prio - vcpu2->arch.apic_arb_prio;
 }
 
+/*
+ * Record a sender -> receiver IPI relationship for directed-yield use.
+ *
+ * Accessed lockless (READ_ONCE/WRITE_ONCE); this is best-effort, racy
+ * information consumed only as a scheduling hint by
+ * kvm_vcpu_on_spin(), so occasional torn or stale reads are harmless.
+ *
+ * Callers should already have filtered out self-IPIs and non-unicast
+ * or non-fixed-mode deliveries; this function only records the state.
+ */
+void kvm_track_ipi_communication(struct kvm_vcpu *sender,
+				 struct kvm_vcpu *receiver, u8 vector)
+{
+	if (!sender || !receiver || sender == receiver)
+		return;
+	if (unlikely(!READ_ONCE(ipi_tracking_enabled)))
+		return;
+
+	WRITE_ONCE(sender->arch.ipi_context.last_ipi_receiver,
+		   receiver->vcpu_idx);
+	WRITE_ONCE(sender->arch.ipi_context.vector, vector);
+	WRITE_ONCE(sender->arch.ipi_context.pending_ipi, true);
+	WRITE_ONCE(sender->arch.ipi_context.ipi_time_ns,
+		   ktime_get_mono_fast_ns());
+
+	WRITE_ONCE(receiver->arch.ipi_context.last_ipi_sender,
+		   sender->vcpu_idx);
+	WRITE_ONCE(receiver->arch.ipi_context.vector, vector);
+}
+
+/*
+ * Return true if @receiver is the confirmed recent IPI target of
+ * @sender, within the configured recency window. Directed yield uses
+ * this as a high-confidence signal that selecting @receiver may
+ * unblock @sender's spin loop.
+ */
+bool kvm_vcpu_is_ipi_receiver(struct kvm_vcpu *sender,
+			      struct kvm_vcpu *receiver)
+{
+	u64 then, now;
+
+	if (unlikely(!READ_ONCE(ipi_tracking_enabled)))
+		return false;
+
+	if (!READ_ONCE(sender->arch.ipi_context.pending_ipi))
+		return false;
+
+	if (READ_ONCE(sender->arch.ipi_context.last_ipi_receiver) !=
+	    receiver->vcpu_idx)
+		return false;
+
+	then = READ_ONCE(sender->arch.ipi_context.ipi_time_ns);
+	now = ktime_get_mono_fast_ns();
+	return now - then <= READ_ONCE(ipi_window_ns);
+}
+
+/*
+ * Clear the IPI tracking state of a single vCPU, typically when the
+ * associated interrupt has been acknowledged (EOI) or the vCPU has
+ * been reset/destroyed.
+ *
+ * Leaves the monotonic timestamp untouched to keep staleness checks
+ * on other vCPUs that may reference this one well-defined; use
+ * kvm_vcpu_reset_ipi_context() for a hard reset.
+ */
+void kvm_vcpu_clear_ipi_context(struct kvm_vcpu *vcpu)
+{
+	WRITE_ONCE(vcpu->arch.ipi_context.pending_ipi, false);
+	WRITE_ONCE(vcpu->arch.ipi_context.last_ipi_sender, -1);
+	WRITE_ONCE(vcpu->arch.ipi_context.last_ipi_receiver, -1);
+	WRITE_ONCE(vcpu->arch.ipi_context.vector, 0);
+}
+
+void kvm_vcpu_reset_ipi_context(struct kvm_vcpu *vcpu)
+{
+	kvm_vcpu_clear_ipi_context(vcpu);
+	WRITE_ONCE(vcpu->arch.ipi_context.ipi_time_ns, 0);
+}
+
 /* Return true if the interrupt can be handled by using *bitmap as index mask
  * for valid destinations in *dst array.
  * Return false if kvm_apic_map_get_dest_lapic did nothing useful.
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0550359ed798..dcedd09bac10 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12907,6 +12907,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 		goto free_guest_fpu;
 
 	kvm_xen_init_vcpu(vcpu);
+	kvm_vcpu_reset_ipi_context(vcpu);
 	vcpu_load(vcpu);
 	kvm_vcpu_after_set_cpuid(vcpu);
 	kvm_set_tsc_khz(vcpu, vcpu->kvm->arch.default_tsc_khz);
@@ -12974,6 +12975,7 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
 	kvm_mmu_destroy(vcpu);
 	srcu_read_unlock(&vcpu->kvm->srcu, idx);
 	free_page((unsigned long)vcpu->arch.pio_data);
+	kvm_vcpu_reset_ipi_context(vcpu);
 	kvfree(vcpu->arch.cpuid_entries);
 }
 
@@ -13050,6 +13052,7 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 		kvm_leave_nested(vcpu);
 
 	kvm_lapic_reset(vcpu, init_event);
+	kvm_vcpu_clear_ipi_context(vcpu);
 
 	WARN_ON_ONCE(is_guest_mode(vcpu) || is_smm(vcpu));
 	vcpu->arch.hflags = 0;
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 38a905fa86de..eb7f50018f78 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -475,6 +475,14 @@ int handle_ud(struct kvm_vcpu *vcpu);
 void kvm_deliver_exception_payload(struct kvm_vcpu *vcpu,
 				   struct kvm_queued_exception *ex);
 
+/* IPI tracking helpers for directed-yield optimization (see lapic.c). */
+void kvm_track_ipi_communication(struct kvm_vcpu *sender,
+				 struct kvm_vcpu *receiver, u8 vector);
+bool kvm_vcpu_is_ipi_receiver(struct kvm_vcpu *sender,
+			      struct kvm_vcpu *receiver);
+void kvm_vcpu_clear_ipi_context(struct kvm_vcpu *vcpu);
+void kvm_vcpu_reset_ipi_context(struct kvm_vcpu *vcpu);
+
 int kvm_mtrr_set_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data);
 int kvm_mtrr_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata);
 void kvm_fixup_and_inject_pf_error(struct kvm_vcpu *vcpu, gva_t gva, u16 error_code);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 4c14aee1fb06..e54e72ae5ebb 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1549,6 +1549,14 @@ static inline void kvm_vcpu_kick(struct kvm_vcpu *vcpu)
 int kvm_vcpu_yield_to(struct kvm_vcpu *target);
 void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu, bool yield_to_kernel_mode);
 
+/*
+ * IPI-aware directed-yield hook. Architectures that support IPI
+ * tracking (currently x86 via arch/x86/kvm/lapic.c) override this;
+ * the generic __weak stub in virt/kvm/kvm_main.c returns false.
+ */
+bool kvm_vcpu_is_ipi_receiver(struct kvm_vcpu *sender,
+			      struct kvm_vcpu *receiver);
+
 void kvm_flush_remote_tlbs(struct kvm *kvm);
 void kvm_flush_remote_tlbs_range(struct kvm *kvm, gfn_t gfn, u64 nr_pages);
 void kvm_flush_remote_tlbs_memslot(struct kvm *kvm,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 881f92d7a469..2e11c6cfc167 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3957,6 +3957,12 @@ bool __weak kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu)
 	return false;
 }
 
+bool __weak kvm_vcpu_is_ipi_receiver(struct kvm_vcpu *sender,
+				     struct kvm_vcpu *receiver)
+{
+	return false;
+}
+
 void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
 {
 	int nr_vcpus, start, i, idx, yielded;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 07/10] KVM: x86/lapic: Track unicast fixed IPI delivery
  2026-06-12  1:33 [PATCH v3 00/10] sched/fair, KVM: Semantics-aware directed yield for oversubscribed KVM Wanpeng Li
                   ` (5 preceding siblings ...)
  2026-06-12  1:33 ` [PATCH v3 06/10] KVM: x86: Add IPI tracking infrastructure for directed yield Wanpeng Li
@ 2026-06-12  1:33 ` Wanpeng Li
  2026-06-12  1:33 ` [PATCH v3 08/10] KVM: x86/lapic: Clear IPI tracking on matching-vector EOI Wanpeng Li
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: Wanpeng Li @ 2026-06-12  1:33 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paolo Bonzini,
	Sean Christopherson
  Cc: K Prateek Nayak, Christian Borntraeger, Steven Rostedt,
	Vincent Guittot, Juri Lelli, linux-kernel, kvm, Wanpeng Li,
	Richie Buturla

From: Wanpeng Li <wanpengli@tencent.com>

Record IPI sender/receiver relationships from both LAPIC delivery paths:
the APIC-map fast path and the slow fallback.

Only record a directed sender->receiver hint when the IPI originates from
a vCPU, uses APIC_DM_FIXED delivery, has no destination shorthand, is
accepted by exactly one destination vCPU, and is not a self-IPI. The
tracking helper also filters the disabled case.

This change only records the relationship; directed-yield candidate
selection is unchanged.

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
---
 arch/x86/kvm/lapic.c | 71 ++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 66 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 515409e0e22c..3a5f197eb2c6 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1326,7 +1326,8 @@ static bool __kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct kvm_lapic *s
 	struct kvm_apic_map *map;
 	unsigned long bitmap;
 	struct kvm_lapic **dst = NULL;
-	int i;
+	struct kvm_vcpu *ipi_unique = NULL;
+	int i, ipi_targets = 0;
 	bool ret;
 
 	*r = -1;
@@ -1347,10 +1348,37 @@ static bool __kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct kvm_lapic *s
 	if (ret) {
 		*r = 0;
 		for_each_set_bit(i, &bitmap, 16) {
+			int delivered;
+
 			if (!dst[i])
 				continue;
-			*r += kvm_apic_set_irq(dst[i]->vcpu, irq, rtc_status);
+			delivered = kvm_apic_set_irq(dst[i]->vcpu, irq, rtc_status);
+			*r += delivered;
+			if (delivered > 0) {
+				ipi_targets++;
+				ipi_unique = dst[i]->vcpu;
+			}
 		}
+
+		/*
+		 * Track unicast fixed IPIs for directed-yield optimization.
+		 *
+		 * Only record when:
+		 *  - the IPI originated from a vCPU (LAPIC write, not kernel
+		 *    injection): src != NULL;
+		 *  - delivery mode is plain fixed: synchronization
+		 *    primitives such as spinlocks, TLB flushes and
+		 *    smp_call_function() use APIC_DM_FIXED;
+		 *  - no shorthand: shorthand encodes broadcasts and self
+		 *    which we explicitly do not track;
+		 *  - exactly one recipient accepted the interrupt, giving a
+		 *    directed sender->receiver relationship.
+		 */
+		if (src && irq->delivery_mode == APIC_DM_FIXED &&
+		    irq->shorthand == APIC_DEST_NOSHORT &&
+		    ipi_targets == 1 && ipi_unique && ipi_unique != src->vcpu)
+			kvm_track_ipi_communication(src->vcpu, ipi_unique,
+						    irq->vector);
 	}
 
 	rcu_read_unlock();
@@ -1443,6 +1471,13 @@ int __kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
 	struct kvm_vcpu *vcpu, *lowest = NULL;
 	unsigned long i, dest_vcpu_bitmap[BITS_TO_LONGS(KVM_MAX_VCPUS)];
 	unsigned int dest_vcpus = 0;
+	/*
+	 * Track unicast fixed IPI for directed-yield optimization in this
+	 * slow fallback path (APIC map miss). See the fast-path equivalent in
+	 * __kvm_irq_delivery_to_apic_fast() for the full filtering rationale.
+	 */
+	struct kvm_vcpu *ipi_unique = NULL;
+	int ipi_targets = 0;
 
 	if (__kvm_irq_delivery_to_apic_fast(kvm, src, irq, &r, rtc_status))
 		return r;
@@ -1456,6 +1491,8 @@ int __kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
 	memset(dest_vcpu_bitmap, 0, sizeof(dest_vcpu_bitmap));
 
 	kvm_for_each_vcpu(i, vcpu, kvm) {
+		int delivered;
+
 		if (!kvm_apic_present(vcpu))
 			continue;
 
@@ -1466,7 +1503,12 @@ int __kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
 		if (!kvm_lowest_prio_delivery(irq)) {
 			if (r < 0)
 				r = 0;
-			r += kvm_apic_set_irq(vcpu, irq, rtc_status);
+			delivered = kvm_apic_set_irq(vcpu, irq, rtc_status);
+			r += delivered;
+			if (delivered > 0) {
+				ipi_targets++;
+				ipi_unique = vcpu;
+			}
 		} else if (kvm_apic_sw_enabled(vcpu->arch.apic)) {
 			if (!vector_hashing_enabled) {
 				if (!lowest)
@@ -1487,8 +1529,27 @@ int __kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
 		lowest = kvm_get_vcpu(kvm, idx);
 	}
 
-	if (lowest)
-		r = kvm_apic_set_irq(lowest, irq, rtc_status);
+	if (lowest) {
+		int delivered = kvm_apic_set_irq(lowest, irq, rtc_status);
+
+		r = delivered;
+		if (delivered > 0) {
+			ipi_targets++;
+			ipi_unique = lowest;
+		}
+	}
+
+	/*
+	 * Record a unicast fixed IPI delivered via this slow path. The fast
+	 * path records the APIC-map-hit case; this covers the fallback
+	 * where kvm_apic_map_get_dest_lapic() missed but delivery still
+	 * resolves to exactly one recipient.
+	 */
+	if (src && irq->delivery_mode == APIC_DM_FIXED &&
+	    irq->shorthand == APIC_DEST_NOSHORT &&
+	    ipi_targets == 1 && ipi_unique && ipi_unique != src->vcpu)
+		kvm_track_ipi_communication(src->vcpu, ipi_unique,
+					    irq->vector);
 
 	return r;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 08/10] KVM: x86/lapic: Clear IPI tracking on matching-vector EOI
  2026-06-12  1:33 [PATCH v3 00/10] sched/fair, KVM: Semantics-aware directed yield for oversubscribed KVM Wanpeng Li
                   ` (6 preceding siblings ...)
  2026-06-12  1:33 ` [PATCH v3 07/10] KVM: x86/lapic: Track unicast fixed IPI delivery Wanpeng Li
@ 2026-06-12  1:33 ` Wanpeng Li
  2026-06-12  3:46   ` sashiko-bot
  2026-06-12  1:33 ` [PATCH v3 09/10] KVM: Add IPI-aware directed-yield candidate selection Wanpeng Li
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 22+ messages in thread
From: Wanpeng Li @ 2026-06-12  1:33 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paolo Bonzini,
	Sean Christopherson
  Cc: K Prateek Nayak, Christian Borntraeger, Steven Rostedt,
	Vincent Guittot, Juri Lelli, linux-kernel, kvm, Wanpeng Li,
	Richie Buturla

From: Wanpeng Li <wanpengli@tencent.com>

Clear IPI tracking from both EOI sites, apic_set_eoi() and
kvm_apic_set_eoi_accelerated(), matched on the EOI'd vector. A receiver
may handle unrelated interrupts between an IPI's arrival and its EOI; a
vector-blind clear could evict the still-pending context and make directed
yield miss the receiver.

When the vector matches, clear the receiver's context unconditionally, and
clear the sender's pending flag only if it still points at this receiver
with the same vector and within the recency window. The existing
trace_kvm_eoi() tracepoint already records the vector, so no new tracepoint
is added.

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
---
 arch/x86/kvm/lapic.c | 61 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 61 insertions(+)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 3a5f197eb2c6..0925d359384e 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1746,6 +1746,65 @@ static void kvm_ioapic_send_eoi(struct kvm_lapic *apic, int vector)
 #endif
 }
 
+/*
+ * Clear IPI tracking state associated with a just-acknowledged vector.
+ *
+ * Called from both the emulated APIC_EOI write path (apic_set_eoi)
+ * and the accelerated exit path (kvm_apic_set_eoi_accelerated).
+ *
+ * Vector matching is deliberate: the receiver may handle an unrelated
+ * interrupt (timer, device IRQ) between an IPI's arrival and its EOI.
+ * If we cleared unconditionally, such an intermediate EOI would evict
+ * the still-pending IPI context and cause directed yield to miss the
+ * real receiver. We therefore only touch state when the EOI'd vector
+ * matches the one we recorded at send time.
+ *
+ * Two-stage cleanup:
+ *  1. Clear the receiver's context unconditionally when the vector
+ *     matches: it has processed exactly this IPI.
+ *  2. Clear the sender's pending_ipi flag only if the sender still
+ *     points at this receiver, with the same vector, and within the
+ *     configured recency window. This avoids evicting a newer IPI
+ *     that happens to share the vector with a stale one.
+ */
+static void kvm_clear_ipi_on_eoi(struct kvm_lapic *apic, int vector)
+{
+	struct kvm_vcpu *receiver = apic->vcpu;
+	struct kvm_vcpu *sender;
+	int sender_idx;
+	u64 then, now;
+
+	if (unlikely(!READ_ONCE(ipi_tracking_enabled)))
+		return;
+
+	if (vector < 0 || vector > 0xff)
+		return;
+
+	if (READ_ONCE(receiver->arch.ipi_context.vector) != (u8)vector)
+		return;
+
+	sender_idx = READ_ONCE(receiver->arch.ipi_context.last_ipi_sender);
+	kvm_vcpu_clear_ipi_context(receiver);
+
+	if (sender_idx < 0)
+		return;
+
+	sender = kvm_get_vcpu(receiver->kvm, sender_idx);
+	if (!sender)
+		return;
+
+	if (READ_ONCE(sender->arch.ipi_context.last_ipi_receiver) !=
+	    receiver->vcpu_idx)
+		return;
+	if (READ_ONCE(sender->arch.ipi_context.vector) != (u8)vector)
+		return;
+
+	then = READ_ONCE(sender->arch.ipi_context.ipi_time_ns);
+	now = ktime_get_mono_fast_ns();
+	if (now - then <= READ_ONCE(ipi_window_ns))
+		WRITE_ONCE(sender->arch.ipi_context.pending_ipi, false);
+}
+
 static int apic_set_eoi(struct kvm_lapic *apic)
 {
 	int vector = apic_find_highest_isr(apic);
@@ -1766,6 +1825,7 @@ static int apic_set_eoi(struct kvm_lapic *apic)
 		kvm_hv_synic_send_eoi(apic->vcpu, vector);
 
 	kvm_ioapic_send_eoi(apic, vector);
+	kvm_clear_ipi_on_eoi(apic, vector);
 	kvm_make_request(KVM_REQ_EVENT, apic->vcpu);
 	return vector;
 }
@@ -1781,6 +1841,7 @@ void kvm_apic_set_eoi_accelerated(struct kvm_vcpu *vcpu, int vector)
 	trace_kvm_eoi(apic, vector);
 
 	kvm_ioapic_send_eoi(apic, vector);
+	kvm_clear_ipi_on_eoi(apic, vector);
 	kvm_make_request(KVM_REQ_EVENT, apic->vcpu);
 }
 EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_apic_set_eoi_accelerated);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 09/10] KVM: Add IPI-aware directed-yield candidate selection
  2026-06-12  1:33 [PATCH v3 00/10] sched/fair, KVM: Semantics-aware directed yield for oversubscribed KVM Wanpeng Li
                   ` (7 preceding siblings ...)
  2026-06-12  1:33 ` [PATCH v3 08/10] KVM: x86/lapic: Clear IPI tracking on matching-vector EOI Wanpeng Li
@ 2026-06-12  1:33 ` Wanpeng Li
  2026-06-12  1:48   ` sashiko-bot
  2026-06-12  1:33 ` [PATCH v3 10/10] KVM: Add relaxed preempted-only fallback for directed yield Wanpeng Li
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 22+ messages in thread
From: Wanpeng Li @ 2026-06-12  1:33 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paolo Bonzini,
	Sean Christopherson
  Cc: K Prateek Nayak, Christian Borntraeger, Steven Rostedt,
	Vincent Guittot, Juri Lelli, linux-kernel, kvm, Wanpeng Li,
	Richie Buturla

From: Wanpeng Li <wanpengli@tencent.com>

kvm_vcpu_on_spin() selects a directed-yield target from coarse preempted
and preempted-in-kernel state. It cannot distinguish a vCPU spinning on an
IPI response from a vCPU spinning on a lock, and can therefore yield to an
unrelated vCPU.

Add kvm_vcpu_is_good_yield_candidate(), a priority-ordered filter. Prefer
a confirmed recent IPI receiver of the spinning vCPU, then an
arch-provided pending-interrupt hint, and finally the existing preempted
heuristic with the optional in-kernel filter.

Use the helper in the strict scan before falling back to the existing
directed-yield eligibility checks.

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
---
 virt/kvm/kvm_main.c | 65 ++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 55 insertions(+), 10 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 2e11c6cfc167..84cbd7a6183f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3963,12 +3963,51 @@ bool __weak kvm_vcpu_is_ipi_receiver(struct kvm_vcpu *sender,
 	return false;
 }
 
+/*
+ * Priority-based candidate filter for directed yield.
+ *
+ *   1) Confirmed IPI receiver of @me within the recency window.
+ *      This is the highest-confidence signal that selecting @vcpu
+ *      may help @me complete its spin-on-IPI-response.
+ *   2) Arch-provided fast pending-interrupt hint (APICv / AVIC /
+ *      arch_dy_has_pending_interrupt()). Covers cases where IPI
+ *      tracking was bypassed by hardware acceleration.
+ *   3) Legacy preempted fallback, with the existing optional
+ *      in-kernel filter when @yield_to_kernel_mode is set.
+ *
+ * Returning false asks kvm_vcpu_on_spin() to skip @vcpu in the
+ * strict (first) round; the relaxed (second) round applies only a
+ * vcpu->preempted check.
+ */
+static bool kvm_vcpu_is_good_yield_candidate(struct kvm_vcpu *me,
+					     struct kvm_vcpu *vcpu,
+					     bool yield_to_kernel_mode)
+{
+	/* Priority 1: confirmed recent IPI receiver. */
+	if (kvm_vcpu_is_ipi_receiver(me, vcpu))
+		return true;
+
+	/* Priority 2: arch-specific pending-interrupt hint. */
+	if (kvm_arch_dy_has_pending_interrupt(vcpu))
+		return true;
+
+	/* Priority 3: preempted, with optional in-kernel requirement. */
+	if (!READ_ONCE(vcpu->preempted))
+		return false;
+
+	if (yield_to_kernel_mode && !kvm_arch_vcpu_preempted_in_kernel(vcpu))
+		return false;
+
+	return true;
+}
+
 void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
 {
 	int nr_vcpus, start, i, idx, yielded;
 	struct kvm *kvm = me->kvm;
 	struct kvm_vcpu *vcpu;
 	int try = 3;
+	bool first_round = true;
 
 	nr_vcpus = atomic_read(&kvm->online_vcpus);
 	if (nr_vcpus < 2)
@@ -4010,16 +4049,21 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
 		if (kvm_vcpu_is_blocking(vcpu) && !vcpu_dy_runnable(vcpu))
 			continue;
 
-		/*
-		 * Treat the target vCPU as being in-kernel if it has a pending
-		 * interrupt, as the vCPU trying to yield may be spinning
-		 * waiting on IPI delivery, i.e. the target vCPU is in-kernel
-		 * for the purposes of directed yield.
-		 */
-		if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode &&
-		    !kvm_arch_dy_has_pending_interrupt(vcpu) &&
-		    !kvm_arch_vcpu_preempted_in_kernel(vcpu))
-			continue;
+		if (first_round) {
+			/* Strict round: IPI-aware and legacy preempted filters. */
+			if (!kvm_vcpu_is_good_yield_candidate(me, vcpu,
+							      yield_to_kernel_mode))
+				continue;
+		} else {
+			/*
+			 * Relaxed round: only require preempted. This is the
+			 * safety net for missed IPI tracking (e.g. APICv) or
+			 * transient runnable-set changes since the strict
+			 * scan.
+			 */
+			if (!READ_ONCE(vcpu->preempted))
+				continue;
+		}
 
 		if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
 			continue;
@@ -4032,6 +4076,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
 			break;
 		}
 	}
+
 	kvm_vcpu_set_in_spin_loop(me, false);
 
 	/* Ensure vcpu is not eligible during next spinloop */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 10/10] KVM: Add relaxed preempted-only fallback for directed yield
  2026-06-12  1:33 [PATCH v3 00/10] sched/fair, KVM: Semantics-aware directed yield for oversubscribed KVM Wanpeng Li
                   ` (8 preceding siblings ...)
  2026-06-12  1:33 ` [PATCH v3 09/10] KVM: Add IPI-aware directed-yield candidate selection Wanpeng Li
@ 2026-06-12  1:33 ` Wanpeng Li
  2026-06-12  5:17 ` [PATCH v3 00/10] sched/fair, KVM: Semantics-aware directed yield for oversubscribed KVM K Prateek Nayak
  2026-06-12  9:43 ` Shrikanth Hegde
  11 siblings, 0 replies; 22+ messages in thread
From: Wanpeng Li @ 2026-06-12  1:33 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paolo Bonzini,
	Sean Christopherson
  Cc: K Prateek Nayak, Christian Borntraeger, Steven Rostedt,
	Vincent Guittot, Juri Lelli, linux-kernel, kvm, Wanpeng Li,
	Richie Buturla

From: Wanpeng Li <wanpengli@tencent.com>

The strict IPI-aware candidate filter can find no target if IPI tracking
misses the relationship, for example for APICv-delivered IPIs, or if the
runnable set changes during the scan.

If the strict pass yields nothing, run a second relaxed pass gated only by
vcpu->preempted. Control the fallback with the enable_relaxed_boost module
parameter (default on), so it can be disabled at runtime if it causes
over-boosting.

With the full series, PARSEC simlarge on 16-vCPU guests under host CPU
overcommit, latency reduction:

  Dedup (IPI-heavy synchronization):
    2 VMs:  +8.87%
    3 VMs: +10.29%
    4 VMs: +15.60%

  VIPS (balanced sync and compute):
    2 VMs: +10.23%
    3 VMs:  +6.63%
    4 VMs:  +4.50%

The IPI-heavy Dedup workload benefits most, as the confirmed IPI receiver
is preferred over the generic preempted lock-holder heuristic.

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
---
 virt/kvm/kvm_main.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 84cbd7a6183f..a327acb198de 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -101,6 +101,19 @@ EXPORT_SYMBOL_FOR_KVM_INTERNAL(halt_poll_ns_shrink);
 static bool __ro_after_init allow_unsafe_mappings;
 module_param(allow_unsafe_mappings, bool, 0444);
 
+/*
+ * enable_relaxed_boost - second-round safety net for kvm_vcpu_on_spin().
+ *
+ * When on (default), if the strict scan finds no eligible yield target,
+ * fall back to a relaxed scan gated only by vcpu->preempted. This
+ * preserves forward progress if IPI tracking is missed (e.g.
+ * APICv-delivered IPIs) or the runnable set changes mid-scan.
+ *
+ * Disable this at runtime if the relaxed pass causes over-boosting.
+ */
+static bool enable_relaxed_boost = true;
+module_param(enable_relaxed_boost, bool, 0644);
+
 /*
  * Ordering of locks:
  *
@@ -4037,6 +4050,8 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
 	 * they may all try to yield to the same vCPU(s).  But as above, this
 	 * is all best effort due to KVM's lack of visibility into the guest.
 	 */
+retry:
+	yielded = 0;
 	start = READ_ONCE(kvm->last_boosted_vcpu) + 1;
 	for (i = 0; i < nr_vcpus; i++) {
 		idx = (start + i) % nr_vcpus;
@@ -4077,6 +4092,15 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
 		}
 	}
 
+	/*
+	 * Second, relaxed pass if enabled, the strict pass yielded nothing,
+	 * and we still have retry budget for -ESRCH paths.
+	 */
+	if (enable_relaxed_boost && first_round && yielded <= 0 && try > 0) {
+		first_round = false;
+		goto retry;
+	}
+
 	kvm_vcpu_set_in_spin_loop(me, false);
 
 	/* Ensure vcpu is not eligible during next spinloop */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 09/10] KVM: Add IPI-aware directed-yield candidate selection
  2026-06-12  1:33 ` [PATCH v3 09/10] KVM: Add IPI-aware directed-yield candidate selection Wanpeng Li
@ 2026-06-12  1:48   ` sashiko-bot
  0 siblings, 0 replies; 22+ messages in thread
From: sashiko-bot @ 2026-06-12  1:48 UTC (permalink / raw)
  To: Wanpeng Li; +Cc: kvm

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] The new candidate filter incorrectly excludes woken-up, runnable vCPUs from being yield targets by strictly requiring the `preempted` flag.
--

commit f2a5464aa09eda51a02ed6e5f04bf8ea0072c9ba
Author: Wanpeng Li <wanpengli@tencent.com>

KVM: Add IPI-aware directed-yield candidate selection

This commit introduces a priority-based candidate filter for directed yields
in kvm_vcpu_on_spin(). It prefers recent IPI receivers or vCPUs with pending
interrupts before falling back to the legacy preempted heuristic.

> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c

[ ... ]

> +static bool kvm_vcpu_is_good_yield_candidate(struct kvm_vcpu *me,
> +					     struct kvm_vcpu *vcpu,
> +					     bool yield_to_kernel_mode)
> +{
> +	/* Priority 1: confirmed recent IPI receiver. */
> +	if (kvm_vcpu_is_ipi_receiver(me, vcpu))
> +		return true;
> +
> +	/* Priority 2: arch-specific pending-interrupt hint. */
> +	if (kvm_arch_dy_has_pending_interrupt(vcpu))
> +		return true;
> +
> +	/* Priority 3: preempted, with optional in-kernel requirement. */
> +	if (!READ_ONCE(vcpu->preempted))
> +		return false;

[Severity: High]
Does this code inadvertently exclude runnable but non-preempted vCPUs
from being yield targets?

In the previous implementation, the preempted check was part of a compound
skip condition:

	if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode &&
	    !kvm_arch_dy_has_pending_interrupt(vcpu) &&
	    !kvm_arch_vcpu_preempted_in_kernel(vcpu))
		continue;

This meant that if a vCPU was not preempted (for instance, it was recently
woken up from a blocking state and is ready to run), it bypassed the skip
condition and was considered a valid target.

By explicitly returning false here when vcpu->preempted is false, this
new filter seems to drop those valid targets.

> +
> +	if (yield_to_kernel_mode && !kvm_arch_vcpu_preempted_in_kernel(vcpu))
> +		return false;
> +
> +	return true;
> +}

[ ... ]

>  		if (kvm_vcpu_is_blocking(vcpu) && !vcpu_dy_runnable(vcpu))
>  			continue;
>  
> -		/*
> -		 * Treat the target vCPU as being in-kernel if it has a pending
> -		 * interrupt, as the vCPU trying to yield may be spinning
> -		 * waiting on IPI delivery, i.e. the target vCPU is in-kernel
> -		 * for the purposes of directed yield.
> -		 */
> -		if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode &&
> -		    !kvm_arch_dy_has_pending_interrupt(vcpu) &&
> -		    !kvm_arch_vcpu_preempted_in_kernel(vcpu))
> -			continue;
> +		if (first_round) {
> +			/* Strict round: IPI-aware and legacy preempted filters. */
> +			if (!kvm_vcpu_is_good_yield_candidate(me, vcpu,
> +							      yield_to_kernel_mode))
> +				continue;
> +		} else {
> +			/*
> +			 * Relaxed round: only require preempted. This is the
> +			 * safety net for missed IPI tracking (e.g. APICv) or
> +			 * transient runnable-set changes since the strict
> +			 * scan.
> +			 */
> +			if (!READ_ONCE(vcpu->preempted))
> +				continue;

[Severity: High]
Similarly, does this relaxed round unconditionally skip non-preempted
vCPUs?

If the vCPU is simply runnable but not marked as preempted, this safety net
will completely bypass it.

>  		}
>  
>  		if (!kvm_vcpu_eligible_for_directed_yield(vcpu))

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260612013355.59231-1-kernellwp@gmail.com?part=9

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 01/10] sched/fair: Add EEVDF lag credit primitive for nominated next-buddy
  2026-06-12  1:33 ` [PATCH v3 01/10] sched/fair: Add EEVDF lag credit primitive for nominated next-buddy Wanpeng Li
@ 2026-06-12  1:49   ` sashiko-bot
  2026-06-12  5:34   ` K Prateek Nayak
  1 sibling, 0 replies; 22+ messages in thread
From: sashiko-bot @ 2026-06-12  1:49 UTC (permalink / raw)
  To: Wanpeng Li; +Cc: kvm

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] eevdf_credit_entity_vlag() inadvertently revives expired slice protection for the current task because it only shifts vprot when protect_slice is true.
--

commit d8a18d46db47fd7df4dc661550f9788e6c195ed7
Author: Wanpeng Li <wanpengli@tencent.com>

sched/fair: Add EEVDF lag credit primitive for nominated next-buddy

This commit adds eevdf_credit_entity_vlag() to allow yield_to() targets
to receive bounded lag credit, enabling pick_eevdf() to honor buddy hints
even for targets with a negative vlag. The helper updates cfs_rq->curr in
place by shifting its vruntime, deadline, and optionally vprot backwards.

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 3ebec186f9823..e7f5ea25fdae3 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c

[ ... ]

> @@ -9341,6 +9341,54 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev, struct t
> +
> +	if (cfs_rq->curr == se) {
> +		/* curr is off-tree: in-place shift, carrying any vprot window. */
> +		if (protect_slice(se))
> +			se->vprot -= credit;
> +		se->vruntime -= credit;
> +		se->deadline -= credit;

[Severity: High]
Could conditionally updating se->vprot unintentionally revive expired slice
protection?

If protect_slice(se) is currently false (meaning the protected slice has
already expired and se->vprot <= se->vruntime), se->vprot is not shifted.

Since se->vruntime is unconditionally shifted backward by credit, couldn't
se->vruntime fall behind the unmodified se->vprot?

This would cause protect_slice(se) to suddenly evaluate to true again,
inappropriately shielding the task from preemption in update_curr().

> +		return;
> +	}
> +
> +	/* Queued entities are left unchanged by this helper path. */
> +}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260612013355.59231-1-kernellwp@gmail.com?part=1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 05/10] sched/fair: Force a local resched on yield_to() so the buddy is picked
  2026-06-12  1:33 ` [PATCH v3 05/10] sched/fair: Force a local resched on yield_to() so the buddy is picked Wanpeng Li
@ 2026-06-12  1:50   ` sashiko-bot
  0 siblings, 0 replies; 22+ messages in thread
From: sashiko-bot @ 2026-06-12  1:50 UTC (permalink / raw)
  To: Wanpeng Li; +Cc: kvm

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Medium] Double clock update triggers WARN_DOUBLE_CLOCK on the local runqueue in yield_to_task_fair().
--

commit fb0067f31069bbf43450f383db09dc5489957b42
Author: Wanpeng Li <wanpengli@tencent.com>

sched/fair: Force a local resched on yield_to() so the buddy is picked

This commit modifies yield_to_task_fair() to force a local reschedule when
applying lag credit, cancelling slice protection and calling resched_curr()
on the local runqueue. The forced preemption is rate-limited to at most
once per 6ms per rq.

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 48f65a4f19230..e9c5265cf0fa9 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
[ ... ]
> @@ -9485,6 +9485,83 @@ static void yield_task_fair(struct rq *rq)
[ ... ]
> +static void yield_to_local_force_resched(struct rq *rq)
> +{
> +	struct sched_entity *yse = &rq->donor->se;
> +
> +	yield_task_fair(rq);
[ ... ]
> @@ -9504,21 +9581,22 @@ static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
[ ... ]
> -	if (rq != p_rq)
> +	update_rq_clock(rq);
> +	if (p_rq != rq)
>  		update_rq_clock(p_rq);
>  
>  	for_each_sched_entity(se) {
[ ... ]
> -	yield_task_fair(rq);
> +	/*
> +	 * Force the local CPU to reschedule so the credited buddy can be
> +	 * selected instead of the protected yielder;
> +	 * yield_to_local_force_resched() also does the leaf forfeit.
> +	 */
> +	yield_to_local_force_resched(rq);

[Severity: Medium]
Will this sequence trigger WARN_DOUBLE_CLOCK on the local runqueue?

The patch comments note this redundant update is only warned about under
the default-off WARN_DOUBLE_CLOCK.

However, update_rq_clock(rq) is called in yield_to_task_fair() setting
RQCF_UPDATED. Then yield_to_local_force_resched(rq) calls yield_task_fair()
which calls update_rq_clock(rq) a second time.

Since the flag isn't cleared between these calls, this will trigger:
WARN_ON_ONCE(rq->clock_update_flags & RQCF_UPDATED)

Is it okay to knowingly leave this double clock update, or could it cause
spurious stack traces for developers actively debugging scheduler clock
issues?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260612013355.59231-1-kernellwp@gmail.com?part=5

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 02/10] sched/fair: Credit a persistent, queue-depth-scaled vlag margin
  2026-06-12  1:33 ` [PATCH v3 02/10] sched/fair: Credit a persistent, queue-depth-scaled vlag margin Wanpeng Li
@ 2026-06-12  1:53   ` sashiko-bot
  2026-06-12  6:07   ` K Prateek Nayak
  1 sibling, 0 replies; 22+ messages in thread
From: sashiko-bot @ 2026-06-12  1:53 UTC (permalink / raw)
  To: Wanpeng Li; +Cc: kvm

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:

Pre-existing issues:
- [High] __enqueue_entity() fails to initialize se->max_slice before adding the entity to the augmented RB-tree, poisoning the tree's max_slice.
--

commit c9a9a8b18145d7a6a58148e991409677da519193
Author: Wanpeng Li <wanpengli@tencent.com>

sched/fair: Credit a persistent, queue-depth-scaled vlag margin

This commit replaces the credit-to-zero buddy system with a bounded
positive-vlag margin. The margin scales with runqueue depth to preserve
the buddy's eligibility across multiple scheduling decisions without
exceeding EEVDF fairness bounds.

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e7f5ea25fdae3..c6502db62cd3e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
[ ... ]
> @@ -9371,11 +9395,23 @@ eevdf_credit_entity_vlag(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  	avruntime = avg_vruntime(cfs_rq);
>  	vlag = entity_lag(cfs_rq, se, avruntime);
>  
> -	/* Already eligible: nothing to do. */
> -	if (vlag >= 0)
> -		return;
> +	/* Clamp the margin to entity_lag()'s bound so place_entity() keeps it. */
> +	max_slice = cfs_rq_max_slice(cfs_rq) + TICK_NSEC;

[Severity: High]
This isn't a bug introduced by this patch, but does the tree's max_slice
contain poisoned data here?

When an entity is dequeued, rb_erase_augmented_cached() removes the node
but does not clear se->max_slice. When enqueued again, __enqueue_entity()
initializes se->min_vruntime and se->min_slice, but misses se->max_slice:

kernel/sched/fair.c:__enqueue_entity() {
	sum_w_vruntime_add(cfs_rq, se);
	se->min_vruntime = se->vruntime;
	se->min_slice = se->slice;
	// Missing se->max_slice = se->slice; here?
	rb_add_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
				__entity_less, &min_vruntime_cb);
}

Won't the new leaf node propagate a potentially large, stale max_slice from its
previous subtree up to the root? This would artificially inflate
cfs_rq_max_slice() and corrupt the lag_limit calculated here, potentially
allowing tasks to accumulate massive lag while sleeping and breaking EEVDF
fairness bounds.

> +	lag_limit = calc_delta_fair(max_slice, se);
> +	margin = eevdf_persistent_margin(cfs_rq, se);
> +	if (lag_limit && margin > lag_limit)
> +		margin = lag_limit;
[ ... ]

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260612013355.59231-1-kernellwp@gmail.com?part=2

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 04/10] sched/fair: Credit nominated next-buddy in yield_to_task_fair()
  2026-06-12  1:33 ` [PATCH v3 04/10] sched/fair: Credit nominated next-buddy in yield_to_task_fair() Wanpeng Li
@ 2026-06-12  1:54   ` sashiko-bot
  0 siblings, 0 replies; 22+ messages in thread
From: sashiko-bot @ 2026-06-12  1:54 UTC (permalink / raw)
  To: Wanpeng Li; +Cc: kvm

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] Skipping `update_rq_clock()` for the same-runqueue case in `yield_to_task_fair()` unconditionally triggers a `WARN_ON_ONCE` in `assert_clock_updated()` via `update_curr()`.
--

commit 18fe3f24f01f55a4b16680e6b460846c6d5e6260
Author: Wanpeng Li <wanpengli@tencent.com>

sched/fair: Credit nominated next-buddy in yield_to_task_fair()

This patch credits bounded EEVDF lag to nominated targets during
yield_to(). It walks the sched_entity hierarchy to provide the credit,
avoiding forfeiture of the yield hint when a group entity isn't yet
eligible.

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c28682fedb362..48f65a4f19230 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c

[ ... ]

> @@ -9496,6 +9497,43 @@ static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
>  	/* Tell the scheduler that we'd really like se to run next. */
>  	set_next_buddy(se);
>  
> +	/* Without lag credit, keep the existing forfeit-based yield. */
> +	if (!sched_feat(YIELD_TO_LAG_CREDIT)) {
> +		yield_task_fair(rq);
> +		return true;
> +	}
> +
> +	/*
> +	 * Walk the ancestor chain set_next_buddy() just nominated and credit
> +	 * bounded lag to each not-yet-eligible level so pick_eevdf() returns
> +	 * it. yield_to() holds both rq locks via double_rq_lock(), so touching
> +	 * p's cfs_rqs (possibly on another CPU) is safe; the primitive is
> +	 * idempotent, so no rate limiting is needed.
> +	 *
> +	 * Only refresh p_rq's clock when it differs from the local rq. A
> +	 * remote p_rq must be refreshed so the per-level update_curr() is
> +	 * accurate. In the same-rq case we skip it: the credit is a
> +	 * best-effort hint and the rq clock is recent enough, while the
> +	 * trailing yield_task_fair() would otherwise make this a second
> +	 * update_rq_clock() on the same rq and trip
> +	 * SCHED_WARN_ON(WARN_DOUBLE_CLOCK).
> +	 */
> +	if (rq != p_rq)
> +		update_rq_clock(p_rq);
> +
> +	for_each_sched_entity(se) {
> +		struct cfs_rq *cfs_rq = cfs_rq_of(se);
> +
> +		if (cfs_rq->next != se)
> +			break;
> +		if (se->sched_delayed)
> +			break;
> +		if (throttled_hierarchy(cfs_rq))
> +			break;
> +
> +		eevdf_credit_entity_vlag(cfs_rq, se);

[Severity: High]
Does skipping update_rq_clock() here trigger a WARN_ON_ONCE() when the target
task is on the same runqueue?

Since yield_to() acquires locks via double_rq_lock(), the clock update flag
is cleared. By skipping update_rq_clock(p_rq) for local tasks, the call to
eevdf_credit_entity_vlag() will invoke update_curr() with an un-updated clock:

eevdf_credit_entity_vlag()
  update_curr()
    update_se()
      rq_clock_task()
        assert_clock_updated() -> WARN_ON_ONCE(rq->clock_update_flags < RQCF_ACT_SKIP)

Can this reliably trigger a warning when yield_to() targets a task on the
same CPU (for instance, via KVM vCPU yielding)?

> +	}
>  
>  	yield_task_fair(rq);
>  
>  	return true;

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260612013355.59231-1-kernellwp@gmail.com?part=4

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 03/10] sched/fair: Credit queued next-buddy via canonical requeue
  2026-06-12  1:33 ` [PATCH v3 03/10] sched/fair: Credit queued next-buddy via canonical requeue Wanpeng Li
@ 2026-06-12  1:55   ` sashiko-bot
  0 siblings, 0 replies; 22+ messages in thread
From: sashiko-bot @ 2026-06-12  1:55 UTC (permalink / raw)
  To: Wanpeng Li; +Cc: kvm

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:

New issues:
- [High] Unconditional setting of `se->rel_deadline = 1` ignores `sched_feat(PLACE_REL_DEADLINE)`, leaking relative deadline flags and corrupting absolute deadlines on reweighting.

Pre-existing issues:
- [High] `__enqueue_entity` fails to reset `se->max_slice = se->slice`, causing stale subtree maximums to incorrectly inflate the augmented rbtree.
--

commit 61176c863b2b0bee3f465f1148e81251b1cf5bda
Author: Wanpeng Li <wanpengli@tencent.com>

sched/fair: Credit queued next-buddy via canonical requeue

This commit extends eevdf_credit_entity_vlag() to handle a queued
entity. It uses the canonical requeue pattern to keep sum_w_vruntime
consistent with entity_key(), preserving the slice and shifting the
deadline.

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c

[ ... ]

> @@ -9422,7 +9424,22 @@ eevdf_credit_entity_vlag(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  		return;
>  	}
>  
> -	/* Queued entities are left unchanged by this helper path. */
> +	/*
> +	 * Canonical place_entity()-paired requeue: see dequeue_entity() and
> +	 * requeue_delayed_entity(). place_entity() restores the deadline via
> +	 * "se->deadline += se->vruntime", so make the deadline relative to
> +	 * se->vruntime here (not avg_vruntime). This preserves the slice and
> +	 * shifts the deadline by exactly -credit, mirroring the curr branch
> +	 * and keeping deadline > vruntime.
> +	 */
> +	se->vlag = vlag + (s64)credit;
> +	se->deadline -= se->vruntime;
> +	se->rel_deadline = 1;

[Severity: High]
Does setting the relative deadline flag unconditionally here cause an issue
if the PLACE_REL_DEADLINE feature is disabled?

Looking at place_entity(), the flag is only cleared if the feature is enabled:

place_entity() {
	...
	if (sched_feat(PLACE_REL_DEADLINE) && se->rel_deadline) {
		se->deadline += se->vruntime;
		se->rel_deadline = 0;
		return;
	}
	...
	se->deadline = se->vruntime + vslice;
}

If the feature is disabled, place_entity() computes an absolute deadline but
leaves se->rel_deadline set to 1. If the task then sleeps and its nice value
changes, rescale_entity() checks se->rel_deadline and will incorrectly apply
relative scaling to the absolute deadline:

rescale_entity() {
	...
	if (se->rel_deadline)
		se->deadline = div64_long(se->deadline * old_weight, weight);
	...
}

Could this arithmetic overflow and corrupt the deadline, potentially breaking
the rbtree priority ordering?

> +	cfs_rq->nr_queued--;
> +	__dequeue_entity(cfs_rq, se);
> +	place_entity(cfs_rq, se, 0);
> +	__enqueue_entity(cfs_rq, se);

[Severity: High]
This is a pre-existing issue, but does the call to __enqueue_entity()
expose a stale max_slice issue in the augmented rbtree?

When an entity is inserted as a leaf node, __enqueue_entity() resets the
min_vruntime and min_slice, but appears to leave max_slice untouched:

__enqueue_entity() {
	sum_w_vruntime_add(cfs_rq, se);
	se->min_vruntime = se->vruntime;
	se->min_slice = se->slice;
	rb_add_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
				__entity_less, &min_vruntime_cb);
}

If the task previously had a large max_slice (for instance, when it was a
parent of tasks with large slices), does that stale max_slice propagate up
the tree during rb_add_augmented_cached()?

Could this artificially inflate cfs_rq_max_slice(cfs_rq) and circumvent the
lag clamping bounds in entity_lag(), allowing tasks to hoard excessive lag?

> +	cfs_rq->nr_queued++;
>  }

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260612013355.59231-1-kernellwp@gmail.com?part=3

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 08/10] KVM: x86/lapic: Clear IPI tracking on matching-vector EOI
  2026-06-12  1:33 ` [PATCH v3 08/10] KVM: x86/lapic: Clear IPI tracking on matching-vector EOI Wanpeng Li
@ 2026-06-12  3:46   ` sashiko-bot
  0 siblings, 0 replies; 22+ messages in thread
From: sashiko-bot @ 2026-06-12  3:46 UTC (permalink / raw)
  To: Wanpeng Li; +Cc: kvm

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Medium] The EOI handler incorrectly clears the vCPU's outgoing IPI tracking state when acknowledging an incoming IPI due to conflated incoming and outgoing tracking fields.
--

commit f711eb495ad18e2339b40d0c737023fdb3459a71
Author: Wanpeng Li <wanpengli@tencent.com>

KVM: x86/lapic: Clear IPI tracking on matching-vector EOI

This commit updates apic_set_eoi() and kvm_apic_set_eoi_accelerated() to
clear IPI tracking state only when the EOI vector matches the tracked
vector, aiming to prevent eviction of pending contexts when unrelated
interrupts are handled.

> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
[ ... ]
> +static void kvm_clear_ipi_on_eoi(struct kvm_lapic *apic, int vector)
> +{
> +	struct kvm_vcpu *receiver = apic->vcpu;
> +	struct kvm_vcpu *sender;
> +	int sender_idx;
> +	u64 then, now;
> +
> +	if (unlikely(!READ_ONCE(ipi_tracking_enabled)))
> +		return;
> +
> +	if (vector < 0 || vector > 0xff)
> +		return;
> +
> +	if (READ_ONCE(receiver->arch.ipi_context.vector) != (u8)vector)
> +		return;
> +
> +	sender_idx = READ_ONCE(receiver->arch.ipi_context.last_ipi_sender);
> +	kvm_vcpu_clear_ipi_context(receiver);

[Severity: Medium]
Does calling kvm_vcpu_clear_ipi_context() here unintentionally clear the
vCPU's outgoing IPI tracking state?

Looking at kvm_vcpu_clear_ipi_context(), it clears the entire ipi_context
structure, including pending_ipi and last_ipi_receiver. Since a vCPU can
act as both a sender and a receiver concurrently, acknowledging an incoming
IPI here appears to erase the vCPU's state as a sender.

Could this destroy the directed yield hint and cause concurrent IPI sender
and receiver vCPUs to miss the yield optimization?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260612013355.59231-1-kernellwp@gmail.com?part=8

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 00/10] sched/fair, KVM: Semantics-aware directed yield for oversubscribed KVM
  2026-06-12  1:33 [PATCH v3 00/10] sched/fair, KVM: Semantics-aware directed yield for oversubscribed KVM Wanpeng Li
                   ` (9 preceding siblings ...)
  2026-06-12  1:33 ` [PATCH v3 10/10] KVM: Add relaxed preempted-only fallback for directed yield Wanpeng Li
@ 2026-06-12  5:17 ` K Prateek Nayak
  2026-06-12  9:43 ` Shrikanth Hegde
  11 siblings, 0 replies; 22+ messages in thread
From: K Prateek Nayak @ 2026-06-12  5:17 UTC (permalink / raw)
  To: Wanpeng Li, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Paolo Bonzini, Sean Christopherson
  Cc: Christian Borntraeger, Steven Rostedt, Vincent Guittot,
	Juri Lelli, linux-kernel, kvm, Wanpeng Li, Richie Buturla

Hello Wanpeng,

On 6/12/2026 7:03 AM, Wanpeng Li wrote:
> Part 1: Scheduler EEVDF lag credit (patches 1-5)
> 
> Rather than penalizing the yielding vCPU, credit the nominated target so
> pick_eevdf() honors the buddy hint.
> 
> The mechanism is EEVDF-native and cgroup-hierarchy-aware:
> 
> - Credit bounded EEVDF lag to the nominated next-buddy so pick_eevdf()'s
>   PICK_BUDDY branch returns it.  Walk the same ancestor chain that
>   set_next_buddy() nominated and credit each not-yet-eligible level, so the
>   hint is not dropped at the first ineligible group entity.

I believe Peter is planning to flatten the pick by v7.3 so I would
suggest you to test the flattened pick series [1] which is available in
Peter's tree in sched/flat branch [2].

That should get rid of the need to traverse the hierarchy and
should solve one part of your problem of yielding to vCPUs across
different cgroups.

[1] https://lore.kernel.org/lkml/20260605105513.354837583@infradead.org/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=sched/flat

> 
> - Credit to a small positive-vlag margin, not merely the vlag = 0
>   eligibility boundary, so the target stays eligible across several
>   scheduling decisions rather than a single pick.  The margin scales with
>   runqueue depth and is clamped to entity_lag()'s legal positive-lag bound,
>   preserving EEVDF fairness.
> 
> - Handle both the off-tree current entity (shifted in place, carrying any
>   vprot window)


Is this even possible to yield to an out of tree entity? The core bits
in syscalls.c already bails out for:

    if (task_on_cpu(p_rq, p) || !task_is_running(p))
           return;

and the early bits in yield_to_task_fair() bail out for "!se->on_rq"
which makes me wonder when will we ever have the p->se as cfs_rq->curr
while holding both the p->pi_lock and the rq_lock?

The task must be on the rq while being preempted for yield_to to work,
no?

>   and a queued (on-tree) entity (repositioned via the
>   canonical place_entity()-paired requeue used by requeue_delayed_entity(),
>   keeping sum_w_vruntime consistent with entity_key()).
> 
> - Force a local reschedule at the end of the credit path: cancel
>   RUN_TO_PARITY slice protection along the yielder's sched_entity chain and
>   resched_curr() the local CPU.  Only this forced preemption is rate
>   limited (once per 6ms per rq) to avoid excessive forced preemption on
>   PLE-heavy guests; the lag credit itself runs on every directed yield.
> 
> The mechanism is gated by SCHED_FEAT(YIELD_TO_LAG_CREDIT) (default on).
> With the feature off, yield_to_task_fair() keeps the existing forfeit-only
> behavior.

[...]

> The gains stem from three factors:
> 
> 1. Lock holders receive sustained CPU time to complete critical sections,
>    reducing lock hold duration and cascading contention.
> 
> 2. IPI receivers are scheduled promptly when senders spin, reducing IPI
>    response latency and wasted spin cycles.

Looking at kvm_smp_send_call_func_ipi() in arch/x86/kernel/kvm.c, there
can be multiple destination vCPUs for the IPI. Why does it make sense
for the sender to yield almost all its time to the first vCPU on the
mask then?

And do all IPIs have to spin? Can't they be async too?

> 
> 3. Reduced context switching between lock waiters and holders improves
>    cache utilization.
> 

[...]

> 
> Patch Organization
> ------------------
> 
> Patches 1-5: Scheduler EEVDF lag credit
> 
>   Patch 1: Add the eevdf_credit_entity_vlag() primitive and the
>            YIELD_TO_LAG_CREDIT feature.  Handles the off-tree current
>            entity and has no functional effect on its own.
> 
>   Patch 2: Credit to a persistent, queue-depth-scaled positive-vlag
>            margin, clamped to entity_lag()'s legal bound.
> 
>   Patch 3: Extend the primitive to a queued (on-tree) entity via the
>            canonical place_entity()-paired requeue.
> 
>   Patch 4: Wire the credit walk into yield_to_task_fair(), crediting each
>            level of the nominated ancestor chain.
> 
>   Patch 5: Force a local reschedule (cancel RUN_TO_PARITY slice protection
>            and resched_curr()) so the credited buddy can be selected.
>            Activation patch; rate-limits only the forced preemption.

I don't know if it is just me but this structure made it insanely
difficult to review with unused functions and callers being only added
at Patch 4 to understand how it all worked.

All of this will require rework with flattened pick but I would suggest
adding the simple lag movement bits first and adding the
eevdf_persistent_margin() magic later on top.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 01/10] sched/fair: Add EEVDF lag credit primitive for nominated next-buddy
  2026-06-12  1:33 ` [PATCH v3 01/10] sched/fair: Add EEVDF lag credit primitive for nominated next-buddy Wanpeng Li
  2026-06-12  1:49   ` sashiko-bot
@ 2026-06-12  5:34   ` K Prateek Nayak
  1 sibling, 0 replies; 22+ messages in thread
From: K Prateek Nayak @ 2026-06-12  5:34 UTC (permalink / raw)
  To: Wanpeng Li, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Paolo Bonzini, Sean Christopherson
  Cc: Christian Borntraeger, Steven Rostedt, Vincent Guittot,
	Juri Lelli, linux-kernel, kvm, Wanpeng Li, Richie Buturla

Hello Wanpeng,

On 6/12/2026 7:03 AM, Wanpeng Li wrote:
> +static void __maybe_unused

I would suggest adding this with the first caller to avoid having to
jump between this and patch 3.

> +eevdf_credit_entity_vlag(struct cfs_rq *cfs_rq, struct sched_entity *se)
> +{
> +	u64 avruntime, credit;
> +	s64 vlag;
> +
> +	/* Callers gate this helper with YIELD_TO_LAG_CREDIT. */
> +	if (cfs_rq->nr_queued < 2)
> +		return;
> +	if (throttled_hierarchy(cfs_rq))
> +		return;
> +	if (WARN_ON_ONCE(!se->on_rq) || se->sched_delayed)
> +		return;
> +
> +	update_curr(cfs_rq);
> +	avruntime = avg_vruntime(cfs_rq);
> +	vlag = entity_lag(cfs_rq, se, avruntime);
> +
> +	/* Already eligible: nothing to do. */
> +	if (vlag >= 0)
> +		return;
> +
> +	credit = (u64)(-vlag);

What is this random credit and how is it fair? My suggestion on the
previous version was simple:

   Task0 -> yield_to -> Task1

- Task0 is eligible and has +ve lag

- Task0 transfers lag to Task1 and makes itself ineligible

- Task1 picks the proportional lag; It may become eligible and get to
  run but there is no guarantee because that is fair.

There is nothing fair about simply making a task eligible just because
someone else decided to call yield_to() on it.

> +
> +	if (cfs_rq->curr == se) {

As mentioned in my reply on the cover, I'm having a hard time
understanding how this can arise when we make sure "p" is a TASK_RUNNING
that is also "!p->on_cpu && p->se.on_rq" while holding p->pi_lock and the
rq_lock.

> +		/* curr is off-tree: in-place shift, carrying any vprot window. */
> +		if (protect_slice(se))
> +			se->vprot -= credit;
> +		se->vruntime -= credit;
> +		se->deadline -= credit;
> +		return;
> +	}
> +
> +	/* Queued entities are left unchanged by this helper path. */
> +}
-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 02/10] sched/fair: Credit a persistent, queue-depth-scaled vlag margin
  2026-06-12  1:33 ` [PATCH v3 02/10] sched/fair: Credit a persistent, queue-depth-scaled vlag margin Wanpeng Li
  2026-06-12  1:53   ` sashiko-bot
@ 2026-06-12  6:07   ` K Prateek Nayak
  1 sibling, 0 replies; 22+ messages in thread
From: K Prateek Nayak @ 2026-06-12  6:07 UTC (permalink / raw)
  To: Wanpeng Li, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Paolo Bonzini, Sean Christopherson
  Cc: Christian Borntraeger, Steven Rostedt, Vincent Guittot,
	Juri Lelli, linux-kernel, kvm, Wanpeng Li, Richie Buturla

Hello Wanpeng,

On 6/12/2026 7:03 AM, Wanpeng Li wrote:
> -	/* Already eligible: nothing to do. */
> -	if (vlag >= 0)
> -		return;
> +	/* Clamp the margin to entity_lag()'s bound so place_entity() keeps it. */
> +	max_slice = cfs_rq_max_slice(cfs_rq) + TICK_NSEC;
> +	lag_limit = calc_delta_fair(max_slice, se);
> +	margin = eevdf_persistent_margin(cfs_rq, se);
> +	if (lag_limit && margin > lag_limit)
> +		margin = lag_limit;
> +	if (vlag >= 0) {
> +		if ((u64)vlag >= margin)
> +			return;
> +		want = margin - (u64)vlag;

And we are making an eligible task more eligible but *where* is this
credit coming from? Someone should be paying the price right?

Just adding an arbitrary credit and then moving the task doing the
yield_to() 0-lag point is simply not fair IMO.

> +	} else {
> +		want = margin + (u64)(-vlag);
> +	}

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 00/10] sched/fair, KVM: Semantics-aware directed yield for oversubscribed KVM
  2026-06-12  1:33 [PATCH v3 00/10] sched/fair, KVM: Semantics-aware directed yield for oversubscribed KVM Wanpeng Li
                   ` (10 preceding siblings ...)
  2026-06-12  5:17 ` [PATCH v3 00/10] sched/fair, KVM: Semantics-aware directed yield for oversubscribed KVM K Prateek Nayak
@ 2026-06-12  9:43 ` Shrikanth Hegde
  11 siblings, 0 replies; 22+ messages in thread
From: Shrikanth Hegde @ 2026-06-12  9:43 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: K Prateek Nayak, Christian Borntraeger, Steven Rostedt,
	Vincent Guittot, Juri Lelli, linux-kernel, kvm, Wanpeng Li,
	Richie Buturla, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Paolo Bonzini, Sean Christopherson

Hi Wanpeng.

On 6/12/26 7:03 AM, Wanpeng Li wrote:
> From: Wanpeng Li <wanpengli@tencent.com>
> 
> On overcommitted hosts, a spinning vCPU often calls yield_to() to let a
> lock holder or IPI receiver run.  The hint can be ineffective for two
> independent reasons: the scheduler may fail to select the nominated task,
> and KVM may nominate a task that is not the one the spinning vCPU is
> waiting for.
> 

I was curious to try this as we have been working on paravirt scheduling
which aims at reducing the vCPU preemption based on the steal time.
https://lore.kernel.org/all/20260514152204.481115-1-sshegde@linux.ibm.com/

I gave the patch series a try. With both real life workload(daytrader) and
micro-benchmarks (hackbench) I don't see any benefit. Note that it is running
the same workloads in all VM's at the same time.


I had thought about this series:
- Do you observe steal time when you run your workload? If yes, if you reduce the
   vCPU, IPI will natually move to reduced set and due to reduction in vCPU preemption
   yeild also may not be necessary.

- is there vCPU pinning? workload is pinned separately to explicit vCPUs.

- Is only a few vCPUs doing the locks in the workload you have mentioned in the
   cover-letter?

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2026-06-12  9:44 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-12  1:33 [PATCH v3 00/10] sched/fair, KVM: Semantics-aware directed yield for oversubscribed KVM Wanpeng Li
2026-06-12  1:33 ` [PATCH v3 01/10] sched/fair: Add EEVDF lag credit primitive for nominated next-buddy Wanpeng Li
2026-06-12  1:49   ` sashiko-bot
2026-06-12  5:34   ` K Prateek Nayak
2026-06-12  1:33 ` [PATCH v3 02/10] sched/fair: Credit a persistent, queue-depth-scaled vlag margin Wanpeng Li
2026-06-12  1:53   ` sashiko-bot
2026-06-12  6:07   ` K Prateek Nayak
2026-06-12  1:33 ` [PATCH v3 03/10] sched/fair: Credit queued next-buddy via canonical requeue Wanpeng Li
2026-06-12  1:55   ` sashiko-bot
2026-06-12  1:33 ` [PATCH v3 04/10] sched/fair: Credit nominated next-buddy in yield_to_task_fair() Wanpeng Li
2026-06-12  1:54   ` sashiko-bot
2026-06-12  1:33 ` [PATCH v3 05/10] sched/fair: Force a local resched on yield_to() so the buddy is picked Wanpeng Li
2026-06-12  1:50   ` sashiko-bot
2026-06-12  1:33 ` [PATCH v3 06/10] KVM: x86: Add IPI tracking infrastructure for directed yield Wanpeng Li
2026-06-12  1:33 ` [PATCH v3 07/10] KVM: x86/lapic: Track unicast fixed IPI delivery Wanpeng Li
2026-06-12  1:33 ` [PATCH v3 08/10] KVM: x86/lapic: Clear IPI tracking on matching-vector EOI Wanpeng Li
2026-06-12  3:46   ` sashiko-bot
2026-06-12  1:33 ` [PATCH v3 09/10] KVM: Add IPI-aware directed-yield candidate selection Wanpeng Li
2026-06-12  1:48   ` sashiko-bot
2026-06-12  1:33 ` [PATCH v3 10/10] KVM: Add relaxed preempted-only fallback for directed yield Wanpeng Li
2026-06-12  5:17 ` [PATCH v3 00/10] sched/fair, KVM: Semantics-aware directed yield for oversubscribed KVM K Prateek Nayak
2026-06-12  9:43 ` Shrikanth Hegde

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.