public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/9] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM
@ 2025-12-19  3:53 Wanpeng Li
  2025-12-19  3:53 ` [PATCH v2 1/9] sched: Add vCPU debooster infrastructure Wanpeng Li
                   ` (12 more replies)
  0 siblings, 13 replies; 19+ messages in thread
From: Wanpeng Li @ 2025-12-19  3:53 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paolo Bonzini,
	Sean Christopherson
  Cc: K Prateek Nayak, Christian Borntraeger, Steven Rostedt,
	Vincent Guittot, Juri Lelli, linux-kernel, kvm, Wanpeng Li

From: Wanpeng Li <wanpengli@tencent.com>

This series addresses long-standing yield_to() inefficiencies in
virtualized environments through two complementary mechanisms: a vCPU
debooster in the scheduler and IPI-aware directed yield in KVM.

Problem Statement
-----------------

In overcommitted virtualization scenarios, vCPUs frequently spin on locks
held by other vCPUs that are not currently running. The kernel's
paravirtual spinlock support detects these situations and calls yield_to()
to boost the lock holder, allowing it to run and release the lock.

However, the current implementation has two critical limitations:

1. Scheduler-side limitation:

   yield_to_task_fair() relies solely on set_next_buddy() to provide
   preference to the target vCPU. This buddy mechanism only offers
   immediate, transient preference. Once the buddy hint expires (typically
   after one scheduling decision), the yielding vCPU may preempt the target
   again, especially in nested cgroup hierarchies where vruntime domains
   differ.

   This creates a ping-pong effect: the lock holder runs briefly, gets
   preempted before completing critical sections, and the yielding vCPU
   spins again, triggering another futile yield_to() cycle. The overhead
   accumulates rapidly in workloads with high lock contention.

2. KVM-side limitation:

   kvm_vcpu_on_spin() attempts to identify which vCPU to yield to through
   directed yield candidate selection. However, it lacks awareness of IPI
   communication patterns. When a vCPU sends an IPI and spins waiting for
   a response (common in inter-processor synchronization), the current
   heuristics often fail to identify the IPI receiver as the yield target.

   Instead, the code may boost an unrelated vCPU based on coarse-grained
   preemption state, missing opportunities to accelerate actual IPI
   response handling. This is particularly problematic when the IPI
   receiver is runnable but not scheduled, as lock-holder-detection logic
   doesn't capture the IPI dependency relationship.

Combined, these issues cause excessive lock hold times, cache thrashing,
and degraded throughput in overcommitted environments, particularly
affecting workloads with fine-grained synchronization patterns.

Solution Overview
-----------------

The series introduces two orthogonal improvements that work synergistically:

Part 1: Scheduler vCPU Debooster (patches 1-5)

Augment yield_to_task_fair() with bounded vruntime penalties to provide
sustained preference beyond the buddy mechanism. When a vCPU yields to a
target, apply a carefully tuned vruntime penalty to the yielding vCPU,
ensuring the target maintains scheduling advantage for longer periods.

The mechanism is EEVDF-aware and cgroup-hierarchy-aware:

- Locate the lowest common ancestor (LCA) in the cgroup hierarchy where
  both the yielding and target tasks coexist. This ensures vruntime
  adjustments occur at the correct hierarchy level, maintaining fairness
  across cgroup boundaries.

- Update EEVDF scheduler fields (vruntime, deadline) atomically to keep
  the scheduler state consistent. Note that vlag is intentionally not
  modified as it will be recalculated on dequeue/enqueue cycles. The
  penalty shifts the yielding task's virtual deadline forward, allowing
  the target to run.

- Apply queue-size-adaptive penalties that scale from 6.0x scheduling
  granularity for 2-task scenarios (strong preference) down to 1.0x for
  large queues (>12 tasks), balancing preference against starvation risks.

- Implement reverse-pair debouncing: when task A yields to B, then B yields
  to A within a short window (~600us), downscale the penalty to prevent
  ping-pong oscillation.

- Rate-limit penalty application to 6ms intervals to prevent pathological
  overhead when yields occur at very high frequency.

The debooster works *with* the buddy mechanism rather than replacing it:
set_next_buddy() provides immediate preference for the next scheduling
decision, while the vruntime penalty sustains that preference over
subsequent decisions. This dual approach proves especially effective in
nested cgroup scenarios where buddy hints alone are insufficient.

Part 2: KVM IPI-Aware Directed Yield (patches 6-9)

Enhance kvm_vcpu_on_spin() with lightweight IPI tracking to improve
directed yield candidate selection. Track sender/receiver relationships
when IPIs are delivered and use this information to prioritize yield
targets.

The tracking mechanism:

- Hooks into kvm_irq_delivery_to_apic() to detect unicast fixed IPIs (the
  common case for inter-processor synchronization). When exactly one
  destination vCPU receives an IPI, record the sender->receiver relationship
  with a monotonic timestamp.

  In high VM density scenarios, software-based IPI tracking through
  interrupt delivery interception becomes particularly valuable. It
  captures precise sender/receiver relationships that can be leveraged
  for intelligent scheduling decisions, providing performance benefits
  that complement or even exceed hardware-accelerated interrupt delivery
  in overcommitted environments.

- Uses lockless READ_ONCE/WRITE_ONCE accessors for minimal overhead. The
  per-vCPU ipi_context structure is carefully designed to avoid cache line
  bouncing.

- Implements a short recency window (50ms default) to avoid stale IPI
  information inflating boost priority on throughput-sensitive workloads.
  Old IPI relationships are naturally aged out.

- Clears IPI context on EOI with two-stage precision: unconditionally clear
  the receiver's context (it processed the interrupt), but only clear the
  sender's pending flag if the receiver matches and the IPI is recent. This
  prevents unrelated EOIs from prematurely clearing valid IPI state.

The candidate selection follows a priority hierarchy:

  Priority 1: Confirmed IPI receiver
    If the spinning vCPU recently sent an IPI to another vCPU and that IPI
    is still pending (within the recency window), unconditionally boost the
    receiver. This directly addresses the "spinning on IPI response" case.

  Priority 2: Fast pending interrupt
    Leverage arch-specific kvm_arch_dy_has_pending_interrupt() for
    compatibility with existing optimizations.

  Priority 3: Preempted in kernel mode
    Fall back to traditional preemption-based logic when yield_to_kernel_mode
    is requested, ensuring compatibility with existing workloads.

A two-round fallback mechanism provides a safety net: if the first round
with strict IPI-aware selection finds no eligible candidate (e.g., due to
missed IPI context or transient runnable set changes), a second round
applies relaxed selection gated only by preemption state. This is
controlled by the enable_relaxed_boost module parameter (default on).

Implementation Details
----------------------

Both mechanisms are designed for minimal overhead and runtime control:

- All locking occurs under existing rq->lock or per-vCPU locks; no new
  lock contention is introduced.

- Penalty calculations use integer arithmetic with overflow protection.

- IPI tracking uses monotonic timestamps (ktime_get_mono_fast_ns()) for
  efficient, race-free recency checks.

Advantages over paravirtualization approaches:

- No guest OS modification required: This solution operates entirely within
  the host kernel, providing transparent optimization without guest kernel
  changes or recompilation.

- Guest OS agnostic: Works uniformly across Linux, Windows, and other guest
  operating systems, unlike PV TLB shootdown which requires guest-side
  paravirtual driver support.

- Broader applicability: Captures IPI patterns from all synchronization
  primitives (spinlocks, RCU, smp_call_function, etc.), not limited to
  specific paravirtualized operations like TLB shootdown.

- Deployment simplicity: Existing VM images benefit immediately without
  guest kernel updates, critical for production environments with diverse
  guest OS versions and configurations.

- Runtime controls allow disabling features if needed:
  * /sys/kernel/debug/sched/vcpu_debooster_enabled
  * /sys/module/kvm/parameters/ipi_tracking_enabled
  * /sys/module/kvm/parameters/enable_relaxed_boost

- The infrastructure is incrementally introduced: early patches add inert
  scaffolding that can be verified for zero performance impact before
  activation.

Performance Results
-------------------

Test environment: Intel Xeon, 16 physical cores, 16 vCPUs per VM

Dbench 16 clients per VM (filesystem metadata operations):
  2 VMs: +14.4% throughput (lock contention reduction)
  3 VMs:  +9.8% throughput
  4 VMs:  +6.7% throughput

PARSEC Dedup benchmark, simlarge input (memory-intensive):
  2 VMs: +47.1% throughput (IPI-heavy synchronization)
  3 VMs: +28.1% throughput
  4 VMs:  +1.7% throughput

PARSEC VIPS benchmark, simlarge input (compute-intensive):
  2 VMs: +26.2% throughput (balanced sync and compute)
  3 VMs: +12.7% throughput
  4 VMs:  +6.0% throughput

Analysis:

- Gains are most pronounced at moderate overcommit (2-3 VMs). At this level,
  contention is significant enough to benefit from better yield behavior,
  but context switch overhead remains manageable.

- Dedup shows the strongest improvement (+47.1% at 2 VMs) due to its
  IPI-heavy synchronization patterns. The IPI-aware directed yield
  precisely targets the bottleneck.

- At 4 VMs (heavier overcommit), gains diminish as general CPU contention
  dominates. However, performance never regresses, indicating the mechanisms
  gracefully degrade.

- In certain high-density, resource overcommitted deployment scenarios, the
  performance benefits of APICv can be constrained by scheduling and
  contention patterns. In such cases, software-based IPI tracking serves as
  a complementary optimization path, offering targeted scheduling hints
  without relying on disabling APICv. The practical choice should be
  evaluated and balanced against workload characteristics and platform
  configuration.

- Dbench benefits primarily from the scheduler-side debooster, as its lock
  patterns involve less IPI spinning and more direct lock holder boosting.

The performance gains stem from three factors:

1. Lock holders receive sustained CPU time to complete critical sections,
   reducing overall lock hold duration and cascading contention.

2. IPI receivers are promptly scheduled when senders spin, minimizing IPI
   response latency and reducing wasted spin cycles.

3. Better cache utilization results from reduced context switching between
   lock waiters and holders.

Patch Organization
------------------

The series is organized for incremental review and bisectability:

Patches 1-5: Scheduler vCPU debooster

  Patch 1: Add infrastructure (per-rq tracking, sysctl, debugfs entry)
           Infrastructure is inert; no functional change.

  Patch 2: Add rate-limiting and validation helpers
           Static functions with comprehensive safety checks.

  Patch 3: Add cgroup LCA finder for hierarchical yield
           Implements CONFIG_FAIR_GROUP_SCHED-aware LCA location.

  Patch 4: Add penalty calculation and application logic
           Core algorithms with queue-size adaptation and debouncing.

  Patch 5: Wire up yield deboost in yield_to_task_fair()
           Activation patch. Includes Dbench performance data.

Patches 6-9: KVM IPI-aware directed yield

  Patch 6: Add IPI tracking infrastructure
           Per-vCPU context, module parameters, helper functions.
           Infrastructure is inert until activated.

  Patch 7: Integrate IPI tracking with LAPIC interrupt delivery
           Hook into kvm_irq_delivery_to_apic() and EOI handling.

  Patch 8: Implement IPI-aware directed yield candidate selection
           Replace candidate selection logic with priority-based approach.
           Includes PARSEC performance data.

  Patch 9: Add relaxed boost as safety net
           Two-round fallback mechanism for robustness.

Each patch compiles and boots independently. Performance data is presented
where the relevant mechanism becomes active (patches 5 and 8).

Testing
-------

Workloads tested:

- Dbench (filesystem metadata stress)
- PARSEC benchmarks (Dedup, VIPS, Ferret, Blackscholes)
- Kernel compilation (make -j16 in each VM)

No regressions observed on any configuration. The mechanisms show neutral
to positive impact across diverse workloads.

Future Work
-----------

Potential extensions beyond this series:

- Adaptive recency window: dynamically adjust ipi_window_ns based on
  observed workload patterns.

- Extended tracking: consider multi-round IPI patterns (A->B->C->A).

- Cross-NUMA awareness: penalty scaling based on NUMA distances.

These are intentionally deferred to keep this series focused and reviewable.

v1 -> v2:
- Rebase onto v6.19-rc1 (v1 was based on v6.18-rc4)
- Drop "KVM: Fix last_boosted_vcpu index assignment bug" patch as v6.19-rc1
  already contains this fix
- Scheduler debooster changes:
  * Adapt to v6.19's EEVDF forfeit behavior: yield_to_deboost() must be
    called BEFORE yield_task_fair() to preserve fairness gap calculation.
    In v6.19+, yield_task_fair() performs forfeit (se->vruntime =
    se->deadline), which would inflate the yielding entity's vruntime
    before penalty calculation, causing need=0 and only baseline penalty
    being applied.
  * Change from rq->curr to rq->donor for correct EEVDF donor tracking
  * Change from nr_queued to h_nr_queued for accurate hierarchical task
    counting in penalty cap calculation
  * Remove vlag assignment as it will be recalculated on dequeue/enqueue
    and modifying it for on-rq entity is incorrect
  * Remove update_min_vruntime() call: in EEVDF the yielding entity is
    always cfs_rq->curr (dequeued from RB-tree), so modifying its vruntime
    does not affect min_vruntime calculation
  * Remove unnecessary gran_floor safeguard (calc_delta_fair already
    handles edge cases correctly)
  * Rename debugfs entry from sched_vcpu_debooster_enabled to
    vcpu_debooster_enabled for consistency
- KVM IPI tracking changes:
  * Improve documentation for module parameters
  * Add kvm_vcpu_is_ipi_receiver() declaration to x86.h header

Wanpeng Li (9):
  sched: Add vCPU debooster infrastructure
  sched/fair: Add rate-limiting and validation helpers
  sched/fair: Add cgroup LCA finder for hierarchical yield
  sched/fair: Add penalty calculation and application logic
  sched/fair: Wire up yield deboost in yield_to_task_fair()
  KVM: x86: Add IPI tracking infrastructure
  KVM: x86/lapic: Integrate IPI tracking with interrupt delivery
  KVM: Implement IPI-aware directed yield candidate selection
  KVM: Relaxed boost as safety net

 arch/x86/include/asm/kvm_host.h |  12 ++
 arch/x86/kvm/lapic.c            | 166 ++++++++++++++++-
 arch/x86/kvm/x86.c              |   3 +
 arch/x86/kvm/x86.h              |   8 +
 include/linux/kvm_host.h        |   3 +
 kernel/sched/core.c             |   9 +-
 kernel/sched/debug.c            |   2 +
 kernel/sched/fair.c             | 305 ++++++++++++++++++++++++++++++++
 kernel/sched/sched.h            |  12 ++
 virt/kvm/kvm_main.c             |  74 +++++++-
 10 files changed, 579 insertions(+), 15 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2 1/9] sched: Add vCPU debooster infrastructure
  2025-12-19  3:53 [PATCH v2 0/9] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM Wanpeng Li
@ 2025-12-19  3:53 ` Wanpeng Li
  2025-12-19  3:53 ` [PATCH v2 2/9] sched/fair: Add rate-limiting and validation helpers Wanpeng Li
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 19+ messages in thread
From: Wanpeng Li @ 2025-12-19  3:53 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paolo Bonzini,
	Sean Christopherson
  Cc: K Prateek Nayak, Christian Borntraeger, Steven Rostedt,
	Vincent Guittot, Juri Lelli, linux-kernel, kvm, Wanpeng Li

From: Wanpeng Li <wanpengli@tencent.com>

Introduce foundational infrastructure for the vCPU debooster mechanism
to improve yield_to() effectiveness in virtualization workloads.

Add per-rq tracking fields for rate limiting (yield_deboost_last_time_ns)
and debouncing (yield_deboost_last_src/dst_pid, last_pair_time_ns).
Introduce global sysctl knob sysctl_sched_vcpu_debooster_enabled for
runtime control, defaulting to enabled. Add debugfs interface for
observability and initialization in sched_init().

The infrastructure is inert at this stage as no deboost logic is
implemented yet, allowing independent verification that existing
behavior remains unchanged.

v1 -> v2:
- Rename debugfs entry from sched_vcpu_debooster_enabled to
  vcpu_debooster_enabled for consistency with other sched debugfs entries
- Add explicit initialization of yield_deboost_last_time_ns to 0 in
  sched_init() for clarity
- Improve comments to follow kernel documentation style

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
---
 kernel/sched/core.c  |  9 +++++++--
 kernel/sched/debug.c |  2 ++
 kernel/sched/fair.c  |  7 +++++++
 kernel/sched/sched.h | 12 ++++++++++++
 4 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 41ba0be16911..9f0936b9c1c9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8606,9 +8606,14 @@ void __init sched_init(void)
 #endif /* CONFIG_CGROUP_SCHED */
 
 	for_each_possible_cpu(i) {
-		struct rq *rq;
+		struct rq *rq = cpu_rq(i);
+
+		/* Initialize vCPU debooster per-rq state */
+		rq->yield_deboost_last_time_ns = 0;
+		rq->yield_deboost_last_src_pid = -1;
+		rq->yield_deboost_last_dst_pid = -1;
+		rq->yield_deboost_last_pair_time_ns = 0;
 
-		rq = cpu_rq(i);
 		raw_spin_lock_init(&rq->__lock);
 		rq->nr_running = 0;
 		rq->calc_load_active = 0;
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 41caa22e0680..13e67617549d 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -508,6 +508,8 @@ static __init int sched_init_debug(void)
 	debugfs_create_file("tunable_scaling", 0644, debugfs_sched, NULL, &sched_scaling_fops);
 	debugfs_create_u32("migration_cost_ns", 0644, debugfs_sched, &sysctl_sched_migration_cost);
 	debugfs_create_u32("nr_migrate", 0644, debugfs_sched, &sysctl_sched_nr_migrate);
+	debugfs_create_u32("vcpu_debooster_enabled", 0644, debugfs_sched,
+			   &sysctl_sched_vcpu_debooster_enabled);
 
 	sched_domains_mutex_lock();
 	update_sched_domain_debugfs();
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index da46c3164537..87c30db2c853 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -81,6 +81,13 @@ static unsigned int normalized_sysctl_sched_base_slice	= 700000ULL;
 
 __read_mostly unsigned int sysctl_sched_migration_cost	= 500000UL;
 
+/*
+ * vCPU debooster: runtime toggle for yield_to() vruntime penalty mechanism.
+ * When enabled (default), yield_to() applies bounded vruntime penalties to
+ * improve lock holder scheduling in virtualized environments.
+ */
+unsigned int sysctl_sched_vcpu_debooster_enabled __read_mostly = 1;
+
 static int __init setup_sched_thermal_decay_shift(char *str)
 {
 	pr_warn("Ignoring the deprecated sched_thermal_decay_shift= option\n");
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d30cca6870f5..b7aa0d35c793 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1294,6 +1294,16 @@ struct rq {
 	unsigned int		push_busy;
 	struct cpu_stop_work	push_work;
 
+	/*
+	 * vCPU debooster: per-rq state for yield_to() optimization.
+	 * Used to rate-limit and debounce vruntime penalties applied
+	 * when a vCPU yields to a lock holder.
+	 */
+	u64			yield_deboost_last_time_ns;
+	pid_t			yield_deboost_last_src_pid;
+	pid_t			yield_deboost_last_dst_pid;
+	u64			yield_deboost_last_pair_time_ns;
+
 #ifdef CONFIG_SCHED_CORE
 	/* per rq */
 	struct rq		*core;
@@ -2958,6 +2968,8 @@ extern int sysctl_resched_latency_warn_once;
 
 extern unsigned int sysctl_sched_tunable_scaling;
 
+extern unsigned int sysctl_sched_vcpu_debooster_enabled;
+
 extern unsigned int sysctl_numa_balancing_scan_delay;
 extern unsigned int sysctl_numa_balancing_scan_period_min;
 extern unsigned int sysctl_numa_balancing_scan_period_max;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 2/9] sched/fair: Add rate-limiting and validation helpers
  2025-12-19  3:53 [PATCH v2 0/9] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM Wanpeng Li
  2025-12-19  3:53 ` [PATCH v2 1/9] sched: Add vCPU debooster infrastructure Wanpeng Li
@ 2025-12-19  3:53 ` Wanpeng Li
  2025-12-22 21:12   ` kernel test robot
  2026-01-04  4:09   ` Hillf Danton
  2025-12-19  3:53 ` [PATCH v2 3/9] sched/fair: Add cgroup LCA finder for hierarchical yield Wanpeng Li
                   ` (10 subsequent siblings)
  12 siblings, 2 replies; 19+ messages in thread
From: Wanpeng Li @ 2025-12-19  3:53 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paolo Bonzini,
	Sean Christopherson
  Cc: K Prateek Nayak, Christian Borntraeger, Steven Rostedt,
	Vincent Guittot, Juri Lelli, linux-kernel, kvm, Wanpeng Li

From: Wanpeng Li <wanpengli@tencent.com>

Implement core safety mechanisms for yield deboost operations.

Add yield_deboost_rate_limit() for high-frequency gating to prevent
excessive overhead on compute-intensive workloads. The 6ms threshold
balances responsiveness with overhead reduction.

Add yield_deboost_validate_tasks() for comprehensive validation ensuring
both tasks are valid and distinct, both belong to fair_sched_class,
target is on the same runqueue, and tasks are runnable.

The rate limiter prevents pathological high-frequency cases while
validation ensures only appropriate task pairs proceed. Both functions
are static and will be integrated in subsequent patches.

v1 -> v2:
- Remove unnecessary READ_ONCE/WRITE_ONCE for per-rq fields accessed
  under rq->lock
- Change rq->clock to rq_clock(rq) helper for consistency
- Change yield_deboost_rate_limit() signature from (rq, now_ns) to (rq),
  obtaining time internally via rq_clock()
- Remove redundant sched_class check for p_yielding (already implied by
  rq->donor being fair)
- Simplify task_rq check to only verify p_target
- Change rq->curr to rq->donor for correct EEVDF donor tracking
- Move sysctl_sched_vcpu_debooster_enabled and NULL checks to caller
  (yield_to_deboost) for early exit before update_rq_clock()
- Simplify function signature by returning p_yielding directly instead
  of using output pointer parameters
- Add documentation explaining the 6ms rate limit threshold

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
---
 kernel/sched/fair.c | 62 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 62 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 87c30db2c853..2f327882bf4d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9040,6 +9040,68 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev, struct t
 	}
 }
 
+/*
+ * Rate-limit yield deboost operations to prevent excessive overhead.
+ * Returns true if the operation should be skipped due to rate limiting.
+ *
+ * The 6ms threshold balances responsiveness with overhead reduction:
+ * - Short enough to allow timely yield boosting for lock contention
+ * - Long enough to prevent pathological high-frequency penalty application
+ *
+ * Called under rq->lock, so direct field access is safe.
+ */
+static bool yield_deboost_rate_limit(struct rq *rq)
+{
+	u64 now = rq_clock(rq);
+	u64 last = rq->yield_deboost_last_time_ns;
+
+	if (last && (now - last) <= 6 * NSEC_PER_MSEC)
+		return true;
+
+	rq->yield_deboost_last_time_ns = now;
+	return false;
+}
+
+/*
+ * Validate tasks for yield deboost operation.
+ * Returns the yielding task on success, NULL on validation failure.
+ *
+ * Checks: feature enabled, valid target, same runqueue, target is fair class,
+ * both on_rq. Called under rq->lock.
+ *
+ * Note: p_yielding (rq->donor) is guaranteed to be fair class by the caller
+ * (yield_to_task_fair is only called when curr->sched_class == p->sched_class).
+ */
+static struct task_struct __maybe_unused *
+yield_deboost_validate_tasks(struct rq *rq, struct task_struct *p_target)
+{
+	struct task_struct *p_yielding;
+
+	if (!sysctl_sched_vcpu_debooster_enabled)
+		return NULL;
+
+	if (!p_target)
+		return NULL;
+
+	if (yield_deboost_rate_limit(rq))
+		return NULL;
+
+	p_yielding = rq->donor;
+	if (!p_yielding || p_yielding == p_target)
+		return NULL;
+
+	if (p_target->sched_class != &fair_sched_class)
+		return NULL;
+
+	if (task_rq(p_target) != rq)
+		return NULL;
+
+	if (!p_target->se.on_rq || !p_yielding->se.on_rq)
+		return NULL;
+
+	return p_yielding;
+}
+
 /*
  * sched_yield() is very simple
  */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 3/9] sched/fair: Add cgroup LCA finder for hierarchical yield
  2025-12-19  3:53 [PATCH v2 0/9] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM Wanpeng Li
  2025-12-19  3:53 ` [PATCH v2 1/9] sched: Add vCPU debooster infrastructure Wanpeng Li
  2025-12-19  3:53 ` [PATCH v2 2/9] sched/fair: Add rate-limiting and validation helpers Wanpeng Li
@ 2025-12-19  3:53 ` Wanpeng Li
  2025-12-19  3:53 ` [PATCH v2 4/9] sched/fair: Add penalty calculation and application logic Wanpeng Li
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 19+ messages in thread
From: Wanpeng Li @ 2025-12-19  3:53 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paolo Bonzini,
	Sean Christopherson
  Cc: K Prateek Nayak, Christian Borntraeger, Steven Rostedt,
	Vincent Guittot, Juri Lelli, linux-kernel, kvm, Wanpeng Li

From: Wanpeng Li <wanpengli@tencent.com>

Implement yield_deboost_find_lca() to locate the lowest common ancestor
(LCA) in the cgroup hierarchy for EEVDF-aware yield operations.

The LCA represents the appropriate hierarchy level where vruntime
adjustments should be applied to ensure fairness is maintained across
cgroup boundaries. This is critical for virtualization workloads where
vCPUs may be organized in nested cgroups.

Key aspects:
- For CONFIG_FAIR_GROUP_SCHED: Walk up both entity hierarchies by
  aligning depths, then ascending together until common cfs_rq found
- For flat hierarchy: Simply verify both entities share the same cfs_rq
- Validate that meaningful contention exists (h_nr_queued > 1)
- Ensure yielding entity has non-zero slice for safe penalty calculation

Function operates under rq->lock protection. Static helper integrated
in subsequent patches.

v1 -> v2:
- Change nr_queued to h_nr_queued for accurate hierarchical task
  counting that includes tasks in child cgroups
- Improve comments to clarify the LCA algorithm

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
---
 kernel/sched/fair.c | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2f327882bf4d..39dbdd222687 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9102,6 +9102,36 @@ yield_deboost_validate_tasks(struct rq *rq, struct task_struct *p_target)
 	return p_yielding;
 }
 
+/*
+ * Find the lowest common ancestor (LCA) in the cgroup hierarchy.
+ * Uses find_matching_se() to locate sibling entities at the same level,
+ * then returns their common cfs_rq for vruntime adjustments.
+ *
+ * Returns true if a valid LCA with meaningful contention (h_nr_queued > 1)
+ * is found, storing the LCA entities and common cfs_rq in output parameters.
+ */
+static bool __maybe_unused
+yield_deboost_find_lca(struct sched_entity *se_y, struct sched_entity *se_t,
+		       struct sched_entity **se_y_lca_out,
+		       struct sched_entity **se_t_lca_out,
+		       struct cfs_rq **cfs_rq_out)
+{
+	struct sched_entity *se_y_lca = se_y;
+	struct sched_entity *se_t_lca = se_t;
+	struct cfs_rq *cfs_rq;
+
+	find_matching_se(&se_y_lca, &se_t_lca);
+
+	cfs_rq = cfs_rq_of(se_y_lca);
+	if (cfs_rq->h_nr_queued <= 1)
+		return false;
+
+	*se_y_lca_out = se_y_lca;
+	*se_t_lca_out = se_t_lca;
+	*cfs_rq_out = cfs_rq;
+	return true;
+}
+
 /*
  * sched_yield() is very simple
  */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 4/9] sched/fair: Add penalty calculation and application logic
  2025-12-19  3:53 [PATCH v2 0/9] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM Wanpeng Li
                   ` (2 preceding siblings ...)
  2025-12-19  3:53 ` [PATCH v2 3/9] sched/fair: Add cgroup LCA finder for hierarchical yield Wanpeng Li
@ 2025-12-19  3:53 ` Wanpeng Li
  2025-12-22 23:36   ` kernel test robot
  2025-12-19  3:53 ` [PATCH v2 5/9] sched/fair: Wire up yield deboost in yield_to_task_fair() Wanpeng Li
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 19+ messages in thread
From: Wanpeng Li @ 2025-12-19  3:53 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paolo Bonzini,
	Sean Christopherson
  Cc: K Prateek Nayak, Christian Borntraeger, Steven Rostedt,
	Vincent Guittot, Juri Lelli, linux-kernel, kvm, Wanpeng Li

From: Wanpeng Li <wanpengli@tencent.com>

Implement core penalty calculation and application mechanisms for
yield deboost operations.

yield_deboost_apply_debounce(): Reverse-pair debouncing prevents
ping-pong. When A->B then B->A within ~600us, penalty is downscaled.

yield_deboost_calculate_penalty(): Calculate vruntime penalty based on:
- Fairness gap (vruntime delta between yielding and target tasks)
- Scheduling granularity based on yielding entity's weight
- Queue-size-based caps (2 tasks: 6.0x gran, 3: 4.0x, 4-6: 2.5x,
  7-8: 2.0x, 9-12: 1.5x, >12: 1.0x)
- Special handling for zero gap with refined multipliers
- 10% weighting on positive gaps (alpha=1.10)

yield_deboost_apply_penalty(): Apply calculated penalty to EEVDF
state, updating vruntime and deadline atomically.

The penalty mechanism provides sustained scheduling preference beyond
the transient buddy hint, critical for lock holder boosting in
virtualized environments.

v1 -> v2:
- Change nr_queued to h_nr_queued for accurate hierarchical task
  counting in penalty cap calculation
- Remove vlag assignment as it will be recalculated on dequeue/enqueue
  and modifying it for on-rq entity is incorrect
- Remove update_min_vruntime() call: in EEVDF the yielding entity is
  always cfs_rq->curr (dequeued from RB-tree), so modifying its vruntime
  does not affect min_vruntime calculation
- Remove unnecessary gran_floor safeguard (calc_delta_fair already
  handles edge cases correctly)
- Change rq->curr to rq->donor for correct EEVDF donor tracking
- Simplify debounce function signature

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
---
 kernel/sched/fair.c | 155 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 155 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 39dbdd222687..8738cfc3109c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9132,6 +9132,161 @@ yield_deboost_find_lca(struct sched_entity *se_y, struct sched_entity *se_t,
 	return true;
 }
 
+/*
+ * Apply debounce for reverse yield pairs to reduce ping-pong effects.
+ * When A yields to B, then B yields back to A within ~600us, downscale
+ * the penalty to prevent oscillation.
+ *
+ * The 600us threshold is chosen to be:
+ * - Long enough to catch rapid back-and-forth yields
+ * - Short enough to not affect legitimate sequential yields
+ *
+ * Returns the (possibly reduced) penalty value.
+ */
+static u64 yield_deboost_apply_debounce(struct rq *rq, struct task_struct *p_target,
+					u64 penalty, u64 need, u64 gran)
+{
+	u64 now = rq_clock(rq);
+	struct task_struct *p_yielding = rq->donor;
+	pid_t src_pid, dst_pid;
+	pid_t last_src, last_dst;
+	u64 last_ns;
+
+	if (!p_yielding || !p_target)
+		return penalty;
+
+	src_pid = p_yielding->pid;
+	dst_pid = p_target->pid;
+	last_src = rq->yield_deboost_last_src_pid;
+	last_dst = rq->yield_deboost_last_dst_pid;
+	last_ns = rq->yield_deboost_last_pair_time_ns;
+
+	/* Detect reverse pair: previous was target->source */
+	if (last_src == dst_pid && last_dst == src_pid &&
+	    (now - last_ns) <= 600 * NSEC_PER_USEC) {
+		u64 alt = max(need, gran);
+
+		if (penalty > alt)
+			penalty = alt;
+	}
+
+	/* Update tracking state */
+	rq->yield_deboost_last_src_pid = src_pid;
+	rq->yield_deboost_last_dst_pid = dst_pid;
+	rq->yield_deboost_last_pair_time_ns = now;
+
+	return penalty;
+}
+
+/*
+ * Calculate vruntime penalty for yield deboost.
+ *
+ * The penalty is based on:
+ * - Fairness gap: vruntime difference between yielding and target tasks
+ * - Scheduling granularity: base unit for penalty calculation
+ * - Queue size: adaptive caps to prevent starvation in larger queues
+ *
+ * Queue-size-based caps (multiplier of granularity):
+ *   2 tasks:  6.0x - Strongest push for 2-task ping-pong scenarios
+ *   3 tasks:  4.0x
+ *   4-6:      2.5x
+ *   7-8:      2.0x
+ *   9-12:     1.5x
+ *   >12:      1.0x - Minimal push to avoid starvation
+ *
+ * Returns the calculated penalty value.
+ */
+static u64 __maybe_unused
+yield_deboost_calculate_penalty(struct rq *rq, struct sched_entity *se_y_lca,
+				struct sched_entity *se_t_lca,
+				struct task_struct *p_target, int h_nr_queued)
+{
+	u64 gran, need, penalty, maxp;
+	u64 weighted_need, base;
+
+	gran = calc_delta_fair(sysctl_sched_base_slice, se_y_lca);
+
+	/* Calculate fairness gap */
+	need = 0;
+	if (se_t_lca->vruntime > se_y_lca->vruntime)
+		need = se_t_lca->vruntime - se_y_lca->vruntime;
+
+	/* Base penalty is granularity plus 110% of fairness gap */
+	penalty = gran;
+	if (need) {
+		weighted_need = need + need / 10;
+		if (weighted_need > U64_MAX - penalty)
+			weighted_need = U64_MAX - penalty;
+		penalty += weighted_need;
+	}
+
+	/* Apply debounce to reduce ping-pong */
+	penalty = yield_deboost_apply_debounce(rq, p_target, penalty, need, gran);
+
+	/* Queue-size-based upper bound */
+	if (h_nr_queued == 2)
+		maxp = gran * 6;
+	else if (h_nr_queued == 3)
+		maxp = gran * 4;
+	else if (h_nr_queued <= 6)
+		maxp = (gran * 5) / 2;
+	else if (h_nr_queued <= 8)
+		maxp = gran * 2;
+	else if (h_nr_queued <= 12)
+		maxp = (gran * 3) / 2;
+	else
+		maxp = gran;
+
+	penalty = clamp(penalty, gran, maxp);
+
+	/* Baseline push when no fairness gap exists */
+	if (need == 0) {
+		if (h_nr_queued == 3)
+			base = (gran * 15) / 16;
+		else if (h_nr_queued >= 4 && h_nr_queued <= 6)
+			base = (gran * 5) / 8;
+		else if (h_nr_queued >= 7 && h_nr_queued <= 8)
+			base = gran / 2;
+		else if (h_nr_queued >= 9 && h_nr_queued <= 12)
+			base = (gran * 3) / 8;
+		else if (h_nr_queued > 12)
+			base = gran / 4;
+		else
+			base = gran;
+
+		if (penalty < base)
+			penalty = base;
+	}
+
+	return penalty;
+}
+
+/*
+ * Apply vruntime penalty and update EEVDF fields for consistency.
+ * Updates vruntime and deadline; vlag is not modified as it will be
+ * recalculated when the entity is dequeued/enqueued.
+ *
+ * Caller must call update_curr(cfs_rq) before invoking this function
+ * to ensure accounting is up-to-date before modifying vruntime.
+ */
+static void __maybe_unused
+yield_deboost_apply_penalty(struct sched_entity *se_y_lca,
+			    struct cfs_rq *cfs_rq, u64 penalty)
+{
+	u64 new_vruntime;
+
+	/* Overflow protection */
+	if (se_y_lca->vruntime > U64_MAX - penalty)
+		return;
+
+	new_vruntime = se_y_lca->vruntime + penalty;
+	if (new_vruntime <= se_y_lca->vruntime)
+		return;
+
+	se_y_lca->vruntime = new_vruntime;
+	se_y_lca->deadline = new_vruntime + calc_delta_fair(se_y_lca->slice, se_y_lca);
+}
+
 /*
  * sched_yield() is very simple
  */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 5/9] sched/fair: Wire up yield deboost in yield_to_task_fair()
  2025-12-19  3:53 [PATCH v2 0/9] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM Wanpeng Li
                   ` (3 preceding siblings ...)
  2025-12-19  3:53 ` [PATCH v2 4/9] sched/fair: Add penalty calculation and application logic Wanpeng Li
@ 2025-12-19  3:53 ` Wanpeng Li
  2025-12-22  7:06   ` kernel test robot
  2025-12-22  9:31   ` kernel test robot
  2025-12-19  3:53 ` [PATCH v2 6/9] KVM: x86: Add IPI tracking infrastructure Wanpeng Li
                   ` (7 subsequent siblings)
  12 siblings, 2 replies; 19+ messages in thread
From: Wanpeng Li @ 2025-12-19  3:53 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paolo Bonzini,
	Sean Christopherson
  Cc: K Prateek Nayak, Christian Borntraeger, Steven Rostedt,
	Vincent Guittot, Juri Lelli, linux-kernel, kvm, Wanpeng Li

From: Wanpeng Li <wanpengli@tencent.com>

Integrate yield_to_deboost() into yield_to_task_fair() to activate the
vCPU debooster mechanism.

The integration works in concert with the existing buddy mechanism:
set_next_buddy() provides immediate preference, yield_to_deboost()
applies bounded vruntime penalty based on the fairness gap, and
yield_task_fair() completes the standard yield path including the
EEVDF forfeit operation.

Note: yield_to_deboost() must be called BEFORE yield_task_fair()
because v6.19+ kernels perform forfeit (se->vruntime = se->deadline)
in yield_task_fair(). If deboost runs after forfeit, the fairness
gap calculation would see the already-inflated vruntime, resulting
in need=0 and only baseline penalty being applied.

Performance testing (16 pCPUs host, 16 vCPUs/VM):

Dbench 16 clients per VM:
  2 VMs: +14.4% throughput
  3 VMs:  +9.8% throughput
  4 VMs:  +6.7% throughput

Gains stem from sustained lock holder preference reducing ping-pong
between yielding vCPUs and lock holders. Most pronounced at moderate
overcommit where contention reduction outweighs context switch cost.

v1 -> v2:
- Move sysctl_sched_vcpu_debooster_enabled check to yield_to_deboost()
  entry point for early exit before update_rq_clock()
- Restore conditional update_curr() check (se_y_lca != cfs_rq->curr)
  to avoid unnecessary accounting updates
- Keep yield_task_fair() unchanged (no for_each_sched_entity loop)
  to avoid double-penalizing the yielding task
- Move yield_to_deboost() BEFORE yield_task_fair() to preserve fairness
  gap calculation (v6.19+ forfeit would otherwise inflate vruntime
  before penalty calculation)
- Improve function documentation

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
---
 kernel/sched/fair.c | 67 +++++++++++++++++++++++++++++++++++++++------
 1 file changed, 59 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8738cfc3109c..9e0991f0c618 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9066,23 +9066,19 @@ static bool yield_deboost_rate_limit(struct rq *rq)
  * Validate tasks for yield deboost operation.
  * Returns the yielding task on success, NULL on validation failure.
  *
- * Checks: feature enabled, valid target, same runqueue, target is fair class,
- * both on_rq. Called under rq->lock.
+ * Checks: valid target, same runqueue, target is fair class,
+ * both on_rq, rate limiting. Called under rq->lock.
  *
  * Note: p_yielding (rq->donor) is guaranteed to be fair class by the caller
  * (yield_to_task_fair is only called when curr->sched_class == p->sched_class).
+ * Note: sysctl_sched_vcpu_debooster_enabled is checked by caller before
+ * update_rq_clock() to avoid unnecessary clock updates.
  */
 static struct task_struct __maybe_unused *
 yield_deboost_validate_tasks(struct rq *rq, struct task_struct *p_target)
 {
 	struct task_struct *p_yielding;
 
-	if (!sysctl_sched_vcpu_debooster_enabled)
-		return NULL;
-
-	if (!p_target)
-		return NULL;
-
 	if (yield_deboost_rate_limit(rq))
 		return NULL;
 
@@ -9287,6 +9283,57 @@ yield_deboost_apply_penalty(struct sched_entity *se_y_lca,
 	se_y_lca->deadline = new_vruntime + calc_delta_fair(se_y_lca->slice, se_y_lca);
 }
 
+/*
+ * yield_to_deboost - Apply vruntime penalty to favor the target task
+ * @rq: runqueue containing both tasks (rq->lock must be held)
+ * @p_target: task to favor in scheduling
+ *
+ * Cooperates with yield_to_task_fair(): set_next_buddy() provides immediate
+ * preference; this routine applies a bounded vruntime penalty at the cgroup
+ * LCA so the target maintains scheduling advantage beyond the buddy effect.
+ *
+ * Only operates on tasks resident on the same rq. Penalty is bounded by
+ * granularity and queue-size caps to prevent starvation.
+ */
+static void yield_to_deboost(struct rq *rq, struct task_struct *p_target)
+{
+	struct task_struct *p_yielding;
+	struct sched_entity *se_y, *se_t, *se_y_lca, *se_t_lca;
+	struct cfs_rq *cfs_rq_common;
+	u64 penalty;
+
+	/* Quick validation before updating clock */
+	if (!sysctl_sched_vcpu_debooster_enabled)
+		return;
+
+	if (!p_target)
+		return;
+
+	/* Update clock - rate limiting and debounce use rq_clock() */
+	update_rq_clock(rq);
+
+	/* Full validation including rate limiting */
+	p_yielding = yield_deboost_validate_tasks(rq, p_target);
+	if (!p_yielding)
+		return;
+
+	se_y = &p_yielding->se;
+	se_t = &p_target->se;
+
+	/* Find LCA in cgroup hierarchy */
+	if (!yield_deboost_find_lca(se_y, se_t, &se_y_lca, &se_t_lca, &cfs_rq_common))
+		return;
+
+	/* Update current accounting before modifying vruntime */
+	if (se_y_lca != cfs_rq_common->curr)
+		update_curr(cfs_rq_common);
+
+	/* Calculate and apply penalty */
+	penalty = yield_deboost_calculate_penalty(rq, se_y_lca, se_t_lca,
+						  p_target, cfs_rq_common->h_nr_queued);
+	yield_deboost_apply_penalty(se_y_lca, cfs_rq_common, penalty);
+}
+
 /*
  * sched_yield() is very simple
  */
@@ -9341,6 +9388,10 @@ static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
 	/* Tell the scheduler that we'd really like se to run next. */
 	set_next_buddy(se);
 
+	/* Apply deboost BEFORE forfeit to preserve fairness gap calculation */
+	yield_to_deboost(rq, p);
+
+	/* Complete the standard yield path (includes forfeit in v6.19+) */
 	yield_task_fair(rq);
 
 	return true;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 6/9] KVM: x86: Add IPI tracking infrastructure
  2025-12-19  3:53 [PATCH v2 0/9] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM Wanpeng Li
                   ` (4 preceding siblings ...)
  2025-12-19  3:53 ` [PATCH v2 5/9] sched/fair: Wire up yield deboost in yield_to_task_fair() Wanpeng Li
@ 2025-12-19  3:53 ` Wanpeng Li
  2025-12-19  3:53 ` [PATCH v2 7/9] KVM: x86/lapic: Integrate IPI tracking with interrupt delivery Wanpeng Li
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 19+ messages in thread
From: Wanpeng Li @ 2025-12-19  3:53 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paolo Bonzini,
	Sean Christopherson
  Cc: K Prateek Nayak, Christian Borntraeger, Steven Rostedt,
	Vincent Guittot, Juri Lelli, linux-kernel, kvm, Wanpeng Li

From: Wanpeng Li <wanpengli@tencent.com>

Add foundational infrastructure for tracking IPI sender/receiver
relationships to improve directed yield candidate selection.

Introduce per-vCPU ipi_context structure containing:
- last_ipi_receiver: vCPU index that received the last IPI from this vCPU
- last_ipi_time_ns: timestamp of the last IPI send
- ipi_pending: flag indicating an unacknowledged IPI
- last_ipi_sender: vCPU index that sent an IPI to this vCPU
- last_ipi_recv_time_ns: timestamp when IPI was received

Add module parameters for runtime control:
- ipi_tracking_enabled (default: true): master switch for IPI tracking
- ipi_window_ns (default: 50ms): recency window for IPI validity

Implement helper functions:
- kvm_ipi_tracking_enabled(): check if tracking is active
- kvm_vcpu_is_ipi_receiver(): determine if a vCPU is a recent IPI target

The infrastructure is inert until integrated with interrupt delivery
in subsequent patches.

v1 -> v2:
- Improve documentation for module parameters explaining the 50ms
  window rationale
- Add kvm_vcpu_is_ipi_receiver() declaration to x86.h header
- Add weak function annotation comment in kvm_host.h

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
---
 arch/x86/include/asm/kvm_host.h | 12 ++++++
 arch/x86/kvm/lapic.c            | 76 +++++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c              |  3 ++
 arch/x86/kvm/x86.h              |  8 ++++
 include/linux/kvm_host.h        |  3 ++
 virt/kvm/kvm_main.c             |  6 +++
 6 files changed, 108 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5a3bfa293e8b..2464c310f0a2 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1052,6 +1052,18 @@ struct kvm_vcpu_arch {
 	int pending_external_vector;
 	int highest_stale_pending_ioapic_eoi;
 
+	/*
+	 * IPI tracking for directed yield optimization.
+	 * Records sender/receiver relationships when IPIs are delivered
+	 * to enable IPI-aware vCPU scheduling decisions.
+	 */
+	struct {
+		int last_ipi_sender;	/* vCPU index of last IPI sender */
+		int last_ipi_receiver;	/* vCPU index of last IPI receiver */
+		bool pending_ipi;	/* Awaiting IPI response */
+		u64 ipi_time_ns;	/* Timestamp when IPI was sent */
+	} ipi_context;
+
 	/* be preempted when it's in kernel-mode(cpl=0) */
 	bool preempted_in_kernel;
 
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 1597dd0b0cc6..23f247a3b127 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -75,6 +75,19 @@ module_param(lapic_timer_advance, bool, 0444);
 /* step-by-step approximation to mitigate fluctuation */
 #define LAPIC_TIMER_ADVANCE_ADJUST_STEP 8
 
+/*
+ * IPI tracking for directed yield optimization.
+ * - ipi_tracking_enabled: global toggle (default on)
+ * - ipi_window_ns: recency window for IPI validity (default 50ms)
+ *   The 50ms window is chosen to be long enough to capture IPI response
+ *   patterns while short enough to avoid stale information affecting
+ *   scheduling decisions in throughput-sensitive workloads.
+ */
+static bool ipi_tracking_enabled = true;
+static unsigned long ipi_window_ns = 50 * NSEC_PER_MSEC;
+module_param(ipi_tracking_enabled, bool, 0644);
+module_param(ipi_window_ns, ulong, 0644);
+
 static bool __read_mostly vector_hashing_enabled = true;
 module_param_named(vector_hashing, vector_hashing_enabled, bool, 0444);
 
@@ -1113,6 +1126,69 @@ static int kvm_apic_compare_prio(struct kvm_vcpu *vcpu1, struct kvm_vcpu *vcpu2)
 	return vcpu1->arch.apic_arb_prio - vcpu2->arch.apic_arb_prio;
 }
 
+/*
+ * Track IPI communication for directed yield optimization.
+ * Records sender/receiver relationship when a unicast IPI is delivered.
+ * Only tracks when a unique receiver exists; ignores self-IPI.
+ */
+void kvm_track_ipi_communication(struct kvm_vcpu *sender, struct kvm_vcpu *receiver)
+{
+	if (!sender || !receiver || sender == receiver)
+		return;
+	if (unlikely(!READ_ONCE(ipi_tracking_enabled)))
+		return;
+
+	WRITE_ONCE(sender->arch.ipi_context.last_ipi_receiver, receiver->vcpu_idx);
+	WRITE_ONCE(sender->arch.ipi_context.pending_ipi, true);
+	WRITE_ONCE(sender->arch.ipi_context.ipi_time_ns, ktime_get_mono_fast_ns());
+
+	WRITE_ONCE(receiver->arch.ipi_context.last_ipi_sender, sender->vcpu_idx);
+}
+
+/*
+ * Check if 'receiver' is the recent IPI target of 'sender'.
+ *
+ * Rationale:
+ * - Use a short window to avoid stale IPI inflating boost priority
+ *   on throughput-sensitive workloads.
+ */
+bool kvm_vcpu_is_ipi_receiver(struct kvm_vcpu *sender, struct kvm_vcpu *receiver)
+{
+	u64 then, now;
+
+	if (unlikely(!READ_ONCE(ipi_tracking_enabled)))
+		return false;
+
+	then = READ_ONCE(sender->arch.ipi_context.ipi_time_ns);
+	now = ktime_get_mono_fast_ns();
+	if (READ_ONCE(sender->arch.ipi_context.pending_ipi) &&
+	    READ_ONCE(sender->arch.ipi_context.last_ipi_receiver) ==
+	    receiver->vcpu_idx &&
+	    now - then <= ipi_window_ns)
+		return true;
+
+	return false;
+}
+
+/*
+ * Clear IPI context for a vCPU (e.g., on EOI or reset).
+ */
+void kvm_vcpu_clear_ipi_context(struct kvm_vcpu *vcpu)
+{
+	WRITE_ONCE(vcpu->arch.ipi_context.pending_ipi, false);
+	WRITE_ONCE(vcpu->arch.ipi_context.last_ipi_sender, -1);
+	WRITE_ONCE(vcpu->arch.ipi_context.last_ipi_receiver, -1);
+}
+
+/*
+ * Reset IPI context completely (e.g., on vCPU creation/destruction).
+ */
+void kvm_vcpu_reset_ipi_context(struct kvm_vcpu *vcpu)
+{
+	kvm_vcpu_clear_ipi_context(vcpu);
+	WRITE_ONCE(vcpu->arch.ipi_context.ipi_time_ns, 0);
+}
+
 /* Return true if the interrupt can be handled by using *bitmap as index mask
  * for valid destinations in *dst array.
  * Return false if kvm_apic_map_get_dest_lapic did nothing useful.
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0c6d899d53dd..d4c401ef04ca 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12728,6 +12728,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 		goto free_guest_fpu;
 
 	kvm_xen_init_vcpu(vcpu);
+	kvm_vcpu_reset_ipi_context(vcpu);
 	vcpu_load(vcpu);
 	kvm_vcpu_after_set_cpuid(vcpu);
 	kvm_set_tsc_khz(vcpu, vcpu->kvm->arch.default_tsc_khz);
@@ -12795,6 +12796,7 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
 	kvm_mmu_destroy(vcpu);
 	srcu_read_unlock(&vcpu->kvm->srcu, idx);
 	free_page((unsigned long)vcpu->arch.pio_data);
+	kvm_vcpu_reset_ipi_context(vcpu);
 	kvfree(vcpu->arch.cpuid_entries);
 }
 
@@ -12871,6 +12873,7 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 		kvm_leave_nested(vcpu);
 
 	kvm_lapic_reset(vcpu, init_event);
+	kvm_vcpu_clear_ipi_context(vcpu);
 
 	WARN_ON_ONCE(is_guest_mode(vcpu) || is_smm(vcpu));
 	vcpu->arch.hflags = 0;
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index fdab0ad49098..cfc24fb207e0 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -466,6 +466,14 @@ fastpath_t handle_fastpath_wrmsr_imm(struct kvm_vcpu *vcpu, u32 msr, int reg);
 fastpath_t handle_fastpath_hlt(struct kvm_vcpu *vcpu);
 fastpath_t handle_fastpath_invd(struct kvm_vcpu *vcpu);
 
+/* IPI tracking helpers for directed yield */
+void kvm_track_ipi_communication(struct kvm_vcpu *sender,
+				 struct kvm_vcpu *receiver);
+bool kvm_vcpu_is_ipi_receiver(struct kvm_vcpu *sender,
+			      struct kvm_vcpu *receiver);
+void kvm_vcpu_clear_ipi_context(struct kvm_vcpu *vcpu);
+void kvm_vcpu_reset_ipi_context(struct kvm_vcpu *vcpu);
+
 extern struct kvm_caps kvm_caps;
 extern struct kvm_host_values kvm_host;
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d93f75b05ae2..f42315d341b3 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1535,6 +1535,9 @@ static inline void kvm_vcpu_kick(struct kvm_vcpu *vcpu)
 int kvm_vcpu_yield_to(struct kvm_vcpu *target);
 void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu, bool yield_to_kernel_mode);
 
+/* Weak function, overridden by arch/x86/kvm for IPI-aware directed yield */
+bool kvm_vcpu_is_ipi_receiver(struct kvm_vcpu *sender, struct kvm_vcpu *receiver);
+
 void kvm_flush_remote_tlbs(struct kvm *kvm);
 void kvm_flush_remote_tlbs_range(struct kvm *kvm, gfn_t gfn, u64 nr_pages);
 void kvm_flush_remote_tlbs_memslot(struct kvm *kvm,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5fcd401a5897..ff771a872c6d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3964,6 +3964,12 @@ bool __weak kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu)
 	return false;
 }
 
+bool __weak kvm_vcpu_is_ipi_receiver(struct kvm_vcpu *sender,
+				     struct kvm_vcpu *receiver)
+{
+	return false;
+}
+
 void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
 {
 	int nr_vcpus, start, i, idx, yielded;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 7/9] KVM: x86/lapic: Integrate IPI tracking with interrupt delivery
  2025-12-19  3:53 [PATCH v2 0/9] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM Wanpeng Li
                   ` (5 preceding siblings ...)
  2025-12-19  3:53 ` [PATCH v2 6/9] KVM: x86: Add IPI tracking infrastructure Wanpeng Li
@ 2025-12-19  3:53 ` Wanpeng Li
  2025-12-19  3:53 ` [PATCH v2 8/9] KVM: Implement IPI-aware directed yield candidate selection Wanpeng Li
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 19+ messages in thread
From: Wanpeng Li @ 2025-12-19  3:53 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paolo Bonzini,
	Sean Christopherson
  Cc: K Prateek Nayak, Christian Borntraeger, Steven Rostedt,
	Vincent Guittot, Juri Lelli, linux-kernel, kvm, Wanpeng Li

From: Wanpeng Li <wanpengli@tencent.com>

Hook IPI tracking into the LAPIC interrupt delivery path to capture
sender/receiver relationships for directed yield optimization.

Implement kvm_ipi_track_send() called from kvm_irq_delivery_to_apic()
when a unicast fixed IPI is detected (exactly one destination). Record
sender vCPU index, receiver vCPU index, and timestamp using lockless
WRITE_ONCE for minimal overhead.

Implement kvm_ipi_track_eoi() called from kvm_apic_set_eoi_accelerated()
and handle_apic_eoi() to clear IPI context when interrupts are
acknowledged. Use two-stage clearing:
1. Unconditionally clear the receiver's context (it processed the IPI)
2. Conditionally clear sender's pending flag only when the sender
   exists, last_ipi_receiver matches, and the IPI is recent

Use lockless accessors for minimal overhead. The tracking only
activates for unicast fixed IPIs where directed yield provides value.

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
---
 arch/x86/kvm/lapic.c | 90 ++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 86 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 23f247a3b127..d4fb6f49390b 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1270,6 +1270,9 @@ bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct kvm_lapic *src,
 	struct kvm_lapic **dst = NULL;
 	int i;
 	bool ret;
+	int targets = 0;
+	int delivered;
+	struct kvm_vcpu *unique = NULL;
 
 	*r = -1;
 
@@ -1291,8 +1294,22 @@ bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct kvm_lapic *src,
 		for_each_set_bit(i, &bitmap, 16) {
 			if (!dst[i])
 				continue;
-			*r += kvm_apic_set_irq(dst[i]->vcpu, irq, dest_map);
+			delivered = kvm_apic_set_irq(dst[i]->vcpu, irq, dest_map);
+			*r += delivered;
+			if (delivered > 0) {
+				targets++;
+				unique = dst[i]->vcpu;
+			}
 		}
+
+		/*
+		 * Track IPI for directed yield: only for LAPIC-originated
+		 * APIC_DM_FIXED without shorthand, with exactly one recipient.
+		 */
+		if (src && irq->delivery_mode == APIC_DM_FIXED &&
+		    irq->shorthand == APIC_DEST_NOSHORT &&
+		    targets == 1 && unique && unique != src->vcpu)
+			kvm_track_ipi_communication(src->vcpu, unique);
 	}
 
 	rcu_read_unlock();
@@ -1377,6 +1394,9 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
 	struct kvm_vcpu *vcpu, *lowest = NULL;
 	unsigned long i, dest_vcpu_bitmap[BITS_TO_LONGS(KVM_MAX_VCPUS)];
 	unsigned int dest_vcpus = 0;
+	int targets = 0;
+	int delivered;
+	struct kvm_vcpu *unique = NULL;
 
 	if (kvm_irq_delivery_to_apic_fast(kvm, src, irq, &r, dest_map))
 		return r;
@@ -1400,7 +1420,12 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
 		if (!kvm_lowest_prio_delivery(irq)) {
 			if (r < 0)
 				r = 0;
-			r += kvm_apic_set_irq(vcpu, irq, dest_map);
+			delivered = kvm_apic_set_irq(vcpu, irq, dest_map);
+			r += delivered;
+			if (delivered > 0) {
+				targets++;
+				unique = vcpu;
+			}
 		} else if (kvm_apic_sw_enabled(vcpu->arch.apic)) {
 			if (!vector_hashing_enabled) {
 				if (!lowest)
@@ -1421,8 +1446,23 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
 		lowest = kvm_get_vcpu(kvm, idx);
 	}
 
-	if (lowest)
-		r = kvm_apic_set_irq(lowest, irq, dest_map);
+	if (lowest) {
+		delivered = kvm_apic_set_irq(lowest, irq, dest_map);
+		r = delivered;
+		if (delivered > 0) {
+			targets = 1;
+			unique = lowest;
+		}
+	}
+
+	/*
+	 * Track IPI for directed yield: only for LAPIC-originated
+	 * APIC_DM_FIXED without shorthand, with exactly one recipient.
+	 */
+	if (src && irq->delivery_mode == APIC_DM_FIXED &&
+	    irq->shorthand == APIC_DEST_NOSHORT &&
+	    targets == 1 && unique && unique != src->vcpu)
+		kvm_track_ipi_communication(src->vcpu, unique);
 
 	return r;
 }
@@ -1608,6 +1648,45 @@ static void kvm_ioapic_send_eoi(struct kvm_lapic *apic, int vector)
 #endif
 }
 
+/*
+ * Clear IPI context on EOI to prevent stale boost decisions.
+ *
+ * Two-stage cleanup:
+ * 1. Always clear receiver's IPI context (it processed the interrupt)
+ * 2. Conditionally clear sender's pending flag only when:
+ *    - Sender vCPU exists and is valid
+ *    - Sender's last_ipi_receiver matches this receiver
+ *    - IPI was sent recently (within window)
+ */
+static void kvm_clear_ipi_on_eoi(struct kvm_lapic *apic)
+{
+	struct kvm_vcpu *receiver = apic->vcpu;
+	int sender_idx;
+	u64 then, now;
+
+	if (unlikely(!READ_ONCE(ipi_tracking_enabled)))
+		return;
+
+	sender_idx = READ_ONCE(receiver->arch.ipi_context.last_ipi_sender);
+
+	/* Step 1: Always clear receiver's IPI context */
+	kvm_vcpu_clear_ipi_context(receiver);
+
+	/* Step 2: Conditionally clear sender's pending flag */
+	if (sender_idx >= 0) {
+		struct kvm_vcpu *sender = kvm_get_vcpu(receiver->kvm, sender_idx);
+
+		if (sender &&
+		    READ_ONCE(sender->arch.ipi_context.last_ipi_receiver) ==
+		    receiver->vcpu_idx) {
+			then = READ_ONCE(sender->arch.ipi_context.ipi_time_ns);
+			now = ktime_get_mono_fast_ns();
+			if (now - then <= ipi_window_ns)
+				WRITE_ONCE(sender->arch.ipi_context.pending_ipi, false);
+		}
+	}
+}
+
 static int apic_set_eoi(struct kvm_lapic *apic)
 {
 	int vector = apic_find_highest_isr(apic);
@@ -1643,6 +1722,7 @@ void kvm_apic_set_eoi_accelerated(struct kvm_vcpu *vcpu, int vector)
 	trace_kvm_eoi(apic, vector);
 
 	kvm_ioapic_send_eoi(apic, vector);
+	kvm_clear_ipi_on_eoi(apic);
 	kvm_make_request(KVM_REQ_EVENT, apic->vcpu);
 }
 EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_apic_set_eoi_accelerated);
@@ -2453,6 +2533,8 @@ static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
 
 	case APIC_EOI:
 		apic_set_eoi(apic);
+		/* Precise cleanup for IPI-aware boost */
+		kvm_clear_ipi_on_eoi(apic);
 		break;
 
 	case APIC_LDR:
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 8/9] KVM: Implement IPI-aware directed yield candidate selection
  2025-12-19  3:53 [PATCH v2 0/9] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM Wanpeng Li
                   ` (6 preceding siblings ...)
  2025-12-19  3:53 ` [PATCH v2 7/9] KVM: x86/lapic: Integrate IPI tracking with interrupt delivery Wanpeng Li
@ 2025-12-19  3:53 ` Wanpeng Li
  2025-12-19  3:53 ` [PATCH v2 9/9] KVM: Relaxed boost as safety net Wanpeng Li
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 19+ messages in thread
From: Wanpeng Li @ 2025-12-19  3:53 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paolo Bonzini,
	Sean Christopherson
  Cc: K Prateek Nayak, Christian Borntraeger, Steven Rostedt,
	Vincent Guittot, Juri Lelli, linux-kernel, kvm, Wanpeng Li

From: Wanpeng Li <wanpengli@tencent.com>

Integrate IPI tracking with directed yield to improve scheduling when
vCPUs spin waiting for IPI responses.

Implement priority-based candidate selection in kvm_vcpu_on_spin()
with three tiers:

Priority 1: Use kvm_vcpu_is_ipi_receiver() to identify confirmed IPI
targets within the recency window, addressing lock holders spinning
on IPI acknowledgment.

Priority 2: Leverage existing kvm_arch_dy_has_pending_interrupt() for
compatibility with arch-specific fast paths.

Priority 3: Fall back to conventional preemption-based logic when
yield_to_kernel_mode is requested, providing a safety net for non-IPI
scenarios.

Add kvm_vcpu_is_good_yield_candidate() helper to consolidate these
checks, preventing over-aggressive boosting while enabling targeted
optimization when IPI patterns are detected.

Performance testing (16 pCPUs host, 16 vCPUs/VM):

Dedup (simlarge):
  2 VMs: +47.1% throughput
  3 VMs: +28.1% throughput
  4 VMs:  +1.7% throughput

VIPS (simlarge):
  2 VMs: +26.2% throughput
  3 VMs: +12.7% throughput
  4 VMs:  +6.0% throughput

Gains stem from effective directed yield when vCPUs spin on IPI
delivery, reducing synchronization overhead. The improvement is most
pronounced at moderate overcommit (2-3 VMs) where contention reduction
outweighs context switching cost.

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
---
 virt/kvm/kvm_main.c | 46 ++++++++++++++++++++++++++++++++++++---------
 1 file changed, 37 insertions(+), 9 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ff771a872c6d..45ede950314b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3970,6 +3970,41 @@ bool __weak kvm_vcpu_is_ipi_receiver(struct kvm_vcpu *sender,
 	return false;
 }
 
+/*
+ * IPI-aware candidate selection for directed yield.
+ *
+ * Priority order:
+ *  1) Confirmed IPI receiver of 'me' within recency window (always boost)
+ *  2) Arch-provided fast pending interrupt hint (user-mode boost)
+ *  3) Kernel-mode yield: preempted-in-kernel vCPU (traditional boost)
+ *  4) Otherwise, be conservative and skip
+ */
+static bool kvm_vcpu_is_good_yield_candidate(struct kvm_vcpu *me,
+					     struct kvm_vcpu *vcpu,
+					     bool yield_to_kernel_mode)
+{
+	/* Priority 1: recently targeted IPI receiver */
+	if (kvm_vcpu_is_ipi_receiver(me, vcpu))
+		return true;
+
+	/* Priority 2: fast pending-interrupt hint (arch-specific) */
+	if (kvm_arch_dy_has_pending_interrupt(vcpu))
+		return true;
+
+	/*
+	 * Minimal preempted gate for remaining cases:
+	 * Require that the target has been preempted, and if yielding to
+	 * kernel mode, additionally require preempted-in-kernel.
+	 */
+	if (!READ_ONCE(vcpu->preempted))
+		return false;
+
+	if (yield_to_kernel_mode && !kvm_arch_vcpu_preempted_in_kernel(vcpu))
+		return false;
+
+	return true;
+}
+
 void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
 {
 	int nr_vcpus, start, i, idx, yielded;
@@ -4017,15 +4052,8 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
 		if (kvm_vcpu_is_blocking(vcpu) && !vcpu_dy_runnable(vcpu))
 			continue;
 
-		/*
-		 * Treat the target vCPU as being in-kernel if it has a pending
-		 * interrupt, as the vCPU trying to yield may be spinning
-		 * waiting on IPI delivery, i.e. the target vCPU is in-kernel
-		 * for the purposes of directed yield.
-		 */
-		if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode &&
-		    !kvm_arch_dy_has_pending_interrupt(vcpu) &&
-		    !kvm_arch_vcpu_preempted_in_kernel(vcpu))
+		/* IPI-aware candidate selection */
+		if (!kvm_vcpu_is_good_yield_candidate(me, vcpu, yield_to_kernel_mode))
 			continue;
 
 		if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 9/9] KVM: Relaxed boost as safety net
  2025-12-19  3:53 [PATCH v2 0/9] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM Wanpeng Li
                   ` (7 preceding siblings ...)
  2025-12-19  3:53 ` [PATCH v2 8/9] KVM: Implement IPI-aware directed yield candidate selection Wanpeng Li
@ 2025-12-19  3:53 ` Wanpeng Li
  2026-01-04  2:40 ` [PATCH v2 0/9] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM Wanpeng Li
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 19+ messages in thread
From: Wanpeng Li @ 2025-12-19  3:53 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paolo Bonzini,
	Sean Christopherson
  Cc: K Prateek Nayak, Christian Borntraeger, Steven Rostedt,
	Vincent Guittot, Juri Lelli, linux-kernel, kvm, Wanpeng Li

From: Wanpeng Li <wanpengli@tencent.com>

Add a minimal two-round fallback mechanism in kvm_vcpu_on_spin() to
avoid pathological stalls when the first round finds no eligible
target.

Round 1 applies strict IPI-aware candidate selection (existing
behavior). Round 2 provides a relaxed scan gated only by preempted
state as a safety net, addressing cases where IPI context is missed or
the runnable set is transient.

The second round is controlled by module parameter enable_relaxed_boost
(bool, 0644, default on) to allow easy disablement by distributions if
needed.

Introduce the enable_relaxed_boost parameter, add a first_round flag,
retry label, and reset of yielded counter. Gate the IPI-aware check in
round 1 and use preempted-only gating in round 2. Keep churn minimal
by reusing the same scan logic while preserving all existing
heuristics, tracing, and bookkeeping.

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
---
 virt/kvm/kvm_main.c | 26 ++++++++++++++++++++++++--
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 45ede950314b..662a907a79e1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -102,6 +102,9 @@ EXPORT_SYMBOL_FOR_KVM_INTERNAL(halt_poll_ns_shrink);
 static bool allow_unsafe_mappings;
 module_param(allow_unsafe_mappings, bool, 0444);
 
+static bool enable_relaxed_boost = true;
+module_param(enable_relaxed_boost, bool, 0644);
+
 /*
  * Ordering of locks:
  *
@@ -4011,6 +4014,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
 	struct kvm *kvm = me->kvm;
 	struct kvm_vcpu *vcpu;
 	int try = 3;
+	bool first_round = true;
 
 	nr_vcpus = atomic_read(&kvm->online_vcpus);
 	if (nr_vcpus < 2)
@@ -4021,6 +4025,9 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
 
 	kvm_vcpu_set_in_spin_loop(me, true);
 
+retry:
+	yielded = 0;
+
 	/*
 	 * The current vCPU ("me") is spinning in kernel mode, i.e. is likely
 	 * waiting for a resource to become available.  Attempt to yield to a
@@ -4052,8 +4059,13 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
 		if (kvm_vcpu_is_blocking(vcpu) && !vcpu_dy_runnable(vcpu))
 			continue;
 
-		/* IPI-aware candidate selection */
-		if (!kvm_vcpu_is_good_yield_candidate(me, vcpu, yield_to_kernel_mode))
+		/* IPI-aware candidate selection in first round */
+		if (first_round &&
+		    !kvm_vcpu_is_good_yield_candidate(me, vcpu, yield_to_kernel_mode))
+			continue;
+
+		/* Minimal preempted gate for second round */
+		if (!first_round && !READ_ONCE(vcpu->preempted))
 			continue;
 
 		if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
@@ -4067,6 +4079,16 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
 			break;
 		}
 	}
+
+	/*
+	 * Second round: relaxed boost as safety net, with preempted gate.
+	 * Only execute when enabled and when the first round yielded nothing.
+	 */
+	if (enable_relaxed_boost && first_round && yielded <= 0) {
+		first_round = false;
+		goto retry;
+	}
+
 	kvm_vcpu_set_in_spin_loop(me, false);
 
 	/* Ensure vcpu is not eligible during next spinloop */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 5/9] sched/fair: Wire up yield deboost in yield_to_task_fair()
  2025-12-19  3:53 ` [PATCH v2 5/9] sched/fair: Wire up yield deboost in yield_to_task_fair() Wanpeng Li
@ 2025-12-22  7:06   ` kernel test robot
  2025-12-22  9:31   ` kernel test robot
  1 sibling, 0 replies; 19+ messages in thread
From: kernel test robot @ 2025-12-22  7:06 UTC (permalink / raw)
  To: Wanpeng Li, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Paolo Bonzini, Sean Christopherson
  Cc: oe-kbuild-all, K Prateek Nayak, Christian Borntraeger,
	Steven Rostedt, Vincent Guittot, Juri Lelli, linux-kernel, kvm,
	Wanpeng Li

Hi Wanpeng,

kernel test robot noticed the following build errors:

[auto build test ERROR on kvm/queue]
[also build test ERROR on kvm/next tip/sched/core peterz-queue/sched/core tip/master linus/master v6.19-rc2 next-20251219]
[cannot apply to kvm/linux-next tip/auto-latest]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Wanpeng-Li/sched-fair-Add-rate-limiting-and-validation-helpers/20251219-125353
base:   https://git.kernel.org/pub/scm/virt/kvm/kvm.git queue
patch link:    https://lore.kernel.org/r/20251219035334.39790-6-kernellwp%40gmail.com
patch subject: [PATCH v2 5/9] sched/fair: Wire up yield deboost in yield_to_task_fair()
config: m68k-allnoconfig (https://download.01.org/0day-ci/archive/20251222/202512221456.139kcj5R-lkp@intel.com/config)
compiler: m68k-linux-gcc (GCC) 15.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251222/202512221456.139kcj5R-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202512221456.139kcj5R-lkp@intel.com/

All errors (new ones prefixed by >>):

   m68k-linux-ld: kernel/sched/fair.o: in function `yield_deboost_calculate_penalty':
>> fair.c:(.text+0x278c): undefined reference to `__udivdi3'

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 5/9] sched/fair: Wire up yield deboost in yield_to_task_fair()
  2025-12-19  3:53 ` [PATCH v2 5/9] sched/fair: Wire up yield deboost in yield_to_task_fair() Wanpeng Li
  2025-12-22  7:06   ` kernel test robot
@ 2025-12-22  9:31   ` kernel test robot
  1 sibling, 0 replies; 19+ messages in thread
From: kernel test robot @ 2025-12-22  9:31 UTC (permalink / raw)
  To: Wanpeng Li, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Paolo Bonzini, Sean Christopherson
  Cc: oe-kbuild-all, K Prateek Nayak, Christian Borntraeger,
	Steven Rostedt, Vincent Guittot, Juri Lelli, linux-kernel, kvm,
	Wanpeng Li

Hi Wanpeng,

kernel test robot noticed the following build errors:

[auto build test ERROR on kvm/queue]
[also build test ERROR on kvm/next tip/sched/core peterz-queue/sched/core tip/master linus/master v6.19-rc2 next-20251219]
[cannot apply to kvm/linux-next tip/auto-latest]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Wanpeng-Li/sched-fair-Add-rate-limiting-and-validation-helpers/20251219-125353
base:   https://git.kernel.org/pub/scm/virt/kvm/kvm.git queue
patch link:    https://lore.kernel.org/r/20251219035334.39790-6-kernellwp%40gmail.com
patch subject: [PATCH v2 5/9] sched/fair: Wire up yield deboost in yield_to_task_fair()
config: i386-randconfig-003-20251222 (https://download.01.org/0day-ci/archive/20251222/202512221734.SzKNollL-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.4.0-5) 12.4.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251222/202512221734.SzKNollL-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202512221734.SzKNollL-lkp@intel.com/

All errors (new ones prefixed by >>):

   ld: kernel/sched/fair.o: in function `yield_deboost_calculate_penalty':
>> kernel/sched/fair.c:9213:(.text+0x278d): undefined reference to `__udivdi3'


vim +9213 kernel/sched/fair.c

9a700d273608de Wanpeng Li 2025-12-19  9176  
9a700d273608de Wanpeng Li 2025-12-19  9177  /*
9a700d273608de Wanpeng Li 2025-12-19  9178   * Calculate vruntime penalty for yield deboost.
9a700d273608de Wanpeng Li 2025-12-19  9179   *
9a700d273608de Wanpeng Li 2025-12-19  9180   * The penalty is based on:
9a700d273608de Wanpeng Li 2025-12-19  9181   * - Fairness gap: vruntime difference between yielding and target tasks
9a700d273608de Wanpeng Li 2025-12-19  9182   * - Scheduling granularity: base unit for penalty calculation
9a700d273608de Wanpeng Li 2025-12-19  9183   * - Queue size: adaptive caps to prevent starvation in larger queues
9a700d273608de Wanpeng Li 2025-12-19  9184   *
9a700d273608de Wanpeng Li 2025-12-19  9185   * Queue-size-based caps (multiplier of granularity):
9a700d273608de Wanpeng Li 2025-12-19  9186   *   2 tasks:  6.0x - Strongest push for 2-task ping-pong scenarios
9a700d273608de Wanpeng Li 2025-12-19  9187   *   3 tasks:  4.0x
9a700d273608de Wanpeng Li 2025-12-19  9188   *   4-6:      2.5x
9a700d273608de Wanpeng Li 2025-12-19  9189   *   7-8:      2.0x
9a700d273608de Wanpeng Li 2025-12-19  9190   *   9-12:     1.5x
9a700d273608de Wanpeng Li 2025-12-19  9191   *   >12:      1.0x - Minimal push to avoid starvation
9a700d273608de Wanpeng Li 2025-12-19  9192   *
9a700d273608de Wanpeng Li 2025-12-19  9193   * Returns the calculated penalty value.
9a700d273608de Wanpeng Li 2025-12-19  9194   */
9a700d273608de Wanpeng Li 2025-12-19  9195  static u64 __maybe_unused
9a700d273608de Wanpeng Li 2025-12-19  9196  yield_deboost_calculate_penalty(struct rq *rq, struct sched_entity *se_y_lca,
9a700d273608de Wanpeng Li 2025-12-19  9197  				struct sched_entity *se_t_lca,
9a700d273608de Wanpeng Li 2025-12-19  9198  				struct task_struct *p_target, int h_nr_queued)
9a700d273608de Wanpeng Li 2025-12-19  9199  {
9a700d273608de Wanpeng Li 2025-12-19  9200  	u64 gran, need, penalty, maxp;
9a700d273608de Wanpeng Li 2025-12-19  9201  	u64 weighted_need, base;
9a700d273608de Wanpeng Li 2025-12-19  9202  
9a700d273608de Wanpeng Li 2025-12-19  9203  	gran = calc_delta_fair(sysctl_sched_base_slice, se_y_lca);
9a700d273608de Wanpeng Li 2025-12-19  9204  
9a700d273608de Wanpeng Li 2025-12-19  9205  	/* Calculate fairness gap */
9a700d273608de Wanpeng Li 2025-12-19  9206  	need = 0;
9a700d273608de Wanpeng Li 2025-12-19  9207  	if (se_t_lca->vruntime > se_y_lca->vruntime)
9a700d273608de Wanpeng Li 2025-12-19  9208  		need = se_t_lca->vruntime - se_y_lca->vruntime;
9a700d273608de Wanpeng Li 2025-12-19  9209  
9a700d273608de Wanpeng Li 2025-12-19  9210  	/* Base penalty is granularity plus 110% of fairness gap */
9a700d273608de Wanpeng Li 2025-12-19  9211  	penalty = gran;
9a700d273608de Wanpeng Li 2025-12-19  9212  	if (need) {
9a700d273608de Wanpeng Li 2025-12-19 @9213  		weighted_need = need + need / 10;
9a700d273608de Wanpeng Li 2025-12-19  9214  		if (weighted_need > U64_MAX - penalty)
9a700d273608de Wanpeng Li 2025-12-19  9215  			weighted_need = U64_MAX - penalty;
9a700d273608de Wanpeng Li 2025-12-19  9216  		penalty += weighted_need;
9a700d273608de Wanpeng Li 2025-12-19  9217  	}
9a700d273608de Wanpeng Li 2025-12-19  9218  
9a700d273608de Wanpeng Li 2025-12-19  9219  	/* Apply debounce to reduce ping-pong */
9a700d273608de Wanpeng Li 2025-12-19  9220  	penalty = yield_deboost_apply_debounce(rq, p_target, penalty, need, gran);
9a700d273608de Wanpeng Li 2025-12-19  9221  
9a700d273608de Wanpeng Li 2025-12-19  9222  	/* Queue-size-based upper bound */
9a700d273608de Wanpeng Li 2025-12-19  9223  	if (h_nr_queued == 2)
9a700d273608de Wanpeng Li 2025-12-19  9224  		maxp = gran * 6;
9a700d273608de Wanpeng Li 2025-12-19  9225  	else if (h_nr_queued == 3)
9a700d273608de Wanpeng Li 2025-12-19  9226  		maxp = gran * 4;
9a700d273608de Wanpeng Li 2025-12-19  9227  	else if (h_nr_queued <= 6)
9a700d273608de Wanpeng Li 2025-12-19  9228  		maxp = (gran * 5) / 2;
9a700d273608de Wanpeng Li 2025-12-19  9229  	else if (h_nr_queued <= 8)
9a700d273608de Wanpeng Li 2025-12-19  9230  		maxp = gran * 2;
9a700d273608de Wanpeng Li 2025-12-19  9231  	else if (h_nr_queued <= 12)
9a700d273608de Wanpeng Li 2025-12-19  9232  		maxp = (gran * 3) / 2;
9a700d273608de Wanpeng Li 2025-12-19  9233  	else
9a700d273608de Wanpeng Li 2025-12-19  9234  		maxp = gran;
9a700d273608de Wanpeng Li 2025-12-19  9235  
9a700d273608de Wanpeng Li 2025-12-19  9236  	penalty = clamp(penalty, gran, maxp);
9a700d273608de Wanpeng Li 2025-12-19  9237  
9a700d273608de Wanpeng Li 2025-12-19  9238  	/* Baseline push when no fairness gap exists */
9a700d273608de Wanpeng Li 2025-12-19  9239  	if (need == 0) {
9a700d273608de Wanpeng Li 2025-12-19  9240  		if (h_nr_queued == 3)
9a700d273608de Wanpeng Li 2025-12-19  9241  			base = (gran * 15) / 16;
9a700d273608de Wanpeng Li 2025-12-19  9242  		else if (h_nr_queued >= 4 && h_nr_queued <= 6)
9a700d273608de Wanpeng Li 2025-12-19  9243  			base = (gran * 5) / 8;
9a700d273608de Wanpeng Li 2025-12-19  9244  		else if (h_nr_queued >= 7 && h_nr_queued <= 8)
9a700d273608de Wanpeng Li 2025-12-19  9245  			base = gran / 2;
9a700d273608de Wanpeng Li 2025-12-19  9246  		else if (h_nr_queued >= 9 && h_nr_queued <= 12)
9a700d273608de Wanpeng Li 2025-12-19  9247  			base = (gran * 3) / 8;
9a700d273608de Wanpeng Li 2025-12-19  9248  		else if (h_nr_queued > 12)
9a700d273608de Wanpeng Li 2025-12-19  9249  			base = gran / 4;
9a700d273608de Wanpeng Li 2025-12-19  9250  		else
9a700d273608de Wanpeng Li 2025-12-19  9251  			base = gran;
9a700d273608de Wanpeng Li 2025-12-19  9252  
9a700d273608de Wanpeng Li 2025-12-19  9253  		if (penalty < base)
9a700d273608de Wanpeng Li 2025-12-19  9254  			penalty = base;
9a700d273608de Wanpeng Li 2025-12-19  9255  	}
9a700d273608de Wanpeng Li 2025-12-19  9256  
9a700d273608de Wanpeng Li 2025-12-19  9257  	return penalty;
9a700d273608de Wanpeng Li 2025-12-19  9258  }
9a700d273608de Wanpeng Li 2025-12-19  9259  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/9] sched/fair: Add rate-limiting and validation helpers
  2025-12-19  3:53 ` [PATCH v2 2/9] sched/fair: Add rate-limiting and validation helpers Wanpeng Li
@ 2025-12-22 21:12   ` kernel test robot
  2026-01-04  4:09   ` Hillf Danton
  1 sibling, 0 replies; 19+ messages in thread
From: kernel test robot @ 2025-12-22 21:12 UTC (permalink / raw)
  To: Wanpeng Li, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Paolo Bonzini, Sean Christopherson
  Cc: oe-kbuild-all, K Prateek Nayak, Christian Borntraeger,
	Steven Rostedt, Vincent Guittot, Juri Lelli, linux-kernel, kvm,
	Wanpeng Li

Hi Wanpeng,

kernel test robot noticed the following build warnings:

[auto build test WARNING on kvm/queue]
[also build test WARNING on kvm/next tip/sched/core peterz-queue/sched/core tip/master linus/master v6.19-rc2 next-20251219]
[cannot apply to kvm/linux-next tip/auto-latest]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Wanpeng-Li/sched-fair-Add-rate-limiting-and-validation-helpers/20251219-125353
base:   https://git.kernel.org/pub/scm/virt/kvm/kvm.git queue
patch link:    https://lore.kernel.org/r/20251219035334.39790-3-kernellwp%40gmail.com
patch subject: [PATCH v2 2/9] sched/fair: Add rate-limiting and validation helpers
config: openrisc-randconfig-r122-20251221 (https://download.01.org/0day-ci/archive/20251223/202512230415.0RatyaQF-lkp@intel.com/config)
compiler: or1k-linux-gcc (GCC) 15.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251223/202512230415.0RatyaQF-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202512230415.0RatyaQF-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
   kernel/sched/fair.c:1158:49: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct task_struct *running @@     got struct task_struct [noderef] __rcu *curr @@
   kernel/sched/fair.c:1158:49: sparse:     expected struct task_struct *running
   kernel/sched/fair.c:1158:49: sparse:     got struct task_struct [noderef] __rcu *curr
   kernel/sched/fair.c:1194:33: sparse: sparse: incorrect type in argument 2 (different address spaces) @@     expected struct sched_entity *se @@     got struct sched_entity [noderef] __rcu * @@
   kernel/sched/fair.c:1194:33: sparse:     expected struct sched_entity *se
   kernel/sched/fair.c:1194:33: sparse:     got struct sched_entity [noderef] __rcu *
   kernel/sched/fair.c:1250:34: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct sched_entity const *se @@     got struct sched_entity [noderef] __rcu * @@
   kernel/sched/fair.c:1250:34: sparse:     expected struct sched_entity const *se
   kernel/sched/fair.c:1250:34: sparse:     got struct sched_entity [noderef] __rcu *
   kernel/sched/fair.c:12991:9: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/fair.c:12991:9: sparse:     expected struct sched_domain *[assigned] sd
   kernel/sched/fair.c:12991:9: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/fair.c:8354:20: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/fair.c:8354:20: sparse:     expected struct sched_domain *[assigned] sd
   kernel/sched/fair.c:8354:20: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/fair.c:8558:9: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] tmp @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/fair.c:8558:9: sparse:     expected struct sched_domain *[assigned] tmp
   kernel/sched/fair.c:8558:9: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/fair.c:8757:39: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct task_struct *donor @@     got struct task_struct [noderef] __rcu *donor @@
   kernel/sched/fair.c:8757:39: sparse:     expected struct task_struct *donor
   kernel/sched/fair.c:8757:39: sparse:     got struct task_struct [noderef] __rcu *donor
   kernel/sched/fair.c:8784:37: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct task_struct *tsk @@     got struct task_struct [noderef] __rcu *curr @@
   kernel/sched/fair.c:8784:37: sparse:     expected struct task_struct *tsk
   kernel/sched/fair.c:8784:37: sparse:     got struct task_struct [noderef] __rcu *curr
>> kernel/sched/fair.c:9089:20: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct task_struct *p_yielding @@     got struct task_struct [noderef] __rcu *donor @@
   kernel/sched/fair.c:9089:20: sparse:     expected struct task_struct *p_yielding
   kernel/sched/fair.c:9089:20: sparse:     got struct task_struct [noderef] __rcu *donor
   kernel/sched/fair.c:9110:38: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct task_struct *curr @@     got struct task_struct [noderef] __rcu *donor @@
   kernel/sched/fair.c:9110:38: sparse:     expected struct task_struct *curr
   kernel/sched/fair.c:9110:38: sparse:     got struct task_struct [noderef] __rcu *donor
   kernel/sched/fair.c:10146:40: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct sched_domain *child @@     got struct sched_domain [noderef] __rcu *child @@
   kernel/sched/fair.c:10146:40: sparse:     expected struct sched_domain *child
   kernel/sched/fair.c:10146:40: sparse:     got struct sched_domain [noderef] __rcu *child
   kernel/sched/fair.c:10774:22: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/fair.c:10774:22: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/fair.c:10774:22: sparse:    struct task_struct *
   kernel/sched/fair.c:12246:9: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/fair.c:12246:9: sparse:     expected struct sched_domain *[assigned] sd
   kernel/sched/fair.c:12246:9: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/fair.c:11884:44: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct sched_domain *sd_parent @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/fair.c:11884:44: sparse:     expected struct sched_domain *sd_parent
   kernel/sched/fair.c:11884:44: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/fair.c:12359:9: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/fair.c:12359:9: sparse:     expected struct sched_domain *[assigned] sd
   kernel/sched/fair.c:12359:9: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/fair.c:6688:35: sparse: sparse: marked inline, but without a definition
   kernel/sched/fair.c: note: in included file:
   kernel/sched/sched.h:2647:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2647:9: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2647:9: sparse:    struct task_struct *
   kernel/sched/sched.h:2314:26: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2314:26: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2314:26: sparse:    struct task_struct *
   kernel/sched/sched.h:2303:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2303:25: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2303:25: sparse:    struct task_struct *
   kernel/sched/sched.h:2314:26: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2314:26: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2314:26: sparse:    struct task_struct *
   kernel/sched/sched.h:2314:26: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2314:26: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2314:26: sparse:    struct task_struct *
   kernel/sched/sched.h:2314:26: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2314:26: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2314:26: sparse:    struct task_struct *

vim +9089 kernel/sched/fair.c

  9064	
  9065	/*
  9066	 * Validate tasks for yield deboost operation.
  9067	 * Returns the yielding task on success, NULL on validation failure.
  9068	 *
  9069	 * Checks: feature enabled, valid target, same runqueue, target is fair class,
  9070	 * both on_rq. Called under rq->lock.
  9071	 *
  9072	 * Note: p_yielding (rq->donor) is guaranteed to be fair class by the caller
  9073	 * (yield_to_task_fair is only called when curr->sched_class == p->sched_class).
  9074	 */
  9075	static struct task_struct __maybe_unused *
  9076	yield_deboost_validate_tasks(struct rq *rq, struct task_struct *p_target)
  9077	{
  9078		struct task_struct *p_yielding;
  9079	
  9080		if (!sysctl_sched_vcpu_debooster_enabled)
  9081			return NULL;
  9082	
  9083		if (!p_target)
  9084			return NULL;
  9085	
  9086		if (yield_deboost_rate_limit(rq))
  9087			return NULL;
  9088	
> 9089		p_yielding = rq->donor;
  9090		if (!p_yielding || p_yielding == p_target)
  9091			return NULL;
  9092	
  9093		if (p_target->sched_class != &fair_sched_class)
  9094			return NULL;
  9095	
  9096		if (task_rq(p_target) != rq)
  9097			return NULL;
  9098	
  9099		if (!p_target->se.on_rq || !p_yielding->se.on_rq)
  9100			return NULL;
  9101	
  9102		return p_yielding;
  9103	}
  9104	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 4/9] sched/fair: Add penalty calculation and application logic
  2025-12-19  3:53 ` [PATCH v2 4/9] sched/fair: Add penalty calculation and application logic Wanpeng Li
@ 2025-12-22 23:36   ` kernel test robot
  0 siblings, 0 replies; 19+ messages in thread
From: kernel test robot @ 2025-12-22 23:36 UTC (permalink / raw)
  To: Wanpeng Li, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Paolo Bonzini, Sean Christopherson
  Cc: oe-kbuild-all, K Prateek Nayak, Christian Borntraeger,
	Steven Rostedt, Vincent Guittot, Juri Lelli, linux-kernel, kvm,
	Wanpeng Li

Hi Wanpeng,

kernel test robot noticed the following build warnings:

[auto build test WARNING on kvm/queue]
[also build test WARNING on kvm/next tip/sched/core peterz-queue/sched/core tip/master linus/master v6.19-rc2 next-20251219]
[cannot apply to kvm/linux-next tip/auto-latest]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Wanpeng-Li/sched-fair-Add-rate-limiting-and-validation-helpers/20251219-125353
base:   https://git.kernel.org/pub/scm/virt/kvm/kvm.git queue
patch link:    https://lore.kernel.org/r/20251219035334.39790-5-kernellwp%40gmail.com
patch subject: [PATCH v2 4/9] sched/fair: Add penalty calculation and application logic
config: openrisc-randconfig-r122-20251221 (https://download.01.org/0day-ci/archive/20251223/202512230746.EpT2QbVU-lkp@intel.com/config)
compiler: or1k-linux-gcc (GCC) 15.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251223/202512230746.EpT2QbVU-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202512230746.EpT2QbVU-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
   kernel/sched/fair.c:1158:49: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct task_struct *running @@     got struct task_struct [noderef] __rcu *curr @@
   kernel/sched/fair.c:1158:49: sparse:     expected struct task_struct *running
   kernel/sched/fair.c:1158:49: sparse:     got struct task_struct [noderef] __rcu *curr
   kernel/sched/fair.c:1194:33: sparse: sparse: incorrect type in argument 2 (different address spaces) @@     expected struct sched_entity *se @@     got struct sched_entity [noderef] __rcu * @@
   kernel/sched/fair.c:1194:33: sparse:     expected struct sched_entity *se
   kernel/sched/fair.c:1194:33: sparse:     got struct sched_entity [noderef] __rcu *
   kernel/sched/fair.c:1250:34: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct sched_entity const *se @@     got struct sched_entity [noderef] __rcu * @@
   kernel/sched/fair.c:1250:34: sparse:     expected struct sched_entity const *se
   kernel/sched/fair.c:1250:34: sparse:     got struct sched_entity [noderef] __rcu *
   kernel/sched/fair.c:13176:9: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/fair.c:13176:9: sparse:     expected struct sched_domain *[assigned] sd
   kernel/sched/fair.c:13176:9: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/fair.c:8354:20: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/fair.c:8354:20: sparse:     expected struct sched_domain *[assigned] sd
   kernel/sched/fair.c:8354:20: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/fair.c:8558:9: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] tmp @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/fair.c:8558:9: sparse:     expected struct sched_domain *[assigned] tmp
   kernel/sched/fair.c:8558:9: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/fair.c:8757:39: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct task_struct *donor @@     got struct task_struct [noderef] __rcu *donor @@
   kernel/sched/fair.c:8757:39: sparse:     expected struct task_struct *donor
   kernel/sched/fair.c:8757:39: sparse:     got struct task_struct [noderef] __rcu *donor
   kernel/sched/fair.c:8784:37: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct task_struct *tsk @@     got struct task_struct [noderef] __rcu *curr @@
   kernel/sched/fair.c:8784:37: sparse:     expected struct task_struct *tsk
   kernel/sched/fair.c:8784:37: sparse:     got struct task_struct [noderef] __rcu *curr
   kernel/sched/fair.c:9089:20: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct task_struct *p_yielding @@     got struct task_struct [noderef] __rcu *donor @@
   kernel/sched/fair.c:9089:20: sparse:     expected struct task_struct *p_yielding
   kernel/sched/fair.c:9089:20: sparse:     got struct task_struct [noderef] __rcu *donor
>> kernel/sched/fair.c:9150:44: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct task_struct *p_yielding @@     got struct task_struct [noderef] __rcu *donor @@
   kernel/sched/fair.c:9150:44: sparse:     expected struct task_struct *p_yielding
   kernel/sched/fair.c:9150:44: sparse:     got struct task_struct [noderef] __rcu *donor
   kernel/sched/fair.c:9295:38: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct task_struct *curr @@     got struct task_struct [noderef] __rcu *donor @@
   kernel/sched/fair.c:9295:38: sparse:     expected struct task_struct *curr
   kernel/sched/fair.c:9295:38: sparse:     got struct task_struct [noderef] __rcu *donor
   kernel/sched/fair.c:10331:40: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct sched_domain *child @@     got struct sched_domain [noderef] __rcu *child @@
   kernel/sched/fair.c:10331:40: sparse:     expected struct sched_domain *child
   kernel/sched/fair.c:10331:40: sparse:     got struct sched_domain [noderef] __rcu *child
   kernel/sched/fair.c:10959:22: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/fair.c:10959:22: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/fair.c:10959:22: sparse:    struct task_struct *
   kernel/sched/fair.c:12431:9: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/fair.c:12431:9: sparse:     expected struct sched_domain *[assigned] sd
   kernel/sched/fair.c:12431:9: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/fair.c:12069:44: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct sched_domain *sd_parent @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/fair.c:12069:44: sparse:     expected struct sched_domain *sd_parent
   kernel/sched/fair.c:12069:44: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/fair.c:12544:9: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/fair.c:12544:9: sparse:     expected struct sched_domain *[assigned] sd
   kernel/sched/fair.c:12544:9: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/fair.c:6688:35: sparse: sparse: marked inline, but without a definition
   kernel/sched/fair.c: note: in included file:
   kernel/sched/sched.h:2647:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2647:9: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2647:9: sparse:    struct task_struct *
   kernel/sched/sched.h:2314:26: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2314:26: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2314:26: sparse:    struct task_struct *
   kernel/sched/sched.h:2303:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2303:25: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2303:25: sparse:    struct task_struct *
   kernel/sched/sched.h:2314:26: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2314:26: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2314:26: sparse:    struct task_struct *
   kernel/sched/sched.h:2314:26: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2314:26: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2314:26: sparse:    struct task_struct *
   kernel/sched/sched.h:2314:26: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2314:26: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2314:26: sparse:    struct task_struct *

vim +9150 kernel/sched/fair.c

  9134	
  9135	/*
  9136	 * Apply debounce for reverse yield pairs to reduce ping-pong effects.
  9137	 * When A yields to B, then B yields back to A within ~600us, downscale
  9138	 * the penalty to prevent oscillation.
  9139	 *
  9140	 * The 600us threshold is chosen to be:
  9141	 * - Long enough to catch rapid back-and-forth yields
  9142	 * - Short enough to not affect legitimate sequential yields
  9143	 *
  9144	 * Returns the (possibly reduced) penalty value.
  9145	 */
  9146	static u64 yield_deboost_apply_debounce(struct rq *rq, struct task_struct *p_target,
  9147						u64 penalty, u64 need, u64 gran)
  9148	{
  9149		u64 now = rq_clock(rq);
> 9150		struct task_struct *p_yielding = rq->donor;
  9151		pid_t src_pid, dst_pid;
  9152		pid_t last_src, last_dst;
  9153		u64 last_ns;
  9154	
  9155		if (!p_yielding || !p_target)
  9156			return penalty;
  9157	
  9158		src_pid = p_yielding->pid;
  9159		dst_pid = p_target->pid;
  9160		last_src = rq->yield_deboost_last_src_pid;
  9161		last_dst = rq->yield_deboost_last_dst_pid;
  9162		last_ns = rq->yield_deboost_last_pair_time_ns;
  9163	
  9164		/* Detect reverse pair: previous was target->source */
  9165		if (last_src == dst_pid && last_dst == src_pid &&
  9166		    (now - last_ns) <= 600 * NSEC_PER_USEC) {
  9167			u64 alt = max(need, gran);
  9168	
  9169			if (penalty > alt)
  9170				penalty = alt;
  9171		}
  9172	
  9173		/* Update tracking state */
  9174		rq->yield_deboost_last_src_pid = src_pid;
  9175		rq->yield_deboost_last_dst_pid = dst_pid;
  9176		rq->yield_deboost_last_pair_time_ns = now;
  9177	
  9178		return penalty;
  9179	}
  9180	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 0/9] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM
  2025-12-19  3:53 [PATCH v2 0/9] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM Wanpeng Li
                   ` (8 preceding siblings ...)
  2025-12-19  3:53 ` [PATCH v2 9/9] KVM: Relaxed boost as safety net Wanpeng Li
@ 2026-01-04  2:40 ` Wanpeng Li
  2026-01-05  6:26 ` K Prateek Nayak
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 19+ messages in thread
From: Wanpeng Li @ 2026-01-04  2:40 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paolo Bonzini,
	Sean Christopherson
  Cc: K Prateek Nayak, Christian Borntraeger, Steven Rostedt,
	Vincent Guittot, Juri Lelli, linux-kernel, kvm, Wanpeng Li

ping, :)
On Fri, 19 Dec 2025 at 11:53, Wanpeng Li <kernellwp@gmail.com> wrote:
>
> From: Wanpeng Li <wanpengli@tencent.com>
>
> This series addresses long-standing yield_to() inefficiencies in
> virtualized environments through two complementary mechanisms: a vCPU
> debooster in the scheduler and IPI-aware directed yield in KVM.
>
> Problem Statement
> -----------------
>
> In overcommitted virtualization scenarios, vCPUs frequently spin on locks
> held by other vCPUs that are not currently running. The kernel's
> paravirtual spinlock support detects these situations and calls yield_to()
> to boost the lock holder, allowing it to run and release the lock.
>
> However, the current implementation has two critical limitations:
>
> 1. Scheduler-side limitation:
>
>    yield_to_task_fair() relies solely on set_next_buddy() to provide
>    preference to the target vCPU. This buddy mechanism only offers
>    immediate, transient preference. Once the buddy hint expires (typically
>    after one scheduling decision), the yielding vCPU may preempt the target
>    again, especially in nested cgroup hierarchies where vruntime domains
>    differ.
>
>    This creates a ping-pong effect: the lock holder runs briefly, gets
>    preempted before completing critical sections, and the yielding vCPU
>    spins again, triggering another futile yield_to() cycle. The overhead
>    accumulates rapidly in workloads with high lock contention.
>
> 2. KVM-side limitation:
>
>    kvm_vcpu_on_spin() attempts to identify which vCPU to yield to through
>    directed yield candidate selection. However, it lacks awareness of IPI
>    communication patterns. When a vCPU sends an IPI and spins waiting for
>    a response (common in inter-processor synchronization), the current
>    heuristics often fail to identify the IPI receiver as the yield target.
>
>    Instead, the code may boost an unrelated vCPU based on coarse-grained
>    preemption state, missing opportunities to accelerate actual IPI
>    response handling. This is particularly problematic when the IPI
>    receiver is runnable but not scheduled, as lock-holder-detection logic
>    doesn't capture the IPI dependency relationship.
>
> Combined, these issues cause excessive lock hold times, cache thrashing,
> and degraded throughput in overcommitted environments, particularly
> affecting workloads with fine-grained synchronization patterns.
>
> Solution Overview
> -----------------
>
> The series introduces two orthogonal improvements that work synergistically:
>
> Part 1: Scheduler vCPU Debooster (patches 1-5)
>
> Augment yield_to_task_fair() with bounded vruntime penalties to provide
> sustained preference beyond the buddy mechanism. When a vCPU yields to a
> target, apply a carefully tuned vruntime penalty to the yielding vCPU,
> ensuring the target maintains scheduling advantage for longer periods.
>
> The mechanism is EEVDF-aware and cgroup-hierarchy-aware:
>
> - Locate the lowest common ancestor (LCA) in the cgroup hierarchy where
>   both the yielding and target tasks coexist. This ensures vruntime
>   adjustments occur at the correct hierarchy level, maintaining fairness
>   across cgroup boundaries.
>
> - Update EEVDF scheduler fields (vruntime, deadline) atomically to keep
>   the scheduler state consistent. Note that vlag is intentionally not
>   modified as it will be recalculated on dequeue/enqueue cycles. The
>   penalty shifts the yielding task's virtual deadline forward, allowing
>   the target to run.
>
> - Apply queue-size-adaptive penalties that scale from 6.0x scheduling
>   granularity for 2-task scenarios (strong preference) down to 1.0x for
>   large queues (>12 tasks), balancing preference against starvation risks.
>
> - Implement reverse-pair debouncing: when task A yields to B, then B yields
>   to A within a short window (~600us), downscale the penalty to prevent
>   ping-pong oscillation.
>
> - Rate-limit penalty application to 6ms intervals to prevent pathological
>   overhead when yields occur at very high frequency.
>
> The debooster works *with* the buddy mechanism rather than replacing it:
> set_next_buddy() provides immediate preference for the next scheduling
> decision, while the vruntime penalty sustains that preference over
> subsequent decisions. This dual approach proves especially effective in
> nested cgroup scenarios where buddy hints alone are insufficient.
>
> Part 2: KVM IPI-Aware Directed Yield (patches 6-9)
>
> Enhance kvm_vcpu_on_spin() with lightweight IPI tracking to improve
> directed yield candidate selection. Track sender/receiver relationships
> when IPIs are delivered and use this information to prioritize yield
> targets.
>
> The tracking mechanism:
>
> - Hooks into kvm_irq_delivery_to_apic() to detect unicast fixed IPIs (the
>   common case for inter-processor synchronization). When exactly one
>   destination vCPU receives an IPI, record the sender->receiver relationship
>   with a monotonic timestamp.
>
>   In high VM density scenarios, software-based IPI tracking through
>   interrupt delivery interception becomes particularly valuable. It
>   captures precise sender/receiver relationships that can be leveraged
>   for intelligent scheduling decisions, providing performance benefits
>   that complement or even exceed hardware-accelerated interrupt delivery
>   in overcommitted environments.
>
> - Uses lockless READ_ONCE/WRITE_ONCE accessors for minimal overhead. The
>   per-vCPU ipi_context structure is carefully designed to avoid cache line
>   bouncing.
>
> - Implements a short recency window (50ms default) to avoid stale IPI
>   information inflating boost priority on throughput-sensitive workloads.
>   Old IPI relationships are naturally aged out.
>
> - Clears IPI context on EOI with two-stage precision: unconditionally clear
>   the receiver's context (it processed the interrupt), but only clear the
>   sender's pending flag if the receiver matches and the IPI is recent. This
>   prevents unrelated EOIs from prematurely clearing valid IPI state.
>
> The candidate selection follows a priority hierarchy:
>
>   Priority 1: Confirmed IPI receiver
>     If the spinning vCPU recently sent an IPI to another vCPU and that IPI
>     is still pending (within the recency window), unconditionally boost the
>     receiver. This directly addresses the "spinning on IPI response" case.
>
>   Priority 2: Fast pending interrupt
>     Leverage arch-specific kvm_arch_dy_has_pending_interrupt() for
>     compatibility with existing optimizations.
>
>   Priority 3: Preempted in kernel mode
>     Fall back to traditional preemption-based logic when yield_to_kernel_mode
>     is requested, ensuring compatibility with existing workloads.
>
> A two-round fallback mechanism provides a safety net: if the first round
> with strict IPI-aware selection finds no eligible candidate (e.g., due to
> missed IPI context or transient runnable set changes), a second round
> applies relaxed selection gated only by preemption state. This is
> controlled by the enable_relaxed_boost module parameter (default on).
>
> Implementation Details
> ----------------------
>
> Both mechanisms are designed for minimal overhead and runtime control:
>
> - All locking occurs under existing rq->lock or per-vCPU locks; no new
>   lock contention is introduced.
>
> - Penalty calculations use integer arithmetic with overflow protection.
>
> - IPI tracking uses monotonic timestamps (ktime_get_mono_fast_ns()) for
>   efficient, race-free recency checks.
>
> Advantages over paravirtualization approaches:
>
> - No guest OS modification required: This solution operates entirely within
>   the host kernel, providing transparent optimization without guest kernel
>   changes or recompilation.
>
> - Guest OS agnostic: Works uniformly across Linux, Windows, and other guest
>   operating systems, unlike PV TLB shootdown which requires guest-side
>   paravirtual driver support.
>
> - Broader applicability: Captures IPI patterns from all synchronization
>   primitives (spinlocks, RCU, smp_call_function, etc.), not limited to
>   specific paravirtualized operations like TLB shootdown.
>
> - Deployment simplicity: Existing VM images benefit immediately without
>   guest kernel updates, critical for production environments with diverse
>   guest OS versions and configurations.
>
> - Runtime controls allow disabling features if needed:
>   * /sys/kernel/debug/sched/vcpu_debooster_enabled
>   * /sys/module/kvm/parameters/ipi_tracking_enabled
>   * /sys/module/kvm/parameters/enable_relaxed_boost
>
> - The infrastructure is incrementally introduced: early patches add inert
>   scaffolding that can be verified for zero performance impact before
>   activation.
>
> Performance Results
> -------------------
>
> Test environment: Intel Xeon, 16 physical cores, 16 vCPUs per VM
>
> Dbench 16 clients per VM (filesystem metadata operations):
>   2 VMs: +14.4% throughput (lock contention reduction)
>   3 VMs:  +9.8% throughput
>   4 VMs:  +6.7% throughput
>
> PARSEC Dedup benchmark, simlarge input (memory-intensive):
>   2 VMs: +47.1% throughput (IPI-heavy synchronization)
>   3 VMs: +28.1% throughput
>   4 VMs:  +1.7% throughput
>
> PARSEC VIPS benchmark, simlarge input (compute-intensive):
>   2 VMs: +26.2% throughput (balanced sync and compute)
>   3 VMs: +12.7% throughput
>   4 VMs:  +6.0% throughput
>
> Analysis:
>
> - Gains are most pronounced at moderate overcommit (2-3 VMs). At this level,
>   contention is significant enough to benefit from better yield behavior,
>   but context switch overhead remains manageable.
>
> - Dedup shows the strongest improvement (+47.1% at 2 VMs) due to its
>   IPI-heavy synchronization patterns. The IPI-aware directed yield
>   precisely targets the bottleneck.
>
> - At 4 VMs (heavier overcommit), gains diminish as general CPU contention
>   dominates. However, performance never regresses, indicating the mechanisms
>   gracefully degrade.
>
> - In certain high-density, resource overcommitted deployment scenarios, the
>   performance benefits of APICv can be constrained by scheduling and
>   contention patterns. In such cases, software-based IPI tracking serves as
>   a complementary optimization path, offering targeted scheduling hints
>   without relying on disabling APICv. The practical choice should be
>   evaluated and balanced against workload characteristics and platform
>   configuration.
>
> - Dbench benefits primarily from the scheduler-side debooster, as its lock
>   patterns involve less IPI spinning and more direct lock holder boosting.
>
> The performance gains stem from three factors:
>
> 1. Lock holders receive sustained CPU time to complete critical sections,
>    reducing overall lock hold duration and cascading contention.
>
> 2. IPI receivers are promptly scheduled when senders spin, minimizing IPI
>    response latency and reducing wasted spin cycles.
>
> 3. Better cache utilization results from reduced context switching between
>    lock waiters and holders.
>
> Patch Organization
> ------------------
>
> The series is organized for incremental review and bisectability:
>
> Patches 1-5: Scheduler vCPU debooster
>
>   Patch 1: Add infrastructure (per-rq tracking, sysctl, debugfs entry)
>            Infrastructure is inert; no functional change.
>
>   Patch 2: Add rate-limiting and validation helpers
>            Static functions with comprehensive safety checks.
>
>   Patch 3: Add cgroup LCA finder for hierarchical yield
>            Implements CONFIG_FAIR_GROUP_SCHED-aware LCA location.
>
>   Patch 4: Add penalty calculation and application logic
>            Core algorithms with queue-size adaptation and debouncing.
>
>   Patch 5: Wire up yield deboost in yield_to_task_fair()
>            Activation patch. Includes Dbench performance data.
>
> Patches 6-9: KVM IPI-aware directed yield
>
>   Patch 6: Add IPI tracking infrastructure
>            Per-vCPU context, module parameters, helper functions.
>            Infrastructure is inert until activated.
>
>   Patch 7: Integrate IPI tracking with LAPIC interrupt delivery
>            Hook into kvm_irq_delivery_to_apic() and EOI handling.
>
>   Patch 8: Implement IPI-aware directed yield candidate selection
>            Replace candidate selection logic with priority-based approach.
>            Includes PARSEC performance data.
>
>   Patch 9: Add relaxed boost as safety net
>            Two-round fallback mechanism for robustness.
>
> Each patch compiles and boots independently. Performance data is presented
> where the relevant mechanism becomes active (patches 5 and 8).
>
> Testing
> -------
>
> Workloads tested:
>
> - Dbench (filesystem metadata stress)
> - PARSEC benchmarks (Dedup, VIPS, Ferret, Blackscholes)
> - Kernel compilation (make -j16 in each VM)
>
> No regressions observed on any configuration. The mechanisms show neutral
> to positive impact across diverse workloads.
>
> Future Work
> -----------
>
> Potential extensions beyond this series:
>
> - Adaptive recency window: dynamically adjust ipi_window_ns based on
>   observed workload patterns.
>
> - Extended tracking: consider multi-round IPI patterns (A->B->C->A).
>
> - Cross-NUMA awareness: penalty scaling based on NUMA distances.
>
> These are intentionally deferred to keep this series focused and reviewable.
>
> v1 -> v2:
> - Rebase onto v6.19-rc1 (v1 was based on v6.18-rc4)
> - Drop "KVM: Fix last_boosted_vcpu index assignment bug" patch as v6.19-rc1
>   already contains this fix
> - Scheduler debooster changes:
>   * Adapt to v6.19's EEVDF forfeit behavior: yield_to_deboost() must be
>     called BEFORE yield_task_fair() to preserve fairness gap calculation.
>     In v6.19+, yield_task_fair() performs forfeit (se->vruntime =
>     se->deadline), which would inflate the yielding entity's vruntime
>     before penalty calculation, causing need=0 and only baseline penalty
>     being applied.
>   * Change from rq->curr to rq->donor for correct EEVDF donor tracking
>   * Change from nr_queued to h_nr_queued for accurate hierarchical task
>     counting in penalty cap calculation
>   * Remove vlag assignment as it will be recalculated on dequeue/enqueue
>     and modifying it for on-rq entity is incorrect
>   * Remove update_min_vruntime() call: in EEVDF the yielding entity is
>     always cfs_rq->curr (dequeued from RB-tree), so modifying its vruntime
>     does not affect min_vruntime calculation
>   * Remove unnecessary gran_floor safeguard (calc_delta_fair already
>     handles edge cases correctly)
>   * Rename debugfs entry from sched_vcpu_debooster_enabled to
>     vcpu_debooster_enabled for consistency
> - KVM IPI tracking changes:
>   * Improve documentation for module parameters
>   * Add kvm_vcpu_is_ipi_receiver() declaration to x86.h header
>
> Wanpeng Li (9):
>   sched: Add vCPU debooster infrastructure
>   sched/fair: Add rate-limiting and validation helpers
>   sched/fair: Add cgroup LCA finder for hierarchical yield
>   sched/fair: Add penalty calculation and application logic
>   sched/fair: Wire up yield deboost in yield_to_task_fair()
>   KVM: x86: Add IPI tracking infrastructure
>   KVM: x86/lapic: Integrate IPI tracking with interrupt delivery
>   KVM: Implement IPI-aware directed yield candidate selection
>   KVM: Relaxed boost as safety net
>
>  arch/x86/include/asm/kvm_host.h |  12 ++
>  arch/x86/kvm/lapic.c            | 166 ++++++++++++++++-
>  arch/x86/kvm/x86.c              |   3 +
>  arch/x86/kvm/x86.h              |   8 +
>  include/linux/kvm_host.h        |   3 +
>  kernel/sched/core.c             |   9 +-
>  kernel/sched/debug.c            |   2 +
>  kernel/sched/fair.c             | 305 ++++++++++++++++++++++++++++++++
>  kernel/sched/sched.h            |  12 ++
>  virt/kvm/kvm_main.c             |  74 +++++++-
>  10 files changed, 579 insertions(+), 15 deletions(-)
>
> --
> 2.43.0
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/9] sched/fair: Add rate-limiting and validation helpers
  2025-12-19  3:53 ` [PATCH v2 2/9] sched/fair: Add rate-limiting and validation helpers Wanpeng Li
  2025-12-22 21:12   ` kernel test robot
@ 2026-01-04  4:09   ` Hillf Danton
  1 sibling, 0 replies; 19+ messages in thread
From: Hillf Danton @ 2026-01-04  4:09 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Peter Zijlstra, Paolo Bonzini, Sean Christopherson,
	K Prateek Nayak, linux-kernel, kvm, Wanpeng Li

Hi Wanpeng 

On Fri, 19 Dec 2025 11:53:26 +0800
> From: Wanpeng Li <wanpengli@tencent.com>
> 
> Implement core safety mechanisms for yield deboost operations.
> 
> Add yield_deboost_rate_limit() for high-frequency gating to prevent
> excessive overhead on compute-intensive workloads. The 6ms threshold
> balances responsiveness with overhead reduction.
> 
> Add yield_deboost_validate_tasks() for comprehensive validation ensuring
> both tasks are valid and distinct, both belong to fair_sched_class,
> target is on the same runqueue, and tasks are runnable.
> 
Given IPI in subsequent pacthes, why is same rq required?

> The rate limiter prevents pathological high-frequency cases while
> validation ensures only appropriate task pairs proceed. Both functions
> are static and will be integrated in subsequent patches.
> 
> v1 -> v2:
> - Remove unnecessary READ_ONCE/WRITE_ONCE for per-rq fields accessed
>   under rq->lock
> - Change rq->clock to rq_clock(rq) helper for consistency
> - Change yield_deboost_rate_limit() signature from (rq, now_ns) to (rq),
>   obtaining time internally via rq_clock()
> - Remove redundant sched_class check for p_yielding (already implied by
>   rq->donor being fair)
> - Simplify task_rq check to only verify p_target
> - Change rq->curr to rq->donor for correct EEVDF donor tracking
> - Move sysctl_sched_vcpu_debooster_enabled and NULL checks to caller
>   (yield_to_deboost) for early exit before update_rq_clock()
> - Simplify function signature by returning p_yielding directly instead
>   of using output pointer parameters
> - Add documentation explaining the 6ms rate limit threshold
> 
> Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
> ---
>  kernel/sched/fair.c | 62 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 62 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 87c30db2c853..2f327882bf4d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9040,6 +9040,68 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev, struct t
>  	}
>  }
>  
> +/*
> + * Rate-limit yield deboost operations to prevent excessive overhead.
> + * Returns true if the operation should be skipped due to rate limiting.
> + *
> + * The 6ms threshold balances responsiveness with overhead reduction:
> + * - Short enough to allow timely yield boosting for lock contention
> + * - Long enough to prevent pathological high-frequency penalty application
> + *
> + * Called under rq->lock, so direct field access is safe.
> + */
> +static bool yield_deboost_rate_limit(struct rq *rq)
> +{
> +	u64 now = rq_clock(rq);
> +	u64 last = rq->yield_deboost_last_time_ns;
> +
> +	if (last && (now - last) <= 6 * NSEC_PER_MSEC)
> +		return true;
> +
> +	rq->yield_deboost_last_time_ns = now;
> +	return false;
> +}
> +
> +/*
> + * Validate tasks for yield deboost operation.
> + * Returns the yielding task on success, NULL on validation failure.
> + *
> + * Checks: feature enabled, valid target, same runqueue, target is fair class,
> + * both on_rq. Called under rq->lock.
> + *
> + * Note: p_yielding (rq->donor) is guaranteed to be fair class by the caller
> + * (yield_to_task_fair is only called when curr->sched_class == p->sched_class).
> + */
> +static struct task_struct __maybe_unused *
> +yield_deboost_validate_tasks(struct rq *rq, struct task_struct *p_target)
> +{
> +	struct task_struct *p_yielding;
> +
> +	if (!sysctl_sched_vcpu_debooster_enabled)
> +		return NULL;
> +
> +	if (!p_target)
> +		return NULL;
> +
> +	if (yield_deboost_rate_limit(rq))
> +		return NULL;
> +
> +	p_yielding = rq->donor;
> +	if (!p_yielding || p_yielding == p_target)
> +		return NULL;
> +
> +	if (p_target->sched_class != &fair_sched_class)
> +		return NULL;
> +
> +	if (task_rq(p_target) != rq)
> +		return NULL;
> +
> +	if (!p_target->se.on_rq || !p_yielding->se.on_rq)
> +		return NULL;
> +
> +	return p_yielding;
> +}
> +
>  /*
>   * sched_yield() is very simple
>   */
> -- 
> 2.43.0

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 0/9] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM
  2025-12-19  3:53 [PATCH v2 0/9] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM Wanpeng Li
                   ` (9 preceding siblings ...)
  2026-01-04  2:40 ` [PATCH v2 0/9] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM Wanpeng Li
@ 2026-01-05  6:26 ` K Prateek Nayak
  2026-03-13  1:13 ` Sean Christopherson
  2026-03-26 14:41 ` Christian Borntraeger
  12 siblings, 0 replies; 19+ messages in thread
From: K Prateek Nayak @ 2026-01-05  6:26 UTC (permalink / raw)
  To: Wanpeng Li, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Paolo Bonzini, Sean Christopherson
  Cc: Christian Borntraeger, Steven Rostedt, Vincent Guittot,
	Juri Lelli, linux-kernel, kvm, Wanpeng Li

Hello Wanpeng,

On 12/19/2025 9:23 AM, Wanpeng Li wrote:
> Part 1: Scheduler vCPU Debooster (patches 1-5)
> 
> Augment yield_to_task_fair() with bounded vruntime penalties to provide
> sustained preference beyond the buddy mechanism. When a vCPU yields to a
> target, apply a carefully tuned vruntime penalty to the yielding vCPU,
> ensuring the target maintains scheduling advantage for longer periods.

Do you still see the problem after the fixes in commits:

127b90315ca0 ("sched/proxy: Yield the donor task")
79104becf42b ("sched/fair: Forfeit vruntime on yield")

Starting 79104becf42b, we push the vruntime on yield too which should
prevent the yield loop between vCPUs on same cgroup on the same vCPU.

If you have the following cgroup hierarchy:

           root
          /    \
         /      \
        /        \
       A          B
      / \         |
     /   \        |
  vCPU0  vCPU1  vCPU0

and vCPU0(A) yields to vCPU1(A) in the same cgroup vCPU1 should start
running after vCPU0 has pushed its vruntime enough to make it
ineligible.

If you have vCPUs across different cgroups with CPU controllers enabled,
I hope you have a very good reason to have such a setup because
otherwise, this is just too much to complexity for some theoretical,
insane deployment.

> 
> The mechanism is EEVDF-aware and cgroup-hierarchy-aware:
> 
> - Locate the lowest common ancestor (LCA) in the cgroup hierarchy where
>   both the yielding and target tasks coexist. This ensures vruntime
>   adjustments occur at the correct hierarchy level, maintaining fairness
>   across cgroup boundaries.
> 
> - Update EEVDF scheduler fields (vruntime, deadline) atomically to keep
>   the scheduler state consistent. Note that vlag is intentionally not
>   modified as it will be recalculated on dequeue/enqueue cycles. The
>   penalty shifts the yielding task's virtual deadline forward, allowing
>   the target to run.
> 
> - Apply queue-size-adaptive penalties that scale from 6.0x scheduling
>   granularity for 2-task scenarios (strong preference) down to 1.0x for
>   large queues (>12 tasks), balancing preference against starvation risks.
> 
> - Implement reverse-pair debouncing: when task A yields to B, then B yields
>   to A within a short window (~600us), downscale the penalty to prevent
>   ping-pong oscillation.
> 
> - Rate-limit penalty application to 6ms intervals to prevent pathological
>   overhead when yields occur at very high frequency.

I still don't like all this complexity. How much better is it than doing
something like a:

  (Only build tested)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7377f9117501..fbb263ea7d5a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9079,6 +9079,7 @@ static void yield_task_fair(struct rq *rq)
 static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
 {
 	struct sched_entity *se = &p->se;
+	unsigned long weight;
 
 	/* !se->on_rq also covers throttled task */
 	if (!se->on_rq)
@@ -9089,6 +9090,32 @@ static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
 
 	yield_task_fair(rq);
 
+	se = &rq->donor->se;
+	weight = se->load.weight;
+
+	/* Proportionally yield the hierarchy. */
+	while ((se = parent_entity(se))) {
+		unsigned long gcfs_rq_weight = group_cfs_rq(se)->load.weight;
+		struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+		WARN_ON_ONCE(se != cfs_rq->curr);
+		update_curr(cfs_rq);
+
+		/* Don't yield beyond the point of ineligibility. */
+		if (!entity_eligible(cfs_rq, se))
+			break;
+		/*
+		 * Proportionally increase the vruntime based on the slice
+		 * and the weight of the yielding subtree.
+		 */
+		se->vruntime += div_u64(calc_delta_fair(se->slice, se) * weight, gcfs_rq_weight);
+		update_deadline(cfs_rq, se);
+
+		/* Update the proportional wight of task on parent hierarchy. */
+		weight = (se->load.weight * weight) / gcfs_rq_weight;
+		if (!weight)
+			break;
+	}
 	return true;
 }
 
base-commit: 6ab7973f254071faf20fe5fcc502a3fe9ca14a47
---

Prepared on top of tip:sched/core. I don't like the above either and I'm
90% sure commit 79104becf42b ("sched/fair: Forfeit vruntime on yield")
will solve the problem you are seeing.

> Performance Results
> -------------------
> 
> Test environment: Intel Xeon, 16 physical cores, 16 vCPUs per VM
> 
> Dbench 16 clients per VM (filesystem metadata operations):
>   2 VMs: +14.4% throughput (lock contention reduction)
>   3 VMs:  +9.8% throughput
>   4 VMs:  +6.7% throughput
> 

And what does the cgroup hierarchy look like for these tests?

-- 
Thanks and Regards,
Prateek


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 0/9] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM
  2025-12-19  3:53 [PATCH v2 0/9] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM Wanpeng Li
                   ` (10 preceding siblings ...)
  2026-01-05  6:26 ` K Prateek Nayak
@ 2026-03-13  1:13 ` Sean Christopherson
  2026-03-26 14:41 ` Christian Borntraeger
  12 siblings, 0 replies; 19+ messages in thread
From: Sean Christopherson @ 2026-03-13  1:13 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paolo Bonzini,
	K Prateek Nayak, Christian Borntraeger, Steven Rostedt,
	Vincent Guittot, Juri Lelli, linux-kernel, kvm, Wanpeng Li

On Fri, Dec 19, 2025, Wanpeng Li wrote:
> Part 2: KVM IPI-Aware Directed Yield (patches 6-9)
> 
> Enhance kvm_vcpu_on_spin() with lightweight IPI tracking to improve
> directed yield candidate selection. Track sender/receiver relationships
> when IPIs are delivered and use this information to prioritize yield
> targets.
> 
> The tracking mechanism:
> 
> - Hooks into kvm_irq_delivery_to_apic() to detect unicast fixed IPIs (the
>   common case for inter-processor synchronization). When exactly one
>   destination vCPU receives an IPI, record the sender->receiver relationship
>   with a monotonic timestamp.
> 
>   In high VM density scenarios, software-based IPI tracking through
>   interrupt delivery interception becomes particularly valuable. It
>   captures precise sender/receiver relationships that can be leveraged
>   for intelligent scheduling decisions, providing performance benefits
>   that complement or even exceed hardware-accelerated interrupt delivery
>   in overcommitted environments.
> 
> - Uses lockless READ_ONCE/WRITE_ONCE accessors for minimal overhead. The
>   per-vCPU ipi_context structure is carefully designed to avoid cache line
>   bouncing.
> 
> - Implements a short recency window (50ms default) to avoid stale IPI
>   information inflating boost priority on throughput-sensitive workloads.
>   Old IPI relationships are naturally aged out.
> 
> - Clears IPI context on EOI with two-stage precision: unconditionally clear
>   the receiver's context (it processed the interrupt), but only clear the
>   sender's pending flag if the receiver matches and the IPI is recent. This
>   prevents unrelated EOIs from prematurely clearing valid IPI state.

That all relies on lack of IPI and EOI virtualization, which seems very
counter-productive given the way hardware is headed.

My reaction to all of this is that in the long run, we'd be far better off getting
the guest to "cooperate" in the sense of communicating intent, status, etc.  As
a stop-gap for older hardware, this obviously is beneficial.  But AFAICT the IPI
tracking is going to be dead weight in the near future.

And there are many, many use cases that what PV scheduling, i.e. if people band
together, I suspect it's feasible to get Linux-as-a-guest to provide hints to the
host that can be used to make scheduling decisions.

> Performance Results
> -------------------
> 
> Test environment: Intel Xeon, 16 physical cores, 16 vCPUs per VM

What generation of CPU?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 0/9] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM
  2025-12-19  3:53 [PATCH v2 0/9] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM Wanpeng Li
                   ` (11 preceding siblings ...)
  2026-03-13  1:13 ` Sean Christopherson
@ 2026-03-26 14:41 ` Christian Borntraeger
  12 siblings, 0 replies; 19+ messages in thread
From: Christian Borntraeger @ 2026-03-26 14:41 UTC (permalink / raw)
  To: Wanpeng Li, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Paolo Bonzini, Sean Christopherson
  Cc: K Prateek Nayak, Steven Rostedt, Vincent Guittot, Juri Lelli,
	linux-kernel, kvm, Wanpeng Li, richie

Am 19.12.25 um 04:53 schrieb Wanpeng Li:
> From: Wanpeng Li <wanpengli@tencent.com>
> 
> This series addresses long-standing yield_to() inefficiencies in
> virtualized environments through two complementary mechanisms: a vCPU
> debooster in the scheduler and IPI-aware directed yield in KVM.
> 
> Problem Statement
> -----------------
> 
> In overcommitted virtualization scenarios, vCPUs frequently spin on locks
> held by other vCPUs that are not currently running. The kernel's
> paravirtual spinlock support detects these situations and calls yield_to()
> to boost the lock holder, allowing it to run and release the lock.
> 
> However, the current implementation has two critical limitations:
> 
> 1. Scheduler-side limitation:
> 
>     yield_to_task_fair() relies solely on set_next_buddy() to provide
>     preference to the target vCPU. This buddy mechanism only offers
>     immediate, transient preference. Once the buddy hint expires (typically
>     after one scheduling decision), the yielding vCPU may preempt the target
>     again, especially in nested cgroup hierarchies where vruntime domains
>     differ.
> 
>     This creates a ping-pong effect: the lock holder runs briefly, gets
>     preempted before completing critical sections, and the yielding vCPU
>     spins again, triggering another futile yield_to() cycle. The overhead
>     accumulates rapidly in workloads with high lock contention.

Wanpeng,

late but not forgotten.

So Richie Buturla gave this a try on s390 with some variations but still
without cgroup support (next step).
The numbers look very promising (diag 9c is our yieldto hypercall). With
super high overcommitment the benefit shrinks again, but results are still
positive. We are probably running into other limits.

2:1 Overcommit Ratio:
diag9c calls:                       225,804,073 →  213,913,266  (-5.3%)
Dbench thrpt (per-run mean):        +1.3%
Dbench thrpt (per-run median):      +0.8%
Dbench thrpt (total across runs):   +1.3%
Dbench thrpt (avg/VM):              +1.3%

4:1:
diag9c calls:                       833,455,152 →  556,597,627 (-33.2%)
Dbench thrpt (per-run mean):        +7.2%
Dbench thrpt (per-run median):      +8.5%
Dbench thrpt (total across runs):   +7.2%
Dbench thrpt (avg/VM):              +7.2%


6:1:
diag9c calls:                       967,501,378 →  737,178,419 (-23.8%)
Dbench thrpt (per-run mean):        +5.1%
Dbench thrpt (per-run median):      +4.8%
Dbench thrpt (total across runs):   +5.1%
Dbench thrpt (avg/VM):              +5.1%



8:1:
diag9c calls:                       872,165,596 → 653,481,530 (-25.1%)
Dbench thrpt (per-run mean):        +11.5%
Dbench thrpt (per-run median):      +11.4%
Dbench thrpt (total across runs):   +11.5%
Dbench thrpt (avg/VM):              +11.5%

9:1:
diag9c calls:                       809,384,976  → 587,597,163 (-27.4%)
Dbench thrpt (per-run mean):        +4.5%
Dbench thrpt (per-run median):      +4.0%
Dbench thrpt (total across runs):   +4.5%
Dbench thrpt (avg/VM):              +4.5%


10:1:
diag9c calls:                       711,772,971 → 477,448,374 (-32.9%)
Dbench thrpt (per-run mean):        +3.6%
Dbench thrpt (per-run median):      +1.6%
Dbench thrpt (total across runs):   +3.6%
Dbench thrpt (avg/VM):              +3.6%


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2026-03-26 14:42 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-19  3:53 [PATCH v2 0/9] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM Wanpeng Li
2025-12-19  3:53 ` [PATCH v2 1/9] sched: Add vCPU debooster infrastructure Wanpeng Li
2025-12-19  3:53 ` [PATCH v2 2/9] sched/fair: Add rate-limiting and validation helpers Wanpeng Li
2025-12-22 21:12   ` kernel test robot
2026-01-04  4:09   ` Hillf Danton
2025-12-19  3:53 ` [PATCH v2 3/9] sched/fair: Add cgroup LCA finder for hierarchical yield Wanpeng Li
2025-12-19  3:53 ` [PATCH v2 4/9] sched/fair: Add penalty calculation and application logic Wanpeng Li
2025-12-22 23:36   ` kernel test robot
2025-12-19  3:53 ` [PATCH v2 5/9] sched/fair: Wire up yield deboost in yield_to_task_fair() Wanpeng Li
2025-12-22  7:06   ` kernel test robot
2025-12-22  9:31   ` kernel test robot
2025-12-19  3:53 ` [PATCH v2 6/9] KVM: x86: Add IPI tracking infrastructure Wanpeng Li
2025-12-19  3:53 ` [PATCH v2 7/9] KVM: x86/lapic: Integrate IPI tracking with interrupt delivery Wanpeng Li
2025-12-19  3:53 ` [PATCH v2 8/9] KVM: Implement IPI-aware directed yield candidate selection Wanpeng Li
2025-12-19  3:53 ` [PATCH v2 9/9] KVM: Relaxed boost as safety net Wanpeng Li
2026-01-04  2:40 ` [PATCH v2 0/9] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM Wanpeng Li
2026-01-05  6:26 ` K Prateek Nayak
2026-03-13  1:13 ` Sean Christopherson
2026-03-26 14:41 ` Christian Borntraeger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox