[PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM
@ 2025-11-10  3:32 Wanpeng Li
  2025-11-10  3:32 ` [PATCH 01/10] sched: Add vCPU debooster infrastructure Wanpeng Li
                   ` (11 more replies)
  0 siblings, 12 replies; 36+ messages in thread
From: Wanpeng Li @ 2025-11-10  3:32 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paolo Bonzini,
	Sean Christopherson
  Cc: Steven Rostedt, Vincent Guittot, Juri Lelli, linux-kernel, kvm,
	Wanpeng Li

From: Wanpeng Li <wanpengli@tencent.com>

This series addresses long-standing yield_to() inefficiencies in
virtualized environments through two complementary mechanisms: a vCPU
debooster in the scheduler and IPI-aware directed yield in KVM.

Problem Statement
-----------------

In overcommitted virtualization scenarios, vCPUs frequently spin on locks
held by other vCPUs that are not currently running. The kernel's
paravirtual spinlock support detects these situations and calls yield_to()
to boost the lock holder, allowing it to run and release the lock.

However, the current implementation has two critical limitations:

1. Scheduler-side limitation:

   yield_to_task_fair() relies solely on set_next_buddy() to provide
   preference to the target vCPU. This buddy mechanism only offers
   immediate, transient preference. Once the buddy hint expires (typically
   after one scheduling decision), the yielding vCPU may preempt the target
   again, especially in nested cgroup hierarchies where vruntime domains
   differ.

   This creates a ping-pong effect: the lock holder runs briefly, gets
   preempted before completing critical sections, and the yielding vCPU
   spins again, triggering another futile yield_to() cycle. The overhead
   accumulates rapidly in workloads with high lock contention.

2. KVM-side limitation:

   kvm_vcpu_on_spin() attempts to identify which vCPU to yield to through
   directed yield candidate selection. However, it lacks awareness of IPI
   communication patterns. When a vCPU sends an IPI and spins waiting for
   a response (common in inter-processor synchronization), the current
   heuristics often fail to identify the IPI receiver as the yield target.

   Instead, the code may boost an unrelated vCPU based on coarse-grained
   preemption state, missing opportunities to accelerate actual IPI
   response handling. This is particularly problematic when the IPI receiver
   is runnable but not scheduled, as lock-holder-detection logic doesn't
   capture the IPI dependency relationship.

Combined, these issues cause excessive lock hold times, cache thrashing,
and degraded throughput in overcommitted environments, particularly
affecting workloads with fine-grained synchronization patterns.

Solution Overview
-----------------

The series introduces two orthogonal improvements that work synergistically:

Part 1: Scheduler vCPU Debooster (patches 1-5)

Augment yield_to_task_fair() with bounded vruntime penalties to provide
sustained preference beyond the buddy mechanism. When a vCPU yields to a
target, apply a carefully tuned vruntime penalty to the yielding vCPU,
ensuring the target maintains scheduling advantage for longer periods.

The mechanism is EEVDF-aware and cgroup-hierarchy-aware:

- Locate the lowest common ancestor (LCA) in the cgroup hierarchy where
  both the yielding and target tasks coexist. This ensures vruntime
  adjustments occur at the correct hierarchy level, maintaining fairness
  across cgroup boundaries.

- Update EEVDF scheduler fields (vruntime, deadline, vlag) atomically to
  keep the scheduler state consistent. The penalty shifts the yielding
  task's virtual deadline forward, allowing the target to run.

- Apply queue-size-adaptive penalties that scale from 6.0× scheduling
  granularity for 2-task scenarios (strong preference) down to 1.0× for
  large queues (>12 tasks), balancing preference against starvation risks.

- Implement reverse-pair debouncing: when task A yields to B, then B yields
  to A within a short window (~600us), downscale the penalty to prevent
  ping-pong oscillation.

- Rate-limit penalty application to 6ms intervals to prevent pathological
  overhead when yields occur at very high frequency.

The debooster works *with* the buddy mechanism rather than replacing it:
set_next_buddy() provides immediate preference for the next scheduling
decision, while the vruntime penalty sustains that preference over
subsequent decisions. This dual approach proves especially effective in
nested cgroup scenarios where buddy hints alone are insufficient.

Part 2: KVM IPI-Aware Directed Yield (patches 6-10)

Enhance kvm_vcpu_on_spin() with lightweight IPI tracking to improve
directed yield candidate selection. Track sender/receiver relationships
when IPIs are delivered and use this information to prioritize yield
targets.

The tracking mechanism:

- Hooks into kvm_irq_delivery_to_apic() to detect unicast fixed IPIs (the
  common case for inter-processor synchronization). When exactly one
  destination vCPU receives an IPI, record the sender→receiver relationship
  with a monotonic timestamp.

  In high VM density scenarios, software-based IPI tracking through interrupt
  delivery interception becomes particularly valuable. It captures precise
  sender/receiver relationships that can be leveraged for intelligent
  scheduling decisions, providing performance benefits that complement or
  even exceed hardware-accelerated interrupt delivery in overcommitted
  environments.

- Uses lockless READ_ONCE/WRITE_ONCE accessors for minimal overhead. The
  per-vCPU ipi_context structure is carefully designed to avoid cache line
  bouncing.

- Implements a short recency window (50ms default) to avoid stale IPI
  information inflating boost priority on throughput-sensitive workloads.
  Old IPI relationships are naturally aged out.

- Clears IPI context on EOI with two-stage precision: unconditionally clear
  the receiver's context (it processed the interrupt), but only clear the
  sender's pending flag if the receiver matches and the IPI is recent. This
  prevents unrelated EOIs from prematurely clearing valid IPI state.

The candidate selection follows a priority hierarchy:

  Priority 1: Confirmed IPI receiver
    If the spinning vCPU recently sent an IPI to another vCPU and that IPI
    is still pending (within the recency window), unconditionally boost the
    receiver. This directly addresses the "spinning on IPI response" case.

  Priority 2: Fast pending interrupt
    Leverage arch-specific kvm_arch_dy_has_pending_interrupt() for
    compatibility with existing optimizations.

  Priority 3: Preempted in kernel mode
    Fall back to traditional preemption-based logic when yield_to_kernel_mode
    is requested, ensuring compatibility with existing workloads.

A two-round fallback mechanism provides a safety net: if the first round
with strict IPI-aware selection finds no eligible candidate (e.g., due to
missed IPI context or transient runnable set changes), a second round
applies relaxed selection gated only by preemption state. This is
controlled by the enable_relaxed_boost module parameter (default on).

Implementation Details
----------------------

Both mechanisms are designed for minimal overhead and runtime control:

- All locking occurs under existing rq->lock or per-vCPU locks; no new
  lock contention is introduced.

- Penalty calculations use integer arithmetic with overflow protection.

- IPI tracking uses monotonic timestamps (ktime_get_mono_fast_ns()) for
  efficient, race-free recency checks.

Advantages over paravirtualization approaches:

- No guest OS modification required: This solution operates entirely within
  the host kernel, providing transparent optimization without guest kernel
  changes or recompilation.

- Guest OS agnostic: Works uniformly across Linux, Windows, and other guest
  operating systems, unlike PV TLB shootdown which requires guest-side
  paravirtual driver support.

- Broader applicability: Captures IPI patterns from all synchronization
  primitives (spinlocks, RCU, smp_call_function, etc.), not limited to
  specific paravirtualized operations like TLB shootdown.

- Deployment simplicity: Existing VM images benefit immediately without
  guest kernel updates, critical for production environments with diverse
  guest OS versions and configurations.

- Runtime controls allow disabling features if needed:
  * /sys/kernel/debug/sched/sched_vcpu_debooster_enabled
  * /sys/module/kvm/parameters/ipi_tracking_enabled
  * /sys/module/kvm/parameters/enable_relaxed_boost

- The infrastructure is incrementally introduced: early patches add inert
  scaffolding that can be verified for zero performance impact before
  activation.

Performance Results
-------------------

Test environment: Intel Xeon, 16 physical cores, 16 vCPUs per VM

Dbench 16 clients per VM (filesystem metadata operations):
  2 VMs: +14.4% throughput (lock contention reduction)
  3 VMs:  +9.8% throughput
  4 VMs:  +6.7% throughput

PARSEC Dedup benchmark, simlarge input (memory-intensive):
  2 VMs: +47.1% throughput (IPI-heavy synchronization)
  3 VMs: +28.1% throughput
  4 VMs:  +1.7% throughput

PARSEC VIPS benchmark, simlarge input (compute-intensive):
  2 VMs: +26.2% throughput (balanced sync and compute)
  3 VMs: +12.7% throughput
  4 VMs:  +6.0% throughput

Analysis:

- Gains are most pronounced at moderate overcommit (2-3 VMs). At this level,
  contention is significant enough to benefit from better yield behavior,
  but context switch overhead remains manageable.

- Dedup shows the strongest improvement (+47.1% at 2 VMs) due to its
  IPI-heavy synchronization patterns. The IPI-aware directed yield
  precisely targets the bottleneck.

- At 4 VMs (heavier overcommit), gains diminish as general CPU contention
  dominates. However, performance never regresses, indicating the mechanisms
  gracefully degrade.

- In certain high-density, resource overcommitted deployment scenarios, the 
  performance benefits of APICv can be constrained by scheduling and contention 
  patterns. In such cases, software-based IPI tracking serves as a complementary 
  optimization path, offering targeted scheduling hints without relying on disabling 
  APICv. The practical choice should be evaluated and balanced against workload 
  characteristics and platform configuration.

- Dbench benefits primarily from the scheduler-side debooster, as its lock
  patterns involve less IPI spinning and more direct lock holder boosting.

The performance gains stem from three factors:

1. Lock holders receive sustained CPU time to complete critical sections,
   reducing overall lock hold duration and cascading contention.

2. IPI receivers are promptly scheduled when senders spin, minimizing IPI
   response latency and reducing wasted spin cycles.

3. Better cache utilization results from reduced context switching between
   lock waiters and holders.

Patch Organization
------------------

The series is organized for incremental review and bisectability:

Patches 1-5: Scheduler vCPU debooster

  Patch 1: Add infrastructure (per-rq tracking, sysctl, debugfs entry)
           Infrastructure is inert; no functional change.

  Patch 2: Add rate-limiting and validation helpers
           Static functions with comprehensive safety checks.

  Patch 3: Add cgroup LCA finder for hierarchical yield
           Implements CONFIG_FAIR_GROUP_SCHED-aware LCA location.

  Patch 4: Add penalty calculation and application logic
           Core algorithms with queue-size adaptation and debouncing.

  Patch 5: Wire up yield deboost in yield_to_task_fair()
           Activation patch. Includes Dbench performance data.

Patches 6-10: KVM IPI-aware directed yield

  Patch 6: Fix last_boosted_vcpu index assignment bug
           Standalone bugfix for existing code.

  Patch 7: Add IPI tracking infrastructure
           Per-vCPU context, module parameters, helper functions.
           Infrastructure is inert until activated.

  Patch 8: Integrate IPI tracking with LAPIC interrupt delivery
           Hook into kvm_irq_delivery_to_apic() and EOI handling.

  Patch 9: Implement IPI-aware directed yield candidate selection
           Replace candidate selection logic with priority-based approach.
           Includes PARSEC performance data.

  Patch 10: Add relaxed boost as safety net
            Two-round fallback mechanism for robustness.

Each patch compiles and boots independently. Performance data is presented
where the relevant mechanism becomes active (patches 5 and 9).

Testing
-------

Workloads tested:

- Dbench (filesystem metadata stress)
- PARSEC benchmarks (Dedup, VIPS, Ferret, Blackscholes)
- Kernel compilation (make -j16 in each VM)

No regressions observed on any configuration. The mechanisms show neutral
to positive impact across diverse workloads.

Future Work
-----------

Potential extensions beyond this series:

- Adaptive recency window: dynamically adjust ipi_window_ns based on
  observed workload patterns.

- Extended tracking: consider multi-round IPI patterns (A→B→C→A).

- Cross-NUMA awareness: penalty scaling based on NUMA distances.

These are intentionally deferred to keep this series focused and reviewable.

Wanpeng Li (10):
  sched: Add vCPU debooster infrastructure
  sched/fair: Add rate-limiting and validation helpers
  sched/fair: Add cgroup LCA finder for hierarchical yield
  sched/fair: Add penalty calculation and application logic
  sched/fair: Wire up yield deboost in yield_to_task_fair()
  KVM: Fix last_boosted_vcpu index assignment bug
  KVM: x86: Add IPI tracking infrastructure
  KVM: x86/lapic: Integrate IPI tracking with interrupt delivery
  KVM: Implement IPI-aware directed yield candidate selection
  KVM: Relaxed boost as safety net

 arch/x86/include/asm/kvm_host.h |   8 +
 arch/x86/kvm/lapic.c            | 172 +++++++++++++++-
 arch/x86/kvm/x86.c              |   6 +
 arch/x86/kvm/x86.h              |   4 +
 include/linux/kvm_host.h        |   1 +
 kernel/sched/core.c             |   7 +-
 kernel/sched/debug.c            |   3 +
 kernel/sched/fair.c             | 336 ++++++++++++++++++++++++++++++++
 kernel/sched/sched.h            |   9 +
 virt/kvm/kvm_main.c             |  81 +++++++-
 10 files changed, 611 insertions(+), 16 deletions(-)

-- 
2.43.0

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 01/10] sched: Add vCPU debooster infrastructure
  2025-11-10  3:32 [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM Wanpeng Li
@ 2025-11-10  3:32 ` Wanpeng Li
  2025-11-10  3:32 ` [PATCH 02/10] sched/fair: Add rate-limiting and validation helpers Wanpeng Li
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 36+ messages in thread
From: Wanpeng Li @ 2025-11-10  3:32 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paolo Bonzini,
	Sean Christopherson
  Cc: Steven Rostedt, Vincent Guittot, Juri Lelli, linux-kernel, kvm,
	Wanpeng Li

From: Wanpeng Li <wanpengli@tencent.com>

From: Wanpeng Li <wanpengli@tencent.com>

Introduce foundational infrastructure for the vCPU debooster mechanism
to improve yield_to() effectiveness in virtualization workloads.

Add per-rq tracking fields for rate limiting (yield_deboost_last_time_ns)
and debouncing (yield_deboost_last_src/dst_pid, last_pair_time_ns).
Introduce global sysctl knob sysctl_sched_vcpu_debooster_enabled for
runtime control, defaulting to enabled. Add debugfs interface for
observability and initialization in sched_init().

The infrastructure is inert at this stage as no deboost logic is
implemented yet, allowing independent verification that existing
behavior remains unchanged.

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
---
 kernel/sched/core.c  | 7 +++++--
 kernel/sched/debug.c | 3 +++
 kernel/sched/fair.c  | 5 +++++
 kernel/sched/sched.h | 9 +++++++++
 4 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f754a60de848..03380790088b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8706,9 +8706,12 @@ void __init sched_init(void)
 #endif /* CONFIG_CGROUP_SCHED */
 
 	for_each_possible_cpu(i) {
-		struct rq *rq;
+		struct rq *rq = cpu_rq(i);
+		/* init per-rq debounce tracking */
+		rq->yield_deboost_last_src_pid = -1;
+		rq->yield_deboost_last_dst_pid = -1;
+		rq->yield_deboost_last_pair_time_ns = 0;
 
-		rq = cpu_rq(i);
 		raw_spin_lock_init(&rq->__lock);
 		rq->nr_running = 0;
 		rq->calc_load_active = 0;
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 02e16b70a790..905f303af752 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -508,6 +508,9 @@ static __init int sched_init_debug(void)
 	debugfs_create_file("tunable_scaling", 0644, debugfs_sched, NULL, &sched_scaling_fops);
 	debugfs_create_u32("migration_cost_ns", 0644, debugfs_sched, &sysctl_sched_migration_cost);
 	debugfs_create_u32("nr_migrate", 0644, debugfs_sched, &sysctl_sched_nr_migrate);
+	debugfs_create_u32("sched_vcpu_debooster_enabled", 0644, debugfs_sched,
+		&sysctl_sched_vcpu_debooster_enabled);
+
 
 	sched_domains_mutex_lock();
 	update_sched_domain_debugfs();
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5b752324270b..5b7fcc86ccff 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -81,6 +81,11 @@ static unsigned int normalized_sysctl_sched_base_slice	= 700000ULL;
 
 __read_mostly unsigned int sysctl_sched_migration_cost	= 500000UL;
 
+/*
+ * vCPU debooster sysctl control
+ */
+unsigned int sysctl_sched_vcpu_debooster_enabled __read_mostly = 1;
+
 static int __init setup_sched_thermal_decay_shift(char *str)
 {
 	pr_warn("Ignoring the deprecated sched_thermal_decay_shift= option\n");
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index adfb6e3409d7..e9b4be024f89 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1292,6 +1292,13 @@ struct rq {
 	unsigned int		push_busy;
 	struct cpu_stop_work	push_work;
 
+	/* vCPU debooster rate-limit */
+	u64			yield_deboost_last_time_ns;
+	/* per-rq debounce state to avoid cross-CPU races */
+	pid_t			yield_deboost_last_src_pid;
+	pid_t			yield_deboost_last_dst_pid;
+	u64			yield_deboost_last_pair_time_ns;
+
 #ifdef CONFIG_SCHED_CORE
 	/* per rq */
 	struct rq		*core;
@@ -2816,6 +2823,8 @@ extern int sysctl_resched_latency_warn_once;
 
 extern unsigned int sysctl_sched_tunable_scaling;
 
+extern unsigned int sysctl_sched_vcpu_debooster_enabled;
+
 extern unsigned int sysctl_numa_balancing_scan_delay;
 extern unsigned int sysctl_numa_balancing_scan_period_min;
 extern unsigned int sysctl_numa_balancing_scan_period_max;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 02/10] sched/fair: Add rate-limiting and validation helpers
  2025-11-10  3:32 [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM Wanpeng Li
  2025-11-10  3:32 ` [PATCH 01/10] sched: Add vCPU debooster infrastructure Wanpeng Li
@ 2025-11-10  3:32 ` Wanpeng Li
  2025-11-12  6:40   ` K Prateek Nayak
  2025-11-10  3:32 ` [PATCH 03/10] sched/fair: Add cgroup LCA finder for hierarchical yield Wanpeng Li
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 36+ messages in thread
From: Wanpeng Li @ 2025-11-10  3:32 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paolo Bonzini,
	Sean Christopherson
  Cc: Steven Rostedt, Vincent Guittot, Juri Lelli, linux-kernel, kvm,
	Wanpeng Li

From: Wanpeng Li <wanpengli@tencent.com>

From: Wanpeng Li <wanpengli@tencent.com>

Implement core safety mechanisms for yield deboost operations.

Add yield_deboost_rate_limit() for high-frequency gating to prevent
excessive overhead on compute-intensive workloads. Use 6ms threshold
with lockless READ_ONCE/WRITE_ONCE to minimize cache line contention
while providing effective rate limiting.

Add yield_deboost_validate_tasks() for comprehensive validation
ensuring feature is enabled via sysctl, both tasks are valid and
distinct, both belong to fair_sched_class, entities are on the same
runqueue, and tasks are runnable.

The rate limiter prevents pathological high-frequency cases while
validation ensures only appropriate task pairs proceed. Both functions
are static and will be integrated in subsequent patches.

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
---
 kernel/sched/fair.c | 68 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 68 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5b7fcc86ccff..a7dc21c2dbdb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8990,6 +8990,74 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev, struct t
 	}
 }
 
+/*
+ * High-frequency yield gating to reduce overhead on compute-intensive workloads.
+ * Returns true if the yield should be skipped due to frequency limits.
+ *
+ * Optimized: single threshold with READ_ONCE/WRITE_ONCE, refresh timestamp on every call.
+ */
+static bool yield_deboost_rate_limit(struct rq *rq, u64 now_ns)
+{
+	u64 last = READ_ONCE(rq->yield_deboost_last_time_ns);
+	bool limited = false;
+
+	if (last) {
+		u64 delta = now_ns - last;
+		limited = (delta <= 6000ULL * NSEC_PER_USEC);
+	}
+
+	WRITE_ONCE(rq->yield_deboost_last_time_ns, now_ns);
+	return limited;
+}
+
+/*
+ * Validate tasks and basic parameters for yield deboost operation.
+ * Performs comprehensive safety checks including feature enablement,
+ * NULL pointer validation, task state verification, and same-rq requirement.
+ * Returns false with appropriate debug logging if any validation fails,
+ * ensuring only safe and meaningful yield operations proceed.
+ */
+static bool __maybe_unused yield_deboost_validate_tasks(struct rq *rq, struct task_struct *p_target,
+					  struct task_struct **p_yielding_out,
+					  struct sched_entity **se_y_out,
+					  struct sched_entity **se_t_out)
+{
+	struct task_struct *p_yielding;
+	struct sched_entity *se_y, *se_t;
+	u64 now_ns;
+
+	if (!sysctl_sched_vcpu_debooster_enabled)
+		return false;
+
+	if (!rq || !p_target)
+		return false;
+
+	now_ns = rq->clock;
+
+	if (yield_deboost_rate_limit(rq, now_ns))
+		return false;
+
+	p_yielding = rq->curr;
+	if (!p_yielding || p_yielding == p_target ||
+	    p_target->sched_class != &fair_sched_class ||
+	    p_yielding->sched_class != &fair_sched_class)
+		return false;
+
+	se_y = &p_yielding->se;
+	se_t = &p_target->se;
+
+	if (!se_t || !se_y || !se_t->on_rq || !se_y->on_rq)
+		return false;
+
+	if (task_rq(p_yielding) != rq || task_rq(p_target) != rq)
+		return false;
+
+	*p_yielding_out = p_yielding;
+	*se_y_out = se_y;
+	*se_t_out = se_t;
+	return true;
+}
+
 /*
  * sched_yield() is very simple
  */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 02/10] sched/fair: Add rate-limiting and validation helpers
  2025-11-10  3:32 ` [PATCH 02/10] sched/fair: Add rate-limiting and validation helpers Wanpeng Li
@ 2025-11-12  6:40   ` K Prateek Nayak
  2025-11-12  6:44     ` K Prateek Nayak
  2025-11-13 12:00     ` Wanpeng Li
  0 siblings, 2 replies; 36+ messages in thread
From: K Prateek Nayak @ 2025-11-12  6:40 UTC (permalink / raw)
  To: Wanpeng Li, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Paolo Bonzini, Sean Christopherson
  Cc: Steven Rostedt, Vincent Guittot, Juri Lelli, linux-kernel, kvm,
	Wanpeng Li

Hello Wanpeng,

On 11/10/2025 9:02 AM, Wanpeng Li wrote:
> +/*
> + * High-frequency yield gating to reduce overhead on compute-intensive workloads.
> + * Returns true if the yield should be skipped due to frequency limits.
> + *
> + * Optimized: single threshold with READ_ONCE/WRITE_ONCE, refresh timestamp on every call.
> + */
> +static bool yield_deboost_rate_limit(struct rq *rq, u64 now_ns)
> +{
> +	u64 last = READ_ONCE(rq->yield_deboost_last_time_ns);
> +	bool limited = false;
> +
> +	if (last) {
> +		u64 delta = now_ns - last;
> +		limited = (delta <= 6000ULL * NSEC_PER_USEC);
> +	}
> +
> +	WRITE_ONCE(rq->yield_deboost_last_time_ns, now_ns);

We only look at local rq so READ_ONCE()/WRITE_ONCE() seems
unnecessary.

> +	return limited;
> +}
> +
> +/*
> + * Validate tasks and basic parameters for yield deboost operation.
> + * Performs comprehensive safety checks including feature enablement,
> + * NULL pointer validation, task state verification, and same-rq requirement.
> + * Returns false with appropriate debug logging if any validation fails,
> + * ensuring only safe and meaningful yield operations proceed.
> + */
> +static bool __maybe_unused yield_deboost_validate_tasks(struct rq *rq, struct task_struct *p_target,
> +					  struct task_struct **p_yielding_out,
> +					  struct sched_entity **se_y_out,
> +					  struct sched_entity **se_t_out)
> +{
> +	struct task_struct *p_yielding;
> +	struct sched_entity *se_y, *se_t;
> +	u64 now_ns;
> +
> +	if (!sysctl_sched_vcpu_debooster_enabled)
> +		return false;
> +
> +	if (!rq || !p_target)
> +		return false;
> +
> +	now_ns = rq->clock;

Brief look at Patch 5 suggests we are under the rq_lock so might
as well use the rq_clock(rq) helper. Also, you have to do a
update_rq_clock() since it isn't done until yield_task_fair().

> +
> +	if (yield_deboost_rate_limit(rq, now_ns))
> +		return false;
> +
> +	p_yielding = rq->curr;
> +	if (!p_yielding || p_yielding == p_target ||
> +	    p_target->sched_class != &fair_sched_class ||
> +	    p_yielding->sched_class != &fair_sched_class)
> +		return false;

yield_to() in syscall.c has already checked for the sched
class matching under double_rq_lock. That cannot change by the
time we are here.

> +
> +	se_y = &p_yielding->se;
> +	se_t = &p_target->se;
> +
> +	if (!se_t || !se_y || !se_t->on_rq || !se_y->on_rq)
> +		return false;
> +
> +	if (task_rq(p_yielding) != rq || task_rq(p_target) != rq)

yield_to() has already checked for this under double_rq_lock()
so this too should be unnecessary.

> +		return false;
> +
> +	*p_yielding_out = p_yielding;
> +	*se_y_out = se_y;
> +	*se_t_out = se_t;

Why do we need these pointers? Can't the caller simply do:

    if (!yield_deboost_validate_tasks(rq, target))
        return;

    p_yielding = rq->donor;
    se_y_out = &p_yielding->se;
    se_t = &target->se;

That reminds me - now that we have proxy execution, you need
to re-evaluate the usage of rq->curr (running context) vs
rq->donor (vruntime context) when looking at all this.

> +	return true;
> +}
> +
>  /*
>   * sched_yield() is very simple
>   */

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 02/10] sched/fair: Add rate-limiting and validation helpers
  2025-11-12  6:40   ` K Prateek Nayak
@ 2025-11-12  6:44     ` K Prateek Nayak
  2025-11-13 13:36       ` Wanpeng Li
  2025-11-13 12:00     ` Wanpeng Li
  1 sibling, 1 reply; 36+ messages in thread
From: K Prateek Nayak @ 2025-11-12  6:44 UTC (permalink / raw)
  To: Wanpeng Li, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Paolo Bonzini, Sean Christopherson
  Cc: Steven Rostedt, Vincent Guittot, Juri Lelli, linux-kernel, kvm,
	Wanpeng Li

On 11/12/2025 12:10 PM, K Prateek Nayak wrote:
>> +	if (task_rq(p_yielding) != rq || task_rq(p_target) != rq)
> 
> yield_to() has already checked for this under double_rq_lock()
> so this too should be unnecessary.

nvm! We only check if the task_rq(p_target) is stable under the
rq_lock or not. Just checking "task_rq(p_target) != rq" should
be sufficient here.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 02/10] sched/fair: Add rate-limiting and validation helpers
  2025-11-12  6:44     ` K Prateek Nayak
@ 2025-11-13 13:36       ` Wanpeng Li
  0 siblings, 0 replies; 36+ messages in thread
From: Wanpeng Li @ 2025-11-13 13:36 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paolo Bonzini,
	Sean Christopherson, Steven Rostedt, Vincent Guittot, Juri Lelli,
	linux-kernel, kvm, Wanpeng Li

Hi Prateek，

On Wed, 12 Nov 2025 at 14:44, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> On 11/12/2025 12:10 PM, K Prateek Nayak wrote:
> >> +    if (task_rq(p_yielding) != rq || task_rq(p_target) != rq)
> >
> > yield_to() has already checked for this under double_rq_lock()
> > so this too should be unnecessary.
>
> nvm! We only check if the task_rq(p_target) is stable under the
> rq_lock or not. Just checking "task_rq(p_target) != rq" should
> be sufficient here.

You're right! Since yield_to() passes rq = this_rq() , the yielding
task is guaranteed on rq . But p_target may be on a different CPU
(yield_to supports cross-CPU). Our deboost only works for same-rq
tasks, so checking only task_rq(p_target) != rq is sufficient. I'll
remove the redundant task_rq(p_yielding) != rq check. Thanks!

Regards,
Wanpeng

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 02/10] sched/fair: Add rate-limiting and validation helpers
  2025-11-12  6:40   ` K Prateek Nayak
  2025-11-12  6:44     ` K Prateek Nayak
@ 2025-11-13 12:00     ` Wanpeng Li
  1 sibling, 0 replies; 36+ messages in thread
From: Wanpeng Li @ 2025-11-13 12:00 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paolo Bonzini,
	Sean Christopherson, Steven Rostedt, Vincent Guittot, Juri Lelli,
	linux-kernel, kvm, Wanpeng Li

Hi Prateek，

On Wed, 12 Nov 2025 at 14:40, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> Hello Wanpeng,
>
> On 11/10/2025 9:02 AM, Wanpeng Li wrote:
> > +/*
> > + * High-frequency yield gating to reduce overhead on compute-intensive workloads.
> > + * Returns true if the yield should be skipped due to frequency limits.
> > + *
> > + * Optimized: single threshold with READ_ONCE/WRITE_ONCE, refresh timestamp on every call.
> > + */
> > +static bool yield_deboost_rate_limit(struct rq *rq, u64 now_ns)
> > +{
> > +     u64 last = READ_ONCE(rq->yield_deboost_last_time_ns);
> > +     bool limited = false;
> > +
> > +     if (last) {
> > +             u64 delta = now_ns - last;
> > +             limited = (delta <= 6000ULL * NSEC_PER_USEC);
> > +     }
> > +
> > +     WRITE_ONCE(rq->yield_deboost_last_time_ns, now_ns);
>
> We only look at local rq so READ_ONCE()/WRITE_ONCE() seems
> unnecessary.

You're right. Since we're under rq->lock and only accessing the local
rq's fields, READ_ONCE()/WRITE_ONCE() provide no benefit here. Will
simplify to direct access.

>
> > +     return limited;
> > +}
> > +
> > +/*
> > + * Validate tasks and basic parameters for yield deboost operation.
> > + * Performs comprehensive safety checks including feature enablement,
> > + * NULL pointer validation, task state verification, and same-rq requirement.
> > + * Returns false with appropriate debug logging if any validation fails,
> > + * ensuring only safe and meaningful yield operations proceed.
> > + */
> > +static bool __maybe_unused yield_deboost_validate_tasks(struct rq *rq, struct task_struct *p_target,
> > +                                       struct task_struct **p_yielding_out,
> > +                                       struct sched_entity **se_y_out,
> > +                                       struct sched_entity **se_t_out)
> > +{
> > +     struct task_struct *p_yielding;
> > +     struct sched_entity *se_y, *se_t;
> > +     u64 now_ns;
> > +
> > +     if (!sysctl_sched_vcpu_debooster_enabled)
> > +             return false;
> > +
> > +     if (!rq || !p_target)
> > +             return false;
> > +
> > +     now_ns = rq->clock;
>
> Brief look at Patch 5 suggests we are under the rq_lock so might
> as well use the rq_clock(rq) helper. Also, you have to do a
> update_rq_clock() since it isn't done until yield_task_fair().

Good catch. Since yield_to() holds rq_lock but doesn't call
update_rq_clock() before invoking yield_to_task(), I need to call
update_rq_clock(rq) at the start of yield_to_deboost() and use
rq_clock(rq) instead of direct rq->clock access. This ensures the
clock is current before rate limiting checks.

>
> > +
> > +     if (yield_deboost_rate_limit(rq, now_ns))
> > +             return false;
> > +
> > +     p_yielding = rq->curr;
> > +     if (!p_yielding || p_yielding == p_target ||
> > +         p_target->sched_class != &fair_sched_class ||
> > +         p_yielding->sched_class != &fair_sched_class)
> > +             return false;
>
> yield_to() in syscall.c has already checked for the sched
> class matching under double_rq_lock. That cannot change by the
> time we are here.

Correct. The sched_class checks are redundant since yield_to() already
validates curr->sched_class == p->sched_class under double_rq_lock(),
and sched_class cannot change while holding the lock. Will remove.

>
> > +
> > +     se_y = &p_yielding->se;
> > +     se_t = &p_target->se;
> > +
> > +     if (!se_t || !se_y || !se_t->on_rq || !se_y->on_rq)
> > +             return false;
> > +
> > +     if (task_rq(p_yielding) != rq || task_rq(p_target) != rq)
>
> yield_to() has already checked for this under double_rq_lock()
> so this too should be unnecessary.

Right. yield_to() already ensures both tasks are on their expected run
queues under double_rq_lock(), so the task_rq(p_yielding) != rq ||
task_rq(p_target) != rq check is redundant. Will remove.

>
> > +             return false;
> > +
> > +     *p_yielding_out = p_yielding;
> > +     *se_y_out = se_y;
> > +     *se_t_out = se_t;
>
> Why do we need these pointers? Can't the caller simply do:
>
>     if (!yield_deboost_validate_tasks(rq, target))
>         return;
>
>     p_yielding = rq->donor;
>     se_y_out = &p_yielding->se;
>     se_t = &target->se;

You're right, the output parameters are unnecessary. The caller can
derive them directly:
   p_yielding = rq->donor (accounting for proxy exec)
   se_y = &p_yielding->se
   se_t = &target->se
I'll simplify yield_deboost_validate_tasks() to just return bool and
let the caller obtain these pointers.

>
> That reminds me - now that we have proxy execution, you need
> to re-evaluate the usage of rq->curr (running context) vs
> rq->donor (vruntime context) when looking at all this.

Good catch. Since we're manipulating vruntime/deadline/vlag, I should
use rq->donor (scheduling context) instead of rq->curr (execution
context). In the yield_to() path, curr should equal donor (the
yielding task is running), but using donor makes the vruntime
semantics clearer and consistent with
update_curr_fair()/check_preempt_wakeup_fair().

Regards,
Wanpeng

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 03/10] sched/fair: Add cgroup LCA finder for hierarchical yield
  2025-11-10  3:32 [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM Wanpeng Li
  2025-11-10  3:32 ` [PATCH 01/10] sched: Add vCPU debooster infrastructure Wanpeng Li
  2025-11-10  3:32 ` [PATCH 02/10] sched/fair: Add rate-limiting and validation helpers Wanpeng Li
@ 2025-11-10  3:32 ` Wanpeng Li
  2025-11-12  6:50   ` K Prateek Nayak
  2025-11-10  3:32 ` [PATCH 04/10] sched/fair: Add penalty calculation and application logic Wanpeng Li
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 36+ messages in thread
From: Wanpeng Li @ 2025-11-10  3:32 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paolo Bonzini,
	Sean Christopherson
  Cc: Steven Rostedt, Vincent Guittot, Juri Lelli, linux-kernel, kvm,
	Wanpeng Li

From: Wanpeng Li <wanpengli@tencent.com>

From: Wanpeng Li <wanpengli@tencent.com>

Implement yield_deboost_find_lca() to locate the lowest common ancestor
(LCA) in the cgroup hierarchy for EEVDF-aware yield operations.

The LCA represents the appropriate hierarchy level where vruntime
adjustments should be applied to ensure fairness is maintained across
cgroup boundaries. This is critical for virtualization workloads where
vCPUs may be organized in nested cgroups.

For CONFIG_FAIR_GROUP_SCHED, walk up both entity hierarchies by
aligning depths, then ascend together until a common cfs_rq is found.
For flat hierarchy, verify both entities share the same cfs_rq.
Validate that meaningful contention exists (nr_queued > 1) and ensure
the yielding entity has non-zero slice for safe penalty calculation.

The function operates under rq->lock protection. This static helper
will be integrated in subsequent patches.

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
---
 kernel/sched/fair.c | 60 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 60 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a7dc21c2dbdb..740c002b8f1c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9058,6 +9058,66 @@ static bool __maybe_unused yield_deboost_validate_tasks(struct rq *rq, struct ta
 	return true;
 }
 
+/*
+ * Find the lowest common ancestor (LCA) in the cgroup hierarchy for EEVDF.
+ * We walk up both entity hierarchies under rq->lock protection.
+ * Task migration requires task_rq_lock, ensuring parent chains remain stable.
+ * We locate the first common cfs_rq where both entities coexist, representing
+ * the appropriate level for vruntime adjustments and EEVDF field updates
+ * (deadline, vlag) to maintain scheduler consistency.
+ */
+static bool __maybe_unused yield_deboost_find_lca(struct sched_entity *se_y, struct sched_entity *se_t,
+				    struct sched_entity **se_y_lca_out,
+				    struct sched_entity **se_t_lca_out,
+				    struct cfs_rq **cfs_rq_common_out)
+{
+	struct sched_entity *se_y_lca, *se_t_lca;
+	struct cfs_rq *cfs_rq_common;
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	se_t_lca = se_t;
+	se_y_lca = se_y;
+
+	while (se_t_lca && se_y_lca && se_t_lca->depth != se_y_lca->depth) {
+		if (se_t_lca->depth > se_y_lca->depth)
+			se_t_lca = se_t_lca->parent;
+		else
+			se_y_lca = se_y_lca->parent;
+	}
+
+	while (se_t_lca && se_y_lca) {
+		if (cfs_rq_of(se_t_lca) == cfs_rq_of(se_y_lca)) {
+			cfs_rq_common = cfs_rq_of(se_t_lca);
+			goto found_lca;
+		}
+		se_t_lca = se_t_lca->parent;
+		se_y_lca = se_y_lca->parent;
+	}
+	return false;
+#else
+	if (cfs_rq_of(se_y) != cfs_rq_of(se_t))
+		return false;
+	cfs_rq_common = cfs_rq_of(se_y);
+	se_y_lca = se_y;
+	se_t_lca = se_t;
+#endif
+
+found_lca:
+	if (!se_y_lca || !se_t_lca)
+		return false;
+
+	if (cfs_rq_common->nr_queued <= 1)
+		return false;
+
+	if (!se_y_lca->slice)
+		return false;
+
+	*se_y_lca_out = se_y_lca;
+	*se_t_lca_out = se_t_lca;
+	*cfs_rq_common_out = cfs_rq_common;
+	return true;
+}
+
 /*
  * sched_yield() is very simple
  */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 03/10] sched/fair: Add cgroup LCA finder for hierarchical yield
  2025-11-10  3:32 ` [PATCH 03/10] sched/fair: Add cgroup LCA finder for hierarchical yield Wanpeng Li
@ 2025-11-12  6:50   ` K Prateek Nayak
  2025-11-13  8:59     ` Wanpeng Li
  0 siblings, 1 reply; 36+ messages in thread
From: K Prateek Nayak @ 2025-11-12  6:50 UTC (permalink / raw)
  To: Wanpeng Li, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Paolo Bonzini, Sean Christopherson
  Cc: Steven Rostedt, Vincent Guittot, Juri Lelli, linux-kernel, kvm,
	Wanpeng Li

Hello Wanpeng,

On 11/10/2025 9:02 AM, Wanpeng Li wrote:
> +/*
> + * Find the lowest common ancestor (LCA) in the cgroup hierarchy for EEVDF.
> + * We walk up both entity hierarchies under rq->lock protection.
> + * Task migration requires task_rq_lock, ensuring parent chains remain stable.
> + * We locate the first common cfs_rq where both entities coexist, representing
> + * the appropriate level for vruntime adjustments and EEVDF field updates
> + * (deadline, vlag) to maintain scheduler consistency.
> + */
> +static bool __maybe_unused yield_deboost_find_lca(struct sched_entity *se_y, struct sched_entity *se_t,
> +				    struct sched_entity **se_y_lca_out,
> +				    struct sched_entity **se_t_lca_out,
> +				    struct cfs_rq **cfs_rq_common_out)
> +{
> +	struct sched_entity *se_y_lca, *se_t_lca;
> +	struct cfs_rq *cfs_rq_common;
> +
> +#ifdef CONFIG_FAIR_GROUP_SCHED
> +	se_t_lca = se_t;
> +	se_y_lca = se_y;
> +
> +	while (se_t_lca && se_y_lca && se_t_lca->depth != se_y_lca->depth) {
> +		if (se_t_lca->depth > se_y_lca->depth)
> +			se_t_lca = se_t_lca->parent;
> +		else
> +			se_y_lca = se_y_lca->parent;
> +	}
> +
> +	while (se_t_lca && se_y_lca) {
> +		if (cfs_rq_of(se_t_lca) == cfs_rq_of(se_y_lca)) {
> +			cfs_rq_common = cfs_rq_of(se_t_lca);
> +			goto found_lca;
> +		}
> +		se_t_lca = se_t_lca->parent;
> +		se_y_lca = se_y_lca->parent;
> +	}
> +	return false;
> +#else
> +	if (cfs_rq_of(se_y) != cfs_rq_of(se_t))
> +		return false;
> +	cfs_rq_common = cfs_rq_of(se_y);
> +	se_y_lca = se_y;
> +	se_t_lca = se_t;
> +#endif
> +
> +found_lca:
> +	if (!se_y_lca || !se_t_lca)
> +		return false;

Can that even happen? They should meet at the root cfs_rq.
Also all of this seems to be just find_matching_se() from
fair.c. Can't we just reuse that?

> +
> +	if (cfs_rq_common->nr_queued <= 1)
> +		return false;
> +
> +	if (!se_y_lca->slice)
> +		return false;

Is that even possible?

> +
> +	*se_y_lca_out = se_y_lca;
> +	*se_t_lca_out = se_t_lca;
> +	*cfs_rq_common_out = cfs_rq_common;

Again, find_matching_se() does pretty much similar thing
and you can just use cfs_rq_of(se) to get the common cfs_rq.

> +	return true;
> +}
> +
>  /*
>   * sched_yield() is very simple
>   */

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 03/10] sched/fair: Add cgroup LCA finder for hierarchical yield
  2025-11-12  6:50   ` K Prateek Nayak
@ 2025-11-13  8:59     ` Wanpeng Li
  0 siblings, 0 replies; 36+ messages in thread
From: Wanpeng Li @ 2025-11-13  8:59 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paolo Bonzini,
	Sean Christopherson, Steven Rostedt, Vincent Guittot, Juri Lelli,
	linux-kernel, kvm, Wanpeng Li

Hi Prateek，

On Wed, 12 Nov 2025 at 14:50, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> Hello Wanpeng,
>
> On 11/10/2025 9:02 AM, Wanpeng Li wrote:
> > +/*
> > + * Find the lowest common ancestor (LCA) in the cgroup hierarchy for EEVDF.
> > + * We walk up both entity hierarchies under rq->lock protection.
> > + * Task migration requires task_rq_lock, ensuring parent chains remain stable.
> > + * We locate the first common cfs_rq where both entities coexist, representing
> > + * the appropriate level for vruntime adjustments and EEVDF field updates
> > + * (deadline, vlag) to maintain scheduler consistency.
> > + */
> > +static bool __maybe_unused yield_deboost_find_lca(struct sched_entity *se_y, struct sched_entity *se_t,
> > +                                 struct sched_entity **se_y_lca_out,
> > +                                 struct sched_entity **se_t_lca_out,
> > +                                 struct cfs_rq **cfs_rq_common_out)
> > +{
> > +     struct sched_entity *se_y_lca, *se_t_lca;
> > +     struct cfs_rq *cfs_rq_common;
> > +
> > +#ifdef CONFIG_FAIR_GROUP_SCHED
> > +     se_t_lca = se_t;
> > +     se_y_lca = se_y;
> > +
> > +     while (se_t_lca && se_y_lca && se_t_lca->depth != se_y_lca->depth) {
> > +             if (se_t_lca->depth > se_y_lca->depth)
> > +                     se_t_lca = se_t_lca->parent;
> > +             else
> > +                     se_y_lca = se_y_lca->parent;
> > +     }
> > +
> > +     while (se_t_lca && se_y_lca) {
> > +             if (cfs_rq_of(se_t_lca) == cfs_rq_of(se_y_lca)) {
> > +                     cfs_rq_common = cfs_rq_of(se_t_lca);
> > +                     goto found_lca;
> > +             }
> > +             se_t_lca = se_t_lca->parent;
> > +             se_y_lca = se_y_lca->parent;
> > +     }
> > +     return false;
> > +#else
> > +     if (cfs_rq_of(se_y) != cfs_rq_of(se_t))
> > +             return false;
> > +     cfs_rq_common = cfs_rq_of(se_y);
> > +     se_y_lca = se_y;
> > +     se_t_lca = se_t;
> > +#endif
> > +
> > +found_lca:
> > +     if (!se_y_lca || !se_t_lca)
> > +             return false;
>
> Can that even happen? They should meet at the root cfs_rq.

You're right. Tasks on the same rq will always meet at root cfs_rq at
worst, so the !se_y_lca || !se_t_lca check is indeed redundant.

> Also all of this seems to be just find_matching_se() from
> fair.c. Can't we just reuse that?

Yes, it does exactly what we need. The existing code duplicates its
depth-alignment and parent-walking logic. I'll replace our custom
LCA-finding with a call to find_matching_se(&se_y_lca, &se_t_lca) ,
then use cfs_rq_of(se_y_lca) to get the common cfs_rq.

>
> > +
> > +     if (cfs_rq_common->nr_queued <= 1)
> > +             return false;
> > +
> > +     if (!se_y_lca->slice)
> > +             return false;
>
> Is that even possible?

No, it's not possible. The check was defensive but unnecessary. As you
noted in question above, entities on the same rq must meet at root
cfs_rq at the latest, and the while loop condition se_t_lca &&
se_y_lca already ensures both are non-NULL before the goto found_lca .
Will remove this check.

>
> > +
> > +     *se_y_lca_out = se_y_lca;
> > +     *se_t_lca_out = se_t_lca;
> > +     *cfs_rq_common_out = cfs_rq_common;
>
> Again, find_matching_se() does pretty much similar thing
> and you can just use cfs_rq_of(se) to get the common cfs_rq.

Agreed. :)

Regards,
Wanpeng

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 04/10] sched/fair: Add penalty calculation and application logic
  2025-11-10  3:32 [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM Wanpeng Li
                   ` (2 preceding siblings ...)
  2025-11-10  3:32 ` [PATCH 03/10] sched/fair: Add cgroup LCA finder for hierarchical yield Wanpeng Li
@ 2025-11-10  3:32 ` Wanpeng Li
  2025-11-12  7:25   ` K Prateek Nayak
  2025-11-10  3:32 ` [PATCH 05/10] sched/fair: Wire up yield deboost in yield_to_task_fair() Wanpeng Li
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 36+ messages in thread
From: Wanpeng Li @ 2025-11-10  3:32 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paolo Bonzini,
	Sean Christopherson
  Cc: Steven Rostedt, Vincent Guittot, Juri Lelli, linux-kernel, kvm,
	Wanpeng Li

From: Wanpeng Li <wanpengli@tencent.com>

From: Wanpeng Li <wanpengli@tencent.com>

Implement core penalty calculation and application mechanisms for
yield deboost operations.

Add yield_deboost_apply_debounce() for reverse-pair debouncing to
prevent ping-pong behavior. When A→B then B→A occurs within ~600us,
downscale the penalty.

Add yield_deboost_calculate_penalty() to calculate vruntime penalty
based on the fairness gap (vruntime delta between yielding and target
tasks), scheduling granularity with safety floor for abnormal values,
and queue-size-based caps (2 tasks: 6.0×gran, 3: 4.0×, 4-6: 2.5×,
7-8: 2.0×, 9-12: 1.5×, >12: 1.0×). Apply special handling for zero
gap with refined multipliers and 10% boost weighting on positive gaps.

Add yield_deboost_apply_penalty() to apply the penalty with overflow
protection and update EEVDF fields (deadline, vlag) and min_vruntime.

The penalty is tuned to provide meaningful preference while avoiding
starvation, scales with queue depth, and prevents oscillation through
debouncing. These static functions will be integrated in the next
patch.

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
---
 kernel/sched/fair.c | 153 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 153 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 740c002b8f1c..4bad324f3662 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9118,6 +9118,159 @@ static bool __maybe_unused yield_deboost_find_lca(struct sched_entity *se_y, str
 	return true;
 }
 
+/*
+ * Apply debounce for reverse pair within ~600us to reduce ping-pong.
+ * Downscales penalty to max(need, gran) when the previous pair was target->source,
+ * and updates per-rq debounce tracking fields to avoid cross-CPU races.
+ */
+static u64 yield_deboost_apply_debounce(struct rq *rq, struct sched_entity *se_t,
+					u64 penalty, u64 need, u64 gran)
+{
+	u64 now_ns = rq->clock;
+	struct task_struct *p_yielding = rq->curr;
+	struct task_struct *p_target = task_of(se_t);
+
+	if (p_yielding && p_target) {
+		pid_t src_pid = p_yielding->pid;
+		pid_t dst_pid = p_target->pid;
+		pid_t last_src = rq->yield_deboost_last_src_pid;
+		pid_t last_dst = rq->yield_deboost_last_dst_pid;
+		u64  last_ns  = rq->yield_deboost_last_pair_time_ns;
+
+		if (last_src == dst_pid && last_dst == src_pid &&
+		    (now_ns - last_ns) <= (600ULL * NSEC_PER_USEC)) {
+			u64 alt = need;
+			if (alt < gran)
+				alt = gran;
+			if (penalty > alt)
+				penalty = alt;
+		}
+
+		/* Update per-rq tracking */
+		rq->yield_deboost_last_src_pid = src_pid;
+		rq->yield_deboost_last_dst_pid = dst_pid;
+		rq->yield_deboost_last_pair_time_ns = now_ns;
+	}
+
+	return penalty;
+}
+
+/*
+ * Calculate penalty with debounce logic for EEVDF yield deboost.
+ * Computes vruntime penalty based on fairness gap (need) plus granularity,
+ * applies queue-size-based caps to prevent excessive penalties in small queues,
+ * and implements reverse-pair debounce (~300us) to reduce ping-pong effects.
+ * Returns 0 if no penalty needed, otherwise returns clamped penalty value.
+ */
+static u64 __maybe_unused yield_deboost_calculate_penalty(struct rq *rq, struct sched_entity *se_y_lca,
+				    struct sched_entity *se_t_lca, struct sched_entity *se_t,
+				    int nr_queued)
+{
+	u64 gran, need, penalty, maxp;
+	u64 gran_floor;
+	u64 weighted_need, base;
+
+	gran = calc_delta_fair(sysctl_sched_base_slice, se_y_lca);
+	/* Low-bound safeguard for gran when slice is abnormally small */
+	gran_floor = calc_delta_fair(sysctl_sched_base_slice >> 1, se_y_lca);
+	if (gran < gran_floor)
+		gran = gran_floor;
+
+	need = 0;
+	if (se_t_lca->vruntime > se_y_lca->vruntime)
+		need = se_t_lca->vruntime - se_y_lca->vruntime;
+
+	/* Apply 10% boost to need when positive (weighted_need = need * 1.10) */
+	penalty = gran;
+	if (need) {
+		/* weighted_need = need + 10% */
+		weighted_need = need + need / 10;
+		/* clamp to avoid overflow when adding to gran (still capped later) */
+		if (weighted_need > U64_MAX - penalty)
+			weighted_need = U64_MAX - penalty;
+		penalty += weighted_need;
+	}
+
+	/* Apply debounce via helper to avoid ping-pong */
+	penalty = yield_deboost_apply_debounce(rq, se_t, penalty, need, gran);
+
+	/* Upper bound (cap): slightly more aggressive for mid-size queues */
+	if (nr_queued == 2)
+		maxp = gran * 6;		/* Strongest push for 2-task ping-pong */
+	else if (nr_queued == 3)
+		maxp = gran * 4;		/* 4.0 * gran */
+	else if (nr_queued <= 6)
+		maxp = (gran * 5) / 2;		/* 2.5 * gran */
+	else if (nr_queued <= 8)
+		maxp = gran * 2;		/* 2.0 * gran */
+	else if (nr_queued <= 12)
+		maxp = (gran * 3) / 2;		/* 1.5 * gran */
+	else
+		maxp = gran;			/* 1.0 * gran */
+
+	if (penalty < gran)
+		penalty = gran;
+	if (penalty > maxp)
+		penalty = maxp;
+
+	/* If no need, apply refined baseline push (low risk + mid risk combined). */
+	if (need == 0) {
+		/*
+		 * Baseline multiplier for need==0:
+		 *   2        -> 1.00 * gran
+		 *   3        -> 0.9375 * gran
+		 *   4–6      -> 0.625 * gran
+		 *   7–8      -> 0.50  * gran
+		 *   9–12     -> 0.375 * gran
+		 *   >12      -> 0.25  * gran
+		 */
+		base = gran;
+		if (nr_queued == 3)
+			base = (gran * 15) / 16;	/* 0.9375 */
+		else if (nr_queued >= 4 && nr_queued <= 6)
+			base = (gran * 5) / 8;		/* 0.625 */
+		else if (nr_queued >= 7 && nr_queued <= 8)
+			base = gran / 2;		/* 0.5 */
+		else if (nr_queued >= 9 && nr_queued <= 12)
+			base = (gran * 3) / 8;		/* 0.375 */
+		else if (nr_queued > 12)
+			base = gran / 4;		/* 0.25 */
+
+		if (penalty < base)
+			penalty = base;
+	}
+
+	return penalty;
+}
+
+/*
+ * Apply penalty and update EEVDF fields for scheduler consistency.
+ * Safely applies vruntime penalty with overflow protection, then updates
+ * EEVDF-specific fields (deadline, vlag) and cfs_rq min_vruntime to maintain
+ * scheduler state consistency. Returns true on successful application,
+ * false if penalty cannot be safely applied.
+ */
+static void __maybe_unused yield_deboost_apply_penalty(struct rq *rq, struct sched_entity *se_y_lca,
+				 struct cfs_rq *cfs_rq_common, u64 penalty)
+{
+	u64 new_vruntime;
+
+	/* Overflow protection */
+	if (se_y_lca->vruntime > (U64_MAX - penalty))
+		return;
+
+	new_vruntime = se_y_lca->vruntime + penalty;
+
+	/* Validity check */
+	if (new_vruntime <= se_y_lca->vruntime)
+		return;
+
+	se_y_lca->vruntime = new_vruntime;
+	se_y_lca->deadline = se_y_lca->vruntime + calc_delta_fair(se_y_lca->slice, se_y_lca);
+	se_y_lca->vlag = avg_vruntime(cfs_rq_common) - se_y_lca->vruntime;
+	update_min_vruntime(cfs_rq_common);
+}
+
 /*
  * sched_yield() is very simple
  */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 04/10] sched/fair: Add penalty calculation and application logic
  2025-11-10  3:32 ` [PATCH 04/10] sched/fair: Add penalty calculation and application logic Wanpeng Li
@ 2025-11-12  7:25   ` K Prateek Nayak
  2025-11-13 13:25     ` Wanpeng Li
  0 siblings, 1 reply; 36+ messages in thread
From: K Prateek Nayak @ 2025-11-12  7:25 UTC (permalink / raw)
  To: Wanpeng Li, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Paolo Bonzini, Sean Christopherson
  Cc: Steven Rostedt, Vincent Guittot, Juri Lelli, linux-kernel, kvm,
	Wanpeng Li

Hello Wanpeng,

On 11/10/2025 9:02 AM, Wanpeng Li wrote:
> +/*
> + * Calculate penalty with debounce logic for EEVDF yield deboost.
> + * Computes vruntime penalty based on fairness gap (need) plus granularity,
> + * applies queue-size-based caps to prevent excessive penalties in small queues,
> + * and implements reverse-pair debounce (~300us) to reduce ping-pong effects.
> + * Returns 0 if no penalty needed, otherwise returns clamped penalty value.
> + */
> +static u64 __maybe_unused yield_deboost_calculate_penalty(struct rq *rq, struct sched_entity *se_y_lca,
> +				    struct sched_entity *se_t_lca, struct sched_entity *se_t,
> +				    int nr_queued)
> +{
> +	u64 gran, need, penalty, maxp;
> +	u64 gran_floor;
> +	u64 weighted_need, base;
> +
> +	gran = calc_delta_fair(sysctl_sched_base_slice, se_y_lca);
> +	/* Low-bound safeguard for gran when slice is abnormally small */
> +	gran_floor = calc_delta_fair(sysctl_sched_base_slice >> 1, se_y_lca);
> +	if (gran < gran_floor)

Is this even possible?

> +		gran = gran_floor;
> +
> +	need = 0;
> +	if (se_t_lca->vruntime > se_y_lca->vruntime)
> +		need = se_t_lca->vruntime - se_y_lca->vruntime;

So I'm assuming you want the yielding task's vruntime to
cross the target's vruntime simply because one task somewhere
down the hierarchy said so.

> +
> +	/* Apply 10% boost to need when positive (weighted_need = need * 1.10) */
> +	penalty = gran;

So at the very least I see it getting weighted(base_slice / 2) penalty
... 

> +	if (need) {
> +		/* weighted_need = need + 10% */
> +		weighted_need = need + need / 10;
> +		/* clamp to avoid overflow when adding to gran (still capped later) */
> +		if (weighted_need > U64_MAX - penalty)
> +			weighted_need = U64_MAX - penalty;
> +		penalty += weighted_need;

... if not more ...

> +	}
> +
> +	/* Apply debounce via helper to avoid ping-pong */
> +	penalty = yield_deboost_apply_debounce(rq, se_t, penalty, need, gran);

... since without debounce, penalty remains same.

> +
> +	/* Upper bound (cap): slightly more aggressive for mid-size queues */
> +	if (nr_queued == 2)
> +		maxp = gran * 6;		/* Strongest push for 2-task ping-pong */
> +	else if (nr_queued == 3)
> +		maxp = gran * 4;		/* 4.0 * gran */
> +	else if (nr_queued <= 6)
> +		maxp = (gran * 5) / 2;		/* 2.5 * gran */
> +	else if (nr_queued <= 8)
> +		maxp = gran * 2;		/* 2.0 * gran */
> +	else if (nr_queued <= 12)
> +		maxp = (gran * 3) / 2;		/* 1.5 * gran */
> +	else
> +		maxp = gran;			/* 1.0 * gran */

And all the nr_queued calculations are based on the entities queued
and not the "h_nr_queued" so we can have a boat load of tasks to
run above but since one task decided to call yield_to() let us make
them all starve a little?

> +
> +	if (penalty < gran)
> +		penalty = gran;
> +	if (penalty > maxp)
> +		penalty = maxp;
> +
> +	/* If no need, apply refined baseline push (low risk + mid risk combined). */
> +	if (need == 0) {
> +		/*
> +		 * Baseline multiplier for need==0:
> +		 *   2        -> 1.00 * gran
> +		 *   3        -> 0.9375 * gran
> +		 *   4–6      -> 0.625 * gran
> +		 *   7–8      -> 0.50  * gran
> +		 *   9–12     -> 0.375 * gran
> +		 *   >12      -> 0.25  * gran
> +		 */
> +		base = gran;
> +		if (nr_queued == 3)
> +			base = (gran * 15) / 16;	/* 0.9375 */
> +		else if (nr_queued >= 4 && nr_queued <= 6)
> +			base = (gran * 5) / 8;		/* 0.625 */
> +		else if (nr_queued >= 7 && nr_queued <= 8)
> +			base = gran / 2;		/* 0.5 */
> +		else if (nr_queued >= 9 && nr_queued <= 12)
> +			base = (gran * 3) / 8;		/* 0.375 */
> +		else if (nr_queued > 12)
> +			base = gran / 4;		/* 0.25 */
> +
> +		if (penalty < base)
> +			penalty = base;
> +	}
> +
> +	return penalty;
> +}
> +
> +/*
> + * Apply penalty and update EEVDF fields for scheduler consistency.
> + * Safely applies vruntime penalty with overflow protection, then updates
> + * EEVDF-specific fields (deadline, vlag) and cfs_rq min_vruntime to maintain
> + * scheduler state consistency. Returns true on successful application,
> + * false if penalty cannot be safely applied.
> + */
> +static void __maybe_unused yield_deboost_apply_penalty(struct rq *rq, struct sched_entity *se_y_lca,
> +				 struct cfs_rq *cfs_rq_common, u64 penalty)
> +{
> +	u64 new_vruntime;
> +
> +	/* Overflow protection */
> +	if (se_y_lca->vruntime > (U64_MAX - penalty))
> +		return;
> +
> +	new_vruntime = se_y_lca->vruntime + penalty;
> +
> +	/* Validity check */
> +	if (new_vruntime <= se_y_lca->vruntime)
> +		return;
> +
> +	se_y_lca->vruntime = new_vruntime;
> +	se_y_lca->deadline = se_y_lca->vruntime + calc_delta_fair(se_y_lca->slice, se_y_lca);

And with that we update vruntime to an arbitrary value simply
because one task in the hierarchy decided to call yield_to().

Since we are on the topic, you are also missing an update_curr()
which is only done in yield_task_fair() so you are actually
looking at old vruntime for the yielding entity.

> +	se_y_lca->vlag = avg_vruntime(cfs_rq_common) - se_y_lca->vruntime;
> +	update_min_vruntime(cfs_rq_common);
> +}
> +
>  /*
>   * sched_yield() is very simple
>   */

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 04/10] sched/fair: Add penalty calculation and application logic
  2025-11-12  7:25   ` K Prateek Nayak
@ 2025-11-13 13:25     ` Wanpeng Li
  0 siblings, 0 replies; 36+ messages in thread
From: Wanpeng Li @ 2025-11-13 13:25 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paolo Bonzini,
	Sean Christopherson, Steven Rostedt, Vincent Guittot, Juri Lelli,
	linux-kernel, kvm, Wanpeng Li

Hi Prateek，

On Wed, 12 Nov 2025 at 15:25, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> Hello Wanpeng,
>
> On 11/10/2025 9:02 AM, Wanpeng Li wrote:
> > +/*
> > + * Calculate penalty with debounce logic for EEVDF yield deboost.
> > + * Computes vruntime penalty based on fairness gap (need) plus granularity,
> > + * applies queue-size-based caps to prevent excessive penalties in small queues,
> > + * and implements reverse-pair debounce (~300us) to reduce ping-pong effects.
> > + * Returns 0 if no penalty needed, otherwise returns clamped penalty value.
> > + */
> > +static u64 __maybe_unused yield_deboost_calculate_penalty(struct rq *rq, struct sched_entity *se_y_lca,
> > +                                 struct sched_entity *se_t_lca, struct sched_entity *se_t,
> > +                                 int nr_queued)
> > +{
> > +     u64 gran, need, penalty, maxp;
> > +     u64 gran_floor;
> > +     u64 weighted_need, base;
> > +
> > +     gran = calc_delta_fair(sysctl_sched_base_slice, se_y_lca);
> > +     /* Low-bound safeguard for gran when slice is abnormally small */
> > +     gran_floor = calc_delta_fair(sysctl_sched_base_slice >> 1, se_y_lca);
> > +     if (gran < gran_floor)
>
> Is this even possible?

No. Both use the same weight denominator in calc_delta_fair(), the
check is redundant. Will remove.

>
> > +             gran = gran_floor;
> > +
> > +     need = 0;
> > +     if (se_t_lca->vruntime > se_y_lca->vruntime)
> > +             need = se_t_lca->vruntime - se_y_lca->vruntime;
>
> So I'm assuming you want the yielding task's vruntime to
> cross the target's vruntime simply because one task somewhere
> down the hierarchy said so.

Yes, this is a known tradeoff. We apply the penalty at the LCA where A
and B compete to make B schedulable immediately. Side effect:
independent task C in CG0 loses CPU time. In practice, VMs place all
vCPUs in one cgroup (no independent C). If C exists and shares the
lock, the penalty helps. If C is truly independent, it loses ~one
scheduling slice. Your natural convergence approach avoids this but
needs multiple yield cycles before B gets sustained preference.

>
> > +
> > +     /* Apply 10% boost to need when positive (weighted_need = need * 1.10) */
> > +     penalty = gran;
>
> So at the very least I see it getting weighted(base_slice / 2) penalty
> ...
>
> > +     if (need) {
> > +             /* weighted_need = need + 10% */
> > +             weighted_need = need + need / 10;
> > +             /* clamp to avoid overflow when adding to gran (still capped later) */
> > +             if (weighted_need > U64_MAX - penalty)
> > +                     weighted_need = U64_MAX - penalty;
> > +             penalty += weighted_need;
>
> ... if not more ...

Yes, the floor is gran (weighted ~700µs). Empirically, smaller values
didn't sustain preference—the yielder would re-preempt the target
within 1-2 decisions in dbench testing. This is a workload-specific
heuristic. If too aggressive for general use, I can lower it or tie it
to h_nr_queued . Thoughts?

>
> > +     }
> > +
> > +     /* Apply debounce via helper to avoid ping-pong */
> > +     penalty = yield_deboost_apply_debounce(rq, se_t, penalty, need, gran);
>
> ... since without debounce, penalty remains same.
>
> > +
> > +     /* Upper bound (cap): slightly more aggressive for mid-size queues */
> > +     if (nr_queued == 2)
> > +             maxp = gran * 6;                /* Strongest push for 2-task ping-pong */
> > +     else if (nr_queued == 3)
> > +             maxp = gran * 4;                /* 4.0 * gran */
> > +     else if (nr_queued <= 6)
> > +             maxp = (gran * 5) / 2;          /* 2.5 * gran */
> > +     else if (nr_queued <= 8)
> > +             maxp = gran * 2;                /* 2.0 * gran */
> > +     else if (nr_queued <= 12)
> > +             maxp = (gran * 3) / 2;          /* 1.5 * gran */
> > +     else
> > +             maxp = gran;                    /* 1.0 * gran */
>
> And all the nr_queued calculations are based on the entities queued
> and not the "h_nr_queued" so we can have a boat load of tasks to
> run above but since one task decided to call yield_to() let us make
> them all starve a little?

You're absolutely right. Using nr_queued (entity count) instead of
h_nr_queued (hierarchical task count) is wrong:
CG0 (nr_queued=2, h_nr_queued=100)
  ├─ CG1 (50 tasks)
  └─ CG2 (50 tasks)
My code sees 2 entities and applies maxp = 6×gran (strongest penalty),
but 100 tasks are competing. This starves unrelated tasks. Will switch
to cfs_rq_common->h_nr_queued . The caps should reflect actual task
count, not group count.

>
> > +
> > +     if (penalty < gran)
> > +             penalty = gran;
> > +     if (penalty > maxp)
> > +             penalty = maxp;
> > +
> > +     /* If no need, apply refined baseline push (low risk + mid risk combined). */
> > +     if (need == 0) {
> > +             /*
> > +              * Baseline multiplier for need==0:
> > +              *   2        -> 1.00 * gran
> > +              *   3        -> 0.9375 * gran
> > +              *   4–6      -> 0.625 * gran
> > +              *   7–8      -> 0.50  * gran
> > +              *   9–12     -> 0.375 * gran
> > +              *   >12      -> 0.25  * gran
> > +              */
> > +             base = gran;
> > +             if (nr_queued == 3)
> > +                     base = (gran * 15) / 16;        /* 0.9375 */
> > +             else if (nr_queued >= 4 && nr_queued <= 6)
> > +                     base = (gran * 5) / 8;          /* 0.625 */
> > +             else if (nr_queued >= 7 && nr_queued <= 8)
> > +                     base = gran / 2;                /* 0.5 */
> > +             else if (nr_queued >= 9 && nr_queued <= 12)
> > +                     base = (gran * 3) / 8;          /* 0.375 */
> > +             else if (nr_queued > 12)
> > +                     base = gran / 4;                /* 0.25 */
> > +
> > +             if (penalty < base)
> > +                     penalty = base;
> > +     }
> > +
> > +     return penalty;
> > +}
> > +
> > +/*
> > + * Apply penalty and update EEVDF fields for scheduler consistency.
> > + * Safely applies vruntime penalty with overflow protection, then updates
> > + * EEVDF-specific fields (deadline, vlag) and cfs_rq min_vruntime to maintain
> > + * scheduler state consistency. Returns true on successful application,
> > + * false if penalty cannot be safely applied.
> > + */
> > +static void __maybe_unused yield_deboost_apply_penalty(struct rq *rq, struct sched_entity *se_y_lca,
> > +                              struct cfs_rq *cfs_rq_common, u64 penalty)
> > +{
> > +     u64 new_vruntime;
> > +
> > +     /* Overflow protection */
> > +     if (se_y_lca->vruntime > (U64_MAX - penalty))
> > +             return;
> > +
> > +     new_vruntime = se_y_lca->vruntime + penalty;
> > +
> > +     /* Validity check */
> > +     if (new_vruntime <= se_y_lca->vruntime)
> > +             return;
> > +
> > +     se_y_lca->vruntime = new_vruntime;
> > +     se_y_lca->deadline = se_y_lca->vruntime + calc_delta_fair(se_y_lca->slice, se_y_lca);
>
> And with that we update vruntime to an arbitrary value simply
> because one task in the hierarchy decided to call yield_to().

Yes, modifying vruntime at se_y_lca affects the entire hierarchy
beneath it, not just the calling task. This is the cost of making
yield_to() work in hierarchical scheduling. Is it worth it? We believe
yes, because:
1. Yield_to() is already a hierarchy-wide decision: When vCPU-A yields
to vCPU-B, it's not just task-A helping task-B—it's the entire VM (the
hierarchy) requesting another vCPU to make progress. Lock-holder
scenarios are VM-wide problems, not individual task problems.
2. The alternative is broken semantics: Without hierarchy-level
adjustment, yield_to() silently fails in cgroup configurations. Users
call yield_to() expecting it to work, but it doesn't—that's worse than
documented unfairness.
3. Bounded impact: The penalty scales conservatively with h_nr_queued
(larger hierarchies get 1.0× gran, not 6.0×), limiting blast radius.
If the position is that hierarchy-wide vruntime perturbation is never
acceptable regardless of use case, then yield_to() should explicitly
fail or be disabled in cgroup configurations rather than pretending to
work.

>
> Since we are on the topic, you are also missing an update_curr()
> which is only done in yield_task_fair() so you are actually
> looking at old vruntime for the yielding entity.

 Will fix it.

Regards,
Wanpeng

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 05/10] sched/fair: Wire up yield deboost in yield_to_task_fair()
  2025-11-10  3:32 [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM Wanpeng Li
                   ` (3 preceding siblings ...)
  2025-11-10  3:32 ` [PATCH 04/10] sched/fair: Add penalty calculation and application logic Wanpeng Li
@ 2025-11-10  3:32 ` Wanpeng Li
  2025-11-10  5:16   ` kernel test robot
  2025-11-10  5:16   ` kernel test robot
  2025-11-10  3:32 ` [PATCH 06/10] KVM: Fix last_boosted_vcpu index assignment bug Wanpeng Li
                   ` (6 subsequent siblings)
  11 siblings, 2 replies; 36+ messages in thread
From: Wanpeng Li @ 2025-11-10  3:32 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paolo Bonzini,
	Sean Christopherson
  Cc: Steven Rostedt, Vincent Guittot, Juri Lelli, linux-kernel, kvm,
	Wanpeng Li

From: Wanpeng Li <wanpengli@tencent.com>

From: Wanpeng Li <wanpengli@tencent.com>

Integrate the yield deboost mechanism into yield_to_task_fair() to
improve yield_to() effectiveness for virtualization workloads.

Add yield_to_deboost() as the main entry point that validates tasks,
finds cgroup LCA, updates rq clock and accounting, calculates penalty,
and applies EEVDF field adjustments.

The integration point after set_next_buddy() and before yield_task_fair()
works in concert with the existing buddy mechanism: set_next_buddy()
provides immediate preference, yield_to_deboost() applies bounded
vruntime penalty for sustained advantage, and yield_task_fair()
completes the standard yield path.

This is particularly beneficial for vCPU workloads where lock holder
detection triggers yield_to(), the holder needs sustained preference
to make progress, vCPUs may be organized in nested cgroups,
high-frequency yields require rate limiting, and ping-pong patterns
need debouncing.

Operation occurs under rq->lock with bounded penalties. The feature
can be disabled at runtime via
/sys/kernel/debug/sched/sched_vcpu_debooster_enabled.

Dbench workload in a virtualized environment (16 pCPUs host, 16 vCPUs
per VM running dbench-16 benchmark) shows consistent gains:
  2 VMs: +14.4% throughput
  3 VMs:  +9.8% throughput
  4 VMs:  +6.7% throughput

Performance gains stem from more effective yield_to() behavior,
enabling lock holders to make faster progress and reducing contention
overhead in overcommitted scenarios.

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
---
 kernel/sched/fair.c | 58 +++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 54 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4bad324f3662..619af60b7ce6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9017,7 +9017,7 @@ static bool yield_deboost_rate_limit(struct rq *rq, u64 now_ns)
  * Returns false with appropriate debug logging if any validation fails,
  * ensuring only safe and meaningful yield operations proceed.
  */
-static bool __maybe_unused yield_deboost_validate_tasks(struct rq *rq, struct task_struct *p_target,
+static bool yield_deboost_validate_tasks(struct rq *rq, struct task_struct *p_target,
 					  struct task_struct **p_yielding_out,
 					  struct sched_entity **se_y_out,
 					  struct sched_entity **se_t_out)
@@ -9066,7 +9066,7 @@ static bool __maybe_unused yield_deboost_validate_tasks(struct rq *rq, struct ta
  * the appropriate level for vruntime adjustments and EEVDF field updates
  * (deadline, vlag) to maintain scheduler consistency.
  */
-static bool __maybe_unused yield_deboost_find_lca(struct sched_entity *se_y, struct sched_entity *se_t,
+static bool yield_deboost_find_lca(struct sched_entity *se_y, struct sched_entity *se_t,
 				    struct sched_entity **se_y_lca_out,
 				    struct sched_entity **se_t_lca_out,
 				    struct cfs_rq **cfs_rq_common_out)
@@ -9162,7 +9162,7 @@ static u64 yield_deboost_apply_debounce(struct rq *rq, struct sched_entity *se_t
  * and implements reverse-pair debounce (~300us) to reduce ping-pong effects.
  * Returns 0 if no penalty needed, otherwise returns clamped penalty value.
  */
-static u64 __maybe_unused yield_deboost_calculate_penalty(struct rq *rq, struct sched_entity *se_y_lca,
+static u64 yield_deboost_calculate_penalty(struct rq *rq, struct sched_entity *se_y_lca,
 				    struct sched_entity *se_t_lca, struct sched_entity *se_t,
 				    int nr_queued)
 {
@@ -9250,7 +9250,7 @@ static u64 __maybe_unused yield_deboost_calculate_penalty(struct rq *rq, struct
  * scheduler state consistency. Returns true on successful application,
  * false if penalty cannot be safely applied.
  */
-static void __maybe_unused yield_deboost_apply_penalty(struct rq *rq, struct sched_entity *se_y_lca,
+static void yield_deboost_apply_penalty(struct rq *rq, struct sched_entity *se_y_lca,
 				 struct cfs_rq *cfs_rq_common, u64 penalty)
 {
 	u64 new_vruntime;
@@ -9303,6 +9303,52 @@ static void yield_task_fair(struct rq *rq)
 	se->deadline += calc_delta_fair(se->slice, se);
 }
 
+/*
+ * yield_to_deboost - deboost the yielding task to favor the target on the same rq
+ * @rq: runqueue containing both tasks; rq->lock must be held
+ * @p_target: task to favor in scheduling
+ *
+ * Cooperates with yield_to_task_fair(): buddy provides immediate preference;
+ * this routine applies a bounded vruntime penalty at the cgroup LCA so the
+ * target keeps advantage beyond the buddy effect. EEVDF fields are updated
+ * to keep scheduler state consistent.
+ *
+ * Only operates on tasks resident on the same rq; throttled hierarchies are
+ * rejected early. Penalty is bounded by granularity and queue-size caps.
+ *
+ * Intended primarily for virtualization workloads where a yielding vCPU
+ * should defer to a target vCPU within the same runqueue.
+ * Does not change runnable order directly; complements buddy selection with
+ * a bounded fairness adjustment.
+ */
+static void yield_to_deboost(struct rq *rq, struct task_struct *p_target)
+{
+	struct task_struct *p_yielding;
+	struct sched_entity *se_y, *se_t, *se_y_lca, *se_t_lca;
+	struct cfs_rq *cfs_rq_common;
+	u64 penalty;
+
+	/* Step 1: validate tasks and inputs */
+	if (!yield_deboost_validate_tasks(rq, p_target, &p_yielding, &se_y, &se_t))
+		return;
+
+	/* Step 2: find LCA in cgroup hierarchy */
+	if (!yield_deboost_find_lca(se_y, se_t, &se_y_lca, &se_t_lca, &cfs_rq_common))
+		return;
+
+	/* Step 3: update clock and current accounting */
+	update_rq_clock(rq);
+	if (se_y_lca != cfs_rq_common->curr)
+		update_curr(cfs_rq_common);
+
+	/* Step 4: calculate penalty (caps + debounce) */
+	penalty = yield_deboost_calculate_penalty(rq, se_y_lca, se_t_lca, se_t,
+						  cfs_rq_common->nr_queued);
+
+	/* Step 5: apply penalty and update EEVDF fields */
+	yield_deboost_apply_penalty(rq, se_y_lca, cfs_rq_common, penalty);
+}
+
 static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
 {
 	struct sched_entity *se = &p->se;
@@ -9314,6 +9360,10 @@ static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
 	/* Tell the scheduler that we'd really like se to run next. */
 	set_next_buddy(se);
 
+	/* Apply deboost under rq lock. */
+	yield_to_deboost(rq, p);
+
+	/* Complete the standard yield path. */
 	yield_task_fair(rq);
 
 	return true;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 05/10] sched/fair: Wire up yield deboost in yield_to_task_fair()
  2025-11-10  3:32 ` [PATCH 05/10] sched/fair: Wire up yield deboost in yield_to_task_fair() Wanpeng Li
@ 2025-11-10  5:16   ` kernel test robot
  2025-11-10  5:16   ` kernel test robot
  1 sibling, 0 replies; 36+ messages in thread
From: kernel test robot @ 2025-11-10  5:16 UTC (permalink / raw)
  To: Wanpeng Li, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Paolo Bonzini, Sean Christopherson
  Cc: oe-kbuild-all, Steven Rostedt, Vincent Guittot, Juri Lelli,
	linux-kernel, kvm, Wanpeng Li

Hi Wanpeng,

kernel test robot noticed the following build errors:

[auto build test ERROR on kvm/queue]
[also build test ERROR on kvm/next tip/sched/core tip/master linus/master v6.18-rc5 next-20251107]
[cannot apply to kvm/linux-next tip/auto-latest]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Wanpeng-Li/sched-Add-vCPU-debooster-infrastructure/20251110-114219
base:   https://git.kernel.org/pub/scm/virt/kvm/kvm.git queue
patch link:    https://lore.kernel.org/r/20251110033232.12538-6-kernellwp%40gmail.com
patch subject: [PATCH 05/10] sched/fair: Wire up yield deboost in yield_to_task_fair()
config: xtensa-allnoconfig (https://download.01.org/0day-ci/archive/20251110/202511101310.HuFb12n3-lkp@intel.com/config)
compiler: xtensa-linux-gcc (GCC) 15.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251110/202511101310.HuFb12n3-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202511101310.HuFb12n3-lkp@intel.com/

All errors (new ones prefixed by >>):

   xtensa-linux-ld: kernel/sched/fair.o: in function `detach_entity_load_avg.constprop.0':
   fair.c:(.text+0xc84): undefined reference to `__udivdi3'
>> xtensa-linux-ld: fair.c:(.text+0xd06): undefined reference to `__udivdi3'

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 05/10] sched/fair: Wire up yield deboost in yield_to_task_fair()
  2025-11-10  3:32 ` [PATCH 05/10] sched/fair: Wire up yield deboost in yield_to_task_fair() Wanpeng Li
  2025-11-10  5:16   ` kernel test robot
@ 2025-11-10  5:16   ` kernel test robot
  1 sibling, 0 replies; 36+ messages in thread
From: kernel test robot @ 2025-11-10  5:16 UTC (permalink / raw)
  To: Wanpeng Li, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Paolo Bonzini, Sean Christopherson
  Cc: oe-kbuild-all, Steven Rostedt, Vincent Guittot, Juri Lelli,
	linux-kernel, kvm, Wanpeng Li

Hi Wanpeng,

kernel test robot noticed the following build errors:

[auto build test ERROR on kvm/queue]
[also build test ERROR on kvm/next tip/sched/core tip/master linus/master v6.18-rc5 next-20251107]
[cannot apply to kvm/linux-next tip/auto-latest]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Wanpeng-Li/sched-Add-vCPU-debooster-infrastructure/20251110-114219
base:   https://git.kernel.org/pub/scm/virt/kvm/kvm.git queue
patch link:    https://lore.kernel.org/r/20251110033232.12538-6-kernellwp%40gmail.com
patch subject: [PATCH 05/10] sched/fair: Wire up yield deboost in yield_to_task_fair()
config: m68k-allnoconfig (https://download.01.org/0day-ci/archive/20251110/202511101338.8AICyae8-lkp@intel.com/config)
compiler: m68k-linux-gcc (GCC) 15.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251110/202511101338.8AICyae8-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202511101338.8AICyae8-lkp@intel.com/

All errors (new ones prefixed by >>):

   m68k-linux-ld: kernel/sched/fair.o: in function `yield_deboost_calculate_penalty':
>> fair.c:(.text+0x1588): undefined reference to `__udivdi3'

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 06/10] KVM: Fix last_boosted_vcpu index assignment bug
  2025-11-10  3:32 [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM Wanpeng Li
                   ` (4 preceding siblings ...)
  2025-11-10  3:32 ` [PATCH 05/10] sched/fair: Wire up yield deboost in yield_to_task_fair() Wanpeng Li
@ 2025-11-10  3:32 ` Wanpeng Li
  2025-11-21  0:35   ` Sean Christopherson
  2025-11-10  3:32 ` [PATCH 07/10] KVM: x86: Add IPI tracking infrastructure Wanpeng Li
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 36+ messages in thread
From: Wanpeng Li @ 2025-11-10  3:32 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paolo Bonzini,
	Sean Christopherson
  Cc: Steven Rostedt, Vincent Guittot, Juri Lelli, linux-kernel, kvm,
	Wanpeng Li

From: Wanpeng Li <wanpengli@tencent.com>

From: Wanpeng Li <wanpengli@tencent.com>

In kvm_vcpu_on_spin(), the loop counter 'i' is incorrectly written to
last_boosted_vcpu instead of the actual vCPU index 'idx'. This causes
last_boosted_vcpu to store the loop iteration count rather than the
vCPU index, leading to incorrect round-robin behavior in subsequent
directed yield operations.

Fix this by using 'idx' instead of 'i' in the assignment.

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
---
 virt/kvm/kvm_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b7a0ae2a7b20..cde1eddbaa91 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4026,7 +4026,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
 
 		yielded = kvm_vcpu_yield_to(vcpu);
 		if (yielded > 0) {
-			WRITE_ONCE(kvm->last_boosted_vcpu, i);
+			WRITE_ONCE(kvm->last_boosted_vcpu, idx);
 			break;
 		} else if (yielded < 0 && !--try) {
 			break;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 06/10] KVM: Fix last_boosted_vcpu index assignment bug
  2025-11-10  3:32 ` [PATCH 06/10] KVM: Fix last_boosted_vcpu index assignment bug Wanpeng Li
@ 2025-11-21  0:35   ` Sean Christopherson
  2025-11-21  0:38     ` Sean Christopherson
  2025-11-21 11:46     ` Wanpeng Li
  0 siblings, 2 replies; 36+ messages in thread
From: Sean Christopherson @ 2025-11-21  0:35 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paolo Bonzini,
	Steven Rostedt, Vincent Guittot, Juri Lelli, linux-kernel, kvm,
	Wanpeng Li

On Mon, Nov 10, 2025, Wanpeng Li wrote:
> From: Wanpeng Li <wanpengli@tencent.com>
> 
> From: Wanpeng Li <wanpengli@tencent.com>

Something might be off in your email scripts.  Speaking of email, mostly as an
FYI, your @tencent email was bouncing as of last year, and prompted commit
b018589013d6 ("MAINTAINERS: Drop Wanpeng Li as a Reviewer for KVM Paravirt support").

> In kvm_vcpu_on_spin(), the loop counter 'i' is incorrectly written to
> last_boosted_vcpu instead of the actual vCPU index 'idx'. This causes
> last_boosted_vcpu to store the loop iteration count rather than the
> vCPU index, leading to incorrect round-robin behavior in subsequent
> directed yield operations.
> 
> Fix this by using 'idx' instead of 'i' in the assignment.

Fixes: 7e513617da71 ("KVM: Rework core loop of kvm_vcpu_on_spin() to use a single for-loop")
Cc: stable@vger.kernel.org
Reviewed-by: Sean Christopherson <seanjc@google.com>

Please, please don't bury fixes like this in a large-ish series, especially in a
series that's going to be quite contentious and thus likely to linger on-list for
quite some time.  It's pretty much dumb luck on my end that I saw this.

That said, thank you for fixing my goof :-)

Paolo, do you want to grab this for 6.19?  Or just wait for 6.20?

> Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
> ---
>  virt/kvm/kvm_main.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index b7a0ae2a7b20..cde1eddbaa91 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -4026,7 +4026,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
>  
>  		yielded = kvm_vcpu_yield_to(vcpu);
>  		if (yielded > 0) {
> -			WRITE_ONCE(kvm->last_boosted_vcpu, i);
> +			WRITE_ONCE(kvm->last_boosted_vcpu, idx);
>  			break;
>  		} else if (yielded < 0 && !--try) {
>  			break;
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 06/10] KVM: Fix last_boosted_vcpu index assignment bug
  2025-11-21  0:35   ` Sean Christopherson
@ 2025-11-21  0:38     ` Sean Christopherson
  2025-11-21 11:46     ` Wanpeng Li
  1 sibling, 0 replies; 36+ messages in thread
From: Sean Christopherson @ 2025-11-21  0:38 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paolo Bonzini,
	Steven Rostedt, Vincent Guittot, Juri Lelli, linux-kernel, kvm,
	Wanpeng Li

On Thu, Nov 20, 2025, Sean Christopherson wrote:
> On Mon, Nov 10, 2025, Wanpeng Li wrote:
> > From: Wanpeng Li <wanpengli@tencent.com>
> > 
> > From: Wanpeng Li <wanpengli@tencent.com>
> 
> Something might be off in your email scripts.  Speaking of email, mostly as an
> FYI, your @tencent email was bouncing as of last year, and prompted commit
> b018589013d6 ("MAINTAINERS: Drop Wanpeng Li as a Reviewer for KVM Paravirt support").
> 
> > In kvm_vcpu_on_spin(), the loop counter 'i' is incorrectly written to
> > last_boosted_vcpu instead of the actual vCPU index 'idx'. This causes
> > last_boosted_vcpu to store the loop iteration count rather than the
> > vCPU index, leading to incorrect round-robin behavior in subsequent
> > directed yield operations.
> > 
> > Fix this by using 'idx' instead of 'i' in the assignment.
> 
> Fixes: 7e513617da71 ("KVM: Rework core loop of kvm_vcpu_on_spin() to use a single for-loop")
> Cc: stable@vger.kernel.org
> Reviewed-by: Sean Christopherson <seanjc@google.com>
> 
> Please, please don't bury fixes like this in a large-ish series, especially in a
> series that's going to be quite contentious and thus likely to linger on-list for
> quite some time.  It's pretty much dumb luck on my end that I saw this.
> 
> That said, thank you for fixing my goof :-)
> 
> Paolo, do you want to grab this for 6.19?  Or just wait for 6.20?

Err, off-by-one.  6.18 and 6.19....

> > Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
> > ---
> >  virt/kvm/kvm_main.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index b7a0ae2a7b20..cde1eddbaa91 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -4026,7 +4026,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
> >  
> >  		yielded = kvm_vcpu_yield_to(vcpu);
> >  		if (yielded > 0) {
> > -			WRITE_ONCE(kvm->last_boosted_vcpu, i);
> > +			WRITE_ONCE(kvm->last_boosted_vcpu, idx);
> >  			break;
> >  		} else if (yielded < 0 && !--try) {
> >  			break;
> > -- 
> > 2.43.0
> > 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 06/10] KVM: Fix last_boosted_vcpu index assignment bug
  2025-11-21  0:35   ` Sean Christopherson
  2025-11-21  0:38     ` Sean Christopherson
@ 2025-11-21 11:46     ` Wanpeng Li
  1 sibling, 0 replies; 36+ messages in thread
From: Wanpeng Li @ 2025-11-21 11:46 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paolo Bonzini,
	Steven Rostedt, Vincent Guittot, Juri Lelli, linux-kernel, kvm

On Fri, 21 Nov 2025 at 08:35, Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Nov 10, 2025, Wanpeng Li wrote:
> > From: Wanpeng Li <wanpengli@tencent.com>
> >
> > From: Wanpeng Li <wanpengli@tencent.com>
>
> Something might be off in your email scripts.  Speaking of email, mostly as an
> FYI, your @tencent email was bouncing as of last year, and prompted commit
> b018589013d6 ("MAINTAINERS: Drop Wanpeng Li as a Reviewer for KVM Paravirt support").

Hi Paolo and Sean,

Regarding commit b018589013d6 — I'm back to active KVM development and
ready to resume reviewing. Please update my entry to Wanpeng Li
<kernellwp@gmail.com>. My recent patch series reflects the level of
engagement you can expect going forward.

>
> > In kvm_vcpu_on_spin(), the loop counter 'i' is incorrectly written to
> > last_boosted_vcpu instead of the actual vCPU index 'idx'. This causes
> > last_boosted_vcpu to store the loop iteration count rather than the
> > vCPU index, leading to incorrect round-robin behavior in subsequent
> > directed yield operations.
> >
> > Fix this by using 'idx' instead of 'i' in the assignment.
>
> Fixes: 7e513617da71 ("KVM: Rework core loop of kvm_vcpu_on_spin() to use a single for-loop")
> Cc: stable@vger.kernel.org
> Reviewed-by: Sean Christopherson <seanjc@google.com>
>
> Please, please don't bury fixes like this in a large-ish series, especially in a
> series that's going to be quite contentious and thus likely to linger on-list for
> quite some time.  It's pretty much dumb luck on my end that I saw this.

Good point about fixed visibility — it makes sense to keep them separate.

>
> That said, thank you for fixing my goof :-)

:)

Regards,
Wanpeng

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 07/10] KVM: x86: Add IPI tracking infrastructure
  2025-11-10  3:32 [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM Wanpeng Li
                   ` (5 preceding siblings ...)
  2025-11-10  3:32 ` [PATCH 06/10] KVM: Fix last_boosted_vcpu index assignment bug Wanpeng Li
@ 2025-11-10  3:32 ` Wanpeng Li
  2025-11-10  3:32 ` [PATCH 08/10] KVM: x86/lapic: Integrate IPI tracking with interrupt delivery Wanpeng Li
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 36+ messages in thread
From: Wanpeng Li @ 2025-11-10  3:32 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paolo Bonzini,
	Sean Christopherson
  Cc: Steven Rostedt, Vincent Guittot, Juri Lelli, linux-kernel, kvm,
	Wanpeng Li

From: Wanpeng Li <wanpengli@tencent.com>

From: Wanpeng Li <wanpengli@tencent.com>

Introduce IPI tracking infrastructure for directed yield optimization.

Add per-vCPU IPI tracking context in kvm_vcpu_arch with
last_ipi_sender/receiver to track IPI communication pairs, pending_ipi
flag to indicate awaiting IPI response, and ipi_time_ns monotonic
timestamp for recency validation.

Add module parameters ipi_tracking_enabled (global toggle, default
true) and ipi_window_ns (recency window, default 50ms).

Add core helper functions: kvm_track_ipi_communication() to record
sender/receiver pairs, kvm_vcpu_is_ipi_receiver() to validate recent
IPI relationship, and kvm_vcpu_clear/reset_ipi_context() for lifecycle
management.

Use lockless READ_ONCE/WRITE_ONCE for minimal overhead. The short time
window prevents stale IPI information from affecting throughput
workloads.

The infrastructure is inert until integrated with interrupt delivery in
subsequent patches.

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
---
 arch/x86/include/asm/kvm_host.h |  8 ++++
 arch/x86/kvm/lapic.c            | 65 +++++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c              |  6 +++
 arch/x86/kvm/x86.h              |  4 ++
 include/linux/kvm_host.h        |  1 +
 virt/kvm/kvm_main.c             |  5 +++
 6 files changed, 89 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 48598d017d6f..b5bdc115ff45 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1052,6 +1052,14 @@ struct kvm_vcpu_arch {
 	int pending_external_vector;
 	int highest_stale_pending_ioapic_eoi;
 
+	/* IPI tracking for directed yield (x86 only) */
+	struct {
+		int last_ipi_sender;    /* vCPU ID of last IPI sender */
+		int last_ipi_receiver;  /* vCPU ID of last IPI receiver */
+		bool pending_ipi;       /* Pending IPI response */
+		u64 ipi_time_ns;        /* Monotonic ns when IPI was sent */
+	} ipi_context;
+
 	/* be preempted when it's in kernel-mode(cpl=0) */
 	bool preempted_in_kernel;
 
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 0ae7f913d782..98ec2b18b02c 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -75,6 +75,12 @@ module_param(lapic_timer_advance, bool, 0444);
 /* step-by-step approximation to mitigate fluctuation */
 #define LAPIC_TIMER_ADVANCE_ADJUST_STEP 8
 
+/* IPI tracking window and runtime toggle (runtime-adjustable) */
+static bool ipi_tracking_enabled = true;
+static unsigned long ipi_window_ns = 50 * NSEC_PER_MSEC;
+module_param(ipi_tracking_enabled, bool, 0644);
+module_param(ipi_window_ns, ulong, 0644);
+
 static bool __read_mostly vector_hashing_enabled = true;
 module_param_named(vector_hashing, vector_hashing_enabled, bool, 0444);
 
@@ -1113,6 +1119,65 @@ static int kvm_apic_compare_prio(struct kvm_vcpu *vcpu1, struct kvm_vcpu *vcpu2)
 	return vcpu1->arch.apic_arb_prio - vcpu2->arch.apic_arb_prio;
 }
 
+/*
+ * Track IPI communication for directed yield when a unique receiver exists.
+ * This only writes sender/receiver context and timestamp; ignores self-IPI.
+ */
+void kvm_track_ipi_communication(struct kvm_vcpu *sender, struct kvm_vcpu *receiver)
+{
+	if (!sender || !receiver || sender == receiver)
+		return;
+	if (unlikely(!READ_ONCE(ipi_tracking_enabled)))
+		return;
+
+	WRITE_ONCE(sender->arch.ipi_context.last_ipi_receiver, receiver->vcpu_idx);
+	WRITE_ONCE(sender->arch.ipi_context.pending_ipi, true);
+	WRITE_ONCE(sender->arch.ipi_context.ipi_time_ns, ktime_get_mono_fast_ns());
+
+	WRITE_ONCE(receiver->arch.ipi_context.last_ipi_sender, sender->vcpu_idx);
+}
+
+/*
+ * Check if 'receiver' is the recent IPI target of 'sender'.
+ *
+ * Rationale:
+ * - Use a short window to avoid stale IPI inflating boost priority
+ *   on throughput-sensitive workloads.
+ */
+bool kvm_vcpu_is_ipi_receiver(struct kvm_vcpu *sender, struct kvm_vcpu *receiver)
+{
+	u64 then, now;
+
+	if (unlikely(!READ_ONCE(ipi_tracking_enabled)))
+		return false;
+
+	then = READ_ONCE(sender->arch.ipi_context.ipi_time_ns);
+	now = ktime_get_mono_fast_ns();
+	if (READ_ONCE(sender->arch.ipi_context.pending_ipi) &&
+	    READ_ONCE(sender->arch.ipi_context.last_ipi_receiver) ==
+	    receiver->vcpu_idx &&
+	    now - then <= ipi_window_ns)
+		return true;
+
+	return false;
+}
+
+void kvm_vcpu_clear_ipi_context(struct kvm_vcpu *vcpu)
+{
+	WRITE_ONCE(vcpu->arch.ipi_context.pending_ipi, false);
+	WRITE_ONCE(vcpu->arch.ipi_context.last_ipi_sender, -1);
+	WRITE_ONCE(vcpu->arch.ipi_context.last_ipi_receiver, -1);
+}
+
+/*
+ * Reset helper: clear ipi_context and zero ipi_time for hard reset paths.
+ */
+void kvm_vcpu_reset_ipi_context(struct kvm_vcpu *vcpu)
+{
+	kvm_vcpu_clear_ipi_context(vcpu);
+	WRITE_ONCE(vcpu->arch.ipi_context.ipi_time_ns, 0);
+}
+
 /* Return true if the interrupt can be handled by using *bitmap as index mask
  * for valid destinations in *dst array.
  * Return false if kvm_apic_map_get_dest_lapic did nothing useful.
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b4b5d2d09634..649e016c131f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12708,6 +12708,8 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 		goto free_guest_fpu;
 
 	kvm_xen_init_vcpu(vcpu);
+	/* Initialize IPI tracking */
+	kvm_vcpu_reset_ipi_context(vcpu);
 	vcpu_load(vcpu);
 	kvm_vcpu_after_set_cpuid(vcpu);
 	kvm_set_tsc_khz(vcpu, vcpu->kvm->arch.default_tsc_khz);
@@ -12781,6 +12783,8 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
 	kvm_mmu_destroy(vcpu);
 	srcu_read_unlock(&vcpu->kvm->srcu, idx);
 	free_page((unsigned long)vcpu->arch.pio_data);
+	/* Clear IPI tracking context */
+	kvm_vcpu_reset_ipi_context(vcpu);
 	kvfree(vcpu->arch.cpuid_entries);
 }
 
@@ -12846,6 +12850,8 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 		kvm_leave_nested(vcpu);
 
 	kvm_lapic_reset(vcpu, init_event);
+	/* Clear IPI tracking context on reset */
+	kvm_vcpu_clear_ipi_context(vcpu);
 
 	WARN_ON_ONCE(is_guest_mode(vcpu) || is_smm(vcpu));
 	vcpu->arch.hflags = 0;
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index f3dc77f006f9..86a10c653eac 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -451,6 +451,10 @@ fastpath_t handle_fastpath_wrmsr(struct kvm_vcpu *vcpu);
 fastpath_t handle_fastpath_wrmsr_imm(struct kvm_vcpu *vcpu, u32 msr, int reg);
 fastpath_t handle_fastpath_hlt(struct kvm_vcpu *vcpu);
 fastpath_t handle_fastpath_invd(struct kvm_vcpu *vcpu);
+void kvm_track_ipi_communication(struct kvm_vcpu *sender,
+				struct kvm_vcpu *receiver);
+void kvm_vcpu_clear_ipi_context(struct kvm_vcpu *vcpu);
+void kvm_vcpu_reset_ipi_context(struct kvm_vcpu *vcpu);
 
 extern struct kvm_caps kvm_caps;
 extern struct kvm_host_values kvm_host;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 5bd76cf394fa..5ae8327fdf21 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1532,6 +1532,7 @@ static inline void kvm_vcpu_kick(struct kvm_vcpu *vcpu)
 }
 #endif
 
+bool kvm_vcpu_is_ipi_receiver(struct kvm_vcpu *sender, struct kvm_vcpu *receiver);
 int kvm_vcpu_yield_to(struct kvm_vcpu *target);
 void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu, bool yield_to_kernel_mode);
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index cde1eddbaa91..495e769c7ddf 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3963,6 +3963,11 @@ bool __weak kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu)
 	return false;
 }
 
+bool __weak kvm_vcpu_is_ipi_receiver(struct kvm_vcpu *sender, struct kvm_vcpu *receiver)
+{
+	return false;
+}
+
 void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
 {
 	int nr_vcpus, start, i, idx, yielded;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 08/10] KVM: x86/lapic: Integrate IPI tracking with interrupt delivery
  2025-11-10  3:32 [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM Wanpeng Li
                   ` (6 preceding siblings ...)
  2025-11-10  3:32 ` [PATCH 07/10] KVM: x86: Add IPI tracking infrastructure Wanpeng Li
@ 2025-11-10  3:32 ` Wanpeng Li
  2025-11-10  3:32 ` [PATCH 09/10] KVM: Implement IPI-aware directed yield candidate selection Wanpeng Li
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 36+ messages in thread
From: Wanpeng Li @ 2025-11-10  3:32 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paolo Bonzini,
	Sean Christopherson
  Cc: Steven Rostedt, Vincent Guittot, Juri Lelli, linux-kernel, kvm,
	Wanpeng Li

From: Wanpeng Li <wanpengli@tencent.com>

From: Wanpeng Li <wanpengli@tencent.com>

Integrate IPI tracking with LAPIC interrupt delivery and EOI handling.

Hook into kvm_irq_delivery_to_apic() after destination resolution to
record sender/receiver pairs when the interrupt is LAPIC-originated,
APIC_DM_FIXED mode, with exactly one destination vCPU. Use counting
for efficient single-destination detection.

Add kvm_clear_ipi_on_eoi() called from both EOI paths to ensure
complete IPI context cleanup:

1. apic_set_eoi(): Software-emulated EOI path (traditional/non-APICv)
2. kvm_apic_set_eoi_accelerated(): Hardware-accelerated EOI path
   (APICv/AVIC)

Without dual-path cleanup, APICv/AVIC-enabled guests would retain
stale IPI state, causing directed yield to rely on obsolete sender/
receiver information and potentially boosting the wrong vCPU. Both
paths must call kvm_clear_ipi_on_eoi() to maintain consistency across
different virtual interrupt delivery modes.

The cleanup implements two-stage logic to avoid premature clearing:
unconditionally clear the receiver's IPI context, and conditionally
clear the sender's pending flag only when the sender exists,
last_ipi_receiver matches, and the IPI is recent. This prevents
unrelated EOIs from disrupting valid IPI tracking state.

Use lockless accessors for minimal overhead. The tracking only
activates for unicast fixed IPIs where directed yield provides value.

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
---
 arch/x86/kvm/lapic.c | 107 +++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 103 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 98ec2b18b02c..d38e64691b78 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1178,6 +1178,47 @@ void kvm_vcpu_reset_ipi_context(struct kvm_vcpu *vcpu)
 	WRITE_ONCE(vcpu->arch.ipi_context.ipi_time_ns, 0);
 }
 
+/*
+ * Clear IPI context on EOI at receiver side; clear sender's pending
+ * only when matches and is fresh.
+ *
+ * This function implements precise cleanup to avoid stale IPI boosts:
+ * 1) Always clear the receiver's IPI context (unconditional cleanup)
+ * 2) Conditionally clear the sender's pending flag only when:
+ *    - The sender vCPU still exists and is valid
+ *    - The sender's last_ipi_receiver matches this receiver
+ *    - The IPI was sent recently (within ~window)
+ */
+static void kvm_clear_ipi_on_eoi(struct kvm_lapic *apic)
+{
+	struct kvm_vcpu *receiver;
+	int sender_idx;
+	u64 then, now;
+
+	if (unlikely(!READ_ONCE(ipi_tracking_enabled)))
+		return;
+
+	receiver = apic->vcpu;
+	sender_idx = READ_ONCE(receiver->arch.ipi_context.last_ipi_sender);
+
+	/* Step 1: Always clear receiver's IPI context */
+	kvm_vcpu_clear_ipi_context(receiver);
+
+	/* Step 2: Conditionally clear sender's pending flag */
+	if (sender_idx >= 0) {
+		struct kvm_vcpu *sender = kvm_get_vcpu(receiver->kvm, sender_idx);
+
+		if (sender &&
+		    READ_ONCE(sender->arch.ipi_context.last_ipi_receiver) ==
+		    receiver->vcpu_idx) {
+			then = READ_ONCE(sender->arch.ipi_context.ipi_time_ns);
+			now = ktime_get_mono_fast_ns();
+			if (now - then <= ipi_window_ns)
+				WRITE_ONCE(sender->arch.ipi_context.pending_ipi, false);
+		}
+	}
+}
+
 /* Return true if the interrupt can be handled by using *bitmap as index mask
  * for valid destinations in *dst array.
  * Return false if kvm_apic_map_get_dest_lapic did nothing useful.
@@ -1259,6 +1300,10 @@ bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct kvm_lapic *src,
 	struct kvm_lapic **dst = NULL;
 	int i;
 	bool ret;
+	/* Count actual delivered targets to identify a unique recipient. */
+	int targets = 0;
+	int delivered = 0;
+	struct kvm_vcpu *unique = NULL;
 
 	*r = -1;
 
@@ -1280,8 +1325,26 @@ bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct kvm_lapic *src,
 		for_each_set_bit(i, &bitmap, 16) {
 			if (!dst[i])
 				continue;
-			*r += kvm_apic_set_irq(dst[i]->vcpu, irq, dest_map);
+			delivered = kvm_apic_set_irq(dst[i]->vcpu, irq, dest_map);
+			*r += delivered;
+			/* Fast path may still fan out; count delivered targets. */
+			if (delivered > 0) {
+				targets++;
+				unique = dst[i]->vcpu;
+			}
 		}
+
+		/*
+		 * Record unique recipient for IPI-aware boost:
+		 * only for LAPIC-originated APIC_DM_FIXED without
+		 * shorthand, and when exactly one recipient was
+		 * delivered; ignore self-IPI.
+		 */
+		if (src &&
+		    irq->delivery_mode == APIC_DM_FIXED &&
+		    irq->shorthand == APIC_DEST_NOSHORT &&
+		    targets == 1 && unique && unique != src->vcpu)
+			kvm_track_ipi_communication(src->vcpu, unique);
 	}
 
 	rcu_read_unlock();
@@ -1366,6 +1429,13 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
 	struct kvm_vcpu *vcpu, *lowest = NULL;
 	unsigned long i, dest_vcpu_bitmap[BITS_TO_LONGS(KVM_MAX_VCPUS)];
 	unsigned int dest_vcpus = 0;
+	/*
+	 * Count actual delivered targets to identify a unique recipient
+	 * for IPI tracking in the slow path.
+	 */
+	int targets = 0;
+	int delivered = 0;
+	struct kvm_vcpu *unique = NULL;
 
 	if (kvm_irq_delivery_to_apic_fast(kvm, src, irq, &r, dest_map))
 		return r;
@@ -1389,7 +1459,13 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
 		if (!kvm_lowest_prio_delivery(irq)) {
 			if (r < 0)
 				r = 0;
-			r += kvm_apic_set_irq(vcpu, irq, dest_map);
+			delivered = kvm_apic_set_irq(vcpu, irq, dest_map);
+			r += delivered;
+			/* Slow path can deliver to multiple vCPUs; count them. */
+			if (delivered > 0) {
+				targets++;
+				unique = vcpu;
+			}
 		} else if (kvm_apic_sw_enabled(vcpu->arch.apic)) {
 			if (!vector_hashing_enabled) {
 				if (!lowest)
@@ -1410,8 +1486,28 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
 		lowest = kvm_get_vcpu(kvm, idx);
 	}
 
-	if (lowest)
-		r = kvm_apic_set_irq(lowest, irq, dest_map);
+	if (lowest) {
+		delivered = kvm_apic_set_irq(lowest, irq, dest_map);
+		r = delivered;
+		/*
+		 * Lowest-priority / vector-hashing paths ultimately deliver to
+		 * a single vCPU.
+		 */
+		if (delivered > 0) {
+			targets = 1;
+			unique = lowest;
+		}
+	}
+
+	/*
+	 * Record unique recipient for IPI-aware boost only for LAPIC-
+	 * originated APIC_DM_FIXED without shorthand, and when exactly
+	 * one recipient was delivered; ignore self-IPI.
+	 */
+	if (src && irq->delivery_mode == APIC_DM_FIXED &&
+	    irq->shorthand == APIC_DEST_NOSHORT &&
+	    targets == 1 && unique && unique != src->vcpu)
+		kvm_track_ipi_communication(src->vcpu, unique);
 
 	return r;
 }
@@ -1632,6 +1728,7 @@ void kvm_apic_set_eoi_accelerated(struct kvm_vcpu *vcpu, int vector)
 	trace_kvm_eoi(apic, vector);
 
 	kvm_ioapic_send_eoi(apic, vector);
+	kvm_clear_ipi_on_eoi(apic);
 	kvm_make_request(KVM_REQ_EVENT, apic->vcpu);
 }
 EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_apic_set_eoi_accelerated);
@@ -2424,6 +2521,8 @@ static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
 
 	case APIC_EOI:
 		apic_set_eoi(apic);
+		/* Precise cleanup for IPI-aware boost */
+		kvm_clear_ipi_on_eoi(apic);
 		break;
 
 	case APIC_LDR:
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 09/10] KVM: Implement IPI-aware directed yield candidate selection
  2025-11-10  3:32 [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM Wanpeng Li
                   ` (7 preceding siblings ...)
  2025-11-10  3:32 ` [PATCH 08/10] KVM: x86/lapic: Integrate IPI tracking with interrupt delivery Wanpeng Li
@ 2025-11-10  3:32 ` Wanpeng Li
  2025-11-10  3:39 ` [PATCH 10/10] KVM: Relaxed boost as safety net Wanpeng Li
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 36+ messages in thread
From: Wanpeng Li @ 2025-11-10  3:32 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paolo Bonzini,
	Sean Christopherson
  Cc: Steven Rostedt, Vincent Guittot, Juri Lelli, linux-kernel, kvm,
	Wanpeng Li

From: Wanpeng Li <wanpengli@tencent.com>

Integrate IPI tracking with directed yield to improve scheduling when
vCPUs spin waiting for IPI responses.

Implement priority-based candidate selection in kvm_vcpu_on_spin()
with three tiers: Priority 1 uses kvm_vcpu_is_ipi_receiver() to
identify confirmed IPI targets within the recency window, addressing
lock holders spinning on IPI acknowledgment. Priority 2 leverages
existing kvm_arch_dy_has_pending_interrupt() for compatibility with
arch-specific fast paths. Priority 3 falls back to conventional
preemption-based logic when yield_to_kernel_mode is requested,
providing a safety net for non-IPI scenarios.

Add kvm_vcpu_is_good_yield_candidate() helper to consolidate these
checks, preventing over-aggressive boosting while enabling targeted
optimization when IPI patterns are detected.

Performance testing (16 pCPUs host, 16 vCPUs/VM):

Dedup (simlarge):
  2 VMs: +47.1% throughput
  3 VMs: +28.1% throughput
  4 VMs:  +1.7% throughput

VIPS (simlarge):
  2 VMs: +26.2% throughput
  3 VMs: +12.7% throughput
  4 VMs:  +6.0% throughput

Gains stem from effective directed yield when vCPUs spin on IPI
delivery, reducing synchronization overhead. The improvement is most
pronounced at moderate overcommit (2-3 VMs) where contention reduction
outweighs context switching cost.

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
---
 virt/kvm/kvm_main.c | 52 +++++++++++++++++++++++++++++++++++++--------
 1 file changed, 43 insertions(+), 9 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 495e769c7ddf..9cf44b6b396d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3968,6 +3968,47 @@ bool __weak kvm_vcpu_is_ipi_receiver(struct kvm_vcpu *sender, struct kvm_vcpu *r
 	return false;
 }
 
+/*
+ * IPI-aware candidate selection for directed yield
+ *
+ * Priority order:
+ *  1) Confirmed IPI receiver of 'me' within a short window (always boost)
+ *  2) Arch-provided fast pending interrupt (user-mode boost)
+ *  3) Kernel-mode yield: preempted-in-kernel vCPU (traditional boost)
+ *  4) Otherwise, be conservative
+ */
+static bool kvm_vcpu_is_good_yield_candidate(struct kvm_vcpu *me, struct kvm_vcpu *vcpu,
+					     bool yield_to_kernel_mode)
+{
+	/* Priority 1: recently targeted IPI receiver */
+	if (kvm_vcpu_is_ipi_receiver(me, vcpu))
+		return true;
+
+	/* Priority 2: fast pending-interrupt hint (arch-specific). */
+	if (kvm_arch_dy_has_pending_interrupt(vcpu))
+		return true;
+
+	/*
+	 * Minimal preempted gate for remaining cases:
+	 * - If the target is neither a confirmed IPI receiver nor has a fast
+	 *   pending interrupt, require that the target has been preempted.
+	 * - If yielding to kernel mode is requested, additionally require
+	 *   that the target was preempted while in kernel mode.
+	 *
+	 * This avoids expanding the candidate set too aggressively and helps
+	 * prevent overboost in workloads where the IPI context is not
+	 * involved.
+	 */
+	if (!READ_ONCE(vcpu->preempted))
+		return false;
+
+	if (yield_to_kernel_mode &&
+	    !kvm_arch_vcpu_preempted_in_kernel(vcpu))
+		return false;
+
+	return true;
+}
+
 void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
 {
 	int nr_vcpus, start, i, idx, yielded;
@@ -4015,15 +4056,8 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
 		if (kvm_vcpu_is_blocking(vcpu) && !vcpu_dy_runnable(vcpu))
 			continue;
 
-		/*
-		 * Treat the target vCPU as being in-kernel if it has a pending
-		 * interrupt, as the vCPU trying to yield may be spinning
-		 * waiting on IPI delivery, i.e. the target vCPU is in-kernel
-		 * for the purposes of directed yield.
-		 */
-		if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode &&
-		    !kvm_arch_dy_has_pending_interrupt(vcpu) &&
-		    !kvm_arch_vcpu_preempted_in_kernel(vcpu))
+		/* IPI-aware candidate selection */
+		if (!kvm_vcpu_is_good_yield_candidate(me, vcpu, yield_to_kernel_mode))
 			continue;
 
 		if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 10/10] KVM: Relaxed boost as safety net
  2025-11-10  3:32 [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM Wanpeng Li
                   ` (8 preceding siblings ...)
  2025-11-10  3:32 ` [PATCH 09/10] KVM: Implement IPI-aware directed yield candidate selection Wanpeng Li
@ 2025-11-10  3:39 ` Wanpeng Li
  2025-11-10 12:02 ` [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM Christian Borntraeger
  2025-11-11  6:28 ` K Prateek Nayak
  11 siblings, 0 replies; 36+ messages in thread
From: Wanpeng Li @ 2025-11-10  3:39 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paolo Bonzini,
	Sean Christopherson
  Cc: Steven Rostedt, Vincent Guittot, Juri Lelli, linux-kernel, kvm,
	Wanpeng Li

From: Wanpeng Li <wanpengli@tencent.com>

Add a minimal two-round fallback mechanism in kvm_vcpu_on_spin() to
avoid pathological stalls when the first round finds no eligible
target.

Round 1 applies strict IPI-aware candidate selection (existing
behavior). Round 2 provides a relaxed scan gated only by preempted
state as a safety net, addressing cases where IPI context is missed or
the runnable set is transient.

The second round is controlled by module parameter enable_relaxed_boost
(bool, 0644, default on) to allow easy disablement by distributions if
needed.

Introduce the enable_relaxed_boost parameter, add a first_round flag,
retry label, and reset of yielded counter. Gate the IPI-aware check in
round 1 and use preempted-only gating in round 2. Keep churn minimal
by reusing the same scan logic while preserving all existing
heuristics, tracing, and bookkeeping.

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
---
 virt/kvm/kvm_main.c | 24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 9cf44b6b396d..b03be8d9ae4c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -101,6 +101,9 @@ EXPORT_SYMBOL_FOR_KVM_INTERNAL(halt_poll_ns_shrink);
 static bool allow_unsafe_mappings;
 module_param(allow_unsafe_mappings, bool, 0444);
 
+static bool enable_relaxed_boost = true;
+module_param(enable_relaxed_boost, bool, 0644);
+
 /*
  * Ordering of locks:
  *
@@ -4015,6 +4018,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
 	struct kvm *kvm = me->kvm;
 	struct kvm_vcpu *vcpu;
 	int try = 3;
+	bool first_round = true;
 
 	nr_vcpus = atomic_read(&kvm->online_vcpus);
 	if (nr_vcpus < 2)
@@ -4025,6 +4029,9 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
 
 	kvm_vcpu_set_in_spin_loop(me, true);
 
+retry:
+	yielded = 0;
+
 	/*
 	 * The current vCPU ("me") is spinning in kernel mode, i.e. is likely
 	 * waiting for a resource to become available.  Attempt to yield to a
@@ -4057,7 +4064,12 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
 			continue;
 
 		/* IPI-aware candidate selection */
-		if (!kvm_vcpu_is_good_yield_candidate(me, vcpu, yield_to_kernel_mode))
+		if (first_round &&
+			!kvm_vcpu_is_good_yield_candidate(me, vcpu, yield_to_kernel_mode))
+			continue;
+
+		/* Minimal preempted gate for second round */
+		if (!first_round && !READ_ONCE(vcpu->preempted))
 			continue;
 
 		if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
@@ -4071,6 +4083,16 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
 			break;
 		}
 	}
+
+	/*
+	 * Second round: relaxed boost as safety net, with preempted gate.
+	 * Only execute when enabled and when the first round yielded nothing.
+	 */
+	if (enable_relaxed_boost && first_round && yielded <= 0) {
+		first_round = false;
+		goto retry;
+	}
+
 	kvm_vcpu_set_in_spin_loop(me, false);
 
 	/* Ensure vcpu is not eligible during next spinloop */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM
  2025-11-10  3:32 [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM Wanpeng Li
                   ` (9 preceding siblings ...)
  2025-11-10  3:39 ` [PATCH 10/10] KVM: Relaxed boost as safety net Wanpeng Li
@ 2025-11-10 12:02 ` Christian Borntraeger
  2025-11-12  5:01   ` Wanpeng Li
  2025-11-11  6:28 ` K Prateek Nayak
  11 siblings, 1 reply; 36+ messages in thread
From: Christian Borntraeger @ 2025-11-10 12:02 UTC (permalink / raw)
  To: Wanpeng Li, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Paolo Bonzini, Sean Christopherson
  Cc: Steven Rostedt, Vincent Guittot, Juri Lelli, linux-kernel, kvm,
	Wanpeng Li, Ilya Leoshkevich, Mete Durlu

Am 10.11.25 um 04:32 schrieb Wanpeng Li:
> From: Wanpeng Li <wanpengli@tencent.com>
> 
> This series addresses long-standing yield_to() inefficiencies in
> virtualized environments through two complementary mechanisms: a vCPU
> debooster in the scheduler and IPI-aware directed yield in KVM.
> 
> Problem Statement
> -----------------
> 
> In overcommitted virtualization scenarios, vCPUs frequently spin on locks
> held by other vCPUs that are not currently running. The kernel's
> paravirtual spinlock support detects these situations and calls yield_to()
> to boost the lock holder, allowing it to run and release the lock.
> 
> However, the current implementation has two critical limitations:
> 
> 1. Scheduler-side limitation:
> 
>     yield_to_task_fair() relies solely on set_next_buddy() to provide
>     preference to the target vCPU. This buddy mechanism only offers
>     immediate, transient preference. Once the buddy hint expires (typically
>     after one scheduling decision), the yielding vCPU may preempt the target
>     again, especially in nested cgroup hierarchies where vruntime domains
>     differ.
> 
>     This creates a ping-pong effect: the lock holder runs briefly, gets
>     preempted before completing critical sections, and the yielding vCPU
>     spins again, triggering another futile yield_to() cycle. The overhead
>     accumulates rapidly in workloads with high lock contention.

I can certainly confirm that on s390 we do see that yield_to does not always
work as expected. Our spinlock code is lock holder aware so our KVM always yield
correctly but often enought the hint is ignored our bounced back as you describe.
So I am certainly interested in that part.

I need to look more closely into the other part.



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM
  2025-11-10 12:02 ` [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM Christian Borntraeger
@ 2025-11-12  5:01   ` Wanpeng Li
  2025-11-18  8:11     ` Christian Borntraeger
  0 siblings, 1 reply; 36+ messages in thread
From: Wanpeng Li @ 2025-11-12  5:01 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paolo Bonzini,
	Sean Christopherson, Steven Rostedt, Vincent Guittot, Juri Lelli,
	linux-kernel, kvm, Ilya Leoshkevich, Mete Durlu

Hi Christian,

On Mon, 10 Nov 2025 at 20:02, Christian Borntraeger
<borntraeger@linux.ibm.com> wrote:
>
> Am 10.11.25 um 04:32 schrieb Wanpeng Li:
> > From: Wanpeng Li <wanpengli@tencent.com>
> >
> > This series addresses long-standing yield_to() inefficiencies in
> > virtualized environments through two complementary mechanisms: a vCPU
> > debooster in the scheduler and IPI-aware directed yield in KVM.
> >
> > Problem Statement
> > -----------------
> >
> > In overcommitted virtualization scenarios, vCPUs frequently spin on locks
> > held by other vCPUs that are not currently running. The kernel's
> > paravirtual spinlock support detects these situations and calls yield_to()
> > to boost the lock holder, allowing it to run and release the lock.
> >
> > However, the current implementation has two critical limitations:
> >
> > 1. Scheduler-side limitation:
> >
> >     yield_to_task_fair() relies solely on set_next_buddy() to provide
> >     preference to the target vCPU. This buddy mechanism only offers
> >     immediate, transient preference. Once the buddy hint expires (typically
> >     after one scheduling decision), the yielding vCPU may preempt the target
> >     again, especially in nested cgroup hierarchies where vruntime domains
> >     differ.
> >
> >     This creates a ping-pong effect: the lock holder runs briefly, gets
> >     preempted before completing critical sections, and the yielding vCPU
> >     spins again, triggering another futile yield_to() cycle. The overhead
> >     accumulates rapidly in workloads with high lock contention.
>
> I can certainly confirm that on s390 we do see that yield_to does not always
> work as expected. Our spinlock code is lock holder aware so our KVM always yield
> correctly but often enought the hint is ignored our bounced back as you describe.
> So I am certainly interested in that part.
>
> I need to look more closely into the other part.

Thanks for the confirmation and interest! It's valuable to hear that
s390 observes similar yield_to() behavior where the hint gets ignored
or bounced back despite correct lock holder identification.

Since your spinlock code is already lock-holder-aware and KVM yields
to the correct target, the scheduler-side improvements (patches 1-5)
should directly address the ping-pong issue you're seeing. The
vruntime penalties are designed to sustain the preference beyond the
transient buddy hint, which should reduce the bouncing effect.

Best regards,
Wanpeng

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM
  2025-11-12  5:01   ` Wanpeng Li
@ 2025-11-18  8:11     ` Christian Borntraeger
  2025-11-18 14:19       ` Wanpeng Li
  0 siblings, 1 reply; 36+ messages in thread
From: Christian Borntraeger @ 2025-11-18  8:11 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paolo Bonzini,
	Sean Christopherson, Steven Rostedt, Vincent Guittot, Juri Lelli,
	linux-kernel, kvm, Ilya Leoshkevich, Mete Durlu, Axel Busch

Am 12.11.25 um 06:01 schrieb Wanpeng Li:
> Hi Christian,
> 
> On Mon, 10 Nov 2025 at 20:02, Christian Borntraeger
> <borntraeger@linux.ibm.com> wrote:
>>
>> Am 10.11.25 um 04:32 schrieb Wanpeng Li:
>>> From: Wanpeng Li <wanpengli@tencent.com>
>>>
>>> This series addresses long-standing yield_to() inefficiencies in
>>> virtualized environments through two complementary mechanisms: a vCPU
>>> debooster in the scheduler and IPI-aware directed yield in KVM.
>>>
>>> Problem Statement
>>> -----------------
>>>
>>> In overcommitted virtualization scenarios, vCPUs frequently spin on locks
>>> held by other vCPUs that are not currently running. The kernel's
>>> paravirtual spinlock support detects these situations and calls yield_to()
>>> to boost the lock holder, allowing it to run and release the lock.
>>>
>>> However, the current implementation has two critical limitations:
>>>
>>> 1. Scheduler-side limitation:
>>>
>>>      yield_to_task_fair() relies solely on set_next_buddy() to provide
>>>      preference to the target vCPU. This buddy mechanism only offers
>>>      immediate, transient preference. Once the buddy hint expires (typically
>>>      after one scheduling decision), the yielding vCPU may preempt the target
>>>      again, especially in nested cgroup hierarchies where vruntime domains
>>>      differ.
>>>
>>>      This creates a ping-pong effect: the lock holder runs briefly, gets
>>>      preempted before completing critical sections, and the yielding vCPU
>>>      spins again, triggering another futile yield_to() cycle. The overhead
>>>      accumulates rapidly in workloads with high lock contention.
>>
>> I can certainly confirm that on s390 we do see that yield_to does not always
>> work as expected. Our spinlock code is lock holder aware so our KVM always yield
>> correctly but often enought the hint is ignored our bounced back as you describe.
>> So I am certainly interested in that part.
>>
>> I need to look more closely into the other part.
> 
> Thanks for the confirmation and interest! It's valuable to hear that
> s390 observes similar yield_to() behavior where the hint gets ignored
> or bounced back despite correct lock holder identification.
> 
> Since your spinlock code is already lock-holder-aware and KVM yields
> to the correct target, the scheduler-side improvements (patches 1-5)
> should directly address the ping-pong issue you're seeing. The
> vruntime penalties are designed to sustain the preference beyond the
> transient buddy hint, which should reduce the bouncing effect.

So we will play a bit with the first patches and check for performance improvements.

I am curious, I did a quick unit test with 2 CPUs ping ponging on a counter. And I do
see "more than count" numbers of the yield hypercalls with that testcase (as before).
Something like 40060000 yields instead of 4000000 for a perfect ping pong. If I comment
out your rate limit code I hit exactly the 4000000.
Can you maybe outline a bit why the rate limit is important and needed?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM
  2025-11-18  8:11     ` Christian Borntraeger
@ 2025-11-18 14:19       ` Wanpeng Li
  0 siblings, 0 replies; 36+ messages in thread
From: Wanpeng Li @ 2025-11-18 14:19 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paolo Bonzini,
	Sean Christopherson, Steven Rostedt, Vincent Guittot, Juri Lelli,
	linux-kernel, kvm, Ilya Leoshkevich, Mete Durlu, Axel Busch

Hi Christian，

On Tue, 18 Nov 2025 at 16:12, Christian Borntraeger
<borntraeger@linux.ibm.com> wrote:
>
> Am 12.11.25 um 06:01 schrieb Wanpeng Li:
> > Hi Christian,
> >
> > On Mon, 10 Nov 2025 at 20:02, Christian Borntraeger
> > <borntraeger@linux.ibm.com> wrote:
> >>
> >> Am 10.11.25 um 04:32 schrieb Wanpeng Li:
> >>> From: Wanpeng Li <wanpengli@tencent.com>
> >>>
> >>> This series addresses long-standing yield_to() inefficiencies in
> >>> virtualized environments through two complementary mechanisms: a vCPU
> >>> debooster in the scheduler and IPI-aware directed yield in KVM.
> >>>
> >>> Problem Statement
> >>> -----------------
> >>>
> >>> In overcommitted virtualization scenarios, vCPUs frequently spin on locks
> >>> held by other vCPUs that are not currently running. The kernel's
> >>> paravirtual spinlock support detects these situations and calls yield_to()
> >>> to boost the lock holder, allowing it to run and release the lock.
> >>>
> >>> However, the current implementation has two critical limitations:
> >>>
> >>> 1. Scheduler-side limitation:
> >>>
> >>>      yield_to_task_fair() relies solely on set_next_buddy() to provide
> >>>      preference to the target vCPU. This buddy mechanism only offers
> >>>      immediate, transient preference. Once the buddy hint expires (typically
> >>>      after one scheduling decision), the yielding vCPU may preempt the target
> >>>      again, especially in nested cgroup hierarchies where vruntime domains
> >>>      differ.
> >>>
> >>>      This creates a ping-pong effect: the lock holder runs briefly, gets
> >>>      preempted before completing critical sections, and the yielding vCPU
> >>>      spins again, triggering another futile yield_to() cycle. The overhead
> >>>      accumulates rapidly in workloads with high lock contention.
> >>
> >> I can certainly confirm that on s390 we do see that yield_to does not always
> >> work as expected. Our spinlock code is lock holder aware so our KVM always yield
> >> correctly but often enought the hint is ignored our bounced back as you describe.
> >> So I am certainly interested in that part.
> >>
> >> I need to look more closely into the other part.
> >
> > Thanks for the confirmation and interest! It's valuable to hear that
> > s390 observes similar yield_to() behavior where the hint gets ignored
> > or bounced back despite correct lock holder identification.
> >
> > Since your spinlock code is already lock-holder-aware and KVM yields
> > to the correct target, the scheduler-side improvements (patches 1-5)
> > should directly address the ping-pong issue you're seeing. The
> > vruntime penalties are designed to sustain the preference beyond the
> > transient buddy hint, which should reduce the bouncing effect.
>
> So we will play a bit with the first patches and check for performance improvements.
>
> I am curious, I did a quick unit test with 2 CPUs ping ponging on a counter. And I do
> see "more than count" numbers of the yield hypercalls with that testcase (as before).
> Something like 40060000 yields instead of 4000000 for a perfect ping pong. If I comment
> out your rate limit code I hit exactly the 4000000.
> Can you maybe outline a bit why the rate limit is important and needed?

Good catch! The 10× inflation is actually expected behavior. The key
insight is that rate limit filters penalty applications, not yield
hypercalls. In your ping-pong test with 4M counter increments, PLE
hardware fires multiple times per lock acquisition (roughly 10 times
based on your numbers), and each triggers kvm_vcpu_on_spin() . Without
rate limit, every yield immediately applies vruntime penalty. In tight
ping-pong, this causes over-penalization where the skip vCPU becomes
so deprioritized it effectively starves, which paradoxically
neutralizes the debooster effect. You see "exactly 4M" not because
it's working optimally, but because excessive penalties create a
pathological equilibrium where subsequent yields are suppressed by
starvation. With a 6ms rate limit, all 40M hypercalls still occur (PLE
still fires), but only the first yield in each burst applies a penalty
while subsequent ones are filtered. This gives you roughly 4M
penalties (one per actual lock acquisition) instead of 40M, providing
sustained advantage without over-penalization. The 6ms threshold was
empirically tuned as roughly 2× typical timeslice to filter intra-lock
PLE bursts while preserving responsiveness to legitimate contention.
Your test validates the design by showing rate limit prevents penalty
amplification even in the tightest ping-pong scenario.

I'll post v2 after the merge window with code comments addressing this
and other review feedback, which should be more suitable for
performance evaluation.

Wanpeng

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM
  2025-11-10  3:32 [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM Wanpeng Li
                   ` (10 preceding siblings ...)
  2025-11-10 12:02 ` [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM Christian Borntraeger
@ 2025-11-11  6:28 ` K Prateek Nayak
  2025-11-12  4:54   ` Wanpeng Li
  11 siblings, 1 reply; 36+ messages in thread
From: K Prateek Nayak @ 2025-11-11  6:28 UTC (permalink / raw)
  To: Wanpeng Li, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Paolo Bonzini, Sean Christopherson
  Cc: Steven Rostedt, Vincent Guittot, Juri Lelli, linux-kernel, kvm,
	Wanpeng Li

Hello Wanpeng,

I haven't looked at the entire series and the penalty calculation math
but I've a few questions looking at the cover-letter.

On 11/10/2025 9:02 AM, Wanpeng Li wrote:
> From: Wanpeng Li <wanpengli@tencent.com>
> 
> This series addresses long-standing yield_to() inefficiencies in
> virtualized environments through two complementary mechanisms: a vCPU
> debooster in the scheduler and IPI-aware directed yield in KVM.
> 
> Problem Statement
> -----------------
> 
> In overcommitted virtualization scenarios, vCPUs frequently spin on locks
> held by other vCPUs that are not currently running. The kernel's
> paravirtual spinlock support detects these situations and calls yield_to()
> to boost the lock holder, allowing it to run and release the lock.
> 
> However, the current implementation has two critical limitations:
> 
> 1. Scheduler-side limitation:
> 
>    yield_to_task_fair() relies solely on set_next_buddy() to provide
>    preference to the target vCPU. This buddy mechanism only offers
>    immediate, transient preference. Once the buddy hint expires (typically
>    after one scheduling decision), the yielding vCPU may preempt the target
>    again, especially in nested cgroup hierarchies where vruntime domains
>    differ.

So what you are saying is there are configurations out there where vCPUs
of same guest are put in different cgroups? Why? Does the use case
warrant enabling the cpu controller for the subtree? Are you running
with the "NEXT_BUDDY" sched feat enabled?

If they are in the same cgroup, the recent optimizations/fixes to
yield_task_fair() in queue:sched/core should help remedy some of the
problems you might be seeing.

For multiple cgroups, perhaps you can extend yield_task_fair() to do:

( Only build and boot tested on top of
    git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/core
  at commit f82a0f91493f "sched/deadline: Minor cleanup in
  select_task_rq_dl()" )

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b4617d631549..87560f5a18b3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8962,10 +8962,28 @@ static void yield_task_fair(struct rq *rq)
 	 * which yields immediately again; without the condition the vruntime
 	 * ends up quickly running away.
 	 */
-	if (entity_eligible(cfs_rq, se)) {
+	do {
+		cfs_rq = cfs_rq_of(se);
+
+		/*
+		 * Another entity will be selected at next pick.
+		 * Single entity on cfs_rq can never be ineligible.
+		 */
+		if (!entity_eligible(cfs_rq, se))
+			break;
+
 		se->vruntime = se->deadline;
 		se->deadline += calc_delta_fair(se->slice, se);
-	}
+
+		/*
+		 * If we have more than one runnable task queued below
+		 * this cfs_rq, the next pick will likely go for a
+		 * different entity now that we have advanced the
+		 * vruntime and the deadline of the running entity.
+		 */
+		if (cfs_rq->h_nr_runnable > 1)
+			break;
+	} while ((se = parent_entity(se)));
 }
 
 static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
---

With that, I'm pretty sure there is a good chance we'll not select the
hierarchy that did a yield_to() unless there is a large discrepancy in
their weights and just advancing se->vruntime to se->deadline once isn't
enough to make it ineligible and you'll have to do it multiple time (at
which point that cgroup hierarchy needs to be studied).

As for the problem that NEXT_BUDDY hint is used only once, you can
perhaps reintroduce LAST_BUDDY which sets does a set_next_buddy() for
the "prev" task during schedule?

> 
>    This creates a ping-pong effect: the lock holder runs briefly, gets
>    preempted before completing critical sections, and the yielding vCPU
>    spins again, triggering another futile yield_to() cycle. The overhead
>    accumulates rapidly in workloads with high lock contention.
> 
> 2. KVM-side limitation:
> 
>    kvm_vcpu_on_spin() attempts to identify which vCPU to yield to through
>    directed yield candidate selection. However, it lacks awareness of IPI
>    communication patterns. When a vCPU sends an IPI and spins waiting for
>    a response (common in inter-processor synchronization), the current
>    heuristics often fail to identify the IPI receiver as the yield target.

Can't that be solved on the KVM end? Also shouldn't Patch 6 be on top
with a "Fixes:" tag.

> 
>    Instead, the code may boost an unrelated vCPU based on coarse-grained
>    preemption state, missing opportunities to accelerate actual IPI
>    response handling. This is particularly problematic when the IPI receiver
>    is runnable but not scheduled, as lock-holder-detection logic doesn't
>    capture the IPI dependency relationship.

Are you saying the yield_to() is called with an incorrect target vCPU?

> 
> Combined, these issues cause excessive lock hold times, cache thrashing,
> and degraded throughput in overcommitted environments, particularly
> affecting workloads with fine-grained synchronization patterns.
> 
-- 
Thanks and Regards,
Prateek


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM
  2025-11-11  6:28 ` K Prateek Nayak
@ 2025-11-12  4:54   ` Wanpeng Li
  2025-11-12  6:07     ` K Prateek Nayak
  2025-11-13  4:42     ` K Prateek Nayak
  0 siblings, 2 replies; 36+ messages in thread
From: Wanpeng Li @ 2025-11-12  4:54 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paolo Bonzini,
	Sean Christopherson, Steven Rostedt, Vincent Guittot, Juri Lelli,
	linux-kernel, kvm, Wanpeng Li

Hi Prateek,

On Tue, 11 Nov 2025 at 14:28, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> Hello Wanpeng,
>
> I haven't looked at the entire series and the penalty calculation math
> but I've a few questions looking at the cover-letter.

Thanks for the review and the thoughtful questions.

>
> On 11/10/2025 9:02 AM, Wanpeng Li wrote:
> > From: Wanpeng Li <wanpengli@tencent.com>
> >
> > This series addresses long-standing yield_to() inefficiencies in
> > virtualized environments through two complementary mechanisms: a vCPU
> > debooster in the scheduler and IPI-aware directed yield in KVM.
> >
> > Problem Statement
> > -----------------
> >
> > In overcommitted virtualization scenarios, vCPUs frequently spin on locks
> > held by other vCPUs that are not currently running. The kernel's
> > paravirtual spinlock support detects these situations and calls yield_to()
> > to boost the lock holder, allowing it to run and release the lock.
> >
> > However, the current implementation has two critical limitations:
> >
> > 1. Scheduler-side limitation:
> >
> >    yield_to_task_fair() relies solely on set_next_buddy() to provide
> >    preference to the target vCPU. This buddy mechanism only offers
> >    immediate, transient preference. Once the buddy hint expires (typically
> >    after one scheduling decision), the yielding vCPU may preempt the target
> >    again, especially in nested cgroup hierarchies where vruntime domains
> >    differ.
>
> So what you are saying is there are configurations out there where vCPUs
> of same guest are put in different cgroups? Why? Does the use case
> warrant enabling the cpu controller for the subtree? Are you running

You're right to question this. The problematic scenario occurs with
nested cgroup hierarchies, which is common when VMs are deployed with
cgroup-based resource management. Even when all vCPUs of a single
guest are in the same leaf cgroup, that leaf sits under parent cgroups
with their own vruntime domains.

The issue manifests when:
   - set_next_buddy() provides preference at the leaf level
   - But vruntime competition happens at parent levels
   - The buddy hint gets "diluted" when pick_task_fair() walks up the hierarchy

The cpu controller is typically enabled in these deployments for quota
enforcement and weight-based sharing. That said, the debooster
mechanism is designed to be general-purpose: it handles any scenario
where yield_to() crosses cgroup boundaries, whether due to nested
hierarchies or sibling cgroups.

> with the "NEXT_BUDDY" sched feat enabled?

Yes, NEXT_BUDDY is enabled. The problem is that set_next_buddy()
provides only immediate, transient preference. Once the buddy hint is
consumed (typically after one pick_next_task_fair() call), the
yielding vCPU can preempt the target again if their vruntime values
haven't diverged sufficiently.

>
> If they are in the same cgroup, the recent optimizations/fixes to
> yield_task_fair() in queue:sched/core should help remedy some of the
> problems you might be seeing.

Agreed - the recent yield_task_fair() improvements in queue:sched/core
(EEVDF-based vruntime = deadline with hierarchical walk) are valuable.
However, our patchset focuses on yield_to() rather than yield(), which
has different semantics:
   - yield_task_fair(): "I voluntarily give up CPU, pick someone else"
→ Recent improvements handle this well with hierarchical walk
   - yield_to_task_fair(): "I want *this specific task* to run
instead" → Requires finding the LCA of yielder and target, then
applying penalties at that level to influence their relative
competition

The debooster extends yield_to() to handle cross-cgroup scenarios
where the yielder and target may be in different subtrees.

>
> For multiple cgroups, perhaps you can extend yield_task_fair() to do:

Thanks for the suggestion. Your hierarchical walk approach shares
similarities with our implementation. A few questions on the details:

>
> ( Only build and boot tested on top of
>     git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/core
>   at commit f82a0f91493f "sched/deadline: Minor cleanup in
>   select_task_rq_dl()" )
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b4617d631549..87560f5a18b3 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8962,10 +8962,28 @@ static void yield_task_fair(struct rq *rq)
>          * which yields immediately again; without the condition the vruntime
>          * ends up quickly running away.
>          */
> -       if (entity_eligible(cfs_rq, se)) {
> +       do {
> +               cfs_rq = cfs_rq_of(se);
> +
> +               /*
> +                * Another entity will be selected at next pick.
> +                * Single entity on cfs_rq can never be ineligible.
> +                */
> +               if (!entity_eligible(cfs_rq, se))
> +                       break;
> +
>                 se->vruntime = se->deadline;

Setting vruntime = deadline zeros out lag. Does this cause fairness
drift with repeated yields? We explicitly recalculate vlag after
adjustment to preserve EEVDF invariants.

>                 se->deadline += calc_delta_fair(se->slice, se);
> -       }
> +
> +               /*
> +                * If we have more than one runnable task queued below
> +                * this cfs_rq, the next pick will likely go for a
> +                * different entity now that we have advanced the
> +                * vruntime and the deadline of the running entity.
> +                */
> +               if (cfs_rq->h_nr_runnable > 1)

Stopping at h_nr_runnable > 1 may not handle cross-cgroup yield_to()
correctly. Shouldn't the penalty apply at the LCA of yielder and
target? Otherwise the vruntime adjustment might not affect the level
where they actually compete.

> +                       break;
> +       } while ((se = parent_entity(se)));
>  }
>
>  static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
> ---

Fixed one-slice penalties underperformed in our testing (dbench:
+14.4%/+9.8%/+6.7% for 2/3/4 VMs). We found adaptive scaling (6.0×
down to 1.0× based on queue size) necessary to balance effectiveness
against starvation.

>
> With that, I'm pretty sure there is a good chance we'll not select the
> hierarchy that did a yield_to() unless there is a large discrepancy in
> their weights and just advancing se->vruntime to se->deadline once isn't
> enough to make it ineligible and you'll have to do it multiple time (at
> which point that cgroup hierarchy needs to be studied).
>
> As for the problem that NEXT_BUDDY hint is used only once, you can
> perhaps reintroduce LAST_BUDDY which sets does a set_next_buddy() for
> the "prev" task during schedule?

That's an interesting idea. However, LAST_BUDDY was removed from the
scheduler due to concerns about fairness and latency regressions in
general workloads. Reintroducing it globally might regress non-vCPU
workloads.

Our approach is more targeted: apply vruntime penalties specifically
in the yield_to() path (controlled by debugfs flag), avoiding impact
on general scheduling. The debooster is inert unless explicitly
enabled and rate-limited to prevent pathological overhead.

>
> >
> >    This creates a ping-pong effect: the lock holder runs briefly, gets
> >    preempted before completing critical sections, and the yielding vCPU
> >    spins again, triggering another futile yield_to() cycle. The overhead
> >    accumulates rapidly in workloads with high lock contention.
> >
> > 2. KVM-side limitation:
> >
> >    kvm_vcpu_on_spin() attempts to identify which vCPU to yield to through
> >    directed yield candidate selection. However, it lacks awareness of IPI
> >    communication patterns. When a vCPU sends an IPI and spins waiting for
> >    a response (common in inter-processor synchronization), the current
> >    heuristics often fail to identify the IPI receiver as the yield target.
>
> Can't that be solved on the KVM end?

Yes, the IPI tracking is entirely KVM-side (patches 6-10). The
scheduler-side debooster (patches 1-5) and KVM-side IPI tracking are
orthogonal mechanisms:
   - Debooster: sustains yield_to() preference regardless of *who* is
yielding to whom
   - IPI tracking: improves *which* target is selected when a vCPU spins

Both showed independent gains in our testing, and combined effects
were approximately additive.

> Also shouldn't Patch 6 be on top with a "Fixes:" tag.

You're right. Patch 6 (last_boosted_vcpu bug fix) is a standalone
bugfix and should be at the top with a Fixes tag. I'll reorder it in
v2 with:
Fixes: 7e513617da71 ("KVM: Rework core loop of kvm_vcpu_on_spin() to
use a single for-loop")

>
> >
> >    Instead, the code may boost an unrelated vCPU based on coarse-grained
> >    preemption state, missing opportunities to accelerate actual IPI
> >    response handling. This is particularly problematic when the IPI receiver
> >    is runnable but not scheduled, as lock-holder-detection logic doesn't
> >    capture the IPI dependency relationship.
>
> Are you saying the yield_to() is called with an incorrect target vCPU?

Yes - more precisely, the issue is in kvm_vcpu_on_spin()'s target
selection logic before yield_to() is called. Without IPI tracking, it
relies on preemption state, which doesn't capture "vCPU waiting for
IPI response from specific other vCPU."

The IPI tracking records sender→receiver relationships at interrupt
delivery time (patch 8), enabling kvm_vcpu_on_spin() to directly boost
the IPI receiver when the sender spins (patch 9). This addresses
scenarios where the spinning vCPU is waiting for IPI acknowledgment
rather than lock release.

Performance (16 pCPU host, 16 vCPUs/VM, PARSEC workloads):
   - Dedup: +47.1%/+28.1%/+1.7% for 2/3/4 VMs
   - VIPS: +26.2%/+12.7%/+6.0% for 2/3/4 VMs

Gains are most pronounced at moderate overcommit where the IPI
receiver is often runnable but not scheduled.

Thanks again for the review and suggestions.

Best regards,
Wanpeng

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM
  2025-11-12  4:54   ` Wanpeng Li
@ 2025-11-12  6:07     ` K Prateek Nayak
  2025-11-13  5:37       ` Wanpeng Li
  2025-11-13  4:42     ` K Prateek Nayak
  1 sibling, 1 reply; 36+ messages in thread
From: K Prateek Nayak @ 2025-11-12  6:07 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paolo Bonzini,
	Sean Christopherson, Steven Rostedt, Vincent Guittot, Juri Lelli,
	linux-kernel, kvm, Wanpeng Li

Hello Wanpeng,

On 11/12/2025 10:24 AM, Wanpeng Li wrote:
>>> Problem Statement
>>> -----------------
>>>
>>> In overcommitted virtualization scenarios, vCPUs frequently spin on locks
>>> held by other vCPUs that are not currently running. The kernel's
>>> paravirtual spinlock support detects these situations and calls yield_to()
>>> to boost the lock holder, allowing it to run and release the lock.
>>>
>>> However, the current implementation has two critical limitations:
>>>
>>> 1. Scheduler-side limitation:
>>>
>>>    yield_to_task_fair() relies solely on set_next_buddy() to provide
>>>    preference to the target vCPU. This buddy mechanism only offers
>>>    immediate, transient preference. Once the buddy hint expires (typically
>>>    after one scheduling decision), the yielding vCPU may preempt the target
>>>    again, especially in nested cgroup hierarchies where vruntime domains
>>>    differ.
>>
>> So what you are saying is there are configurations out there where vCPUs
>> of same guest are put in different cgroups? Why? Does the use case
>> warrant enabling the cpu controller for the subtree? Are you running
> 
> You're right to question this. The problematic scenario occurs with
> nested cgroup hierarchies, which is common when VMs are deployed with
> cgroup-based resource management. Even when all vCPUs of a single
> guest are in the same leaf cgroup, that leaf sits under parent cgroups
> with their own vruntime domains.
> 
> The issue manifests when:
>    - set_next_buddy() provides preference at the leaf level
>    - But vruntime competition happens at parent levels

If that is the case, then NEXT_BUDDY is in-eligible as a result of its
vruntime being higher that the weighted averages of other entity.
Won't this break fairness?

Let me go look at the series and come back.

>    - The buddy hint gets "diluted" when pick_task_fair() walks up the hierarchy
> 
-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM
  2025-11-12  6:07     ` K Prateek Nayak
@ 2025-11-13  5:37       ` Wanpeng Li
  0 siblings, 0 replies; 36+ messages in thread
From: Wanpeng Li @ 2025-11-13  5:37 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paolo Bonzini,
	Sean Christopherson, Steven Rostedt, Vincent Guittot, Juri Lelli,
	linux-kernel, kvm, Wanpeng Li

Hi Prateek,

On Wed, 12 Nov 2025 at 14:07, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> Hello Wanpeng,
>
> On 11/12/2025 10:24 AM, Wanpeng Li wrote:
> >>> Problem Statement
> >>> -----------------
> >>>
> >>> In overcommitted virtualization scenarios, vCPUs frequently spin on locks
> >>> held by other vCPUs that are not currently running. The kernel's
> >>> paravirtual spinlock support detects these situations and calls yield_to()
> >>> to boost the lock holder, allowing it to run and release the lock.
> >>>
> >>> However, the current implementation has two critical limitations:
> >>>
> >>> 1. Scheduler-side limitation:
> >>>
> >>>    yield_to_task_fair() relies solely on set_next_buddy() to provide
> >>>    preference to the target vCPU. This buddy mechanism only offers
> >>>    immediate, transient preference. Once the buddy hint expires (typically
> >>>    after one scheduling decision), the yielding vCPU may preempt the target
> >>>    again, especially in nested cgroup hierarchies where vruntime domains
> >>>    differ.
> >>
> >> So what you are saying is there are configurations out there where vCPUs
> >> of same guest are put in different cgroups? Why? Does the use case
> >> warrant enabling the cpu controller for the subtree? Are you running
> >
> > You're right to question this. The problematic scenario occurs with
> > nested cgroup hierarchies, which is common when VMs are deployed with
> > cgroup-based resource management. Even when all vCPUs of a single
> > guest are in the same leaf cgroup, that leaf sits under parent cgroups
> > with their own vruntime domains.
> >
> > The issue manifests when:
> >    - set_next_buddy() provides preference at the leaf level
> >    - But vruntime competition happens at parent levels
>
> If that is the case, then NEXT_BUDDY is in-eligible as a result of its
> vruntime being higher that the weighted averages of other entity.
> Won't this break fairness?

Yes, it does break strict vruntime fairness temporarily. That's
intentional. The problem: buddy expires after one pick, then vruntime
wins → ping-pong. The spinning vCPU wastes CPU while the lock holder
stays preempted. The fix applies a bounded vruntime penalty to the
yielder at the cgroup LCA level:

Bounds:
  * Rate limited: 6ms minimum interval between deboosting
  * Queue-adaptive caps: 6.0× gran for 2-task ping-pong, decays to
1.0× gran for large queues (12+)
  * Debounce: 600µs window detects A→B→A reverse patterns and reduces penalty
  * Hierarchy-aware: Applied at LCA, so same-cgroup yields have localized impact

Why acceptable: Current behavior is already unfair—wasting CPU on
spinning instead of productive work. Bounded vruntime penalty lets the
lock holder complete faster, reducing overall waste. The scheduler
still converges to fairness—the penalty just gives the boosted task
sustained advantage until it finishes the critical section. Runtime
toggle available via
/sys/kernel/debug/sched/sched_vcpu_debooster_enabled if degradation
observed. Dbench results show net throughput wins (+6-14%) outweigh
the temporary fairness deviation.

Regards,
Wanpeng

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM
  2025-11-12  4:54   ` Wanpeng Li
  2025-11-12  6:07     ` K Prateek Nayak
@ 2025-11-13  4:42     ` K Prateek Nayak
  2025-11-13  8:33       ` Wanpeng Li
  1 sibling, 1 reply; 36+ messages in thread
From: K Prateek Nayak @ 2025-11-13  4:42 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paolo Bonzini,
	Sean Christopherson, Steven Rostedt, Vincent Guittot, Juri Lelli,
	linux-kernel, kvm, Wanpeng Li

Hello Wanpeng,

On 11/12/2025 10:24 AM, Wanpeng Li wrote:
>>
>> ( Only build and boot tested on top of
>>     git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/core
>>   at commit f82a0f91493f "sched/deadline: Minor cleanup in
>>   select_task_rq_dl()" )
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index b4617d631549..87560f5a18b3 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -8962,10 +8962,28 @@ static void yield_task_fair(struct rq *rq)
>>          * which yields immediately again; without the condition the vruntime
>>          * ends up quickly running away.
>>          */
>> -       if (entity_eligible(cfs_rq, se)) {
>> +       do {
>> +               cfs_rq = cfs_rq_of(se);
>> +
>> +               /*
>> +                * Another entity will be selected at next pick.
>> +                * Single entity on cfs_rq can never be ineligible.
>> +                */
>> +               if (!entity_eligible(cfs_rq, se))
>> +                       break;
>> +
>>                 se->vruntime = se->deadline;
> 
> Setting vruntime = deadline zeros out lag. Does this cause fairness
> drift with repeated yields? We explicitly recalculate vlag after
> adjustment to preserve EEVDF invariants.

We only push deadline when the entity is eligible. Ineligible entity
will break out above. Also I don't get how adding a penalty to an
entity in the cgroup hierarchy of the yielding task when there are
other runnable tasks considered as "preserve(ing) EEVDF invariants".

> 
>>                 se->deadline += calc_delta_fair(se->slice, se);
>> -       }
>> +
>> +               /*
>> +                * If we have more than one runnable task queued below
>> +                * this cfs_rq, the next pick will likely go for a
>> +                * different entity now that we have advanced the
>> +                * vruntime and the deadline of the running entity.
>> +                */
>> +               if (cfs_rq->h_nr_runnable > 1)
> 
> Stopping at h_nr_runnable > 1 may not handle cross-cgroup yield_to()
> correctly. Shouldn't the penalty apply at the LCA of yielder and
> target? Otherwise the vruntime adjustment might not affect the level
> where they actually compete.

So here is the case I'm going after - consider the following
hierarchy:

     root
    /    \
  CG0   CG1
   |     |
   A     B

  CG* are cgroups and, [A-Z]* are tasks

A decides to yield to B, and advances its deadline on CG0's timeline.
Currently, if CG0 is eligible and CG1 isn't, pick will still select
CG0 which will in-turn select task A and it'll yield again. This
cycle repeates until vruntime of CG0 turns large enough to make itself
ineligible and route the EEVDF pick to CG1.

Now consider:


       root
      /    \
    CG0   CG1
   /   \   |
  A     C  B

Same scenario: A yields to B. A advances its vruntime and deadline
as a prt of yield. Now, why should CG0 sacrifice its fair share of
runtime for A when task B is runnable? Just because one task decided
to yield to another task in a different cgroup doesn't mean other
waiting tasks on that hierarchy suffer.

> 
>> +                       break;
>> +       } while ((se = parent_entity(se)));
>>  }
>>
>>  static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
>> ---
> 
> Fixed one-slice penalties underperformed in our testing (dbench:
> +14.4%/+9.8%/+6.7% for 2/3/4 VMs). We found adaptive scaling (6.0×
> down to 1.0× based on queue size) necessary to balance effectiveness
> against starvation.

If all vCPUs of a VM are in the same cgroup - yield_to() should work
just fine. If this "target" task is not selected then either some
entity in the hierarchy, or the task is ineligible and EEVDF pick has
decided to go with something else.

It is not "starvation" but rather you've received you for fair share
of "proportional runtime" and now you wait. If you really want to
follow EEVDF maybe you compute the vlag and if it is behind the
avg_vruntime, you account it to the "target" task - that would be
in the spirit of the EEVDF algorithm.

> 
>>
>> With that, I'm pretty sure there is a good chance we'll not select the
>> hierarchy that did a yield_to() unless there is a large discrepancy in
>> their weights and just advancing se->vruntime to se->deadline once isn't
>> enough to make it ineligible and you'll have to do it multiple time (at
>> which point that cgroup hierarchy needs to be studied).
>>
>> As for the problem that NEXT_BUDDY hint is used only once, you can
>> perhaps reintroduce LAST_BUDDY which sets does a set_next_buddy() for
>> the "prev" task during schedule?
> 
> That's an interesting idea. However, LAST_BUDDY was removed from the
> scheduler due to concerns about fairness and latency regressions in
> general workloads. Reintroducing it globally might regress non-vCPU
> workloads.
> 
> Our approach is more targeted: apply vruntime penalties specifically
> in the yield_to() path (controlled by debugfs flag), avoiding impact
> on general scheduling. The debooster is inert unless explicitly
> enabled and rate-limited to prevent pathological overhead.

Yeah, I'm still not on board with the idea but maybe I don't see the
vision. Hope other scheduler folks can chime in.

> 
>>
>>>
>>>    This creates a ping-pong effect: the lock holder runs briefly, gets
>>>    preempted before completing critical sections, and the yielding vCPU
>>>    spins again, triggering another futile yield_to() cycle. The overhead
>>>    accumulates rapidly in workloads with high lock contention.
>>>
>>> 2. KVM-side limitation:
>>>
>>>    kvm_vcpu_on_spin() attempts to identify which vCPU to yield to through
>>>    directed yield candidate selection. However, it lacks awareness of IPI
>>>    communication patterns. When a vCPU sends an IPI and spins waiting for
>>>    a response (common in inter-processor synchronization), the current
>>>    heuristics often fail to identify the IPI receiver as the yield target.
>>
>> Can't that be solved on the KVM end?
> 
> Yes, the IPI tracking is entirely KVM-side (patches 6-10). The
> scheduler-side debooster (patches 1-5) and KVM-side IPI tracking are
> orthogonal mechanisms:
>    - Debooster: sustains yield_to() preference regardless of *who* is
> yielding to whom
>    - IPI tracking: improves *which* target is selected when a vCPU spins
> 
> Both showed independent gains in our testing, and combined effects
> were approximately additive.

I'll try to look at the KVM bits but I'm not familiar enough with
those bits enough to review it well :)

> 
>> Also shouldn't Patch 6 be on top with a "Fixes:" tag.
> 
> You're right. Patch 6 (last_boosted_vcpu bug fix) is a standalone
> bugfix and should be at the top with a Fixes tag. I'll reorder it in
> v2 with:
> Fixes: 7e513617da71 ("KVM: Rework core loop of kvm_vcpu_on_spin() to
> use a single for-loop")

Thank you.

> 
>>
>>>
>>>    Instead, the code may boost an unrelated vCPU based on coarse-grained
>>>    preemption state, missing opportunities to accelerate actual IPI
>>>    response handling. This is particularly problematic when the IPI receiver
>>>    is runnable but not scheduled, as lock-holder-detection logic doesn't
>>>    capture the IPI dependency relationship.
>>
>> Are you saying the yield_to() is called with an incorrect target vCPU?
> 
> Yes - more precisely, the issue is in kvm_vcpu_on_spin()'s target
> selection logic before yield_to() is called. Without IPI tracking, it
> relies on preemption state, which doesn't capture "vCPU waiting for
> IPI response from specific other vCPU."
> 
> The IPI tracking records sender→receiver relationships at interrupt
> delivery time (patch 8), enabling kvm_vcpu_on_spin() to directly boost
> the IPI receiver when the sender spins (patch 9). This addresses
> scenarios where the spinning vCPU is waiting for IPI acknowledgment
> rather than lock release.
> 
> Performance (16 pCPU host, 16 vCPUs/VM, PARSEC workloads):
>    - Dedup: +47.1%/+28.1%/+1.7% for 2/3/4 VMs
>    - VIPS: +26.2%/+12.7%/+6.0% for 2/3/4 VMs
> 
> Gains are most pronounced at moderate overcommit where the IPI
> receiver is often runnable but not scheduled.
> 
> Thanks again for the review and suggestions.
> 
> Best regards,
> Wanpeng

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM
  2025-11-13  4:42     ` K Prateek Nayak
@ 2025-11-13  8:33       ` Wanpeng Li
  2025-11-13  9:48         ` K Prateek Nayak
  0 siblings, 1 reply; 36+ messages in thread
From: Wanpeng Li @ 2025-11-13  8:33 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paolo Bonzini,
	Sean Christopherson, Steven Rostedt, Vincent Guittot, Juri Lelli,
	linux-kernel, kvm, Wanpeng Li

Hi Prateek,

On Thu, 13 Nov 2025 at 12:42, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> Hello Wanpeng,
>
> On 11/12/2025 10:24 AM, Wanpeng Li wrote:
> >>
> >> ( Only build and boot tested on top of
> >>     git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/core
> >>   at commit f82a0f91493f "sched/deadline: Minor cleanup in
> >>   select_task_rq_dl()" )
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index b4617d631549..87560f5a18b3 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -8962,10 +8962,28 @@ static void yield_task_fair(struct rq *rq)
> >>          * which yields immediately again; without the condition the vruntime
> >>          * ends up quickly running away.
> >>          */
> >> -       if (entity_eligible(cfs_rq, se)) {
> >> +       do {
> >> +               cfs_rq = cfs_rq_of(se);
> >> +
> >> +               /*
> >> +                * Another entity will be selected at next pick.
> >> +                * Single entity on cfs_rq can never be ineligible.
> >> +                */
> >> +               if (!entity_eligible(cfs_rq, se))
> >> +                       break;
> >> +
> >>                 se->vruntime = se->deadline;
> >
> > Setting vruntime = deadline zeros out lag. Does this cause fairness
> > drift with repeated yields? We explicitly recalculate vlag after
> > adjustment to preserve EEVDF invariants.
>
> We only push deadline when the entity is eligible. Ineligible entity
> will break out above. Also I don't get how adding a penalty to an
> entity in the cgroup hierarchy of the yielding task when there are
> other runnable tasks considered as "preserve(ing) EEVDF invariants".

Our penalty preserves EEVDF invariants by recalculating all scheduler state:
   se->vruntime = new_vruntime;
   se->deadline = se->vruntime + calc_delta_fair(se->slice, se);
   se->vlag = avg_vruntime(cfs_rq) - se->vruntime;
   update_min_vruntime(cfs_rq); // maintains cfs_rq consistency
This is the same update pattern used in update_curr(). The EEVDF
relationship lag = (V - v) * w remains valid—vlag becomes more
negative as vruntime increases. The presence of other runnable tasks
doesn't affect the mathematical correctness; each entity's lag is
computed independently relative to avg_vruntime.

>
> >
> >>                 se->deadline += calc_delta_fair(se->slice, se);
> >> -       }
> >> +
> >> +               /*
> >> +                * If we have more than one runnable task queued below
> >> +                * this cfs_rq, the next pick will likely go for a
> >> +                * different entity now that we have advanced the
> >> +                * vruntime and the deadline of the running entity.
> >> +                */
> >> +               if (cfs_rq->h_nr_runnable > 1)
> >
> > Stopping at h_nr_runnable > 1 may not handle cross-cgroup yield_to()
> > correctly. Shouldn't the penalty apply at the LCA of yielder and
> > target? Otherwise the vruntime adjustment might not affect the level
> > where they actually compete.
>
> So here is the case I'm going after - consider the following
> hierarchy:
>
>      root
>     /    \
>   CG0   CG1
>    |     |
>    A     B
>
>   CG* are cgroups and, [A-Z]* are tasks
>
> A decides to yield to B, and advances its deadline on CG0's timeline.
> Currently, if CG0 is eligible and CG1 isn't, pick will still select
> CG0 which will in-turn select task A and it'll yield again. This
> cycle repeates until vruntime of CG0 turns large enough to make itself
> ineligible and route the EEVDF pick to CG1.

Yes, natural convergence works, but requires multiple cycles. Your
h_nr_runnable > 1 stops propagation when another entity might be
picked, but "might" depends on vruntime ordering which needs time to
develop. Our penalty forces immediate ineligibility at the LCA. One
penalty application vs N natural yield cycles.

>
> Now consider:
>
>
>        root
>       /    \
>     CG0   CG1
>    /   \   |
>   A     C  B
>
> Same scenario: A yields to B. A advances its vruntime and deadline
> as a prt of yield. Now, why should CG0 sacrifice its fair share of
> runtime for A when task B is runnable? Just because one task decided
> to yield to another task in a different cgroup doesn't mean other
> waiting tasks on that hierarchy suffer.

You're right that C suffers unfairly if it's independent work. This is
a known tradeoff. The rationale: when A spins on B's lock, we apply
the penalty at the LCA (root in your example) because that's where A
and B compete. This ensures B gets scheduled. The side effect is C
loses CPU time even though it's not involved in the dependency. In
practice: VMs typically put all vCPUs in one cgroup—no independent C
exists. If C exists and is affected by the same lock, the penalty
helps overall progress. If C is truly independent, it loses one
scheduling slice worth of time.

>
> >
> >> +                       break;
> >> +       } while ((se = parent_entity(se)));
> >>  }
> >>
> >>  static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
> >> ---
> >
> > Fixed one-slice penalties underperformed in our testing (dbench:
> > +14.4%/+9.8%/+6.7% for 2/3/4 VMs). We found adaptive scaling (6.0×
> > down to 1.0× based on queue size) necessary to balance effectiveness
> > against starvation.
>
> If all vCPUs of a VM are in the same cgroup - yield_to() should work
> just fine. If this "target" task is not selected then either some
> entity in the hierarchy, or the task is ineligible and EEVDF pick has
> decided to go with something else.
>
> It is not "starvation" but rather you've received you for fair share
> of "proportional runtime" and now you wait. If you really want to
> follow EEVDF maybe you compute the vlag and if it is behind the
> avg_vruntime, you account it to the "target" task - that would be
> in the spirit of the EEVDF algorithm.

You're right about the terminology—it's priority inversion, not
starvation. On crediting the target: this is philosophically
interesting but has practical issues. 1) Only helps if the target's
vlag < 0 (already lagging). If the lock holder is ahead (vlag > 0), no
effect. 2) Doesn't prevent the yielder from being re-picked at the LCA
if it's still most eligible. Accounting-wise: the spinner consumes
real CPU cycles. Our penalty charges that consumption. Crediting the
target gives service it didn't receive—arguably less consistent with
proportional fairness.

Regards,
Wanpeng

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM
  2025-11-13  8:33       ` Wanpeng Li
@ 2025-11-13  9:48         ` K Prateek Nayak
  2025-11-13 13:56           ` Wanpeng Li
  0 siblings, 1 reply; 36+ messages in thread
From: K Prateek Nayak @ 2025-11-13  9:48 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paolo Bonzini,
	Sean Christopherson, Steven Rostedt, Vincent Guittot, Juri Lelli,
	linux-kernel, kvm, Wanpeng Li

Hello Wanpeng,

On 11/13/2025 2:03 PM, Wanpeng Li wrote:
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index b4617d631549..87560f5a18b3 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -8962,10 +8962,28 @@ static void yield_task_fair(struct rq *rq)
>>>>          * which yields immediately again; without the condition the vruntime
>>>>          * ends up quickly running away.
>>>>          */
>>>> -       if (entity_eligible(cfs_rq, se)) {
>>>> +       do {
>>>> +               cfs_rq = cfs_rq_of(se);
>>>> +
>>>> +               /*
>>>> +                * Another entity will be selected at next pick.
>>>> +                * Single entity on cfs_rq can never be ineligible.
>>>> +                */
>>>> +               if (!entity_eligible(cfs_rq, se))
>>>> +                       break;
>>>> +
>>>>                 se->vruntime = se->deadline;
>>>
>>> Setting vruntime = deadline zeros out lag. Does this cause fairness
>>> drift with repeated yields? We explicitly recalculate vlag after
>>> adjustment to preserve EEVDF invariants.
>>
>> We only push deadline when the entity is eligible. Ineligible entity
>> will break out above. Also I don't get how adding a penalty to an
>> entity in the cgroup hierarchy of the yielding task when there are
>> other runnable tasks considered as "preserve(ing) EEVDF invariants".
> 
> Our penalty preserves EEVDF invariants by recalculating all scheduler state:
>    se->vruntime = new_vruntime;
>    se->deadline = se->vruntime + calc_delta_fair(se->slice, se);
>    se->vlag = avg_vruntime(cfs_rq) - se->vruntime;
>    update_min_vruntime(cfs_rq); // maintains cfs_rq consistency

So your exact implementation in yield_deboost_apply_penalty() is:

> +	new_vruntime = se_y_lca->vruntime + penalty;
> +
> +	/* Validity check */
> +	if (new_vruntime <= se_y_lca->vruntime)
> +		return;
> +
> +	se_y_lca->vruntime = new_vruntime;

You've updated this vruntime to something that you've seen fit based on
your performance data - better performance is not necessarily fair.

update_curr() uses:

    /* Time elapsed. */
    delta_exec = now - se->exec_start;
    se->exec_start = now;

    curr->vruntime += calc_delta_fair(delta_exec, curr);


"delta_exec" is based on the amount of time entity has run as opposed
to the penalty calculation which simply advances the vruntime by half a
slice because someone in the hierarchy decided to yield.

Also assume the vCPU yielding and the target is on the same cgroup -
you'll advance the vruntime of task in yield_deboost_apply_penalty() and
then again in yield_task_fair()?


> +	se_y_lca->deadline = se_y_lca->vruntime + calc_delta_fair(se_y_lca->slice, se_y_lca);
> +	se_y_lca->vlag = avg_vruntime(cfs_rq_common) - se_y_lca->vruntime;

There is no point in setting vlag for a running entity

> +	update_min_vruntime(cfs_rq_common);

> This is the same update pattern used in update_curr(). The EEVDF
> relationship lag = (V - v) * w remains valid—vlag becomes more
> negative as vruntime increases.

Sure "V" just moves to the new avg_vruntime() to give the 0-lag
point but modifying the vruntime arbitrarily doesn't seem fair to
me.

> The presence of other runnable tasks
> doesn't affect the mathematical correctness; each entity's lag is
> computed independently relative to avg_vruntime.
> 
>>
>>>
>>>>                 se->deadline += calc_delta_fair(se->slice, se);
>>>> -       }
>>>> +
>>>> +               /*
>>>> +                * If we have more than one runnable task queued below
>>>> +                * this cfs_rq, the next pick will likely go for a
>>>> +                * different entity now that we have advanced the
>>>> +                * vruntime and the deadline of the running entity.
>>>> +                */
>>>> +               if (cfs_rq->h_nr_runnable > 1)
>>>
>>> Stopping at h_nr_runnable > 1 may not handle cross-cgroup yield_to()
>>> correctly. Shouldn't the penalty apply at the LCA of yielder and
>>> target? Otherwise the vruntime adjustment might not affect the level
>>> where they actually compete.
>>
>> So here is the case I'm going after - consider the following
>> hierarchy:
>>
>>      root
>>     /    \
>>   CG0   CG1
>>    |     |
>>    A     B
>>
>>   CG* are cgroups and, [A-Z]* are tasks
>>
>> A decides to yield to B, and advances its deadline on CG0's timeline.
>> Currently, if CG0 is eligible and CG1 isn't, pick will still select
>> CG0 which will in-turn select task A and it'll yield again. This
>> cycle repeates until vruntime of CG0 turns large enough to make itself
>> ineligible and route the EEVDF pick to CG1.
> 
> Yes, natural convergence works, but requires multiple cycles. Your
> h_nr_runnable > 1 stops propagation when another entity might be
> picked, but "might" depends on vruntime ordering which needs time to
> develop. Our penalty forces immediate ineligibility at the LCA. One
> penalty application vs N natural yield cycles.
> 
>>
>> Now consider:
>>
>>
>>        root
>>       /    \
>>     CG0   CG1
>>    /   \   |
>>   A     C  B
>>
>> Same scenario: A yields to B. A advances its vruntime and deadline
>> as a prt of yield. Now, why should CG0 sacrifice its fair share of
>> runtime for A when task B is runnable? Just because one task decided
>> to yield to another task in a different cgroup doesn't mean other
>> waiting tasks on that hierarchy suffer.
> 
> You're right that C suffers unfairly if it's independent work. This is
> a known tradeoff.

So KVM is only one of the user of yield_to(). This whole debouncer
infrastructure seems to be over complicating all this. If anything
is yielding across cgroup boundary - that seems like bad
configuration and if necessary, the previous suggestion does stuff
fairly. I don't mind accounting the lost time in
yield_to_task_fair() and account it to target task but apart from
that, I don't think any of it is "fair".

Again, maybe it is only me and everyone else sees the vision having
dealt with virtualization.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM
  2025-11-13  9:48         ` K Prateek Nayak
@ 2025-11-13 13:56           ` Wanpeng Li
  0 siblings, 0 replies; 36+ messages in thread
From: Wanpeng Li @ 2025-11-13 13:56 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paolo Bonzini,
	Sean Christopherson, Steven Rostedt, Vincent Guittot, Juri Lelli,
	linux-kernel, kvm, Wanpeng Li

Hi Prateek,

On Thu, 13 Nov 2025 at 17:48, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> Hello Wanpeng,
>
> On 11/13/2025 2:03 PM, Wanpeng Li wrote:
> >>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>>> index b4617d631549..87560f5a18b3 100644
> >>>> --- a/kernel/sched/fair.c
> >>>> +++ b/kernel/sched/fair.c
> >>>> @@ -8962,10 +8962,28 @@ static void yield_task_fair(struct rq *rq)
> >>>>          * which yields immediately again; without the condition the vruntime
> >>>>          * ends up quickly running away.
> >>>>          */
> >>>> -       if (entity_eligible(cfs_rq, se)) {
> >>>> +       do {
> >>>> +               cfs_rq = cfs_rq_of(se);
> >>>> +
> >>>> +               /*
> >>>> +                * Another entity will be selected at next pick.
> >>>> +                * Single entity on cfs_rq can never be ineligible.
> >>>> +                */
> >>>> +               if (!entity_eligible(cfs_rq, se))
> >>>> +                       break;
> >>>> +
> >>>>                 se->vruntime = se->deadline;
> >>>
> >>> Setting vruntime = deadline zeros out lag. Does this cause fairness
> >>> drift with repeated yields? We explicitly recalculate vlag after
> >>> adjustment to preserve EEVDF invariants.
> >>
> >> We only push deadline when the entity is eligible. Ineligible entity
> >> will break out above. Also I don't get how adding a penalty to an
> >> entity in the cgroup hierarchy of the yielding task when there are
> >> other runnable tasks considered as "preserve(ing) EEVDF invariants".
> >
> > Our penalty preserves EEVDF invariants by recalculating all scheduler state:
> >    se->vruntime = new_vruntime;
> >    se->deadline = se->vruntime + calc_delta_fair(se->slice, se);
> >    se->vlag = avg_vruntime(cfs_rq) - se->vruntime;
> >    update_min_vruntime(cfs_rq); // maintains cfs_rq consistency
>
> So your exact implementation in yield_deboost_apply_penalty() is:
>
> > +     new_vruntime = se_y_lca->vruntime + penalty;
> > +
> > +     /* Validity check */
> > +     if (new_vruntime <= se_y_lca->vruntime)
> > +             return;
> > +
> > +     se_y_lca->vruntime = new_vruntime;
>
> You've updated this vruntime to something that you've seen fit based on
> your performance data - better performance is not necessarily fair.
>
> update_curr() uses:
>
>     /* Time elapsed. */
>     delta_exec = now - se->exec_start;
>     se->exec_start = now;
>
>     curr->vruntime += calc_delta_fair(delta_exec, curr);
>
>
> "delta_exec" is based on the amount of time entity has run as opposed
> to the penalty calculation which simply advances the vruntime by half a
> slice because someone in the hierarchy decided to yield.

CFS already separates time accounting from policy enforcement.
place_entity() modifies vruntime based on lag without time
passage—it's placement policy, not time accounting. Similarly,
yield_task_fair() advances the deadline without consuming time—policy
to trigger reschedule. Our penalty follows this established pattern:
bounded vruntime adjustment to implement yield_to() semantics in
hierarchical scheduling. Time accounting ( update_curr ) and
scheduling policy (placement, yielding, penalties) are distinct
mechanisms in CFS.

>
> Also assume the vCPU yielding and the target is on the same cgroup -
> you'll advance the vruntime of task in yield_deboost_apply_penalty() and
> then again in yield_task_fair()?

This is deliberate. When tasks share the same cgroup, they need both
hierarchy-level and leaf-level adjustments.
yield_deboost_apply_penalty() positions the task in cgroup timeline
(affects picking at that level), while yield_task_fair() advances the
deadline (triggers immediate reschedule). Without both, same-cgroup
yield loses effectiveness—the task would be repicked despite yielding.
The double adjustment ensures yield works at both the task level and
across hierarchy levels. This matches CFS's multi-level scheduling
philosophy.

>
>
> > +     se_y_lca->deadline = se_y_lca->vruntime + calc_delta_fair(se_y_lca->slice, se_y_lca);
> > +     se_y_lca->vlag = avg_vruntime(cfs_rq_common) - se_y_lca->vruntime;
>
> There is no point in setting vlag for a running entity

Maintaining invariants when modifying scheduler state is standard
practice throughout fair.c. reweight_entity() updates vlag for curr
when changing weights to preserve the lag relationship. We follow the
same principle—when artificially advancing vruntime, recalculate vlag
to maintain vlag = V - v . This prevents inconsistency when the entity
later dequeues. It's defensive correctness at negligible cost. The
alternative—leaving vlag stale—risks subtle bugs when scheduler state
assumptions are violated.

>
> > +     update_min_vruntime(cfs_rq_common);
>
> > This is the same update pattern used in update_curr(). The EEVDF
> > relationship lag = (V - v) * w remains valid—vlag becomes more
> > negative as vruntime increases.
>
> Sure "V" just moves to the new avg_vruntime() to give the 0-lag
> point but modifying the vruntime arbitrarily doesn't seem fair to
> me.

yield_to() API explicitly requests directed unfairness. CFS already
implements unfairness mechanisms: nice values, cgroup weights,
set_next_buddy() immediate preference. Without our mechanism,
yield_to() silently fails across cgroups—buddy hints vanish at
hierarchy boundaries where EEVDF makes independent decisions. We make
the documented API functional. The real question: should yield_to()
work in production environments (nested cgroups)? If yes, vruntime
adjustment is necessary. If not, deprecate the API.

>
> > The presence of other runnable tasks
> > doesn't affect the mathematical correctness; each entity's lag is
> > computed independently relative to avg_vruntime.
> >
> >>
> >>>
> >>>>                 se->deadline += calc_delta_fair(se->slice, se);
> >>>> -       }
> >>>> +
> >>>> +               /*
> >>>> +                * If we have more than one runnable task queued below
> >>>> +                * this cfs_rq, the next pick will likely go for a
> >>>> +                * different entity now that we have advanced the
> >>>> +                * vruntime and the deadline of the running entity.
> >>>> +                */
> >>>> +               if (cfs_rq->h_nr_runnable > 1)
> >>>
> >>> Stopping at h_nr_runnable > 1 may not handle cross-cgroup yield_to()
> >>> correctly. Shouldn't the penalty apply at the LCA of yielder and
> >>> target? Otherwise the vruntime adjustment might not affect the level
> >>> where they actually compete.
> >>
> >> So here is the case I'm going after - consider the following
> >> hierarchy:
> >>
> >>      root
> >>     /    \
> >>   CG0   CG1
> >>    |     |
> >>    A     B
> >>
> >>   CG* are cgroups and, [A-Z]* are tasks
> >>
> >> A decides to yield to B, and advances its deadline on CG0's timeline.
> >> Currently, if CG0 is eligible and CG1 isn't, pick will still select
> >> CG0 which will in-turn select task A and it'll yield again. This
> >> cycle repeates until vruntime of CG0 turns large enough to make itself
> >> ineligible and route the EEVDF pick to CG1.
> >
> > Yes, natural convergence works, but requires multiple cycles. Your
> > h_nr_runnable > 1 stops propagation when another entity might be
> > picked, but "might" depends on vruntime ordering which needs time to
> > develop. Our penalty forces immediate ineligibility at the LCA. One
> > penalty application vs N natural yield cycles.
> >
> >>
> >> Now consider:
> >>
> >>
> >>        root
> >>       /    \
> >>     CG0   CG1
> >>    /   \   |
> >>   A     C  B
> >>
> >> Same scenario: A yields to B. A advances its vruntime and deadline
> >> as a prt of yield. Now, why should CG0 sacrifice its fair share of
> >> runtime for A when task B is runnable? Just because one task decided
> >> to yield to another task in a different cgroup doesn't mean other
> >> waiting tasks on that hierarchy suffer.
> >
> > You're right that C suffers unfairly if it's independent work. This is
> > a known tradeoff.
>
> So KVM is only one of the user of yield_to(). This whole debouncer
> infrastructure seems to be over complicating all this. If anything
> is yielding across cgroup boundary - that seems like bad
> configuration and if necessary, the previous suggestion does stuff
> fairly. I don't mind accounting the lost time in
> yield_to_task_fair() and account it to target task but apart from
> that, I don't think any of it is "fair".

Time-transfer fails fundamentally: lock holders often have higher
vruntime (ran more), so crediting them backwards doesn't change EEVDF
pick order. Our penalty pushes yielder back—effective regardless. The
infrastructure addresses real measured problems: rate limiting
prevents overhead, debounce stops ping-pong accumulation, LCA
targeting fixes hierarchy picking. Nested cgroups are production
standard (systemd, containers, cloud)—not misconfiguration.
Performance gains prove yield_to was broken. Open to simplifications,
but they must actually solve the hierarchical scheduling problem.

Regards,
Wanpeng

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2025-11-21 11:46 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-10  3:32 [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM Wanpeng Li
2025-11-10  3:32 ` [PATCH 01/10] sched: Add vCPU debooster infrastructure Wanpeng Li
2025-11-10  3:32 ` [PATCH 02/10] sched/fair: Add rate-limiting and validation helpers Wanpeng Li
2025-11-12  6:40   ` K Prateek Nayak
2025-11-12  6:44     ` K Prateek Nayak
2025-11-13 13:36       ` Wanpeng Li
2025-11-13 12:00     ` Wanpeng Li
2025-11-10  3:32 ` [PATCH 03/10] sched/fair: Add cgroup LCA finder for hierarchical yield Wanpeng Li
2025-11-12  6:50   ` K Prateek Nayak
2025-11-13  8:59     ` Wanpeng Li
2025-11-10  3:32 ` [PATCH 04/10] sched/fair: Add penalty calculation and application logic Wanpeng Li
2025-11-12  7:25   ` K Prateek Nayak
2025-11-13 13:25     ` Wanpeng Li
2025-11-10  3:32 ` [PATCH 05/10] sched/fair: Wire up yield deboost in yield_to_task_fair() Wanpeng Li
2025-11-10  5:16   ` kernel test robot
2025-11-10  5:16   ` kernel test robot
2025-11-10  3:32 ` [PATCH 06/10] KVM: Fix last_boosted_vcpu index assignment bug Wanpeng Li
2025-11-21  0:35   ` Sean Christopherson
2025-11-21  0:38     ` Sean Christopherson
2025-11-21 11:46     ` Wanpeng Li
2025-11-10  3:32 ` [PATCH 07/10] KVM: x86: Add IPI tracking infrastructure Wanpeng Li
2025-11-10  3:32 ` [PATCH 08/10] KVM: x86/lapic: Integrate IPI tracking with interrupt delivery Wanpeng Li
2025-11-10  3:32 ` [PATCH 09/10] KVM: Implement IPI-aware directed yield candidate selection Wanpeng Li
2025-11-10  3:39 ` [PATCH 10/10] KVM: Relaxed boost as safety net Wanpeng Li
2025-11-10 12:02 ` [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM Christian Borntraeger
2025-11-12  5:01   ` Wanpeng Li
2025-11-18  8:11     ` Christian Borntraeger
2025-11-18 14:19       ` Wanpeng Li
2025-11-11  6:28 ` K Prateek Nayak
2025-11-12  4:54   ` Wanpeng Li
2025-11-12  6:07     ` K Prateek Nayak
2025-11-13  5:37       ` Wanpeng Li
2025-11-13  4:42     ` K Prateek Nayak
2025-11-13  8:33       ` Wanpeng Li
2025-11-13  9:48         ` K Prateek Nayak
2025-11-13 13:56           ` Wanpeng Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox