All of lore.kernel.org
 help / color / mirror / Atom feed
From: Wanpeng Li <kernellwp@gmail.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Sean Christopherson <seanjc@google.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>,
	Christian Borntraeger <borntraeger@linux.ibm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	Wanpeng Li <wanpengli@tencent.com>
Subject: [PATCH v2 5/9] sched/fair: Wire up yield deboost in yield_to_task_fair()
Date: Fri, 19 Dec 2025 11:53:29 +0800	[thread overview]
Message-ID: <20251219035334.39790-6-kernellwp@gmail.com> (raw)
In-Reply-To: <20251219035334.39790-1-kernellwp@gmail.com>

From: Wanpeng Li <wanpengli@tencent.com>

Integrate yield_to_deboost() into yield_to_task_fair() to activate the
vCPU debooster mechanism.

The integration works in concert with the existing buddy mechanism:
set_next_buddy() provides immediate preference, yield_to_deboost()
applies bounded vruntime penalty based on the fairness gap, and
yield_task_fair() completes the standard yield path including the
EEVDF forfeit operation.

Note: yield_to_deboost() must be called BEFORE yield_task_fair()
because v6.19+ kernels perform forfeit (se->vruntime = se->deadline)
in yield_task_fair(). If deboost runs after forfeit, the fairness
gap calculation would see the already-inflated vruntime, resulting
in need=0 and only baseline penalty being applied.

Performance testing (16 pCPUs host, 16 vCPUs/VM):

Dbench 16 clients per VM:
  2 VMs: +14.4% throughput
  3 VMs:  +9.8% throughput
  4 VMs:  +6.7% throughput

Gains stem from sustained lock holder preference reducing ping-pong
between yielding vCPUs and lock holders. Most pronounced at moderate
overcommit where contention reduction outweighs context switch cost.

v1 -> v2:
- Move sysctl_sched_vcpu_debooster_enabled check to yield_to_deboost()
  entry point for early exit before update_rq_clock()
- Restore conditional update_curr() check (se_y_lca != cfs_rq->curr)
  to avoid unnecessary accounting updates
- Keep yield_task_fair() unchanged (no for_each_sched_entity loop)
  to avoid double-penalizing the yielding task
- Move yield_to_deboost() BEFORE yield_task_fair() to preserve fairness
  gap calculation (v6.19+ forfeit would otherwise inflate vruntime
  before penalty calculation)
- Improve function documentation

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
---
 kernel/sched/fair.c | 67 +++++++++++++++++++++++++++++++++++++++------
 1 file changed, 59 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8738cfc3109c..9e0991f0c618 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9066,23 +9066,19 @@ static bool yield_deboost_rate_limit(struct rq *rq)
  * Validate tasks for yield deboost operation.
  * Returns the yielding task on success, NULL on validation failure.
  *
- * Checks: feature enabled, valid target, same runqueue, target is fair class,
- * both on_rq. Called under rq->lock.
+ * Checks: valid target, same runqueue, target is fair class,
+ * both on_rq, rate limiting. Called under rq->lock.
  *
  * Note: p_yielding (rq->donor) is guaranteed to be fair class by the caller
  * (yield_to_task_fair is only called when curr->sched_class == p->sched_class).
+ * Note: sysctl_sched_vcpu_debooster_enabled is checked by caller before
+ * update_rq_clock() to avoid unnecessary clock updates.
  */
 static struct task_struct __maybe_unused *
 yield_deboost_validate_tasks(struct rq *rq, struct task_struct *p_target)
 {
 	struct task_struct *p_yielding;
 
-	if (!sysctl_sched_vcpu_debooster_enabled)
-		return NULL;
-
-	if (!p_target)
-		return NULL;
-
 	if (yield_deboost_rate_limit(rq))
 		return NULL;
 
@@ -9287,6 +9283,57 @@ yield_deboost_apply_penalty(struct sched_entity *se_y_lca,
 	se_y_lca->deadline = new_vruntime + calc_delta_fair(se_y_lca->slice, se_y_lca);
 }
 
+/*
+ * yield_to_deboost - Apply vruntime penalty to favor the target task
+ * @rq: runqueue containing both tasks (rq->lock must be held)
+ * @p_target: task to favor in scheduling
+ *
+ * Cooperates with yield_to_task_fair(): set_next_buddy() provides immediate
+ * preference; this routine applies a bounded vruntime penalty at the cgroup
+ * LCA so the target maintains scheduling advantage beyond the buddy effect.
+ *
+ * Only operates on tasks resident on the same rq. Penalty is bounded by
+ * granularity and queue-size caps to prevent starvation.
+ */
+static void yield_to_deboost(struct rq *rq, struct task_struct *p_target)
+{
+	struct task_struct *p_yielding;
+	struct sched_entity *se_y, *se_t, *se_y_lca, *se_t_lca;
+	struct cfs_rq *cfs_rq_common;
+	u64 penalty;
+
+	/* Quick validation before updating clock */
+	if (!sysctl_sched_vcpu_debooster_enabled)
+		return;
+
+	if (!p_target)
+		return;
+
+	/* Update clock - rate limiting and debounce use rq_clock() */
+	update_rq_clock(rq);
+
+	/* Full validation including rate limiting */
+	p_yielding = yield_deboost_validate_tasks(rq, p_target);
+	if (!p_yielding)
+		return;
+
+	se_y = &p_yielding->se;
+	se_t = &p_target->se;
+
+	/* Find LCA in cgroup hierarchy */
+	if (!yield_deboost_find_lca(se_y, se_t, &se_y_lca, &se_t_lca, &cfs_rq_common))
+		return;
+
+	/* Update current accounting before modifying vruntime */
+	if (se_y_lca != cfs_rq_common->curr)
+		update_curr(cfs_rq_common);
+
+	/* Calculate and apply penalty */
+	penalty = yield_deboost_calculate_penalty(rq, se_y_lca, se_t_lca,
+						  p_target, cfs_rq_common->h_nr_queued);
+	yield_deboost_apply_penalty(se_y_lca, cfs_rq_common, penalty);
+}
+
 /*
  * sched_yield() is very simple
  */
@@ -9341,6 +9388,10 @@ static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
 	/* Tell the scheduler that we'd really like se to run next. */
 	set_next_buddy(se);
 
+	/* Apply deboost BEFORE forfeit to preserve fairness gap calculation */
+	yield_to_deboost(rq, p);
+
+	/* Complete the standard yield path (includes forfeit in v6.19+) */
 	yield_task_fair(rq);
 
 	return true;
-- 
2.43.0


  parent reply	other threads:[~2025-12-19  3:54 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-19  3:53 [PATCH v2 0/9] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM Wanpeng Li
2025-12-19  3:53 ` [PATCH v2 1/9] sched: Add vCPU debooster infrastructure Wanpeng Li
2025-12-19  3:53 ` [PATCH v2 2/9] sched/fair: Add rate-limiting and validation helpers Wanpeng Li
2025-12-22 21:12   ` kernel test robot
2026-01-04  4:09   ` Hillf Danton
2025-12-19  3:53 ` [PATCH v2 3/9] sched/fair: Add cgroup LCA finder for hierarchical yield Wanpeng Li
2025-12-19  3:53 ` [PATCH v2 4/9] sched/fair: Add penalty calculation and application logic Wanpeng Li
2025-12-22 23:36   ` kernel test robot
2025-12-19  3:53 ` Wanpeng Li [this message]
2025-12-22  7:06   ` [PATCH v2 5/9] sched/fair: Wire up yield deboost in yield_to_task_fair() kernel test robot
2025-12-22  9:31   ` kernel test robot
2025-12-19  3:53 ` [PATCH v2 6/9] KVM: x86: Add IPI tracking infrastructure Wanpeng Li
2025-12-19  3:53 ` [PATCH v2 7/9] KVM: x86/lapic: Integrate IPI tracking with interrupt delivery Wanpeng Li
2025-12-19  3:53 ` [PATCH v2 8/9] KVM: Implement IPI-aware directed yield candidate selection Wanpeng Li
2025-12-19  3:53 ` [PATCH v2 9/9] KVM: Relaxed boost as safety net Wanpeng Li
2026-01-04  2:40 ` [PATCH v2 0/9] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM Wanpeng Li
2026-01-05  6:26 ` K Prateek Nayak
2026-03-13  1:13 ` Sean Christopherson
2026-04-01  9:48   ` Wanpeng Li
2026-04-02 23:43     ` Sean Christopherson
2026-03-26 14:41 ` Christian Borntraeger
2026-04-01  9:34   ` Wanpeng Li
2026-04-08  9:35     ` Richie Buturla
2026-04-17 11:30       ` Richie Buturla
2026-05-13 12:52         ` Richie Buturla

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251219035334.39790-6-kernellwp@gmail.com \
    --to=kernellwp@gmail.com \
    --cc=borntraeger@linux.ibm.com \
    --cc=juri.lelli@redhat.com \
    --cc=kprateek.nayak@amd.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=seanjc@google.com \
    --cc=tglx@linutronix.de \
    --cc=vincent.guittot@linaro.org \
    --cc=wanpengli@tencent.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.