From: Wanpeng Li <kernellwp@gmail.com>
To: Peter Zijlstra <peterz@infradead.org>,
Ingo Molnar <mingo@redhat.com>,
Thomas Gleixner <tglx@linutronix.de>,
Paolo Bonzini <pbonzini@redhat.com>,
Sean Christopherson <seanjc@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>,
Vincent Guittot <vincent.guittot@linaro.org>,
Juri Lelli <juri.lelli@redhat.com>,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
Wanpeng Li <wanpengli@tencent.com>
Subject: [PATCH 05/10] sched/fair: Wire up yield deboost in yield_to_task_fair()
Date: Mon, 10 Nov 2025 11:32:26 +0800 [thread overview]
Message-ID: <20251110033232.12538-6-kernellwp@gmail.com> (raw)
In-Reply-To: <20251110033232.12538-1-kernellwp@gmail.com>
From: Wanpeng Li <wanpengli@tencent.com>
From: Wanpeng Li <wanpengli@tencent.com>
Integrate the yield deboost mechanism into yield_to_task_fair() to
improve yield_to() effectiveness for virtualization workloads.
Add yield_to_deboost() as the main entry point that validates tasks,
finds cgroup LCA, updates rq clock and accounting, calculates penalty,
and applies EEVDF field adjustments.
The integration point after set_next_buddy() and before yield_task_fair()
works in concert with the existing buddy mechanism: set_next_buddy()
provides immediate preference, yield_to_deboost() applies bounded
vruntime penalty for sustained advantage, and yield_task_fair()
completes the standard yield path.
This is particularly beneficial for vCPU workloads where lock holder
detection triggers yield_to(), the holder needs sustained preference
to make progress, vCPUs may be organized in nested cgroups,
high-frequency yields require rate limiting, and ping-pong patterns
need debouncing.
Operation occurs under rq->lock with bounded penalties. The feature
can be disabled at runtime via
/sys/kernel/debug/sched/sched_vcpu_debooster_enabled.
Dbench workload in a virtualized environment (16 pCPUs host, 16 vCPUs
per VM running dbench-16 benchmark) shows consistent gains:
2 VMs: +14.4% throughput
3 VMs: +9.8% throughput
4 VMs: +6.7% throughput
Performance gains stem from more effective yield_to() behavior,
enabling lock holders to make faster progress and reducing contention
overhead in overcommitted scenarios.
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
---
kernel/sched/fair.c | 58 +++++++++++++++++++++++++++++++++++++++++----
1 file changed, 54 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4bad324f3662..619af60b7ce6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9017,7 +9017,7 @@ static bool yield_deboost_rate_limit(struct rq *rq, u64 now_ns)
* Returns false with appropriate debug logging if any validation fails,
* ensuring only safe and meaningful yield operations proceed.
*/
-static bool __maybe_unused yield_deboost_validate_tasks(struct rq *rq, struct task_struct *p_target,
+static bool yield_deboost_validate_tasks(struct rq *rq, struct task_struct *p_target,
struct task_struct **p_yielding_out,
struct sched_entity **se_y_out,
struct sched_entity **se_t_out)
@@ -9066,7 +9066,7 @@ static bool __maybe_unused yield_deboost_validate_tasks(struct rq *rq, struct ta
* the appropriate level for vruntime adjustments and EEVDF field updates
* (deadline, vlag) to maintain scheduler consistency.
*/
-static bool __maybe_unused yield_deboost_find_lca(struct sched_entity *se_y, struct sched_entity *se_t,
+static bool yield_deboost_find_lca(struct sched_entity *se_y, struct sched_entity *se_t,
struct sched_entity **se_y_lca_out,
struct sched_entity **se_t_lca_out,
struct cfs_rq **cfs_rq_common_out)
@@ -9162,7 +9162,7 @@ static u64 yield_deboost_apply_debounce(struct rq *rq, struct sched_entity *se_t
* and implements reverse-pair debounce (~300us) to reduce ping-pong effects.
* Returns 0 if no penalty needed, otherwise returns clamped penalty value.
*/
-static u64 __maybe_unused yield_deboost_calculate_penalty(struct rq *rq, struct sched_entity *se_y_lca,
+static u64 yield_deboost_calculate_penalty(struct rq *rq, struct sched_entity *se_y_lca,
struct sched_entity *se_t_lca, struct sched_entity *se_t,
int nr_queued)
{
@@ -9250,7 +9250,7 @@ static u64 __maybe_unused yield_deboost_calculate_penalty(struct rq *rq, struct
* scheduler state consistency. Returns true on successful application,
* false if penalty cannot be safely applied.
*/
-static void __maybe_unused yield_deboost_apply_penalty(struct rq *rq, struct sched_entity *se_y_lca,
+static void yield_deboost_apply_penalty(struct rq *rq, struct sched_entity *se_y_lca,
struct cfs_rq *cfs_rq_common, u64 penalty)
{
u64 new_vruntime;
@@ -9303,6 +9303,52 @@ static void yield_task_fair(struct rq *rq)
se->deadline += calc_delta_fair(se->slice, se);
}
+/*
+ * yield_to_deboost - deboost the yielding task to favor the target on the same rq
+ * @rq: runqueue containing both tasks; rq->lock must be held
+ * @p_target: task to favor in scheduling
+ *
+ * Cooperates with yield_to_task_fair(): buddy provides immediate preference;
+ * this routine applies a bounded vruntime penalty at the cgroup LCA so the
+ * target keeps advantage beyond the buddy effect. EEVDF fields are updated
+ * to keep scheduler state consistent.
+ *
+ * Only operates on tasks resident on the same rq; throttled hierarchies are
+ * rejected early. Penalty is bounded by granularity and queue-size caps.
+ *
+ * Intended primarily for virtualization workloads where a yielding vCPU
+ * should defer to a target vCPU within the same runqueue.
+ * Does not change runnable order directly; complements buddy selection with
+ * a bounded fairness adjustment.
+ */
+static void yield_to_deboost(struct rq *rq, struct task_struct *p_target)
+{
+ struct task_struct *p_yielding;
+ struct sched_entity *se_y, *se_t, *se_y_lca, *se_t_lca;
+ struct cfs_rq *cfs_rq_common;
+ u64 penalty;
+
+ /* Step 1: validate tasks and inputs */
+ if (!yield_deboost_validate_tasks(rq, p_target, &p_yielding, &se_y, &se_t))
+ return;
+
+ /* Step 2: find LCA in cgroup hierarchy */
+ if (!yield_deboost_find_lca(se_y, se_t, &se_y_lca, &se_t_lca, &cfs_rq_common))
+ return;
+
+ /* Step 3: update clock and current accounting */
+ update_rq_clock(rq);
+ if (se_y_lca != cfs_rq_common->curr)
+ update_curr(cfs_rq_common);
+
+ /* Step 4: calculate penalty (caps + debounce) */
+ penalty = yield_deboost_calculate_penalty(rq, se_y_lca, se_t_lca, se_t,
+ cfs_rq_common->nr_queued);
+
+ /* Step 5: apply penalty and update EEVDF fields */
+ yield_deboost_apply_penalty(rq, se_y_lca, cfs_rq_common, penalty);
+}
+
static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
{
struct sched_entity *se = &p->se;
@@ -9314,6 +9360,10 @@ static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
/* Tell the scheduler that we'd really like se to run next. */
set_next_buddy(se);
+ /* Apply deboost under rq lock. */
+ yield_to_deboost(rq, p);
+
+ /* Complete the standard yield path. */
yield_task_fair(rq);
return true;
--
2.43.0
next prev parent reply other threads:[~2025-11-10 3:32 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-10 3:32 [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM Wanpeng Li
2025-11-10 3:32 ` [PATCH 01/10] sched: Add vCPU debooster infrastructure Wanpeng Li
2025-11-10 3:32 ` [PATCH 02/10] sched/fair: Add rate-limiting and validation helpers Wanpeng Li
2025-11-12 6:40 ` K Prateek Nayak
2025-11-12 6:44 ` K Prateek Nayak
2025-11-13 13:36 ` Wanpeng Li
2025-11-13 12:00 ` Wanpeng Li
2025-11-10 3:32 ` [PATCH 03/10] sched/fair: Add cgroup LCA finder for hierarchical yield Wanpeng Li
2025-11-12 6:50 ` K Prateek Nayak
2025-11-13 8:59 ` Wanpeng Li
2025-11-10 3:32 ` [PATCH 04/10] sched/fair: Add penalty calculation and application logic Wanpeng Li
2025-11-12 7:25 ` K Prateek Nayak
2025-11-13 13:25 ` Wanpeng Li
2025-11-10 3:32 ` Wanpeng Li [this message]
2025-11-10 5:16 ` [PATCH 05/10] sched/fair: Wire up yield deboost in yield_to_task_fair() kernel test robot
2025-11-10 5:16 ` kernel test robot
2025-11-10 3:32 ` [PATCH 06/10] KVM: Fix last_boosted_vcpu index assignment bug Wanpeng Li
2025-11-21 0:35 ` Sean Christopherson
2025-11-21 0:38 ` Sean Christopherson
2025-11-21 11:46 ` Wanpeng Li
2025-11-10 3:32 ` [PATCH 07/10] KVM: x86: Add IPI tracking infrastructure Wanpeng Li
2025-11-10 3:32 ` [PATCH 08/10] KVM: x86/lapic: Integrate IPI tracking with interrupt delivery Wanpeng Li
2025-11-10 3:32 ` [PATCH 09/10] KVM: Implement IPI-aware directed yield candidate selection Wanpeng Li
2025-11-10 3:39 ` [PATCH 10/10] KVM: Relaxed boost as safety net Wanpeng Li
2025-11-10 12:02 ` [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM Christian Borntraeger
2025-11-12 5:01 ` Wanpeng Li
2025-11-18 8:11 ` Christian Borntraeger
2025-11-18 14:19 ` Wanpeng Li
2025-11-11 6:28 ` K Prateek Nayak
2025-11-12 4:54 ` Wanpeng Li
2025-11-12 6:07 ` K Prateek Nayak
2025-11-13 5:37 ` Wanpeng Li
2025-11-13 4:42 ` K Prateek Nayak
2025-11-13 8:33 ` Wanpeng Li
2025-11-13 9:48 ` K Prateek Nayak
2025-11-13 13:56 ` Wanpeng Li
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251110033232.12538-6-kernellwp@gmail.com \
--to=kernellwp@gmail.com \
--cc=juri.lelli@redhat.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@redhat.com \
--cc=pbonzini@redhat.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=seanjc@google.com \
--cc=tglx@linutronix.de \
--cc=vincent.guittot@linaro.org \
--cc=wanpengli@tencent.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox