From: K Prateek Nayak <kprateek.nayak@amd.com>
To: Wanpeng Li <kernellwp@gmail.com>,
Peter Zijlstra <peterz@infradead.org>,
Ingo Molnar <mingo@redhat.com>,
Thomas Gleixner <tglx@linutronix.de>,
"Paolo Bonzini" <pbonzini@redhat.com>,
Sean Christopherson <seanjc@google.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>,
Steven Rostedt <rostedt@goodmis.org>,
Vincent Guittot <vincent.guittot@linaro.org>,
"Juri Lelli" <juri.lelli@redhat.com>,
<linux-kernel@vger.kernel.org>, <kvm@vger.kernel.org>,
Wanpeng Li <wanpengli@tencent.com>
Subject: Re: [PATCH v2 0/9] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM
Date: Mon, 5 Jan 2026 11:56:51 +0530 [thread overview]
Message-ID: <f76772c1-7ece-4bc2-a67f-1ba07256604a@amd.com> (raw)
In-Reply-To: <20251219035334.39790-1-kernellwp@gmail.com>
Hello Wanpeng,
On 12/19/2025 9:23 AM, Wanpeng Li wrote:
> Part 1: Scheduler vCPU Debooster (patches 1-5)
>
> Augment yield_to_task_fair() with bounded vruntime penalties to provide
> sustained preference beyond the buddy mechanism. When a vCPU yields to a
> target, apply a carefully tuned vruntime penalty to the yielding vCPU,
> ensuring the target maintains scheduling advantage for longer periods.
Do you still see the problem after the fixes in commits:
127b90315ca0 ("sched/proxy: Yield the donor task")
79104becf42b ("sched/fair: Forfeit vruntime on yield")
Starting 79104becf42b, we push the vruntime on yield too which should
prevent the yield loop between vCPUs on same cgroup on the same vCPU.
If you have the following cgroup hierarchy:
root
/ \
/ \
/ \
A B
/ \ |
/ \ |
vCPU0 vCPU1 vCPU0
and vCPU0(A) yields to vCPU1(A) in the same cgroup vCPU1 should start
running after vCPU0 has pushed its vruntime enough to make it
ineligible.
If you have vCPUs across different cgroups with CPU controllers enabled,
I hope you have a very good reason to have such a setup because
otherwise, this is just too much to complexity for some theoretical,
insane deployment.
>
> The mechanism is EEVDF-aware and cgroup-hierarchy-aware:
>
> - Locate the lowest common ancestor (LCA) in the cgroup hierarchy where
> both the yielding and target tasks coexist. This ensures vruntime
> adjustments occur at the correct hierarchy level, maintaining fairness
> across cgroup boundaries.
>
> - Update EEVDF scheduler fields (vruntime, deadline) atomically to keep
> the scheduler state consistent. Note that vlag is intentionally not
> modified as it will be recalculated on dequeue/enqueue cycles. The
> penalty shifts the yielding task's virtual deadline forward, allowing
> the target to run.
>
> - Apply queue-size-adaptive penalties that scale from 6.0x scheduling
> granularity for 2-task scenarios (strong preference) down to 1.0x for
> large queues (>12 tasks), balancing preference against starvation risks.
>
> - Implement reverse-pair debouncing: when task A yields to B, then B yields
> to A within a short window (~600us), downscale the penalty to prevent
> ping-pong oscillation.
>
> - Rate-limit penalty application to 6ms intervals to prevent pathological
> overhead when yields occur at very high frequency.
I still don't like all this complexity. How much better is it than doing
something like a:
(Only build tested)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7377f9117501..fbb263ea7d5a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9079,6 +9079,7 @@ static void yield_task_fair(struct rq *rq)
static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
{
struct sched_entity *se = &p->se;
+ unsigned long weight;
/* !se->on_rq also covers throttled task */
if (!se->on_rq)
@@ -9089,6 +9090,32 @@ static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
yield_task_fair(rq);
+ se = &rq->donor->se;
+ weight = se->load.weight;
+
+ /* Proportionally yield the hierarchy. */
+ while ((se = parent_entity(se))) {
+ unsigned long gcfs_rq_weight = group_cfs_rq(se)->load.weight;
+ struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+ WARN_ON_ONCE(se != cfs_rq->curr);
+ update_curr(cfs_rq);
+
+ /* Don't yield beyond the point of ineligibility. */
+ if (!entity_eligible(cfs_rq, se))
+ break;
+ /*
+ * Proportionally increase the vruntime based on the slice
+ * and the weight of the yielding subtree.
+ */
+ se->vruntime += div_u64(calc_delta_fair(se->slice, se) * weight, gcfs_rq_weight);
+ update_deadline(cfs_rq, se);
+
+ /* Update the proportional wight of task on parent hierarchy. */
+ weight = (se->load.weight * weight) / gcfs_rq_weight;
+ if (!weight)
+ break;
+ }
return true;
}
base-commit: 6ab7973f254071faf20fe5fcc502a3fe9ca14a47
---
Prepared on top of tip:sched/core. I don't like the above either and I'm
90% sure commit 79104becf42b ("sched/fair: Forfeit vruntime on yield")
will solve the problem you are seeing.
> Performance Results
> -------------------
>
> Test environment: Intel Xeon, 16 physical cores, 16 vCPUs per VM
>
> Dbench 16 clients per VM (filesystem metadata operations):
> 2 VMs: +14.4% throughput (lock contention reduction)
> 3 VMs: +9.8% throughput
> 4 VMs: +6.7% throughput
>
And what does the cgroup hierarchy look like for these tests?
--
Thanks and Regards,
Prateek
next prev parent reply other threads:[~2026-01-05 6:26 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-19 3:53 [PATCH v2 0/9] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM Wanpeng Li
2025-12-19 3:53 ` [PATCH v2 1/9] sched: Add vCPU debooster infrastructure Wanpeng Li
2025-12-19 3:53 ` [PATCH v2 2/9] sched/fair: Add rate-limiting and validation helpers Wanpeng Li
2025-12-22 21:12 ` kernel test robot
2026-01-04 4:09 ` Hillf Danton
2025-12-19 3:53 ` [PATCH v2 3/9] sched/fair: Add cgroup LCA finder for hierarchical yield Wanpeng Li
2025-12-19 3:53 ` [PATCH v2 4/9] sched/fair: Add penalty calculation and application logic Wanpeng Li
2025-12-22 23:36 ` kernel test robot
2025-12-19 3:53 ` [PATCH v2 5/9] sched/fair: Wire up yield deboost in yield_to_task_fair() Wanpeng Li
2025-12-22 7:06 ` kernel test robot
2025-12-22 9:31 ` kernel test robot
2025-12-19 3:53 ` [PATCH v2 6/9] KVM: x86: Add IPI tracking infrastructure Wanpeng Li
2025-12-19 3:53 ` [PATCH v2 7/9] KVM: x86/lapic: Integrate IPI tracking with interrupt delivery Wanpeng Li
2025-12-19 3:53 ` [PATCH v2 8/9] KVM: Implement IPI-aware directed yield candidate selection Wanpeng Li
2025-12-19 3:53 ` [PATCH v2 9/9] KVM: Relaxed boost as safety net Wanpeng Li
2026-01-04 2:40 ` [PATCH v2 0/9] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM Wanpeng Li
2026-01-05 6:26 ` K Prateek Nayak [this message]
2026-03-13 1:13 ` Sean Christopherson
2026-04-01 9:48 ` Wanpeng Li
2026-04-02 23:43 ` Sean Christopherson
2026-03-26 14:41 ` Christian Borntraeger
2026-04-01 9:34 ` Wanpeng Li
2026-04-08 9:35 ` Richie Buturla
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f76772c1-7ece-4bc2-a67f-1ba07256604a@amd.com \
--to=kprateek.nayak@amd.com \
--cc=borntraeger@linux.ibm.com \
--cc=juri.lelli@redhat.com \
--cc=kernellwp@gmail.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@redhat.com \
--cc=pbonzini@redhat.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=seanjc@google.com \
--cc=tglx@linutronix.de \
--cc=vincent.guittot@linaro.org \
--cc=wanpengli@tencent.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox