Re: [PATCH v2 0/9] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Richie Buturla <richie@linux.ibm.com>
To: Wanpeng Li <kernellwp@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Sean Christopherson <seanjc@google.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	Wanpeng Li <wanpengli@tencent.com>,
	Christian Borntraeger <borntraeger@linux.ibm.com>
Subject: Re: [PATCH v2 0/9] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM
Date: Fri, 17 Apr 2026 12:30:53 +0100	[thread overview]
Message-ID: <1d99d7ea-e8c0-4afd-a6cb-58d3a09a7dfa@linux.ibm.com> (raw)
In-Reply-To: <d5bb7e5d-94d3-4b6c-b1a6-e11d13db38f3@linux.ibm.com>


On 08/04/2026 10:35, Richie Buturla wrote:
>
> On 01/04/2026 10:34, Wanpeng Li wrote:
>> Hi Christian,
>> On Thu, 26 Mar 2026 at 22:42, Christian Borntraeger
>> <borntraeger@linux.ibm.com> wrote:
>>> Am 19.12.25 um 04:53 schrieb Wanpeng Li:
>>>> From: Wanpeng Li <wanpengli@tencent.com>
>>>>
>>>> This series addresses long-standing yield_to() inefficiencies in
>>>> virtualized environments through two complementary mechanisms: a vCPU
>>>> debooster in the scheduler and IPI-aware directed yield in KVM.
>>>>
>>>> Problem Statement
>>>> -----------------
>>>>
>>>> In overcommitted virtualization scenarios, vCPUs frequently spin on 
>>>> locks
>>>> held by other vCPUs that are not currently running. The kernel's
>>>> paravirtual spinlock support detects these situations and calls 
>>>> yield_to()
>>>> to boost the lock holder, allowing it to run and release the lock.
>>>>
>>>> However, the current implementation has two critical limitations:
>>>>
>>>> 1. Scheduler-side limitation:
>>>>
>>>>      yield_to_task_fair() relies solely on set_next_buddy() to provide
>>>>      preference to the target vCPU. This buddy mechanism only offers
>>>>      immediate, transient preference. Once the buddy hint expires 
>>>> (typically
>>>>      after one scheduling decision), the yielding vCPU may preempt 
>>>> the target
>>>>      again, especially in nested cgroup hierarchies where vruntime 
>>>> domains
>>>>      differ.
>>>>
>>>>      This creates a ping-pong effect: the lock holder runs briefly, 
>>>> gets
>>>>      preempted before completing critical sections, and the 
>>>> yielding vCPU
>>>>      spins again, triggering another futile yield_to() cycle. The 
>>>> overhead
>>>>      accumulates rapidly in workloads with high lock contention.
>>> Wanpeng,
>>>
>>> late but not forgotten.
>>>
>>> So Richie Buturla gave this a try on s390 with some variations but 
>>> still
>>> without cgroup support (next step).
>>> The numbers look very promising (diag 9c is our yieldto hypercall). 
>>> With
>>> super high overcommitment the benefit shrinks again, but results are 
>>> still
>>> positive. We are probably running into other limits.
>>>
>>> 2:1 Overcommit Ratio:
>>> diag9c calls:                       225,804,073 → 213,913,266  (-5.3%)
>>> Dbench thrpt (per-run mean):        +1.3%
>>> Dbench thrpt (per-run median):      +0.8%
>>> Dbench thrpt (total across runs):   +1.3%
>>> Dbench thrpt (avg/VM):              +1.3%
>>>
>>> 4:1:
>>> diag9c calls:                       833,455,152 →  556,597,627 (-33.2%)
>>> Dbench thrpt (per-run mean):        +7.2%
>>> Dbench thrpt (per-run median):      +8.5%
>>> Dbench thrpt (total across runs):   +7.2%
>>> Dbench thrpt (avg/VM):              +7.2%
>>>
>>>
>>> 6:1:
>>> diag9c calls:                       967,501,378 →  737,178,419 (-23.8%)
>>> Dbench thrpt (per-run mean):        +5.1%
>>> Dbench thrpt (per-run median):      +4.8%
>>> Dbench thrpt (total across runs):   +5.1%
>>> Dbench thrpt (avg/VM):              +5.1%
>>>
>>>
>>>
>>> 8:1:
>>> diag9c calls:                       872,165,596 → 653,481,530 (-25.1%)
>>> Dbench thrpt (per-run mean):        +11.5%
>>> Dbench thrpt (per-run median):      +11.4%
>>> Dbench thrpt (total across runs):   +11.5%
>>> Dbench thrpt (avg/VM):              +11.5%
>>>
>>> 9:1:
>>> diag9c calls:                       809,384,976  → 587,597,163 (-27.4%)
>>> Dbench thrpt (per-run mean):        +4.5%
>>> Dbench thrpt (per-run median):      +4.0%
>>> Dbench thrpt (total across runs):   +4.5%
>>> Dbench thrpt (avg/VM):              +4.5%
>>>
>>>
>>> 10:1:
>>> diag9c calls:                       711,772,971 → 477,448,374 (-32.9%)
>>> Dbench thrpt (per-run mean):        +3.6%
>>> Dbench thrpt (per-run median):      +1.6%
>>> Dbench thrpt (total across runs):   +3.6%
>>> Dbench thrpt (avg/VM):              +3.6%
>> Thanks Christian, and thanks to Richie for running this on s390. :)
>>
>> This is very valuable independent data. A few things stand out to me:
>>
>> - The consistent reduction in diag9c calls across all overcommit
>> ratios (up to -33.2% at 4:1) confirms that the directed yield
>> improvements are effective at reducing unnecessary yield-to
>> hypercalls, not just on x86 but across architectures.
>> - The fact that these results are without cgroup support is actually
>> informative: it tells us the core yield improvement carries its weight
>> on its own, which helps me scope the next revision more tightly.
>> - The diminishing-but-still-positive returns at very high overcommit
>> (9:1, 10:1) match what I see on x86 as well — other bottlenecks start
>> dominating but the mechanism does not regress.
>>
>> Btw, which kernel version were these results collected on?
>>
>> Regards,
>> Wanpeng
>>
> Hi Wanpeng,
>
> I collected these results on a 6.19 kernel - which should also include 
> the existing fixes for yielding and forfeiting vruntime on yield that 
> K Prateek mentioned.
>
Hi Wanpeng. I'm trying out cgroup runs with libvirt but the results seem 
to vary when I reproduce and need to look into this again so we should 
not try to base any decisions on the numbers.

I'll also rerun on the kernel version you are using (The 6.19-rc1).

     prev parent reply	other threads:[~2026-04-17 11:31 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-19  3:53 [PATCH v2 0/9] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM Wanpeng Li
2025-12-19  3:53 ` [PATCH v2 1/9] sched: Add vCPU debooster infrastructure Wanpeng Li
2025-12-19  3:53 ` [PATCH v2 2/9] sched/fair: Add rate-limiting and validation helpers Wanpeng Li
2025-12-22 21:12   ` kernel test robot
2026-01-04  4:09   ` Hillf Danton
2025-12-19  3:53 ` [PATCH v2 3/9] sched/fair: Add cgroup LCA finder for hierarchical yield Wanpeng Li
2025-12-19  3:53 ` [PATCH v2 4/9] sched/fair: Add penalty calculation and application logic Wanpeng Li
2025-12-22 23:36   ` kernel test robot
2025-12-19  3:53 ` [PATCH v2 5/9] sched/fair: Wire up yield deboost in yield_to_task_fair() Wanpeng Li
2025-12-22  7:06   ` kernel test robot
2025-12-22  9:31   ` kernel test robot
2025-12-19  3:53 ` [PATCH v2 6/9] KVM: x86: Add IPI tracking infrastructure Wanpeng Li
2025-12-19  3:53 ` [PATCH v2 7/9] KVM: x86/lapic: Integrate IPI tracking with interrupt delivery Wanpeng Li
2025-12-19  3:53 ` [PATCH v2 8/9] KVM: Implement IPI-aware directed yield candidate selection Wanpeng Li
2025-12-19  3:53 ` [PATCH v2 9/9] KVM: Relaxed boost as safety net Wanpeng Li
2026-01-04  2:40 ` [PATCH v2 0/9] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM Wanpeng Li
2026-01-05  6:26 ` K Prateek Nayak
2026-03-13  1:13 ` Sean Christopherson
2026-04-01  9:48   ` Wanpeng Li
2026-04-02 23:43     ` Sean Christopherson
2026-03-26 14:41 ` Christian Borntraeger
2026-04-01  9:34   ` Wanpeng Li
2026-04-08  9:35     ` Richie Buturla
2026-04-17 11:30       ` Richie Buturla [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1d99d7ea-e8c0-4afd-a6cb-58d3a09a7dfa@linux.ibm.com \
    --to=richie@linux.ibm.com \
    --cc=borntraeger@linux.ibm.com \
    --cc=juri.lelli@redhat.com \
    --cc=kernellwp@gmail.com \
    --cc=kprateek.nayak@amd.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=seanjc@google.com \
    --cc=tglx@linutronix.de \
    --cc=vincent.guittot@linaro.org \
    --cc=wanpengli@tencent.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox