The Linux Kernel Mailing List
 help / color / mirror / Atom feed
* Re: [PATCH v2 0/9] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM
       [not found]       ` <1d99d7ea-e8c0-4afd-a6cb-58d3a09a7dfa@linux.ibm.com>
@ 2026-05-13 12:52         ` Richie Buturla
  0 siblings, 0 replies; only message in thread
From: Richie Buturla @ 2026-05-13 12:52 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paolo Bonzini,
	Sean Christopherson, K Prateek Nayak, Steven Rostedt,
	Vincent Guittot, Juri Lelli, linux-kernel, kvm, Wanpeng Li,
	Christian Borntraeger


On 17/04/2026 12:30, Richie Buturla wrote:
>
> On 08/04/2026 10:35, Richie Buturla wrote:
>>
>> On 01/04/2026 10:34, Wanpeng Li wrote:
>>> Hi Christian,
>>> On Thu, 26 Mar 2026 at 22:42, Christian Borntraeger
>>> <borntraeger@linux.ibm.com> wrote:
>>>> Am 19.12.25 um 04:53 schrieb Wanpeng Li:
>>>>> From: Wanpeng Li <wanpengli@tencent.com>
>>>>>
>>>>> This series addresses long-standing yield_to() inefficiencies in
>>>>> virtualized environments through two complementary mechanisms: a vCPU
>>>>> debooster in the scheduler and IPI-aware directed yield in KVM.
>>>>>
>>>>> Problem Statement
>>>>> -----------------
>>>>>
>>>>> In overcommitted virtualization scenarios, vCPUs frequently spin 
>>>>> on locks
>>>>> held by other vCPUs that are not currently running. The kernel's
>>>>> paravirtual spinlock support detects these situations and calls 
>>>>> yield_to()
>>>>> to boost the lock holder, allowing it to run and release the lock.
>>>>>
>>>>> However, the current implementation has two critical limitations:
>>>>>
>>>>> 1. Scheduler-side limitation:
>>>>>
>>>>>      yield_to_task_fair() relies solely on set_next_buddy() to 
>>>>> provide
>>>>>      preference to the target vCPU. This buddy mechanism only offers
>>>>>      immediate, transient preference. Once the buddy hint expires 
>>>>> (typically
>>>>>      after one scheduling decision), the yielding vCPU may preempt 
>>>>> the target
>>>>>      again, especially in nested cgroup hierarchies where vruntime 
>>>>> domains
>>>>>      differ.
>>>>>
>>>>>      This creates a ping-pong effect: the lock holder runs 
>>>>> briefly, gets
>>>>>      preempted before completing critical sections, and the 
>>>>> yielding vCPU
>>>>>      spins again, triggering another futile yield_to() cycle. The 
>>>>> overhead
>>>>>      accumulates rapidly in workloads with high lock contention.
>>>> Wanpeng,
>>>>
>>>> late but not forgotten.
>>>>
>>>> So Richie Buturla gave this a try on s390 with some variations but 
>>>> still
>>>> without cgroup support (next step).
>>>> The numbers look very promising (diag 9c is our yieldto hypercall). 
>>>> With
>>>> super high overcommitment the benefit shrinks again, but results 
>>>> are still
>>>> positive. We are probably running into other limits.
>>>>
>>>> 2:1 Overcommit Ratio:
>>>> diag9c calls:                       225,804,073 → 213,913,266  (-5.3%)
>>>> Dbench thrpt (per-run mean):        +1.3%
>>>> Dbench thrpt (per-run median):      +0.8%
>>>> Dbench thrpt (total across runs):   +1.3%
>>>> Dbench thrpt (avg/VM):              +1.3%
>>>>
>>>> 4:1:
>>>> diag9c calls:                       833,455,152 → 556,597,627 (-33.2%)
>>>> Dbench thrpt (per-run mean):        +7.2%
>>>> Dbench thrpt (per-run median):      +8.5%
>>>> Dbench thrpt (total across runs):   +7.2%
>>>> Dbench thrpt (avg/VM):              +7.2%
>>>>
>>>>
>>>> 6:1:
>>>> diag9c calls:                       967,501,378 → 737,178,419 (-23.8%)
>>>> Dbench thrpt (per-run mean):        +5.1%
>>>> Dbench thrpt (per-run median):      +4.8%
>>>> Dbench thrpt (total across runs):   +5.1%
>>>> Dbench thrpt (avg/VM):              +5.1%
>>>>
>>>>
>>>>
>>>> 8:1:
>>>> diag9c calls:                       872,165,596 → 653,481,530 (-25.1%)
>>>> Dbench thrpt (per-run mean):        +11.5%
>>>> Dbench thrpt (per-run median):      +11.4%
>>>> Dbench thrpt (total across runs):   +11.5%
>>>> Dbench thrpt (avg/VM):              +11.5%
>>>>
>>>> 9:1:
>>>> diag9c calls:                       809,384,976  → 587,597,163 
>>>> (-27.4%)
>>>> Dbench thrpt (per-run mean):        +4.5%
>>>> Dbench thrpt (per-run median):      +4.0%
>>>> Dbench thrpt (total across runs):   +4.5%
>>>> Dbench thrpt (avg/VM):              +4.5%
>>>>
>>>>
>>>> 10:1:
>>>> diag9c calls:                       711,772,971 → 477,448,374 (-32.9%)
>>>> Dbench thrpt (per-run mean):        +3.6%
>>>> Dbench thrpt (per-run median):      +1.6%
>>>> Dbench thrpt (total across runs):   +3.6%
>>>> Dbench thrpt (avg/VM):              +3.6%
>>> Thanks Christian, and thanks to Richie for running this on s390. :)
>>>
>>> This is very valuable independent data. A few things stand out to me:
>>>
>>> - The consistent reduction in diag9c calls across all overcommit
>>> ratios (up to -33.2% at 4:1) confirms that the directed yield
>>> improvements are effective at reducing unnecessary yield-to
>>> hypercalls, not just on x86 but across architectures.
>>> - The fact that these results are without cgroup support is actually
>>> informative: it tells us the core yield improvement carries its weight
>>> on its own, which helps me scope the next revision more tightly.
>>> - The diminishing-but-still-positive returns at very high overcommit
>>> (9:1, 10:1) match what I see on x86 as well — other bottlenecks start
>>> dominating but the mechanism does not regress.
>>>
>>> Btw, which kernel version were these results collected on?
>>>
>>> Regards,
>>> Wanpeng
>>>
>> Hi Wanpeng,
>>
>> I collected these results on a 6.19 kernel - which should also 
>> include the existing fixes for yielding and forfeiting vruntime on 
>> yield that K Prateek mentioned.
>>
> Hi Wanpeng. I'm trying out cgroup runs with libvirt but the results 
> seem to vary when I reproduce and need to look into this again so we 
> should not try to base any decisions on the numbers.
>
> I'll also rerun on the kernel version you are using (The 6.19-rc1).
Hi Wanpeng,

I spent some more time benchmarking the scheduler-side changes on s390 
and I think I can now narrow down where the benefit shows up and where 
it does not.
For context, my test runs have libvirt vms running dbench with the 
number of clients equal to the number of vCPUs, and the workload runs on 
tmpfs so that this is primarily measuring scheduler behavior.
As far as I can tell, the yield/deboost benefit is constrained to cases 
where the relevant vCPUs are competing on the same runqueue. That makes 
placement the key variable.

In particular:

1. With explicit 1:1 vCPU:pCPU pinning, I do not see a meaningful benefit.
For 3 VMs with 16 vCPUs each pinned to 16 pCPUs, the results were:

      diag9c calls:                  61,384,968 -> 62,994,594  (+2.6%)
      Dbench throughput mean:   -0.5%
      Dbench throughput median:  -0.3%

    That is basically noise from my point of view. This matches the 
expectation that if the lock waiter and lock holder are not sharing an 
rq, the scheduler-side boost/deboost path has little or nothing to act on.

2. When vCPUs are pooled onto a smaller pCPU set, I can reproduce a benefit.
    For 2 VMs with 16 vCPUs each placed on a 8 pCPU pool per VM, I saw:

      diag9c calls:                  62,893,856 -> 20,033,920  (-68.1%)
      Dbench throughput mean:   +4.2%
      Dbench throughput median:  +4.0%

    For 3 VMs with 16 vCPUs each placed on a 5 pCPU pool per VM, I saw:

      diag9c calls:                 107,915,379 -> 35,393,080  (-67.2%)
      Dbench throughput mean:     +4.4%
      Dbench throughput median:    +4.4%

    I also saw the same pattern with heavier pooling. For 5 VMs with 16 
vCPUs each placed on a 3 pCPU pool per VM, the results were:

      diag9c calls:                 130,986,144 -> 58,153,006  (-55.6%)
      Dbench throughput mean:     +3.4%
      Dbench throughput median:    +3.6%

These are the configurations where I consistently see an improvement in 
reduction of diag9c calls (again our yieldto hypercall) and some 
throughput improvement. This works because the VM is actually 
overcommitted onto its allowed pCPU set, so multiple vCPUs from the same 
VM can contend on the same rq and exercise the mechanism.

3. If there is no intra-VM overcommit, the effect disappears again.
    For 3 VMs with 5 vCPUs on a 5 pCPU pool per VM, the results were:

      diag9c calls:                     696,548 ->    718,219  (+3.1%)
      Dbench throughput mean: -0.8%
      Dbench throughput median:  -0.7%

    Again, no meaningful benefit.

So my final takeaway is that on s390 I can only demonstrate a benefit 
when the test setup intentionally causes multiple vCPUs of a VM to share 
runqueues. Plain pinning does not show an effect, and a matched 
vCPU:pCPU configuration such as 5 vCPUs on 5 pCPUs does not either. The 
interesting case is specifically vCPU pooling / overcommit onto a 
smaller pCPU set, not just "more VMs on the host".

I suppose this mechanism does help once the waiter/holder pair can 
actually meet on the same rq. If something similar could somehow target 
useful cross-runqueue cases as well, that would seem like a natural way 
to stretch this benefit further.

Thanks,
Richie

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2026-05-13 12:53 UTC | newest]

Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20251219035334.39790-1-kernellwp@gmail.com>
     [not found] ` <9a8c1bd7-5d95-4d79-aae2-fc06c448b9a3@linux.ibm.com>
     [not found]   ` <CANRm+Cz8Lh+AdsLSSy-u-KQhOpOqBOOAfC1-N7eTO_UM46f6Uw@mail.gmail.com>
     [not found]     ` <d5bb7e5d-94d3-4b6c-b1a6-e11d13db38f3@linux.ibm.com>
     [not found]       ` <1d99d7ea-e8c0-4afd-a6cb-58d3a09a7dfa@linux.ibm.com>
2026-05-13 12:52         ` [PATCH v2 0/9] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM Richie Buturla

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox