From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id ECA1440758F; Wed, 13 May 2026 12:53:37 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=148.163.158.5 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778676819; cv=none; b=cUb6J1h5ChzB0v7vLZgMfYYfE79ulUbjw0FWCZ++XuXe9lVEdk0DmMVnEaO6NBbHcT6G2mGqE0LQ8+VTiq6ZyG6tLLBZ/XnIxjMS5sTEQG1WV2j1LezqCqIkypLCg3vY4ovI+jEBgJGo3MDqUf5oPFhLHV38xik7QpKB+oDzbzo= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778676819; c=relaxed/simple; bh=XuWXFM+eKnXbBY9eErz37OVMFUcYn6vf9L0nbKwlbmE=; h=Message-ID:Date:MIME-Version:Subject:From:To:Cc:References: In-Reply-To:Content-Type; b=kY088w3z4wfhTRNeCSo/gXn+WDfDccvFt70qlrrioGxiFIAO9vmJ+5CVwaQqoFa1lQEUsUPm3YlD7e2oQBRb320QbBKDM/R6PmK408O/aDFUbiEjSU8itZUCR3gICSXFbzopWjbwHTCaIvuGSpe2CVFi46tcP9XKn5BwyKCAb0o= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com; spf=pass smtp.mailfrom=linux.ibm.com; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b=LufkTSAK; arc=none smtp.client-ip=148.163.158.5 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b="LufkTSAK" Received: from pps.filterd (m0356516.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.18.1.11/8.18.1.11) with ESMTP id 64D7ol2G3874784; Wed, 13 May 2026 12:52:54 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=cc :content-transfer-encoding:content-type:date:from:in-reply-to :message-id:mime-version:references:subject:to; s=pp1; bh=1APpEh L0YVU6LN3uFHSEX1tbex9YXW0Qo0ZkwqsSUzA=; b=LufkTSAKAuwdHdlUjtPHar H27p+msPYcD9veaGT6lpRhYmFDRlN9qrzyBTUNBYtBeyrVj3CfqNgTU+16vR1COQ uUu27Nxbs+8kVXyAmkCfux1j0jO+J6HKJiqWai4xf+Od24nIBdt0lU4HPpJ+LSfS k4qjCL7SK9lZfYr5RFhSOIXxYGTgeg72CXtp5C84/qLJvFiqGj6+xNMu2+auNFoF 6F/x3knH6zinLgzypih0QBZy5gbQy+Cnn7ndtorZhKCmzliVJ6dzhXZKte38i41y ilOLWefbY3oRowxCQEvyt3MnSKF9Yg2w/Z2i2wUVqmzltPdxTRP9SnvbBe/mSmog == Received: from ppma12.dal12v.mail.ibm.com (dc.9e.1632.ip4.static.sl-reverse.com [50.22.158.220]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 4e3nv6qd4t-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 13 May 2026 12:52:53 +0000 (GMT) Received: from pps.filterd (ppma12.dal12v.mail.ibm.com [127.0.0.1]) by ppma12.dal12v.mail.ibm.com (8.18.1.7/8.18.1.7) with ESMTP id 64DCdQl3030271; Wed, 13 May 2026 12:52:52 GMT Received: from smtprelay07.fra02v.mail.ibm.com ([9.218.2.229]) by ppma12.dal12v.mail.ibm.com (PPS) with ESMTPS id 4e3nfgfsc5-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 13 May 2026 12:52:52 +0000 (GMT) Received: from smtpav07.fra02v.mail.ibm.com (smtpav07.fra02v.mail.ibm.com [10.20.54.106]) by smtprelay07.fra02v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 64DCqoTb49676694 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 13 May 2026 12:52:50 GMT Received: from smtpav07.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 7B4DC20043; Wed, 13 May 2026 12:52:50 +0000 (GMT) Received: from smtpav07.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 57F1C20040; Wed, 13 May 2026 12:52:49 +0000 (GMT) Received: from [9.111.78.101] (unknown [9.111.78.101]) by smtpav07.fra02v.mail.ibm.com (Postfix) with ESMTP; Wed, 13 May 2026 12:52:49 +0000 (GMT) Message-ID: Date: Wed, 13 May 2026 13:52:48 +0100 Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v2 0/9] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM From: Richie Buturla To: Wanpeng Li Cc: Peter Zijlstra , Ingo Molnar , Thomas Gleixner , Paolo Bonzini , Sean Christopherson , K Prateek Nayak , Steven Rostedt , Vincent Guittot , Juri Lelli , linux-kernel@vger.kernel.org, kvm@vger.kernel.org, Wanpeng Li , Christian Borntraeger References: <20251219035334.39790-1-kernellwp@gmail.com> <9a8c1bd7-5d95-4d79-aae2-fc06c448b9a3@linux.ibm.com> <1d99d7ea-e8c0-4afd-a6cb-58d3a09a7dfa@linux.ibm.com> Content-Language: en-US In-Reply-To: <1d99d7ea-e8c0-4afd-a6cb-58d3a09a7dfa@linux.ibm.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-TM-AS-GCONF: 00 X-Proofpoint-Reinject: loops=2 maxloops=12 X-Authority-Analysis: v=2.4 cv=Us1T8ewB c=1 sm=1 tr=0 ts=6a047425 cx=c_pps a=bLidbwmWQ0KltjZqbj+ezA==:117 a=bLidbwmWQ0KltjZqbj+ezA==:17 a=IkcTkHD0fZMA:10 a=NGcC8JguVDcA:10 a=VkNPw1HP01LnGYTKEx00:22 a=RnoormkPH1_aCDwRdu11:22 a=Y2IxJ9c9Rs8Kov3niI8_:22 a=VnNF1IyMAAAA:8 a=GvQkQWPkAAAA:8 a=qhrAOcthnf8THf3nuK0A:9 a=3ZKOabzyN94A:10 a=QEXdDO2ut3YA:10 X-Proofpoint-GUID: luDS1YZ_xvSkxJoudlOuMTfwcwnC6r6G X-Proofpoint-Spam-Details-Enc: AW1haW4tMjYwNTEzMDEzMSBTYWx0ZWRfX45w7dk/xp4Wl JiVxL1G1/gaTdDUw+Dr0Nvz94jM2R5CqjNN/pAIHHkEvl7oafbfn+JqsxGjGRV1rbycqshZ3GZa PJFM1DFQ4zosTW+u267MLTFk+/HXzh/2CE2TLDKqYCw+Vg6G13VIHPkJxznoH+w8nb5zdNI36X3 1dy9Xe7J+xUceKZPdD1uh0Sr1YHc7BKvxlQdSxUlzHk3XI54K83oLjFeiRGOktLC6bgrJ4fk/PU IswabjmTb9dZied2CGS8aF7zmnla43PUIYQ7ZPdiHbRl15eh7hSHAAz+RMYL1avbPKdvdQwV4Y8 eiQm6dUsecQE78pZIiebnqkP1qsoMMd5FZC8TAbK/VpDUOMhTKwqiIOuOPc4MTK3/sXA9z9KEMf YCuk8FA0UXZvK//BNSUUjVtCW0JOHifwp+r/JGx24492wz5kKrCuz5I4GwBqy/q2atTS48wbgI1 SODS0QNdc016yME/8XA== X-Proofpoint-ORIG-GUID: fwzMqjzm1zI0PdFe6OSANVS9Tq90_Qfs X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1143,Hydra:6.1.51,FMLib:17.12.100.49 definitions=2026-05-13_01,2026-05-08_02,2025-10-01_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 phishscore=0 priorityscore=1501 impostorscore=0 lowpriorityscore=0 bulkscore=0 suspectscore=0 spamscore=0 malwarescore=0 clxscore=1015 adultscore=0 classifier=typeunknown authscore=0 authtc= authcc= route=outbound adjust=0 reason=mlx scancount=1 engine=8.22.0-2605050000 definitions=main-2605130131 On 17/04/2026 12:30, Richie Buturla wrote: > > On 08/04/2026 10:35, Richie Buturla wrote: >> >> On 01/04/2026 10:34, Wanpeng Li wrote: >>> Hi Christian, >>> On Thu, 26 Mar 2026 at 22:42, Christian Borntraeger >>> wrote: >>>> Am 19.12.25 um 04:53 schrieb Wanpeng Li: >>>>> From: Wanpeng Li >>>>> >>>>> This series addresses long-standing yield_to() inefficiencies in >>>>> virtualized environments through two complementary mechanisms: a vCPU >>>>> debooster in the scheduler and IPI-aware directed yield in KVM. >>>>> >>>>> Problem Statement >>>>> ----------------- >>>>> >>>>> In overcommitted virtualization scenarios, vCPUs frequently spin >>>>> on locks >>>>> held by other vCPUs that are not currently running. The kernel's >>>>> paravirtual spinlock support detects these situations and calls >>>>> yield_to() >>>>> to boost the lock holder, allowing it to run and release the lock. >>>>> >>>>> However, the current implementation has two critical limitations: >>>>> >>>>> 1. Scheduler-side limitation: >>>>> >>>>>      yield_to_task_fair() relies solely on set_next_buddy() to >>>>> provide >>>>>      preference to the target vCPU. This buddy mechanism only offers >>>>>      immediate, transient preference. Once the buddy hint expires >>>>> (typically >>>>>      after one scheduling decision), the yielding vCPU may preempt >>>>> the target >>>>>      again, especially in nested cgroup hierarchies where vruntime >>>>> domains >>>>>      differ. >>>>> >>>>>      This creates a ping-pong effect: the lock holder runs >>>>> briefly, gets >>>>>      preempted before completing critical sections, and the >>>>> yielding vCPU >>>>>      spins again, triggering another futile yield_to() cycle. The >>>>> overhead >>>>>      accumulates rapidly in workloads with high lock contention. >>>> Wanpeng, >>>> >>>> late but not forgotten. >>>> >>>> So Richie Buturla gave this a try on s390 with some variations but >>>> still >>>> without cgroup support (next step). >>>> The numbers look very promising (diag 9c is our yieldto hypercall). >>>> With >>>> super high overcommitment the benefit shrinks again, but results >>>> are still >>>> positive. We are probably running into other limits. >>>> >>>> 2:1 Overcommit Ratio: >>>> diag9c calls:                       225,804,073 → 213,913,266  (-5.3%) >>>> Dbench thrpt (per-run mean):        +1.3% >>>> Dbench thrpt (per-run median):      +0.8% >>>> Dbench thrpt (total across runs):   +1.3% >>>> Dbench thrpt (avg/VM):              +1.3% >>>> >>>> 4:1: >>>> diag9c calls:                       833,455,152 → 556,597,627 (-33.2%) >>>> Dbench thrpt (per-run mean):        +7.2% >>>> Dbench thrpt (per-run median):      +8.5% >>>> Dbench thrpt (total across runs):   +7.2% >>>> Dbench thrpt (avg/VM):              +7.2% >>>> >>>> >>>> 6:1: >>>> diag9c calls:                       967,501,378 → 737,178,419 (-23.8%) >>>> Dbench thrpt (per-run mean):        +5.1% >>>> Dbench thrpt (per-run median):      +4.8% >>>> Dbench thrpt (total across runs):   +5.1% >>>> Dbench thrpt (avg/VM):              +5.1% >>>> >>>> >>>> >>>> 8:1: >>>> diag9c calls:                       872,165,596 → 653,481,530 (-25.1%) >>>> Dbench thrpt (per-run mean):        +11.5% >>>> Dbench thrpt (per-run median):      +11.4% >>>> Dbench thrpt (total across runs):   +11.5% >>>> Dbench thrpt (avg/VM):              +11.5% >>>> >>>> 9:1: >>>> diag9c calls:                       809,384,976  → 587,597,163 >>>> (-27.4%) >>>> Dbench thrpt (per-run mean):        +4.5% >>>> Dbench thrpt (per-run median):      +4.0% >>>> Dbench thrpt (total across runs):   +4.5% >>>> Dbench thrpt (avg/VM):              +4.5% >>>> >>>> >>>> 10:1: >>>> diag9c calls:                       711,772,971 → 477,448,374 (-32.9%) >>>> Dbench thrpt (per-run mean):        +3.6% >>>> Dbench thrpt (per-run median):      +1.6% >>>> Dbench thrpt (total across runs):   +3.6% >>>> Dbench thrpt (avg/VM):              +3.6% >>> Thanks Christian, and thanks to Richie for running this on s390. :) >>> >>> This is very valuable independent data. A few things stand out to me: >>> >>> - The consistent reduction in diag9c calls across all overcommit >>> ratios (up to -33.2% at 4:1) confirms that the directed yield >>> improvements are effective at reducing unnecessary yield-to >>> hypercalls, not just on x86 but across architectures. >>> - The fact that these results are without cgroup support is actually >>> informative: it tells us the core yield improvement carries its weight >>> on its own, which helps me scope the next revision more tightly. >>> - The diminishing-but-still-positive returns at very high overcommit >>> (9:1, 10:1) match what I see on x86 as well — other bottlenecks start >>> dominating but the mechanism does not regress. >>> >>> Btw, which kernel version were these results collected on? >>> >>> Regards, >>> Wanpeng >>> >> Hi Wanpeng, >> >> I collected these results on a 6.19 kernel - which should also >> include the existing fixes for yielding and forfeiting vruntime on >> yield that K Prateek mentioned. >> > Hi Wanpeng. I'm trying out cgroup runs with libvirt but the results > seem to vary when I reproduce and need to look into this again so we > should not try to base any decisions on the numbers. > > I'll also rerun on the kernel version you are using (The 6.19-rc1). Hi Wanpeng, I spent some more time benchmarking the scheduler-side changes on s390 and I think I can now narrow down where the benefit shows up and where it does not. For context, my test runs have libvirt vms running dbench with the number of clients equal to the number of vCPUs, and the workload runs on tmpfs so that this is primarily measuring scheduler behavior. As far as I can tell, the yield/deboost benefit is constrained to cases where the relevant vCPUs are competing on the same runqueue. That makes placement the key variable. In particular: 1. With explicit 1:1 vCPU:pCPU pinning, I do not see a meaningful benefit. For 3 VMs with 16 vCPUs each pinned to 16 pCPUs, the results were:      diag9c calls:                  61,384,968 -> 62,994,594  (+2.6%)      Dbench throughput mean:   -0.5%      Dbench throughput median:  -0.3%    That is basically noise from my point of view. This matches the expectation that if the lock waiter and lock holder are not sharing an rq, the scheduler-side boost/deboost path has little or nothing to act on. 2. When vCPUs are pooled onto a smaller pCPU set, I can reproduce a benefit.    For 2 VMs with 16 vCPUs each placed on a 8 pCPU pool per VM, I saw:      diag9c calls:                  62,893,856 -> 20,033,920  (-68.1%)      Dbench throughput mean:   +4.2%      Dbench throughput median:  +4.0%    For 3 VMs with 16 vCPUs each placed on a 5 pCPU pool per VM, I saw:      diag9c calls:                 107,915,379 -> 35,393,080  (-67.2%)      Dbench throughput mean:     +4.4%      Dbench throughput median:    +4.4%    I also saw the same pattern with heavier pooling. For 5 VMs with 16 vCPUs each placed on a 3 pCPU pool per VM, the results were:      diag9c calls:                 130,986,144 -> 58,153,006  (-55.6%)      Dbench throughput mean:     +3.4%      Dbench throughput median:    +3.6% These are the configurations where I consistently see an improvement in reduction of diag9c calls (again our yieldto hypercall) and some throughput improvement. This works because the VM is actually overcommitted onto its allowed pCPU set, so multiple vCPUs from the same VM can contend on the same rq and exercise the mechanism. 3. If there is no intra-VM overcommit, the effect disappears again.    For 3 VMs with 5 vCPUs on a 5 pCPU pool per VM, the results were:      diag9c calls:                     696,548 ->    718,219  (+3.1%)      Dbench throughput mean: -0.8%      Dbench throughput median:  -0.7%    Again, no meaningful benefit. So my final takeaway is that on s390 I can only demonstrate a benefit when the test setup intentionally causes multiple vCPUs of a VM to share runqueues. Plain pinning does not show an effect, and a matched vCPU:pCPU configuration such as 5 vCPUs on 5 pCPUs does not either. The interesting case is specifically vCPU pooling / overcommit onto a smaller pCPU set, not just "more VMs on the host". I suppose this mechanism does help once the waiter/holder pair can actually meet on the same rq. If something similar could somehow target useful cross-runqueue cases as well, that would seem like a natural way to stretch this benefit further. Thanks, Richie