From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 585922BB13 for ; Thu, 9 Apr 2026 10:27:37 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=148.163.156.1 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775730458; cv=none; b=hgHfPgnegQpZa16UiZKlc0HFPgmNWDA7Ynv8TYfFfmZ4Mz/osIddcrmyxMS9gvYlAizX1lOVLC/SVhnN0NtG7oyWJIwa3knLGcKsIKlK4XAcUwf86xVTmndWfb0jn5+JLODq/sX0+MoTJgGB8kSl3zSQkW85zLVVLgG2w1Mgd34= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775730458; c=relaxed/simple; bh=W2zYB8NgpUiM1J1JyUkmb82QbLZSbEUNZHD44OlcSTQ=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=Kml4OjelwVNysZimKni1k42QdUp860IzilAdTLQrctM95GrtAI2XqFu5QzRbI+LH/uHKsjhzYqUkLUq0bN/sSXmfgJdnQs6+wenYpWJ3lJYRNnH8xjpEI1kjR7y0qUm41pSuul8fMGkAx4NJ+62QpElaGwEq5+XOPD8pWJjNtmI= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com; spf=pass smtp.mailfrom=linux.ibm.com; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b=s8pmgGsV; arc=none smtp.client-ip=148.163.156.1 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b="s8pmgGsV" Received: from pps.filterd (m0356517.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.18.1.11/8.18.1.11) with ESMTP id 638ITR1T2302913; Thu, 9 Apr 2026 10:27:26 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=cc :content-transfer-encoding:content-type:date:from:in-reply-to :message-id:mime-version:references:subject:to; s=pp1; bh=qATZLN rysBwvqjTF59rale1ggq4pH0ZyEkpEp7GNAdg=; b=s8pmgGsVX4pwyGhZ3YuNHz 6AGHBX/JdZ+CQQNfy4oRlz8aY7LzB+Xe/X2M6CwvSlSeGxObMAtkr2wGDLgkf/vy ggCwiuh/RuYxOuVzesyWp7E9VkR85OsXg8kCxHsbY70JpwE0rPEeJINSJ0wRhwfu Z9PUvrKrG4T4YpwKtI+6Tdb8T8GHadINDAvnOdII+kBVE11CepXDn+jTXUI8cD24 K1vrvOMscy8fqxJDgqQzFYGZwNf2hILCOy7Tku847ITQceZy9/NMiZ2ajylEy3/e Izh3DMFZ7a3paJpUUcmPucjCtk/vyP0EAGSOddwhSdug/hZQRQa5iQoU4mbv3nlA == Received: from ppma22.wdc07v.mail.ibm.com (5c.69.3da9.ip4.static.sl-reverse.com [169.61.105.92]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 4dcn2fmexy-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 09 Apr 2026 10:27:26 +0000 (GMT) Received: from pps.filterd (ppma22.wdc07v.mail.ibm.com [127.0.0.1]) by ppma22.wdc07v.mail.ibm.com (8.18.1.2/8.18.1.2) with ESMTP id 6397x9qC007881; Thu, 9 Apr 2026 10:27:25 GMT Received: from smtprelay07.fra02v.mail.ibm.com ([9.218.2.229]) by ppma22.wdc07v.mail.ibm.com (PPS) with ESMTPS id 4dcmg2k619-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 09 Apr 2026 10:27:24 +0000 Received: from smtpav02.fra02v.mail.ibm.com (smtpav02.fra02v.mail.ibm.com [10.20.54.101]) by smtprelay07.fra02v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 639ARN4Y39059756 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 9 Apr 2026 10:27:23 GMT Received: from smtpav02.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id F1B372004B; Thu, 9 Apr 2026 10:27:22 +0000 (GMT) Received: from smtpav02.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id A4DE320043; Thu, 9 Apr 2026 10:27:21 +0000 (GMT) Received: from [9.124.214.232] (unknown [9.124.214.232]) by smtpav02.fra02v.mail.ibm.com (Postfix) with ESMTP; Thu, 9 Apr 2026 10:27:21 +0000 (GMT) Message-ID: Date: Thu, 9 Apr 2026 15:57:20 +0530 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v2 00/17] sched/paravirt: Introduce cpu_preferred_mask and steal-driven vCPU backoff To: Hillf Danton Cc: linux-kernel@vger.kernel.org, peterz@infradead.org, Sean Christopherson , vincent.guittot@linaro.org, yury.norov@gmail.com, kprateek.nayak@amd.com References: <20260409051556.1637-1-hdanton@sina.com> Content-Language: en-US From: Shrikanth Hegde In-Reply-To: <20260409051556.1637-1-hdanton@sina.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-TM-AS-GCONF: 00 X-Proofpoint-Reinject: loops=2 maxloops=12 X-Proofpoint-Spam-Details-Enc: AW1haW4tMjYwNDA5MDA5MiBTYWx0ZWRfX3yNJ+5Z6l1EP WpYwDpdn0XPo/DRCaupJgxch9IXKZyORZ8wHFxbjRtMs1XtZciBLfGP1aEMTvN8J0UBtfmNomDt BCGB0BpRzklANAo+SFsaCV0hxg7qYOpDQC2Xb3FyYfx+WtBzBlDyTQj+rSg65OFZhtDjIQ1hjNo eDhxb5M9FLz94bt5V6pOKpvaKIlaKvpvKvdhWJQFn9l9pha1eAzCf057oxPmnYkxj5adpluQmdr 4HbMh+siIx1xD/glkoQDYQhXetZoZ2hnf6pO9rSJSKy+iTslhIRjWgto6jP3SxpHq/HhnEt94gv O9NVKE6cbguMOhXHOZ4KynQmIUz4VrmnD1N4GRf0PB9TG3lpy49+sUG7LOD+j2SxYTMEhOFE+Cr xGbYzA0VHfEgr2dNaw5n0wJzkelrbLyHMDbFs3sFPmzIOHj8/CAigxlx8pDxI6eT7d2JSn2RX12 akZD1jNKDaKGKvPE8RQ== X-Authority-Analysis: v=2.4 cv=FsY1OWrq c=1 sm=1 tr=0 ts=69d77f0e cx=c_pps a=5BHTudwdYE3Te8bg5FgnPg==:117 a=5BHTudwdYE3Te8bg5FgnPg==:17 a=IkcTkHD0fZMA:10 a=A5OVakUREuEA:10 a=VkNPw1HP01LnGYTKEx00:22 a=RnoormkPH1_aCDwRdu11:22 a=U7nrCbtTmkRpXpFmAIza:22 a=eR8Jdg37o02I9HhQGKEA:9 a=3ZKOabzyN94A:10 a=QEXdDO2ut3YA:10 X-Proofpoint-ORIG-GUID: bZhXe7n5uXBYcyumtT7UFZgTnC6C6gJM X-Proofpoint-GUID: 6ILQ8nCqhpSmDTq9tobhxWsIyYABDTdV X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1143,Hydra:6.1.51,FMLib:17.12.100.49 definitions=2026-04-09_02,2026-04-09_01,2025-10-01_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 bulkscore=0 priorityscore=1501 impostorscore=0 spamscore=0 phishscore=0 lowpriorityscore=0 clxscore=1015 adultscore=0 malwarescore=0 suspectscore=0 classifier=typeunknown authscore=0 authtc= authcc= route=outbound adjust=0 reason=mlx scancount=1 engine=8.22.0-2604010000 definitions=main-2604090092 Hi Hillf. On 4/9/26 10:45 AM, Hillf Danton wrote: > On Wed, 8 Apr 2026 19:19:05 +0530 Shrikanth Hegde wrote: >> On 4/8/26 3:44 PM, Hillf Danton wrote: >>> On Wed, 8 Apr 2026 00:49:33 +0530 Shrikanth Hegde wrote: >>>> Core idea is: >>>> - Maintain set of CPUs which can be used by workload. It is denoted as >>>> cpu_preferred_mask >>>> - Periodically compute the steal time. If steal time is high/low based >>>> on the thresholds, either reduce/increase the preferred CPUs. >>>> - If a CPU is marked as non-preferred, push the task running on it if >>>> possible. >>>> - Use this CPU state in wakeup and load balance to ensure tasks run >>>> within preferred CPUs. >>>> >>>> For the host kernel, there is no steal time, so no changes to its preferred >>>> CPUs. So series would affect only the guest kernels. >>>> >>> Changes are added to guest in order to detect if pCPU is overloaded, and if >>> that is true (I mean it is layer violation), why not ask the pCPU governor, >>> hypervisor, to monitor the loads on pCPU and migrate vCPUs forth and back >>> if necessary. >>> >> >> AFAIK, there in no information in the host scheduler on what >> each vCPU is running. It maybe holding a mutex, spinlock with irq disabled > > This is what layer means (particularly in the data center environment). > Host / hypervisor scheduler - Schedules vCPU threads as opaque entities. Has no visibility into: - whether a vCPU is holding a spinlock - whether IRQs are disabled - whether a guest mutex is contended - guest scheduler state Can only ensure fairness between vCPUs Guest scheduler Knows exact task‑level semantics - lock ownership - preemption state - affinity constraints. But does not control pCPUs directly, unless there is vCPU pinning. Steal time is precisely the contract boundary between those layers: So, This is not a layer violation. Guest is acting on its CPUs based on the hint which host already provides. Actual layer violation would be: - host peeking into guest scheduler data - host deciding which guest vCPUs are “important” - host understanding guest locks or IRQ state Or I am not understanding what you mean by layer violation. If so, please explain to me. Today, why is steal time is being reported? So that guest/host can make appropriate decision. right? When you see high steal values, You have two choices. Either increase the underlying resource by re-partitioning the host with more cores or reduce the incoming request from guest such that host can meet. If the host is already at max cores, then there is only option. One could, say with series high steal values may not be seen, how will system admin re-size the host. Just look at preferred vs online. If they are not same then there was contention and preferred became subset of online. We might have update the documentation of steal time section. >> or maybe in interrupt context. Moving/migrating the vCPUs threads without >> that knowledge will hurt the guest. And it has to ensure fairness. >> > We have to pay the cost for vCPU. > >> This has to work across different archs, some have linux as hypervisor, some >> has non-linux hypervisor such as powerpc, s390. >> > Yeah, in the car cockpit product environment in Shenzhen Linux, Android and > XYZ guests run on QNX, and your steal time approach looks half baked. > They likely don't have this problem. IIUC, they would prefer deterministic behavior in automotive hypervisors. Having steal time brings unbounded latency. If the guests are not linux, then yes. Same logic will have to be there in each guest. But that problem exists in other direction too. You have to inform the host somehow, which of my vCPU threads are important. That is going to be way more complex in IMHO. Even in linux we don't have that interface today. And then repeat the same in other non-linux guest. One could say that is even worse. If the guests are all indeed linux, then solution would work just fine. Just re-iterate: - For host kernel - No Change as it can't have steal time construct. Minimal overhead. - Guests don't have steal time - No functional change. Minimal overhead. - Guest with steal time - NO_STEAL_MONITOR - No functional change. Minimal overhead. - Guest with steal time - STEAL_MONITOR - Functional changes - Steal driven vCPU backoff. >> Steal time in guest is common construct in all archs. I don't think such >> commonality exists in host schedulers. >> >> If done in guest, guest actually knows what it is running and whats more important. >> It can make better decisions IMHO.