From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 388ED3BE653 for ; Thu, 14 May 2026 15:23:38 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=148.163.156.1 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778772219; cv=none; b=H3MeaKNaSz4hYeVc3ktuXreEBA+rP54qaBS4qqJ1gShYpsaAftcuJsz6juWsyzohMH8m97ZzcBr4NeiLIDfFqyKK627UnaGwsXZmNtVsQhq5hIiBV2CRkEETfTCTEgBHH6X4+WeaWWVH7hpI+yeCnYuxxPx4BD6V3mur0ROxiFM= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778772219; c=relaxed/simple; bh=uTbukDA2QFDI9mBlbFPtorrUsgjSwL0GZa6d3vbz7Yo=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=mR2Aj0KLkD91VmP6nnGyklrOZwatM07PLca7rpz/KNMYHN7UUayMROTicwuk+n+vJNPC31tXiQJGVwGZcwt8fyOYIed1tqFbXMz2dL9DIF5P64yS2Pu0u0mPVT9TxDHzYff9U6OJIg1Qc3Rt/UXXvgCNDXZMyDJoq90AU9hxLys= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com; spf=pass smtp.mailfrom=linux.ibm.com; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b=QDyoErZw; arc=none smtp.client-ip=148.163.156.1 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b="QDyoErZw" Received: from pps.filterd (m0356517.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.18.1.11/8.18.1.11) with ESMTP id 64EE6BDn1058336; Thu, 14 May 2026 15:23:12 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=cc :content-transfer-encoding:date:from:message-id:mime-version :subject:to; s=pp1; bh=ijo82s66IzgIH22H/14V6fc54cymg8Ax44PlclXfa Bs=; b=QDyoErZwySIouCEaeo2qE/nT0LKcq8/yiO2K/0+5pTgoVufx/HCENCCL9 9uLSg5kIORI9NQ7mpcvLx2oippMrAYMJLtbcxx+RHMILcwHrZTfsQSt5svxD8nSt hblVjaaewv3cTCkeEMv+QjbxUpPs+BOM2MWLtSx4CISZBGssnNTiT5YS4Hik5Bcp j8wXIV/tK+zFKPOSnf0EJat24Vc0PIwzPi3BqWweUB1gsGFxiyFXz5NOkVCauHvz 5yaeJr1jDJS2/ncPP/9cKnHDmtjZDqGHr/yXDc3+cLyMr6L+0UlHrsSprZzveli5 jfeRXjRPqXscbIKaBO12FaYI/70rg== Received: from ppma22.wdc07v.mail.ibm.com (5c.69.3da9.ip4.static.sl-reverse.com [169.61.105.92]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 4e3nve54as-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 14 May 2026 15:23:11 +0000 (GMT) Received: from pps.filterd (ppma22.wdc07v.mail.ibm.com [127.0.0.1]) by ppma22.wdc07v.mail.ibm.com (8.18.1.7/8.18.1.7) with ESMTP id 64EF9eNC013441; Thu, 14 May 2026 15:23:10 GMT Received: from smtprelay05.fra02v.mail.ibm.com ([9.218.2.225]) by ppma22.wdc07v.mail.ibm.com (PPS) with ESMTPS id 4e3nfh4yg1-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 14 May 2026 15:23:10 +0000 (GMT) Received: from smtpav01.fra02v.mail.ibm.com (smtpav01.fra02v.mail.ibm.com [10.20.54.100]) by smtprelay05.fra02v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 64EFN6Cx42008848 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 14 May 2026 15:23:06 GMT Received: from smtpav01.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id ABCC520043; Thu, 14 May 2026 15:23:06 +0000 (GMT) Received: from smtpav01.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 6D60220040; Thu, 14 May 2026 15:22:59 +0000 (GMT) Received: from li-7bb28a4c-2dab-11b2-a85c-887b5c60d769.ibm.com.com (unknown [9.124.213.185]) by smtpav01.fra02v.mail.ibm.com (Postfix) with ESMTP; Thu, 14 May 2026 15:22:59 +0000 (GMT) From: Shrikanth Hegde To: linux-kernel@vger.kernel.org, mingo@kernel.org, peterz@infradead.org, juri.lelli@redhat.com, vincent.guittot@linaro.org, yury.norov@gmail.com, kprateek.nayak@amd.com, iii@linux.ibm.com Cc: sshegde@linux.ibm.com, tglx@kernel.org, gregkh@linuxfoundation.org, pbonzini@redhat.com, seanjc@google.com, vschneid@redhat.com, huschle@linux.ibm.com, rostedt@goodmis.org, dietmar.eggemann@arm.com, mgorman@suse.de, bsegall@google.com, maddy@linux.ibm.com, srikar@linux.ibm.com, hdanton@sina.com, chleroy@kernel.org, vineeth@bitbyteword.org, frederic@kernel.org, arighi@nvidia.com, pauld@redhat.com, christian.loehle@arm.com, tj@kernel.org, tommaso.cucinotta@gmail.com, maz@kernel.org, rafael@kernel.org Subject: [PATCH v3 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Date: Thu, 14 May 2026 20:51:44 +0530 Message-ID: <20260514152204.481115-1-sshegde@linux.ibm.com> X-Mailer: git-send-email 2.51.0 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-TM-AS-GCONF: 00 X-Proofpoint-Reinject: loops=2 maxloops=12 X-Authority-Analysis: v=2.4 cv=Y/XIdBeN c=1 sm=1 tr=0 ts=6a05e8e0 cx=c_pps a=5BHTudwdYE3Te8bg5FgnPg==:117 a=5BHTudwdYE3Te8bg5FgnPg==:17 a=NGcC8JguVDcA:10 a=VkNPw1HP01LnGYTKEx00:22 a=RnoormkPH1_aCDwRdu11:22 a=U7nrCbtTmkRpXpFmAIza:22 a=jJrOw3FHAAAA:8 a=VwQbUJbxAAAA:8 a=VnNF1IyMAAAA:8 a=Rl9AmBntTKWYVUS4HYYA:9 X-Proofpoint-GUID: nPHlDyur7dy4hdiId2-Y-pfP_JzYQQKA X-Proofpoint-ORIG-GUID: wssTQxyUrdT4U9tWffoKqF1qphk9nMhE X-Proofpoint-Spam-Details-Enc: AW1haW4tMjYwNTE0MDE1MyBTYWx0ZWRfX/DiFpoLm8dIu H81qQYkNwxNgd6DOJCTqQ3v2b2H+biju1Vx6DsdLqL7pwBEmyf2ubDHVIZkm45r+8TZ6rLPl9+i 85UPPNesO7L658kmhA6ERYb0NYv8ELh6K6cbtb6qZKMa/XaTDJFpWGH8U+2+CZyUKIcNYi3+7tE qoufMF+ePao23SLJgyQV4PHr4csrvsvItI2vhnN4KEmjZB3Iv59xRVjPSjnTXxxtC1uC+atn476 vYd91SapmoKD8gTBVrL/LyHDVftk/SwYTDsxTYV7pln4FmYrhKd4njwsxAqgruzXEOHrtWKrcRw EvFeep1xtlR8NFnSDKx5ClU1C8hEW42U6ksmgYGbBUnqSFxYcHK+/IHMRQYKgqZWFdIFl2UOLVf 4Ysq3Mpe8famC4fbg27cY6Kj/mwR9uCRKIZIVRziuxOdw823dH0ELmbaBH6GrZHkU1j5rgszJqG LBPqJnMUOjFMdP6LcQQ== X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1143,Hydra:6.1.51,FMLib:17.12.100.49 definitions=2026-05-14_03,2026-05-13_01,2025-10-01_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 clxscore=1011 priorityscore=1501 lowpriorityscore=0 phishscore=0 malwarescore=0 suspectscore=0 bulkscore=0 spamscore=0 adultscore=0 impostorscore=0 classifier=typeunknown authscore=0 authtc= authcc= route=outbound adjust=0 reason=mlx scancount=1 engine=8.22.0-2605050000 definitions=main-2605140153 This version is after the OSPM26 Discussion[1]. There was a good discussion around this problem and there were feedback on some of the implementation bits. Some of them have been tried/implemented and few have been deferred. *** Review and feedback is much appreciated!! *** [1]:https://youtu.be/adxUKFPlOp0 Briefly, Core idea is: - Maintain set of CPUs which can be used by workload. It is denoted as cpu_preferred_mask - Periodically compute the steal time. If steal time is high/low based on the thresholds, either reduce/increase the preferred CPUs. - If a CPU is marked as non-preferred, push the task running on it if possible. - Use this CPU state in wakeup and load balance to ensure tasks run within preferred CPUs. For more details on idea, problem statement and performance numbers, please refer to cover-letter of v2[2] and OSPM talk[1]. ========================================================================== Note: This series expect dependent series mentioned below applied on base (tip/master) base: 4d034938b6b1 ("Merge branch into tip/master: 'x86/tdx'") Dependent series: https://lore.kernel.org/all/20260513133934.380347-1-sshegde@linux.ibm.com/#t ========================================================================== Changes since v2[2]: - Introduce a new config CONFIG_PREFERRED_CPU and make user select the config for this feature. This was suggested by Yury Norov. This removes the dependency from PARAVIRT which would make s390 folks happy. - With CONFIG_PREFERRED_CPU=n, preferred state is same as online state. - With CONFIG_PREFERRED_CPU=y, always maintain a design construct such that preferred is always a subset of online. - Create a debugfs folder called steal_monitor in sched. Move away from sched_feat since there is no easier way to call additional code when doing enable/disable. This is essential when one disables the feature and preferred now has to be same as online to maintain that construct. - With feature=off, preferred state is same on online state. Feature is still based on static key to avoid any runtime overhead. - Prevent the ifdeffery spread to many file. Now the ifdeffery is spread mainly to */sched.h and cpumask.h, debug.c. Some ifdeffery have been kept to avoid code bloat and introducing debug files when config=n. - Using active mask instead of using preferred mask. (One of the ideas suggested). This is was tried. When there is high steal time, a CPU marked as not-active isn't available for workload which pins them. That would break user affinities. Also there is heavy use of it and it is well known too. So decided not to use it. - Support the feature for CONFIG_SCHED_SMT=y. Note that some would have interpreted my comment as supporting smt or not. It was actually CONFIG_SCHED_SMT=n(which is rare btw). It was due to ifdeffery around cpu_smt_mask which was not pretty. With the effort of removing the ifdeffery around it [3], this series supports CONFIG_SCHED_SMT=n too. - Introduce arch specific handling for inc/dec preferred CPUs. This was a ask from s390 as it may have good hint from HW on which specific CPUs to take out. I hoping current hooks would work for s390. Please let me know if it works or not. - Added comments around O(N2) complexity in rare cases for select_fallback_rq. (Yury Norov) - irqbalance=n was considered as not important. It was quite hard to send interrupt on non-preferred CPUs as well. There was patch sent[4] as reply to previous version which covers irqbalance=y. - Performance numbers from v2 (x86, powerpc, s390) showed nice improvements in some cases without any major regression. Numbers are expected to similar for this series. ========================================================================== TODO/OPEN Questions: - SCHED_EXT is still pending. I tried adding few checks in scx_idle_test_and_clear_cpu, pick_idle_cpu_in_node and push the sched_ext task in tick. But it hasn't still worked with scx_simple. I will try to figure it out. But i may need help since I am yet wade deeper waters in sched_ext. - Use PELT kind of signal to smoothen the steal time. This may help avoid oscillations. Current one works to certain extent. - NUMA splicing when dec/inc preferred CPUs. Left it as of now as simple method works quite well. NUMA splicing is going to be heavy. Is it really necessary? Are there common topology with weird CPU distributions across NUMA? - Consider not changing state of isolcpus, since one usually pins the workload on them anyways. Not typical use case though. - Corner cases when there are multiple VM's and each may have only one Core. Are those cases worth taking a look? - Add cpumask_check at appropriate places. - Currently it works if all the guests enable the feature. If not one guest may take advantage of other. Is that to be fixed? Since this has to be enabled by admins, is that a valid concern still? [2] v2: https://lore.kernel.org/all/20260407191950.643549-1-sshegde@linux.ibm.com/#t [3]: https://lore.kernel.org/all/20260506110052.9974-1-sshegde@linux.ibm.com/#t [4]: https://lore.kernel.org/all/8beafb01-f891-4b13-8eae-c6f3face7001@linux.ibm.com/ PS: There were several suggestions in OSPM discussion; some have been incorporated, whichever have been intentionally deferred are mentioned such as sched_ext and rest might have been overlooked. Please let me know if any specific suggestion should be prioritized or reconsidered. Please review. Shrikanth Hegde (20): sched/debug: Remove unused schedstats sched/docs: Document cpu_preferred_mask and Preferred CPU concept kconfig: Provide PREFERRED_CPU option cpumask: Introduce cpu_preferred_mask sysfs: Add preferred CPU file sched/core: allow only preferred CPUs in is_cpu_allowed sched/fair: Select preferred CPU at wakeup when possible sched/fair: load balance only among preferred CPUs sched/rt: Select a preferred CPU for wakeup and pulling rt task sched/core: Keep tick on non-preferred CPUs until tasks are out sched/core: Push current task from non preferred CPU sched/debug: Add migration stats due to non preferred CPUs sched/debug: Create debugfs folder steal_monitor sched/debug: Provide debugfs to enable/disable steal monitor sched/core: Introduce a simple steal monitor sched/core: Compute steal values at regular intervals sched/core: Introduce default arch handling code for inc/dec preferred CPUs sched/core: Handle steal values and mark CPUs as preferred sched/core: Mark the direction of steal values to avoid oscillations sched/debug: Add debug knobs for steal monitor .../ABI/testing/sysfs-devices-system-cpu | 11 + Documentation/scheduler/sched-arch.rst | 49 ++++ Documentation/scheduler/sched-debug.rst | 32 +++ drivers/base/cpu.c | 8 + include/linux/cpumask.h | 21 +- include/linux/sched.h | 21 +- kernel/Kconfig.preempt | 13 + kernel/cpu.c | 16 ++ kernel/sched/core.c | 255 +++++++++++++++++- kernel/sched/cpupri.c | 1 + kernel/sched/debug.c | 51 +++- kernel/sched/fair.c | 6 +- kernel/sched/rt.c | 4 + kernel/sched/sched.h | 27 ++ 14 files changed, 505 insertions(+), 10 deletions(-) -- 2.47.3