From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E2DDD426EA7 for ; Thu, 14 May 2026 15:23:45 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=148.163.156.1 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778772227; cv=none; b=cCCKnQs7vLPhGw1MSCg0WtO9SJ9yAGFxOMUpBcKKfC+h0H+2SwHO1a6WWtx3WfNvXoIhsBHnDJnklXCm+pW3TwCLnU++Y31gfES0rLBKhQ8juWyEswXo246AqiP3a9KkJiAsCqKs1gpXU+Cgsu0yKntOEsm/pKANjx+n/f93UyE= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778772227; c=relaxed/simple; bh=DLX3x7QBLYuhQPs3alW8JTDdhaTVScW105u/yxi1GHQ=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=QHJf81dZRj1FDHkaMkQZbLAuvZOXAkmf2GCo9FCWjhKHxzHflpBMXP7Y0dUuG/jI1Kv5jN8CI0IyIm17cFw/DGPBGaLnx3OElGlrYwZiiA1/Z7i7vuDBWLTjKtunLgRIG9Fv2Bgpb4MpJHCuDAAohNsp5bzCZgYXx1P18ztG3UE= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com; spf=pass smtp.mailfrom=linux.ibm.com; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b=rW7MH0Di; arc=none smtp.client-ip=148.163.156.1 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b="rW7MH0Di" Received: from pps.filterd (m0360083.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.18.1.11/8.18.1.11) with ESMTP id 64E7awtn3185040; Thu, 14 May 2026 15:23:27 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=cc :content-transfer-encoding:date:from:in-reply-to:message-id :mime-version:references:subject:to; s=pp1; bh=B2C9EO8uCfds7b1U3 HEXiKHl4h7eezG+WFm9cdBqxAE=; b=rW7MH0Diqw6Xn6paso1BFPDf1g+VehQrf 3m9EdZgpLNlM4UVWYXGXr4Jr/lJXr90ESzS+qsrCvVIRaBmmfmLlcElgdk8tqpP7 UP5mGF605IDCs4I1sL+pSqQUBlRUCd73IVNZlOgNlvfWHTau+ARyGwVI3bmyVMrw fekeiowDOvrZ+BQy3JoYOcivfa/gysTriBU99RNouq/KVE00g/4ZiTwkq+IcZIdr Av8tlvcONKKp56iqPX+cW7G7ojcqAwnKPYNQ/WbZ8nJQyCMrlwykKddDprVFpb6b QKXEm7aDoqEkzglMwBr72GQbnxP6ux8gd1o51CRTrcxE5wrxcVhUg== Received: from ppma11.dal12v.mail.ibm.com (db.9e.1632.ip4.static.sl-reverse.com [50.22.158.219]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 4e3nv5n5jk-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 14 May 2026 15:23:26 +0000 (GMT) Received: from pps.filterd (ppma11.dal12v.mail.ibm.com [127.0.0.1]) by ppma11.dal12v.mail.ibm.com (8.18.1.7/8.18.1.7) with ESMTP id 64EF9cDf008403; Thu, 14 May 2026 15:23:25 GMT Received: from smtprelay07.fra02v.mail.ibm.com ([9.218.2.229]) by ppma11.dal12v.mail.ibm.com (PPS) with ESMTPS id 4e3nfgmxpd-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 14 May 2026 15:23:25 +0000 (GMT) Received: from smtpav01.fra02v.mail.ibm.com (smtpav01.fra02v.mail.ibm.com [10.20.54.100]) by smtprelay07.fra02v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 64EFNMr928443110 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 14 May 2026 15:23:22 GMT Received: from smtpav01.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 3594C20043; Thu, 14 May 2026 15:23:22 +0000 (GMT) Received: from smtpav01.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id DE9A120040; Thu, 14 May 2026 15:23:14 +0000 (GMT) Received: from li-7bb28a4c-2dab-11b2-a85c-887b5c60d769.ibm.com.com (unknown [9.124.213.185]) by smtpav01.fra02v.mail.ibm.com (Postfix) with ESMTP; Thu, 14 May 2026 15:23:14 +0000 (GMT) From: Shrikanth Hegde To: linux-kernel@vger.kernel.org, mingo@kernel.org, peterz@infradead.org, juri.lelli@redhat.com, vincent.guittot@linaro.org, yury.norov@gmail.com, kprateek.nayak@amd.com, iii@linux.ibm.com Cc: sshegde@linux.ibm.com, tglx@kernel.org, gregkh@linuxfoundation.org, pbonzini@redhat.com, seanjc@google.com, vschneid@redhat.com, huschle@linux.ibm.com, rostedt@goodmis.org, dietmar.eggemann@arm.com, mgorman@suse.de, bsegall@google.com, maddy@linux.ibm.com, srikar@linux.ibm.com, hdanton@sina.com, chleroy@kernel.org, vineeth@bitbyteword.org, frederic@kernel.org, arighi@nvidia.com, pauld@redhat.com, christian.loehle@arm.com, tj@kernel.org, tommaso.cucinotta@gmail.com, maz@kernel.org, rafael@kernel.org Subject: [PATCH v3 02/20] sched/docs: Document cpu_preferred_mask and Preferred CPU concept Date: Thu, 14 May 2026 20:51:46 +0530 Message-ID: <20260514152204.481115-3-sshegde@linux.ibm.com> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20260514152204.481115-1-sshegde@linux.ibm.com> References: <20260514152204.481115-1-sshegde@linux.ibm.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-TM-AS-GCONF: 00 X-Proofpoint-Reinject: loops=2 maxloops=12 X-Authority-Analysis: v=2.4 cv=cPHQdFeN c=1 sm=1 tr=0 ts=6a05e8ef cx=c_pps a=aDMHemPKRhS1OARIsFnwRA==:117 a=aDMHemPKRhS1OARIsFnwRA==:17 a=NGcC8JguVDcA:10 a=VkNPw1HP01LnGYTKEx00:22 a=RnoormkPH1_aCDwRdu11:22 a=iQ6ETzBq9ecOQQE5vZCe:22 a=VnNF1IyMAAAA:8 a=f0vDFc08jm7PY2BpABoA:9 X-Proofpoint-ORIG-GUID: dYpD5L28GWXgLMetWfv_ZGatEtYAZP7Y X-Proofpoint-GUID: 7rROQoyzzKfbgYuow1hQCO_O7xkz5NHt X-Proofpoint-Spam-Details-Enc: AW1haW4tMjYwNTE0MDE1MyBTYWx0ZWRfX0/ZtGVvmskFe PI6fSUmmgjZfnP6SuOTbc6RKVC9hFRwd60/aNVFCjzqJoD2zO3VHTBd68Dfa795PX+ckg7b4+S3 +6qPDVNqoi43TmaDLUdT+RKjAcqJoPcywXJCm1LF0CeAbYaURUrUdfSKOsWSgO11Oh6FEtMOHYS Oy9ufnaNtfTL7EM0dEy9WV5xnWb6I+RI2CCTzDUmdQErWAQHYQjOYS0Mt4TlnN2L1uf3LPImsKl bwQIlCQzYcs/AqaI63WRQzN3JsFtUxOWlisJkefZSKK6699YiXwZATmzCnqi2GCzpxLNc0608tQ eVYsybBy/CALOmu9ZqIVjSGWOspsqNqjYzwJmJ7zkC6CmaxKmMmAG5wIWqCyzrFQ7iqFIxM5ACB l+/XmTBPnyKfKojrY3gG7IsUzC7XVS6xb+DlgHBZfRmGDZNzuD7lwBa+qOtZlWneHzpJBUKOKE+ ZRZ+Tm3liCSzdgD0J+g== X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1143,Hydra:6.1.51,FMLib:17.12.100.49 definitions=2026-05-14_03,2026-05-13_01,2025-10-01_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 bulkscore=0 phishscore=0 clxscore=1015 spamscore=0 impostorscore=0 lowpriorityscore=0 adultscore=0 suspectscore=0 malwarescore=0 classifier=typeunknown authscore=0 authtc= authcc= route=outbound adjust=0 reason=mlx scancount=1 engine=8.22.0-2605050000 definitions=main-2605140153 Add documentation for new cpumask called cpu_preferred_mask. This could help users in understanding what this mask and the concept behind it. Document how to enable it and implementation aspects of it. Signed-off-by: Shrikanth Hegde --- Documentation/scheduler/sched-arch.rst | 49 ++++++++++++++++++++++++++ 1 file changed, 49 insertions(+) diff --git a/Documentation/scheduler/sched-arch.rst b/Documentation/scheduler/sched-arch.rst index ed07efea7d02..3f7de70dc97f 100644 --- a/Documentation/scheduler/sched-arch.rst +++ b/Documentation/scheduler/sched-arch.rst @@ -62,6 +62,55 @@ Your cpu_idle routines need to obey the following rules: arch/x86/kernel/process.c has examples of both polling and sleeping idle functions. +Preferred CPUs +============== + +In virtualised environments it is possible to overcommit CPU resources. +i.e sum of virtual CPU(vCPU) of all VM's is greater than number of physical +CPUs(pCPU). Under such conditions when all or many VM's have high utilization, +hypervisor won't be able to satisfy the CPU requirement and has to context +switch within or across VM. i.e hypervisor need to preempt one vCPU to run +another. This is called vCPU preemption. This is more expensive compared to +task context switch within a vCPU. + +In such cases it is better that VM's co-ordinate among themselves and ask for +less CPU by not using some of the vCPUs. vCPUs where workload can be safely +scheduled which won't increase any contention for pCPU are called as +"Preferred CPUs". + +In most cases preferred CPUs will be same as online CPUs, when there is pCPU +contention, Preferred CPUs will reduce based on the amount of steal time. +When the pCPU contention goes away as indicated by steal time, Preferred CPUs +will become same as online CPUs again. One has to enable the feature by +writing 1 to /sys/kernel/debug/sched/steal_monitor/enable + +One of the design construct is preferred CPUs is always subset of online CPUs. +With CONFIG_PREFERRED_CPU=n, it is same as online CPUs. + +For scheduling decisions such as wakeup, pushing the task etc, needs this +CPU state info. This is maintained in cpu_preferred_mask. + +vCPUs which are not in cpu_preferred_mask should be treated as vCPUs which +should not be used at this moment provided it doesn't break user affinity. +This is achieved by +1. Selecting only a preferred CPU at wakeup. +2. Push the task away from non-preferred CPU at tick. +3. Only select preferred CPUs for load balance. + +/sys/devices/system/cpu/preferred prints the current cpu_preferred_mask in +cpulist format. + +Notes: +1. This feature is available under CONFIG_PREFERRED_CPU +2. This feature works for FAIR/RT class. +3. A task pinned, which can't be moved to preferred CPUs will continue + to run based on its affinity. But no load balancing happens +4. If needed, steal time based governors/arch dependent method + could be used to cater to different types of cpu numbers. + Arch can do so by implementing its own hooks. +5. Decision to use/not use is driven by kernel. Hence it shouldn't + break user affinities. One of the main reason why CPU hotplug + or Isolated cpuset partitions was not a solution. Possible arch/ problems ======================= -- 2.47.3