From: Shrikanth Hegde <sshegde@linux.ibm.com>
To: linux-kernel@vger.kernel.org, mingo@kernel.org,
peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, yury.norov@gmail.com,
kprateek.nayak@amd.com, iii@linux.ibm.com
Cc: sshegde@linux.ibm.com, tglx@kernel.org,
gregkh@linuxfoundation.org, pbonzini@redhat.com,
seanjc@google.com, vschneid@redhat.com, huschle@linux.ibm.com,
rostedt@goodmis.org, dietmar.eggemann@arm.com, mgorman@suse.de,
bsegall@google.com, maddy@linux.ibm.com, srikar@linux.ibm.com,
hdanton@sina.com, chleroy@kernel.org, vineeth@bitbyteword.org,
frederic@kernel.org, arighi@nvidia.com, pauld@redhat.com,
christian.loehle@arm.com, tj@kernel.org,
tommaso.cucinotta@gmail.com, maz@kernel.org, rafael@kernel.org
Subject: [PATCH v3 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff
Date: Thu, 14 May 2026 20:51:44 +0530 [thread overview]
Message-ID: <20260514152204.481115-1-sshegde@linux.ibm.com> (raw)
This version is after the OSPM26 Discussion[1]. There was
a good discussion around this problem and there were feedback on some
of the implementation bits. Some of them have been tried/implemented
and few have been deferred.
*** Review and feedback is much appreciated!! ***
[1]:https://youtu.be/adxUKFPlOp0
Briefly, Core idea is:
- Maintain set of CPUs which can be used by workload. It is denoted as
cpu_preferred_mask
- Periodically compute the steal time. If steal time is high/low based
on the thresholds, either reduce/increase the preferred CPUs.
- If a CPU is marked as non-preferred, push the task running on it if
possible.
- Use this CPU state in wakeup and load balance to ensure tasks run
within preferred CPUs.
For more details on idea, problem statement and performance numbers,
please refer to cover-letter of v2[2] and OSPM talk[1].
==========================================================================
Note: This series expect dependent series mentioned below applied on
base (tip/master)
base: 4d034938b6b1 ("Merge branch into tip/master: 'x86/tdx'")
Dependent series: https://lore.kernel.org/all/20260513133934.380347-1-sshegde@linux.ibm.com/#t
==========================================================================
Changes since v2[2]:
- Introduce a new config CONFIG_PREFERRED_CPU and make user select
the config for this feature. This was suggested by Yury Norov.
This removes the dependency from PARAVIRT which would make s390
folks happy.
- With CONFIG_PREFERRED_CPU=n, preferred state is same as online state.
- With CONFIG_PREFERRED_CPU=y, always maintain a design construct such
that preferred is always a subset of online.
- Create a debugfs folder called steal_monitor in sched. Move away from
sched_feat since there is no easier way to call additional code when
doing enable/disable. This is essential when one disables the feature
and preferred now has to be same as online to maintain that construct.
- With feature=off, preferred state is same on online state. Feature is
still based on static key to avoid any runtime overhead.
- Prevent the ifdeffery spread to many file. Now the ifdeffery is spread
mainly to */sched.h and cpumask.h, debug.c. Some ifdeffery have been kept
to avoid code bloat and introducing debug files when config=n.
- Using active mask instead of using preferred mask. (One of the ideas
suggested). This is was tried. When there is high steal time,
a CPU marked as not-active isn't available for workload which pins
them. That would break user affinities.
Also there is heavy use of it and it is well known too. So decided
not to use it.
- Support the feature for CONFIG_SCHED_SMT=y. Note that some would have
interpreted my comment as supporting smt or not. It was actually
CONFIG_SCHED_SMT=n(which is rare btw). It was due to ifdeffery around
cpu_smt_mask which was not pretty.
With the effort of removing the ifdeffery around it [3], this series
supports CONFIG_SCHED_SMT=n too.
- Introduce arch specific handling for inc/dec preferred CPUs. This was
a ask from s390 as it may have good hint from HW on which specific
CPUs to take out. I hoping current hooks would work for s390. Please
let me know if it works or not.
- Added comments around O(N2) complexity in rare cases for
select_fallback_rq. (Yury Norov)
- irqbalance=n was considered as not important. It was quite hard to
send interrupt on non-preferred CPUs as well. There was patch sent[4] as
reply to previous version which covers irqbalance=y.
- Performance numbers from v2 (x86, powerpc, s390) showed nice
improvements in some cases without any major regression. Numbers are
expected to similar for this series.
==========================================================================
TODO/OPEN Questions:
- SCHED_EXT is still pending. I tried adding few checks in
scx_idle_test_and_clear_cpu, pick_idle_cpu_in_node and push the
sched_ext task in tick. But it hasn't still worked with scx_simple.
I will try to figure it out. But i may need help since
I am yet wade deeper waters in sched_ext.
- Use PELT kind of signal to smoothen the steal time. This may help
avoid oscillations. Current one works to certain extent.
- NUMA splicing when dec/inc preferred CPUs. Left it as of now as simple
method works quite well. NUMA splicing is going to be heavy.
Is it really necessary? Are there common topology with weird CPU
distributions across NUMA?
- Consider not changing state of isolcpus, since one usually pins the
workload on them anyways. Not typical use case though.
- Corner cases when there are multiple VM's and each may have only one
Core. Are those cases worth taking a look?
- Add cpumask_check at appropriate places.
- Currently it works if all the guests enable the feature. If not one
guest may take advantage of other. Is that to be fixed? Since this has
to be enabled by admins, is that a valid concern still?
[2] v2: https://lore.kernel.org/all/20260407191950.643549-1-sshegde@linux.ibm.com/#t
[3]: https://lore.kernel.org/all/20260506110052.9974-1-sshegde@linux.ibm.com/#t
[4]: https://lore.kernel.org/all/8beafb01-f891-4b13-8eae-c6f3face7001@linux.ibm.com/
PS: There were several suggestions in OSPM discussion; some have been
incorporated, whichever have been intentionally deferred are mentioned
such as sched_ext and rest might have been overlooked.
Please let me know if any specific suggestion should be prioritized
or reconsidered. Please review.
Shrikanth Hegde (20):
sched/debug: Remove unused schedstats
sched/docs: Document cpu_preferred_mask and Preferred CPU concept
kconfig: Provide PREFERRED_CPU option
cpumask: Introduce cpu_preferred_mask
sysfs: Add preferred CPU file
sched/core: allow only preferred CPUs in is_cpu_allowed
sched/fair: Select preferred CPU at wakeup when possible
sched/fair: load balance only among preferred CPUs
sched/rt: Select a preferred CPU for wakeup and pulling rt task
sched/core: Keep tick on non-preferred CPUs until tasks are out
sched/core: Push current task from non preferred CPU
sched/debug: Add migration stats due to non preferred CPUs
sched/debug: Create debugfs folder steal_monitor
sched/debug: Provide debugfs to enable/disable steal monitor
sched/core: Introduce a simple steal monitor
sched/core: Compute steal values at regular intervals
sched/core: Introduce default arch handling code for inc/dec preferred
CPUs
sched/core: Handle steal values and mark CPUs as preferred
sched/core: Mark the direction of steal values to avoid oscillations
sched/debug: Add debug knobs for steal monitor
.../ABI/testing/sysfs-devices-system-cpu | 11 +
Documentation/scheduler/sched-arch.rst | 49 ++++
Documentation/scheduler/sched-debug.rst | 32 +++
drivers/base/cpu.c | 8 +
include/linux/cpumask.h | 21 +-
include/linux/sched.h | 21 +-
kernel/Kconfig.preempt | 13 +
kernel/cpu.c | 16 ++
kernel/sched/core.c | 255 +++++++++++++++++-
kernel/sched/cpupri.c | 1 +
kernel/sched/debug.c | 51 +++-
kernel/sched/fair.c | 6 +-
kernel/sched/rt.c | 4 +
kernel/sched/sched.h | 27 ++
14 files changed, 505 insertions(+), 10 deletions(-)
--
2.47.3
next reply other threads:[~2026-05-14 15:23 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-14 15:21 Shrikanth Hegde [this message]
2026-05-14 15:21 ` [PATCH v3 01/20] sched/debug: Remove unused schedstats Shrikanth Hegde
2026-05-14 15:21 ` [PATCH v3 02/20] sched/docs: Document cpu_preferred_mask and Preferred CPU concept Shrikanth Hegde
2026-05-14 15:21 ` [PATCH v3 03/20] kconfig: Provide PREFERRED_CPU option Shrikanth Hegde
2026-05-14 15:21 ` [PATCH v3 04/20] cpumask: Introduce cpu_preferred_mask Shrikanth Hegde
2026-05-14 15:21 ` [PATCH v3 05/20] sysfs: Add preferred CPU file Shrikanth Hegde
2026-05-14 15:21 ` [PATCH v3 06/20] sched/core: allow only preferred CPUs in is_cpu_allowed Shrikanth Hegde
2026-05-14 15:21 ` [PATCH v3 07/20] sched/fair: Select preferred CPU at wakeup when possible Shrikanth Hegde
2026-05-14 15:21 ` [PATCH v3 08/20] sched/fair: load balance only among preferred CPUs Shrikanth Hegde
2026-05-14 15:21 ` [PATCH v3 09/20] sched/rt: Select a preferred CPU for wakeup and pulling rt task Shrikanth Hegde
2026-05-14 15:21 ` [PATCH v3 10/20] sched/core: Keep tick on non-preferred CPUs until tasks are out Shrikanth Hegde
2026-05-14 15:21 ` [PATCH v3 11/20] sched/core: Push current task from non preferred CPU Shrikanth Hegde
2026-05-14 15:21 ` [PATCH v3 12/20] sched/debug: Add migration stats due to non preferred CPUs Shrikanth Hegde
2026-05-14 15:21 ` [PATCH v3 13/20] sched/debug: Create debugfs folder steal_monitor Shrikanth Hegde
2026-05-14 15:21 ` [PATCH v3 14/20] sched/debug: Provide debugfs to enable/disable steal monitor Shrikanth Hegde
2026-05-14 15:21 ` [PATCH v3 15/20] sched/core: Introduce a simple " Shrikanth Hegde
2026-05-14 15:22 ` [PATCH v3 16/20] sched/core: Compute steal values at regular intervals Shrikanth Hegde
2026-05-14 15:22 ` [PATCH v3 17/20] sched/core: Introduce default arch handling code for inc/dec preferred CPUs Shrikanth Hegde
2026-05-14 15:22 ` [PATCH v3 18/20] sched/core: Handle steal values and mark CPUs as preferred Shrikanth Hegde
2026-05-14 15:22 ` [PATCH v3 19/20] sched/core: Mark the direction of steal values to avoid oscillations Shrikanth Hegde
2026-05-14 15:22 ` [PATCH v3 20/20] sched/debug: Add debug knobs for steal monitor Shrikanth Hegde
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260514152204.481115-1-sshegde@linux.ibm.com \
--to=sshegde@linux.ibm.com \
--cc=arighi@nvidia.com \
--cc=bsegall@google.com \
--cc=chleroy@kernel.org \
--cc=christian.loehle@arm.com \
--cc=dietmar.eggemann@arm.com \
--cc=frederic@kernel.org \
--cc=gregkh@linuxfoundation.org \
--cc=hdanton@sina.com \
--cc=huschle@linux.ibm.com \
--cc=iii@linux.ibm.com \
--cc=juri.lelli@redhat.com \
--cc=kprateek.nayak@amd.com \
--cc=linux-kernel@vger.kernel.org \
--cc=maddy@linux.ibm.com \
--cc=maz@kernel.org \
--cc=mgorman@suse.de \
--cc=mingo@kernel.org \
--cc=pauld@redhat.com \
--cc=pbonzini@redhat.com \
--cc=peterz@infradead.org \
--cc=rafael@kernel.org \
--cc=rostedt@goodmis.org \
--cc=seanjc@google.com \
--cc=srikar@linux.ibm.com \
--cc=tglx@kernel.org \
--cc=tj@kernel.org \
--cc=tommaso.cucinotta@gmail.com \
--cc=vincent.guittot@linaro.org \
--cc=vineeth@bitbyteword.org \
--cc=vschneid@redhat.com \
--cc=yury.norov@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox