[PATCH v3 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

From: Shrikanth Hegde <sshegde@linux.ibm.com>
To: linux-kernel@vger.kernel.org, mingo@kernel.org,
	peterz@infradead.org, juri.lelli@redhat.com,
	vincent.guittot@linaro.org, yury.norov@gmail.com,
	kprateek.nayak@amd.com, iii@linux.ibm.com
Cc: sshegde@linux.ibm.com, tglx@kernel.org,
	gregkh@linuxfoundation.org, pbonzini@redhat.com,
	seanjc@google.com, vschneid@redhat.com, huschle@linux.ibm.com,
	rostedt@goodmis.org, dietmar.eggemann@arm.com, mgorman@suse.de,
	bsegall@google.com, maddy@linux.ibm.com, srikar@linux.ibm.com,
	hdanton@sina.com, chleroy@kernel.org, vineeth@bitbyteword.org,
	frederic@kernel.org, arighi@nvidia.com, pauld@redhat.com,
	christian.loehle@arm.com, tj@kernel.org,
	tommaso.cucinotta@gmail.com, maz@kernel.org, rafael@kernel.org
Subject: [PATCH v3 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff
Date: Thu, 14 May 2026 20:51:44 +0530	[thread overview]
Message-ID: <20260514152204.481115-1-sshegde@linux.ibm.com> (raw)

This version is after the OSPM26 Discussion[1]. There was 
a good discussion around this problem and there were feedback on some
of the implementation bits. Some of them have been tried/implemented
and few have been deferred. 

*** Review and feedback is much appreciated!! ***

[1]:https://youtu.be/adxUKFPlOp0

Briefly, Core idea is:
- Maintain set of CPUs which can be used by workload. It is denoted as
  cpu_preferred_mask
- Periodically compute the steal time. If steal time is high/low based
  on the thresholds, either reduce/increase the preferred CPUs.
- If a CPU is marked as non-preferred, push the task running on it if
  possible.
- Use this CPU state in wakeup and load balance to ensure tasks run
  within preferred CPUs.

For more details on idea, problem statement and performance numbers,
please refer to cover-letter of v2[2] and OSPM talk[1].

==========================================================================
Note: This series expect dependent series mentioned below applied on
base (tip/master) 
base: 4d034938b6b1 ("Merge branch into tip/master: 'x86/tdx'")
Dependent series: https://lore.kernel.org/all/20260513133934.380347-1-sshegde@linux.ibm.com/#t

==========================================================================
Changes since v2[2]:

- Introduce a new config CONFIG_PREFERRED_CPU and make user select
  the config for this feature. This was suggested by Yury Norov.
  This removes the dependency from PARAVIRT which would make s390
  folks happy.

- With CONFIG_PREFERRED_CPU=n, preferred state is same as online state.

- With CONFIG_PREFERRED_CPU=y, always maintain a design construct such
  that preferred is always a subset of online.

- Create a debugfs folder called steal_monitor in sched. Move away from
  sched_feat since there is no easier way to call additional code when
  doing enable/disable. This is essential when one disables the feature
  and preferred now has to be same as online to maintain that construct.

- With feature=off, preferred state is same on online state. Feature is
  still based on static key to avoid any runtime overhead.

- Prevent the ifdeffery spread to many file. Now the ifdeffery is spread
  mainly to */sched.h and cpumask.h, debug.c. Some ifdeffery have been kept
  to avoid code bloat and introducing debug files when config=n.

- Using active mask instead of using preferred mask. (One of the ideas
  suggested). This is was tried. When there is high steal time,
  a CPU marked as not-active isn't available for workload which pins
  them. That would break user affinities. 
  Also there is heavy use of it and it is well known too. So decided
  not to use it.

- Support the feature for CONFIG_SCHED_SMT=y. Note that some would have
  interpreted my comment as supporting smt or not. It was actually
  CONFIG_SCHED_SMT=n(which is rare btw). It was due to ifdeffery around
  cpu_smt_mask which was not pretty. 
  With the effort of removing the ifdeffery around it [3], this series
  supports CONFIG_SCHED_SMT=n too.

- Introduce arch specific handling for inc/dec preferred CPUs. This was
  a ask from s390 as it may have good hint from HW on which specific
  CPUs to take out. I hoping current hooks would work for s390. Please
  let me know if it works or not.

- Added comments around O(N2) complexity in rare cases for
  select_fallback_rq. (Yury Norov)

- irqbalance=n was considered as not important. It was quite hard to
  send interrupt on non-preferred CPUs as well. There was patch sent[4] as
  reply to previous version which covers irqbalance=y.

- Performance numbers from v2 (x86, powerpc, s390) showed nice
  improvements in some cases without any major regression. Numbers are
  expected to similar for this series.

==========================================================================
TODO/OPEN Questions: 

- SCHED_EXT is still pending. I tried adding few checks in
  scx_idle_test_and_clear_cpu, pick_idle_cpu_in_node and push the
  sched_ext task in tick. But it hasn't still worked with scx_simple.
  I will try to figure it out. But i may need help since
  I am yet wade deeper waters in sched_ext.

- Use PELT kind of signal to smoothen the steal time. This may help
  avoid oscillations. Current one works to certain extent.

- NUMA splicing when dec/inc preferred CPUs. Left it as of now as simple
  method works quite well. NUMA splicing is going to be heavy.
  Is it really necessary? Are there common topology with weird CPU
  distributions across NUMA?

- Consider not changing state of isolcpus, since one usually pins the
  workload on them anyways. Not typical use case though.

- Corner cases when there are multiple VM's and each may have only one
  Core. Are those cases worth taking a look?

- Add cpumask_check at appropriate places.

- Currently it works if all the guests enable the feature. If not one
  guest may take advantage of other. Is that to be fixed? Since this has
  to be enabled by admins, is that a valid concern still?

[2] v2: https://lore.kernel.org/all/20260407191950.643549-1-sshegde@linux.ibm.com/#t
[3]: https://lore.kernel.org/all/20260506110052.9974-1-sshegde@linux.ibm.com/#t
[4]: https://lore.kernel.org/all/8beafb01-f891-4b13-8eae-c6f3face7001@linux.ibm.com/


PS: There were several suggestions in OSPM discussion; some have been
incorporated, whichever have been intentionally deferred are mentioned
such as sched_ext and rest might have been overlooked. 

Please let me know if any specific suggestion should be prioritized
or reconsidered. Please review.

Shrikanth Hegde (20):
  sched/debug: Remove unused schedstats
  sched/docs: Document cpu_preferred_mask and Preferred CPU concept
  kconfig: Provide PREFERRED_CPU option
  cpumask: Introduce cpu_preferred_mask
  sysfs: Add preferred CPU file
  sched/core: allow only preferred CPUs in is_cpu_allowed
  sched/fair: Select preferred CPU at wakeup when possible
  sched/fair: load balance only among preferred CPUs
  sched/rt: Select a preferred CPU for wakeup and pulling rt task
  sched/core: Keep tick on non-preferred CPUs until tasks are out
  sched/core: Push current task from non preferred CPU
  sched/debug: Add migration stats due to non preferred CPUs
  sched/debug: Create debugfs folder steal_monitor
  sched/debug: Provide debugfs to enable/disable steal monitor
  sched/core: Introduce a simple steal monitor
  sched/core: Compute steal values at regular intervals
  sched/core: Introduce default arch handling code for inc/dec preferred
    CPUs
  sched/core: Handle steal values and mark CPUs as preferred
  sched/core: Mark the direction of steal values to avoid oscillations
  sched/debug: Add debug knobs for steal monitor

 .../ABI/testing/sysfs-devices-system-cpu      |  11 +
 Documentation/scheduler/sched-arch.rst        |  49 ++++
 Documentation/scheduler/sched-debug.rst       |  32 +++
 drivers/base/cpu.c                            |   8 +
 include/linux/cpumask.h                       |  21 +-
 include/linux/sched.h                         |  21 +-
 kernel/Kconfig.preempt                        |  13 +
 kernel/cpu.c                                  |  16 ++
 kernel/sched/core.c                           | 255 +++++++++++++++++-
 kernel/sched/cpupri.c                         |   1 +
 kernel/sched/debug.c                          |  51 +++-
 kernel/sched/fair.c                           |   6 +-
 kernel/sched/rt.c                             |   4 +
 kernel/sched/sched.h                          |  27 ++
 14 files changed, 505 insertions(+), 10 deletions(-)

-- 
2.47.3

next             reply	other threads:[~2026-05-14 15:23 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-14 15:21 Shrikanth Hegde [this message]
2026-05-14 15:21 ` [PATCH v3 01/20] sched/debug: Remove unused schedstats Shrikanth Hegde
2026-05-14 15:21 ` [PATCH v3 02/20] sched/docs: Document cpu_preferred_mask and Preferred CPU concept Shrikanth Hegde
2026-05-14 15:21 ` [PATCH v3 03/20] kconfig: Provide PREFERRED_CPU option Shrikanth Hegde
2026-05-14 15:21 ` [PATCH v3 04/20] cpumask: Introduce cpu_preferred_mask Shrikanth Hegde
2026-05-14 15:21 ` [PATCH v3 05/20] sysfs: Add preferred CPU file Shrikanth Hegde
2026-05-14 15:21 ` [PATCH v3 06/20] sched/core: allow only preferred CPUs in is_cpu_allowed Shrikanth Hegde
2026-05-14 15:21 ` [PATCH v3 07/20] sched/fair: Select preferred CPU at wakeup when possible Shrikanth Hegde
2026-05-14 15:21 ` [PATCH v3 08/20] sched/fair: load balance only among preferred CPUs Shrikanth Hegde
2026-05-14 15:21 ` [PATCH v3 09/20] sched/rt: Select a preferred CPU for wakeup and pulling rt task Shrikanth Hegde
2026-05-14 15:21 ` [PATCH v3 10/20] sched/core: Keep tick on non-preferred CPUs until tasks are out Shrikanth Hegde
2026-05-14 15:21 ` [PATCH v3 11/20] sched/core: Push current task from non preferred CPU Shrikanth Hegde
2026-05-14 15:21 ` [PATCH v3 12/20] sched/debug: Add migration stats due to non preferred CPUs Shrikanth Hegde
2026-05-14 15:21 ` [PATCH v3 13/20] sched/debug: Create debugfs folder steal_monitor Shrikanth Hegde
2026-05-14 15:21 ` [PATCH v3 14/20] sched/debug: Provide debugfs to enable/disable steal monitor Shrikanth Hegde
2026-05-14 15:21 ` [PATCH v3 15/20] sched/core: Introduce a simple " Shrikanth Hegde
2026-05-14 15:22 ` [PATCH v3 16/20] sched/core: Compute steal values at regular intervals Shrikanth Hegde
2026-05-14 15:22 ` [PATCH v3 17/20] sched/core: Introduce default arch handling code for inc/dec preferred CPUs Shrikanth Hegde
2026-05-14 15:22 ` [PATCH v3 18/20] sched/core: Handle steal values and mark CPUs as preferred Shrikanth Hegde
2026-05-14 15:22 ` [PATCH v3 19/20] sched/core: Mark the direction of steal values to avoid oscillations Shrikanth Hegde
2026-05-14 15:22 ` [PATCH v3 20/20] sched/debug: Add debug knobs for steal monitor Shrikanth Hegde

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260514152204.481115-1-sshegde@linux.ibm.com \
    --to=sshegde@linux.ibm.com \
    --cc=arighi@nvidia.com \
    --cc=bsegall@google.com \
    --cc=chleroy@kernel.org \
    --cc=christian.loehle@arm.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=frederic@kernel.org \
    --cc=gregkh@linuxfoundation.org \
    --cc=hdanton@sina.com \
    --cc=huschle@linux.ibm.com \
    --cc=iii@linux.ibm.com \
    --cc=juri.lelli@redhat.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=maddy@linux.ibm.com \
    --cc=maz@kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@kernel.org \
    --cc=pauld@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rafael@kernel.org \
    --cc=rostedt@goodmis.org \
    --cc=seanjc@google.com \
    --cc=srikar@linux.ibm.com \
    --cc=tglx@kernel.org \
    --cc=tj@kernel.org \
    --cc=tommaso.cucinotta@gmail.com \
    --cc=vincent.guittot@linaro.org \
    --cc=vineeth@bitbyteword.org \
    --cc=vschneid@redhat.com \
    --cc=yury.norov@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox