[PATCH v3 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

* [PATCH v3 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff
@ 2026-05-14 15:21 Shrikanth Hegde
  2026-05-14 15:21 ` [PATCH v3 01/20] sched/debug: Remove unused schedstats Shrikanth Hegde
                   ` (19 more replies)
  0 siblings, 20 replies; 21+ messages in thread
From: Shrikanth Hegde @ 2026-05-14 15:21 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael

This version is after the OSPM26 Discussion[1]. There was 
a good discussion around this problem and there were feedback on some
of the implementation bits. Some of them have been tried/implemented
and few have been deferred. 

*** Review and feedback is much appreciated!! ***

[1]:https://youtu.be/adxUKFPlOp0

Briefly, Core idea is:
- Maintain set of CPUs which can be used by workload. It is denoted as
  cpu_preferred_mask
- Periodically compute the steal time. If steal time is high/low based
  on the thresholds, either reduce/increase the preferred CPUs.
- If a CPU is marked as non-preferred, push the task running on it if
  possible.
- Use this CPU state in wakeup and load balance to ensure tasks run
  within preferred CPUs.

For more details on idea, problem statement and performance numbers,
please refer to cover-letter of v2[2] and OSPM talk[1].

==========================================================================
Note: This series expect dependent series mentioned below applied on
base (tip/master) 
base: 4d034938b6b1 ("Merge branch into tip/master: 'x86/tdx'")
Dependent series: https://lore.kernel.org/all/20260513133934.380347-1-sshegde@linux.ibm.com/#t

==========================================================================
Changes since v2[2]:

- Introduce a new config CONFIG_PREFERRED_CPU and make user select
  the config for this feature. This was suggested by Yury Norov.
  This removes the dependency from PARAVIRT which would make s390
  folks happy.

- With CONFIG_PREFERRED_CPU=n, preferred state is same as online state.

- With CONFIG_PREFERRED_CPU=y, always maintain a design construct such
  that preferred is always a subset of online.

- Create a debugfs folder called steal_monitor in sched. Move away from
  sched_feat since there is no easier way to call additional code when
  doing enable/disable. This is essential when one disables the feature
  and preferred now has to be same as online to maintain that construct.

- With feature=off, preferred state is same on online state. Feature is
  still based on static key to avoid any runtime overhead.

- Prevent the ifdeffery spread to many file. Now the ifdeffery is spread
  mainly to */sched.h and cpumask.h, debug.c. Some ifdeffery have been kept
  to avoid code bloat and introducing debug files when config=n.

- Using active mask instead of using preferred mask. (One of the ideas
  suggested). This is was tried. When there is high steal time,
  a CPU marked as not-active isn't available for workload which pins
  them. That would break user affinities. 
  Also there is heavy use of it and it is well known too. So decided
  not to use it.

- Support the feature for CONFIG_SCHED_SMT=y. Note that some would have
  interpreted my comment as supporting smt or not. It was actually
  CONFIG_SCHED_SMT=n(which is rare btw). It was due to ifdeffery around
  cpu_smt_mask which was not pretty. 
  With the effort of removing the ifdeffery around it [3], this series
  supports CONFIG_SCHED_SMT=n too.

- Introduce arch specific handling for inc/dec preferred CPUs. This was
  a ask from s390 as it may have good hint from HW on which specific
  CPUs to take out. I hoping current hooks would work for s390. Please
  let me know if it works or not.

- Added comments around O(N2) complexity in rare cases for
  select_fallback_rq. (Yury Norov)

- irqbalance=n was considered as not important. It was quite hard to
  send interrupt on non-preferred CPUs as well. There was patch sent[4] as
  reply to previous version which covers irqbalance=y.

- Performance numbers from v2 (x86, powerpc, s390) showed nice
  improvements in some cases without any major regression. Numbers are
  expected to similar for this series.

==========================================================================
TODO/OPEN Questions: 

- SCHED_EXT is still pending. I tried adding few checks in
  scx_idle_test_and_clear_cpu, pick_idle_cpu_in_node and push the
  sched_ext task in tick. But it hasn't still worked with scx_simple.
  I will try to figure it out. But i may need help since
  I am yet wade deeper waters in sched_ext.

- Use PELT kind of signal to smoothen the steal time. This may help
  avoid oscillations. Current one works to certain extent.

- NUMA splicing when dec/inc preferred CPUs. Left it as of now as simple
  method works quite well. NUMA splicing is going to be heavy.
  Is it really necessary? Are there common topology with weird CPU
  distributions across NUMA?

- Consider not changing state of isolcpus, since one usually pins the
  workload on them anyways. Not typical use case though.

- Corner cases when there are multiple VM's and each may have only one
  Core. Are those cases worth taking a look?

- Add cpumask_check at appropriate places.

- Currently it works if all the guests enable the feature. If not one
  guest may take advantage of other. Is that to be fixed? Since this has
  to be enabled by admins, is that a valid concern still?

[2] v2: https://lore.kernel.org/all/20260407191950.643549-1-sshegde@linux.ibm.com/#t
[3]: https://lore.kernel.org/all/20260506110052.9974-1-sshegde@linux.ibm.com/#t
[4]: https://lore.kernel.org/all/8beafb01-f891-4b13-8eae-c6f3face7001@linux.ibm.com/


PS: There were several suggestions in OSPM discussion; some have been
incorporated, whichever have been intentionally deferred are mentioned
such as sched_ext and rest might have been overlooked. 

Please let me know if any specific suggestion should be prioritized
or reconsidered. Please review.

Shrikanth Hegde (20):
  sched/debug: Remove unused schedstats
  sched/docs: Document cpu_preferred_mask and Preferred CPU concept
  kconfig: Provide PREFERRED_CPU option
  cpumask: Introduce cpu_preferred_mask
  sysfs: Add preferred CPU file
  sched/core: allow only preferred CPUs in is_cpu_allowed
  sched/fair: Select preferred CPU at wakeup when possible
  sched/fair: load balance only among preferred CPUs
  sched/rt: Select a preferred CPU for wakeup and pulling rt task
  sched/core: Keep tick on non-preferred CPUs until tasks are out
  sched/core: Push current task from non preferred CPU
  sched/debug: Add migration stats due to non preferred CPUs
  sched/debug: Create debugfs folder steal_monitor
  sched/debug: Provide debugfs to enable/disable steal monitor
  sched/core: Introduce a simple steal monitor
  sched/core: Compute steal values at regular intervals
  sched/core: Introduce default arch handling code for inc/dec preferred
    CPUs
  sched/core: Handle steal values and mark CPUs as preferred
  sched/core: Mark the direction of steal values to avoid oscillations
  sched/debug: Add debug knobs for steal monitor

 .../ABI/testing/sysfs-devices-system-cpu      |  11 +
 Documentation/scheduler/sched-arch.rst        |  49 ++++
 Documentation/scheduler/sched-debug.rst       |  32 +++
 drivers/base/cpu.c                            |   8 +
 include/linux/cpumask.h                       |  21 +-
 include/linux/sched.h                         |  21 +-
 kernel/Kconfig.preempt                        |  13 +
 kernel/cpu.c                                  |  16 ++
 kernel/sched/core.c                           | 255 +++++++++++++++++-
 kernel/sched/cpupri.c                         |   1 +
 kernel/sched/debug.c                          |  51 +++-
 kernel/sched/fair.c                           |   6 +-
 kernel/sched/rt.c                             |   4 +
 kernel/sched/sched.h                          |  27 ++
 14 files changed, 505 insertions(+), 10 deletions(-)

-- 
2.47.3


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v3 01/20] sched/debug: Remove unused schedstats
  2026-05-14 15:21 [PATCH v3 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
@ 2026-05-14 15:21 ` Shrikanth Hegde
  2026-05-14 15:21 ` [PATCH v3 02/20] sched/docs: Document cpu_preferred_mask and Preferred CPU concept Shrikanth Hegde
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Shrikanth Hegde @ 2026-05-14 15:21 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael

nr_migrations_cold, nr_wakeups_passive and nr_wakeups_idle are not
being updated anywhere. So remove them.

These are per process stats. So updating sched stats version isn't
necessary.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 include/linux/sched.h | 3 ---
 kernel/sched/debug.c  | 3 ---
 2 files changed, 6 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 368c7b4d7cb5..2c3ad3e0edb5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -550,7 +550,6 @@ struct sched_statistics {
 	s64				exec_max;
 	u64				slice_max;
 
-	u64				nr_migrations_cold;
 	u64				nr_failed_migrations_affine;
 	u64				nr_failed_migrations_running;
 	u64				nr_failed_migrations_hot;
@@ -563,8 +562,6 @@ struct sched_statistics {
 	u64				nr_wakeups_remote;
 	u64				nr_wakeups_affine;
 	u64				nr_wakeups_affine_attempts;
-	u64				nr_wakeups_passive;
-	u64				nr_wakeups_idle;
 
 #ifdef CONFIG_SCHED_CORE
 	u64				core_forceidle_sum;
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 74c1617cf652..f8a43fc13564 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1301,7 +1301,6 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 		P_SCHEDSTAT(wait_count);
 		PN_SCHEDSTAT(iowait_sum);
 		P_SCHEDSTAT(iowait_count);
-		P_SCHEDSTAT(nr_migrations_cold);
 		P_SCHEDSTAT(nr_failed_migrations_affine);
 		P_SCHEDSTAT(nr_failed_migrations_running);
 		P_SCHEDSTAT(nr_failed_migrations_hot);
@@ -1313,8 +1312,6 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 		P_SCHEDSTAT(nr_wakeups_remote);
 		P_SCHEDSTAT(nr_wakeups_affine);
 		P_SCHEDSTAT(nr_wakeups_affine_attempts);
-		P_SCHEDSTAT(nr_wakeups_passive);
-		P_SCHEDSTAT(nr_wakeups_idle);
 
 		avg_atom = p->se.sum_exec_runtime;
 		if (nr_switches)
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 02/20] sched/docs: Document cpu_preferred_mask and Preferred CPU concept
  2026-05-14 15:21 [PATCH v3 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
  2026-05-14 15:21 ` [PATCH v3 01/20] sched/debug: Remove unused schedstats Shrikanth Hegde
@ 2026-05-14 15:21 ` Shrikanth Hegde
  2026-05-14 15:21 ` [PATCH v3 03/20] kconfig: Provide PREFERRED_CPU option Shrikanth Hegde
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Shrikanth Hegde @ 2026-05-14 15:21 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael

Add documentation for new cpumask called cpu_preferred_mask. This could
help users in understanding what this mask and the concept behind it.

Document how to enable it and implementation aspects of it.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 Documentation/scheduler/sched-arch.rst | 49 ++++++++++++++++++++++++++
 1 file changed, 49 insertions(+)

diff --git a/Documentation/scheduler/sched-arch.rst b/Documentation/scheduler/sched-arch.rst
index ed07efea7d02..3f7de70dc97f 100644
--- a/Documentation/scheduler/sched-arch.rst
+++ b/Documentation/scheduler/sched-arch.rst
@@ -62,6 +62,55 @@ Your cpu_idle routines need to obey the following rules:
 arch/x86/kernel/process.c has examples of both polling and
 sleeping idle functions.
 
+Preferred CPUs
+==============
+
+In virtualised environments it is possible to overcommit CPU resources.
+i.e sum of virtual CPU(vCPU) of all VM's is greater than number of physical
+CPUs(pCPU). Under such conditions when all or many VM's have high utilization,
+hypervisor won't be able to satisfy the CPU requirement and has to context
+switch within or across VM. i.e hypervisor need to preempt one vCPU to run
+another. This is called vCPU preemption. This is more expensive compared to
+task context switch within a vCPU.
+
+In such cases it is better that VM's co-ordinate among themselves and ask for
+less CPU by not using some of the vCPUs. vCPUs where workload can be safely
+scheduled which won't increase any contention for pCPU are called as
+"Preferred CPUs".
+
+In most cases preferred CPUs will be same as online CPUs, when there is pCPU
+contention, Preferred CPUs will reduce based on the amount of steal time.
+When the pCPU contention goes away as indicated by steal time, Preferred CPUs
+will become same as online CPUs again. One has to enable the feature by
+writing 1 to /sys/kernel/debug/sched/steal_monitor/enable
+
+One of the design construct is preferred CPUs is always subset of online CPUs.
+With CONFIG_PREFERRED_CPU=n, it is same as online CPUs.
+
+For scheduling decisions such as wakeup, pushing the task etc, needs this
+CPU state info. This is maintained in cpu_preferred_mask.
+
+vCPUs which are not in cpu_preferred_mask should be treated as vCPUs which
+should not be used at this moment provided it doesn't break user affinity.
+This is achieved by
+1. Selecting only a preferred CPU at wakeup.
+2. Push the task away from non-preferred CPU at tick.
+3. Only select preferred CPUs for load balance.
+
+/sys/devices/system/cpu/preferred prints the current cpu_preferred_mask in
+cpulist format.
+
+Notes:
+1. This feature is available under CONFIG_PREFERRED_CPU
+2. This feature works for FAIR/RT class.
+3. A task pinned, which can't be moved to preferred CPUs will continue
+   to run based on its affinity. But no load balancing happens
+4. If needed, steal time based governors/arch dependent method
+   could be used to cater to different types of cpu numbers.
+   Arch can do so by implementing its own hooks.
+5. Decision to use/not use is driven by kernel. Hence it shouldn't
+   break user affinities. One of the main reason why CPU hotplug
+   or Isolated cpuset partitions was not a solution.
 
 Possible arch/ problems
 =======================
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 03/20] kconfig: Provide PREFERRED_CPU option
  2026-05-14 15:21 [PATCH v3 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
  2026-05-14 15:21 ` [PATCH v3 01/20] sched/debug: Remove unused schedstats Shrikanth Hegde
  2026-05-14 15:21 ` [PATCH v3 02/20] sched/docs: Document cpu_preferred_mask and Preferred CPU concept Shrikanth Hegde
@ 2026-05-14 15:21 ` Shrikanth Hegde
  2026-05-14 15:21 ` [PATCH v3 04/20] cpumask: Introduce cpu_preferred_mask Shrikanth Hegde
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Shrikanth Hegde @ 2026-05-14 15:21 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael

Introduce a new config named PREFERRED_CPU.

This helps to:
- Avoid the code bloat when PREFERRED_CPU=n. In that cases preferred
  is same as online.
- Avoid the ifdeffery around PREFERRED_CPU in many files.

Since par-avirtual use case is the main driving force of this
feature, make it default for kernels with PARAVIRT=y

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/Kconfig.preempt | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index 88c594c6d7fc..495252fef768 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -192,3 +192,16 @@ config SCHED_CLASS_EXT
 	  For more information:
 	    Documentation/scheduler/sched-ext.rst
 	    https://github.com/sched-ext/scx
+
+config PREFERRED_CPU
+	bool "Dynamic vCPU management based on steal time"
+	default y if PARAVIRT && SMP
+	help
+	This feature helps to reduce the steal time in paravirtualised
+	environment, there by reducing vCPU preemption. Reducing vCPU
+	preemption provides improved lock holder preemption and reduces
+	cost of vCPU preemption in the host.
+
+	By default preferred CPUs will be same as online CPUs. Depending
+	on the steal time when steal monitor is enabled, preferred CPUs
+	could become subset of online CPUs.
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 04/20] cpumask: Introduce cpu_preferred_mask
  2026-05-14 15:21 [PATCH v3 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (2 preceding siblings ...)
  2026-05-14 15:21 ` [PATCH v3 03/20] kconfig: Provide PREFERRED_CPU option Shrikanth Hegde
@ 2026-05-14 15:21 ` Shrikanth Hegde
  2026-05-14 15:21 ` [PATCH v3 05/20] sysfs: Add preferred CPU file Shrikanth Hegde
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Shrikanth Hegde @ 2026-05-14 15:21 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael

This patch does
- Declare and Define cpu_preferred_mask.
- Get/Set helpers for it.

Values are set/clear by the scheduler by detecting the steal time values.

A CPU is set to preferred when it comes online. Later it may be
marked as non-preferred depending on steal time values with
steal monitor being enabled.

Always maintain design construct of preferred is subset of online.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 include/linux/cpumask.h | 21 ++++++++++++++++++++-
 kernel/cpu.c            | 16 ++++++++++++++++
 2 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 80211900f373..025ad7778a6c 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -120,12 +120,20 @@ extern struct cpumask __cpu_enabled_mask;
 extern struct cpumask __cpu_present_mask;
 extern struct cpumask __cpu_active_mask;
 extern struct cpumask __cpu_dying_mask;
+
+#ifdef CONFIG_PREFERRED_CPU
+extern struct cpumask __cpu_preferred_mask;
+#else
+#define __cpu_preferred_mask __cpu_online_mask
+#endif
+
 #define cpu_possible_mask ((const struct cpumask *)&__cpu_possible_mask)
 #define cpu_online_mask   ((const struct cpumask *)&__cpu_online_mask)
 #define cpu_enabled_mask   ((const struct cpumask *)&__cpu_enabled_mask)
 #define cpu_present_mask  ((const struct cpumask *)&__cpu_present_mask)
 #define cpu_active_mask   ((const struct cpumask *)&__cpu_active_mask)
 #define cpu_dying_mask    ((const struct cpumask *)&__cpu_dying_mask)
+#define cpu_preferred_mask ((const struct cpumask *)&__cpu_preferred_mask)
 
 extern atomic_t __num_online_cpus;
 extern unsigned int __num_possible_cpus;
@@ -1164,6 +1172,7 @@ void init_cpu_possible(const struct cpumask *src);
 
 void set_cpu_online(unsigned int cpu, bool online);
 void set_cpu_possible(unsigned int cpu, bool possible);
+void set_cpu_preferred(unsigned int cpu, bool preferred);
 
 /**
  * to_cpumask - convert a NR_CPUS bitmap to a struct cpumask *
@@ -1256,7 +1265,12 @@ static __always_inline bool cpu_dying(unsigned int cpu)
 	return cpumask_test_cpu(cpu, cpu_dying_mask);
 }
 
-#else
+static __always_inline bool cpu_preferred(unsigned int cpu)
+{
+	return cpumask_test_cpu(cpu, cpu_preferred_mask);
+}
+
+#else	/* NR_CPUS <= 1 */
 
 #define num_online_cpus()	1U
 #define num_possible_cpus()	1U
@@ -1294,6 +1308,11 @@ static __always_inline bool cpu_dying(unsigned int cpu)
 	return false;
 }
 
+static __always_inline bool cpu_preferred(unsigned int cpu)
+{
+	return cpu == 0;
+}
+
 #endif /* NR_CPUS > 1 */
 
 #define cpu_is_offline(cpu)	unlikely(!cpu_online(cpu))
diff --git a/kernel/cpu.c b/kernel/cpu.c
index bc4f7a9ba64e..819167cb8bed 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -3107,6 +3107,11 @@ EXPORT_SYMBOL(__cpu_dying_mask);
 atomic_t __num_online_cpus __read_mostly;
 EXPORT_SYMBOL(__num_online_cpus);
 
+#ifdef CONFIG_PREFERRED_CPU
+struct cpumask __cpu_preferred_mask __read_mostly;
+EXPORT_SYMBOL(__cpu_preferred_mask);
+#endif
+
 void init_cpu_present(const struct cpumask *src)
 {
 	cpumask_copy(&__cpu_present_mask, src);
@@ -3137,6 +3142,9 @@ void set_cpu_online(unsigned int cpu, bool online)
 		if (cpumask_test_and_clear_cpu(cpu, &__cpu_online_mask))
 			atomic_dec(&__num_online_cpus);
 	}
+
+	/* preferred is always subset of online */
+	set_cpu_preferred(cpu, online);
 }
 
 /*
@@ -3154,6 +3162,14 @@ void set_cpu_possible(unsigned int cpu, bool possible)
 	}
 }
 
+void set_cpu_preferred(unsigned int cpu, bool preferred)
+{
+	if (!IS_ENABLED(CONFIG_PREFERRED_CPU))
+		return;
+
+	assign_cpu((cpu), &__cpu_preferred_mask, (preferred));
+}
+
 /*
  * Activate the first processor.
  */
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 05/20] sysfs: Add preferred CPU file
  2026-05-14 15:21 [PATCH v3 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (3 preceding siblings ...)
  2026-05-14 15:21 ` [PATCH v3 04/20] cpumask: Introduce cpu_preferred_mask Shrikanth Hegde
@ 2026-05-14 15:21 ` Shrikanth Hegde
  2026-05-14 15:21 ` [PATCH v3 06/20] sched/core: allow only preferred CPUs in is_cpu_allowed Shrikanth Hegde
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Shrikanth Hegde @ 2026-05-14 15:21 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael

Add "preferred" file in /sys/devices/system/cpu

This offers
- User can quickly check which CPUs are marked as preferred at this
  moment.
- Userspace algorithms irqbalance could use this mask to send irq into
  preferred CPUs.

For example:
cat /sys/devices/system/cpu/online
0-719
cat /sys/devices/system/cpu/preferred
0-599        <<< Implies 0-599 are preferred for workloads and 600-719
                 should be avoided at this moment.

cat /sys/devices/system/cpu/preferred
0-719        <<< All CPUs are usable. There is no preferrence.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 Documentation/ABI/testing/sysfs-devices-system-cpu | 11 +++++++++++
 drivers/base/cpu.c                                 |  8 ++++++++
 2 files changed, 19 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu
index 82d10d556cc8..354058c07d65 100644
--- a/Documentation/ABI/testing/sysfs-devices-system-cpu
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -806,3 +806,14 @@ Date:		Nov 2022
 Contact:	Linux kernel mailing list <linux-kernel@vger.kernel.org>
 Description:
 		(RO) the list of CPUs that can be brought online.
+
+What:		/sys/devices/system/cpu/preferred
+Date:		May 2026
+Contact:	Linux kernel mailing list <linux-kernel@vger.kernel.org>
+Description:
+		(RO) the list of preferred CPUs at this moment.
+		These are the only CPUs meant to be used at the moment.
+		Using CPU outside of the list could lead to more
+		contention of underlying physical CPU resource. Dynamically
+		changes based on steal time. With CONFIG_PREFERRED_CPU=n it
+		is same as online CPUs. See sched-arch.rst for more details.
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index 875abdc9942e..0c6647805805 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -391,6 +391,13 @@ static int cpu_uevent(const struct device *dev, struct kobj_uevent_env *env)
 }
 #endif
 
+static ssize_t preferred_show(struct device *dev,
+			      struct device_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "%*pbl\n", cpumask_pr_args(cpu_preferred_mask));
+}
+static DEVICE_ATTR_RO(preferred);
+
 const struct bus_type cpu_subsys = {
 	.name = "cpu",
 	.dev_name = "cpu",
@@ -532,6 +539,7 @@ static struct attribute *cpu_root_attrs[] = {
 #ifdef CONFIG_GENERIC_CPU_AUTOPROBE
 	&dev_attr_modalias.attr,
 #endif
+	&dev_attr_preferred.attr,
 	NULL
 };
 
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 06/20] sched/core: allow only preferred CPUs in is_cpu_allowed
  2026-05-14 15:21 [PATCH v3 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (4 preceding siblings ...)
  2026-05-14 15:21 ` [PATCH v3 05/20] sysfs: Add preferred CPU file Shrikanth Hegde
@ 2026-05-14 15:21 ` Shrikanth Hegde
  2026-05-14 15:21 ` [PATCH v3 07/20] sched/fair: Select preferred CPU at wakeup when possible Shrikanth Hegde
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Shrikanth Hegde @ 2026-05-14 15:21 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael

When possible, choose a preferred CPUs to pick.

Push task mechanism uses stopper thread which going to call
select_fallback_rq and use this mechanism to pick only a preferred CPU.

When task is affined only to non-preferred CPUs it should continue to
run there. Detect that by checking if cpus_ptr and cpu_preferred_mask
intersect or not.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/core.c  | 27 +++++++++++++++++++++++++--
 kernel/sched/sched.h |  5 +++++
 2 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3ae5f19c1b7e..292d4e7db0fd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2468,6 +2468,8 @@ static inline bool rq_has_pinned_tasks(struct rq *rq)
  */
 static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
 {
+	bool task_has_preferred_cpu = false;
+
 	/* When not in the task's cpumask, no point in looking further. */
 	if (!task_allowed_on_cpu(p, cpu))
 		return false;
@@ -2476,9 +2478,26 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
 	if (is_migration_disabled(p))
 		return cpu_online(cpu);
 
+	/*
+	 * This is essential to maintain user affinities when preferred
+	 * CPUs change. A task pinned on non-preferred CPU should continue
+	 * to run there, since this is non-user triggered.
+	 *
+	 * For majority of the cases this would still keep select_fallback_rq
+	 * as O(N). task_has_preferred_cpus which is O(N) is called only if
+	 * !cpu_preferred. Then task running there is expected to move out.
+	 * So subsequent it should run on preferred CPU. This becomes O(N**2)
+	 * only for tasks pinned only non preferred CPUs. That is rare case.
+	 */
+	task_has_preferred_cpu = !cpu_preferred(cpu) && task_has_preferred_cpus(p);
+
 	/* Non kernel threads are not allowed during either online or offline. */
-	if (!(p->flags & PF_KTHREAD))
-		return cpu_active(cpu);
+	if (!(p->flags & PF_KTHREAD)) {
+		if (!cpu_active(cpu))
+			return false;
+		if (task_has_preferred_cpu)
+			return false;
+	}
 
 	/* KTHREAD_IS_PER_CPU is always allowed. */
 	if (kthread_is_per_cpu(p))
@@ -2488,6 +2507,10 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
 	if (cpu_dying(cpu))
 		return false;
 
+	/* Try on preferred CPU first if possible*/
+	if (task_has_preferred_cpu)
+		return false;
+
 	/* But are allowed during online. */
 	return cpu_online(cpu);
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ffe77b2b6296..faf36bc7bd12 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -4130,4 +4130,9 @@ DEFINE_CLASS_IS_UNCONDITIONAL(sched_change)
 
 #include "ext.h"
 
+static inline bool task_has_preferred_cpus(struct task_struct *p)
+{
+	return cpumask_intersects(p->cpus_ptr, cpu_preferred_mask);
+}
+
 #endif /* _KERNEL_SCHED_SCHED_H */
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 07/20] sched/fair: Select preferred CPU at wakeup when possible
  2026-05-14 15:21 [PATCH v3 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (5 preceding siblings ...)
  2026-05-14 15:21 ` [PATCH v3 06/20] sched/core: allow only preferred CPUs in is_cpu_allowed Shrikanth Hegde
@ 2026-05-14 15:21 ` Shrikanth Hegde
  2026-05-14 15:21 ` [PATCH v3 08/20] sched/fair: load balance only among preferred CPUs Shrikanth Hegde
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Shrikanth Hegde @ 2026-05-14 15:21 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael

Update available_idle_cpu to consider preferred CPUs. This takes care of
lot of decisions at wakeup to use only preferred CPUs. There is no need to
put those explicit checks everywhere.

Only other place where prev_cpu was not preferred and could possibly return
was sched_balance_find_dst_cpu. Put the check there.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/fair.c  | 3 ++-
 kernel/sched/sched.h | 3 +++
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 29fbb5287cfc..a704285ac55a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7743,7 +7743,8 @@ static inline int sched_balance_find_dst_cpu(struct sched_domain *sd, struct tas
 {
 	int new_cpu = cpu;
 
-	if (!cpumask_intersects(sched_domain_span(sd), p->cpus_ptr))
+	if (!cpumask_intersects(sched_domain_span(sd), p->cpus_ptr) &&
+	    cpu_preferred(prev_cpu))
 		return prev_cpu;
 
 	/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index faf36bc7bd12..90743b9e5add 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1418,6 +1418,9 @@ static inline bool available_idle_cpu(int cpu)
 	if (!idle_rq(cpu_rq(cpu)))
 		return 0;
 
+	if (!cpu_preferred(cpu))
+		return 0;
+
 	if (vcpu_is_preempted(cpu))
 		return 0;
 
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 08/20] sched/fair: load balance only among preferred CPUs
  2026-05-14 15:21 [PATCH v3 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (6 preceding siblings ...)
  2026-05-14 15:21 ` [PATCH v3 07/20] sched/fair: Select preferred CPU at wakeup when possible Shrikanth Hegde
@ 2026-05-14 15:21 ` Shrikanth Hegde
  2026-05-14 15:21 ` [PATCH v3 09/20] sched/rt: Select a preferred CPU for wakeup and pulling rt task Shrikanth Hegde
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Shrikanth Hegde @ 2026-05-14 15:21 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael

Consider only preferred CPUs for load balance.

With this, load balance will end up choosing a preferred CPUs for pull.
This makes it not fight against the push task mechanism which happens
at tick. Also, this stops active balance to happen on non-preferred CPU
pulling the load.

This means there is no load balancing if the task is pinned only to
non-preferred CPUs. They will continue to run where they were previously
running before the CPUs was marked as non-preferred.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/fair.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a704285ac55a..0a851d4b0d7e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12087,6 +12087,9 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
 
 	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
 
+	/* Spread load among preferred CPUs */
+	cpumask_and(cpus, cpus, cpu_preferred_mask);
+
 	schedstat_inc(sd->lb_count[idle]);
 
 redo:
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 09/20] sched/rt: Select a preferred CPU for wakeup and pulling rt task
  2026-05-14 15:21 [PATCH v3 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (7 preceding siblings ...)
  2026-05-14 15:21 ` [PATCH v3 08/20] sched/fair: load balance only among preferred CPUs Shrikanth Hegde
@ 2026-05-14 15:21 ` Shrikanth Hegde
  2026-05-14 15:21 ` [PATCH v3 10/20] sched/core: Keep tick on non-preferred CPUs until tasks are out Shrikanth Hegde
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Shrikanth Hegde @ 2026-05-14 15:21 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael

For RT class,
- During wakeup choose a preferred CPU.
- For push_rt framework, limit pushing to preferred CPUs
- Pull the rt task only if CPU is preferred.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/cpupri.c | 1 +
 kernel/sched/rt.c     | 4 ++++
 2 files changed, 5 insertions(+)

diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c
index 8f2237e8b484..24eb26ea9a91 100644
--- a/kernel/sched/cpupri.c
+++ b/kernel/sched/cpupri.c
@@ -103,6 +103,7 @@ static inline int __cpupri_find(struct cpupri *cp, struct task_struct *p,
 	if (lowest_mask) {
 		cpumask_and(lowest_mask, &p->cpus_mask, vec->mask);
 		cpumask_and(lowest_mask, lowest_mask, cpu_active_mask);
+		cpumask_and(lowest_mask, lowest_mask, cpu_preferred_mask);
 
 		/*
 		 * We have to ensure that we have at least one bit
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 4ee8faf01441..62c53f10de24 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2262,6 +2262,10 @@ static void pull_rt_task(struct rq *this_rq)
 	if (likely(!rt_overload_count))
 		return;
 
+	/* No point in pulling the load, just to push it next tick again */
+	if (!cpu_preferred(this_cpu))
+		return;
+
 	/*
 	 * Match the barrier from rt_set_overloaded; this guarantees that if we
 	 * see overloaded we must also see the rto_mask bit.
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 10/20] sched/core: Keep tick on non-preferred CPUs until tasks are out
  2026-05-14 15:21 [PATCH v3 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (8 preceding siblings ...)
  2026-05-14 15:21 ` [PATCH v3 09/20] sched/rt: Select a preferred CPU for wakeup and pulling rt task Shrikanth Hegde
@ 2026-05-14 15:21 ` Shrikanth Hegde
  2026-05-14 15:21 ` [PATCH v3 11/20] sched/core: Push current task from non preferred CPU Shrikanth Hegde
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Shrikanth Hegde @ 2026-05-14 15:21 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael

Enable tick on nohz full CPU when it is marked as non-preferred.
If there in no CFS/RT running there, disable the tick to save the power.

Steal time handling code will call tick_nohz_dep_set_cpu with
TICK_DEP_BIT_SCHED for moving the task out of nohz_full CPU fast.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/core.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 292d4e7db0fd..86fa4bfaead0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1420,6 +1420,10 @@ bool sched_can_stop_tick(struct rq *rq)
 	if (rq->dl.dl_nr_running)
 		return false;
 
+	/* Keep the tick running until both RT and CFS are pushed out*/
+	if (!cpu_preferred(rq->cpu) && (rq->rt.rt_nr_running || rq->cfs.h_nr_queued))
+		return false;
+
 	/*
 	 * If there are more than one RR tasks, we need the tick to affect the
 	 * actual RR behaviour.
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 11/20] sched/core: Push current task from non preferred CPU
  2026-05-14 15:21 [PATCH v3 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (9 preceding siblings ...)
  2026-05-14 15:21 ` [PATCH v3 10/20] sched/core: Keep tick on non-preferred CPUs until tasks are out Shrikanth Hegde
@ 2026-05-14 15:21 ` Shrikanth Hegde
  2026-05-14 15:21 ` [PATCH v3 12/20] sched/debug: Add migration stats due to non preferred CPUs Shrikanth Hegde
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Shrikanth Hegde @ 2026-05-14 15:21 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael

Actively push out task running on a non-preferred CPU. Since the task is
running on the CPU, need to stop the cpu and push the task out.
However, if the task in pinned only to non-preferred CPUs, it will continue
running there. This will help in maintaining the userspace affinities
unlike CPU hotplug or isolated cpusets.

Though code is almost same as __balance_push_cpu_stop and quite close to
push_cpu_stop, it is being kept separate as it provides a cleaner
implementation w.r.t CONFIG_HOTPLUG_CPU.

Add push_task_work_done flag to protect work buffer.
Works for all classes. Best results today with FAIR/RT.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/core.c  | 87 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |  7 ++++
 2 files changed, 94 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 86fa4bfaead0..508773e71929 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5678,6 +5678,9 @@ void sched_tick(void)
 	unsigned long hw_pressure;
 	u64 resched_latency;
 
+	if (!cpu_preferred(cpu))
+		sched_push_current_non_preferred_cpu(rq);
+
 	if (housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE))
 		arch_scale_freq_tick();
 
@@ -11263,3 +11266,87 @@ void sched_change_end(struct sched_change_ctx *ctx)
 		p->sched_class->prio_changed(rq, p, ctx->prio);
 	}
 }
+
+#ifdef CONFIG_PREFERRED_CPU
+/* npc - non preferred CPU */
+static DEFINE_PER_CPU(struct cpu_stop_work, npc_push_task_work);
+
+static int sched_non_preferred_cpu_push_stop(void *arg)
+{
+	struct task_struct *p = arg;
+	struct rq *rq = this_rq();
+	struct rq_flags rf;
+	int cpu;
+
+	raw_spin_lock_irq(&p->pi_lock);
+	rq_lock(rq, &rf);
+	rq->push_task_work_done = 0;
+
+	update_rq_clock(rq);
+
+	if (task_rq(p) == rq && task_on_rq_queued(p)) {
+		cpu = select_fallback_rq(rq->cpu, p);
+		rq = __migrate_task(rq, &rf, p, cpu);
+	}
+
+	rq_unlock(rq, &rf);
+	raw_spin_unlock_irq(&p->pi_lock);
+	put_task_struct(p);
+
+	return 0;
+}
+
+/*
+ * Push the current task running on non-preferred CPU.
+ * Using this non preferred CPU will lead to more vCPU preemptions
+ * in the host. So it is better not to use this CPU.
+ *
+ * Since task is running, call a stopper to push the task out. This is
+ * similar to how task moves during hotplug. In select_fallback_rq a
+ * preferred CPU will be chosen and henceforth task shouldn't come back to
+ * this CPU again.
+ *
+ * Works for FAIR/RT class only
+ *
+ * If task is affined only non-preferred CPUs, it can't be moved out
+ */
+void sched_push_current_non_preferred_cpu(struct rq *rq)
+{
+	struct task_struct *push_task = rq->curr;
+	unsigned long flags;
+	struct rq_flags rf;
+
+	/* sanity check */
+	if (cpu_preferred(rq->cpu))
+		return;
+
+	/* Push only if it is FAIR/RT class */
+	if (push_task->sched_class != &fair_sched_class &&
+	    push_task->sched_class != &rt_sched_class)
+		return;
+
+	if (kthread_is_per_cpu(push_task) ||
+	    is_migration_disabled(push_task))
+		return;
+
+	/* Is there any preferred CPU in the affinity list */
+	if (!task_has_preferred_cpus(push_task))
+		return;
+
+	/* There is already a stopper thread for this. Dont race with it */
+	if (rq->push_task_work_done == 1)
+		return;
+
+	local_irq_save(flags);
+
+	get_task_struct(push_task);
+
+	rq_lock(rq, &rf);
+	rq->push_task_work_done = 1;
+	rq_unlock(rq, &rf);
+
+	stop_one_cpu_nowait(rq->cpu, sched_non_preferred_cpu_push_stop,
+			    push_task, this_cpu_ptr(&npc_push_task_work));
+	local_irq_restore(flags);
+}
+#endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 90743b9e5add..96870021a842 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1244,6 +1244,7 @@ struct rq {
 
 	unsigned char		nohz_idle_balance;
 	unsigned char		idle_balance;
+	bool			push_task_work_done;
 
 	unsigned long		misfit_task_load;
 
@@ -4138,4 +4139,10 @@ static inline bool task_has_preferred_cpus(struct task_struct *p)
 	return cpumask_intersects(p->cpus_ptr, cpu_preferred_mask);
 }
 
+#ifdef CONFIG_PREFERRED_CPU
+void sched_push_current_non_preferred_cpu(struct rq *rq);
+#else	/* !CONFIG_PREFERRED_CPU */
+static inline void sched_push_current_non_preferred_cpu(struct rq *rq) { }
+#endif
+
 #endif /* _KERNEL_SCHED_SCHED_H */
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 12/20] sched/debug: Add migration stats due to non preferred CPUs
  2026-05-14 15:21 [PATCH v3 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (10 preceding siblings ...)
  2026-05-14 15:21 ` [PATCH v3 11/20] sched/core: Push current task from non preferred CPU Shrikanth Hegde
@ 2026-05-14 15:21 ` Shrikanth Hegde
  2026-05-14 15:21 ` [PATCH v3 13/20] sched/debug: Create debugfs folder steal_monitor Shrikanth Hegde
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Shrikanth Hegde @ 2026-05-14 15:21 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael

Add new stats.
- nr_migrations_cpu_non_preferred: number of migrations happened since
  a CPU was marked as non preferred due to high steal time.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 include/linux/sched.h | 1 +
 kernel/sched/core.c   | 1 +
 kernel/sched/debug.c  | 1 +
 3 files changed, 3 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2c3ad3e0edb5..dcfb57c90850 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -554,6 +554,7 @@ struct sched_statistics {
 	u64				nr_failed_migrations_running;
 	u64				nr_failed_migrations_hot;
 	u64				nr_forced_migrations;
+	u64				nr_migrations_cpu_non_preferred;
 
 	u64				nr_wakeups;
 	u64				nr_wakeups_sync;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 508773e71929..0d1995c65ce6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -11287,6 +11287,7 @@ static int sched_non_preferred_cpu_push_stop(void *arg)
 	if (task_rq(p) == rq && task_on_rq_queued(p)) {
 		cpu = select_fallback_rq(rq->cpu, p);
 		rq = __migrate_task(rq, &rf, p, cpu);
+		schedstat_inc(p->stats.nr_migrations_cpu_non_preferred);
 	}
 
 	rq_unlock(rq, &rf);
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index f8a43fc13564..482c86a0ff80 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1305,6 +1305,7 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 		P_SCHEDSTAT(nr_failed_migrations_running);
 		P_SCHEDSTAT(nr_failed_migrations_hot);
 		P_SCHEDSTAT(nr_forced_migrations);
+		P_SCHEDSTAT(nr_migrations_cpu_non_preferred);
 		P_SCHEDSTAT(nr_wakeups);
 		P_SCHEDSTAT(nr_wakeups_sync);
 		P_SCHEDSTAT(nr_wakeups_migrate);
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 13/20] sched/debug: Create debugfs folder steal_monitor
  2026-05-14 15:21 [PATCH v3 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (11 preceding siblings ...)
  2026-05-14 15:21 ` [PATCH v3 12/20] sched/debug: Add migration stats due to non preferred CPUs Shrikanth Hegde
@ 2026-05-14 15:21 ` Shrikanth Hegde
  2026-05-14 15:21 ` [PATCH v3 14/20] sched/debug: Provide debugfs to enable/disable steal monitor Shrikanth Hegde
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Shrikanth Hegde @ 2026-05-14 15:21 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael

Create a debugfs folder called steal_monitor in /sys/kernel/debug/sched

This is going to host debugfs knobs needed for generic steal monitor
that will be introduced in subsequent patches.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/debug.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 482c86a0ff80..b1abfdc168bf 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -591,6 +591,17 @@ static void debugfs_ext_server_init(void)
 }
 #endif /* CONFIG_SCHED_CLASS_EXT */
 
+#ifdef CONFIG_PREFERRED_CPU
+static void sched_steal_monitor_debugfs_init(void)
+{
+	struct dentry __maybe_unused *sm;
+
+	sm = debugfs_create_dir("steal_monitor", debugfs_sched);
+	if (!sm)
+		return;
+}
+#endif
+
 static __init int sched_init_debug(void)
 {
 	struct dentry __maybe_unused *numa;
@@ -632,6 +643,9 @@ static __init int sched_init_debug(void)
 #ifdef CONFIG_SCHED_CLASS_EXT
 	debugfs_ext_server_init();
 #endif
+#ifdef CONFIG_PREFERRED_CPU
+	sched_steal_monitor_debugfs_init();
+#endif
 
 	return 0;
 }
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 14/20] sched/debug: Provide debugfs to enable/disable steal monitor
  2026-05-14 15:21 [PATCH v3 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (12 preceding siblings ...)
  2026-05-14 15:21 ` [PATCH v3 13/20] sched/debug: Create debugfs folder steal_monitor Shrikanth Hegde
@ 2026-05-14 15:21 ` Shrikanth Hegde
  2026-05-14 15:21 ` [PATCH v3 15/20] sched/core: Introduce a simple " Shrikanth Hegde
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Shrikanth Hegde @ 2026-05-14 15:21 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael

Add a debugfs "enable" file to enable steal time monitor.

Computing steal time and acting on it periodically are to be opted by
the user. This helps to avoid any overhead when the feature
is disabled.

It is disabled by default.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/core.c  |  1 +
 kernel/sched/debug.c | 29 +++++++++++++++++++++++++++++
 kernel/sched/sched.h |  2 ++
 3 files changed, 32 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0d1995c65ce6..1533a44d1a6b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -11270,6 +11270,7 @@ void sched_change_end(struct sched_change_ctx *ctx)
 #ifdef CONFIG_PREFERRED_CPU
 /* npc - non preferred CPU */
 static DEFINE_PER_CPU(struct cpu_stop_work, npc_push_task_work);
+DEFINE_STATIC_KEY_FALSE(__sched_sm_enable);
 
 static int sched_non_preferred_cpu_push_stop(void *arg)
 {
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index b1abfdc168bf..be8d223b43fd 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -592,6 +592,33 @@ static void debugfs_ext_server_init(void)
 #endif /* CONFIG_SCHED_CLASS_EXT */
 
 #ifdef CONFIG_PREFERRED_CPU
+__read_mostly bool sched_sm_wr_enable;
+
+static ssize_t sched_sm_en_write(struct file *filp, const char __user *ubuf,
+				 size_t cnt, loff_t *ppos)
+{
+	bool orig = sched_sm_wr_enable;
+	ssize_t result;
+
+	result = debugfs_write_file_bool(filp, ubuf, cnt, ppos);
+
+	if (sched_sm_wr_enable && !orig) {
+		static_branch_enable(&__sched_sm_enable);
+	} else if (!sched_sm_wr_enable && orig) {
+		static_branch_disable(&__sched_sm_enable);
+		cpumask_copy(&__cpu_preferred_mask, cpu_online_mask);
+	}
+
+	return result;
+}
+
+static const struct file_operations sched_sm_en_fops = {
+	.read   =	debugfs_read_file_bool,
+	.write	=	sched_sm_en_write,
+	.open   =	simple_open,
+	.llseek =	default_llseek,
+};
+
 static void sched_steal_monitor_debugfs_init(void)
 {
 	struct dentry __maybe_unused *sm;
@@ -599,6 +626,8 @@ static void sched_steal_monitor_debugfs_init(void)
 	sm = debugfs_create_dir("steal_monitor", debugfs_sched);
 	if (!sm)
 		return;
+
+	debugfs_create_file("enable", 0644, sm, &sched_sm_wr_enable, &sched_sm_en_fops);
 }
 #endif
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 96870021a842..bcc65c8b4ac6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -4140,6 +4140,8 @@ static inline bool task_has_preferred_cpus(struct task_struct *p)
 }
 
 #ifdef CONFIG_PREFERRED_CPU
+DECLARE_STATIC_KEY_FALSE(__sched_sm_enable);
+
 void sched_push_current_non_preferred_cpu(struct rq *rq);
 #else	/* !CONFIG_PREFERRED_CPU */
 static inline void sched_push_current_non_preferred_cpu(struct rq *rq) { }
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 15/20] sched/core: Introduce a simple steal monitor
  2026-05-14 15:21 [PATCH v3 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (13 preceding siblings ...)
  2026-05-14 15:21 ` [PATCH v3 14/20] sched/debug: Provide debugfs to enable/disable steal monitor Shrikanth Hegde
@ 2026-05-14 15:21 ` Shrikanth Hegde
  2026-05-14 15:22 ` [PATCH v3 16/20] sched/core: Compute steal values at regular intervals Shrikanth Hegde
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Shrikanth Hegde @ 2026-05-14 15:21 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael

Start with a simple steal monitor.

It is meant to look at steal time and make the decision to
reduce/increase the preferred CPUs.

It has
- work function to execute the steal time calculations and decision
  making periodically.
- temporary cpumask, which will be used in the work function. This helps
  to avoid cpumask allocation in periodic work function.
- low and high thresholds for steal time.
- sampling period to control the frequency of steal time calculations.
- cache the previous decision to avoid oscillations

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 include/linux/sched.h | 13 +++++++++++++
 kernel/sched/core.c   | 24 ++++++++++++++++++++++++
 kernel/sched/sched.h  |  3 +++
 3 files changed, 40 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index dcfb57c90850..ee5f19a96118 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2516,4 +2516,17 @@ extern void migrate_enable(void);
 
 DEFINE_LOCK_GUARD_0(migrate, migrate_disable(), migrate_enable())
 
+#ifdef CONFIG_PREFERRED_CPU
+struct steal_monitor_t {
+	struct work_struct  work;
+	cpumask_var_t tmp_mask;
+	ktime_t prev_time;
+	u64 prev_steal;
+	int previous_decision;
+	unsigned int low_threshold;
+	unsigned int high_threshold;
+	unsigned int sampling_period_ms;
+};
+#endif
+
 #endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1533a44d1a6b..907c6b38460b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9102,6 +9102,8 @@ void __init sched_init(void)
 
 	preempt_dynamic_init();
 
+	sched_init_steal_monitor();
+
 	scheduler_running = 1;
 }
 
@@ -11351,4 +11353,26 @@ void sched_push_current_non_preferred_cpu(struct rq *rq)
 			    push_task, this_cpu_ptr(&npc_push_task_work));
 	local_irq_restore(flags);
 }
+
+struct steal_monitor_t steal_mon;
+
+void sched_init_steal_monitor(void)
+{
+	INIT_WORK(&steal_mon.work, sched_steal_detection_work);
+	zalloc_cpumask_var(&steal_mon.tmp_mask, GFP_KERNEL);
+	steal_mon.low_threshold       = 200;		/* 2% steal time */
+	steal_mon.high_threshold      = 500;		/* 5% steal time */
+	steal_mon.sampling_period_ms  = 1000;		/* once per second */
+}
+
+/* This is only a skeleton. Subsequent patches introduce more of it */
+void sched_steal_detection_work(struct work_struct *work)
+{
+	struct steal_monitor_t *sm = container_of(work, struct steal_monitor_t, work);
+	ktime_t now;
+
+	/* Update the prev_time for next iteration*/
+	now = ktime_get();
+	sm->prev_time = now;
+}
 #endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index bcc65c8b4ac6..d674f8e8e854 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -4143,8 +4143,11 @@ static inline bool task_has_preferred_cpus(struct task_struct *p)
 DECLARE_STATIC_KEY_FALSE(__sched_sm_enable);
 
 void sched_push_current_non_preferred_cpu(struct rq *rq);
+void sched_init_steal_monitor(void);
+void sched_steal_detection_work(struct work_struct *work);
 #else	/* !CONFIG_PREFERRED_CPU */
 static inline void sched_push_current_non_preferred_cpu(struct rq *rq) { }
+static inline void sched_init_steal_monitor(void) { }
 #endif
 
 #endif /* _KERNEL_SCHED_SCHED_H */
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 16/20] sched/core: Compute steal values at regular intervals
  2026-05-14 15:21 [PATCH v3 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (14 preceding siblings ...)
  2026-05-14 15:21 ` [PATCH v3 15/20] sched/core: Introduce a simple " Shrikanth Hegde
@ 2026-05-14 15:22 ` Shrikanth Hegde
  2026-05-14 15:22 ` [PATCH v3 17/20] sched/core: Introduce default arch handling code for inc/dec preferred CPUs Shrikanth Hegde
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Shrikanth Hegde @ 2026-05-14 15:22 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael

Kick off the work to compute the steal time at regular interval.
Gated with steal monitor enabled static key check to avoid any overhead
when its disabled.

The sampling period can changed at runtime using steal_mon/sampling_period.
By default is 1000 milliseconds. I.e. 1 second

This work is done by first online housekeeping CPU only. Hence it won't
need any complicated synchronization.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 include/linux/sched.h |  2 ++
 kernel/sched/core.c   | 26 ++++++++++++++++++++++++++
 kernel/sched/debug.c  |  1 +
 kernel/sched/sched.h  |  7 +++++++
 4 files changed, 36 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ee5f19a96118..738f17d63943 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2527,6 +2527,8 @@ struct steal_monitor_t {
 	unsigned int high_threshold;
 	unsigned int sampling_period_ms;
 };
+
+extern struct steal_monitor_t steal_mon;
 #endif
 
 #endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 907c6b38460b..a3f65e9c7d30 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5719,6 +5719,9 @@ void sched_tick(void)
 		rq->idle_balance = idle_cpu(cpu);
 		sched_balance_trigger(rq);
 	}
+
+	if (sched_steal_mon_enabled())
+		sched_trigger_steal_computation(cpu);
 }
 
 #ifdef CONFIG_NO_HZ_FULL
@@ -11375,4 +11378,27 @@ void sched_steal_detection_work(struct work_struct *work)
 	now = ktime_get();
 	sm->prev_time = now;
 }
+
+void sched_trigger_steal_computation(int cpu)
+{
+	int first_hk_cpu = cpumask_first_and(housekeeping_cpumask(HK_TYPE_KERNEL_NOISE),
+					     cpu_online_mask);
+	ktime_t now;
+
+	/* Done by first online housekeeping CPU only */
+	if (likely(cpu != first_hk_cpu))
+		return;
+
+	/*
+	 * Since everything is updated by first housekeeping CPU,
+	 * There is no need for complex syncronization.
+	 */
+	now = ktime_get();
+
+	/* Default is once per second */
+	if (likely(ktime_ms_delta(now, steal_mon.prev_time) < steal_mon.sampling_period_ms))
+		return;
+
+	schedule_work_on(first_hk_cpu, &steal_mon.work);
+}
 #endif
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index be8d223b43fd..f00c08581253 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -606,6 +606,7 @@ static ssize_t sched_sm_en_write(struct file *filp, const char __user *ubuf,
 		static_branch_enable(&__sched_sm_enable);
 	} else if (!sched_sm_wr_enable && orig) {
 		static_branch_disable(&__sched_sm_enable);
+		cancel_work_sync(&steal_mon.work);
 		cpumask_copy(&__cpu_preferred_mask, cpu_online_mask);
 	}
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d674f8e8e854..cc90012a85fc 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -4145,9 +4145,16 @@ DECLARE_STATIC_KEY_FALSE(__sched_sm_enable);
 void sched_push_current_non_preferred_cpu(struct rq *rq);
 void sched_init_steal_monitor(void);
 void sched_steal_detection_work(struct work_struct *work);
+void sched_trigger_steal_computation(int cpu);
+static inline bool sched_steal_mon_enabled(void)
+{
+	return static_branch_unlikely(&__sched_sm_enable);
+}
 #else	/* !CONFIG_PREFERRED_CPU */
 static inline void sched_push_current_non_preferred_cpu(struct rq *rq) { }
 static inline void sched_init_steal_monitor(void) { }
+static inline void sched_trigger_steal_computation(int cpu) { }
+static inline bool sched_steal_mon_enabled(void) { return false; }
 #endif
 
 #endif /* _KERNEL_SCHED_SCHED_H */
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 17/20] sched/core: Introduce default arch handling code for inc/dec preferred CPUs
  2026-05-14 15:21 [PATCH v3 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (15 preceding siblings ...)
  2026-05-14 15:22 ` [PATCH v3 16/20] sched/core: Compute steal values at regular intervals Shrikanth Hegde
@ 2026-05-14 15:22 ` Shrikanth Hegde
  2026-05-14 15:22 ` [PATCH v3 18/20] sched/core: Handle steal values and mark CPUs as preferred Shrikanth Hegde
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Shrikanth Hegde @ 2026-05-14 15:22 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael

Define default handlers for high/low steal time. If arch has better
decision logic, may override the default implementation.

- If the steal time higher than threshold, reduce the number of preferred
  CPUs by 1 core. The last core in the intersection of online and
  preferred CPUs will be marked as non-preferred.
  Ensure at least one core is left as preferred always.

- If the steal time lower than threshold, increase the number of preferred
  CPUs by 1 core. First online core which is not in cpu_preferred_mask will
  be marked as preferred.
  If all cores are already set to preferred, bail out.

Increase/Decrease may need to modify the splicing across NUMA nodes. It is
being kept simple for now.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 include/linux/sched.h |  2 ++
 kernel/sched/core.c   | 58 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 60 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 738f17d63943..2afbcd70f0ac 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2529,6 +2529,8 @@ struct steal_monitor_t {
 };
 
 extern struct steal_monitor_t steal_mon;
+void arch_dec_preferred_cpus(struct steal_monitor_t *sm, u64 steal_ratio);
+void arch_inc_preferred_cpus(struct steal_monitor_t *sm, u64 steal_ratio);
 #endif
 
 #endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a3f65e9c7d30..195e3648b1b5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -11368,6 +11368,64 @@ void sched_init_steal_monitor(void)
 	steal_mon.sampling_period_ms  = 1000;		/* once per second */
 }
 
+/*
+ * Default implementation of decrementing the preferred CPUs based on steal
+ * time. This is simple logic and decrease the preferred CPUs by 1 core.
+ * It takes out the last core in the online & preferred.
+ *
+ * Ensure at least one housekeeping core is always kept as preferred
+ *
+ * Could be overwritten by arch specific handling.
+ */
+#ifndef arch_dec_preferred_cpus
+void arch_dec_preferred_cpus(struct steal_monitor_t *sm, u64 steal_ratio)
+{
+	int last_cpu, tmp_cpu;
+	int this_cpu = raw_smp_processor_id();
+
+	cpumask_and(sm->tmp_mask, cpu_online_mask, cpu_preferred_mask);
+	last_cpu = cpumask_last(sm->tmp_mask);
+
+	/*
+	 * If the core belongs to the housekeeping CPUs, no action is
+	 * taken. This leaves at least one core preferred always.
+	 * This ensures at least some CPUs are available to run
+	 */
+	if (cpumask_equal(cpu_smt_mask(last_cpu), cpu_smt_mask(this_cpu)))
+		return;
+
+	for_each_cpu_and(tmp_cpu, cpu_smt_mask(last_cpu), cpu_online_mask) {
+		set_cpu_preferred(tmp_cpu, false);
+		if (tick_nohz_full_cpu(tmp_cpu))
+			tick_nohz_dep_set_cpu(tmp_cpu, TICK_DEP_BIT_SCHED);
+	}
+}
+#endif
+
+/*
+ * Default implementation of incrementing preferred CPUs based on steal
+ * time. This is simple logic and increases the preferred CPUs by 1 core.
+ * It adds the first core in online & !preferred
+ *
+ * Nothing to do if online == preferred
+ *
+ * Could be overwritten by arch specific handling.
+ */
+#ifndef arch_inc_preferred_cpus
+void arch_inc_preferred_cpus(struct steal_monitor_t *sm, u64 steal_ratio)
+{
+	int first_cpu, tmp_cpu;
+
+	first_cpu = cpumask_first_andnot(cpu_online_mask, cpu_preferred_mask);
+	/* All CPUs are preferred. Nothing to increase further */
+	if (first_cpu >= nr_cpu_ids)
+		return;
+
+	for_each_cpu_and(tmp_cpu, cpu_smt_mask(first_cpu), cpu_online_mask)
+		set_cpu_preferred(tmp_cpu, true);
+}
+#endif
+
 /* This is only a skeleton. Subsequent patches introduce more of it */
 void sched_steal_detection_work(struct work_struct *work)
 {
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 18/20] sched/core: Handle steal values and mark CPUs as preferred
  2026-05-14 15:21 [PATCH v3 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (16 preceding siblings ...)
  2026-05-14 15:22 ` [PATCH v3 17/20] sched/core: Introduce default arch handling code for inc/dec preferred CPUs Shrikanth Hegde
@ 2026-05-14 15:22 ` Shrikanth Hegde
  2026-05-14 15:22 ` [PATCH v3 19/20] sched/core: Mark the direction of steal values to avoid oscillations Shrikanth Hegde
  2026-05-14 15:22 ` [PATCH v3 20/20] sched/debug: Add debug knobs for steal monitor Shrikanth Hegde
  19 siblings, 0 replies; 21+ messages in thread
From: Shrikanth Hegde @ 2026-05-14 15:22 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael

This is the main periodic work which handles the steal time values.

- Compute the steal time by looking CPUTIME_STEAL across all online CPUs

- Compute steal ratio. It is multiplied by 100 to handle the fractional
  values.

- Invoke callbacks for inc/dec preferred CPUs based on low/high steal
  time.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/core.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 195e3648b1b5..955e74a41627 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -11426,15 +11426,34 @@ void arch_inc_preferred_cpus(struct steal_monitor_t *sm, u64 steal_ratio)
 }
 #endif
 
-/* This is only a skeleton. Subsequent patches introduce more of it */
 void sched_steal_detection_work(struct work_struct *work)
 {
 	struct steal_monitor_t *sm = container_of(work, struct steal_monitor_t, work);
+	u64 steal_ratio, delta_steal, delta_ns, steal = 0;
 	ktime_t now;
+	int tmp_cpu;
+
+	for_each_cpu(tmp_cpu, cpu_online_mask)
+		steal += kcpustat_cpu(tmp_cpu).cpustat[CPUTIME_STEAL];
 
 	/* Update the prev_time for next iteration*/
 	now = ktime_get();
+	delta_steal = steal > sm->prev_steal ? steal - sm->prev_steal : 0;
+	delta_ns = max_t(u64, ktime_to_ns(ktime_sub(now, sm->prev_time)), 1);
+
 	sm->prev_time = now;
+	sm->prev_steal = steal;
+
+	/* Multiply by 100 to consider the fractional values of steal time */
+	steal_ratio = (delta_steal * 100 * 100) / (delta_ns * num_online_cpus());
+
+	/* If the steal time values are high, reduce one core from preferred CPUs */
+	if (steal_ratio > sm->high_threshold)
+		arch_dec_preferred_cpus(sm, steal_ratio);
+
+	/* If the steal time values are low, increase one core as preferred CPUs */
+	if (steal_ratio < sm->low_threshold)
+		arch_inc_preferred_cpus(sm, steal_ratio);
 }
 
 void sched_trigger_steal_computation(int cpu)
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 19/20] sched/core: Mark the direction of steal values to avoid oscillations
  2026-05-14 15:21 [PATCH v3 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (17 preceding siblings ...)
  2026-05-14 15:22 ` [PATCH v3 18/20] sched/core: Handle steal values and mark CPUs as preferred Shrikanth Hegde
@ 2026-05-14 15:22 ` Shrikanth Hegde
  2026-05-14 15:22 ` [PATCH v3 20/20] sched/debug: Add debug knobs for steal monitor Shrikanth Hegde
  19 siblings, 0 replies; 21+ messages in thread
From: Shrikanth Hegde @ 2026-05-14 15:22 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael

Cache the previous decision on steal time. So consecutive values of
high values or low values are taken for increase/decrease of preferred
CPUs.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/core.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 955e74a41627..a9e8beb5108e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -11448,12 +11448,20 @@ void sched_steal_detection_work(struct work_struct *work)
 	steal_ratio = (delta_steal * 100 * 100) / (delta_ns * num_online_cpus());
 
 	/* If the steal time values are high, reduce one core from preferred CPUs */
-	if (steal_ratio > sm->high_threshold)
+	if (sm->previous_decision == 1 && steal_ratio > sm->high_threshold)
 		arch_dec_preferred_cpus(sm, steal_ratio);
 
 	/* If the steal time values are low, increase one core as preferred CPUs */
-	if (steal_ratio < sm->low_threshold)
+	if (sm->previous_decision == -1 && steal_ratio < sm->low_threshold)
 		arch_inc_preferred_cpus(sm, steal_ratio);
+
+	/* mark the direction. This helps to avoid ping-pongs */
+	if (steal_ratio > sm->high_threshold)
+		sm->previous_decision = 1;
+	else if (steal_ratio < sm->low_threshold)
+		sm->previous_decision = -1;
+	else
+		sm->previous_decision = 0;
 }
 
 void sched_trigger_steal_computation(int cpu)
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 20/20] sched/debug: Add debug knobs for steal monitor
  2026-05-14 15:21 [PATCH v3 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (18 preceding siblings ...)
  2026-05-14 15:22 ` [PATCH v3 19/20] sched/core: Mark the direction of steal values to avoid oscillations Shrikanth Hegde
@ 2026-05-14 15:22 ` Shrikanth Hegde
  19 siblings, 0 replies; 21+ messages in thread
From: Shrikanth Hegde @ 2026-05-14 15:22 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael

Add three debug knobs in steal_monitor:

sampling_period - sampling frequency in milliseconds.
low_threshold - lower steal threshold value (specify percentage * 100)
high_threshold - higher steal threshold value (specify percentage * 100)

Refer to Documentation/scheduler/sched-debug.rst for detailed info.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 Documentation/scheduler/sched-debug.rst | 32 +++++++++++++++++++++++++
 kernel/sched/debug.c                    |  3 +++
 2 files changed, 35 insertions(+)

diff --git a/Documentation/scheduler/sched-debug.rst b/Documentation/scheduler/sched-debug.rst
index b5a92a39eccd..a1fddfca2a52 100644
--- a/Documentation/scheduler/sched-debug.rst
+++ b/Documentation/scheduler/sched-debug.rst
@@ -52,3 +52,35 @@ rate for each task.
 
 ``scan_size_mb`` is how many megabytes worth of pages are scanned for
 a given scan.
+
+==================================
+Tunables for generic steal monitor
+==================================
+Feature for preferred CPUs logic. Available under CONFIG_PREFERRED_CPU
+Files are at /sys/kernel/debug/sched/steal_monitor/
+
+enable  - used for enable/disable the steal_monitor feature.
+Disable needs more than a static branch disable to maintain design
+construct of preferred is same as online when feature is disabled.
+Once enabled, it starts sampling steal time at intervals specified in
+sampling_period and takes action based on high/low thresholds.
+
+sampling_period - sampling frequency in milliseconds.
+How often sampling for steal values happen. This controls how fast scheduler
+acts on detecting the changes to steal time values.
+Default value is 1000 milliseconds.
+
+low_threshold   - lower threshold value in percentage * 100
+This determines what values should be considered as nil/no steal values.
+When scheduler see steal times below this value, it will try to increase
+the preferred CPUs by 1 core. Having value as zero causes too much oscillations.
+Default value is 200, i.e 2% steal is considered as low threshold.
+
+high_threshold  - higher threshold value in percentage * 100
+This determines what values should be considered as high steal values.
+When scheduler see steal times higher than this value, it will reduce
+the preferred CPUs by 1 core.
+Default value is 500, i.e 5% steal is considered as high threshold.
+
+Note: When the steal values in between high and low threshold no action is taken
+by scheduler. This is to avoid oscillations.
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index f00c08581253..57ba35f7cf95 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -629,6 +629,9 @@ static void sched_steal_monitor_debugfs_init(void)
 		return;
 
 	debugfs_create_file("enable", 0644, sm, &sched_sm_wr_enable, &sched_sm_en_fops);
+	debugfs_create_u32("low_threshold", 0644, sm, &steal_mon.low_threshold);
+	debugfs_create_u32("high_threshold", 0644, sm, &steal_mon.high_threshold);
+	debugfs_create_u32("sampling_period", 0644, sm, &steal_mon.sampling_period_ms);
 }
 #endif
 
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2026-05-14 15:26 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-14 15:21 [PATCH v3 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
2026-05-14 15:21 ` [PATCH v3 01/20] sched/debug: Remove unused schedstats Shrikanth Hegde
2026-05-14 15:21 ` [PATCH v3 02/20] sched/docs: Document cpu_preferred_mask and Preferred CPU concept Shrikanth Hegde
2026-05-14 15:21 ` [PATCH v3 03/20] kconfig: Provide PREFERRED_CPU option Shrikanth Hegde
2026-05-14 15:21 ` [PATCH v3 04/20] cpumask: Introduce cpu_preferred_mask Shrikanth Hegde
2026-05-14 15:21 ` [PATCH v3 05/20] sysfs: Add preferred CPU file Shrikanth Hegde
2026-05-14 15:21 ` [PATCH v3 06/20] sched/core: allow only preferred CPUs in is_cpu_allowed Shrikanth Hegde
2026-05-14 15:21 ` [PATCH v3 07/20] sched/fair: Select preferred CPU at wakeup when possible Shrikanth Hegde
2026-05-14 15:21 ` [PATCH v3 08/20] sched/fair: load balance only among preferred CPUs Shrikanth Hegde
2026-05-14 15:21 ` [PATCH v3 09/20] sched/rt: Select a preferred CPU for wakeup and pulling rt task Shrikanth Hegde
2026-05-14 15:21 ` [PATCH v3 10/20] sched/core: Keep tick on non-preferred CPUs until tasks are out Shrikanth Hegde
2026-05-14 15:21 ` [PATCH v3 11/20] sched/core: Push current task from non preferred CPU Shrikanth Hegde
2026-05-14 15:21 ` [PATCH v3 12/20] sched/debug: Add migration stats due to non preferred CPUs Shrikanth Hegde
2026-05-14 15:21 ` [PATCH v3 13/20] sched/debug: Create debugfs folder steal_monitor Shrikanth Hegde
2026-05-14 15:21 ` [PATCH v3 14/20] sched/debug: Provide debugfs to enable/disable steal monitor Shrikanth Hegde
2026-05-14 15:21 ` [PATCH v3 15/20] sched/core: Introduce a simple " Shrikanth Hegde
2026-05-14 15:22 ` [PATCH v3 16/20] sched/core: Compute steal values at regular intervals Shrikanth Hegde
2026-05-14 15:22 ` [PATCH v3 17/20] sched/core: Introduce default arch handling code for inc/dec preferred CPUs Shrikanth Hegde
2026-05-14 15:22 ` [PATCH v3 18/20] sched/core: Handle steal values and mark CPUs as preferred Shrikanth Hegde
2026-05-14 15:22 ` [PATCH v3 19/20] sched/core: Mark the direction of steal values to avoid oscillations Shrikanth Hegde
2026-05-14 15:22 ` [PATCH v3 20/20] sched/debug: Add debug knobs for steal monitor Shrikanth Hegde

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox