Linux Documentation
 help / color / mirror / Atom feed
* [PATCH v5 00/24] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff
@ 2026-06-25 12:46 Shrikanth Hegde
  2026-06-25 12:46 ` [PATCH v5 01/24] sched/debug: Remove unused schedstats Shrikanth Hegde
                   ` (23 more replies)
  0 siblings, 24 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc

Very briefly,
- Maintain set of CPUs which can be used by workload. It is denoted as
  cpu_preferred_mask
- Periodically compute the steal time. If steal time is high/low based
  on the thresholds, either reduce/increase the preferred CPUs. This is
  handled in a new driver called steal_monitor
- If a CPU is marked as non-preferred, push the task running on it if
  possible.
- Use this CPU state in wakeup and load balance to ensure tasks run
  within preferred CPUs.

For more details on idea, problem statement and performance numbers,
please refer to cover-letter of v2[2] and OSPM talk[1].

*** Please review and provide your feedback!! ***

[1]:https://youtu.be/adxUKFPlOp0
[2] v2: https://lore.kernel.org/all/20260407191950.643549-1-sshegde@linux.ibm.com/#t
[3] v4: https://lore.kernel.org/all/20260617174139.155540-1-sshegde@linux.ibm.com/#t

Thank you very much for feedback so far. This has helped the code to
evolve towards a clear abstraction layers and get simplified.
(Hopefully). Apologies in advance if I have missed any comment.

base commit:
tip/sched/core at c095741713d1 ("sched/fair: Fix newidle vs core-sched")

v4->v5:
- Move the computation of steal time and decide on preferred CPU state
  to a driver. Drop those changes in core scheduler. (Yury Norov, K Prateek Nayak)
- A new driver called steal_monitor is added in drivers/virt/ (K Prateek Nayak)
  (Please let me know if there is a better place for it. I can move it
  there)
- New driver does periodic computation of steal time and
  increments/decrements the preferred CPUs.
- Debug knobs can be changed via module parameters. (Yury Norov)
- Default implementation are weak symbols. Archs may override by
  providing strong symbols in new respective arch specific file.
- Everything is centered around CONFIG_PREFERRED_CPU. No new config
  for new driver. Driver gets added to kernel, but not loaded by
  default.
- Load the driver to enable steal_monitor functionality. Unload to
  remove the same.
- Make CONFIG_PREFERRED_CPU depend on PARAVIRT && SMP (Yury Norov)
- move set_cpu_preferred to a macro. (Yury Norov)
  on CONFIG_PREFERRED_CPU=n it will just act on active CPUs in that case.
  It shouldn't alter any functionality.
- Do a simple encoding for has_preferred_cpu_state, which aims to avoid
  repeated cpumask_interest in is_cpu_allowed. 
  (Please let me know if new variable based approach to is_cpu_allowed
  should be done instead).
- Move select_fallback_rq above the rq_lock. (sashiko)
- Few documentation nitpicks (Randy Dunlap, sashiko)
- Avoid any decision for is_cpu_allowed for other classes (sashiko)
- Don't pull the load towards a non-preferred CPUs in idle and new
  idle balanced. (Inferred when seeing sashiko comments)
- Fix leaking of task_struct in push_work_done (K Prateek Nayak)
- Module parameters aren't checked for sane values. One should know
  what they are writing to it. If one writes 0 for interval_ms,
  then it gets set to default value again to avoid workqueue lockup.
- Added a few design construct related checks in the periodic work
  to ensure any future arch specific implementations follow it.
  1. preferred is subset of active.
  2. preferred cannot be empty.
- Added Documentation of steal_monitor in Documentation/driver-api/
  (Let me know if there is better place for it)

performance numbers are expected to be same or slightly better than v2.
With driver, one major overhead in sched_tick has been removed. i.e
finding the first housekeeping CPU which was O(N). 

Apologies in advance if there is any critical information is missing
regarding new driver such as policy, documentation or missing
implementation. Please let me know, and I can make those changes.
I have ensured checkpatch --strict is happy.

Also, I think there should be a MAINTAINERS file entry for new
driver. I don't see a drivers/virt/* entry.
Either as a new entry for driver or a few file in SCHEDULER entry.
Let me know if/what I should add it. I am bit cautious about such
change. I am willing to maintain this driver, other than that
I don't know what else i going to be necessary for it. I don't have
any maintainer experience either :)

PS: Sorry for the long CC list. Please unicast it to me if you want to
be dropped for the CC list.

Shrikanth Hegde (24):
  sched/debug: Remove unused schedstats
  sched/docs: Document cpu_preferred_mask and Preferred CPU concept
  kconfig: Provide PREFERRED_CPU option
  cpumask: Introduce cpu_preferred_mask
  sysfs: Add preferred CPU file
  sched/core: allow only preferred CPUs in is_cpu_allowed
  sched/fair: Select preferred CPU at wakeup when possible
  sched/fair: load balance only among preferred CPUs
  sched/fair: Pull the load on preferred CPU
  sched/core: Keep tick on non-preferred CPUs until tasks are out
  sched/core: Push current task from non preferred CPU
  sched/debug: Add migration stats due to non preferred CPUs
  virt/steal_monitor: Add documentation
  virt: Introduce steal monitor driver
  virt/steal_monitor: Restore to active on module disable
  virt/steal_monitor: Define steal_monitor structure
  virt/steal_monitor: Add control knobs for handling steal values
  virt/steal_monitor: Compute work at regular intervals
  virt/steal_monitor: Provide default method to get systemwide steal
    time
  virt/steal_monitor: Provide default method to inc/dec preferred CPUs
  virt/steal_monitor: Provide default method to get num of CPUs for
    steal ratio
  virt/steal_monitor: Act on steal values at regular intervals
  virt/steal_monitor: Add direction control
  virt/steal_monitor: Add design check of preferred subset of active

 .../ABI/testing/sysfs-devices-system-cpu      |  11 ++
 Documentation/driver-api/index.rst            |   1 +
 Documentation/driver-api/steal-monitor.rst    |  93 ++++++++++++
 Documentation/scheduler/sched-arch.rst        |  50 +++++++
 drivers/base/cpu.c                            |   8 ++
 drivers/virt/Makefile                         |   1 +
 drivers/virt/steal_monitor/Makefile           |  14 ++
 drivers/virt/steal_monitor/defaults.c         | 105 ++++++++++++++
 drivers/virt/steal_monitor/sm_core.c          | 124 ++++++++++++++++
 drivers/virt/steal_monitor/sm_core.h          |  32 +++++
 include/linux/cpumask.h                       |  21 ++-
 include/linux/sched.h                         |   5 +-
 kernel/Kconfig.preempt                        |  14 ++
 kernel/cpu.c                                  |   6 +
 kernel/sched/core.c                           | 133 +++++++++++++++++-
 kernel/sched/debug.c                          |   4 +-
 kernel/sched/fair.c                           |  11 +-
 kernel/sched/sched.h                          |  36 +++++
 18 files changed, 659 insertions(+), 10 deletions(-)
 create mode 100644 Documentation/driver-api/steal-monitor.rst
 create mode 100644 drivers/virt/steal_monitor/Makefile
 create mode 100644 drivers/virt/steal_monitor/defaults.c
 create mode 100644 drivers/virt/steal_monitor/sm_core.c
 create mode 100644 drivers/virt/steal_monitor/sm_core.h

-- 
2.47.3


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v5 01/24] sched/debug: Remove unused schedstats
  2026-06-25 12:46 [PATCH v5 00/24] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
@ 2026-06-25 12:46 ` Shrikanth Hegde
  2026-06-25 12:46 ` [PATCH v5 02/24] sched/docs: Document cpu_preferred_mask and Preferred CPU concept Shrikanth Hegde
                   ` (22 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc

nr_migrations_cold, nr_wakeups_passive and nr_wakeups_idle are not
being updated anywhere. So remove them.

These are per process stats. So updating sched stats version isn't
necessary.

Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 include/linux/sched.h | 3 ---
 kernel/sched/debug.c  | 3 ---
 2 files changed, 6 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 35e6183ef615..fc6ecb3869dd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -550,7 +550,6 @@ struct sched_statistics {
 	s64				exec_max;
 	u64				slice_max;
 
-	u64				nr_migrations_cold;
 	u64				nr_failed_migrations_affine;
 	u64				nr_failed_migrations_running;
 	u64				nr_failed_migrations_hot;
@@ -563,8 +562,6 @@ struct sched_statistics {
 	u64				nr_wakeups_remote;
 	u64				nr_wakeups_affine;
 	u64				nr_wakeups_affine_attempts;
-	u64				nr_wakeups_passive;
-	u64				nr_wakeups_idle;
 
 #ifdef CONFIG_SCHED_CORE
 	u64				core_forceidle_sum;
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 40584b27ea0c..f3a033b34ba0 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1359,7 +1359,6 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 		P_SCHEDSTAT(wait_count);
 		PN_SCHEDSTAT(iowait_sum);
 		P_SCHEDSTAT(iowait_count);
-		P_SCHEDSTAT(nr_migrations_cold);
 		P_SCHEDSTAT(nr_failed_migrations_affine);
 		P_SCHEDSTAT(nr_failed_migrations_running);
 		P_SCHEDSTAT(nr_failed_migrations_hot);
@@ -1371,8 +1370,6 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 		P_SCHEDSTAT(nr_wakeups_remote);
 		P_SCHEDSTAT(nr_wakeups_affine);
 		P_SCHEDSTAT(nr_wakeups_affine_attempts);
-		P_SCHEDSTAT(nr_wakeups_passive);
-		P_SCHEDSTAT(nr_wakeups_idle);
 
 		avg_atom = p->se.sum_exec_runtime;
 		if (nr_switches)
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 02/24] sched/docs: Document cpu_preferred_mask and Preferred CPU concept
  2026-06-25 12:46 [PATCH v5 00/24] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
  2026-06-25 12:46 ` [PATCH v5 01/24] sched/debug: Remove unused schedstats Shrikanth Hegde
@ 2026-06-25 12:46 ` Shrikanth Hegde
  2026-06-25 12:46 ` [PATCH v5 03/24] kconfig: Provide PREFERRED_CPU option Shrikanth Hegde
                   ` (21 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc,
	kernel test robot

Add documentation for new cpumask called cpu_preferred_mask. This could
help users in understanding what this mask is and the concept behind it.

Document how to enable it and implementation aspects of it.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202606180717.yNM0yb41-lkp@intel.com/
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v4->v5:
- Change text to reflect new driver info.
- Changes suggested by Randy Dunlap.
- Sashiko nitpicks

 Documentation/scheduler/sched-arch.rst | 50 ++++++++++++++++++++++++++
 1 file changed, 50 insertions(+)

diff --git a/Documentation/scheduler/sched-arch.rst b/Documentation/scheduler/sched-arch.rst
index ed07efea7d02..8fc56edd8e03 100644
--- a/Documentation/scheduler/sched-arch.rst
+++ b/Documentation/scheduler/sched-arch.rst
@@ -62,6 +62,56 @@ Your cpu_idle routines need to obey the following rules:
 arch/x86/kernel/process.c has examples of both polling and
 sleeping idle functions.
 
+Preferred CPUs
+==============
+
+In virtualised environments it is possible to overcommit CPU resources.
+i.e sum of virtual CPU(vCPU) of all VMs is greater than number of physical
+CPUs(pCPU). Under such conditions when all or many VMs have high utilization,
+hypervisor won't be able to satisfy the CPU requirement and has to context
+switch within or across VMs. i.e hypervisor needs to preempt one vCPU to run
+another. This is called vCPU preemption. This is more expensive compared to
+task context switch within a vCPU.
+
+In such cases it is better that combined vCPU ask from all VMs is reduced
+by not using some of the vCPUs in each VM. vCPUs where workload can be safely
+scheduled which won't increase any contention for pCPU are called as
+"Preferred CPUs".
+
+Main design construct is preferred CPUs is always subset of active CPUs.
+In most cases preferred CPUs will be same as active CPUs, when there is pCPU
+contention, Preferred CPUs will reduce based on the amount of steal time.
+When the pCPU contention goes away as indicated by steal time, Preferred CPUs
+will become same as active CPUs again. This is done by loading the
+steal_monitor driver available at drivers/virt/steal_monitor.
+
+For scheduling decisions such as wakeup, pushing the task etc, needs this
+CPU state info. This is maintained in cpu_preferred_mask.
+vCPUs which are not in cpu_preferred_mask should be treated as vCPUs which
+should not be used at this moment provided it doesn't break user affinity.
+
+This is achieved by
+1. Selecting a preferred CPU at wakeup.
+2. Push the task away from non-preferred CPU at tick.
+3. Only select preferred CPUs for load balance.
+
+/sys/devices/system/cpu/preferred prints the current cpu_preferred_mask in
+cpulist format.
+
+Notes:
+1. This feature is available under CONFIG_PREFERRED_CPU. This enables
+   steal_monitor driver. On enabling the driver, CPU preferred state
+   can change based on steal time. With CONFIG_PREFERRED_CPU=n,
+   preferred CPUs is same as active CPUs.
+
+2. This feature works for FAIR class only.
+
+3. A task pinned, which can't be moved to preferred CPUs will continue
+   to run based on its affinity. But no load balancing happens.
+
+4. Decision to use/not use is driven by kernel. Hence it shouldn't
+   break user affinities. One of the main reasons why CPU hotplug
+   or Isolated cpuset partitions was not a solution.
 
 Possible arch/ problems
 =======================
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 03/24] kconfig: Provide PREFERRED_CPU option
  2026-06-25 12:46 [PATCH v5 00/24] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
  2026-06-25 12:46 ` [PATCH v5 01/24] sched/debug: Remove unused schedstats Shrikanth Hegde
  2026-06-25 12:46 ` [PATCH v5 02/24] sched/docs: Document cpu_preferred_mask and Preferred CPU concept Shrikanth Hegde
@ 2026-06-25 12:46 ` Shrikanth Hegde
  2026-06-25 12:46 ` [PATCH v5 04/24] cpumask: Introduce cpu_preferred_mask Shrikanth Hegde
                   ` (20 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc

Introduce a new config named PREFERRED_CPU.

This helps to:
- Avoid the code bloat when PREFERRED_CPU=n. In that cases preferred
  is same as active.
- Avoid the ifdeffery around PREFERRED_CPU in many files.

Since paravirtualized use case is the main driving force of this
feature, make it default for kernels with PARAVIRT=y

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v4->v5:
- Make it depend on instead. (Yury Norov)
- Fix helper indentation (sashiko)

 kernel/Kconfig.preempt | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index 88c594c6d7fc..b3a543cb44cd 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -192,3 +192,17 @@ config SCHED_CLASS_EXT
 	  For more information:
 	    Documentation/scheduler/sched-ext.rst
 	    https://github.com/sched-ext/scx
+
+config PREFERRED_CPU
+	bool "Dynamic vCPU management based on steal time"
+	depends on PARAVIRT && SMP
+	default y
+	help
+	  This feature helps to reduce the steal time in paravirtualised
+	  environment, there by reducing vCPU preemption. Reducing vCPU
+	  preemption provides improved lock holder preemption and reduces
+	  cost of vCPU preemption in the host.
+
+	  By default preferred CPUs will be same as active CPUs. Depending
+	  on the steal time when steal_monitor driver is enabled,
+	  preferred CPUs could become subset of active CPUs.
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 04/24] cpumask: Introduce cpu_preferred_mask
  2026-06-25 12:46 [PATCH v5 00/24] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (2 preceding siblings ...)
  2026-06-25 12:46 ` [PATCH v5 03/24] kconfig: Provide PREFERRED_CPU option Shrikanth Hegde
@ 2026-06-25 12:46 ` Shrikanth Hegde
  2026-06-25 12:46 ` [PATCH v5 05/24] sysfs: Add preferred CPU file Shrikanth Hegde
                   ` (19 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc

This patch does
- Declare and Define cpu_preferred_mask.
- Get/Set helpers for it.

Values are set/clear by the scheduler by detecting the steal time values.

A CPU is set to preferred when it becomes active. Later it may be
marked as non-preferred depending on steal time values with
steal monitor being enabled.

Always maintain design construct of preferred is subset of active.
i.e. preferred ⊆ active ⊆ online ⊆ present ⊆ possible

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v4->v5:
- Make it macro instead (Yury Norov)

 include/linux/cpumask.h | 21 ++++++++++++++++++++-
 kernel/cpu.c            |  6 ++++++
 kernel/sched/core.c     |  5 +++++
 3 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 80211900f373..5a643d608ea6 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -120,12 +120,20 @@ extern struct cpumask __cpu_enabled_mask;
 extern struct cpumask __cpu_present_mask;
 extern struct cpumask __cpu_active_mask;
 extern struct cpumask __cpu_dying_mask;
+
+#ifdef CONFIG_PREFERRED_CPU
+extern struct cpumask __cpu_preferred_mask;
+#else
+#define __cpu_preferred_mask __cpu_active_mask
+#endif
+
 #define cpu_possible_mask ((const struct cpumask *)&__cpu_possible_mask)
 #define cpu_online_mask   ((const struct cpumask *)&__cpu_online_mask)
 #define cpu_enabled_mask   ((const struct cpumask *)&__cpu_enabled_mask)
 #define cpu_present_mask  ((const struct cpumask *)&__cpu_present_mask)
 #define cpu_active_mask   ((const struct cpumask *)&__cpu_active_mask)
 #define cpu_dying_mask    ((const struct cpumask *)&__cpu_dying_mask)
+#define cpu_preferred_mask ((const struct cpumask *)&__cpu_preferred_mask)
 
 extern atomic_t __num_online_cpus;
 extern unsigned int __num_possible_cpus;
@@ -1161,6 +1169,7 @@ void init_cpu_possible(const struct cpumask *src);
 #define set_cpu_present(cpu, present)	assign_cpu((cpu), &__cpu_present_mask, (present))
 #define set_cpu_active(cpu, active)	assign_cpu((cpu), &__cpu_active_mask, (active))
 #define set_cpu_dying(cpu, dying)	assign_cpu((cpu), &__cpu_dying_mask, (dying))
+#define set_cpu_preferred(cpu, preferred) assign_cpu((cpu), &__cpu_preferred_mask, (preferred))
 
 void set_cpu_online(unsigned int cpu, bool online);
 void set_cpu_possible(unsigned int cpu, bool possible);
@@ -1256,7 +1265,12 @@ static __always_inline bool cpu_dying(unsigned int cpu)
 	return cpumask_test_cpu(cpu, cpu_dying_mask);
 }
 
-#else
+static __always_inline bool cpu_preferred(unsigned int cpu)
+{
+	return cpumask_test_cpu(cpu, cpu_preferred_mask);
+}
+
+#else	/* NR_CPUS <= 1 */
 
 #define num_online_cpus()	1U
 #define num_possible_cpus()	1U
@@ -1294,6 +1308,11 @@ static __always_inline bool cpu_dying(unsigned int cpu)
 	return false;
 }
 
+static __always_inline bool cpu_preferred(unsigned int cpu)
+{
+	return cpu == 0;
+}
+
 #endif /* NR_CPUS > 1 */
 
 #define cpu_is_offline(cpu)	unlikely(!cpu_online(cpu))
diff --git a/kernel/cpu.c b/kernel/cpu.c
index bc4f7a9ba64e..d623a9c5554a 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -3107,6 +3107,11 @@ EXPORT_SYMBOL(__cpu_dying_mask);
 atomic_t __num_online_cpus __read_mostly;
 EXPORT_SYMBOL(__num_online_cpus);
 
+#ifdef CONFIG_PREFERRED_CPU
+struct cpumask __cpu_preferred_mask __read_mostly;
+EXPORT_SYMBOL(__cpu_preferred_mask);
+#endif
+
 void init_cpu_present(const struct cpumask *src)
 {
 	cpumask_copy(&__cpu_present_mask, src);
@@ -3164,6 +3169,7 @@ void __init boot_cpu_init(void)
 	/* Mark the boot cpu "present", "online" etc for SMP and UP case */
 	set_cpu_online(cpu, true);
 	set_cpu_active(cpu, true);
+	set_cpu_preferred(cpu, true);
 	set_cpu_present(cpu, true);
 	set_cpu_possible(cpu, true);
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2f4530eb543f..9e16946c9d62 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8685,6 +8685,9 @@ int sched_cpu_activate(unsigned int cpu)
 	 */
 	sched_set_rq_online(rq, cpu);
 
+	/* preferred is subset of active and follows its state */
+	set_cpu_preferred(cpu, true);
+
 	return 0;
 }
 
@@ -8698,6 +8701,8 @@ int sched_cpu_deactivate(unsigned int cpu)
 	if (ret)
 		return ret;
 
+	set_cpu_preferred(cpu, false);
+
 	/*
 	 * Remove CPU from nohz.idle_cpus_mask to prevent participating in
 	 * load balancing when not active
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 05/24] sysfs: Add preferred CPU file
  2026-06-25 12:46 [PATCH v5 00/24] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (3 preceding siblings ...)
  2026-06-25 12:46 ` [PATCH v5 04/24] cpumask: Introduce cpu_preferred_mask Shrikanth Hegde
@ 2026-06-25 12:46 ` Shrikanth Hegde
  2026-06-25 12:46 ` [PATCH v5 06/24] sched/core: allow only preferred CPUs in is_cpu_allowed Shrikanth Hegde
                   ` (18 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc

Add "preferred" file in /sys/devices/system/cpu

This offers
- User can quickly check which CPUs are marked as preferred at this
  moment.
- Userspace algorithms irqbalance could use this mask to send irq into
  preferred CPUs.

For example:
cat /sys/devices/system/cpu/online
0-719
cat /sys/devices/system/cpu/preferred
0-599        <<< Implies 0-599 are preferred for workloads and 600-719
                 should be avoided at this moment.

cat /sys/devices/system/cpu/preferred
0-719        <<< All CPUs are usable. There is no preferrence.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 Documentation/ABI/testing/sysfs-devices-system-cpu | 11 +++++++++++
 drivers/base/cpu.c                                 |  8 ++++++++
 2 files changed, 19 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu
index 82d10d556cc8..5fb973d53287 100644
--- a/Documentation/ABI/testing/sysfs-devices-system-cpu
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -806,3 +806,14 @@ Date:		Nov 2022
 Contact:	Linux kernel mailing list <linux-kernel@vger.kernel.org>
 Description:
 		(RO) the list of CPUs that can be brought online.
+
+What:		/sys/devices/system/cpu/preferred
+Date:		Jun 2026
+Contact:	Linux kernel mailing list <linux-kernel@vger.kernel.org>
+Description:
+		(RO) the list of preferred CPUs at this moment.
+		These are the only CPUs meant to be used at the moment.
+		Using CPU outside of the list could lead to more
+		contention of underlying physical CPU resource. Dynamically
+		changes based on steal time. With CONFIG_PREFERRED_CPU=n it
+		is same as active CPUs. See sched-arch.rst for more details.
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index 875abdc9942e..0c6647805805 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -391,6 +391,13 @@ static int cpu_uevent(const struct device *dev, struct kobj_uevent_env *env)
 }
 #endif
 
+static ssize_t preferred_show(struct device *dev,
+			      struct device_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "%*pbl\n", cpumask_pr_args(cpu_preferred_mask));
+}
+static DEVICE_ATTR_RO(preferred);
+
 const struct bus_type cpu_subsys = {
 	.name = "cpu",
 	.dev_name = "cpu",
@@ -532,6 +539,7 @@ static struct attribute *cpu_root_attrs[] = {
 #ifdef CONFIG_GENERIC_CPU_AUTOPROBE
 	&dev_attr_modalias.attr,
 #endif
+	&dev_attr_preferred.attr,
 	NULL
 };
 
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 06/24] sched/core: allow only preferred CPUs in is_cpu_allowed
  2026-06-25 12:46 [PATCH v5 00/24] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (4 preceding siblings ...)
  2026-06-25 12:46 ` [PATCH v5 05/24] sysfs: Add preferred CPU file Shrikanth Hegde
@ 2026-06-25 12:46 ` Shrikanth Hegde
  2026-06-25 12:46 ` [PATCH v5 07/24] sched/fair: Select preferred CPU at wakeup when possible Shrikanth Hegde
                   ` (17 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc

When possible, choose a preferred CPUs to pick.

Push task mechanism uses stopper thread which going to call
select_fallback_rq and use this mechanism to pick only a preferred CPU.

When task is affined only to non-preferred CPUs it should continue to
run there. Detect that by checking if cpus_ptr and cpu_preferred_mask
intersect or not.

Since is_cpu_allowed can be called directly or repeatedly in
select_fallback_rq, encode the info in task_struct->has_preferred_cpu_state
if the path is via select_fallback_rq or not.
This helps to avoid N**2 complexity for the rare cases.

Additional overhead of O(N) comes to is_cpu_allowed only when cpu is not
preferred. So in normal scenarios overhead is only a bit check.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v4->v5:
- Do simple encoding of -1,0,1 instead (K Prateek Nayak)
- Make it s8 (K Prateek Nayak)
- Update changelog to address sashiko concerns of overhead.

 include/linux/sched.h |  1 +
 kernel/sched/core.c   | 35 +++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h  | 25 +++++++++++++++++++++++++
 3 files changed, 59 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index fc6ecb3869dd..27dbf676113e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1657,6 +1657,7 @@ struct task_struct {
 #ifdef CONFIG_UNWIND_USER
 	struct unwind_task_info		unwind_info;
 #endif
+	s8				has_preferred_cpu_state;
 
 	/* CPU-specific state of this task: */
 	struct thread_struct		thread;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9e16946c9d62..281715a6e88f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2500,6 +2500,8 @@ static inline bool rq_has_pinned_tasks(struct rq *rq)
  */
 static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
 {
+	bool task_check_preferred_cpu;
+
 	/* When not in the task's cpumask, no point in looking further. */
 	if (!task_allowed_on_cpu(p, cpu))
 		return false;
@@ -2508,9 +2510,23 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
 	if (is_migration_disabled(p))
 		return cpu_online(cpu);
 
+	/*
+	 * This is essential to maintain user affinities when preferred
+	 * CPUs change. A task pinned on non-preferred CPU should continue
+	 * to run there, since this is non-user triggered.
+	 *
+	 * If CPU is non-preferred and task can run on other CPUs which are
+	 * currently preferred, then choose those other CPUs instead.
+	 * Overhead is minimal when CPU is preferred.
+	 */
+	task_check_preferred_cpu = !cpu_preferred(cpu) && task_has_preferred_cpus(p);
+
 	/* Non kernel threads are not allowed during either online or offline. */
-	if (!(p->flags & PF_KTHREAD))
+	if (!(p->flags & PF_KTHREAD)) {
+		if (task_check_preferred_cpu)
+			return false;
 		return cpu_active(cpu);
+	}
 
 	/* KTHREAD_IS_PER_CPU is always allowed. */
 	if (kthread_is_per_cpu(p))
@@ -2520,6 +2536,10 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
 	if (cpu_dying(cpu))
 		return false;
 
+	/* Try on preferred CPU first if possible*/
+	if (task_check_preferred_cpu)
+		return false;
+
 	/* But are allowed during online. */
 	return cpu_online(cpu);
 }
@@ -3549,6 +3569,14 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
 	enum { cpuset, possible, fail } state = cpuset;
 	int dest_cpu;
 
+	/*
+	 * Cache the value whether task's affinity spans preferred CPUs.
+	 * This helps to avoid repeating the same for each CPU
+	 * later in the loop. Encode call to is_cpu_allowed coming
+	 * via select_fallback_rq.
+	 */
+	p->has_preferred_cpu_state = task_has_preferred_cpus(p) ? 1 : -1;
+
 	/*
 	 * If the node that the CPU is on has been offlined, cpu_to_node()
 	 * will return -1. There is no CPU on the node, and we should
@@ -3560,7 +3588,7 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
 		/* Look for allowed, online CPU in same node. */
 		for_each_cpu(dest_cpu, nodemask) {
 			if (is_cpu_allowed(p, dest_cpu))
-				return dest_cpu;
+				goto clear_and_return;
 		}
 	}
 
@@ -3604,6 +3632,8 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
 		}
 	}
 
+clear_and_return:
+	p->has_preferred_cpu_state = 0;
 	return dest_cpu;
 }
 
@@ -4612,6 +4642,7 @@ static void __sched_fork(u64 clone_flags, struct task_struct *p)
 	init_numa_balancing(clone_flags, p);
 	p->wake_entry.u_flags = CSD_TYPE_TTWU;
 	p->migration_pending = NULL;
+	p->has_preferred_cpu_state = 0;
 	init_sched_mm(p);
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c7c2dea65edd..5d009c2529b2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -4213,4 +4213,29 @@ DEFINE_CLASS_IS_UNCONDITIONAL(sched_change)
 
 #include "ext.h"
 
+/*
+ * has_preferred_cpu_state could have the value cached from
+ * select_fallback_rq. It is set/cleared while holding pi_lock
+ * and irq disabled.
+ *
+ *  1: Cached and preferred CPUs exists in task's affinity.
+ *  0: Not cached and need to evaluate.
+ * -1: Cached and preferred CPU doesn't exits task's affinity
+ *
+ * Only affects FAIR task.
+ */
+static inline bool task_has_preferred_cpus(struct task_struct *p)
+{
+	int cached;
+
+	/* Only FAIR tasks honor preferred CPU state */
+	if (unlikely(p->sched_class != &fair_sched_class))
+		return false;
+
+	cached = READ_ONCE(p->has_preferred_cpu_state);
+	if (cached)
+		return cached > 0;
+	else
+		return cpumask_intersects(p->cpus_ptr, cpu_preferred_mask);
+}
 #endif /* _KERNEL_SCHED_SCHED_H */
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 07/24] sched/fair: Select preferred CPU at wakeup when possible
  2026-06-25 12:46 [PATCH v5 00/24] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (5 preceding siblings ...)
  2026-06-25 12:46 ` [PATCH v5 06/24] sched/core: allow only preferred CPUs in is_cpu_allowed Shrikanth Hegde
@ 2026-06-25 12:46 ` Shrikanth Hegde
  2026-06-25 12:46 ` [PATCH v5 08/24] sched/fair: load balance only among preferred CPUs Shrikanth Hegde
                   ` (16 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc

Update available_idle_cpu to consider preferred CPUs. This takes care of
lot of decisions at wakeup to use only preferred CPUs. There is no need to
put those explicit checks everywhere.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/sched.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5d009c2529b2..148fe6145f1a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1434,6 +1434,9 @@ static inline bool available_idle_cpu(int cpu)
 	if (!idle_rq(cpu_rq(cpu)))
 		return 0;
 
+	if (!cpu_preferred(cpu))
+		return 0;
+
 	if (vcpu_is_preempted(cpu))
 		return 0;
 
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 08/24] sched/fair: load balance only among preferred CPUs
  2026-06-25 12:46 [PATCH v5 00/24] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (6 preceding siblings ...)
  2026-06-25 12:46 ` [PATCH v5 07/24] sched/fair: Select preferred CPU at wakeup when possible Shrikanth Hegde
@ 2026-06-25 12:46 ` Shrikanth Hegde
  2026-06-25 12:46 ` [PATCH v5 09/24] sched/fair: Pull the load on preferred CPU Shrikanth Hegde
                   ` (15 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc

Consider only preferred CPUs for load balance.

With this, load balance will end up choosing a preferred CPUs for pull.
This makes it not fight against the push task mechanism which happens
at tick. Also, this stops active balance to happen on non-preferred CPU
pulling the load.

This means there is no load balancing if the task is pinned only to
non-preferred CPUs. They will continue to run where they were previously
running before the CPUs was marked as non-preferred.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v4->v5:
- Remove previous cpumask_and (K Prateek Nayak)

 kernel/sched/fair.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d78467ec6ee1..44a0d9736b67 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -13289,7 +13289,8 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
 	};
 	bool need_unlock = false;
 
-	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
+	/* Spread load among preferred CPUs */
+	cpumask_and(cpus, sched_domain_span(sd), cpu_preferred_mask);
 
 	schedstat_inc(sd->lb_count[idle]);
 
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 09/24] sched/fair: Pull the load on preferred CPU
  2026-06-25 12:46 [PATCH v5 00/24] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (7 preceding siblings ...)
  2026-06-25 12:46 ` [PATCH v5 08/24] sched/fair: load balance only among preferred CPUs Shrikanth Hegde
@ 2026-06-25 12:46 ` Shrikanth Hegde
  2026-06-25 12:46 ` [PATCH v5 10/24] sched/core: Keep tick on non-preferred CPUs until tasks are out Shrikanth Hegde
                   ` (14 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc

When cpu is marked as non preferred, any load pulled towards it is
pointless since in the next tick task will be pushed out again.

Since load balancing only happens among preferred CPUs, should_we_balance
will bail out. But for NEWIDLE and IDLE balance, this bailout can
happen even earlier.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v4->v5:
- new patch

 kernel/sched/fair.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 44a0d9736b67..fda8966d9d87 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -14196,6 +14196,10 @@ static void _nohz_idle_balance(struct rq *this_rq, unsigned int flags)
 		if (!idle_cpu(balance_cpu))
 			continue;
 
+		/* There is no point in pulling the load, just to push it out next */
+		if (!cpu_preferred(balance_cpu))
+			continue;
+
 		/*
 		 * If this CPU gets work to do, stop the load balancing
 		 * work being done for other CPUs. Next load
@@ -14375,6 +14379,10 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
 	if (!cpu_active(this_cpu))
 		return 0;
 
+	/* Do not pull to a !preferred CPU just to push it out next */
+	if (!cpu_preferred(this_cpu))
+		return 0;
+
 	/*
 	 * This is OK, because current is on_cpu, which avoids it being picked
 	 * for load-balance and preemption/IRQs are still disabled avoiding
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 10/24] sched/core: Keep tick on non-preferred CPUs until tasks are out
  2026-06-25 12:46 [PATCH v5 00/24] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (8 preceding siblings ...)
  2026-06-25 12:46 ` [PATCH v5 09/24] sched/fair: Pull the load on preferred CPU Shrikanth Hegde
@ 2026-06-25 12:46 ` Shrikanth Hegde
  2026-06-25 12:46 ` [PATCH v5 11/24] sched/core: Push current task from non preferred CPU Shrikanth Hegde
                   ` (13 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc

Enable tick on nohz full CPU when it is marked as non-preferred.
If there in no CFS running there, disable the tick to save the power.

Steal time handling code will call tick_nohz_dep_set_cpu with
TICK_DEP_BIT_SCHED for moving the task out of nohz_full CPU fast.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v4->v5:
- Move it below rt checks. (Sashiko)

 kernel/sched/core.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 281715a6e88f..c0391e7897f5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1473,6 +1473,10 @@ bool sched_can_stop_tick(struct rq *rq)
 			return false;
 	}
 
+	/* Keep the tick running until CFS tasks are pushed out*/
+	if (!cpu_preferred(rq->cpu) && rq->cfs.h_nr_queued)
+		return false;
+
 	return true;
 }
 #endif /* CONFIG_NO_HZ_FULL */
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 11/24] sched/core: Push current task from non preferred CPU
  2026-06-25 12:46 [PATCH v5 00/24] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (9 preceding siblings ...)
  2026-06-25 12:46 ` [PATCH v5 10/24] sched/core: Keep tick on non-preferred CPUs until tasks are out Shrikanth Hegde
@ 2026-06-25 12:46 ` Shrikanth Hegde
  2026-06-25 12:46 ` [PATCH v5 12/24] sched/debug: Add migration stats due to non preferred CPUs Shrikanth Hegde
                   ` (12 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc

Actively push out task running on a non-preferred CPU. Since the task is
running on the CPU, need to stop the cpu and push the task out.
However, if the task in pinned only to non-preferred CPUs, it will continue
running there. This will help in maintaining the userspace affinities
unlike CPU hotplug or isolated cpusets.

Though code is almost same as __balance_push_cpu_stop and quite close to
push_cpu_stop, it is being kept separate as it provides a cleaner
implementation w.r.t CONFIG_HOTPLUG_CPU.

Add push_task_work_done flag to protect work buffer.
Works only with FAIR class.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v4->v5:
- Move select_fallback_rq outside of rq_lock (Sashiko)
- Add context_unsafe_alias (K Prateek Nayak)
- Cleanup properly on early exit.

 kernel/sched/core.c  | 87 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |  8 ++++
 2 files changed, 95 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c0391e7897f5..1e42078251d5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5794,6 +5794,9 @@ void sched_tick(void)
 	unsigned long hw_pressure;
 	u64 resched_latency;
 
+	if (!cpu_preferred(cpu))
+		sched_push_current_non_preferred_cpu(rq);
+
 	if (housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE))
 		arch_scale_freq_tick();
 
@@ -11303,3 +11306,87 @@ void sched_change_end(struct sched_change_ctx *ctx)
 		p->sched_class->prio_changed(rq, p, ctx->prio);
 	}
 }
+
+#ifdef CONFIG_PREFERRED_CPU
+/* npc - non preferred CPU */
+static DEFINE_PER_CPU(struct cpu_stop_work, npc_push_task_work);
+
+static int sched_non_preferred_cpu_push_stop(void *arg)
+{
+	struct task_struct *p = arg;
+	struct rq *rq = this_rq();
+	struct rq_flags rf;
+	int cpu;
+
+	/* sanity check and clear */
+	if (cpu_preferred(rq->cpu)) {
+		scoped_guard (rq_lock, rq)
+			rq->push_task_work_done = 0;
+		put_task_struct(p);
+		return 0;
+	}
+
+	raw_spin_lock_irq(&p->pi_lock);
+
+	/* This could take rq lock. So call it before rq lock is taken */
+	cpu = select_fallback_rq(rq->cpu, p);
+	rq_lock(rq, &rf);
+	rq->push_task_work_done = 0;
+	update_rq_clock(rq);
+
+	context_unsafe_alias(rq);
+
+	if (task_rq(p) == rq && task_on_rq_queued(p))
+		rq = __migrate_task(rq, &rf, p, cpu);
+
+	rq_unlock(rq, &rf);
+	raw_spin_unlock_irq(&p->pi_lock);
+	put_task_struct(p);
+
+	return 0;
+}
+
+/*
+ * Push the current task running on non-preferred CPU.
+ * Using this non preferred CPU will lead to more vCPU preemptions
+ * in the host. So it is better not to use this CPU.
+ *
+ * Since task is running, call a stopper to push the task out. This is
+ * similar to how task moves during hotplug. In select_fallback_rq a
+ * preferred CPU will be chosen and henceforth task shouldn't come back to
+ * this CPU again.
+ *
+ * Works for FAIR class only
+ *
+ * If task is affined only non-preferred CPUs, it can't be moved out
+ */
+void sched_push_current_non_preferred_cpu(struct rq *rq)
+{
+	struct task_struct *push_task = rq->curr;
+
+	/* Push only if it is FAIR class */
+	if (push_task->sched_class != &fair_sched_class)
+		return;
+
+	if (kthread_is_per_cpu(push_task) ||
+	    is_migration_disabled(push_task))
+		return;
+
+	/* Is there any preferred CPU in the affinity list */
+	if (!task_has_preferred_cpus(push_task))
+		return;
+
+	/* There is already a stopper thread for this. Dont race with it */
+	if (rq->push_task_work_done == 1)
+		return;
+
+	/* sched_tick runs with interrupts disabled. Don't disable again */
+	get_task_struct(push_task);
+
+	scoped_guard (rq_lock, rq)
+		rq->push_task_work_done = 1;
+
+	stop_one_cpu_nowait(rq->cpu, sched_non_preferred_cpu_push_stop,
+			    push_task, this_cpu_ptr(&npc_push_task_work));
+}
+#endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 148fe6145f1a..316d3ccefc48 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1274,6 +1274,8 @@ struct rq {
 
 	struct list_head cfs_tasks;
 
+	bool			push_task_work_done;
+
 	struct sched_avg	avg_rt;
 	struct sched_avg	avg_dl;
 #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
@@ -4241,4 +4243,10 @@ static inline bool task_has_preferred_cpus(struct task_struct *p)
 	else
 		return cpumask_intersects(p->cpus_ptr, cpu_preferred_mask);
 }
+
+#ifdef CONFIG_PREFERRED_CPU
+void sched_push_current_non_preferred_cpu(struct rq *rq);
+#else	/* !CONFIG_PREFERRED_CPU */
+static inline void sched_push_current_non_preferred_cpu(struct rq *rq) { }
+#endif
 #endif /* _KERNEL_SCHED_SCHED_H */
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 12/24] sched/debug: Add migration stats due to non preferred CPUs
  2026-06-25 12:46 [PATCH v5 00/24] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (10 preceding siblings ...)
  2026-06-25 12:46 ` [PATCH v5 11/24] sched/core: Push current task from non preferred CPU Shrikanth Hegde
@ 2026-06-25 12:46 ` Shrikanth Hegde
  2026-06-25 12:46 ` [PATCH v5 13/24] virt/steal_monitor: Add documentation Shrikanth Hegde
                   ` (11 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc

Add a new stat.
- nr_migrations_cpu_non_preferred: number of migrations happened since
  a CPU was marked as non preferred due to high steal time.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 include/linux/sched.h | 1 +
 kernel/sched/core.c   | 4 +++-
 kernel/sched/debug.c  | 1 +
 3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 27dbf676113e..32f4743326f8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -554,6 +554,7 @@ struct sched_statistics {
 	u64				nr_failed_migrations_running;
 	u64				nr_failed_migrations_hot;
 	u64				nr_forced_migrations;
+	u64				nr_migrations_cpu_non_preferred;
 
 	u64				nr_wakeups;
 	u64				nr_wakeups_sync;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1e42078251d5..fdfefdca74dc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -11336,8 +11336,10 @@ static int sched_non_preferred_cpu_push_stop(void *arg)
 
 	context_unsafe_alias(rq);
 
-	if (task_rq(p) == rq && task_on_rq_queued(p))
+	if (task_rq(p) == rq && task_on_rq_queued(p)) {
 		rq = __migrate_task(rq, &rf, p, cpu);
+		schedstat_inc(p->stats.nr_migrations_cpu_non_preferred);
+	}
 
 	rq_unlock(rq, &rf);
 	raw_spin_unlock_irq(&p->pi_lock);
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index f3a033b34ba0..106b448cafb6 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1363,6 +1363,7 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 		P_SCHEDSTAT(nr_failed_migrations_running);
 		P_SCHEDSTAT(nr_failed_migrations_hot);
 		P_SCHEDSTAT(nr_forced_migrations);
+		P_SCHEDSTAT(nr_migrations_cpu_non_preferred);
 		P_SCHEDSTAT(nr_wakeups);
 		P_SCHEDSTAT(nr_wakeups_sync);
 		P_SCHEDSTAT(nr_wakeups_migrate);
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 13/24] virt/steal_monitor: Add documentation
  2026-06-25 12:46 [PATCH v5 00/24] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (11 preceding siblings ...)
  2026-06-25 12:46 ` [PATCH v5 12/24] sched/debug: Add migration stats due to non preferred CPUs Shrikanth Hegde
@ 2026-06-25 12:46 ` Shrikanth Hegde
  2026-06-25 12:46 ` [PATCH v5 14/24] virt: Introduce steal monitor driver Shrikanth Hegde
                   ` (10 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc

Document this module named steal_monitor and its parameters.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v4-v5:
- new patch

Please let me know if the placing is not right.

 Documentation/driver-api/index.rst         |  1 +
 Documentation/driver-api/steal-monitor.rst | 93 ++++++++++++++++++++++
 2 files changed, 94 insertions(+)
 create mode 100644 Documentation/driver-api/steal-monitor.rst

diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst
index eaf7161ff957..ec12f396a5e6 100644
--- a/Documentation/driver-api/index.rst
+++ b/Documentation/driver-api/index.rst
@@ -138,6 +138,7 @@ Subsystem-specific APIs
    sm501
    soundwire/index
    spi
+   steal-monitor
    surface_aggregator/index
    switchtec
    sync_file
diff --git a/Documentation/driver-api/steal-monitor.rst b/Documentation/driver-api/steal-monitor.rst
new file mode 100644
index 000000000000..997a22d0812c
--- /dev/null
+++ b/Documentation/driver-api/steal-monitor.rst
@@ -0,0 +1,93 @@
+.. SPDX-License-Identifier: GPL-2.0
+=============
+Steal Monitor
+=============
+
+:Author: Shrikanth Hegde
+
+Introduction:
+=============
+
+Steal monitor is a driver aimed at solving the Noisy Neighbour problem
+in virtualized environments. I.e performance of workload
+running in one VM gets affected significantly due to other VMs and
+combined they make slower forward progress.
+
+When there is overcommit of CPU resources, i.e sum of virtual CPUs(vCPU)
+of all VMs is greater than number of physical CPUs(pCPU) and
+when all or many VMs have high utilization, hypervisor won't be able
+to satisfy the CPU requirement and has to context switch within or
+across VM. I.e hypervisor needs to preempt one vCPU to run
+another. This is called vCPU preemption.
+This is more expensive compared to task context switch within a vCPU.
+
+In such cases it is better that combined vCPU ask from all VM is reduced
+by not using some of the vCPUs. vCPUs where workload can be safely
+scheduled which won't increase any contention for pCPU are called as
+"Preferred CPUs".
+
+See more on "Preferred CPUs" in Documentation/scheduler/sched-arch.rst.
+
+This driver helps in setting/clearing the CPUs in the "Preferred CPUs" list.
+This list is obtained using cpu_preferred_mask.
+
+Core idea:
+==========
+steal time is an indication available today in Guest which shows contention
+for underlying physical CPU. Use it as a hint in the guest to fold the
+workload to a reduced set of vCPUs. When there is contention, steal time
+will show up in all the guests. When each guest honors the hint and folds
+the workload to a smaller set of vCPUs(Preferred CPUs), it reduces the
+contention and thereby reduces vCPU preemption.
+This is achieved without any cross-guest communication.
+
+Steal monitor driver effectively does:
+
+1. Periodically computes steal time across the system.
+
+2. If steal time is greater than high threshold, reduce the number of
+   preferred CPUs by 1 core. Ensure at least one core is left always.
+   This avoids running into extreme cases.
+
+3. If steal time is lower or equal to low threshold, increase the
+   number of preferred CPUs by 1 core. If preferred is same as active,
+   nothing to be done.
+
+4. Ensure preferred CPUs is always subset of active CPUs.
+   On feature disable it is same as active CPUs.
+
+Module Parameters:
+==================
+interval_ms
+-----------
+How often steal monitor checks for steal time.
+(Default: 1000 i.e 1 second)
+
+This controls how fast steal monitor driver reacts to changes to
+the contention of physical CPUs. Since it does fair amount of
+work, setting too low will have overheads. If set to 0, on next
+work it will be set to default.
+
+low_threshold
+-------------
+lower threshold value in percentage * 100.
+(Default: 200, i.e 2% steal is considered as low threshold)
+
+This determines what values should be considered as nil/no steal values.
+When steal monitor see steal time is below or equal to this value, it
+will increase the preferred CPUs by 1 core. Having value as zero
+might cause too much oscillations.
+
+high_threshold
+--------------
+higher threshold value in percentage * 100
+(Default: 500, i.e 5% steal is considered as high threshold)
+
+This determines what values should be considered as high steal values.
+When steal monitor sees steal time is higher than this value, it will
+reduce the preferred CPUs by 1 core.
+
+Notes:
+======
+This is available under CONFIG_PREFERRED_CPU. Selecting that includes
+this module. Module is not loaded by default.
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 14/24] virt: Introduce steal monitor driver
  2026-06-25 12:46 [PATCH v5 00/24] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (12 preceding siblings ...)
  2026-06-25 12:46 ` [PATCH v5 13/24] virt/steal_monitor: Add documentation Shrikanth Hegde
@ 2026-06-25 12:46 ` Shrikanth Hegde
  2026-06-25 12:46 ` [PATCH v5 15/24] virt/steal_monitor: Restore to active on module disable Shrikanth Hegde
                   ` (9 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc

Introduce a new driver in virt named steal_monitor. This driver
will compute the steal time and drive the policy decisions of preferred
CPU state.

More on it can be found in the Documentation/driver-api/steal-monitor.rst
This patch introduces the skeleton code.

There is no new kconfig. It depends on CONFIG_PREFERRED_CPU.
- If CONFIG_PREFERRED_CPU=y, it gets compiled as a module. It is not
  loaded by default.
- If CONFIG_PREFERRED_CPU=n, module isn't compiled.

File layout of the driver is designed with having arch specific
files in the future.

- sm_core.c - contains main driver code. This includes the periodic
  work function and take action on steal time.
- defaults.c - contains the default implementation defined with __weak
  symbols.
- sm_core.h - header file which includes data structure.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v4->v5:
- new patch

Please let me know if the placing is not right.

 drivers/virt/Makefile                |  1 +
 drivers/virt/steal_monitor/Makefile  | 14 ++++++++++++
 drivers/virt/steal_monitor/sm_core.c | 33 ++++++++++++++++++++++++++++
 drivers/virt/steal_monitor/sm_core.h | 11 ++++++++++
 4 files changed, 59 insertions(+)
 create mode 100644 drivers/virt/steal_monitor/Makefile
 create mode 100644 drivers/virt/steal_monitor/sm_core.c
 create mode 100644 drivers/virt/steal_monitor/sm_core.h

diff --git a/drivers/virt/Makefile b/drivers/virt/Makefile
index f29901bd7820..aff715cea42d 100644
--- a/drivers/virt/Makefile
+++ b/drivers/virt/Makefile
@@ -9,4 +9,5 @@ obj-y				+= vboxguest/
 
 obj-$(CONFIG_NITRO_ENCLAVES)	+= nitro_enclaves/
 obj-$(CONFIG_ACRN_HSM)		+= acrn/
+obj-$(CONFIG_PREFERRED_CPU)	+= steal_monitor/
 obj-y				+= coco/
diff --git a/drivers/virt/steal_monitor/Makefile b/drivers/virt/steal_monitor/Makefile
new file mode 100644
index 000000000000..24cee55342ce
--- /dev/null
+++ b/drivers/virt/steal_monitor/Makefile
@@ -0,0 +1,14 @@
+# SPDX-License-Identifier: GPL-2.0-only
+#
+# Steal time monitor to alter preferred CPU state.
+#
+# Arch can implement strong function definitions and override the
+# default by adding them in arch specific file. It must ensure
+# that preferred is always subset of active.
+#
+# It is always compiled as module if CONFIG_PREFERRED_CPU=y
+# One has to enable the module.
+#
+obj-$(subst y,m,$(CONFIG_PREFERRED_CPU)) += steal_monitor.o
+
+steal_monitor-y := sm_core.o
diff --git a/drivers/virt/steal_monitor/sm_core.c b/drivers/virt/steal_monitor/sm_core.c
new file mode 100644
index 000000000000..e320559c6576
--- /dev/null
+++ b/drivers/virt/steal_monitor/sm_core.c
@@ -0,0 +1,33 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Steal time Monitor.
+ *
+ * Periodically compute steal time. Based on the thresholds either
+ * reduce/increase the preferred CPUs which can be made use
+ * by the workload to avoid vCPU preemption to an extent possible.
+ *
+ * Available as module with CONFIG_PREFERRED_CPU=y
+ *
+ * Copyright (C) 2026 IBM
+ * Author: Shrikanth Hegde <sshegde@linux.ibm.com>
+ */
+
+#include "sm_core.h"
+
+static int __init steal_monitor_init(void)
+{
+	pr_info("steal_monitor is enabled\n");
+	return 0;
+}
+
+static void __exit steal_monitor_exit(void)
+{
+	pr_info("steal_monitor is disabled\n");
+}
+
+module_init(steal_monitor_init);
+module_exit(steal_monitor_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("IBM Corporation");
+MODULE_DESCRIPTION("Virtualization Steal Time Monitor");
diff --git a/drivers/virt/steal_monitor/sm_core.h b/drivers/virt/steal_monitor/sm_core.h
new file mode 100644
index 000000000000..684a258526e1
--- /dev/null
+++ b/drivers/virt/steal_monitor/sm_core.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef __VIRT_STEAL_CORE_H
+#define __VIRT_STEAL_CORE_H
+
+#include <linux/types.h>
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/init.h>
+
+#endif /* __VIRT_STEAL_CORE_H */
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 15/24] virt/steal_monitor: Restore to active on module disable
  2026-06-25 12:46 [PATCH v5 00/24] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (13 preceding siblings ...)
  2026-06-25 12:46 ` [PATCH v5 14/24] virt: Introduce steal monitor driver Shrikanth Hegde
@ 2026-06-25 12:46 ` Shrikanth Hegde
  2026-06-25 12:46 ` [PATCH v5 16/24] virt/steal_monitor: Define steal_monitor structure Shrikanth Hegde
                   ` (8 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc

When the module is not in use, preferred CPUs must be same
as active CPUs.

Even if one disables the module during high steal time, it
still restores the preferred CPUs to be same as active CPUs
to keep disable path simple.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v4->v5:
- Modified for steal_monitor

 drivers/virt/steal_monitor/sm_core.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/virt/steal_monitor/sm_core.c b/drivers/virt/steal_monitor/sm_core.c
index e320559c6576..b1865fcdff93 100644
--- a/drivers/virt/steal_monitor/sm_core.c
+++ b/drivers/virt/steal_monitor/sm_core.c
@@ -23,6 +23,7 @@ static int __init steal_monitor_init(void)
 static void __exit steal_monitor_exit(void)
 {
 	pr_info("steal_monitor is disabled\n");
+	cpumask_copy(&__cpu_preferred_mask, cpu_active_mask);
 }
 
 module_init(steal_monitor_init);
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 16/24] virt/steal_monitor: Define steal_monitor structure
  2026-06-25 12:46 [PATCH v5 00/24] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (14 preceding siblings ...)
  2026-06-25 12:46 ` [PATCH v5 15/24] virt/steal_monitor: Restore to active on module disable Shrikanth Hegde
@ 2026-06-25 12:46 ` Shrikanth Hegde
  2026-06-25 12:46 ` [PATCH v5 17/24] virt/steal_monitor: Add control knobs for handling steal values Shrikanth Hegde
                   ` (7 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc

Main structure of steal monitor. It has
- work: deferred periodic work function
- prev_steal, prev_time - To calculate the delta in periodic work.
- interval_ms, high_threshold, low_threshold - debug knobs of
  steal_monitor.

sm_core_ctx - instance used by the core code.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v4->v5:
- Modified for steal_monitor

 drivers/virt/steal_monitor/sm_core.c |  2 ++
 drivers/virt/steal_monitor/sm_core.h | 12 ++++++++++++
 2 files changed, 14 insertions(+)

diff --git a/drivers/virt/steal_monitor/sm_core.c b/drivers/virt/steal_monitor/sm_core.c
index b1865fcdff93..3feb686dd3c4 100644
--- a/drivers/virt/steal_monitor/sm_core.c
+++ b/drivers/virt/steal_monitor/sm_core.c
@@ -14,6 +14,8 @@
 
 #include "sm_core.h"
 
+struct steal_monitor sm_core_ctx;
+
 static int __init steal_monitor_init(void)
 {
 	pr_info("steal_monitor is enabled\n");
diff --git a/drivers/virt/steal_monitor/sm_core.h b/drivers/virt/steal_monitor/sm_core.h
index 684a258526e1..a4e813319680 100644
--- a/drivers/virt/steal_monitor/sm_core.h
+++ b/drivers/virt/steal_monitor/sm_core.h
@@ -8,4 +8,16 @@
 #include <linux/kernel.h>
 #include <linux/init.h>
 
+struct steal_monitor {
+	struct delayed_work	work;
+	u64			prev_steal;
+	int			prev_direction;
+	unsigned int		interval_ms;
+	unsigned int		high_threshold;
+	unsigned int		low_threshold;
+	ktime_t			prev_time;
+};
+
+extern struct steal_monitor sm_core_ctx;
+
 #endif /* __VIRT_STEAL_CORE_H */
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 17/24] virt/steal_monitor: Add control knobs for handling steal values
  2026-06-25 12:46 [PATCH v5 00/24] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (15 preceding siblings ...)
  2026-06-25 12:46 ` [PATCH v5 16/24] virt/steal_monitor: Define steal_monitor structure Shrikanth Hegde
@ 2026-06-25 12:46 ` Shrikanth Hegde
  2026-06-25 12:46 ` [PATCH v5 18/24] virt/steal_monitor: Compute work at regular intervals Shrikanth Hegde
                   ` (6 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc

These are the knobs to control the steal_monitor.

interval_ms:
How often steal monitor checks for steal time.
(Default: 1000 i.e 1 second)

This controls how fast steal monitor driver reacts to changes to
the contention of physical CPUs. Since it does fair amount of
work, setting too low will have overheads. If set to 0, on next
work it will be set to default.

low_threshold:
lower threshold value in percentage * 100.
(Default: 200, i.e 2% steal is considered as low threshold)

This determines what values should be considered as nil/no steal values.
When steal monitor see steal time is below or equal to this value, it
will increase the preferred CPUs by 1 core. Having value as zero
might cause too much oscillations.

high_threshold:
higher threshold value in percentage * 100
(Default: 500, i.e 5% steal is considered as high threshold)

This determines what values should be considered as high steal values.
When steal monitor sees steal time is higher than this value, it will
reduce the preferred CPUs by 1 core.

Also available at: Documentation/driver-api/steal-monitor.rst

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v4->v5:
- Modified for steal_monitor

 drivers/virt/steal_monitor/sm_core.c | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/drivers/virt/steal_monitor/sm_core.c b/drivers/virt/steal_monitor/sm_core.c
index 3feb686dd3c4..b95b37e37a16 100644
--- a/drivers/virt/steal_monitor/sm_core.c
+++ b/drivers/virt/steal_monitor/sm_core.c
@@ -14,7 +14,23 @@
 
 #include "sm_core.h"
 
-struct steal_monitor sm_core_ctx;
+struct steal_monitor sm_core_ctx = {
+	.interval_ms = 1000,	/* 1 second */
+	.high_threshold = 500,	/* 5% */
+	.low_threshold = 200,	/* 2% */
+};
+
+module_param_named(interval_ms, sm_core_ctx.interval_ms, uint, 0644);
+MODULE_PARM_DESC(interval_ms,
+		 "Sampling frequency for steal values in milliseconds (default: 1000)");
+
+module_param_named(high_threshold, sm_core_ctx.high_threshold, uint, 0644);
+MODULE_PARM_DESC(high_threshold,
+		 "High steal threshold (default: 500 i.e 5%)");
+
+module_param_named(low_threshold, sm_core_ctx.low_threshold, uint, 0644);
+MODULE_PARM_DESC(low_threshold,
+		 "Low steal threshold (default: 200 i.e 2%)");
 
 static int __init steal_monitor_init(void)
 {
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 18/24] virt/steal_monitor: Compute work at regular intervals
  2026-06-25 12:46 [PATCH v5 00/24] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (16 preceding siblings ...)
  2026-06-25 12:46 ` [PATCH v5 17/24] virt/steal_monitor: Add control knobs for handling steal values Shrikanth Hegde
@ 2026-06-25 12:46 ` Shrikanth Hegde
  2026-06-25 12:46 ` [PATCH v5 19/24] virt/steal_monitor: Provide default method to get systemwide steal time Shrikanth Hegde
                   ` (5 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc

Trigger periodic work at intervals specified in interval_ms.
schedule_delayed_work is chosen since this work need not happen at this
instant and this variant is safe w.r.t to CPU hotplug.

Reset the interval_ms to default if one sets it to 0 to avoid workqueue
stalls.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v4->v5:
- Modified for steal_monitor

 drivers/virt/steal_monitor/sm_core.c | 26 +++++++++++++++++++++++++-
 drivers/virt/steal_monitor/sm_core.h |  2 ++
 2 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/drivers/virt/steal_monitor/sm_core.c b/drivers/virt/steal_monitor/sm_core.c
index b95b37e37a16..fac8f4d5dac7 100644
--- a/drivers/virt/steal_monitor/sm_core.c
+++ b/drivers/virt/steal_monitor/sm_core.c
@@ -32,15 +32,39 @@ module_param_named(low_threshold, sm_core_ctx.low_threshold, uint, 0644);
 MODULE_PARM_DESC(low_threshold,
 		 "Low steal threshold (default: 200 i.e 2%)");
 
+static void compute_preferred_cpus_work(struct work_struct *work)
+{
+	/* At least one core is kept as preferred */
+	WARN_ON(cpumask_empty(cpu_preferred_mask));
+
+	/* Warn if interval_ms is set to 0, that might cause lockup. */
+	if (unlikely(sm_core_ctx.interval_ms == 0)) {
+		WARN_ON(1);
+		sm_core_ctx.interval_ms = 1000; /* Fallback to default */
+	}
+
+	/* Trigger for next sampling */
+	schedule_delayed_work(&sm_core_ctx.work,
+			      msecs_to_jiffies(sm_core_ctx.interval_ms));
+}
+
 static int __init steal_monitor_init(void)
 {
-	pr_info("steal_monitor is enabled\n");
+	pr_info("steal_monitor is enabled. interval: %ums, high_threshold: %u, low_threshold: %u\n",
+		sm_core_ctx.interval_ms, sm_core_ctx.high_threshold, sm_core_ctx.low_threshold);
+
+	INIT_DELAYED_WORK(&sm_core_ctx.work, compute_preferred_cpus_work);
+
+	schedule_delayed_work(&sm_core_ctx.work,
+			      msecs_to_jiffies(sm_core_ctx.interval_ms));
+
 	return 0;
 }
 
 static void __exit steal_monitor_exit(void)
 {
 	pr_info("steal_monitor is disabled\n");
+	cancel_delayed_work_sync(&sm_core_ctx.work);
 	cpumask_copy(&__cpu_preferred_mask, cpu_active_mask);
 }
 
diff --git a/drivers/virt/steal_monitor/sm_core.h b/drivers/virt/steal_monitor/sm_core.h
index a4e813319680..d50138ad8c42 100644
--- a/drivers/virt/steal_monitor/sm_core.h
+++ b/drivers/virt/steal_monitor/sm_core.h
@@ -7,6 +7,8 @@
 #include <linux/module.h>
 #include <linux/kernel.h>
 #include <linux/init.h>
+#include <linux/cpumask.h>
+#include <linux/workqueue.h>
 
 struct steal_monitor {
 	struct delayed_work	work;
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 19/24] virt/steal_monitor: Provide default method to get systemwide steal time
  2026-06-25 12:46 [PATCH v5 00/24] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (17 preceding siblings ...)
  2026-06-25 12:46 ` [PATCH v5 18/24] virt/steal_monitor: Compute work at regular intervals Shrikanth Hegde
@ 2026-06-25 12:46 ` Shrikanth Hegde
  2026-06-25 12:46 ` [PATCH v5 20/24] virt/steal_monitor: Provide default method to inc/dec preferred CPUs Shrikanth Hegde
                   ` (4 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc

steal monitor takes global view of steal time instead of individual
vCPU. For this collect overall steal values across all the vCPUs or
vCPUs of interest.

Default implementation chooses steal time across all active CPUs.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v4->v5:
- new patch

 drivers/virt/steal_monitor/Makefile   |  2 +-
 drivers/virt/steal_monitor/defaults.c | 27 +++++++++++++++++++++++++++
 drivers/virt/steal_monitor/sm_core.h  |  2 ++
 3 files changed, 30 insertions(+), 1 deletion(-)
 create mode 100644 drivers/virt/steal_monitor/defaults.c

diff --git a/drivers/virt/steal_monitor/Makefile b/drivers/virt/steal_monitor/Makefile
index 24cee55342ce..7c16f8cf9583 100644
--- a/drivers/virt/steal_monitor/Makefile
+++ b/drivers/virt/steal_monitor/Makefile
@@ -11,4 +11,4 @@
 #
 obj-$(subst y,m,$(CONFIG_PREFERRED_CPU)) += steal_monitor.o
 
-steal_monitor-y := sm_core.o
+steal_monitor-y := sm_core.o defaults.o
diff --git a/drivers/virt/steal_monitor/defaults.c b/drivers/virt/steal_monitor/defaults.c
new file mode 100644
index 000000000000..17f57afacbe6
--- /dev/null
+++ b/drivers/virt/steal_monitor/defaults.c
@@ -0,0 +1,27 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Base file contains the default implementations.
+ * These are defined as __weak so that arch may define
+ * strong symbols to override.
+ *
+ * Copyright (C) 2026 IBM
+ * Author: Shrikanth Hegde <sshegde@linux.ibm.com>
+ */
+#include "sm_core.h"
+
+/*
+ * Compute steal time of the full system.
+ *
+ * Default implementation returns steal time across all active CPUs
+ */
+
+u64 __weak get_system_steal_time(void)
+{
+	int tmp_cpu;
+	u64 total_steal = 0;
+
+	for_each_cpu(tmp_cpu, cpu_active_mask)
+		total_steal += kcpustat_cpu(tmp_cpu).cpustat[CPUTIME_STEAL];
+
+	return total_steal;
+}
diff --git a/drivers/virt/steal_monitor/sm_core.h b/drivers/virt/steal_monitor/sm_core.h
index d50138ad8c42..e09745a2b813 100644
--- a/drivers/virt/steal_monitor/sm_core.h
+++ b/drivers/virt/steal_monitor/sm_core.h
@@ -9,6 +9,7 @@
 #include <linux/init.h>
 #include <linux/cpumask.h>
 #include <linux/workqueue.h>
+#include <linux/kernel_stat.h>
 
 struct steal_monitor {
 	struct delayed_work	work;
@@ -22,4 +23,5 @@ struct steal_monitor {
 
 extern struct steal_monitor sm_core_ctx;
 
+u64 get_system_steal_time(void);
 #endif /* __VIRT_STEAL_CORE_H */
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 20/24] virt/steal_monitor: Provide default method to inc/dec preferred CPUs
  2026-06-25 12:46 [PATCH v5 00/24] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (18 preceding siblings ...)
  2026-06-25 12:46 ` [PATCH v5 19/24] virt/steal_monitor: Provide default method to get systemwide steal time Shrikanth Hegde
@ 2026-06-25 12:46 ` Shrikanth Hegde
  2026-06-25 12:46 ` [PATCH v5 21/24] virt/steal_monitor: Provide default method to get num of CPUs for steal ratio Shrikanth Hegde
                   ` (3 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc

These methods will be used by the steal_monitor core in subsequent
patches. Default implementation are likely good enough for most archs.

decrease_preferred_cpus() - Called when there is high steal time. It needs
to decide which CPUs to mark as non-preferred and set that state.
increase_preferred_cpus() - Called when there is low steal time. It needs
to decide which CPUs to mark as preferred and set that state.

Default Implementations:
decrease_preferred_cpus()
- Get the last CPU in cpu_preferred_mask.
- Check if that last CPU belong to first housekeeping core. If so there
  is nothing to do. This helps to keep at least one core as preferred.
  This is to be safe under non-normal cases.
- If it is not first housekeeping core, get its sibling and mark them as
  non-preferred. If they are nohz_full, enable the tick. push mechanism
  relies on sched_tick.

increase_preferred_cpus()
- Get the first active non-preferred CPUs. This likely is the last
  set of CPUs being marked as non-preferred.
- If there is no such CPU, i.e preferred is same as active. Nothing
  todo further.
- If not, get the siblings of that core and mark them as preferred.
  Note that clearing the tick isn't needed as that would be handled via
  sched_can_stop_tick.

Using core instead of individual CPUs give better numbers as SMT is
quite common and some hypervisor such as powerVM does core scheduling.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v4->v5:
- Modified for steal_monitor

 drivers/virt/steal_monitor/defaults.c | 68 +++++++++++++++++++++++++++
 drivers/virt/steal_monitor/sm_core.h  |  4 ++
 2 files changed, 72 insertions(+)

diff --git a/drivers/virt/steal_monitor/defaults.c b/drivers/virt/steal_monitor/defaults.c
index 17f57afacbe6..90ede838491f 100644
--- a/drivers/virt/steal_monitor/defaults.c
+++ b/drivers/virt/steal_monitor/defaults.c
@@ -25,3 +25,71 @@ u64 __weak get_system_steal_time(void)
 
 	return total_steal;
 }
+
+/*
+ * Default implementation of decrementing the preferred CPUs based on steal
+ * time. This is simple logic and decrease the preferred CPUs by 1 core.
+ * It takes out the last core in the active & preferred.
+ *
+ * Ensure at least one housekeeping core is always kept as preferred
+ *
+ * Could be overwritten by arch specific handling. Arch must ensure
+ * preferred is always subset of active.
+ */
+
+#define get_core_mask(cpu) topology_sibling_cpumask(cpu)
+
+void __weak decrease_preferred_cpus(struct steal_monitor *ctx)
+{
+	int last_cpu, tmp_cpu;
+	int first_hk_cpu;
+
+	guard(cpus_read_lock)();
+
+	last_cpu = cpumask_last(cpu_preferred_mask);
+	first_hk_cpu = cpumask_first_and(housekeeping_cpumask(HK_TYPE_KERNEL_NOISE),
+					 cpu_active_mask);
+	/*
+	 * If the core belongs to the first housekeeping CPUs, no action is
+	 * taken. This leaves at least one core preferred always.
+	 * This ensures at least some CPUs are available to run.
+	 */
+	if (cpumask_equal(get_core_mask(last_cpu), get_core_mask(first_hk_cpu)))
+		return;
+
+	/*
+	 * set tick bit for nohz_full CPU to push the task out. Once the tasks
+	 * are pushed out, bit will be cleared if there are no tasks.
+	 */
+
+	for_each_cpu_and(tmp_cpu, get_core_mask(last_cpu), cpu_active_mask) {
+		set_cpu_preferred(tmp_cpu, false);
+		if (tick_nohz_full_cpu(tmp_cpu))
+			tick_nohz_dep_set_cpu(tmp_cpu, TICK_DEP_BIT_SCHED);
+	}
+}
+
+/*
+ * Default implementation of incrementing preferred CPUs based on steal
+ * time. This is simple logic and increases the preferred CPUs by 1 core.
+ * It adds the first core in active & !preferred
+ *
+ * Nothing to do if active == preferred
+ *
+ * Could be overwritten by arch specific handling. Arch must ensure
+ * preferred is subset of active.
+ */
+void __weak increase_preferred_cpus(struct steal_monitor *ctx)
+{
+	int first_cpu, tmp_cpu;
+
+	guard(cpus_read_lock)();
+
+	first_cpu = cpumask_first_andnot(cpu_active_mask, cpu_preferred_mask);
+	/* All CPUs are preferred. Nothing to increase further */
+	if (first_cpu >= nr_cpu_ids)
+		return;
+
+	for_each_cpu_and(tmp_cpu, get_core_mask(first_cpu), cpu_active_mask)
+		set_cpu_preferred(tmp_cpu, true);
+}
diff --git a/drivers/virt/steal_monitor/sm_core.h b/drivers/virt/steal_monitor/sm_core.h
index e09745a2b813..1857d6a9a295 100644
--- a/drivers/virt/steal_monitor/sm_core.h
+++ b/drivers/virt/steal_monitor/sm_core.h
@@ -10,6 +10,8 @@
 #include <linux/cpumask.h>
 #include <linux/workqueue.h>
 #include <linux/kernel_stat.h>
+#include <linux/tick.h>
+#include <linux/sched/isolation.h>
 
 struct steal_monitor {
 	struct delayed_work	work;
@@ -24,4 +26,6 @@ struct steal_monitor {
 extern struct steal_monitor sm_core_ctx;
 
 u64 get_system_steal_time(void);
+void increase_preferred_cpus(struct steal_monitor *ctx);
+void decrease_preferred_cpus(struct steal_monitor *ctx);
 #endif /* __VIRT_STEAL_CORE_H */
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 21/24] virt/steal_monitor: Provide default method to get num of CPUs for steal ratio
  2026-06-25 12:46 [PATCH v5 00/24] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (19 preceding siblings ...)
  2026-06-25 12:46 ` [PATCH v5 20/24] virt/steal_monitor: Provide default method to inc/dec preferred CPUs Shrikanth Hegde
@ 2026-06-25 12:46 ` Shrikanth Hegde
  2026-06-25 12:46 ` [PATCH v5 22/24] virt/steal_monitor: Act on steal values at regular intervals Shrikanth Hegde
                   ` (2 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc

This method informs the steal_monitor core, how many CPUs it needs to
consider for steal ratio calculations.
steal_ratio = (delta_steal * 100 * 100) / (delta_ns * number_of_cpus);

Default method returns number of Active CPUs since it calculates steal
time across active CPUs.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v4->v5:
- new patch

 drivers/virt/steal_monitor/defaults.c | 10 ++++++++++
 drivers/virt/steal_monitor/sm_core.h  |  1 +
 2 files changed, 11 insertions(+)

diff --git a/drivers/virt/steal_monitor/defaults.c b/drivers/virt/steal_monitor/defaults.c
index 90ede838491f..3232fa8f4032 100644
--- a/drivers/virt/steal_monitor/defaults.c
+++ b/drivers/virt/steal_monitor/defaults.c
@@ -26,6 +26,16 @@ u64 __weak get_system_steal_time(void)
 	return total_steal;
 }
 
+/*
+ * Return number of CPUs to consider for steal ratio calculation
+ *
+ * Default returns number of active CPUs.
+ */
+unsigned int __weak get_num_cpus_steal_ratio(void)
+{
+	return num_active_cpus();
+}
+
 /*
  * Default implementation of decrementing the preferred CPUs based on steal
  * time. This is simple logic and decrease the preferred CPUs by 1 core.
diff --git a/drivers/virt/steal_monitor/sm_core.h b/drivers/virt/steal_monitor/sm_core.h
index 1857d6a9a295..17732c8bc136 100644
--- a/drivers/virt/steal_monitor/sm_core.h
+++ b/drivers/virt/steal_monitor/sm_core.h
@@ -26,6 +26,7 @@ struct steal_monitor {
 extern struct steal_monitor sm_core_ctx;
 
 u64 get_system_steal_time(void);
+unsigned int get_num_cpus_steal_ratio(void);
 void increase_preferred_cpus(struct steal_monitor *ctx);
 void decrease_preferred_cpus(struct steal_monitor *ctx);
 #endif /* __VIRT_STEAL_CORE_H */
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 22/24] virt/steal_monitor: Act on steal values at regular intervals
  2026-06-25 12:46 [PATCH v5 00/24] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (20 preceding siblings ...)
  2026-06-25 12:46 ` [PATCH v5 21/24] virt/steal_monitor: Provide default method to get num of CPUs for steal ratio Shrikanth Hegde
@ 2026-06-25 12:46 ` Shrikanth Hegde
  2026-06-25 12:46 ` [PATCH v5 23/24] virt/steal_monitor: Add direction control Shrikanth Hegde
  2026-06-25 12:46 ` [PATCH v5 24/24] virt/steal_monitor: Add design check of preferred subset of active Shrikanth Hegde
  23 siblings, 0 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc

This is the steal_monitor core functionality done in periodic work

- Calculate the steal_ratio. It is multiplied by 100 to consider the
  fractional values of steal time. I.e 10 means 0.1% steal time.
- If steal value is higher than high threshold, call the method to reduce
  the preferred CPUs.
- If steal value is lower or equal to low threshold, call the method to
  increase the preferred CPUs.
- If the steal value is in between, no action is taken.
- Save the values for next delta calculations.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v4->v5:
- Modified for steal_monitor

 drivers/virt/steal_monitor/sm_core.c | 29 ++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/drivers/virt/steal_monitor/sm_core.c b/drivers/virt/steal_monitor/sm_core.c
index fac8f4d5dac7..641488a5a3b5 100644
--- a/drivers/virt/steal_monitor/sm_core.c
+++ b/drivers/virt/steal_monitor/sm_core.c
@@ -34,6 +34,33 @@ MODULE_PARM_DESC(low_threshold,
 
 static void compute_preferred_cpus_work(struct work_struct *work)
 {
+	u64 curr_steal, delta_steal, delta_ns, steal_ratio;
+	ktime_t now;
+
+	curr_steal = get_system_steal_time();
+	now = ktime_get();
+
+	/* get the deltas */
+	delta_steal = curr_steal > sm_core_ctx.prev_steal ?
+		      curr_steal - sm_core_ctx.prev_steal : 0;
+	delta_ns = max_t(u64, ktime_to_ns(ktime_sub(now, sm_core_ctx.prev_time)), 1);
+
+	/* Update for next calculation */
+	sm_core_ctx.prev_steal = curr_steal;
+	sm_core_ctx.prev_time = now;
+
+	/* Multiply by 100 to consider the fractional values of steal time */
+	steal_ratio = (delta_steal * 100 * 100) /
+		      (delta_ns * get_num_cpus_steal_ratio());
+
+	/* If the steal time values are high, reduce preferred CPUs */
+	if (steal_ratio > sm_core_ctx.high_threshold)
+		decrease_preferred_cpus(&sm_core_ctx);
+
+	/* If the steal time values are low, increase preferred CPUs */
+	if (steal_ratio <= sm_core_ctx.low_threshold)
+		increase_preferred_cpus(&sm_core_ctx);
+
 	/* At least one core is kept as preferred */
 	WARN_ON(cpumask_empty(cpu_preferred_mask));
 
@@ -54,6 +81,8 @@ static int __init steal_monitor_init(void)
 		sm_core_ctx.interval_ms, sm_core_ctx.high_threshold, sm_core_ctx.low_threshold);
 
 	INIT_DELAYED_WORK(&sm_core_ctx.work, compute_preferred_cpus_work);
+	sm_core_ctx.prev_steal = get_system_steal_time();
+	sm_core_ctx.prev_time = ktime_get();
 
 	schedule_delayed_work(&sm_core_ctx.work,
 			      msecs_to_jiffies(sm_core_ctx.interval_ms));
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 23/24] virt/steal_monitor: Add direction control
  2026-06-25 12:46 [PATCH v5 00/24] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (21 preceding siblings ...)
  2026-06-25 12:46 ` [PATCH v5 22/24] virt/steal_monitor: Act on steal values at regular intervals Shrikanth Hegde
@ 2026-06-25 12:46 ` Shrikanth Hegde
  2026-06-25 12:46 ` [PATCH v5 24/24] virt/steal_monitor: Add design check of preferred subset of active Shrikanth Hegde
  23 siblings, 0 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc

Cache the previous direction on steal time. So two consecutive values of
high values or low values are taken for decrease/increase of preferred
CPUs. This helps to avoid oscillations.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v4->v5:
- Modified for steal_monitor

 drivers/virt/steal_monitor/sm_core.c | 20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/drivers/virt/steal_monitor/sm_core.c b/drivers/virt/steal_monitor/sm_core.c
index 641488a5a3b5..f5b0e568eb32 100644
--- a/drivers/virt/steal_monitor/sm_core.c
+++ b/drivers/virt/steal_monitor/sm_core.c
@@ -20,6 +20,12 @@ struct steal_monitor sm_core_ctx = {
 	.low_threshold = 200,	/* 2% */
 };
 
+enum sm_direction {
+	SM_DIR_INCREASE = -1,
+	SM_DIR_NONE	=  0,
+	SM_DIR_DECREASE	=  1,
+};
+
 module_param_named(interval_ms, sm_core_ctx.interval_ms, uint, 0644);
 MODULE_PARM_DESC(interval_ms,
 		 "Sampling frequency for steal values in milliseconds (default: 1000)");
@@ -54,13 +60,23 @@ static void compute_preferred_cpus_work(struct work_struct *work)
 		      (delta_ns * get_num_cpus_steal_ratio());
 
 	/* If the steal time values are high, reduce preferred CPUs */
-	if (steal_ratio > sm_core_ctx.high_threshold)
+	if (sm_core_ctx.prev_direction == SM_DIR_DECREASE &&
+	    steal_ratio > sm_core_ctx.high_threshold)
 		decrease_preferred_cpus(&sm_core_ctx);
 
 	/* If the steal time values are low, increase preferred CPUs */
-	if (steal_ratio <= sm_core_ctx.low_threshold)
+	if (sm_core_ctx.prev_direction == SM_DIR_INCREASE &&
+	    steal_ratio <= sm_core_ctx.low_threshold)
 		increase_preferred_cpus(&sm_core_ctx);
 
+	/* mark the direction. This helps to avoid ping-pongs */
+	if (steal_ratio > sm_core_ctx.high_threshold)
+		sm_core_ctx.prev_direction = SM_DIR_DECREASE;
+	else if (steal_ratio <= sm_core_ctx.low_threshold)
+		sm_core_ctx.prev_direction = SM_DIR_INCREASE;
+	else
+		sm_core_ctx.prev_direction = SM_DIR_NONE;
+
 	/* At least one core is kept as preferred */
 	WARN_ON(cpumask_empty(cpu_preferred_mask));
 
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 24/24] virt/steal_monitor: Add design check of preferred subset of active
  2026-06-25 12:46 [PATCH v5 00/24] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (22 preceding siblings ...)
  2026-06-25 12:46 ` [PATCH v5 23/24] virt/steal_monitor: Add direction control Shrikanth Hegde
@ 2026-06-25 12:46 ` Shrikanth Hegde
  23 siblings, 0 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc

One of the main design construct that CONFIG_PREFERRED_CPU maintains is
that preferred is always subset of active. Force that in any future arch
specific implementations.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v4->v5:
- new patch

 drivers/virt/steal_monitor/sm_core.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/virt/steal_monitor/sm_core.c b/drivers/virt/steal_monitor/sm_core.c
index f5b0e568eb32..82beb2b94083 100644
--- a/drivers/virt/steal_monitor/sm_core.c
+++ b/drivers/virt/steal_monitor/sm_core.c
@@ -80,6 +80,9 @@ static void compute_preferred_cpus_work(struct work_struct *work)
 	/* At least one core is kept as preferred */
 	WARN_ON(cpumask_empty(cpu_preferred_mask));
 
+	/* Maintain design construct */
+	WARN_ON(!cpumask_subset(cpu_preferred_mask, cpu_active_mask));
+
 	/* Warn if interval_ms is set to 0, that might cause lockup. */
 	if (unlikely(sm_core_ctx.interval_ms == 0)) {
 		WARN_ON(1);
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2026-06-25 12:50 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-25 12:46 [PATCH v5 00/24] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
2026-06-25 12:46 ` [PATCH v5 01/24] sched/debug: Remove unused schedstats Shrikanth Hegde
2026-06-25 12:46 ` [PATCH v5 02/24] sched/docs: Document cpu_preferred_mask and Preferred CPU concept Shrikanth Hegde
2026-06-25 12:46 ` [PATCH v5 03/24] kconfig: Provide PREFERRED_CPU option Shrikanth Hegde
2026-06-25 12:46 ` [PATCH v5 04/24] cpumask: Introduce cpu_preferred_mask Shrikanth Hegde
2026-06-25 12:46 ` [PATCH v5 05/24] sysfs: Add preferred CPU file Shrikanth Hegde
2026-06-25 12:46 ` [PATCH v5 06/24] sched/core: allow only preferred CPUs in is_cpu_allowed Shrikanth Hegde
2026-06-25 12:46 ` [PATCH v5 07/24] sched/fair: Select preferred CPU at wakeup when possible Shrikanth Hegde
2026-06-25 12:46 ` [PATCH v5 08/24] sched/fair: load balance only among preferred CPUs Shrikanth Hegde
2026-06-25 12:46 ` [PATCH v5 09/24] sched/fair: Pull the load on preferred CPU Shrikanth Hegde
2026-06-25 12:46 ` [PATCH v5 10/24] sched/core: Keep tick on non-preferred CPUs until tasks are out Shrikanth Hegde
2026-06-25 12:46 ` [PATCH v5 11/24] sched/core: Push current task from non preferred CPU Shrikanth Hegde
2026-06-25 12:46 ` [PATCH v5 12/24] sched/debug: Add migration stats due to non preferred CPUs Shrikanth Hegde
2026-06-25 12:46 ` [PATCH v5 13/24] virt/steal_monitor: Add documentation Shrikanth Hegde
2026-06-25 12:46 ` [PATCH v5 14/24] virt: Introduce steal monitor driver Shrikanth Hegde
2026-06-25 12:46 ` [PATCH v5 15/24] virt/steal_monitor: Restore to active on module disable Shrikanth Hegde
2026-06-25 12:46 ` [PATCH v5 16/24] virt/steal_monitor: Define steal_monitor structure Shrikanth Hegde
2026-06-25 12:46 ` [PATCH v5 17/24] virt/steal_monitor: Add control knobs for handling steal values Shrikanth Hegde
2026-06-25 12:46 ` [PATCH v5 18/24] virt/steal_monitor: Compute work at regular intervals Shrikanth Hegde
2026-06-25 12:46 ` [PATCH v5 19/24] virt/steal_monitor: Provide default method to get systemwide steal time Shrikanth Hegde
2026-06-25 12:46 ` [PATCH v5 20/24] virt/steal_monitor: Provide default method to inc/dec preferred CPUs Shrikanth Hegde
2026-06-25 12:46 ` [PATCH v5 21/24] virt/steal_monitor: Provide default method to get num of CPUs for steal ratio Shrikanth Hegde
2026-06-25 12:46 ` [PATCH v5 22/24] virt/steal_monitor: Act on steal values at regular intervals Shrikanth Hegde
2026-06-25 12:46 ` [PATCH v5 23/24] virt/steal_monitor: Add direction control Shrikanth Hegde
2026-06-25 12:46 ` [PATCH v5 24/24] virt/steal_monitor: Add design check of preferred subset of active Shrikanth Hegde

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox