Linux Documentation
 help / color / mirror / Atom feed
* [PATCH v6 00/23] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff
@ 2026-07-01 14:16 Shrikanth Hegde
  2026-07-01 14:16 ` [PATCH v6 01/23] sched/docs: Document cpu_preferred_mask and Preferred CPU concept Shrikanth Hegde
                   ` (22 more replies)
  0 siblings, 23 replies; 32+ messages in thread
From: Shrikanth Hegde @ 2026-07-01 14:16 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc

Very briefly,
- Maintain set of CPUs which can be used by workload. It is denoted as
  cpu_preferred_mask
- Periodically compute the steal time. If steal time is high/low based
  on the thresholds, either reduce/increase the preferred CPUs. This is
  handled in a new driver called steal_monitor
- If a CPU is marked as non-preferred, push the task running on it if
  possible.
- Use this CPU state in wakeup and load balance to ensure tasks run
  within preferred CPUs.

For more details on idea, problem statement and performance numbers,
please refer to cover-letter of v2[2] and OSPM talk[1].

*** Please review and provide your feedback!! ***

[1]:https://youtu.be/adxUKFPlOp0
[2] v2: https://lore.kernel.org/all/20260407191950.643549-1-sshegde@linux.ibm.com/#t
[3] v5: https://lore.kernel.org/all/20260625124648.802832-1-sshegde@linux.ibm.com/

Thank you very much for feedback so far. This has helped the code to
evolve towards a clear abstraction layers and get simplified.
(Hopefully). Apologies in advance if I have missed addressing any
comments. If so would be purely accidental, not in any way intentional.

base commit:
tip/sched/core at 'commit b2463ebf2674 ("sched/debug: Remove unused schedstats")'

v5->v6:
- Drop 1st patch. It is already in sched/core. Thanks Peter for picking
  it up.
- make cpu_preferred_mask as EXPORT_SYMBOL_GPL (Peter Zijlstra)
- Make set_preferred_cpu a NOP when CONFIG_PREFERRED_CPU=n, and still
  keep assign_cpu macro for =y case. (Peter Zijlstra and Yury Norov)
- Drop the optimization of caching the preferred state in
  select_fallback_rq. Initially had thought of splitting v5's patch into
  two. Then later found that having the cached value exposes a race
  scenario where task affinity could get reset due to cached value if mask
  changed after it was cached.
- Drop wakeup patch (Peter Zijlstra).
  No performance degradation seen.
  If CPU is non-preferred select_fallback_rq gets called in wakeup path.
  Additional checks of available_idle_cpu is not necessary. Hence the drop.
- Address CPU hotplug related concerns of accessing active_mask in 
  steal monitor driver code (sashiko)
- Address concerns over u64 overflow (sashiko)
- Make decrease_preferred_cpus work correctly if
  nohz_full=<last_set_of_CPUS. Don't assume
  housekeeping core is always at beginning, (sashiko)
- Added a optimization for common case where nohz_full=<empty>
- Fixed a few documentation nits (Randy Dunlap)
- Fixed "this patch" reference in changelogs (Peter Zijlstra)

Let me know if there is any critical information is missing
regarding new driver such as policy, documentation or missing
implementation. I have ensured checkpatch --strict is happy.

As mentioned in previous v5[3]'s cover-letter, I am looking for guidance 
on the below concern that will arise.
I think there should be a MAINTAINERS file entry for new
driver. I don't see a drivers/virt/* entry.
Either as a new entry for driver or a few file in SCHEDULER entry.
Let me know if/what I should add it.

Shrikanth Hegde (23):
  sched/docs: Document cpu_preferred_mask and Preferred CPU concept
  kconfig: Provide PREFERRED_CPU option
  cpumask: Introduce cpu_preferred_mask
  sysfs: Add preferred CPU file
  sched/core: Try to use a preferred CPU in is_cpu_allowed
  sched/fair: load balance only among preferred CPUs
  sched/fair: Pull the load on preferred CPU
  sched/core: Keep tick on non-preferred CPUs until tasks are out
  sched/core: Push current task from non preferred CPU
  sched/debug: Add migration stats due to non preferred CPUs
  virt/steal_monitor: Add documentation
  virt: Introduce steal monitor driver
  virt/steal_monitor: Restore to active on module disable
  virt/steal_monitor: Define steal_monitor structure
  virt/steal_monitor: Add control knobs for handling steal values
  virt/steal_monitor: Compute work at regular intervals
  virt/steal_monitor: Provide default method to get systemwide steal
    time
  virt/steal_monitor: Provide default method to inc/dec preferred CPUs
  virt/steal_monitor: Provide default method to get num of CPUs for
    steal ratio
  virt/steal_monitor: Act on steal values at regular intervals
  virt/steal_monitor: Add direction control
  virt/steal_monitor: Add design check of preferred subset of active
  virt/steal_monitor: Optimise decrease_preferred_cpus when all CPUs are
    housekeeping

 .../ABI/testing/sysfs-devices-system-cpu      |  11 ++
 Documentation/driver-api/index.rst            |   1 +
 Documentation/driver-api/steal-monitor.rst    |  99 +++++++++++++
 Documentation/scheduler/sched-arch.rst        |  56 ++++++++
 drivers/base/cpu.c                            |   8 ++
 drivers/virt/Makefile                         |   1 +
 drivers/virt/steal_monitor/Makefile           |  14 ++
 drivers/virt/steal_monitor/defaults.c         | 129 +++++++++++++++++
 drivers/virt/steal_monitor/sm_core.c          | 130 ++++++++++++++++++
 drivers/virt/steal_monitor/sm_core.h          |  33 +++++
 include/linux/cpumask.h                       |  27 +++-
 include/linux/sched.h                         |   1 +
 kernel/Kconfig.preempt                        |  14 ++
 kernel/cpu.c                                  |   6 +
 kernel/sched/core.c                           | 127 ++++++++++++++++-
 kernel/sched/debug.c                          |   1 +
 kernel/sched/fair.c                           |  12 +-
 kernel/sched/sched.h                          |  17 +++
 18 files changed, 682 insertions(+), 5 deletions(-)
 create mode 100644 Documentation/driver-api/steal-monitor.rst
 create mode 100644 drivers/virt/steal_monitor/Makefile
 create mode 100644 drivers/virt/steal_monitor/defaults.c
 create mode 100644 drivers/virt/steal_monitor/sm_core.c
 create mode 100644 drivers/virt/steal_monitor/sm_core.h

-- 
2.47.3


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v6 01/23] sched/docs: Document cpu_preferred_mask and Preferred CPU concept
  2026-07-01 14:16 [PATCH v6 00/23] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
@ 2026-07-01 14:16 ` Shrikanth Hegde
  2026-07-01 14:16 ` [PATCH v6 02/23] kconfig: Provide PREFERRED_CPU option Shrikanth Hegde
                   ` (21 subsequent siblings)
  22 siblings, 0 replies; 32+ messages in thread
From: Shrikanth Hegde @ 2026-07-01 14:16 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc,
	kernel test robot

Add documentation for new cpumask called cpu_preferred_mask. This could
help users in understanding what this mask is and the concept behind it.

Document how to enable it and implementation aspects of it.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202606180717.yNM0yb41-lkp@intel.com/
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v5->v6:
- Add block on co-operative scheme

 Documentation/scheduler/sched-arch.rst | 56 ++++++++++++++++++++++++++
 1 file changed, 56 insertions(+)

diff --git a/Documentation/scheduler/sched-arch.rst b/Documentation/scheduler/sched-arch.rst
index ed07efea7d02..a8612d55b9fa 100644
--- a/Documentation/scheduler/sched-arch.rst
+++ b/Documentation/scheduler/sched-arch.rst
@@ -62,6 +62,62 @@ Your cpu_idle routines need to obey the following rules:
 arch/x86/kernel/process.c has examples of both polling and
 sleeping idle functions.
 
+Preferred CPUs
+==============
+
+In virtualised environments it is possible to overcommit CPU resources.
+i.e sum of virtual CPU(vCPU) of all VMs is greater than number of physical
+CPUs(pCPU). Under such conditions when all or many VMs have high utilization,
+hypervisor won't be able to satisfy the CPU requirement and has to context
+switch within or across VMs. i.e hypervisor needs to preempt one vCPU to run
+another. This is called vCPU preemption. This is more expensive compared to
+task context switch within a vCPU.
+
+In such cases it is better that combined vCPU ask from all VMs is reduced
+by not using some of the vCPUs in each VM. vCPUs where workload can be safely
+scheduled which won't increase any contention for pCPU are called as
+"Preferred CPUs".
+
+Main design construct is preferred CPUs is always subset of active CPUs.
+In most cases preferred CPUs will be same as active CPUs, when there is pCPU
+contention, Preferred CPUs will reduce based on the amount of steal time.
+When the pCPU contention goes away as indicated by steal time, Preferred CPUs
+will become same as active CPUs again. This is done by loading the
+steal_monitor driver available at drivers/virt/steal_monitor.
+
+For scheduling decisions such as wakeup, pushing the task etc, needs this
+CPU state info. This is maintained in cpu_preferred_mask.
+vCPUs which are not in cpu_preferred_mask should be treated as vCPUs which
+should not be used at this moment provided it doesn't break user affinity.
+
+This is achieved by
+1. Selecting a preferred CPU at wakeup.
+2. Push the task away from non-preferred CPU at tick.
+3. Only select preferred CPUs for load balance.
+
+/sys/devices/system/cpu/preferred prints the current cpu_preferred_mask in
+cpulist format.
+
+Notes:
+1. This feature is available under CONFIG_PREFERRED_CPU. This builds
+   steal_monitor driver. On enabling the driver, CPU preferred state
+   can change based on steal time. With CONFIG_PREFERRED_CPU=n,
+   preferred CPUs is same as active CPUs.
+
+2. This feature works for FAIR class only.
+
+3. A task pinned, which can't be moved to preferred CPUs will continue
+   to run based on its affinity. But no load balancing happens.
+
+4. Decision to use/not use is driven by kernel. Hence it shouldn't
+   break user affinities. One of the main reasons why CPU hotplug
+   or Isolated cpuset partitions was not a solution.
+
+5. This feature works best only when all the VMs enable the feature as
+   it is a co-operative scheme. If a specific VM doesn't enable this feature
+   it may end up with more CPUs than others, still should lead to better
+   performance when seen from system view.
+   Ones who are enabling this driver has to ensure it is enabled in all VMs.
 
 Possible arch/ problems
 =======================
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v6 02/23] kconfig: Provide PREFERRED_CPU option
  2026-07-01 14:16 [PATCH v6 00/23] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
  2026-07-01 14:16 ` [PATCH v6 01/23] sched/docs: Document cpu_preferred_mask and Preferred CPU concept Shrikanth Hegde
@ 2026-07-01 14:16 ` Shrikanth Hegde
  2026-07-01 14:16 ` [PATCH v6 03/23] cpumask: Introduce cpu_preferred_mask Shrikanth Hegde
                   ` (20 subsequent siblings)
  22 siblings, 0 replies; 32+ messages in thread
From: Shrikanth Hegde @ 2026-07-01 14:16 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc

Introduce a new config named PREFERRED_CPU.

This helps to:
- Avoid the code bloat when PREFERRED_CPU=n. In that cases preferred
  is same as active.
- Avoid the ifdeffery around PREFERRED_CPU in many files.

Since paravirtualized use case is the main driving force of this
feature, make it default for kernels with PARAVIRT=y

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/Kconfig.preempt | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index 88c594c6d7fc..b3a543cb44cd 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -192,3 +192,17 @@ config SCHED_CLASS_EXT
 	  For more information:
 	    Documentation/scheduler/sched-ext.rst
 	    https://github.com/sched-ext/scx
+
+config PREFERRED_CPU
+	bool "Dynamic vCPU management based on steal time"
+	depends on PARAVIRT && SMP
+	default y
+	help
+	  This feature helps to reduce the steal time in paravirtualised
+	  environment, there by reducing vCPU preemption. Reducing vCPU
+	  preemption provides improved lock holder preemption and reduces
+	  cost of vCPU preemption in the host.
+
+	  By default preferred CPUs will be same as active CPUs. Depending
+	  on the steal time when steal_monitor driver is enabled,
+	  preferred CPUs could become subset of active CPUs.
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v6 03/23] cpumask: Introduce cpu_preferred_mask
  2026-07-01 14:16 [PATCH v6 00/23] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
  2026-07-01 14:16 ` [PATCH v6 01/23] sched/docs: Document cpu_preferred_mask and Preferred CPU concept Shrikanth Hegde
  2026-07-01 14:16 ` [PATCH v6 02/23] kconfig: Provide PREFERRED_CPU option Shrikanth Hegde
@ 2026-07-01 14:16 ` Shrikanth Hegde
  2026-07-01 15:35   ` Yury Norov
  2026-07-01 14:16 ` [PATCH v6 04/23] sysfs: Add preferred CPU file Shrikanth Hegde
                   ` (19 subsequent siblings)
  22 siblings, 1 reply; 32+ messages in thread
From: Shrikanth Hegde @ 2026-07-01 14:16 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc

Provide cpu_preferred_mask infrastructure. Define get/set macros
which could be used to get/set CPU state as preferred.

Values are set/clear by the new driver called steal_monitor.
It periodically samples the steal time and decides preferred CPU state.

A CPU is set to preferred when it becomes active. Later it may be
marked as non-preferred depending on steal time values with
steal_monitor being enabled.

Always maintain design construct of preferred is subset of active.
i.e. preferred ⊆ active ⊆ online ⊆ present ⊆ possible

With PREFERRED_CPU=n, set is nop and get returns active state.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v5->v6:
- Make it nop for PREFERRED_CPU=n
- Make it EXPORT_SYMBOL_GPL

 include/linux/cpumask.h | 27 ++++++++++++++++++++++++++-
 kernel/cpu.c            |  6 ++++++
 kernel/sched/core.c     |  5 +++++
 3 files changed, 37 insertions(+), 1 deletion(-)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index d3cda0544954..c97271c063ce 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -122,12 +122,20 @@ extern struct cpumask __cpu_enabled_mask;
 extern struct cpumask __cpu_present_mask;
 extern struct cpumask __cpu_active_mask;
 extern struct cpumask __cpu_dying_mask;
+
+#ifdef CONFIG_PREFERRED_CPU
+extern struct cpumask __cpu_preferred_mask;
+#else
+#define __cpu_preferred_mask __cpu_active_mask
+#endif
+
 #define cpu_possible_mask ((const struct cpumask *)&__cpu_possible_mask)
 #define cpu_online_mask   ((const struct cpumask *)&__cpu_online_mask)
 #define cpu_enabled_mask   ((const struct cpumask *)&__cpu_enabled_mask)
 #define cpu_present_mask  ((const struct cpumask *)&__cpu_present_mask)
 #define cpu_active_mask   ((const struct cpumask *)&__cpu_active_mask)
 #define cpu_dying_mask    ((const struct cpumask *)&__cpu_dying_mask)
+#define cpu_preferred_mask ((const struct cpumask *)&__cpu_preferred_mask)
 
 extern atomic_t __num_online_cpus;
 extern unsigned int __num_possible_cpus;
@@ -1164,6 +1172,13 @@ void init_cpu_possible(const struct cpumask *src);
 #define set_cpu_active(cpu, active)	assign_cpu((cpu), &__cpu_active_mask, (active))
 #define set_cpu_dying(cpu, dying)	assign_cpu((cpu), &__cpu_dying_mask, (dying))
 
+#ifdef CONFIG_PREFERRED_CPU
+#define set_cpu_preferred(cpu, preferred) assign_cpu((cpu), &__cpu_preferred_mask, (preferred))
+#else
+/* Don't edit active state when the feature is off */
+#define set_cpu_preferred(cpu, preferred) {}
+#endif
+
 void set_cpu_online(unsigned int cpu, bool online);
 void set_cpu_possible(unsigned int cpu, bool possible);
 
@@ -1258,7 +1273,12 @@ static __always_inline bool cpu_dying(unsigned int cpu)
 	return cpumask_test_cpu(cpu, cpu_dying_mask);
 }
 
-#else
+static __always_inline bool cpu_preferred(unsigned int cpu)
+{
+	return cpumask_test_cpu(cpu, cpu_preferred_mask);
+}
+
+#else	/* NR_CPUS <= 1 */
 
 #define num_online_cpus()	1U
 #define num_possible_cpus()	1U
@@ -1296,6 +1316,11 @@ static __always_inline bool cpu_dying(unsigned int cpu)
 	return false;
 }
 
+static __always_inline bool cpu_preferred(unsigned int cpu)
+{
+	return cpu == 0;
+}
+
 #endif /* NR_CPUS > 1 */
 
 #define cpu_is_offline(cpu)	unlikely(!cpu_online(cpu))
diff --git a/kernel/cpu.c b/kernel/cpu.c
index b3c8553d7bd6..376d297a6292 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -3103,6 +3103,11 @@ EXPORT_SYMBOL(__cpu_dying_mask);
 atomic_t __num_online_cpus __read_mostly;
 EXPORT_SYMBOL(__num_online_cpus);
 
+#ifdef CONFIG_PREFERRED_CPU
+struct cpumask __cpu_preferred_mask __read_mostly;
+EXPORT_SYMBOL_GPL(__cpu_preferred_mask);
+#endif
+
 void init_cpu_present(const struct cpumask *src)
 {
 	cpumask_copy(&__cpu_present_mask, src);
@@ -3160,6 +3165,7 @@ void __init boot_cpu_init(void)
 	/* Mark the boot cpu "present", "online" etc for SMP and UP case */
 	set_cpu_online(cpu, true);
 	set_cpu_active(cpu, true);
+	set_cpu_preferred(cpu, true);
 	set_cpu_present(cpu, true);
 	set_cpu_possible(cpu, true);
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2e7cde033a31..a45f7c308329 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8690,6 +8690,9 @@ int sched_cpu_activate(unsigned int cpu)
 	 */
 	sched_set_rq_online(rq, cpu);
 
+	/* preferred is subset of active and follows its state */
+	set_cpu_preferred(cpu, true);
+
 	return 0;
 }
 
@@ -8703,6 +8706,8 @@ int sched_cpu_deactivate(unsigned int cpu)
 	if (ret)
 		return ret;
 
+	set_cpu_preferred(cpu, false);
+
 	/*
 	 * Remove CPU from nohz.idle_cpus_mask to prevent participating in
 	 * load balancing when not active
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v6 04/23] sysfs: Add preferred CPU file
  2026-07-01 14:16 [PATCH v6 00/23] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (2 preceding siblings ...)
  2026-07-01 14:16 ` [PATCH v6 03/23] cpumask: Introduce cpu_preferred_mask Shrikanth Hegde
@ 2026-07-01 14:16 ` Shrikanth Hegde
  2026-07-01 14:16 ` [PATCH v6 05/23] sched/core: Try to use a preferred CPU in is_cpu_allowed Shrikanth Hegde
                   ` (18 subsequent siblings)
  22 siblings, 0 replies; 32+ messages in thread
From: Shrikanth Hegde @ 2026-07-01 14:16 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc

Add "preferred" file in /sys/devices/system/cpu

This offers
- User can quickly check which CPUs are marked as preferred at this
  moment.
- Userspace algorithms irqbalance could use this mask to send irq into
  preferred CPUs.

For example:
cat /sys/devices/system/cpu/online
0-719
cat /sys/devices/system/cpu/preferred
0-599        <<< Implies 0-599 are preferred for workloads and 600-719
                 should be avoided at this moment.

cat /sys/devices/system/cpu/preferred
0-719        <<< All CPUs are usable. There is no preferrence.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 Documentation/ABI/testing/sysfs-devices-system-cpu | 11 +++++++++++
 drivers/base/cpu.c                                 |  8 ++++++++
 2 files changed, 19 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu
index 82d10d556cc8..ac1dbb209cc7 100644
--- a/Documentation/ABI/testing/sysfs-devices-system-cpu
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -806,3 +806,14 @@ Date:		Nov 2022
 Contact:	Linux kernel mailing list <linux-kernel@vger.kernel.org>
 Description:
 		(RO) the list of CPUs that can be brought online.
+
+What:		/sys/devices/system/cpu/preferred
+Date:		July 2026
+Contact:	Linux kernel mailing list <linux-kernel@vger.kernel.org>
+Description:
+		(RO) the list of preferred CPUs at this moment.
+		These are the only CPUs meant to be used at the moment.
+		Using CPU outside of the list could lead to more
+		contention of underlying physical CPU resource. Dynamically
+		changes based on steal time. With CONFIG_PREFERRED_CPU=n it
+		is same as active CPUs. See sched-arch.rst for more details.
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index 19d288a3c80c..4ac990efee7c 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -391,6 +391,13 @@ static int cpu_uevent(const struct device *dev, struct kobj_uevent_env *env)
 }
 #endif
 
+static ssize_t preferred_show(struct device *dev,
+			      struct device_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "%*pbl\n", cpumask_pr_args(cpu_preferred_mask));
+}
+static DEVICE_ATTR_RO(preferred);
+
 const struct bus_type cpu_subsys = {
 	.name = "cpu",
 	.dev_name = "cpu",
@@ -532,6 +539,7 @@ static struct attribute *cpu_root_attrs[] = {
 #ifdef CONFIG_GENERIC_CPU_AUTOPROBE
 	&dev_attr_modalias.attr,
 #endif
+	&dev_attr_preferred.attr,
 	NULL
 };
 
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v6 05/23] sched/core: Try to use a preferred CPU in is_cpu_allowed
  2026-07-01 14:16 [PATCH v6 00/23] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (3 preceding siblings ...)
  2026-07-01 14:16 ` [PATCH v6 04/23] sysfs: Add preferred CPU file Shrikanth Hegde
@ 2026-07-01 14:16 ` Shrikanth Hegde
  2026-07-01 16:09   ` Yury Norov
  2026-07-01 14:16 ` [PATCH v6 06/23] sched/fair: Load balance only among preferred CPUs Shrikanth Hegde
                   ` (17 subsequent siblings)
  22 siblings, 1 reply; 32+ messages in thread
From: Shrikanth Hegde @ 2026-07-01 14:16 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc

When possible, choose a preferred CPUs to pick.

Push task mechanism uses stopper thread which going to call
select_fallback_rq and use this mechanism to pick only a preferred CPU.

When task is affined only to non-preferred CPUs it should continue to
run there. Detect that by checking if cpus_ptr and cpu_preferred_mask
intersect or not.

This takes care of wakeup path optimization for FAIR tasks.
is_cpu_allowed is called to ensure wakeup happens on preferred CPUs.
With that, additional checks in available_idle_cpu is not necessary.

Add a comment on rare case of O(N**2) in select_fallback_rq.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v5->v6:
- Drop optimization for select_fallback_rq
- Keep comment on N**2

 kernel/sched/core.c  | 29 ++++++++++++++++++++++++++++-
 kernel/sched/sched.h |  9 +++++++++
 2 files changed, 37 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a45f7c308329..1fb1c17e8387 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2500,6 +2500,8 @@ static inline bool rq_has_pinned_tasks(struct rq *rq)
  */
 static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
 {
+	bool task_has_preferred_cpu;
+
 	/* When not in the task's cpumask, no point in looking further. */
 	if (!task_allowed_on_cpu(p, cpu))
 		return false;
@@ -2508,9 +2510,30 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
 	if (is_migration_disabled(p))
 		return cpu_online(cpu);
 
+	/*
+	 * This is essential to maintain user affinities when preferred
+	 * CPUs change. A task pinned on non-preferred CPU should continue
+	 * to run there, since this is non-user triggered.
+	 *
+	 * If CPU is non-preferred and task can run on other CPUs which are
+	 * currently preferred, then choose those other CPUs instead.
+	 * Overhead is minimal when CPU is preferred.
+	 *
+	 * For majority of the cases this would still keep select_fallback_rq
+	 * as O(N). task_has_preferred_cpus which is O(N) is called only if
+	 * !cpu_preferred. Then task running there is expected to move out.
+	 * So subsequent it should run on preferred CPU. This becomes O(N**2)
+	 * only for tasks pinned only non preferred CPUs. That is rare case.
+	 */
+	task_has_preferred_cpu = !cpu_preferred(cpu) &&
+				 task_has_preferred_cpus(p);
+
 	/* Non kernel threads are not allowed during either online or offline. */
-	if (!(p->flags & PF_KTHREAD))
+	if (!(p->flags & PF_KTHREAD)) {
+		if (task_has_preferred_cpu)
+			return false;
 		return cpu_active(cpu);
+	}
 
 	/* KTHREAD_IS_PER_CPU is always allowed. */
 	if (kthread_is_per_cpu(p))
@@ -2520,6 +2543,10 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
 	if (cpu_dying(cpu))
 		return false;
 
+	/* Try on preferred CPU first if possible*/
+	if (task_has_preferred_cpu)
+		return false;
+
 	/* But are allowed during online. */
 	return cpu_online(cpu);
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 26ae13c86b69..36ae20310891 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -4230,4 +4230,13 @@ DEFINE_CLASS_IS_UNCONDITIONAL(sched_change)
 
 #include "ext/ext.h"
 
+static inline bool task_has_preferred_cpus(struct task_struct *p)
+{
+	/* Only FAIR tasks honor preferred CPU state */
+	if (unlikely(p->sched_class != &fair_sched_class))
+		return false;
+
+	return cpumask_intersects(p->cpus_ptr, cpu_preferred_mask);
+}
+
 #endif /* _KERNEL_SCHED_SCHED_H */
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v6 06/23] sched/fair: Load balance only among preferred CPUs
  2026-07-01 14:16 [PATCH v6 00/23] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (4 preceding siblings ...)
  2026-07-01 14:16 ` [PATCH v6 05/23] sched/core: Try to use a preferred CPU in is_cpu_allowed Shrikanth Hegde
@ 2026-07-01 14:16 ` Shrikanth Hegde
  2026-07-01 16:19   ` Yury Norov
  2026-07-01 14:16 ` [PATCH v6 07/23] sched/fair: Pull the load on preferred CPU Shrikanth Hegde
                   ` (16 subsequent siblings)
  22 siblings, 1 reply; 32+ messages in thread
From: Shrikanth Hegde @ 2026-07-01 14:16 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc

Consider only preferred CPUs for load balance.

With this, load balance will end up choosing a preferred CPUs for pull.
This makes it not fight against the push task mechanism which happens
at tick. Also, this stops active balance to happen on non-preferred CPU
pulling the load.

This means there is no load balancing if the task is pinned only to
non-preferred CPUs. They will continue to run where they were previously
running before the CPUs was marked as non-preferred.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/fair.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ce05acf52d35..9b2931b559d6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -13391,7 +13391,8 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
 	};
 	bool need_unlock = false;
 
-	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
+	/* Spread load among preferred CPUs */
+	cpumask_and(cpus, sched_domain_span(sd), cpu_preferred_mask);
 
 	schedstat_inc(sd->lb_count[idle]);
 
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v6 07/23] sched/fair: Pull the load on preferred CPU
  2026-07-01 14:16 [PATCH v6 00/23] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (5 preceding siblings ...)
  2026-07-01 14:16 ` [PATCH v6 06/23] sched/fair: Load balance only among preferred CPUs Shrikanth Hegde
@ 2026-07-01 14:16 ` Shrikanth Hegde
  2026-07-01 14:16 ` [PATCH v6 08/23] sched/core: Keep tick on non-preferred CPUs until tasks are out Shrikanth Hegde
                   ` (15 subsequent siblings)
  22 siblings, 0 replies; 32+ messages in thread
From: Shrikanth Hegde @ 2026-07-01 14:16 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc

When cpu is marked as non preferred, any load pulled towards it is
pointless since in the next tick task will be pushed out again.

Since load balancing only happens among preferred CPUs, should_we_balance
will bail out. But for NEWIDLE and IDLE balance, this bailout can
happen even earlier.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
- Replace active check by preferred instead of adding one.

 kernel/sched/fair.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9b2931b559d6..f2bcf5ff6058 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -14301,6 +14301,10 @@ static void _nohz_idle_balance(struct rq *this_rq, unsigned int flags)
 		if (!idle_cpu(balance_cpu))
 			continue;
 
+		/* There is no point in pulling the load, just to push it out next */
+		if (!cpu_preferred(balance_cpu))
+			continue;
+
 		/*
 		 * If this CPU gets work to do, stop the load balancing
 		 * work being done for other CPUs. Next load
@@ -14475,9 +14479,10 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
 	this_rq->idle_stamp = rq_clock(this_rq);
 
 	/*
-	 * Do not pull tasks towards !active CPUs...
+	 * Do not pull tasks towards !preferred CPUs...
+	 * preferred is always a subset of active.
 	 */
-	if (!cpu_active(this_cpu))
+	if (!cpu_preferred(this_cpu))
 		return 0;
 
 	/*
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v6 08/23] sched/core: Keep tick on non-preferred CPUs until tasks are out
  2026-07-01 14:16 [PATCH v6 00/23] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (6 preceding siblings ...)
  2026-07-01 14:16 ` [PATCH v6 07/23] sched/fair: Pull the load on preferred CPU Shrikanth Hegde
@ 2026-07-01 14:16 ` Shrikanth Hegde
  2026-07-01 14:16 ` [PATCH v6 09/23] sched/core: Push current task from non preferred CPU Shrikanth Hegde
                   ` (14 subsequent siblings)
  22 siblings, 0 replies; 32+ messages in thread
From: Shrikanth Hegde @ 2026-07-01 14:16 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc

Enable tick on nohz full CPU when it is marked as non-preferred.
If there in no FAIR task running there,disable the tick to
save the power.

Steal time handling code will call tick_nohz_dep_set_cpu with
TICK_DEP_BIT_SCHED for moving the task out of nohz_full CPU fast.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/core.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1fb1c17e8387..aa4201bb8082 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1473,6 +1473,10 @@ bool sched_can_stop_tick(struct rq *rq)
 			return false;
 	}
 
+	/* Keep the tick running until CFS tasks are pushed out*/
+	if (!cpu_preferred(rq->cpu) && rq->cfs.h_nr_queued)
+		return false;
+
 	return true;
 }
 #endif /* CONFIG_NO_HZ_FULL */
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v6 09/23] sched/core: Push current task from non preferred CPU
  2026-07-01 14:16 [PATCH v6 00/23] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (7 preceding siblings ...)
  2026-07-01 14:16 ` [PATCH v6 08/23] sched/core: Keep tick on non-preferred CPUs until tasks are out Shrikanth Hegde
@ 2026-07-01 14:16 ` Shrikanth Hegde
  2026-07-01 16:50   ` Yury Norov
  2026-07-01 14:16 ` [PATCH v6 10/23] sched/debug: Add migration stats due to non preferred CPUs Shrikanth Hegde
                   ` (13 subsequent siblings)
  22 siblings, 1 reply; 32+ messages in thread
From: Shrikanth Hegde @ 2026-07-01 14:16 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc

Actively push out task running on a non-preferred CPU. Since the task is
running on the CPU, need to stop the cpu and push the task out.
However, if the task is pinned only to non-preferred CPUs, it will continue
running there. This will help in maintaining the userspace affinities
unlike CPU hotplug or isolated cpusets.

Though code is similar to  __balance_push_cpu_stop and quite close to
push_cpu_stop, it is being kept separate as it provides a cleaner
implementation with CONFIG_PREFERRED_CPU.

Add push_task_work_done flag to protect work buffer.
Works only with FAIR class.

For now, only current running task is pushed out. This keeps the code
simpler. In future optimization maybe done to move all the queued
task on the rq.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/core.c  | 87 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |  8 ++++
 2 files changed, 95 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index aa4201bb8082..56905bac9525 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5797,6 +5797,9 @@ void sched_tick(void)
 	unsigned long hw_pressure;
 	u64 resched_latency;
 
+	if (!cpu_preferred(cpu))
+		sched_push_current_non_preferred_cpu(rq);
+
 	if (housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE))
 		arch_scale_freq_tick();
 
@@ -11315,3 +11318,87 @@ void sched_change_end(struct sched_change_ctx *ctx)
 		p->sched_class->prio_changed(rq, p, ctx->prio);
 	}
 }
+
+#ifdef CONFIG_PREFERRED_CPU
+/* npc - non preferred CPU */
+static DEFINE_PER_CPU(struct cpu_stop_work, npc_push_task_work);
+
+static int sched_non_preferred_cpu_push_stop(void *arg)
+{
+	struct task_struct *p = arg;
+	struct rq *rq = this_rq();
+	struct rq_flags rf;
+	int cpu;
+
+	/* sanity check and clear */
+	if (cpu_preferred(rq->cpu)) {
+		scoped_guard (rq_lock, rq)
+			rq->push_task_work_done = 0;
+		put_task_struct(p);
+		return 0;
+	}
+
+	raw_spin_lock_irq(&p->pi_lock);
+
+	/* This could take rq lock. So call it before rq lock is taken */
+	cpu = select_fallback_rq(rq->cpu, p);
+	rq_lock(rq, &rf);
+	rq->push_task_work_done = 0;
+	update_rq_clock(rq);
+
+	context_unsafe_alias(rq);
+
+	if (task_rq(p) == rq && task_on_rq_queued(p))
+		rq = __migrate_task(rq, &rf, p, cpu);
+
+	rq_unlock(rq, &rf);
+	raw_spin_unlock_irq(&p->pi_lock);
+	put_task_struct(p);
+
+	return 0;
+}
+
+/*
+ * Push the current task running on non-preferred CPU.
+ * Using this non preferred CPU will lead to more vCPU preemptions
+ * in the host. So it is better not to use this CPU.
+ *
+ * Since task is running, call a stopper to push the task out. This is
+ * similar to how task moves during hotplug. In select_fallback_rq a
+ * preferred CPU will be chosen and henceforth task shouldn't come back to
+ * this CPU again.
+ *
+ * Works for FAIR class only
+ *
+ * If task is affined only non-preferred CPUs, it can't be moved out
+ */
+void sched_push_current_non_preferred_cpu(struct rq *rq)
+{
+	struct task_struct *push_task = rq->curr;
+
+	/* Preferred feature works only for FAIR class */
+	if (push_task->sched_class != &fair_sched_class)
+		return;
+
+	if (kthread_is_per_cpu(push_task) ||
+	    is_migration_disabled(push_task))
+		return;
+
+	/* Don't push the task if it is affined only on non preferred CPUs */
+	if (!task_has_preferred_cpus(push_task))
+		return;
+
+	/* There is already a stopper thread for this. Dont race with it. */
+	if (rq->push_task_work_done == 1)
+		return;
+
+	/* sched_tick runs with interrupts disabled. */
+	get_task_struct(push_task);
+
+	scoped_guard (rq_lock, rq)
+		rq->push_task_work_done = 1;
+
+	stop_one_cpu_nowait(rq->cpu, sched_non_preferred_cpu_push_stop,
+			    push_task, this_cpu_ptr(&npc_push_task_work));
+}
+#endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 36ae20310891..711fc8bd7ebc 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1277,6 +1277,8 @@ struct rq {
 
 	struct list_head cfs_tasks;
 
+	bool			push_task_work_done;
+
 	struct sched_avg	avg_rt;
 	struct sched_avg	avg_dl;
 #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
@@ -4239,4 +4241,10 @@ static inline bool task_has_preferred_cpus(struct task_struct *p)
 	return cpumask_intersects(p->cpus_ptr, cpu_preferred_mask);
 }
 
+#ifdef CONFIG_PREFERRED_CPU
+void sched_push_current_non_preferred_cpu(struct rq *rq);
+#else	/* !CONFIG_PREFERRED_CPU */
+static inline void sched_push_current_non_preferred_cpu(struct rq *rq) { }
+#endif
+
 #endif /* _KERNEL_SCHED_SCHED_H */
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v6 10/23] sched/debug: Add migration stats due to non preferred CPUs
  2026-07-01 14:16 [PATCH v6 00/23] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (8 preceding siblings ...)
  2026-07-01 14:16 ` [PATCH v6 09/23] sched/core: Push current task from non preferred CPU Shrikanth Hegde
@ 2026-07-01 14:16 ` Shrikanth Hegde
  2026-07-01 14:16 ` [PATCH v6 11/23] virt/steal_monitor: Add documentation Shrikanth Hegde
                   ` (12 subsequent siblings)
  22 siblings, 0 replies; 32+ messages in thread
From: Shrikanth Hegde @ 2026-07-01 14:16 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc

Add a new stat,
- nr_migrations_cpu_non_preferred: number of migrations happened since
  a CPU was marked as non preferred due to high steal time.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 include/linux/sched.h | 1 +
 kernel/sched/core.c   | 4 +++-
 kernel/sched/debug.c  | 1 +
 3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 968b18a7f470..37849d2f1dbd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -554,6 +554,7 @@ struct sched_statistics {
 	u64				nr_failed_migrations_running;
 	u64				nr_failed_migrations_hot;
 	u64				nr_forced_migrations;
+	u64				nr_migrations_cpu_non_preferred;
 
 	u64				nr_wakeups;
 	u64				nr_wakeups_sync;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 56905bac9525..0dffa3a722f2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -11348,8 +11348,10 @@ static int sched_non_preferred_cpu_push_stop(void *arg)
 
 	context_unsafe_alias(rq);
 
-	if (task_rq(p) == rq && task_on_rq_queued(p))
+	if (task_rq(p) == rq && task_on_rq_queued(p)) {
 		rq = __migrate_task(rq, &rf, p, cpu);
+		schedstat_inc(p->stats.nr_migrations_cpu_non_preferred);
+	}
 
 	rq_unlock(rq, &rf);
 	raw_spin_unlock_irq(&p->pi_lock);
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 72236db67983..5ebb2055e6d5 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1446,6 +1446,7 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 		P_SCHEDSTAT(nr_failed_migrations_running);
 		P_SCHEDSTAT(nr_failed_migrations_hot);
 		P_SCHEDSTAT(nr_forced_migrations);
+		P_SCHEDSTAT(nr_migrations_cpu_non_preferred);
 		P_SCHEDSTAT(nr_wakeups);
 		P_SCHEDSTAT(nr_wakeups_sync);
 		P_SCHEDSTAT(nr_wakeups_migrate);
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v6 11/23] virt/steal_monitor: Add documentation
  2026-07-01 14:16 [PATCH v6 00/23] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (9 preceding siblings ...)
  2026-07-01 14:16 ` [PATCH v6 10/23] sched/debug: Add migration stats due to non preferred CPUs Shrikanth Hegde
@ 2026-07-01 14:16 ` Shrikanth Hegde
  2026-07-01 14:16 ` [PATCH v6 12/23] virt: Introduce steal monitor driver Shrikanth Hegde
                   ` (11 subsequent siblings)
  22 siblings, 0 replies; 32+ messages in thread
From: Shrikanth Hegde @ 2026-07-01 14:16 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc

Document this module named steal_monitor and its parameters.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v5->v6:
- Fix documentation nits.
- Add block on co-operative scheme

 Documentation/driver-api/index.rst         |  1 +
 Documentation/driver-api/steal-monitor.rst | 99 ++++++++++++++++++++++
 2 files changed, 100 insertions(+)
 create mode 100644 Documentation/driver-api/steal-monitor.rst

diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst
index eaf7161ff957..ec12f396a5e6 100644
--- a/Documentation/driver-api/index.rst
+++ b/Documentation/driver-api/index.rst
@@ -138,6 +138,7 @@ Subsystem-specific APIs
    sm501
    soundwire/index
    spi
+   steal-monitor
    surface_aggregator/index
    switchtec
    sync_file
diff --git a/Documentation/driver-api/steal-monitor.rst b/Documentation/driver-api/steal-monitor.rst
new file mode 100644
index 000000000000..f71534322b14
--- /dev/null
+++ b/Documentation/driver-api/steal-monitor.rst
@@ -0,0 +1,99 @@
+.. SPDX-License-Identifier: GPL-2.0
+=============
+Steal Monitor
+=============
+
+:Author: Shrikanth Hegde
+
+Introduction
+============
+
+Steal monitor is a driver aimed at solving the Noisy Neighbour problem
+in virtualized environments. I.e performance of workload
+running in one VM gets affected significantly due to other VMs and
+combined they make slower forward progress.
+
+When there is overcommit of CPU resources, i.e sum of virtual CPUs(vCPU)
+of all VMs is greater than number of physical CPUs(pCPU) and
+when all or many VMs have high utilization, hypervisor won't be able
+to satisfy the CPU requirement and has to context switch within or
+across VM. I.e hypervisor needs to preempt one vCPU to run
+another. This is called vCPU preemption.
+This is more expensive compared to task context switch within a vCPU.
+
+In such cases it is better that combined vCPU ask from all VM is reduced
+by not using some of the vCPUs. vCPUs where workload can be safely
+scheduled which won't increase any contention for pCPU are called as
+"Preferred CPUs".
+
+See more on "Preferred CPUs" in Documentation/scheduler/sched-arch.rst.
+
+This driver helps in setting/clearing the CPUs in the "Preferred CPUs" list.
+This list is obtained using cpu_preferred_mask.
+
+Core idea
+=========
+steal time is an indication available today in Guest which shows contention
+for underlying physical CPU. Use it as a hint in the guest to fold the
+workload to a reduced set of vCPUs. When there is contention, steal time
+will show up in all the guests. When each guest honors the hint and folds
+the workload to a smaller set of vCPUs(Preferred CPUs), it reduces the
+contention and thereby reduces vCPU preemption.
+This is achieved without any cross-guest communication.
+
+Steal monitor driver effectively does:
+
+1. Periodically computes steal time across the system.
+
+2. If steal time is greater than high threshold, reduce the number of
+   preferred CPUs by 1 core. Ensure at least one core is left always.
+   This avoids running into extreme cases.
+
+3. If steal time is lower or equal to low threshold, increase the
+   number of preferred CPUs by 1 core. If preferred is same as active,
+   nothing to be done.
+
+4. Ensure preferred CPUs is always subset of active CPUs.
+   On feature disable it is same as active CPUs.
+
+This feature works best only when all the VMs enable the feature as
+it is a co-operative scheme. If a specific VM don't enable this feature
+it may end up with more CPUs than others, still should lead to better
+performance when seen from system view.
+Ones who are enabling this driver has to ensure it is enabled in all VMs.
+
+Module Parameters
+=================
+interval_ms
+-----------
+How often steal monitor checks for steal time.
+(Default: 1000 i.e 1 second)
+
+This controls how fast steal monitor driver reacts to changes to
+the contention of physical CPUs. Since it does fair amount of
+work, setting too low will have overheads. If set to 0, on next
+work it will be set to default.
+
+low_threshold
+-------------
+lower threshold value in percentage * 100.
+(Default: 200, i.e 2% steal is considered as low threshold)
+
+This determines what values should be considered as nil/no steal values.
+When steal monitor see steal time is below or equal to this value, it
+will increase the preferred CPUs by 1 core. Having value as zero
+might cause too much oscillations.
+
+high_threshold
+--------------
+higher threshold value in percentage * 100
+(Default: 500, i.e 5% steal is considered as high threshold)
+
+This determines what values should be considered as high steal values.
+When steal monitor sees steal time is higher than this value, it will
+reduce the preferred CPUs by 1 core.
+
+Notes
+=====
+This is available under CONFIG_PREFERRED_CPU. Selecting that includes
+this module. Module is not loaded by default.
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v6 12/23] virt: Introduce steal monitor driver
  2026-07-01 14:16 [PATCH v6 00/23] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (10 preceding siblings ...)
  2026-07-01 14:16 ` [PATCH v6 11/23] virt/steal_monitor: Add documentation Shrikanth Hegde
@ 2026-07-01 14:16 ` Shrikanth Hegde
  2026-07-01 14:16 ` [PATCH v6 13/23] virt/steal_monitor: Restore to active on module disable Shrikanth Hegde
                   ` (10 subsequent siblings)
  22 siblings, 0 replies; 32+ messages in thread
From: Shrikanth Hegde @ 2026-07-01 14:16 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc

Introduce a new driver in virt named steal_monitor. This driver
will compute the steal time and drive the policy decisions of preferred
CPU state.

More on it can be found in the Documentation/driver-api/steal-monitor.rst
Introduce the skeleton code first.

There is no new kconfig. It depends on CONFIG_PREFERRED_CPU.
- If CONFIG_PREFERRED_CPU=y, it gets compiled as a module. It is not
  loaded by default.
- If CONFIG_PREFERRED_CPU=n, module isn't compiled.

File layout of the driver is designed with having arch specific
files in the future.

- sm_core.c - contains main driver code. This includes the periodic
  work function and take action on steal time.
- defaults.c - contains the default implementation defined with __weak
  symbols.
- sm_core.h - header file which includes data structure.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 drivers/virt/Makefile                |  1 +
 drivers/virt/steal_monitor/Makefile  | 14 ++++++++++++
 drivers/virt/steal_monitor/sm_core.c | 33 ++++++++++++++++++++++++++++
 drivers/virt/steal_monitor/sm_core.h | 11 ++++++++++
 4 files changed, 59 insertions(+)
 create mode 100644 drivers/virt/steal_monitor/Makefile
 create mode 100644 drivers/virt/steal_monitor/sm_core.c
 create mode 100644 drivers/virt/steal_monitor/sm_core.h

diff --git a/drivers/virt/Makefile b/drivers/virt/Makefile
index f29901bd7820..aff715cea42d 100644
--- a/drivers/virt/Makefile
+++ b/drivers/virt/Makefile
@@ -9,4 +9,5 @@ obj-y				+= vboxguest/
 
 obj-$(CONFIG_NITRO_ENCLAVES)	+= nitro_enclaves/
 obj-$(CONFIG_ACRN_HSM)		+= acrn/
+obj-$(CONFIG_PREFERRED_CPU)	+= steal_monitor/
 obj-y				+= coco/
diff --git a/drivers/virt/steal_monitor/Makefile b/drivers/virt/steal_monitor/Makefile
new file mode 100644
index 000000000000..24cee55342ce
--- /dev/null
+++ b/drivers/virt/steal_monitor/Makefile
@@ -0,0 +1,14 @@
+# SPDX-License-Identifier: GPL-2.0-only
+#
+# Steal time monitor to alter preferred CPU state.
+#
+# Arch can implement strong function definitions and override the
+# default by adding them in arch specific file. It must ensure
+# that preferred is always subset of active.
+#
+# It is always compiled as module if CONFIG_PREFERRED_CPU=y
+# One has to enable the module.
+#
+obj-$(subst y,m,$(CONFIG_PREFERRED_CPU)) += steal_monitor.o
+
+steal_monitor-y := sm_core.o
diff --git a/drivers/virt/steal_monitor/sm_core.c b/drivers/virt/steal_monitor/sm_core.c
new file mode 100644
index 000000000000..e320559c6576
--- /dev/null
+++ b/drivers/virt/steal_monitor/sm_core.c
@@ -0,0 +1,33 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Steal time Monitor.
+ *
+ * Periodically compute steal time. Based on the thresholds either
+ * reduce/increase the preferred CPUs which can be made use
+ * by the workload to avoid vCPU preemption to an extent possible.
+ *
+ * Available as module with CONFIG_PREFERRED_CPU=y
+ *
+ * Copyright (C) 2026 IBM
+ * Author: Shrikanth Hegde <sshegde@linux.ibm.com>
+ */
+
+#include "sm_core.h"
+
+static int __init steal_monitor_init(void)
+{
+	pr_info("steal_monitor is enabled\n");
+	return 0;
+}
+
+static void __exit steal_monitor_exit(void)
+{
+	pr_info("steal_monitor is disabled\n");
+}
+
+module_init(steal_monitor_init);
+module_exit(steal_monitor_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("IBM Corporation");
+MODULE_DESCRIPTION("Virtualization Steal Time Monitor");
diff --git a/drivers/virt/steal_monitor/sm_core.h b/drivers/virt/steal_monitor/sm_core.h
new file mode 100644
index 000000000000..684a258526e1
--- /dev/null
+++ b/drivers/virt/steal_monitor/sm_core.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef __VIRT_STEAL_CORE_H
+#define __VIRT_STEAL_CORE_H
+
+#include <linux/types.h>
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/init.h>
+
+#endif /* __VIRT_STEAL_CORE_H */
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v6 13/23] virt/steal_monitor: Restore to active on module disable
  2026-07-01 14:16 [PATCH v6 00/23] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (11 preceding siblings ...)
  2026-07-01 14:16 ` [PATCH v6 12/23] virt: Introduce steal monitor driver Shrikanth Hegde
@ 2026-07-01 14:16 ` Shrikanth Hegde
  2026-07-01 14:16 ` [PATCH v6 14/23] virt/steal_monitor: Define steal_monitor structure Shrikanth Hegde
                   ` (9 subsequent siblings)
  22 siblings, 0 replies; 32+ messages in thread
From: Shrikanth Hegde @ 2026-07-01 14:16 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc

When the module is not in use, preferred CPUs must be same
as active CPUs.

Even if one disables the module during high steal time, it
still restores the preferred CPUs to be same as active CPUs
to keep disable path simple.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v5->v6:
- Add cpus_read_lock() for hotplug safety

 drivers/virt/steal_monitor/sm_core.c | 3 +++
 drivers/virt/steal_monitor/sm_core.h | 1 +
 2 files changed, 4 insertions(+)

diff --git a/drivers/virt/steal_monitor/sm_core.c b/drivers/virt/steal_monitor/sm_core.c
index e320559c6576..222c23286043 100644
--- a/drivers/virt/steal_monitor/sm_core.c
+++ b/drivers/virt/steal_monitor/sm_core.c
@@ -23,6 +23,9 @@ static int __init steal_monitor_init(void)
 static void __exit steal_monitor_exit(void)
 {
 	pr_info("steal_monitor is disabled\n");
+
+	guard(cpus_read_lock)();
+	cpumask_copy(&__cpu_preferred_mask, cpu_active_mask);
 }
 
 module_init(steal_monitor_init);
diff --git a/drivers/virt/steal_monitor/sm_core.h b/drivers/virt/steal_monitor/sm_core.h
index 684a258526e1..40913aeccf16 100644
--- a/drivers/virt/steal_monitor/sm_core.h
+++ b/drivers/virt/steal_monitor/sm_core.h
@@ -7,5 +7,6 @@
 #include <linux/module.h>
 #include <linux/kernel.h>
 #include <linux/init.h>
+#include <linux/cpuhplock.h>
 
 #endif /* __VIRT_STEAL_CORE_H */
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v6 14/23] virt/steal_monitor: Define steal_monitor structure
  2026-07-01 14:16 [PATCH v6 00/23] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (12 preceding siblings ...)
  2026-07-01 14:16 ` [PATCH v6 13/23] virt/steal_monitor: Restore to active on module disable Shrikanth Hegde
@ 2026-07-01 14:16 ` Shrikanth Hegde
  2026-07-01 14:16 ` [PATCH v6 15/23] virt/steal_monitor: Add control knobs for handling steal values Shrikanth Hegde
                   ` (8 subsequent siblings)
  22 siblings, 0 replies; 32+ messages in thread
From: Shrikanth Hegde @ 2026-07-01 14:16 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc

Main structure of steal monitor. It has
- work: deferred periodic work function
- prev_steal, prev_time - To calculate the delta in periodic work.
- interval_ms, high_threshold, low_threshold - debug knobs of
  steal_monitor.

sm_core_ctx - instance used by the core code.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 drivers/virt/steal_monitor/sm_core.c |  2 ++
 drivers/virt/steal_monitor/sm_core.h | 12 ++++++++++++
 2 files changed, 14 insertions(+)

diff --git a/drivers/virt/steal_monitor/sm_core.c b/drivers/virt/steal_monitor/sm_core.c
index 222c23286043..92d5a0e3d8bf 100644
--- a/drivers/virt/steal_monitor/sm_core.c
+++ b/drivers/virt/steal_monitor/sm_core.c
@@ -14,6 +14,8 @@
 
 #include "sm_core.h"
 
+struct steal_monitor sm_core_ctx;
+
 static int __init steal_monitor_init(void)
 {
 	pr_info("steal_monitor is enabled\n");
diff --git a/drivers/virt/steal_monitor/sm_core.h b/drivers/virt/steal_monitor/sm_core.h
index 40913aeccf16..e5c3ea0a63c9 100644
--- a/drivers/virt/steal_monitor/sm_core.h
+++ b/drivers/virt/steal_monitor/sm_core.h
@@ -9,4 +9,16 @@
 #include <linux/init.h>
 #include <linux/cpuhplock.h>
 
+struct steal_monitor {
+	struct delayed_work	work;
+	u64			prev_steal;
+	int			prev_direction;
+	unsigned int		interval_ms;
+	unsigned int		high_threshold;
+	unsigned int		low_threshold;
+	ktime_t			prev_time;
+};
+
+extern struct steal_monitor sm_core_ctx;
+
 #endif /* __VIRT_STEAL_CORE_H */
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v6 15/23] virt/steal_monitor: Add control knobs for handling steal values
  2026-07-01 14:16 [PATCH v6 00/23] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (13 preceding siblings ...)
  2026-07-01 14:16 ` [PATCH v6 14/23] virt/steal_monitor: Define steal_monitor structure Shrikanth Hegde
@ 2026-07-01 14:16 ` Shrikanth Hegde
  2026-07-01 14:16 ` [PATCH v6 16/23] virt/steal_monitor: Compute work at regular intervals Shrikanth Hegde
                   ` (7 subsequent siblings)
  22 siblings, 0 replies; 32+ messages in thread
From: Shrikanth Hegde @ 2026-07-01 14:16 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc

These are the knobs to control the steal_monitor.

interval_ms:
How often steal monitor checks for steal time.
(Default: 1000 i.e 1 second)

This controls how fast steal monitor driver reacts to changes to
the contention of physical CPUs. Since it does fair amount of
work, setting too low will have overheads. If set to 0, on next
work it will be set to default.

low_threshold:
lower threshold value in percentage * 100.
(Default: 200, i.e 2% steal is considered as low threshold)

This determines what values should be considered as nil/no steal values.
When steal monitor see steal time is below or equal to this value, it
will increase the preferred CPUs by 1 core. Having value as zero
might cause too much oscillations.

high_threshold:
higher threshold value in percentage * 100
(Default: 500, i.e 5% steal is considered as high threshold)

This determines what values should be considered as high steal values.
When steal monitor sees steal time is higher than this value, it will
reduce the preferred CPUs by 1 core.

Also available at: Documentation/driver-api/steal-monitor.rst

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 drivers/virt/steal_monitor/sm_core.c | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/drivers/virt/steal_monitor/sm_core.c b/drivers/virt/steal_monitor/sm_core.c
index 92d5a0e3d8bf..1ba638224abb 100644
--- a/drivers/virt/steal_monitor/sm_core.c
+++ b/drivers/virt/steal_monitor/sm_core.c
@@ -14,7 +14,23 @@
 
 #include "sm_core.h"
 
-struct steal_monitor sm_core_ctx;
+struct steal_monitor sm_core_ctx = {
+	.interval_ms = 1000,	/* 1 second */
+	.high_threshold = 500,	/* 5% */
+	.low_threshold = 200,	/* 2% */
+};
+
+module_param_named(interval_ms, sm_core_ctx.interval_ms, uint, 0644);
+MODULE_PARM_DESC(interval_ms,
+		 "Sampling frequency for steal values in milliseconds (default: 1000)");
+
+module_param_named(high_threshold, sm_core_ctx.high_threshold, uint, 0644);
+MODULE_PARM_DESC(high_threshold,
+		 "High steal threshold (default: 500 i.e 5%)");
+
+module_param_named(low_threshold, sm_core_ctx.low_threshold, uint, 0644);
+MODULE_PARM_DESC(low_threshold,
+		 "Low steal threshold (default: 200 i.e 2%)");
 
 static int __init steal_monitor_init(void)
 {
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v6 16/23] virt/steal_monitor: Compute work at regular intervals
  2026-07-01 14:16 [PATCH v6 00/23] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (14 preceding siblings ...)
  2026-07-01 14:16 ` [PATCH v6 15/23] virt/steal_monitor: Add control knobs for handling steal values Shrikanth Hegde
@ 2026-07-01 14:16 ` Shrikanth Hegde
  2026-07-01 14:16 ` [PATCH v6 17/23] virt/steal_monitor: Provide default method to get systemwide steal time Shrikanth Hegde
                   ` (6 subsequent siblings)
  22 siblings, 0 replies; 32+ messages in thread
From: Shrikanth Hegde @ 2026-07-01 14:16 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc

This is the steal_monitor core functionality done in periodic work

- Calculate the steal_ratio. It is multiplied by 100 to consider the
  fractional values of steal time. I.e 10 means 0.1% steal time.
- If steal value is higher than high threshold, call the method to reduce
  the preferred CPUs.
- If steal value is lower or equal to low threshold, call the method to
  increase the preferred CPUs.
- If the steal value is in between, no action is taken.
- Save the values for next delta calculations.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 drivers/virt/steal_monitor/sm_core.c | 26 +++++++++++++++++++++++++-
 drivers/virt/steal_monitor/sm_core.h |  3 +++
 2 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/drivers/virt/steal_monitor/sm_core.c b/drivers/virt/steal_monitor/sm_core.c
index 1ba638224abb..b499faa61010 100644
--- a/drivers/virt/steal_monitor/sm_core.c
+++ b/drivers/virt/steal_monitor/sm_core.c
@@ -32,9 +32,32 @@ module_param_named(low_threshold, sm_core_ctx.low_threshold, uint, 0644);
 MODULE_PARM_DESC(low_threshold,
 		 "Low steal threshold (default: 200 i.e 2%)");
 
+static void compute_preferred_cpus_work(struct work_struct *work)
+{
+	/* At least one core is kept as preferred */
+	WARN_ON(cpumask_empty(cpu_preferred_mask));
+
+	/* Warn if interval_ms is set to 0, that might cause lockup. */
+	if (unlikely(sm_core_ctx.interval_ms == 0)) {
+		WARN_ON(1);
+		sm_core_ctx.interval_ms = 1000; /* Fallback to default */
+	}
+
+	/* Trigger for next sampling */
+	schedule_delayed_work(&sm_core_ctx.work,
+			      msecs_to_jiffies(sm_core_ctx.interval_ms));
+}
+
 static int __init steal_monitor_init(void)
 {
-	pr_info("steal_monitor is enabled\n");
+	pr_info("steal_monitor is enabled. interval: %ums, high_threshold: %u, low_threshold: %u\n",
+		sm_core_ctx.interval_ms, sm_core_ctx.high_threshold, sm_core_ctx.low_threshold);
+
+	INIT_DELAYED_WORK(&sm_core_ctx.work, compute_preferred_cpus_work);
+
+	schedule_delayed_work(&sm_core_ctx.work,
+			      msecs_to_jiffies(sm_core_ctx.interval_ms));
+
 	return 0;
 }
 
@@ -42,6 +65,7 @@ static void __exit steal_monitor_exit(void)
 {
 	pr_info("steal_monitor is disabled\n");
 
+	cancel_delayed_work_sync(&sm_core_ctx.work);
 	guard(cpus_read_lock)();
 	cpumask_copy(&__cpu_preferred_mask, cpu_active_mask);
 }
diff --git a/drivers/virt/steal_monitor/sm_core.h b/drivers/virt/steal_monitor/sm_core.h
index e5c3ea0a63c9..ea06e83c228c 100644
--- a/drivers/virt/steal_monitor/sm_core.h
+++ b/drivers/virt/steal_monitor/sm_core.h
@@ -8,6 +8,9 @@
 #include <linux/kernel.h>
 #include <linux/init.h>
 #include <linux/cpuhplock.h>
+#include <linux/cpumask.h>
+#include <linux/workqueue.h>
+#include <linux/sched/isolation.h>
 
 struct steal_monitor {
 	struct delayed_work	work;
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v6 17/23] virt/steal_monitor: Provide default method to get systemwide steal time
  2026-07-01 14:16 [PATCH v6 00/23] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (15 preceding siblings ...)
  2026-07-01 14:16 ` [PATCH v6 16/23] virt/steal_monitor: Compute work at regular intervals Shrikanth Hegde
@ 2026-07-01 14:16 ` Shrikanth Hegde
  2026-07-01 14:16 ` [PATCH v6 18/23] virt/steal_monitor: Provide default method to inc/dec preferred CPUs Shrikanth Hegde
                   ` (5 subsequent siblings)
  22 siblings, 0 replies; 32+ messages in thread
From: Shrikanth Hegde @ 2026-07-01 14:16 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc

steal monitor takes global view of steal time instead of individual
vCPU. For this collect overall steal values across all the vCPUs or
vCPUs of interest.

Default implementation chooses steal time across all active CPUs.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v5->v6:
- Add cpus_read_lock() for hotplug safety

 drivers/virt/steal_monitor/Makefile   |  2 +-
 drivers/virt/steal_monitor/defaults.c | 28 +++++++++++++++++++++++++++
 drivers/virt/steal_monitor/sm_core.h  |  3 +++
 3 files changed, 32 insertions(+), 1 deletion(-)
 create mode 100644 drivers/virt/steal_monitor/defaults.c

diff --git a/drivers/virt/steal_monitor/Makefile b/drivers/virt/steal_monitor/Makefile
index 24cee55342ce..7c16f8cf9583 100644
--- a/drivers/virt/steal_monitor/Makefile
+++ b/drivers/virt/steal_monitor/Makefile
@@ -11,4 +11,4 @@
 #
 obj-$(subst y,m,$(CONFIG_PREFERRED_CPU)) += steal_monitor.o
 
-steal_monitor-y := sm_core.o
+steal_monitor-y := sm_core.o defaults.o
diff --git a/drivers/virt/steal_monitor/defaults.c b/drivers/virt/steal_monitor/defaults.c
new file mode 100644
index 000000000000..6681f9938f6a
--- /dev/null
+++ b/drivers/virt/steal_monitor/defaults.c
@@ -0,0 +1,28 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Base file contains the default implementations.
+ * These are defined as __weak so that arch may define
+ * strong symbols to override.
+ *
+ * Copyright (C) 2026 IBM
+ * Author: Shrikanth Hegde <sshegde@linux.ibm.com>
+ */
+#include "sm_core.h"
+
+/*
+ * Compute steal time of the full system.
+ *
+ * Default implementation returns steal time across all active CPUs
+ */
+
+u64 __weak get_system_steal_time(void)
+{
+	int tmp_cpu;
+	u64 total_steal = 0;
+
+	guard(cpus_read_lock)();
+	for_each_cpu(tmp_cpu, cpu_active_mask)
+		total_steal += kcpustat_cpu(tmp_cpu).cpustat[CPUTIME_STEAL];
+
+	return total_steal;
+}
diff --git a/drivers/virt/steal_monitor/sm_core.h b/drivers/virt/steal_monitor/sm_core.h
index ea06e83c228c..634c9f5a2610 100644
--- a/drivers/virt/steal_monitor/sm_core.h
+++ b/drivers/virt/steal_monitor/sm_core.h
@@ -11,6 +11,7 @@
 #include <linux/cpumask.h>
 #include <linux/workqueue.h>
 #include <linux/sched/isolation.h>
+#include <linux/kernel_stat.h>
 
 struct steal_monitor {
 	struct delayed_work	work;
@@ -24,4 +25,6 @@ struct steal_monitor {
 
 extern struct steal_monitor sm_core_ctx;
 
+u64 get_system_steal_time(void);
+
 #endif /* __VIRT_STEAL_CORE_H */
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v6 18/23] virt/steal_monitor: Provide default method to inc/dec preferred CPUs
  2026-07-01 14:16 [PATCH v6 00/23] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (16 preceding siblings ...)
  2026-07-01 14:16 ` [PATCH v6 17/23] virt/steal_monitor: Provide default method to get systemwide steal time Shrikanth Hegde
@ 2026-07-01 14:16 ` Shrikanth Hegde
  2026-07-01 14:16 ` [PATCH v6 19/23] virt/steal_monitor: Provide default method to get num of CPUs for steal ratio Shrikanth Hegde
                   ` (4 subsequent siblings)
  22 siblings, 0 replies; 32+ messages in thread
From: Shrikanth Hegde @ 2026-07-01 14:16 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc

These methods will be used by the steal_monitor core in subsequent
patches. Default implementation are likely good enough for most archs.

decrease_preferred_cpus() - Called when there is high steal time. It needs
to decide which CPUs to mark as non-preferred and set that state.
increase_preferred_cpus() - Called when there is low steal time. It needs
to decide which CPUs to mark as preferred and set that state.

Default Implementations:
decrease_preferred_cpus()
- Get first housekeeping CPU and its core mask. Mark it as
  protected core. This helps to keep at least one core as preferred.
  This is to be safe under non-normal cases.
- Find the last CPU outside of this protected core mask. (target CPU)
  This works for cases where one may specify nohz_full= for last set of
  CPUs as well.
- If no such CPU exits, then only housekeeping core remains. Bail out.
- Based on that target CPU, get its sibling and mark them as
  non-preferred. If they are nohz_full, enable the tick.
  push mechanism relies on sched_tick.

increase_preferred_cpus()
- Get the first active non-preferred CPUs. This likely is the last
  set of CPUs being marked as non-preferred.
- If there is no such CPU, i.e preferred is same as active. Nothing
  todo further.
- If not, get the siblings of that core and mark them as preferred.
  Note that clearing the tick isn't needed as that would be handled via
  sched_can_stop_tick.

Using core instead of individual CPUs give better numbers as SMT is
quite common and some hypervisor such as powerVM does core scheduling.

Note: This doesn't do any NUMA splicing to keep the code simpler and
minimal overhead. current code expects CPUs spread unformly
across NUMA nodes.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v5->v6:
- Make it work for all cases when nohz_full= may be specified.

 drivers/virt/steal_monitor/defaults.c | 74 +++++++++++++++++++++++++++
 drivers/virt/steal_monitor/sm_core.h  |  2 +
 2 files changed, 76 insertions(+)

diff --git a/drivers/virt/steal_monitor/defaults.c b/drivers/virt/steal_monitor/defaults.c
index 6681f9938f6a..4e2e5b233948 100644
--- a/drivers/virt/steal_monitor/defaults.c
+++ b/drivers/virt/steal_monitor/defaults.c
@@ -26,3 +26,77 @@ u64 __weak get_system_steal_time(void)
 
 	return total_steal;
 }
+
+/*
+ * Default implementation of decrementing the preferred CPUs based on steal
+ * time. This is simple logic and decrease the preferred CPUs by 1 core.
+ * It takes out the last core in the active & preferred.
+ *
+ * Ensure at least one housekeeping core is always kept as preferred
+ *
+ * Could be overwritten by arch specific handling. Arch must ensure
+ * preferred is always subset of active.
+ */
+
+#define get_core_mask(cpu) topology_sibling_cpumask(cpu)
+
+void __weak decrease_preferred_cpus(struct steal_monitor *ctx)
+{
+	int tmp_cpu, first_hk_cpu;
+	const struct cpumask *first_hk_core;
+	int target_cpu = nr_cpu_ids;
+
+	guard(cpus_read_lock)();
+
+	first_hk_cpu = cpumask_first_and(housekeeping_cpumask(HK_TYPE_KERNEL_NOISE),
+					 cpu_active_mask);
+
+	if (first_hk_cpu >= nr_cpu_ids)
+		return;
+
+	first_hk_core = get_core_mask(first_hk_cpu);
+
+	/* Always leave first housekeeping core as preferred. */
+	for_each_cpu_andnot(tmp_cpu, cpu_preferred_mask, first_hk_core)
+		target_cpu = tmp_cpu;
+
+	/* Only the first housekeeping core remains */
+	if (target_cpu >= nr_cpu_ids)
+		return;
+
+	/*
+	 * set tick bit for nohz_full CPU to push the task out. Once the tasks
+	 * are pushed out, bit will be cleared if there are no tasks.
+	 */
+
+	for_each_cpu_and(tmp_cpu, get_core_mask(target_cpu), cpu_active_mask) {
+		set_cpu_preferred(tmp_cpu, false);
+		if (tick_nohz_full_cpu(tmp_cpu))
+			tick_nohz_dep_set_cpu(tmp_cpu, TICK_DEP_BIT_SCHED);
+	}
+}
+
+/*
+ * Default implementation of incrementing preferred CPUs based on steal
+ * time. This is simple logic and increases the preferred CPUs by 1 core.
+ * It adds the first core in active & !preferred
+ *
+ * Nothing to do if active == preferred
+ *
+ * Could be overwritten by arch specific handling. Arch must ensure
+ * preferred is subset of active.
+ */
+void __weak increase_preferred_cpus(struct steal_monitor *ctx)
+{
+	int first_cpu, tmp_cpu;
+
+	guard(cpus_read_lock)();
+
+	first_cpu = cpumask_first_andnot(cpu_active_mask, cpu_preferred_mask);
+	/* All CPUs are preferred. Nothing to increase further */
+	if (first_cpu >= nr_cpu_ids)
+		return;
+
+	for_each_cpu_and(tmp_cpu, get_core_mask(first_cpu), cpu_active_mask)
+		set_cpu_preferred(tmp_cpu, true);
+}
diff --git a/drivers/virt/steal_monitor/sm_core.h b/drivers/virt/steal_monitor/sm_core.h
index 634c9f5a2610..030f6236c38e 100644
--- a/drivers/virt/steal_monitor/sm_core.h
+++ b/drivers/virt/steal_monitor/sm_core.h
@@ -26,5 +26,7 @@ struct steal_monitor {
 extern struct steal_monitor sm_core_ctx;
 
 u64 get_system_steal_time(void);
+void increase_preferred_cpus(struct steal_monitor *ctx);
+void decrease_preferred_cpus(struct steal_monitor *ctx);
 
 #endif /* __VIRT_STEAL_CORE_H */
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v6 19/23] virt/steal_monitor: Provide default method to get num of CPUs for steal ratio
  2026-07-01 14:16 [PATCH v6 00/23] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (17 preceding siblings ...)
  2026-07-01 14:16 ` [PATCH v6 18/23] virt/steal_monitor: Provide default method to inc/dec preferred CPUs Shrikanth Hegde
@ 2026-07-01 14:16 ` Shrikanth Hegde
  2026-07-01 14:16 ` [PATCH v6 20/23] virt/steal_monitor: Act on steal values at regular intervals Shrikanth Hegde
                   ` (3 subsequent siblings)
  22 siblings, 0 replies; 32+ messages in thread
From: Shrikanth Hegde @ 2026-07-01 14:16 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc

This method informs the steal_monitor core, how many CPUs it needs to
consider for steal ratio calculations.
steal_ratio = (delta_steal * 100 * 100) / (delta_ns * number_of_cpus);

Default method returns number of Active CPUs since it calculates steal
time across active CPUs.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v5->v6:
- Add cpus_read_lock() for hotplug safety.

 drivers/virt/steal_monitor/defaults.c | 11 +++++++++++
 drivers/virt/steal_monitor/sm_core.h  |  1 +
 2 files changed, 12 insertions(+)

diff --git a/drivers/virt/steal_monitor/defaults.c b/drivers/virt/steal_monitor/defaults.c
index 4e2e5b233948..70dcfb1ce4cb 100644
--- a/drivers/virt/steal_monitor/defaults.c
+++ b/drivers/virt/steal_monitor/defaults.c
@@ -27,6 +27,17 @@ u64 __weak get_system_steal_time(void)
 	return total_steal;
 }
 
+/*
+ * Return number of CPUs to consider for steal ratio calculation
+ *
+ * Default returns number of active CPUs.
+ */
+unsigned int __weak get_num_cpus_steal_ratio(void)
+{
+	guard(cpus_read_lock)();
+	return num_active_cpus();
+}
+
 /*
  * Default implementation of decrementing the preferred CPUs based on steal
  * time. This is simple logic and decrease the preferred CPUs by 1 core.
diff --git a/drivers/virt/steal_monitor/sm_core.h b/drivers/virt/steal_monitor/sm_core.h
index 030f6236c38e..794d3be04248 100644
--- a/drivers/virt/steal_monitor/sm_core.h
+++ b/drivers/virt/steal_monitor/sm_core.h
@@ -26,6 +26,7 @@ struct steal_monitor {
 extern struct steal_monitor sm_core_ctx;
 
 u64 get_system_steal_time(void);
+unsigned int get_num_cpus_steal_ratio(void);
 void increase_preferred_cpus(struct steal_monitor *ctx);
 void decrease_preferred_cpus(struct steal_monitor *ctx);
 
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v6 20/23] virt/steal_monitor: Act on steal values at regular intervals
  2026-07-01 14:16 [PATCH v6 00/23] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (18 preceding siblings ...)
  2026-07-01 14:16 ` [PATCH v6 19/23] virt/steal_monitor: Provide default method to get num of CPUs for steal ratio Shrikanth Hegde
@ 2026-07-01 14:16 ` Shrikanth Hegde
  2026-07-01 14:16 ` [PATCH v6 21/23] virt/steal_monitor: Add direction control Shrikanth Hegde
                   ` (2 subsequent siblings)
  22 siblings, 0 replies; 32+ messages in thread
From: Shrikanth Hegde @ 2026-07-01 14:16 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc

This is the steal_monitor core functionality done in periodic work

- Calculate the steal_ratio. It is multiplied by 100 to consider the
  fractional values of steal time. I.e 10 means 0.1% steal time.
- If steal value is higher than high threshold, call the method to reduce
  the preferred CPUs.
- If steal value is lower or equal to low threshold, call the method to
  increase the preferred CPUs.
- If the steal value is in between, no action is taken.
- Save the values for next delta calculations.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v5->v6:
- Address u64 overflow concerns.

 drivers/virt/steal_monitor/sm_core.c | 33 ++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/drivers/virt/steal_monitor/sm_core.c b/drivers/virt/steal_monitor/sm_core.c
index b499faa61010..7b7435f79b85 100644
--- a/drivers/virt/steal_monitor/sm_core.c
+++ b/drivers/virt/steal_monitor/sm_core.c
@@ -34,6 +34,37 @@ MODULE_PARM_DESC(low_threshold,
 
 static void compute_preferred_cpus_work(struct work_struct *work)
 {
+	u64 curr_steal, delta_steal, delta_ns, steal_ratio;
+	ktime_t now;
+
+	curr_steal = get_system_steal_time();
+	now = ktime_get();
+
+	/* get the deltas */
+	delta_steal = curr_steal > sm_core_ctx.prev_steal ?
+		      curr_steal - sm_core_ctx.prev_steal : 0;
+	delta_ns = max_t(u64, ktime_to_ns(ktime_sub(now, sm_core_ctx.prev_time)), 1);
+
+	/* Update for next calculation */
+	sm_core_ctx.prev_steal = curr_steal;
+	sm_core_ctx.prev_time = now;
+
+	/*
+	 * Multiply by 100 to consider the fractional values of steal time.
+	 * steal_ratio = (delta_steal * 100 * 100)/(delta_ns * num_cpus())
+	 */
+	delta_ns = div_u64(delta_ns * get_num_cpus_steal_ratio(), 100 * 100);
+	if (unlikely(!delta_ns))
+		return;
+
+	steal_ratio = div64_u64(delta_steal, delta_ns);
+	/* If the steal time values are high, reduce preferred CPUs */
+	if (steal_ratio > sm_core_ctx.high_threshold)
+		decrease_preferred_cpus(&sm_core_ctx);
+	/* If the steal time values are low, increase preferred CPUs */
+	if (steal_ratio <= sm_core_ctx.low_threshold)
+		increase_preferred_cpus(&sm_core_ctx);
+
 	/* At least one core is kept as preferred */
 	WARN_ON(cpumask_empty(cpu_preferred_mask));
 
@@ -54,6 +85,8 @@ static int __init steal_monitor_init(void)
 		sm_core_ctx.interval_ms, sm_core_ctx.high_threshold, sm_core_ctx.low_threshold);
 
 	INIT_DELAYED_WORK(&sm_core_ctx.work, compute_preferred_cpus_work);
+	sm_core_ctx.prev_steal = get_system_steal_time();
+	sm_core_ctx.prev_time = ktime_get();
 
 	schedule_delayed_work(&sm_core_ctx.work,
 			      msecs_to_jiffies(sm_core_ctx.interval_ms));
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v6 21/23] virt/steal_monitor: Add direction control
  2026-07-01 14:16 [PATCH v6 00/23] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (19 preceding siblings ...)
  2026-07-01 14:16 ` [PATCH v6 20/23] virt/steal_monitor: Act on steal values at regular intervals Shrikanth Hegde
@ 2026-07-01 14:16 ` Shrikanth Hegde
  2026-07-01 14:16 ` [PATCH v6 22/23] virt/steal_monitor: Add design check of preferred subset of active Shrikanth Hegde
  2026-07-01 14:16 ` [PATCH v6 23/23] virt/steal_monitor: Optimise decrease_preferred_cpus when all CPUs are housekeeping Shrikanth Hegde
  22 siblings, 0 replies; 32+ messages in thread
From: Shrikanth Hegde @ 2026-07-01 14:16 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc

Cache the previous direction on steal time. So two consecutive values of
high values or low values are taken for decrease/increase of preferred
CPUs. This helps to avoid oscillations.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 drivers/virt/steal_monitor/sm_core.c | 20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/drivers/virt/steal_monitor/sm_core.c b/drivers/virt/steal_monitor/sm_core.c
index 7b7435f79b85..4810bad96818 100644
--- a/drivers/virt/steal_monitor/sm_core.c
+++ b/drivers/virt/steal_monitor/sm_core.c
@@ -20,6 +20,12 @@ struct steal_monitor sm_core_ctx = {
 	.low_threshold = 200,	/* 2% */
 };
 
+enum sm_direction {
+	SM_DIR_INCREASE = -1,
+	SM_DIR_NONE	=  0,
+	SM_DIR_DECREASE	=  1,
+};
+
 module_param_named(interval_ms, sm_core_ctx.interval_ms, uint, 0644);
 MODULE_PARM_DESC(interval_ms,
 		 "Sampling frequency for steal values in milliseconds (default: 1000)");
@@ -59,12 +65,22 @@ static void compute_preferred_cpus_work(struct work_struct *work)
 
 	steal_ratio = div64_u64(delta_steal, delta_ns);
 	/* If the steal time values are high, reduce preferred CPUs */
-	if (steal_ratio > sm_core_ctx.high_threshold)
+	if (sm_core_ctx.prev_direction == SM_DIR_DECREASE &&
+	    steal_ratio > sm_core_ctx.high_threshold)
 		decrease_preferred_cpus(&sm_core_ctx);
 	/* If the steal time values are low, increase preferred CPUs */
-	if (steal_ratio <= sm_core_ctx.low_threshold)
+	if (sm_core_ctx.prev_direction == SM_DIR_INCREASE &&
+	    steal_ratio <= sm_core_ctx.low_threshold)
 		increase_preferred_cpus(&sm_core_ctx);
 
+	/* mark the direction. This helps to avoid ping-pongs */
+	if (steal_ratio > sm_core_ctx.high_threshold)
+		sm_core_ctx.prev_direction = SM_DIR_DECREASE;
+	else if (steal_ratio <= sm_core_ctx.low_threshold)
+		sm_core_ctx.prev_direction = SM_DIR_INCREASE;
+	else
+		sm_core_ctx.prev_direction = SM_DIR_NONE;
+
 	/* At least one core is kept as preferred */
 	WARN_ON(cpumask_empty(cpu_preferred_mask));
 
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v6 22/23] virt/steal_monitor: Add design check of preferred subset of active
  2026-07-01 14:16 [PATCH v6 00/23] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (20 preceding siblings ...)
  2026-07-01 14:16 ` [PATCH v6 21/23] virt/steal_monitor: Add direction control Shrikanth Hegde
@ 2026-07-01 14:16 ` Shrikanth Hegde
  2026-07-01 14:16 ` [PATCH v6 23/23] virt/steal_monitor: Optimise decrease_preferred_cpus when all CPUs are housekeeping Shrikanth Hegde
  22 siblings, 0 replies; 32+ messages in thread
From: Shrikanth Hegde @ 2026-07-01 14:16 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc

One of the main design construct that CONFIG_PREFERRED_CPU maintains is
that preferred is always subset of active. Force that in any future arch
specific implementations.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
- I am thinking to keep WARN_ON since it would enforce
  the policy. That's the reason i haven't used WARN_ON_ONCE.

 drivers/virt/steal_monitor/sm_core.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/virt/steal_monitor/sm_core.c b/drivers/virt/steal_monitor/sm_core.c
index 4810bad96818..27054bd0bbf1 100644
--- a/drivers/virt/steal_monitor/sm_core.c
+++ b/drivers/virt/steal_monitor/sm_core.c
@@ -84,6 +84,9 @@ static void compute_preferred_cpus_work(struct work_struct *work)
 	/* At least one core is kept as preferred */
 	WARN_ON(cpumask_empty(cpu_preferred_mask));
 
+	/* Maintain design construct */
+	WARN_ON(!cpumask_subset(cpu_preferred_mask, cpu_active_mask));
+
 	/* Warn if interval_ms is set to 0, that might cause lockup. */
 	if (unlikely(sm_core_ctx.interval_ms == 0)) {
 		WARN_ON(1);
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v6 23/23] virt/steal_monitor: Optimise decrease_preferred_cpus when all CPUs are housekeeping
  2026-07-01 14:16 [PATCH v6 00/23] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (21 preceding siblings ...)
  2026-07-01 14:16 ` [PATCH v6 22/23] virt/steal_monitor: Add design check of preferred subset of active Shrikanth Hegde
@ 2026-07-01 14:16 ` Shrikanth Hegde
  22 siblings, 0 replies; 32+ messages in thread
From: Shrikanth Hegde @ 2026-07-01 14:16 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc

In most of the configurations one wouldn't specify nohz_full= or
isolated= option. That makes all CPU are part of housekeeping domain.

Since decrease_preferred_cpus aims to find the last CPU which
doesn't belong to the first housekeeping core, that could be optimized
when all CPUs are housekeeping cores. Simply look at last CPU in
the preferred mask and check if it is housekeeping core or not.

Time taken for decrease_preferred_cpus on 480 CPU (60 Cores, SMT8):
Without optimization:  around 6us.
With optimization:  around 3us.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v5->v6:
- Optimize for the common case where nohz_full=<empty>

 drivers/virt/steal_monitor/defaults.c | 26 +++++++++++++++++++++-----
 1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/drivers/virt/steal_monitor/defaults.c b/drivers/virt/steal_monitor/defaults.c
index 70dcfb1ce4cb..5206a5f7af48 100644
--- a/drivers/virt/steal_monitor/defaults.c
+++ b/drivers/virt/steal_monitor/defaults.c
@@ -53,7 +53,7 @@ unsigned int __weak get_num_cpus_steal_ratio(void)
 
 void __weak decrease_preferred_cpus(struct steal_monitor *ctx)
 {
-	int tmp_cpu, first_hk_cpu;
+	int tmp_cpu, first_hk_cpu, last_cpu;
 	const struct cpumask *first_hk_core;
 	int target_cpu = nr_cpu_ids;
 
@@ -62,14 +62,30 @@ void __weak decrease_preferred_cpus(struct steal_monitor *ctx)
 	first_hk_cpu = cpumask_first_and(housekeeping_cpumask(HK_TYPE_KERNEL_NOISE),
 					 cpu_active_mask);
 
-	if (first_hk_cpu >= nr_cpu_ids)
+	last_cpu = cpumask_last(cpu_preferred_mask);
+
+	if (first_hk_cpu >= nr_cpu_ids || last_cpu >= nr_cpu_ids)
 		return;
 
 	first_hk_core = get_core_mask(first_hk_cpu);
 
-	/* Always leave first housekeeping core as preferred. */
-	for_each_cpu_andnot(tmp_cpu, cpu_preferred_mask, first_hk_core)
-		target_cpu = tmp_cpu;
+	/*
+	 * Always leave first housekeeping core as preferred.
+	 * In most configurations housekeeping core will be at beginning,
+	 * i.e there is no nohz_full= or isolcpus=.
+	 * Skip the loop in that case.
+	 */
+	if (!cpumask_test_cpu(last_cpu, first_hk_core)) {
+		target_cpu = last_cpu;
+	} else {
+		/*
+		 * Since one may specify nohz_full= for any set of CPUs,
+		 * Find the last CPU which doesn't belong to the
+		 * protected first housekeeping core.
+		 */
+		for_each_cpu_andnot(tmp_cpu, cpu_preferred_mask, first_hk_core)
+			target_cpu = tmp_cpu;
+	}
 
 	/* Only the first housekeeping core remains */
 	if (target_cpu >= nr_cpu_ids)
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v6 03/23] cpumask: Introduce cpu_preferred_mask
  2026-07-01 14:16 ` [PATCH v6 03/23] cpumask: Introduce cpu_preferred_mask Shrikanth Hegde
@ 2026-07-01 15:35   ` Yury Norov
  2026-07-01 16:40     ` Shrikanth Hegde
  0 siblings, 1 reply; 32+ messages in thread
From: Yury Norov @ 2026-07-01 15:35 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet, tglx, gregkh, pbonzini,
	seanjc, vschneid, huschle, rostedt, dietmar.eggemann, maddy,
	srikar, hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael, rdunlap,
	kernellwp, linux-doc

On Wed, Jul 01, 2026 at 07:46:34PM +0530, Shrikanth Hegde wrote:
> Provide cpu_preferred_mask infrastructure. Define get/set macros
> which could be used to get/set CPU state as preferred.
> 
> Values are set/clear by the new driver called steal_monitor.
> It periodically samples the steal time and decides preferred CPU state.
> 
> A CPU is set to preferred when it becomes active. Later it may be
> marked as non-preferred depending on steal time values with
> steal_monitor being enabled.
> 
> Always maintain design construct of preferred is subset of active.
> i.e. preferred ⊆ active ⊆ online ⊆ present ⊆ possible
> 
> With PREFERRED_CPU=n, set is nop and get returns active state.
> 
> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> ---
> v5->v6:
> - Make it nop for PREFERRED_CPU=n
> - Make it EXPORT_SYMBOL_GPL
> 
>  include/linux/cpumask.h | 27 ++++++++++++++++++++++++++-
>  kernel/cpu.c            |  6 ++++++
>  kernel/sched/core.c     |  5 +++++
>  3 files changed, 37 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
> index d3cda0544954..c97271c063ce 100644
> --- a/include/linux/cpumask.h
> +++ b/include/linux/cpumask.h
> @@ -122,12 +122,20 @@ extern struct cpumask __cpu_enabled_mask;
>  extern struct cpumask __cpu_present_mask;
>  extern struct cpumask __cpu_active_mask;
>  extern struct cpumask __cpu_dying_mask;
> +
> +#ifdef CONFIG_PREFERRED_CPU
> +extern struct cpumask __cpu_preferred_mask;
> +#else
> +#define __cpu_preferred_mask __cpu_active_mask
> +#endif
> +
>  #define cpu_possible_mask ((const struct cpumask *)&__cpu_possible_mask)
>  #define cpu_online_mask   ((const struct cpumask *)&__cpu_online_mask)
>  #define cpu_enabled_mask   ((const struct cpumask *)&__cpu_enabled_mask)
>  #define cpu_present_mask  ((const struct cpumask *)&__cpu_present_mask)
>  #define cpu_active_mask   ((const struct cpumask *)&__cpu_active_mask)
>  #define cpu_dying_mask    ((const struct cpumask *)&__cpu_dying_mask)
> +#define cpu_preferred_mask ((const struct cpumask *)&__cpu_preferred_mask)
>  
>  extern atomic_t __num_online_cpus;
>  extern unsigned int __num_possible_cpus;
> @@ -1164,6 +1172,13 @@ void init_cpu_possible(const struct cpumask *src);
>  #define set_cpu_active(cpu, active)	assign_cpu((cpu), &__cpu_active_mask, (active))
>  #define set_cpu_dying(cpu, dying)	assign_cpu((cpu), &__cpu_dying_mask, (dying))
>  
> +#ifdef CONFIG_PREFERRED_CPU
> +#define set_cpu_preferred(cpu, preferred) assign_cpu((cpu), &__cpu_preferred_mask, (preferred))
> +#else
> +/* Don't edit active state when the feature is off */

And that makes a random reader thinking like why in the world he mentions
active state here?

Can you move this in commit message, or drop entirely?

> +#define set_cpu_preferred(cpu, preferred) {}
> +#endif
> +
>  void set_cpu_online(unsigned int cpu, bool online);
>  void set_cpu_possible(unsigned int cpu, bool possible);
>  
> @@ -1258,7 +1273,12 @@ static __always_inline bool cpu_dying(unsigned int cpu)
>  	return cpumask_test_cpu(cpu, cpu_dying_mask);
>  }
>  
> -#else
> +static __always_inline bool cpu_preferred(unsigned int cpu)
> +{
> +	return cpumask_test_cpu(cpu, cpu_preferred_mask);
> +}
> +
> +#else	/* NR_CPUS <= 1 */

NR_CPUS can't be less than 1, I guess.

>  
>  #define num_online_cpus()	1U
>  #define num_possible_cpus()	1U
> @@ -1296,6 +1316,11 @@ static __always_inline bool cpu_dying(unsigned int cpu)
>  	return false;
>  }
>  
> +static __always_inline bool cpu_preferred(unsigned int cpu)
> +{
> +	return cpu == 0;
> +}
> +
>  #endif /* NR_CPUS > 1 */
>  
>  #define cpu_is_offline(cpu)	unlikely(!cpu_online(cpu))
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index b3c8553d7bd6..376d297a6292 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -3103,6 +3103,11 @@ EXPORT_SYMBOL(__cpu_dying_mask);
>  atomic_t __num_online_cpus __read_mostly;
>  EXPORT_SYMBOL(__num_online_cpus);
>  
> +#ifdef CONFIG_PREFERRED_CPU
> +struct cpumask __cpu_preferred_mask __read_mostly;
> +EXPORT_SYMBOL_GPL(__cpu_preferred_mask);
> +#endif
> +
>  void init_cpu_present(const struct cpumask *src)
>  {
>  	cpumask_copy(&__cpu_present_mask, src);
> @@ -3160,6 +3165,7 @@ void __init boot_cpu_init(void)
>  	/* Mark the boot cpu "present", "online" etc for SMP and UP case */
>  	set_cpu_online(cpu, true);
>  	set_cpu_active(cpu, true);
> +	set_cpu_preferred(cpu, true);
>  	set_cpu_present(cpu, true);
>  	set_cpu_possible(cpu, true);
>  
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 2e7cde033a31..a45f7c308329 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -8690,6 +8690,9 @@ int sched_cpu_activate(unsigned int cpu)
>  	 */
>  	sched_set_rq_online(rq, cpu);
>  
> +	/* preferred is subset of active and follows its state */
> +	set_cpu_preferred(cpu, true);
> +
>  	return 0;
>  }
>  
> @@ -8703,6 +8706,8 @@ int sched_cpu_deactivate(unsigned int cpu)
>  	if (ret)
>  		return ret;
>  
> +	set_cpu_preferred(cpu, false);
> +
>  	/*
>  	 * Remove CPU from nohz.idle_cpus_mask to prevent participating in
>  	 * load balancing when not active
> -- 
> 2.47.3

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v6 05/23] sched/core: Try to use a preferred CPU in is_cpu_allowed
  2026-07-01 14:16 ` [PATCH v6 05/23] sched/core: Try to use a preferred CPU in is_cpu_allowed Shrikanth Hegde
@ 2026-07-01 16:09   ` Yury Norov
  2026-07-01 16:49     ` Shrikanth Hegde
  0 siblings, 1 reply; 32+ messages in thread
From: Yury Norov @ 2026-07-01 16:09 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet, tglx, gregkh, pbonzini,
	seanjc, vschneid, huschle, rostedt, dietmar.eggemann, maddy,
	srikar, hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael, rdunlap,
	kernellwp, linux-doc

On Wed, Jul 01, 2026 at 07:46:36PM +0530, Shrikanth Hegde wrote:
> When possible, choose a preferred CPUs to pick.
> 
> Push task mechanism uses stopper thread which going to call
> select_fallback_rq and use this mechanism to pick only a preferred CPU.
> 
> When task is affined only to non-preferred CPUs it should continue to
> run there. Detect that by checking if cpus_ptr and cpu_preferred_mask
> intersect or not.
> 
> This takes care of wakeup path optimization for FAIR tasks.
> is_cpu_allowed is called to ensure wakeup happens on preferred CPUs.
> With that, additional checks in available_idle_cpu is not necessary.
> 
> Add a comment on rare case of O(N**2) in select_fallback_rq.
> 
> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> ---
> v5->v6:
> - Drop optimization for select_fallback_rq
> - Keep comment on N**2
> 
>  kernel/sched/core.c  | 29 ++++++++++++++++++++++++++++-
>  kernel/sched/sched.h |  9 +++++++++
>  2 files changed, 37 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index a45f7c308329..1fb1c17e8387 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2500,6 +2500,8 @@ static inline bool rq_has_pinned_tasks(struct rq *rq)
>   */
>  static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
>  {
> +	bool task_has_preferred_cpu;
> +
>  	/* When not in the task's cpumask, no point in looking further. */
>  	if (!task_allowed_on_cpu(p, cpu))
>  		return false;
> @@ -2508,9 +2510,30 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
>  	if (is_migration_disabled(p))
>  		return cpu_online(cpu);
>  
> +	/*
> +	 * This is essential to maintain user affinities when preferred
> +	 * CPUs change. A task pinned on non-preferred CPU should continue
> +	 * to run there, since this is non-user triggered.
> +	 *
> +	 * If CPU is non-preferred and task can run on other CPUs which are
> +	 * currently preferred, then choose those other CPUs instead.
> +	 * Overhead is minimal when CPU is preferred.
> +	 *
> +	 * For majority of the cases this would still keep select_fallback_rq
> +	 * as O(N). task_has_preferred_cpus which is O(N) is called only if
> +	 * !cpu_preferred. Then task running there is expected to move out.
> +	 * So subsequent it should run on preferred CPU. This becomes O(N**2)
> +	 * only for tasks pinned only non preferred CPUs. That is rare case.
> +	 */

The is_cpu_allowed() is ~20 lines now, and your patch doubles that count.
Can you keep this type of thoughts in commit message? 90% of setups
will disable preferred CPUs, and I guess 99% of developers don't care.

This is the code, not a scientific paper, after all.

> +	task_has_preferred_cpu = !cpu_preferred(cpu) &&
> +				 task_has_preferred_cpus(p);

Maybe it's just me, but the name looks illogical. Because if
'cpu' is preferred, the task indeed has some preferred CPUs.

Maybe 'can_sched_on_preferred' or something like that?

> +
>  	/* Non kernel threads are not allowed during either online or offline. */
> -	if (!(p->flags & PF_KTHREAD))
> +	if (!(p->flags & PF_KTHREAD)) {
> +		if (task_has_preferred_cpu)
> +			return false;
>  		return cpu_active(cpu);
> +	}

The comment on top of the block seems to be applicable to the 2nd
return only, right?

>  
>  	/* KTHREAD_IS_PER_CPU is always allowed. */
>  	if (kthread_is_per_cpu(p))
> @@ -2520,6 +2543,10 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
>  	if (cpu_dying(cpu))
>  		return false;
>  
> +	/* Try on preferred CPU first if possible*/
> +	if (task_has_preferred_cpu)
> +		return false;

Would it look better if you drop the comment and:
        
        if (need_sched_on_preferred)
                return false;

> +
>  	/* But are allowed during online. */

This comment is the continuation of the cpu_dying() case. With your
change it's not anymore, and it needs to be reworded.

>  	return cpu_online(cpu);
>  }
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 26ae13c86b69..36ae20310891 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -4230,4 +4230,13 @@ DEFINE_CLASS_IS_UNCONDITIONAL(sched_change)
>  
>  #include "ext/ext.h"
>  
> +static inline bool task_has_preferred_cpus(struct task_struct *p)
> +{
> +	/* Only FAIR tasks honor preferred CPU state */
> +	if (unlikely(p->sched_class != &fair_sched_class))
> +		return false;
> +
> +	return cpumask_intersects(p->cpus_ptr, cpu_preferred_mask);
> +}
> +
>  #endif /* _KERNEL_SCHED_SCHED_H */
> -- 
> 2.47.3

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v6 06/23] sched/fair: Load balance only among preferred CPUs
  2026-07-01 14:16 ` [PATCH v6 06/23] sched/fair: Load balance only among preferred CPUs Shrikanth Hegde
@ 2026-07-01 16:19   ` Yury Norov
  2026-07-01 16:41     ` Shrikanth Hegde
  0 siblings, 1 reply; 32+ messages in thread
From: Yury Norov @ 2026-07-01 16:19 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet, tglx, gregkh, pbonzini,
	seanjc, vschneid, huschle, rostedt, dietmar.eggemann, maddy,
	srikar, hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael, rdunlap,
	kernellwp, linux-doc

On Wed, Jul 01, 2026 at 07:46:37PM +0530, Shrikanth Hegde wrote:
> Consider only preferred CPUs for load balance.
> 
> With this, load balance will end up choosing a preferred CPUs for pull.
> This makes it not fight against the push task mechanism which happens
> at tick. Also, this stops active balance to happen on non-preferred CPU
> pulling the load.
> 
> This means there is no load balancing if the task is pinned only to
> non-preferred CPUs. They will continue to run where they were previously
> running before the CPUs was marked as non-preferred.
> 
> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> ---
>  kernel/sched/fair.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ce05acf52d35..9b2931b559d6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -13391,7 +13391,8 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>  	};
>  	bool need_unlock = false;
>  
> -	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
> +	/* Spread load among preferred CPUs */

We don't have a "Spread load among active CPUs" comment. Don't think
it's more difficult to understand what happens if you replace one mask
with another.

> +	cpumask_and(cpus, sched_domain_span(sd), cpu_preferred_mask);
>  
>  	schedstat_inc(sd->lb_count[idle]);
>  
> -- 
> 2.47.3

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v6 03/23] cpumask: Introduce cpu_preferred_mask
  2026-07-01 15:35   ` Yury Norov
@ 2026-07-01 16:40     ` Shrikanth Hegde
  0 siblings, 0 replies; 32+ messages in thread
From: Shrikanth Hegde @ 2026-07-01 16:40 UTC (permalink / raw)
  To: Yury Norov
  Cc: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	kprateek.nayak, iii, corbet, tglx, gregkh, pbonzini, seanjc,
	vschneid, huschle, rostedt, dietmar.eggemann, maddy, srikar,
	hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael, rdunlap,
	kernellwp, linux-doc

Hi Yury, Thanks for taking a look at the patches.

On 7/1/26 9:05 PM, Yury Norov wrote:
> On Wed, Jul 01, 2026 at 07:46:34PM +0530, Shrikanth Hegde wrote:
>> Provide cpu_preferred_mask infrastructure. Define get/set macros
>> which could be used to get/set CPU state as preferred.
>>
>> Values are set/clear by the new driver called steal_monitor.
>> It periodically samples the steal time and decides preferred CPU state.
>>
>> A CPU is set to preferred when it becomes active. Later it may be
>> marked as non-preferred depending on steal time values with
>> steal_monitor being enabled.
>>
>> Always maintain design construct of preferred is subset of active.
>> i.e. preferred ⊆ active ⊆ online ⊆ present ⊆ possible
>>
>> With PREFERRED_CPU=n, set is nop and get returns active state.
>>
>> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
>> ---
>> v5->v6:
>> - Make it nop for PREFERRED_CPU=n
>> - Make it EXPORT_SYMBOL_GPL
>>
>>   include/linux/cpumask.h | 27 ++++++++++++++++++++++++++-
>>   kernel/cpu.c            |  6 ++++++
>>   kernel/sched/core.c     |  5 +++++
>>   3 files changed, 37 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
>> index d3cda0544954..c97271c063ce 100644
>> --- a/include/linux/cpumask.h
>> +++ b/include/linux/cpumask.h
>> @@ -122,12 +122,20 @@ extern struct cpumask __cpu_enabled_mask;
>>   extern struct cpumask __cpu_present_mask;
>>   extern struct cpumask __cpu_active_mask;
>>   extern struct cpumask __cpu_dying_mask;
>> +
>> +#ifdef CONFIG_PREFERRED_CPU
>> +extern struct cpumask __cpu_preferred_mask;
>> +#else
>> +#define __cpu_preferred_mask __cpu_active_mask
>> +#endif
>> +
>>   #define cpu_possible_mask ((const struct cpumask *)&__cpu_possible_mask)
>>   #define cpu_online_mask   ((const struct cpumask *)&__cpu_online_mask)
>>   #define cpu_enabled_mask   ((const struct cpumask *)&__cpu_enabled_mask)
>>   #define cpu_present_mask  ((const struct cpumask *)&__cpu_present_mask)
>>   #define cpu_active_mask   ((const struct cpumask *)&__cpu_active_mask)
>>   #define cpu_dying_mask    ((const struct cpumask *)&__cpu_dying_mask)
>> +#define cpu_preferred_mask ((const struct cpumask *)&__cpu_preferred_mask)
>>   
>>   extern atomic_t __num_online_cpus;
>>   extern unsigned int __num_possible_cpus;
>> @@ -1164,6 +1172,13 @@ void init_cpu_possible(const struct cpumask *src);
>>   #define set_cpu_active(cpu, active)	assign_cpu((cpu), &__cpu_active_mask, (active))
>>   #define set_cpu_dying(cpu, dying)	assign_cpu((cpu), &__cpu_dying_mask, (dying))
>>   
>> +#ifdef CONFIG_PREFERRED_CPU
>> +#define set_cpu_preferred(cpu, preferred) assign_cpu((cpu), &__cpu_preferred_mask, (preferred))
>> +#else
>> +/* Don't edit active state when the feature is off */
> 
> And that makes a random reader thinking like why in the world he mentions
> active state here?
> 
> Can you move this in commit message, or drop entirely?

Ok. I will drop it.

> 
>> +#define set_cpu_preferred(cpu, preferred) {}
>> +#endif
>> +
>>   void set_cpu_online(unsigned int cpu, bool online);
>>   void set_cpu_possible(unsigned int cpu, bool possible);
>>   
>> @@ -1258,7 +1273,12 @@ static __always_inline bool cpu_dying(unsigned int cpu)
>>   	return cpumask_test_cpu(cpu, cpu_dying_mask);
>>   }
>>   
>> -#else
>> +static __always_inline bool cpu_preferred(unsigned int cpu)
>> +{
>> +	return cpumask_test_cpu(cpu, cpu_preferred_mask);
>> +}
>> +
>> +#else	/* NR_CPUS <= 1 */
> 
> NR_CPUS can't be less than 1, I guess.

Ah yes. It can't be 0 :-)
this can only be  /* NR_CPUS == 1 */.

Since that is self explanatory, I guess it is good to drop that
comment addition. I think i did it to abide by that #else comment style
which i saw elsewhere.

> 
>>   
>>   #define num_online_cpus()	1U
>>   #define num_possible_cpus()	1U
>> @@ -1296,6 +1316,11 @@ static __always_inline bool cpu_dying(unsigned int cpu)
>>   	return false;
>>   }
>>   
>> +static __always_inline bool cpu_preferred(unsigned int cpu)
>> +{
>> +	return cpu == 0;
>> +}
>> +
>>   #endif /* NR_CPUS > 1 */
>>   
>>   #define cpu_is_offline(cpu)	unlikely(!cpu_online(cpu))
>> diff --git a/kernel/cpu.c b/kernel/cpu.c
>> index b3c8553d7bd6..376d297a6292 100644
>> --- a/kernel/cpu.c
>> +++ b/kernel/cpu.c
>> @@ -3103,6 +3103,11 @@ EXPORT_SYMBOL(__cpu_dying_mask);
>>   atomic_t __num_online_cpus __read_mostly;
>>   EXPORT_SYMBOL(__num_online_cpus);
>>   
>> +#ifdef CONFIG_PREFERRED_CPU
>> +struct cpumask __cpu_preferred_mask __read_mostly;
>> +EXPORT_SYMBOL_GPL(__cpu_preferred_mask);
>> +#endif
>> +
>>   void init_cpu_present(const struct cpumask *src)
>>   {
>>   	cpumask_copy(&__cpu_present_mask, src);
>> @@ -3160,6 +3165,7 @@ void __init boot_cpu_init(void)
>>   	/* Mark the boot cpu "present", "online" etc for SMP and UP case */
>>   	set_cpu_online(cpu, true);
>>   	set_cpu_active(cpu, true);
>> +	set_cpu_preferred(cpu, true);
>>   	set_cpu_present(cpu, true);
>>   	set_cpu_possible(cpu, true);
>>   
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 2e7cde033a31..a45f7c308329 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -8690,6 +8690,9 @@ int sched_cpu_activate(unsigned int cpu)
>>   	 */
>>   	sched_set_rq_online(rq, cpu);
>>   
>> +	/* preferred is subset of active and follows its state */
>> +	set_cpu_preferred(cpu, true);
>> +
>>   	return 0;
>>   }
>>   
>> @@ -8703,6 +8706,8 @@ int sched_cpu_deactivate(unsigned int cpu)
>>   	if (ret)
>>   		return ret;
>>   
>> +	set_cpu_preferred(cpu, false);
>> +
>>   	/*
>>   	 * Remove CPU from nohz.idle_cpus_mask to prevent participating in
>>   	 * load balancing when not active
>> -- 
>> 2.47.3


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v6 06/23] sched/fair: Load balance only among preferred CPUs
  2026-07-01 16:19   ` Yury Norov
@ 2026-07-01 16:41     ` Shrikanth Hegde
  0 siblings, 0 replies; 32+ messages in thread
From: Shrikanth Hegde @ 2026-07-01 16:41 UTC (permalink / raw)
  To: Yury Norov
  Cc: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	kprateek.nayak, iii, corbet, tglx, gregkh, pbonzini, seanjc,
	vschneid, huschle, rostedt, dietmar.eggemann, maddy, srikar,
	hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael, rdunlap,
	kernellwp, linux-doc



On 7/1/26 9:49 PM, Yury Norov wrote:
> On Wed, Jul 01, 2026 at 07:46:37PM +0530, Shrikanth Hegde wrote:
>> Consider only preferred CPUs for load balance.
>>
>> With this, load balance will end up choosing a preferred CPUs for pull.
>> This makes it not fight against the push task mechanism which happens
>> at tick. Also, this stops active balance to happen on non-preferred CPU
>> pulling the load.
>>
>> This means there is no load balancing if the task is pinned only to
>> non-preferred CPUs. They will continue to run where they were previously
>> running before the CPUs was marked as non-preferred.
>>
>> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
>> ---
>>   kernel/sched/fair.c | 3 ++-
>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index ce05acf52d35..9b2931b559d6 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -13391,7 +13391,8 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>>   	};
>>   	bool need_unlock = false;
>>   
>> -	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
>> +	/* Spread load among preferred CPUs */
> 
> We don't have a "Spread load among active CPUs" comment. Don't think
> it's more difficult to understand what happens if you replace one mask
> with another.
> 

Alright, I will drop it.

>> +	cpumask_and(cpus, sched_domain_span(sd), cpu_preferred_mask);
>>   
>>   	schedstat_inc(sd->lb_count[idle]);
>>   
>> -- 
>> 2.47.3


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v6 05/23] sched/core: Try to use a preferred CPU in is_cpu_allowed
  2026-07-01 16:09   ` Yury Norov
@ 2026-07-01 16:49     ` Shrikanth Hegde
  0 siblings, 0 replies; 32+ messages in thread
From: Shrikanth Hegde @ 2026-07-01 16:49 UTC (permalink / raw)
  To: Yury Norov
  Cc: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	kprateek.nayak, iii, corbet, tglx, gregkh, pbonzini, seanjc,
	vschneid, huschle, rostedt, dietmar.eggemann, maddy, srikar,
	hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael, rdunlap,
	kernellwp, linux-doc

Hi Yury,

On 7/1/26 9:39 PM, Yury Norov wrote:
> On Wed, Jul 01, 2026 at 07:46:36PM +0530, Shrikanth Hegde wrote:
>> When possible, choose a preferred CPUs to pick.
>>
>> Push task mechanism uses stopper thread which going to call
>> select_fallback_rq and use this mechanism to pick only a preferred CPU.
>>
>> When task is affined only to non-preferred CPUs it should continue to
>> run there. Detect that by checking if cpus_ptr and cpu_preferred_mask
>> intersect or not.
>>
>> This takes care of wakeup path optimization for FAIR tasks.
>> is_cpu_allowed is called to ensure wakeup happens on preferred CPUs.
>> With that, additional checks in available_idle_cpu is not necessary.
>>
>> Add a comment on rare case of O(N**2) in select_fallback_rq.
>>
>> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
>> ---
>> v5->v6:
>> - Drop optimization for select_fallback_rq
>> - Keep comment on N**2
>>
>>   kernel/sched/core.c  | 29 ++++++++++++++++++++++++++++-
>>   kernel/sched/sched.h |  9 +++++++++
>>   2 files changed, 37 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index a45f7c308329..1fb1c17e8387 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -2500,6 +2500,8 @@ static inline bool rq_has_pinned_tasks(struct rq *rq)
>>    */
>>   static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
>>   {
>> +	bool task_has_preferred_cpu;
>> +
>>   	/* When not in the task's cpumask, no point in looking further. */
>>   	if (!task_allowed_on_cpu(p, cpu))
>>   		return false;
>> @@ -2508,9 +2510,30 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
>>   	if (is_migration_disabled(p))
>>   		return cpu_online(cpu);
>>   
>> +	/*
>> +	 * This is essential to maintain user affinities when preferred
>> +	 * CPUs change. A task pinned on non-preferred CPU should continue
>> +	 * to run there, since this is non-user triggered.
>> +	 *
>> +	 * If CPU is non-preferred and task can run on other CPUs which are
>> +	 * currently preferred, then choose those other CPUs instead.
>> +	 * Overhead is minimal when CPU is preferred.
>> +	 *
>> +	 * For majority of the cases this would still keep select_fallback_rq
>> +	 * as O(N). task_has_preferred_cpus which is O(N) is called only if
>> +	 * !cpu_preferred. Then task running there is expected to move out.
>> +	 * So subsequent it should run on preferred CPU. This becomes O(N**2)
>> +	 * only for tasks pinned only non preferred CPUs. That is rare case.
>> +	 */
> 
> The is_cpu_allowed() is ~20 lines now, and your patch doubles that count.
> Can you keep this type of thoughts in commit message? 90% of setups
> will disable preferred CPUs, and I guess 99% of developers don't care.
> 
> This is the code, not a scientific paper, after all.
> 

Ok. I will update the comments and share updated one soon
as reply to this.

>> +	task_has_preferred_cpu = !cpu_preferred(cpu) &&
>> +				 task_has_preferred_cpus(p);
> 
> Maybe it's just me, but the name looks illogical. Because if
> 'cpu' is preferred, the task indeed has some preferred CPUs.
> 
> Maybe 'can_sched_on_preferred' or something like that?
> 

ok.

>> +
>>   	/* Non kernel threads are not allowed during either online or offline. */
>> -	if (!(p->flags & PF_KTHREAD))
>> +	if (!(p->flags & PF_KTHREAD)) {
>> +		if (task_has_preferred_cpu)
>> +			return false;
>>   		return cpu_active(cpu);
>> +	}
> 
> The comment on top of the block seems to be applicable to the 2nd
> return only, right?
> 

First return is for !kthread and second return is kthread.
It is applicable for both. (since kthreads are FAIR class)

But on that thought, do unbound kthreads run too often or
are they usually bound to a CPU? If it is later, we can even drop
that second return.

>>   
>>   	/* KTHREAD_IS_PER_CPU is always allowed. */
>>   	if (kthread_is_per_cpu(p))
>> @@ -2520,6 +2543,10 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
>>   	if (cpu_dying(cpu))
>>   		return false;
>>   
>> +	/* Try on preferred CPU first if possible*/
>> +	if (task_has_preferred_cpu)
>> +		return false;
> 
> Would it look better if you drop the comment and:
>          
>          if (need_sched_on_preferred)
>                  return false;
> 
>> +
>>   	/* But are allowed during online. */
> 
> This comment is the continuation of the cpu_dying() case. With your
> change it's not anymore, and it needs to be reworded.
> 
>>   	return cpu_online(cpu);
>>   }
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 26ae13c86b69..36ae20310891 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -4230,4 +4230,13 @@ DEFINE_CLASS_IS_UNCONDITIONAL(sched_change)
>>   
>>   #include "ext/ext.h"
>>   
>> +static inline bool task_has_preferred_cpus(struct task_struct *p)
>> +{
>> +	/* Only FAIR tasks honor preferred CPU state */
>> +	if (unlikely(p->sched_class != &fair_sched_class))
>> +		return false;
>> +
>> +	return cpumask_intersects(p->cpus_ptr, cpu_preferred_mask);
>> +}
>> +
>>   #endif /* _KERNEL_SCHED_SCHED_H */
>> -- 
>> 2.47.3


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v6 09/23] sched/core: Push current task from non preferred CPU
  2026-07-01 14:16 ` [PATCH v6 09/23] sched/core: Push current task from non preferred CPU Shrikanth Hegde
@ 2026-07-01 16:50   ` Yury Norov
  2026-07-01 17:03     ` Shrikanth Hegde
  0 siblings, 1 reply; 32+ messages in thread
From: Yury Norov @ 2026-07-01 16:50 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet, tglx, gregkh, pbonzini,
	seanjc, vschneid, huschle, rostedt, dietmar.eggemann, maddy,
	srikar, hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael, rdunlap,
	kernellwp, linux-doc

On Wed, Jul 01, 2026 at 07:46:40PM +0530, Shrikanth Hegde wrote:
> Actively push out task running on a non-preferred CPU. Since the task is
> running on the CPU, need to stop the cpu and push the task out.
> However, if the task is pinned only to non-preferred CPUs, it will continue
> running there. This will help in maintaining the userspace affinities
> unlike CPU hotplug or isolated cpusets.
> 
> Though code is similar to  __balance_push_cpu_stop and quite close to
> push_cpu_stop, it is being kept separate as it provides a cleaner
> implementation with CONFIG_PREFERRED_CPU.
> 
> Add push_task_work_done flag to protect work buffer.
> Works only with FAIR class.
> 
> For now, only current running task is pushed out. This keeps the code
> simpler. In future optimization maybe done to move all the queued
> task on the rq.
> 
> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> ---
>  kernel/sched/core.c  | 87 ++++++++++++++++++++++++++++++++++++++++++++
>  kernel/sched/sched.h |  8 ++++
>  2 files changed, 95 insertions(+)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index aa4201bb8082..56905bac9525 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5797,6 +5797,9 @@ void sched_tick(void)
>  	unsigned long hw_pressure;
>  	u64 resched_latency;
>  
> +	if (!cpu_preferred(cpu))
> +		sched_push_current_non_preferred_cpu(rq);
> +
>  	if (housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE))
>  		arch_scale_freq_tick();
>  
> @@ -11315,3 +11318,87 @@ void sched_change_end(struct sched_change_ctx *ctx)
>  		p->sched_class->prio_changed(rq, p, ctx->prio);
>  	}
>  }
> +
> +#ifdef CONFIG_PREFERRED_CPU
> +/* npc - non preferred CPU */
> +static DEFINE_PER_CPU(struct cpu_stop_work, npc_push_task_work);
> +
> +static int sched_non_preferred_cpu_push_stop(void *arg)
> +{
> +	struct task_struct *p = arg;
> +	struct rq *rq = this_rq();
> +	struct rq_flags rf;
> +	int cpu;
> +
> +	/* sanity check and clear */
> +	if (cpu_preferred(rq->cpu)) {
> +		scoped_guard (rq_lock, rq)

No whitespace please:

$ git grep "scoped_guard" | wc -l
2153
$ git grep "scoped_guard (" | wc -l
84

> +			rq->push_task_work_done = 0;
> +		put_task_struct(p);
> +		return 0;
> +	}
> +
> +	raw_spin_lock_irq(&p->pi_lock);
> +
> +	/* This could take rq lock. So call it before rq lock is taken */
> +	cpu = select_fallback_rq(rq->cpu, p);
> +	rq_lock(rq, &rf);
> +	rq->push_task_work_done = 0;
> +	update_rq_clock(rq);
> +
> +	context_unsafe_alias(rq);
> +
> +	if (task_rq(p) == rq && task_on_rq_queued(p))
> +		rq = __migrate_task(rq, &rf, p, cpu);
> +
> +	rq_unlock(rq, &rf);
> +	raw_spin_unlock_irq(&p->pi_lock);
> +	put_task_struct(p);
> +
> +	return 0;
> +}
> +
> +/*
> + * Push the current task running on non-preferred CPU.
> + * Using this non preferred CPU will lead to more vCPU preemptions
> + * in the host. So it is better not to use this CPU.
> + *
> + * Since task is running, call a stopper to push the task out. This is
> + * similar to how task moves during hotplug. In select_fallback_rq a
> + * preferred CPU will be chosen and henceforth task shouldn't come back to
> + * this CPU again.
> + *
> + * Works for FAIR class only
> + *
> + * If task is affined only non-preferred CPUs, it can't be moved out
> + */
> +void sched_push_current_non_preferred_cpu(struct rq *rq)
> +{
> +	struct task_struct *push_task = rq->curr;
> +
> +	/* Preferred feature works only for FAIR class */
> +	if (push_task->sched_class != &fair_sched_class)
> +		return;

This is useless - task_has_preferred_cpus() checks that already.

> +
> +	if (kthread_is_per_cpu(push_task) ||
> +	    is_migration_disabled(push_task))
> +		return;
> +
> +	/* Don't push the task if it is affined only on non preferred CPUs */
> +	if (!task_has_preferred_cpus(push_task))
> +		return;
> +
> +	/* There is already a stopper thread for this. Dont race with it. */
> +	if (rq->push_task_work_done == 1)
> +		return;
> +
> +	/* sched_tick runs with interrupts disabled. */
> +	get_task_struct(push_task);
> +
> +	scoped_guard (rq_lock, rq)
> +		rq->push_task_work_done = 1;
> +
> +	stop_one_cpu_nowait(rq->cpu, sched_non_preferred_cpu_push_stop,
> +			    push_task, this_cpu_ptr(&npc_push_task_work));
> +}
> +#endif
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 36ae20310891..711fc8bd7ebc 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1277,6 +1277,8 @@ struct rq {
>  
>  	struct list_head cfs_tasks;
>  
> +	bool			push_task_work_done;
> +
>  	struct sched_avg	avg_rt;
>  	struct sched_avg	avg_dl;
>  #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
> @@ -4239,4 +4241,10 @@ static inline bool task_has_preferred_cpus(struct task_struct *p)
>  	return cpumask_intersects(p->cpus_ptr, cpu_preferred_mask);
>  }
>  
> +#ifdef CONFIG_PREFERRED_CPU
> +void sched_push_current_non_preferred_cpu(struct rq *rq);
> +#else	/* !CONFIG_PREFERRED_CPU */
> +static inline void sched_push_current_non_preferred_cpu(struct rq *rq) { }
> +#endif
> +
>  #endif /* _KERNEL_SCHED_SCHED_H */
> -- 
> 2.47.3

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v6 09/23] sched/core: Push current task from non preferred CPU
  2026-07-01 16:50   ` Yury Norov
@ 2026-07-01 17:03     ` Shrikanth Hegde
  0 siblings, 0 replies; 32+ messages in thread
From: Shrikanth Hegde @ 2026-07-01 17:03 UTC (permalink / raw)
  To: Yury Norov
  Cc: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	kprateek.nayak, iii, corbet, tglx, gregkh, pbonzini, seanjc,
	vschneid, huschle, rostedt, dietmar.eggemann, maddy, srikar,
	hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael, rdunlap,
	kernellwp, linux-doc

Hi Yury,

On 7/1/26 10:20 PM, Yury Norov wrote:
> On Wed, Jul 01, 2026 at 07:46:40PM +0530, Shrikanth Hegde wrote:
>> Actively push out task running on a non-preferred CPU. Since the task is
>> running on the CPU, need to stop the cpu and push the task out.
>> However, if the task is pinned only to non-preferred CPUs, it will continue
>> running there. This will help in maintaining the userspace affinities
>> unlike CPU hotplug or isolated cpusets.
>>
>> Though code is similar to  __balance_push_cpu_stop and quite close to
>> push_cpu_stop, it is being kept separate as it provides a cleaner
>> implementation with CONFIG_PREFERRED_CPU.
>>
>> Add push_task_work_done flag to protect work buffer.
>> Works only with FAIR class.
>>
>> For now, only current running task is pushed out. This keeps the code
>> simpler. In future optimization maybe done to move all the queued
>> task on the rq.
>>
>> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
>> ---
>>   kernel/sched/core.c  | 87 ++++++++++++++++++++++++++++++++++++++++++++
>>   kernel/sched/sched.h |  8 ++++
>>   2 files changed, 95 insertions(+)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index aa4201bb8082..56905bac9525 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -5797,6 +5797,9 @@ void sched_tick(void)
>>   	unsigned long hw_pressure;
>>   	u64 resched_latency;
>>   
>> +	if (!cpu_preferred(cpu))
>> +		sched_push_current_non_preferred_cpu(rq);
>> +
>>   	if (housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE))
>>   		arch_scale_freq_tick();
>>   
>> @@ -11315,3 +11318,87 @@ void sched_change_end(struct sched_change_ctx *ctx)
>>   		p->sched_class->prio_changed(rq, p, ctx->prio);
>>   	}
>>   }
>> +
>> +#ifdef CONFIG_PREFERRED_CPU
>> +/* npc - non preferred CPU */
>> +static DEFINE_PER_CPU(struct cpu_stop_work, npc_push_task_work);
>> +
>> +static int sched_non_preferred_cpu_push_stop(void *arg)
>> +{
>> +	struct task_struct *p = arg;
>> +	struct rq *rq = this_rq();
>> +	struct rq_flags rf;
>> +	int cpu;
>> +
>> +	/* sanity check and clear */
>> +	if (cpu_preferred(rq->cpu)) {
>> +		scoped_guard (rq_lock, rq)
> 
> No whitespace please:
> 
> $ git grep "scoped_guard" | wc -l
> 2153
> $ git grep "scoped_guard (" | wc -l
> 84
> 
>> +			rq->push_task_work_done = 0;
>> +		put_task_struct(p);
>> +		return 0;
>> +	}
>> +
>> +	raw_spin_lock_irq(&p->pi_lock);
>> +
>> +	/* This could take rq lock. So call it before rq lock is taken */
>> +	cpu = select_fallback_rq(rq->cpu, p);
>> +	rq_lock(rq, &rf);
>> +	rq->push_task_work_done = 0;
>> +	update_rq_clock(rq);
>> +
>> +	context_unsafe_alias(rq);
>> +
>> +	if (task_rq(p) == rq && task_on_rq_queued(p))
>> +		rq = __migrate_task(rq, &rf, p, cpu);
>> +
>> +	rq_unlock(rq, &rf);
>> +	raw_spin_unlock_irq(&p->pi_lock);
>> +	put_task_struct(p);
>> +
>> +	return 0;
>> +}
>> +
>> +/*
>> + * Push the current task running on non-preferred CPU.
>> + * Using this non preferred CPU will lead to more vCPU preemptions
>> + * in the host. So it is better not to use this CPU.
>> + *
>> + * Since task is running, call a stopper to push the task out. This is
>> + * similar to how task moves during hotplug. In select_fallback_rq a
>> + * preferred CPU will be chosen and henceforth task shouldn't come back to
>> + * this CPU again.
>> + *
>> + * Works for FAIR class only
>> + *
>> + * If task is affined only non-preferred CPUs, it can't be moved out
>> + */
>> +void sched_push_current_non_preferred_cpu(struct rq *rq)
>> +{
>> +	struct task_struct *push_task = rq->curr;
>> +
>> +	/* Preferred feature works only for FAIR class */
>> +	if (push_task->sched_class != &fair_sched_class)
>> +		return;
> 
> This is useless - task_has_preferred_cpus() checks that already.
> 

Yes. Thanks for catching that. Earlier versions of task_has_preferred_cpus
didn't have that check, so it was there.

Even kthread_is_per_cpu is not necessary as the cpumask_check
in task_has_preferred_cpus will return accordingly.

Let me re-order this.


>> +
>> +	if (kthread_is_per_cpu(push_task) ||
>> +	    is_migration_disabled(push_task))
>> +		return;
>> +
>> +	/* Don't push the task if it is affined only on non preferred CPUs */
>> +	if (!task_has_preferred_cpus(push_task))
>> +		return;
>> +
>> +	/* There is already a stopper thread for this. Dont race with it. */
>> +	if (rq->push_task_work_done == 1)
>> +		return;
>> +
>> +	/* sched_tick runs with interrupts disabled. */
>> +	get_task_struct(push_task);
>> +
>> +	scoped_guard (rq_lock, rq)
>> +		rq->push_task_work_done = 1;
>> +
>> +	stop_one_cpu_nowait(rq->cpu, sched_non_preferred_cpu_push_stop,
>> +			    push_task, this_cpu_ptr(&npc_push_task_work));
>> +}
>> +#endif
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 36ae20310891..711fc8bd7ebc 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -1277,6 +1277,8 @@ struct rq {
>>   
>>   	struct list_head cfs_tasks;
>>   
>> +	bool			push_task_work_done;
>> +
>>   	struct sched_avg	avg_rt;
>>   	struct sched_avg	avg_dl;
>>   #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
>> @@ -4239,4 +4241,10 @@ static inline bool task_has_preferred_cpus(struct task_struct *p)
>>   	return cpumask_intersects(p->cpus_ptr, cpu_preferred_mask);
>>   }
>>   
>> +#ifdef CONFIG_PREFERRED_CPU
>> +void sched_push_current_non_preferred_cpu(struct rq *rq);
>> +#else	/* !CONFIG_PREFERRED_CPU */
>> +static inline void sched_push_current_non_preferred_cpu(struct rq *rq) { }
>> +#endif
>> +
>>   #endif /* _KERNEL_SCHED_SCHED_H */
>> -- 
>> 2.47.3


^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2026-07-01 17:03 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-07-01 14:16 [PATCH v6 00/23] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
2026-07-01 14:16 ` [PATCH v6 01/23] sched/docs: Document cpu_preferred_mask and Preferred CPU concept Shrikanth Hegde
2026-07-01 14:16 ` [PATCH v6 02/23] kconfig: Provide PREFERRED_CPU option Shrikanth Hegde
2026-07-01 14:16 ` [PATCH v6 03/23] cpumask: Introduce cpu_preferred_mask Shrikanth Hegde
2026-07-01 15:35   ` Yury Norov
2026-07-01 16:40     ` Shrikanth Hegde
2026-07-01 14:16 ` [PATCH v6 04/23] sysfs: Add preferred CPU file Shrikanth Hegde
2026-07-01 14:16 ` [PATCH v6 05/23] sched/core: Try to use a preferred CPU in is_cpu_allowed Shrikanth Hegde
2026-07-01 16:09   ` Yury Norov
2026-07-01 16:49     ` Shrikanth Hegde
2026-07-01 14:16 ` [PATCH v6 06/23] sched/fair: Load balance only among preferred CPUs Shrikanth Hegde
2026-07-01 16:19   ` Yury Norov
2026-07-01 16:41     ` Shrikanth Hegde
2026-07-01 14:16 ` [PATCH v6 07/23] sched/fair: Pull the load on preferred CPU Shrikanth Hegde
2026-07-01 14:16 ` [PATCH v6 08/23] sched/core: Keep tick on non-preferred CPUs until tasks are out Shrikanth Hegde
2026-07-01 14:16 ` [PATCH v6 09/23] sched/core: Push current task from non preferred CPU Shrikanth Hegde
2026-07-01 16:50   ` Yury Norov
2026-07-01 17:03     ` Shrikanth Hegde
2026-07-01 14:16 ` [PATCH v6 10/23] sched/debug: Add migration stats due to non preferred CPUs Shrikanth Hegde
2026-07-01 14:16 ` [PATCH v6 11/23] virt/steal_monitor: Add documentation Shrikanth Hegde
2026-07-01 14:16 ` [PATCH v6 12/23] virt: Introduce steal monitor driver Shrikanth Hegde
2026-07-01 14:16 ` [PATCH v6 13/23] virt/steal_monitor: Restore to active on module disable Shrikanth Hegde
2026-07-01 14:16 ` [PATCH v6 14/23] virt/steal_monitor: Define steal_monitor structure Shrikanth Hegde
2026-07-01 14:16 ` [PATCH v6 15/23] virt/steal_monitor: Add control knobs for handling steal values Shrikanth Hegde
2026-07-01 14:16 ` [PATCH v6 16/23] virt/steal_monitor: Compute work at regular intervals Shrikanth Hegde
2026-07-01 14:16 ` [PATCH v6 17/23] virt/steal_monitor: Provide default method to get systemwide steal time Shrikanth Hegde
2026-07-01 14:16 ` [PATCH v6 18/23] virt/steal_monitor: Provide default method to inc/dec preferred CPUs Shrikanth Hegde
2026-07-01 14:16 ` [PATCH v6 19/23] virt/steal_monitor: Provide default method to get num of CPUs for steal ratio Shrikanth Hegde
2026-07-01 14:16 ` [PATCH v6 20/23] virt/steal_monitor: Act on steal values at regular intervals Shrikanth Hegde
2026-07-01 14:16 ` [PATCH v6 21/23] virt/steal_monitor: Add direction control Shrikanth Hegde
2026-07-01 14:16 ` [PATCH v6 22/23] virt/steal_monitor: Add design check of preferred subset of active Shrikanth Hegde
2026-07-01 14:16 ` [PATCH v6 23/23] virt/steal_monitor: Optimise decrease_preferred_cpus when all CPUs are housekeeping Shrikanth Hegde

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox