[PATCH v4 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

* [PATCH v4 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff
@ 2026-06-17 17:41 Shrikanth Hegde
  2026-06-17 17:41 ` [PATCH v4 01/20] sched/debug: Remove unused schedstats Shrikanth Hegde
                   ` (19 more replies)
  0 siblings, 20 replies; 46+ messages in thread
From: Shrikanth Hegde @ 2026-06-17 17:41 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael

Very briefly,
- Maintain set of CPUs which can be used by workload. It is denoted as
  cpu_preferred_mask
- Periodically compute the steal time. If steal time is high/low based
  on the thresholds, either reduce/increase the preferred CPUs.
- If a CPU is marked as non-preferred, push the task running on it if
  possible.
- Use this CPU state in wakeup and load balance to ensure tasks run
  within preferred CPUs.

For more details on idea, problem statement and performance numbers,
please refer to cover-letter of v2[2] and OSPM talk[1].

*** Please review and provide your feedback!! ***

[1]:https://youtu.be/adxUKFPlOp0
[2] v2: https://lore.kernel.org/all/20260407191950.643549-1-sshegde@linux.ibm.com/#t
[3] v3: https://lore.kernel.org/all/20260514152204.481115-1-sshegde@linux.ibm.com/

v3->v4: 
- Made preferred subset of active instead of online. (K Prateek Nayak,
  Peter Zijlstra)
- Dropped RT patch
- Decided generic sched_ext change doesn't make sense. Hence it has to
  be custom sched_ext with its select_cpu, enqeue/dequeue etc. This will
  be done later. 
- changes to is_cpu_allowed/select_fallback_rq to avoid N**2 (K Prateek
  Nayak). There is encoding of two bits of information there. Let me
  know if this needs to split up into two.
- Add concurrency protection for enabling/disabling steal monitor (Ilya
  Leoshkevich)
- Dropped tmp_mask and reset steal_monitor state (Ilya Leoshkevich)
- Added a few cpumask_check (Yury Norov)
- Picked up tag for patch 1. (Thanks to K Prateek Nayak)
- Decided not to put too much complexity for numa splicing.

There is no major TODO item at this point. There are few minor additions
which maybe good to do provided numbers show its worth. Performance
numbers are expected to be same as v2.

base: tip/sched/core at 
c095741713d1 ("sched/fair: Fix newidle vs core-sched")


Shrikanth Hegde (20):
  sched/debug: Remove unused schedstats
  sched/docs: Document cpu_preferred_mask and Preferred CPU concept
  kconfig: Provide PREFERRED_CPU option
  cpumask: Introduce cpu_preferred_mask
  sysfs: Add preferred CPU file
  sched/core: allow only preferred CPUs in is_cpu_allowed
  sched/fair: Select preferred CPU at wakeup when possible
  sched/fair: load balance only among preferred CPUs
  sched/core: Keep tick on non-preferred CPUs until tasks are out
  sched/core: Push current task from non preferred CPU
  sched/debug: Add migration stats due to non preferred CPUs
  sched/debug: Create debugfs folder steal monitor
  sched/debug: Provide debugfs to enable/disable steal monitor
  sched/core: Introduce a simple steal monitor
  sched/core: Compute steal values at regular intervals
  sched/core: Introduce default arch handling code for inc/dec preferred
    CPUs
  sched/core: Handle steal values and mark CPUs as preferred
  sched/core: Mark the direction of steal values to avoid oscillations
  sched/debug: Add debug knobs for steal monitor
  sched/core: Add a few check for valid CPU in inc/dec of preferred CPUs

 .../ABI/testing/sysfs-devices-system-cpu      |  11 +
 Documentation/scheduler/sched-arch.rst        |  49 ++++
 Documentation/scheduler/sched-debug.rst       |  34 +++
 drivers/base/cpu.c                            |   8 +
 include/linux/cpumask.h                       |  21 +-
 include/linux/sched.h                         |  20 +-
 kernel/Kconfig.preempt                        |  13 +
 kernel/cpu.c                                  |  14 +
 kernel/sched/core.c                           | 269 +++++++++++++++++-
 kernel/sched/debug.c                          |  55 +++-
 kernel/sched/fair.c                           |   4 +
 kernel/sched/sched.h                          |  42 +++
 12 files changed, 531 insertions(+), 9 deletions(-)

-- 
2.47.3


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH v4 01/20] sched/debug: Remove unused schedstats
  2026-06-17 17:41 [PATCH v4 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
@ 2026-06-17 17:41 ` Shrikanth Hegde
  2026-06-17 17:41 ` [PATCH v4 02/20] sched/docs: Document cpu_preferred_mask and Preferred CPU concept Shrikanth Hegde
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 46+ messages in thread
From: Shrikanth Hegde @ 2026-06-17 17:41 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael

nr_migrations_cold, nr_wakeups_passive and nr_wakeups_idle are not
being updated anywhere. So remove them.

These are per process stats. So updating sched stats version isn't
necessary.

Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 include/linux/sched.h | 3 ---
 kernel/sched/debug.c  | 3 ---
 2 files changed, 6 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 35e6183ef615..fc6ecb3869dd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -550,7 +550,6 @@ struct sched_statistics {
 	s64				exec_max;
 	u64				slice_max;
 
-	u64				nr_migrations_cold;
 	u64				nr_failed_migrations_affine;
 	u64				nr_failed_migrations_running;
 	u64				nr_failed_migrations_hot;
@@ -563,8 +562,6 @@ struct sched_statistics {
 	u64				nr_wakeups_remote;
 	u64				nr_wakeups_affine;
 	u64				nr_wakeups_affine_attempts;
-	u64				nr_wakeups_passive;
-	u64				nr_wakeups_idle;
 
 #ifdef CONFIG_SCHED_CORE
 	u64				core_forceidle_sum;
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 40584b27ea0c..f3a033b34ba0 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1359,7 +1359,6 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 		P_SCHEDSTAT(wait_count);
 		PN_SCHEDSTAT(iowait_sum);
 		P_SCHEDSTAT(iowait_count);
-		P_SCHEDSTAT(nr_migrations_cold);
 		P_SCHEDSTAT(nr_failed_migrations_affine);
 		P_SCHEDSTAT(nr_failed_migrations_running);
 		P_SCHEDSTAT(nr_failed_migrations_hot);
@@ -1371,8 +1370,6 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 		P_SCHEDSTAT(nr_wakeups_remote);
 		P_SCHEDSTAT(nr_wakeups_affine);
 		P_SCHEDSTAT(nr_wakeups_affine_attempts);
-		P_SCHEDSTAT(nr_wakeups_passive);
-		P_SCHEDSTAT(nr_wakeups_idle);
 
 		avg_atom = p->se.sum_exec_runtime;
 		if (nr_switches)
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v4 02/20] sched/docs: Document cpu_preferred_mask and Preferred CPU concept
  2026-06-17 17:41 [PATCH v4 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
  2026-06-17 17:41 ` [PATCH v4 01/20] sched/debug: Remove unused schedstats Shrikanth Hegde
@ 2026-06-17 17:41 ` Shrikanth Hegde
  2026-06-17 17:41 ` [PATCH v4 03/20] kconfig: Provide PREFERRED_CPU option Shrikanth Hegde
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 46+ messages in thread
From: Shrikanth Hegde @ 2026-06-17 17:41 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael

Add documentation for new cpumask called cpu_preferred_mask. This could
help users in understanding what this mask is and the concept behind it.

Document how to enable it and implementation aspects of it.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v3->v4:
- update docs to reflect preferred is subset of active.

 Documentation/scheduler/sched-arch.rst | 49 ++++++++++++++++++++++++++
 1 file changed, 49 insertions(+)

diff --git a/Documentation/scheduler/sched-arch.rst b/Documentation/scheduler/sched-arch.rst
index ed07efea7d02..f17c54f44421 100644
--- a/Documentation/scheduler/sched-arch.rst
+++ b/Documentation/scheduler/sched-arch.rst
@@ -62,6 +62,55 @@ Your cpu_idle routines need to obey the following rules:
 arch/x86/kernel/process.c has examples of both polling and
 sleeping idle functions.
 
+Preferred CPUs
+==============
+
+In virtualised environments it is possible to overcommit CPU resources.
+i.e sum of virtual CPU(vCPU) of all VM's is greater than number of physical
+CPUs(pCPU). Under such conditions when all or many VM's have high utilization,
+hypervisor won't be able to satisfy the CPU requirement and has to context
+switch within or across VM. i.e hypervisor need to preempt one vCPU to run
+another. This is called vCPU preemption. This is more expensive compared to
+task context switch within a vCPU.
+
+In such cases it is better that combined vCPU ask from all VM is reduced
+by not using some of the vCPUs. vCPUs where workload can be safely
+scheduled which won't increase any contention for pCPU are called as
+"Preferred CPUs".
+
+In most cases preferred CPUs will be same as active CPUs, when there is pCPU
+contention, Preferred CPUs will reduce based on the amount of steal time.
+When the pCPU contention goes away as indicated by steal time, Preferred CPUs
+will become same as active CPUs again. One has to enable the feature by
+writing 1 to /sys/kernel/debug/sched/steal_monitor/enable
+
+One of the design construct is preferred CPUs is always subset of active CPUs.
+With CONFIG_PREFERRED_CPU=n, it is same as active CPUs.
+
+For scheduling decisions such as wakeup, pushing the task etc, needs this
+CPU state info. This is maintained in cpu_preferred_mask.
+
+vCPUs which are not in cpu_preferred_mask should be treated as vCPUs which
+should not be used at this moment provided it doesn't break user affinity.
+This is achieved by
+1. Selecting a preferred CPU at wakeup.
+2. Push the task away from non-preferred CPU at tick.
+3. Only select preferred CPUs for load balance.
+
+/sys/devices/system/cpu/preferred prints the current cpu_preferred_mask in
+cpulist format.
+
+Notes:
+1. This feature is available under CONFIG_PREFERRED_CPU
+2. This feature works for FAIR class only.
+3. A task pinned, which can't be moved to preferred CPUs will continue
+   to run based on its affinity. But no load balancing happens
+4. If needed, steal time based governors/arch dependent method
+   could be used to cater to different types of cpu numbers.
+   Arch can do so by implementing its own hooks.
+5. Decision to use/not use is driven by kernel. Hence it shouldn't
+   break user affinities. One of the main reason why CPU hotplug
+   or Isolated cpuset partitions was not a solution.
 
 Possible arch/ problems
 =======================
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v4 03/20] kconfig: Provide PREFERRED_CPU option
  2026-06-17 17:41 [PATCH v4 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
  2026-06-17 17:41 ` [PATCH v4 01/20] sched/debug: Remove unused schedstats Shrikanth Hegde
  2026-06-17 17:41 ` [PATCH v4 02/20] sched/docs: Document cpu_preferred_mask and Preferred CPU concept Shrikanth Hegde
@ 2026-06-17 17:41 ` Shrikanth Hegde
  2026-06-18  0:51   ` Yury Norov
  2026-06-17 17:41 ` [PATCH v4 04/20] cpumask: Introduce cpu_preferred_mask Shrikanth Hegde
                   ` (16 subsequent siblings)
  19 siblings, 1 reply; 46+ messages in thread
From: Shrikanth Hegde @ 2026-06-17 17:41 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael

Introduce a new config named PREFERRED_CPU.

This helps to:
- Avoid the code bloat when PREFERRED_CPU=n. In that cases preferred
  is same as active.
- Avoid the ifdeffery around PREFERRED_CPU in many files.

Since paravirtulized use case is the main driving force of this
feature, make it default for kernels with PARAVIRT=y

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/Kconfig.preempt | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index 88c594c6d7fc..0995f5ba66eb 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -192,3 +192,16 @@ config SCHED_CLASS_EXT
 	  For more information:
 	    Documentation/scheduler/sched-ext.rst
 	    https://github.com/sched-ext/scx
+
+config PREFERRED_CPU
+	bool "Dynamic vCPU management based on steal time"
+	default y if PARAVIRT && SMP
+	help
+	This feature helps to reduce the steal time in paravirtualised
+	environment, there by reducing vCPU preemption. Reducing vCPU
+	preemption provides improved lock holder preemption and reduces
+	cost of vCPU preemption in the host.
+
+	By default preferred CPUs will be same as active CPUs. Depending
+	on the steal time when steal monitor is enabled, preferred CPUs
+	could become subset of active CPUs.
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v4 04/20] cpumask: Introduce cpu_preferred_mask
  2026-06-17 17:41 [PATCH v4 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (2 preceding siblings ...)
  2026-06-17 17:41 ` [PATCH v4 03/20] kconfig: Provide PREFERRED_CPU option Shrikanth Hegde
@ 2026-06-17 17:41 ` Shrikanth Hegde
  2026-06-18  1:29   ` Yury Norov
  2026-06-17 17:41 ` [PATCH v4 05/20] sysfs: Add preferred CPU file Shrikanth Hegde
                   ` (15 subsequent siblings)
  19 siblings, 1 reply; 46+ messages in thread
From: Shrikanth Hegde @ 2026-06-17 17:41 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael

This patch does
- Declare and Define cpu_preferred_mask.
- Get/Set helpers for it.

Values are set/clear by the scheduler by detecting the steal time values.

A CPU is set to preferred when it becomes active. Later it may be
marked as non-preferred depending on steal time values with
steal monitor being enabled.

Always maintain design construct of preferred is subset of active.
i.e. preferred ⊆ active ⊆ online ⊆ present ⊆ possible

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v3->v4:
- Make preferred subser of active instead of online.

 include/linux/cpumask.h | 21 ++++++++++++++++++++-
 kernel/cpu.c            | 14 ++++++++++++++
 kernel/sched/core.c     |  5 +++++
 3 files changed, 39 insertions(+), 1 deletion(-)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 80211900f373..30ea64cc1656 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -120,12 +120,20 @@ extern struct cpumask __cpu_enabled_mask;
 extern struct cpumask __cpu_present_mask;
 extern struct cpumask __cpu_active_mask;
 extern struct cpumask __cpu_dying_mask;
+
+#ifdef CONFIG_PREFERRED_CPU
+extern struct cpumask __cpu_preferred_mask;
+#else
+#define __cpu_preferred_mask __cpu_active_mask
+#endif
+
 #define cpu_possible_mask ((const struct cpumask *)&__cpu_possible_mask)
 #define cpu_online_mask   ((const struct cpumask *)&__cpu_online_mask)
 #define cpu_enabled_mask   ((const struct cpumask *)&__cpu_enabled_mask)
 #define cpu_present_mask  ((const struct cpumask *)&__cpu_present_mask)
 #define cpu_active_mask   ((const struct cpumask *)&__cpu_active_mask)
 #define cpu_dying_mask    ((const struct cpumask *)&__cpu_dying_mask)
+#define cpu_preferred_mask ((const struct cpumask *)&__cpu_preferred_mask)
 
 extern atomic_t __num_online_cpus;
 extern unsigned int __num_possible_cpus;
@@ -1164,6 +1172,7 @@ void init_cpu_possible(const struct cpumask *src);
 
 void set_cpu_online(unsigned int cpu, bool online);
 void set_cpu_possible(unsigned int cpu, bool possible);
+void set_cpu_preferred(unsigned int cpu, bool preferred);
 
 /**
  * to_cpumask - convert a NR_CPUS bitmap to a struct cpumask *
@@ -1256,7 +1265,12 @@ static __always_inline bool cpu_dying(unsigned int cpu)
 	return cpumask_test_cpu(cpu, cpu_dying_mask);
 }
 
-#else
+static __always_inline bool cpu_preferred(unsigned int cpu)
+{
+	return cpumask_test_cpu(cpu, cpu_preferred_mask);
+}
+
+#else	/* NR_CPUS <= 1 */
 
 #define num_online_cpus()	1U
 #define num_possible_cpus()	1U
@@ -1294,6 +1308,11 @@ static __always_inline bool cpu_dying(unsigned int cpu)
 	return false;
 }
 
+static __always_inline bool cpu_preferred(unsigned int cpu)
+{
+	return cpu == 0;
+}
+
 #endif /* NR_CPUS > 1 */
 
 #define cpu_is_offline(cpu)	unlikely(!cpu_online(cpu))
diff --git a/kernel/cpu.c b/kernel/cpu.c
index bc4f7a9ba64e..c196ba5d8b2a 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -3107,6 +3107,11 @@ EXPORT_SYMBOL(__cpu_dying_mask);
 atomic_t __num_online_cpus __read_mostly;
 EXPORT_SYMBOL(__num_online_cpus);
 
+#ifdef CONFIG_PREFERRED_CPU
+struct cpumask __cpu_preferred_mask __read_mostly;
+EXPORT_SYMBOL(__cpu_preferred_mask);
+#endif
+
 void init_cpu_present(const struct cpumask *src)
 {
 	cpumask_copy(&__cpu_present_mask, src);
@@ -3154,6 +3159,14 @@ void set_cpu_possible(unsigned int cpu, bool possible)
 	}
 }
 
+void set_cpu_preferred(unsigned int cpu, bool preferred)
+{
+	if (!IS_ENABLED(CONFIG_PREFERRED_CPU))
+		return;
+
+	assign_cpu((cpu), &__cpu_preferred_mask, (preferred));
+}
+
 /*
  * Activate the first processor.
  */
@@ -3164,6 +3177,7 @@ void __init boot_cpu_init(void)
 	/* Mark the boot cpu "present", "online" etc for SMP and UP case */
 	set_cpu_online(cpu, true);
 	set_cpu_active(cpu, true);
+	set_cpu_preferred(cpu, true);
 	set_cpu_present(cpu, true);
 	set_cpu_possible(cpu, true);
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2f4530eb543f..9e16946c9d62 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8685,6 +8685,9 @@ int sched_cpu_activate(unsigned int cpu)
 	 */
 	sched_set_rq_online(rq, cpu);
 
+	/* preferred is subset of active and follows its state */
+	set_cpu_preferred(cpu, true);
+
 	return 0;
 }
 
@@ -8698,6 +8701,8 @@ int sched_cpu_deactivate(unsigned int cpu)
 	if (ret)
 		return ret;
 
+	set_cpu_preferred(cpu, false);
+
 	/*
 	 * Remove CPU from nohz.idle_cpus_mask to prevent participating in
 	 * load balancing when not active
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v4 05/20] sysfs: Add preferred CPU file
  2026-06-17 17:41 [PATCH v4 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (3 preceding siblings ...)
  2026-06-17 17:41 ` [PATCH v4 04/20] cpumask: Introduce cpu_preferred_mask Shrikanth Hegde
@ 2026-06-17 17:41 ` Shrikanth Hegde
  2026-06-17 17:41 ` [PATCH v4 06/20] sched/core: allow only preferred CPUs in is_cpu_allowed Shrikanth Hegde
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 46+ messages in thread
From: Shrikanth Hegde @ 2026-06-17 17:41 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael

Add "preferred" file in /sys/devices/system/cpu

This offers
- User can quickly check which CPUs are marked as preferred at this
  moment.
- Userspace algorithms irqbalance could use this mask to send irq into
  preferred CPUs.

For example:
cat /sys/devices/system/cpu/online
0-719
cat /sys/devices/system/cpu/preferred
0-599        <<< Implies 0-599 are preferred for workloads and 600-719
                 should be avoided at this moment.

cat /sys/devices/system/cpu/preferred
0-719        <<< All CPUs are usable. There is no preferrence.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 Documentation/ABI/testing/sysfs-devices-system-cpu | 11 +++++++++++
 drivers/base/cpu.c                                 |  8 ++++++++
 2 files changed, 19 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu
index 82d10d556cc8..5fb973d53287 100644
--- a/Documentation/ABI/testing/sysfs-devices-system-cpu
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -806,3 +806,14 @@ Date:		Nov 2022
 Contact:	Linux kernel mailing list <linux-kernel@vger.kernel.org>
 Description:
 		(RO) the list of CPUs that can be brought online.
+
+What:		/sys/devices/system/cpu/preferred
+Date:		Jun 2026
+Contact:	Linux kernel mailing list <linux-kernel@vger.kernel.org>
+Description:
+		(RO) the list of preferred CPUs at this moment.
+		These are the only CPUs meant to be used at the moment.
+		Using CPU outside of the list could lead to more
+		contention of underlying physical CPU resource. Dynamically
+		changes based on steal time. With CONFIG_PREFERRED_CPU=n it
+		is same as active CPUs. See sched-arch.rst for more details.
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index 875abdc9942e..0c6647805805 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -391,6 +391,13 @@ static int cpu_uevent(const struct device *dev, struct kobj_uevent_env *env)
 }
 #endif
 
+static ssize_t preferred_show(struct device *dev,
+			      struct device_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "%*pbl\n", cpumask_pr_args(cpu_preferred_mask));
+}
+static DEVICE_ATTR_RO(preferred);
+
 const struct bus_type cpu_subsys = {
 	.name = "cpu",
 	.dev_name = "cpu",
@@ -532,6 +539,7 @@ static struct attribute *cpu_root_attrs[] = {
 #ifdef CONFIG_GENERIC_CPU_AUTOPROBE
 	&dev_attr_modalias.attr,
 #endif
+	&dev_attr_preferred.attr,
 	NULL
 };
 
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v4 06/20] sched/core: allow only preferred CPUs in is_cpu_allowed
  2026-06-17 17:41 [PATCH v4 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (4 preceding siblings ...)
  2026-06-17 17:41 ` [PATCH v4 05/20] sysfs: Add preferred CPU file Shrikanth Hegde
@ 2026-06-17 17:41 ` Shrikanth Hegde
  2026-06-18  3:32   ` Yury Norov
  2026-06-18  3:49   ` K Prateek Nayak
  2026-06-17 17:41 ` [PATCH v4 07/20] sched/fair: Select preferred CPU at wakeup when possible Shrikanth Hegde
                   ` (13 subsequent siblings)
  19 siblings, 2 replies; 46+ messages in thread
From: Shrikanth Hegde @ 2026-06-17 17:41 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael

When possible, choose a preferred CPUs to pick.

Push task mechanism uses stopper thread which going to call
select_fallback_rq and use this mechanism to pick only a preferred CPU.

When task is affined only to non-preferred CPUs it should continue to
run there. Detect that by checking if cpus_ptr and cpu_preferred_mask
intersect or not.

Since is_cpu_allowed can be called directly or repeatedly in
select_fallback_rq, encode the info in task_struct->has_preferred_cpu_state
if the path is via select_fallback_rq or not.
This helps to avoid N**2 complexity for the rare cases.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v3->v4:
- Missing case of PF_KTHREAD is avoided.
- Add a new field in task_struct which encodes intersection of
  tasks affinity and preferred CPUs and path its coming from.

 include/linux/sched.h |  1 +
 kernel/sched/core.c   | 34 ++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h  | 18 ++++++++++++++++++
 3 files changed, 51 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index fc6ecb3869dd..2d0b1a6d50ac 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1657,6 +1657,7 @@ struct task_struct {
 #ifdef CONFIG_UNWIND_USER
 	struct unwind_task_info		unwind_info;
 #endif
+	int				has_preferred_cpu_state;
 
 	/* CPU-specific state of this task: */
 	struct thread_struct		thread;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9e16946c9d62..714816cfa975 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2500,6 +2500,8 @@ static inline bool rq_has_pinned_tasks(struct rq *rq)
  */
 static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
 {
+	bool task_check_preferred_cpu = false;
+
 	/* When not in the task's cpumask, no point in looking further. */
 	if (!task_allowed_on_cpu(p, cpu))
 		return false;
@@ -2508,9 +2510,22 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
 	if (is_migration_disabled(p))
 		return cpu_online(cpu);
 
+	/*
+	 * This is essential to maintain user affinities when preferred
+	 * CPUs change. A task pinned on non-preferred CPU should continue
+	 * to run there, since this is non-user triggered.
+	 *
+	 * If CPU is non-preferred and task can run on other CPUs which are
+	 * currently preferred, then choose those other CPUs instead
+	 */
+	task_check_preferred_cpu = !cpu_preferred(cpu) && task_has_preferred_cpus(p);
+
 	/* Non kernel threads are not allowed during either online or offline. */
-	if (!(p->flags & PF_KTHREAD))
+	if (!(p->flags & PF_KTHREAD)) {
+		if (task_check_preferred_cpu)
+			return false;
 		return cpu_active(cpu);
+	}
 
 	/* KTHREAD_IS_PER_CPU is always allowed. */
 	if (kthread_is_per_cpu(p))
@@ -2520,6 +2535,10 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
 	if (cpu_dying(cpu))
 		return false;
 
+	/* Try on preferred CPU first if possible*/
+	if (task_check_preferred_cpu)
+		return false;
+
 	/* But are allowed during online. */
 	return cpu_online(cpu);
 }
@@ -3549,6 +3568,14 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
 	enum { cpuset, possible, fail } state = cpuset;
 	int dest_cpu;
 
+	/*
+	 * Cache value whether task's affinity spans preferred CPUs.
+	 * This helps to avoid repeating the same for each CPU
+	 * later in the loop. Encode call to is_cpu_allowed coming
+	 * via select_fallback_rq.
+	 */
+	p->has_preferred_cpu_state = task_has_preferred_cpus(p) << 8 | 0x1;
+
 	/*
 	 * If the node that the CPU is on has been offlined, cpu_to_node()
 	 * will return -1. There is no CPU on the node, and we should
@@ -3560,7 +3587,7 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
 		/* Look for allowed, online CPU in same node. */
 		for_each_cpu(dest_cpu, nodemask) {
 			if (is_cpu_allowed(p, dest_cpu))
-				return dest_cpu;
+				goto clear_and_return;
 		}
 	}
 
@@ -3604,6 +3631,8 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
 		}
 	}
 
+clear_and_return:
+	p->has_preferred_cpu_state = 0;
 	return dest_cpu;
 }
 
@@ -4612,6 +4641,7 @@ static void __sched_fork(u64 clone_flags, struct task_struct *p)
 	init_numa_balancing(clone_flags, p);
 	p->wake_entry.u_flags = CSD_TYPE_TTWU;
 	p->migration_pending = NULL;
+	p->has_preferred_cpu_state = 0;
 	init_sched_mm(p);
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c7c2dea65edd..38fd84b0b8f8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -4213,4 +4213,22 @@ DEFINE_CLASS_IS_UNCONDITIONAL(sched_change)
 
 #include "ext.h"
 
+/*
+ * has_preferred_cpu_state is encoding two bits of information.
+ * First Byte is to encode where the call to is_cpu_allowed coming from.
+ * Second Byte is to encode the intersection of task affinity
+ * and cpu_preferred_mask.
+ *
+ * If 1st Byte is set, call to is_cpu_allowed coming from select_fallback_rq.
+ * That helps to avoid repeated calculation keeping time complexity same.
+ */
+static inline bool task_has_preferred_cpus(struct task_struct *p)
+{
+	int cached_value = p->has_preferred_cpu_state;
+
+	if (cached_value & 0x1)
+		return p->has_preferred_cpu_state >> 8;
+	else
+		return cpumask_intersects(p->cpus_ptr, cpu_preferred_mask);
+}
 #endif /* _KERNEL_SCHED_SCHED_H */
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v4 07/20] sched/fair: Select preferred CPU at wakeup when possible
  2026-06-17 17:41 [PATCH v4 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (5 preceding siblings ...)
  2026-06-17 17:41 ` [PATCH v4 06/20] sched/core: allow only preferred CPUs in is_cpu_allowed Shrikanth Hegde
@ 2026-06-17 17:41 ` Shrikanth Hegde
  2026-06-17 17:41 ` [PATCH v4 08/20] sched/fair: load balance only among preferred CPUs Shrikanth Hegde
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 46+ messages in thread
From: Shrikanth Hegde @ 2026-06-17 17:41 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael

Update available_idle_cpu to consider preferred CPUs. This takes care of
lot of decisions at wakeup to use only preferred CPUs. There is no need to
put those explicit checks everywhere.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v3->v4:
- Drop check in sched_balance_find_dst_cpu

 kernel/sched/sched.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 38fd84b0b8f8..f194a5007e3a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1434,6 +1434,9 @@ static inline bool available_idle_cpu(int cpu)
 	if (!idle_rq(cpu_rq(cpu)))
 		return 0;
 
+	if (!cpu_preferred(cpu))
+		return 0;
+
 	if (vcpu_is_preempted(cpu))
 		return 0;
 
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v4 08/20] sched/fair: load balance only among preferred CPUs
  2026-06-17 17:41 [PATCH v4 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (6 preceding siblings ...)
  2026-06-17 17:41 ` [PATCH v4 07/20] sched/fair: Select preferred CPU at wakeup when possible Shrikanth Hegde
@ 2026-06-17 17:41 ` Shrikanth Hegde
  2026-06-18  3:03   ` K Prateek Nayak
  2026-06-17 17:41 ` [PATCH v4 09/20] sched/core: Keep tick on non-preferred CPUs until tasks are out Shrikanth Hegde
                   ` (11 subsequent siblings)
  19 siblings, 1 reply; 46+ messages in thread
From: Shrikanth Hegde @ 2026-06-17 17:41 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael

Consider only preferred CPUs for load balance.

With this, load balance will end up choosing a preferred CPUs for pull.
This makes it not fight against the push task mechanism which happens
at tick. Also, this stops active balance to happen on non-preferred CPU
pulling the load.

This means there is no load balancing if the task is pinned only to
non-preferred CPUs. They will continue to run where they were previously
running before the CPUs was marked as non-preferred.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/fair.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d78467ec6ee1..3f3c7f0ca489 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -13291,6 +13291,9 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
 
 	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
 
+	/* Spread load among preferred CPUs */
+	cpumask_and(cpus, cpus, cpu_preferred_mask);
+
 	schedstat_inc(sd->lb_count[idle]);
 
 redo:
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v4 09/20] sched/core: Keep tick on non-preferred CPUs until tasks are out
  2026-06-17 17:41 [PATCH v4 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (7 preceding siblings ...)
  2026-06-17 17:41 ` [PATCH v4 08/20] sched/fair: load balance only among preferred CPUs Shrikanth Hegde
@ 2026-06-17 17:41 ` Shrikanth Hegde
  2026-06-17 17:41 ` [PATCH v4 10/20] sched/core: Push current task from non preferred CPU Shrikanth Hegde
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 46+ messages in thread
From: Shrikanth Hegde @ 2026-06-17 17:41 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael

Enable tick on nohz full CPU when it is marked as non-preferred.
If there in no CFS running there, disable the tick to save the power.

Steal time handling code will call tick_nohz_dep_set_cpu with
TICK_DEP_BIT_SCHED for moving the task out of nohz_full CPU fast.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/core.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 714816cfa975..390a4de28b3c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1431,6 +1431,10 @@ bool sched_can_stop_tick(struct rq *rq)
 	if (rq->dl.dl_nr_running)
 		return false;
 
+	/* Keep the tick running until CFS tasks are pushed out*/
+	if (!cpu_preferred(rq->cpu) && rq->cfs.h_nr_queued)
+		return false;
+
 	/*
 	 * If there are more than one RR tasks, we need the tick to affect the
 	 * actual RR behaviour.
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v4 10/20] sched/core: Push current task from non preferred CPU
  2026-06-17 17:41 [PATCH v4 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (8 preceding siblings ...)
  2026-06-17 17:41 ` [PATCH v4 09/20] sched/core: Keep tick on non-preferred CPUs until tasks are out Shrikanth Hegde
@ 2026-06-17 17:41 ` Shrikanth Hegde
  2026-06-18  4:09   ` K Prateek Nayak
  2026-06-17 17:41 ` [PATCH v4 11/20] sched/debug: Add migration stats due to non preferred CPUs Shrikanth Hegde
                   ` (9 subsequent siblings)
  19 siblings, 1 reply; 46+ messages in thread
From: Shrikanth Hegde @ 2026-06-17 17:41 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael

Actively push out task running on a non-preferred CPU. Since the task is
running on the CPU, need to stop the cpu and push the task out.
However, if the task in pinned only to non-preferred CPUs, it will continue
running there. This will help in maintaining the userspace affinities
unlike CPU hotplug or isolated cpusets.

Though code is almost same as __balance_push_cpu_stop and quite close to
push_cpu_stop, it is being kept separate as it provides a cleaner
implementation w.r.t CONFIG_HOTPLUG_CPU.

Add push_task_work_done flag to protect work buffer.
Works only with FAIR class.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v3->v4:
- Drop irqsave
- Add guard for rq
- Drop RT support
 
 kernel/sched/core.c  | 81 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |  8 +++++
 2 files changed, 89 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 390a4de28b3c..4b835cf2464a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5793,6 +5793,9 @@ void sched_tick(void)
 	unsigned long hw_pressure;
 	u64 resched_latency;
 
+	if (!cpu_preferred(cpu))
+		sched_push_current_non_preferred_cpu(rq);
+
 	if (housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE))
 		arch_scale_freq_tick();
 
@@ -11302,3 +11305,81 @@ void sched_change_end(struct sched_change_ctx *ctx)
 		p->sched_class->prio_changed(rq, p, ctx->prio);
 	}
 }
+
+#ifdef CONFIG_PREFERRED_CPU
+/* npc - non preferred CPU */
+static DEFINE_PER_CPU(struct cpu_stop_work, npc_push_task_work);
+
+static int sched_non_preferred_cpu_push_stop(void *arg)
+{
+	struct task_struct *p = arg;
+	struct rq *rq = this_rq();
+	struct rq_flags rf;
+	int cpu;
+
+	/* sanity check */
+	if (cpu_preferred(rq->cpu))
+		return 0;
+
+	raw_spin_lock_irq(&p->pi_lock);
+	rq_lock(rq, &rf);
+	rq->push_task_work_done = 0;
+
+	update_rq_clock(rq);
+
+	if (task_rq(p) == rq && task_on_rq_queued(p)) {
+		cpu = select_fallback_rq(rq->cpu, p);
+		rq = __migrate_task(rq, &rf, p, cpu);
+	}
+
+	rq_unlock(rq, &rf);
+	raw_spin_unlock_irq(&p->pi_lock);
+	put_task_struct(p);
+
+	return 0;
+}
+
+/*
+ * Push the current task running on non-preferred CPU.
+ * Using this non preferred CPU will lead to more vCPU preemptions
+ * in the host. So it is better not to use this CPU.
+ *
+ * Since task is running, call a stopper to push the task out. This is
+ * similar to how task moves during hotplug. In select_fallback_rq a
+ * preferred CPU will be chosen and henceforth task shouldn't come back to
+ * this CPU again.
+ *
+ * Works for FAIR class only
+ *
+ * If task is affined only non-preferred CPUs, it can't be moved out
+ */
+void sched_push_current_non_preferred_cpu(struct rq *rq)
+{
+	struct task_struct *push_task = rq->curr;
+
+	/* Push only if it is FAIR class */
+	if (push_task->sched_class != &fair_sched_class)
+		return;
+
+	if (kthread_is_per_cpu(push_task) ||
+	    is_migration_disabled(push_task))
+		return;
+
+	/* Is there any preferred CPU in the affinity list */
+	if (!task_has_preferred_cpus(push_task))
+		return;
+
+	/* There is already a stopper thread for this. Dont race with it */
+	if (rq->push_task_work_done == 1)
+		return;
+
+	/* sched_tick runs with interrupts disabled. Don't disable again */
+	get_task_struct(push_task);
+
+	scoped_guard (rq_lock, rq)
+		rq->push_task_work_done = 1;
+
+	stop_one_cpu_nowait(rq->cpu, sched_non_preferred_cpu_push_stop,
+			    push_task, this_cpu_ptr(&npc_push_task_work));
+}
+#endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f194a5007e3a..5e9b8aaf9a9a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1274,6 +1274,8 @@ struct rq {
 
 	struct list_head cfs_tasks;
 
+	bool			push_task_work_done;
+
 	struct sched_avg	avg_rt;
 	struct sched_avg	avg_dl;
 #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
@@ -4234,4 +4236,10 @@ static inline bool task_has_preferred_cpus(struct task_struct *p)
 	else
 		return cpumask_intersects(p->cpus_ptr, cpu_preferred_mask);
 }
+
+#ifdef CONFIG_PREFERRED_CPU
+void sched_push_current_non_preferred_cpu(struct rq *rq);
+#else	/* !CONFIG_PREFERRED_CPU */
+static inline void sched_push_current_non_preferred_cpu(struct rq *rq) { }
+#endif
 #endif /* _KERNEL_SCHED_SCHED_H */
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v4 11/20] sched/debug: Add migration stats due to non preferred CPUs
  2026-06-17 17:41 [PATCH v4 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (9 preceding siblings ...)
  2026-06-17 17:41 ` [PATCH v4 10/20] sched/core: Push current task from non preferred CPU Shrikanth Hegde
@ 2026-06-17 17:41 ` Shrikanth Hegde
  2026-06-17 17:41 ` [PATCH v4 12/20] sched/debug: Create debugfs folder steal monitor Shrikanth Hegde
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 46+ messages in thread
From: Shrikanth Hegde @ 2026-06-17 17:41 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael

Add new stats.
- nr_migrations_cpu_non_preferred: number of migrations happened since
  a CPU was marked as non preferred due to high steal time.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 include/linux/sched.h | 1 +
 kernel/sched/core.c   | 1 +
 kernel/sched/debug.c  | 1 +
 3 files changed, 3 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2d0b1a6d50ac..5f523782ca28 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -554,6 +554,7 @@ struct sched_statistics {
 	u64				nr_failed_migrations_running;
 	u64				nr_failed_migrations_hot;
 	u64				nr_forced_migrations;
+	u64				nr_migrations_cpu_non_preferred;
 
 	u64				nr_wakeups;
 	u64				nr_wakeups_sync;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4b835cf2464a..33ebe71a0b4b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -11330,6 +11330,7 @@ static int sched_non_preferred_cpu_push_stop(void *arg)
 	if (task_rq(p) == rq && task_on_rq_queued(p)) {
 		cpu = select_fallback_rq(rq->cpu, p);
 		rq = __migrate_task(rq, &rf, p, cpu);
+		schedstat_inc(p->stats.nr_migrations_cpu_non_preferred);
 	}
 
 	rq_unlock(rq, &rf);
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index f3a033b34ba0..106b448cafb6 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1363,6 +1363,7 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 		P_SCHEDSTAT(nr_failed_migrations_running);
 		P_SCHEDSTAT(nr_failed_migrations_hot);
 		P_SCHEDSTAT(nr_forced_migrations);
+		P_SCHEDSTAT(nr_migrations_cpu_non_preferred);
 		P_SCHEDSTAT(nr_wakeups);
 		P_SCHEDSTAT(nr_wakeups_sync);
 		P_SCHEDSTAT(nr_wakeups_migrate);
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v4 12/20] sched/debug: Create debugfs folder steal monitor
  2026-06-17 17:41 [PATCH v4 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (10 preceding siblings ...)
  2026-06-17 17:41 ` [PATCH v4 11/20] sched/debug: Add migration stats due to non preferred CPUs Shrikanth Hegde
@ 2026-06-17 17:41 ` Shrikanth Hegde
  2026-06-17 17:41 ` [PATCH v4 13/20] sched/debug: Provide debugfs to enable/disable " Shrikanth Hegde
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 46+ messages in thread
From: Shrikanth Hegde @ 2026-06-17 17:41 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael

Create a debugfs folder called steal_monitor in /sys/kernel/debug/sched

This is going to host debugfs knobs needed for generic steal monitor
that will be introduced in subsequent patches.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/debug.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 106b448cafb6..d1532359fc50 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -633,6 +633,17 @@ static void debugfs_fair_server_init(void)
 	}
 }
 
+#ifdef CONFIG_PREFERRED_CPU
+static void sched_steal_monitor_debugfs_init(void)
+{
+	struct dentry __maybe_unused *sm;
+
+	sm = debugfs_create_dir("steal_monitor", debugfs_sched);
+	if (!sm)
+		return;
+}
+#endif
+
 static __init int sched_init_debug(void)
 {
 	struct dentry __maybe_unused *numa, *llc;
@@ -691,6 +702,10 @@ static __init int sched_init_debug(void)
 	debugfs_ext_server_init();
 #endif
 
+#ifdef CONFIG_PREFERRED_CPU
+	sched_steal_monitor_debugfs_init();
+#endif
+
 	return 0;
 }
 late_initcall(sched_init_debug);
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v4 13/20] sched/debug: Provide debugfs to enable/disable steal monitor
  2026-06-17 17:41 [PATCH v4 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (11 preceding siblings ...)
  2026-06-17 17:41 ` [PATCH v4 12/20] sched/debug: Create debugfs folder steal monitor Shrikanth Hegde
@ 2026-06-17 17:41 ` Shrikanth Hegde
  2026-06-17 17:41 ` [PATCH v4 14/20] sched/core: Introduce a simple " Shrikanth Hegde
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 46+ messages in thread
From: Shrikanth Hegde @ 2026-06-17 17:41 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael

Add a debugfs "enable" file to enable steal time monitor.

Computing steal time and acting on it periodically are to be opted by
the user. This helps to avoid any overhead when the feature
is disabled.

It is disabled by default.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/core.c  |  1 +
 kernel/sched/debug.c | 31 +++++++++++++++++++++++++++++++
 kernel/sched/sched.h |  2 ++
 3 files changed, 34 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 33ebe71a0b4b..24d4abc74241 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -11309,6 +11309,7 @@ void sched_change_end(struct sched_change_ctx *ctx)
 #ifdef CONFIG_PREFERRED_CPU
 /* npc - non preferred CPU */
 static DEFINE_PER_CPU(struct cpu_stop_work, npc_push_task_work);
+DEFINE_STATIC_KEY_FALSE(__sched_sm_enable);
 
 static int sched_non_preferred_cpu_push_stop(void *arg)
 {
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index d1532359fc50..2d62858f9cc0 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -634,6 +634,35 @@ static void debugfs_fair_server_init(void)
 }
 
 #ifdef CONFIG_PREFERRED_CPU
+__read_mostly bool sched_sm_wr_enable;
+
+static ssize_t sched_sm_en_write(struct file *filp, const char __user *ubuf,
+				 size_t cnt, loff_t *ppos)
+{
+	bool orig = sched_sm_wr_enable;
+	ssize_t result;
+
+	cpus_read_lock();
+	result = debugfs_write_file_bool(filp, ubuf, cnt, ppos);
+
+	if (sched_sm_wr_enable && !orig) {
+		static_branch_enable(&__sched_sm_enable);
+	} else if (!sched_sm_wr_enable && orig) {
+		static_branch_disable(&__sched_sm_enable);
+		cpumask_copy(&__cpu_preferred_mask, cpu_active_mask);
+	}
+
+	cpus_read_unlock();
+	return result;
+}
+
+static const struct file_operations sched_sm_en_fops = {
+	.read   =	debugfs_read_file_bool,
+	.write	=	sched_sm_en_write,
+	.open   =	simple_open,
+	.llseek =	default_llseek,
+};
+
 static void sched_steal_monitor_debugfs_init(void)
 {
 	struct dentry __maybe_unused *sm;
@@ -641,6 +670,8 @@ static void sched_steal_monitor_debugfs_init(void)
 	sm = debugfs_create_dir("steal_monitor", debugfs_sched);
 	if (!sm)
 		return;
+
+	debugfs_create_file("enable", 0644, sm, &sched_sm_wr_enable, &sched_sm_en_fops);
 }
 #endif
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5e9b8aaf9a9a..9cb006c21090 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -4238,6 +4238,8 @@ static inline bool task_has_preferred_cpus(struct task_struct *p)
 }
 
 #ifdef CONFIG_PREFERRED_CPU
+DECLARE_STATIC_KEY_FALSE(__sched_sm_enable);
+
 void sched_push_current_non_preferred_cpu(struct rq *rq);
 #else	/* !CONFIG_PREFERRED_CPU */
 static inline void sched_push_current_non_preferred_cpu(struct rq *rq) { }
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v4 14/20] sched/core: Introduce a simple steal monitor
  2026-06-17 17:41 [PATCH v4 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (12 preceding siblings ...)
  2026-06-17 17:41 ` [PATCH v4 13/20] sched/debug: Provide debugfs to enable/disable " Shrikanth Hegde
@ 2026-06-17 17:41 ` Shrikanth Hegde
  2026-06-18  4:30   ` Yury Norov
  2026-06-17 17:41 ` [PATCH v4 15/20] sched/core: Compute steal values at regular intervals Shrikanth Hegde
                   ` (5 subsequent siblings)
  19 siblings, 1 reply; 46+ messages in thread
From: Shrikanth Hegde @ 2026-06-17 17:41 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael

Start with a simple steal monitor.

It is meant to look at steal time and make the decision to
reduce/increase the preferred CPUs.

It has
- work function to execute the steal time calculations and decision
  making periodically.
- low and high thresholds for steal time.
- sampling period to control the frequency of steal time calculations.
- cache the previous decision to avoid oscillations

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v3->v4:
- Drop tmp_mask

 include/linux/sched.h | 11 +++++++++++
 kernel/sched/core.c   | 23 +++++++++++++++++++++++
 kernel/sched/sched.h  |  3 +++
 3 files changed, 37 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5f523782ca28..ce6bc8a22eb1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2517,4 +2517,15 @@ extern void migrate_enable(void);
 
 DEFINE_LOCK_GUARD_0(migrate, migrate_disable(), migrate_enable())
 
+#ifdef CONFIG_PREFERRED_CPU
+struct steal_monitor_t {
+	struct work_struct  work;
+	ktime_t prev_time;
+	u64 prev_steal;
+	int previous_decision;
+	unsigned int low_threshold;
+	unsigned int high_threshold;
+	unsigned int sampling_period_ms;
+};
+#endif
 #endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 24d4abc74241..cc48632dd42d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9138,6 +9138,8 @@ void __init sched_init(void)
 
 	preempt_dynamic_init();
 
+	sched_init_steal_monitor();
+
 	scheduler_running = 1;
 }
 
@@ -11384,4 +11386,25 @@ void sched_push_current_non_preferred_cpu(struct rq *rq)
 	stop_one_cpu_nowait(rq->cpu, sched_non_preferred_cpu_push_stop,
 			    push_task, this_cpu_ptr(&npc_push_task_work));
 }
+
+struct steal_monitor_t steal_mon;
+
+void sched_init_steal_monitor(void)
+{
+	INIT_WORK(&steal_mon.work, sched_steal_detection_work);
+	steal_mon.low_threshold       = 200;		/* 2% steal time */
+	steal_mon.high_threshold      = 500;		/* 5% steal time */
+	steal_mon.sampling_period_ms  = 1000;		/* once per second */
+}
+
+/* This is only a skeleton. Subsequent patches introduce more of it */
+void sched_steal_detection_work(struct work_struct *work)
+{
+	struct steal_monitor_t *sm = container_of(work, struct steal_monitor_t, work);
+	ktime_t now;
+
+	/* Update the prev_time for next iteration*/
+	now = ktime_get();
+	sm->prev_time = now;
+}
 #endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9cb006c21090..984da3827f19 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -4240,8 +4240,11 @@ static inline bool task_has_preferred_cpus(struct task_struct *p)
 #ifdef CONFIG_PREFERRED_CPU
 DECLARE_STATIC_KEY_FALSE(__sched_sm_enable);
 
+void sched_init_steal_monitor(void);
+void sched_steal_detection_work(struct work_struct *work);
 void sched_push_current_non_preferred_cpu(struct rq *rq);
 #else	/* !CONFIG_PREFERRED_CPU */
 static inline void sched_push_current_non_preferred_cpu(struct rq *rq) { }
+static inline void sched_init_steal_monitor(void) { }
 #endif
 #endif /* _KERNEL_SCHED_SCHED_H */
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v4 15/20] sched/core: Compute steal values at regular intervals
  2026-06-17 17:41 [PATCH v4 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (13 preceding siblings ...)
  2026-06-17 17:41 ` [PATCH v4 14/20] sched/core: Introduce a simple " Shrikanth Hegde
@ 2026-06-17 17:41 ` Shrikanth Hegde
  2026-06-18  4:04   ` Yury Norov
  2026-06-17 17:41 ` [PATCH v4 16/20] sched/core: Introduce default arch handling code for inc/dec preferred CPUs Shrikanth Hegde
                   ` (4 subsequent siblings)
  19 siblings, 1 reply; 46+ messages in thread
From: Shrikanth Hegde @ 2026-06-17 17:41 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael

Kick off the work to compute the steal time at regular interval.
Gated with steal monitor enabled static key check to avoid any overhead
when its disabled.

The sampling period can changed at runtime using steal_mon/sampling_period.
By default is 1000 milliseconds. I.e. 1 second

This work is done by first active housekeeping CPU only. Hence it won't
need any complicated synchronization.

Now, that sched_steal_mon_enabled() is available which is a static branch,
add this to hotpath such as wakeup and load balance.
This will make them effectively nop when the feature is disabled.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v3->v4:
- Add static key check in hotpaths. Could be split into a separate
  patch. Let me know if thats better. 

 include/linux/sched.h |  2 ++
 kernel/sched/core.c   | 28 +++++++++++++++++++++++++++-
 kernel/sched/debug.c  |  1 +
 kernel/sched/fair.c   |  3 ++-
 kernel/sched/sched.h  | 10 +++++++++-
 5 files changed, 41 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ce6bc8a22eb1..5b15353ed7ef 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2527,5 +2527,7 @@ struct steal_monitor_t {
 	unsigned int high_threshold;
 	unsigned int sampling_period_ms;
 };
+
+extern struct steal_monitor_t steal_mon;
 #endif
 #endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index cc48632dd42d..f1a91021e357 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5793,7 +5793,7 @@ void sched_tick(void)
 	unsigned long hw_pressure;
 	u64 resched_latency;
 
-	if (!cpu_preferred(cpu))
+	if (sched_steal_mon_enabled() && !cpu_preferred(cpu))
 		sched_push_current_non_preferred_cpu(rq);
 
 	if (housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE))
@@ -5834,6 +5834,9 @@ void sched_tick(void)
 		rq->idle_balance = idle_cpu(cpu);
 		sched_balance_trigger(rq);
 	}
+
+	if (sched_steal_mon_enabled())
+		sched_trigger_steal_computation(cpu);
 }
 
 #ifdef CONFIG_NO_HZ_FULL
@@ -11407,4 +11410,27 @@ void sched_steal_detection_work(struct work_struct *work)
 	now = ktime_get();
 	sm->prev_time = now;
 }
+
+void sched_trigger_steal_computation(int cpu)
+{
+	int first_hk_cpu = cpumask_first_and(housekeeping_cpumask(HK_TYPE_KERNEL_NOISE),
+					     cpu_active_mask);
+	ktime_t now;
+
+	/* Done by first active housekeeping CPU only */
+	if (likely(cpu != first_hk_cpu))
+		return;
+
+	/*
+	 * Since everything is updated by first housekeeping CPU,
+	 * There is no need for complex syncronization.
+	 */
+	now = ktime_get();
+
+	/* Default is once per second */
+	if (likely(ktime_ms_delta(now, steal_mon.prev_time) < steal_mon.sampling_period_ms))
+		return;
+
+	schedule_work_on(first_hk_cpu, &steal_mon.work);
+}
 #endif
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 2d62858f9cc0..55b8beb42574 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -649,6 +649,7 @@ static ssize_t sched_sm_en_write(struct file *filp, const char __user *ubuf,
 		static_branch_enable(&__sched_sm_enable);
 	} else if (!sched_sm_wr_enable && orig) {
 		static_branch_disable(&__sched_sm_enable);
+		cancel_work_sync(&steal_mon.work);
 		cpumask_copy(&__cpu_preferred_mask, cpu_active_mask);
 	}
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3f3c7f0ca489..b02a414ffaae 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -13292,7 +13292,8 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
 	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
 
 	/* Spread load among preferred CPUs */
-	cpumask_and(cpus, cpus, cpu_preferred_mask);
+	if (sched_steal_mon_enabled())
+		cpumask_and(cpus, cpus, cpu_preferred_mask);
 
 	schedstat_inc(sd->lb_count[idle]);
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 984da3827f19..f3814099cc0b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1060,6 +1060,7 @@ struct root_domain {
 	struct perf_domain __rcu *pd;
 };
 
+static inline bool sched_steal_mon_enabled(void);
 extern void init_defrootdomain(void);
 extern int sched_init_domains(const struct cpumask *cpu_map);
 extern void rq_attach_root(struct rq *rq, struct root_domain *rd);
@@ -1436,7 +1437,7 @@ static inline bool available_idle_cpu(int cpu)
 	if (!idle_rq(cpu_rq(cpu)))
 		return 0;
 
-	if (!cpu_preferred(cpu))
+	if (sched_steal_mon_enabled() && !cpu_preferred(cpu))
 		return 0;
 
 	if (vcpu_is_preempted(cpu))
@@ -4243,8 +4244,15 @@ DECLARE_STATIC_KEY_FALSE(__sched_sm_enable);
 void sched_init_steal_monitor(void);
 void sched_steal_detection_work(struct work_struct *work);
 void sched_push_current_non_preferred_cpu(struct rq *rq);
+void sched_trigger_steal_computation(int cpu);
+static inline bool sched_steal_mon_enabled(void)
+{
+	return static_branch_unlikely(&__sched_sm_enable);
+}
 #else	/* !CONFIG_PREFERRED_CPU */
 static inline void sched_push_current_non_preferred_cpu(struct rq *rq) { }
 static inline void sched_init_steal_monitor(void) { }
+static inline void sched_trigger_steal_computation(int cpu) { }
+static inline bool sched_steal_mon_enabled(void) { return false; }
 #endif
 #endif /* _KERNEL_SCHED_SCHED_H */
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v4 16/20] sched/core: Introduce default arch handling code for inc/dec preferred CPUs
  2026-06-17 17:41 [PATCH v4 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (14 preceding siblings ...)
  2026-06-17 17:41 ` [PATCH v4 15/20] sched/core: Compute steal values at regular intervals Shrikanth Hegde
@ 2026-06-17 17:41 ` Shrikanth Hegde
       [not found]   ` <ajNwy25WYg45AQJX@yury>
  2026-06-17 17:41 ` [PATCH v4 17/20] sched/core: Handle steal values and mark CPUs as preferred Shrikanth Hegde
                   ` (3 subsequent siblings)
  19 siblings, 1 reply; 46+ messages in thread
From: Shrikanth Hegde @ 2026-06-17 17:41 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael

Define default handlers for high/low steal time. If arch has better
decision logic, may override the default implementation.

- If the steal time higher than threshold, reduce the number of preferred
  CPUs by 1 core. The last core in the intersection of active and
  preferred CPUs will be marked as non-preferred.
  Ensure at least one core is left as preferred always.

- If the steal time lower than threshold, increase the number of preferred
  CPUs by 1 core. First active core which is not in cpu_preferred_mask will
  be marked as preferred.
  If all cores are already set to preferred, bail out.

Increase/Decrease may need to modify the splicing across NUMA nodes. It is
being kept simple for now.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v3->v4:
- active instead of online
- added comment for enabling tick for nohz_full.

 include/linux/sched.h |  2 ++
 kernel/sched/core.c   | 61 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 63 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5b15353ed7ef..e435f3073ffc 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2529,5 +2529,7 @@ struct steal_monitor_t {
 };
 
 extern struct steal_monitor_t steal_mon;
+void arch_dec_preferred_cpus(struct steal_monitor_t *sm, u64 steal_ratio);
+void arch_inc_preferred_cpus(struct steal_monitor_t *sm, u64 steal_ratio);
 #endif
 #endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f1a91021e357..c77045055604 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -11400,6 +11400,67 @@ void sched_init_steal_monitor(void)
 	steal_mon.sampling_period_ms  = 1000;		/* once per second */
 }
 
+/*
+ * Default implementation of decrementing the preferred CPUs based on steal
+ * time. This is simple logic and decrease the preferred CPUs by 1 core.
+ * It takes out the last core in the active & preferred.
+ *
+ * Ensure at least one housekeeping core is always kept as preferred
+ *
+ * Could be overwritten by arch specific handling.
+ */
+#ifndef arch_dec_preferred_cpus
+void arch_dec_preferred_cpus(struct steal_monitor_t *sm, u64 steal_ratio)
+{
+	int last_cpu, tmp_cpu;
+	int this_cpu = raw_smp_processor_id();
+
+	last_cpu = cpumask_last(cpu_preferred_mask);
+
+	/*
+	 * If the core belongs to the housekeeping CPUs, no action is
+	 * taken. This leaves at least one core preferred always.
+	 * This ensures at least some CPUs are available to run
+	 */
+	if (cpumask_equal(cpu_smt_mask(last_cpu), cpu_smt_mask(this_cpu)))
+		return;
+
+	/*
+	 * set tick bit for nohz_full CPU to push the task out. Once the tasks
+	 * are pushed out, bit will be cleared
+	 */
+	for_each_cpu_and(tmp_cpu, cpu_smt_mask(last_cpu), cpu_active_mask) {
+		set_cpu_preferred(tmp_cpu, false);
+		if (tick_nohz_full_cpu(tmp_cpu))
+			tick_nohz_dep_set_cpu(tmp_cpu, TICK_DEP_BIT_SCHED);
+	}
+}
+#endif
+
+/*
+ * Default implementation of incrementing preferred CPUs based on steal
+ * time. This is simple logic and increases the preferred CPUs by 1 core.
+ * It adds the first core in active & !preferred
+ *
+ * Nothing to do if active == preferred
+ *
+ * Could be overwritten by arch specific handling.
+ */
+#ifndef arch_inc_preferred_cpus
+void arch_inc_preferred_cpus(struct steal_monitor_t *sm, u64 steal_ratio)
+{
+	int first_cpu, tmp_cpu;
+
+	first_cpu = cpumask_first_andnot(cpu_active_mask, cpu_preferred_mask);
+	/* All CPUs are preferred. Nothing to increase further */
+	if (first_cpu >= nr_cpu_ids)
+		return;
+
+	for_each_cpu_and(tmp_cpu, cpu_smt_mask(first_cpu), cpu_active_mask)
+		set_cpu_preferred(tmp_cpu, true);
+}
+#endif
+
 /* This is only a skeleton. Subsequent patches introduce more of it */
 void sched_steal_detection_work(struct work_struct *work)
 {
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v4 17/20] sched/core: Handle steal values and mark CPUs as preferred
  2026-06-17 17:41 [PATCH v4 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (15 preceding siblings ...)
  2026-06-17 17:41 ` [PATCH v4 16/20] sched/core: Introduce default arch handling code for inc/dec preferred CPUs Shrikanth Hegde
@ 2026-06-17 17:41 ` Shrikanth Hegde
  2026-06-17 17:41 ` [PATCH v4 18/20] sched/core: Mark the direction of steal values to avoid oscillations Shrikanth Hegde
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 46+ messages in thread
From: Shrikanth Hegde @ 2026-06-17 17:41 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael

This is the main periodic work which handles the steal time values.

- Compute the steal time by looking CPUTIME_STEAL across all active CPUs

- Compute steal ratio. It is multiplied by 100 to handle the fractional
  values.

- Invoke callbacks for inc/dec preferred CPUs based on low/high steal
  time.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/core.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c77045055604..657c36a0e7ca 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -11461,15 +11461,34 @@ void arch_inc_preferred_cpus(struct steal_monitor_t *sm, u64 steal_ratio)
 }
 #endif
 
-/* This is only a skeleton. Subsequent patches introduce more of it */
 void sched_steal_detection_work(struct work_struct *work)
 {
 	struct steal_monitor_t *sm = container_of(work, struct steal_monitor_t, work);
+	u64 steal_ratio, delta_steal, delta_ns, steal = 0;
 	ktime_t now;
+	int tmp_cpu;
+
+	for_each_cpu(tmp_cpu, cpu_active_mask)
+		steal += kcpustat_cpu(tmp_cpu).cpustat[CPUTIME_STEAL];
 
 	/* Update the prev_time for next iteration*/
 	now = ktime_get();
+	delta_steal = steal > sm->prev_steal ? steal - sm->prev_steal : 0;
+	delta_ns = max_t(u64, ktime_to_ns(ktime_sub(now, sm->prev_time)), 1);
+
 	sm->prev_time = now;
+	sm->prev_steal = steal;
+
+	/* Multiply by 100 to consider the fractional values of steal time */
+	steal_ratio = (delta_steal * 100 * 100) / (delta_ns * num_active_cpus());
+
+	/* If the steal time values are high, reduce one core from preferred CPUs */
+	if (steal_ratio > sm->high_threshold)
+		arch_dec_preferred_cpus(sm, steal_ratio);
+
+	/* If the steal time values are low, increase one core as preferred CPUs */
+	if (steal_ratio < sm->low_threshold)
+		arch_inc_preferred_cpus(sm, steal_ratio);
 }
 
 void sched_trigger_steal_computation(int cpu)
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v4 18/20] sched/core: Mark the direction of steal values to avoid oscillations
  2026-06-17 17:41 [PATCH v4 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (16 preceding siblings ...)
  2026-06-17 17:41 ` [PATCH v4 17/20] sched/core: Handle steal values and mark CPUs as preferred Shrikanth Hegde
@ 2026-06-17 17:41 ` Shrikanth Hegde
  2026-06-17 17:41 ` [PATCH v4 19/20] sched/debug: Add debug knobs for steal monitor Shrikanth Hegde
  2026-06-17 17:41 ` [PATCH v4 20/20] sched/core: Add a few check for valid CPU in inc/dec of preferred CPUs Shrikanth Hegde
  19 siblings, 0 replies; 46+ messages in thread
From: Shrikanth Hegde @ 2026-06-17 17:41 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael

Cache the previous decision on steal time. So consecutive values of
high values or low values are taken for increase/decrease of preferred
CPUs.

Also make lower threhold equal to less than to handle one setting it to
zero.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v3->v4:
- Consider equal or less than for low threshold.
- reset the direction when disabling the feature.

 kernel/sched/core.c  | 12 ++++++++++--
 kernel/sched/debug.c |  1 +
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 657c36a0e7ca..57d52973ef0d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -11483,12 +11483,20 @@ void sched_steal_detection_work(struct work_struct *work)
 	steal_ratio = (delta_steal * 100 * 100) / (delta_ns * num_active_cpus());
 
 	/* If the steal time values are high, reduce one core from preferred CPUs */
-	if (steal_ratio > sm->high_threshold)
+	if (sm->previous_decision == 1 && steal_ratio > sm->high_threshold)
 		arch_dec_preferred_cpus(sm, steal_ratio);
 
 	/* If the steal time values are low, increase one core as preferred CPUs */
-	if (steal_ratio < sm->low_threshold)
+	if (sm->previous_decision == -1 && steal_ratio <= sm->low_threshold)
 		arch_inc_preferred_cpus(sm, steal_ratio);
+
+	/* mark the direction. This helps to avoid ping-pongs */
+	if (steal_ratio > sm->high_threshold)
+		sm->previous_decision = 1;
+	else if (steal_ratio <= sm->low_threshold)
+		sm->previous_decision = -1;
+	else
+		sm->previous_decision = 0;
 }
 
 void sched_trigger_steal_computation(int cpu)
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 55b8beb42574..ae7a641931d1 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -650,6 +650,7 @@ static ssize_t sched_sm_en_write(struct file *filp, const char __user *ubuf,
 	} else if (!sched_sm_wr_enable && orig) {
 		static_branch_disable(&__sched_sm_enable);
 		cancel_work_sync(&steal_mon.work);
+		steal_mon.previous_decision = 0;
 		cpumask_copy(&__cpu_preferred_mask, cpu_active_mask);
 	}
 
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v4 19/20] sched/debug: Add debug knobs for steal monitor
  2026-06-17 17:41 [PATCH v4 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (17 preceding siblings ...)
  2026-06-17 17:41 ` [PATCH v4 18/20] sched/core: Mark the direction of steal values to avoid oscillations Shrikanth Hegde
@ 2026-06-17 17:41 ` Shrikanth Hegde
  2026-06-17 17:41 ` [PATCH v4 20/20] sched/core: Add a few check for valid CPU in inc/dec of preferred CPUs Shrikanth Hegde
  19 siblings, 0 replies; 46+ messages in thread
From: Shrikanth Hegde @ 2026-06-17 17:41 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael

Add three debug knobs in steal_monitor:

sampling_period - sampling frequency in milliseconds.
low_threshold - lower steal threshold value (specify percentage * 100)
high_threshold - higher steal threshold value (specify percentage * 100)

Refer to Documentation/scheduler/sched-debug.rst for detailed info.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 Documentation/scheduler/sched-debug.rst | 34 +++++++++++++++++++++++++
 kernel/sched/debug.c                    |  3 +++
 2 files changed, 37 insertions(+)

diff --git a/Documentation/scheduler/sched-debug.rst b/Documentation/scheduler/sched-debug.rst
index b5a92a39eccd..95c355b1aebc 100644
--- a/Documentation/scheduler/sched-debug.rst
+++ b/Documentation/scheduler/sched-debug.rst
@@ -52,3 +52,37 @@ rate for each task.
 
 ``scan_size_mb`` is how many megabytes worth of pages are scanned for
 a given scan.
+
+==================================
+Tunables for generic steal monitor
+==================================
+Feature for preferred CPUs logic. Available under CONFIG_PREFERRED_CPU
+Files are at /sys/kernel/debug/sched/steal_monitor/
+
+enable  - used for enable/disable the steal_monitor feature.
+Disable needs more than a static branch disable to maintain design
+construct of preferred is same as active when feature is disabled.
+Once enabled, it starts sampling steal time at intervals specified in
+sampling_period and takes action based on high/low thresholds.
+
+sampling_period - sampling frequency in milliseconds.
+How often sampling for steal values happen. This controls how fast scheduler
+acts on detecting the changes to steal time values.
+Default value is 1000 milliseconds.
+
+low_threshold   - lower threshold value in percentage * 100
+This determines what values should be considered as nil/no steal values.
+When scheduler see steal times below or equal to this value, it will try
+to increase the preferred CPUs by 1 core. Having value as zero causes too
+much oscillations.
+Default value is 200, i.e 2% steal is considered as low threshold.
+
+high_threshold  - higher threshold value in percentage * 100
+This determines what values should be considered as high steal values.
+When scheduler see steal times higher than this value, it will reduce
+the preferred CPUs by 1 core.
+Default value is 500, i.e 5% steal is considered as high threshold.
+
+Note: When the steal values in between high and low threshold no action is taken
+by scheduler. This is to avoid oscillations.
+One needs to be CAREFUL when setting the values.
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index ae7a641931d1..7a9905009ede 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -674,6 +674,9 @@ static void sched_steal_monitor_debugfs_init(void)
 		return;
 
 	debugfs_create_file("enable", 0644, sm, &sched_sm_wr_enable, &sched_sm_en_fops);
+	debugfs_create_u32("low_threshold", 0644, sm, &steal_mon.low_threshold);
+	debugfs_create_u32("high_threshold", 0644, sm, &steal_mon.high_threshold);
+	debugfs_create_u32("sampling_period", 0644, sm, &steal_mon.sampling_period_ms);
 }
 #endif
 
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v4 20/20] sched/core: Add a few check for valid CPU in inc/dec of preferred CPUs
  2026-06-17 17:41 [PATCH v4 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (18 preceding siblings ...)
  2026-06-17 17:41 ` [PATCH v4 19/20] sched/debug: Add debug knobs for steal monitor Shrikanth Hegde
@ 2026-06-17 17:41 ` Shrikanth Hegde
  2026-06-18  4:21   ` Yury Norov
  19 siblings, 1 reply; 46+ messages in thread
From: Shrikanth Hegde @ 2026-06-17 17:41 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael

Add few cpumask_check where cpu is expected to be a a valid one.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v3->v4:
- new patch.

 kernel/sched/core.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 57d52973ef0d..9342aae315ca 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -11417,6 +11417,9 @@ void arch_dec_preferred_cpus(struct steal_monitor_t *sm, u64 steal_ratio)
 
 	last_cpu = cpumask_last(cpu_preferred_mask);
 
+	/* mask can't be null */
+	cpumask_check(last_cpu);
+
 	/*
 	 * If the core belongs to the housekeeping CPUs, no action is
 	 * taken. This leaves at least one core preferred always.
@@ -11505,6 +11508,9 @@ void sched_trigger_steal_computation(int cpu)
 					     cpu_active_mask);
 	ktime_t now;
 
+	/* at least one housekeeping CPU must be active */
+	cpumask_check(first_hk_cpu);
+
 	/* Done by first active housekeeping CPU only */
 	if (likely(cpu != first_hk_cpu))
 		return;
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 03/20] kconfig: Provide PREFERRED_CPU option
  2026-06-17 17:41 ` [PATCH v4 03/20] kconfig: Provide PREFERRED_CPU option Shrikanth Hegde
@ 2026-06-18  0:51   ` Yury Norov
  2026-06-18  3:44     ` Shrikanth Hegde
  0 siblings, 1 reply; 46+ messages in thread
From: Yury Norov @ 2026-06-18  0:51 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, tglx, gregkh, pbonzini, seanjc,
	vschneid, huschle, rostedt, dietmar.eggemann, mgorman, bsegall,
	maddy, srikar, hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael

On Wed, Jun 17, 2026 at 11:11:22PM +0530, Shrikanth Hegde wrote:
> Introduce a new config named PREFERRED_CPU.
> 
> This helps to:
> - Avoid the code bloat when PREFERRED_CPU=n. In that cases preferred
>   is same as active.
> - Avoid the ifdeffery around PREFERRED_CPU in many files.
> 
> Since paravirtulized use case is the main driving force of this
> feature, make it default for kernels with PARAVIRT=y
> 
> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> ---
>  kernel/Kconfig.preempt | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
> index 88c594c6d7fc..0995f5ba66eb 100644
> --- a/kernel/Kconfig.preempt
> +++ b/kernel/Kconfig.preempt
> @@ -192,3 +192,16 @@ config SCHED_CLASS_EXT
>  	  For more information:
>  	    Documentation/scheduler/sched-ext.rst
>  	    https://github.com/sched-ext/scx
> +
> +config PREFERRED_CPU
> +	bool "Dynamic vCPU management based on steal time"
> +	default y if PARAVIRT && SMP
> +	help
> +	This feature helps to reduce the steal time in paravirtualised
> +	environment, there by reducing vCPU preemption. Reducing vCPU
> +	preemption provides improved lock holder preemption and reduces
> +	cost of vCPU preemption in the host.

If it doesn't make sense for non-paravirt kernels, then

        depends on PARAVIRT && SMP
        default y

> +
> +	By default preferred CPUs will be same as active CPUs. Depending
> +	on the steal time when steal monitor is enabled, preferred CPUs
> +	could become subset of active CPUs.
> -- 
> 2.47.3

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 04/20] cpumask: Introduce cpu_preferred_mask
  2026-06-17 17:41 ` [PATCH v4 04/20] cpumask: Introduce cpu_preferred_mask Shrikanth Hegde
@ 2026-06-18  1:29   ` Yury Norov
  2026-06-18  3:53     ` Shrikanth Hegde
  0 siblings, 1 reply; 46+ messages in thread
From: Yury Norov @ 2026-06-18  1:29 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, tglx, gregkh, pbonzini, seanjc,
	vschneid, huschle, rostedt, dietmar.eggemann, mgorman, bsegall,
	maddy, srikar, hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael

On Wed, Jun 17, 2026 at 11:11:23PM +0530, Shrikanth Hegde wrote:
> This patch does
> - Declare and Define cpu_preferred_mask.
> - Get/Set helpers for it.
> 
> Values are set/clear by the scheduler by detecting the steal time values.
> 
> A CPU is set to preferred when it becomes active. Later it may be
> marked as non-preferred depending on steal time values with
> steal monitor being enabled.
> 
> Always maintain design construct of preferred is subset of active.
> i.e. preferred ⊆ active ⊆ online ⊆ present ⊆ possible
> 
> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> ---
> v3->v4:
> - Make preferred subser of active instead of online.

s/subser/subset

> 
>  include/linux/cpumask.h | 21 ++++++++++++++++++++-
>  kernel/cpu.c            | 14 ++++++++++++++
>  kernel/sched/core.c     |  5 +++++
>  3 files changed, 39 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
> index 80211900f373..30ea64cc1656 100644
> --- a/include/linux/cpumask.h
> +++ b/include/linux/cpumask.h
> @@ -120,12 +120,20 @@ extern struct cpumask __cpu_enabled_mask;
>  extern struct cpumask __cpu_present_mask;
>  extern struct cpumask __cpu_active_mask;
>  extern struct cpumask __cpu_dying_mask;
> +
> +#ifdef CONFIG_PREFERRED_CPU
> +extern struct cpumask __cpu_preferred_mask;
> +#else
> +#define __cpu_preferred_mask __cpu_active_mask
> +#endif
> +
>  #define cpu_possible_mask ((const struct cpumask *)&__cpu_possible_mask)
>  #define cpu_online_mask   ((const struct cpumask *)&__cpu_online_mask)
>  #define cpu_enabled_mask   ((const struct cpumask *)&__cpu_enabled_mask)
>  #define cpu_present_mask  ((const struct cpumask *)&__cpu_present_mask)
>  #define cpu_active_mask   ((const struct cpumask *)&__cpu_active_mask)
>  #define cpu_dying_mask    ((const struct cpumask *)&__cpu_dying_mask)
> +#define cpu_preferred_mask ((const struct cpumask *)&__cpu_preferred_mask)
>  
>  extern atomic_t __num_online_cpus;
>  extern unsigned int __num_possible_cpus;
> @@ -1164,6 +1172,7 @@ void init_cpu_possible(const struct cpumask *src);
>  
>  void set_cpu_online(unsigned int cpu, bool online);
>  void set_cpu_possible(unsigned int cpu, bool possible);
> +void set_cpu_preferred(unsigned int cpu, bool preferred);
>  
>  /**
>   * to_cpumask - convert a NR_CPUS bitmap to a struct cpumask *
> @@ -1256,7 +1265,12 @@ static __always_inline bool cpu_dying(unsigned int cpu)
>  	return cpumask_test_cpu(cpu, cpu_dying_mask);
>  }
>  
> -#else
> +static __always_inline bool cpu_preferred(unsigned int cpu)
> +{
> +	return cpumask_test_cpu(cpu, cpu_preferred_mask);
> +}
> +
> +#else	/* NR_CPUS <= 1 */
>  
>  #define num_online_cpus()	1U
>  #define num_possible_cpus()	1U
> @@ -1294,6 +1308,11 @@ static __always_inline bool cpu_dying(unsigned int cpu)
>  	return false;
>  }
>  
> +static __always_inline bool cpu_preferred(unsigned int cpu)
> +{
> +	return cpu == 0;
> +}
> +
>  #endif /* NR_CPUS > 1 */
>  
>  #define cpu_is_offline(cpu)	unlikely(!cpu_online(cpu))
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index bc4f7a9ba64e..c196ba5d8b2a 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -3107,6 +3107,11 @@ EXPORT_SYMBOL(__cpu_dying_mask);
>  atomic_t __num_online_cpus __read_mostly;
>  EXPORT_SYMBOL(__num_online_cpus);
>  
> +#ifdef CONFIG_PREFERRED_CPU
> +struct cpumask __cpu_preferred_mask __read_mostly;
> +EXPORT_SYMBOL(__cpu_preferred_mask);
> +#endif
> +
>  void init_cpu_present(const struct cpumask *src)
>  {
>  	cpumask_copy(&__cpu_present_mask, src);
> @@ -3154,6 +3159,14 @@ void set_cpu_possible(unsigned int cpu, bool possible)
>  	}
>  }
>  
> +void set_cpu_preferred(unsigned int cpu, bool preferred)
> +{
> +	if (!IS_ENABLED(CONFIG_PREFERRED_CPU))
> +		return;
> +
> +	assign_cpu((cpu), &__cpu_preferred_mask, (preferred));
> +}

set_cpu_xxx() is a macro on purpose - it improves code generation
quite a lot. See 5c563ee90a22d. Can you keep set_cpu_preferred aligned
with the other set_cpu(), i.e. make it a macro?

> +
>  /*
>   * Activate the first processor.
>   */
> @@ -3164,6 +3177,7 @@ void __init boot_cpu_init(void)
>  	/* Mark the boot cpu "present", "online" etc for SMP and UP case */
>  	set_cpu_online(cpu, true);
>  	set_cpu_active(cpu, true);
> +	set_cpu_preferred(cpu, true);
>  	set_cpu_present(cpu, true);
>  	set_cpu_possible(cpu, true);
>  
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 2f4530eb543f..9e16946c9d62 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -8685,6 +8685,9 @@ int sched_cpu_activate(unsigned int cpu)
>  	 */
>  	sched_set_rq_online(rq, cpu);
>  
> +	/* preferred is subset of active and follows its state */
> +	set_cpu_preferred(cpu, true);

Did you put it at the end of the function on purpose? If yes, please
add a comment. If no, I'd prefer to have it immediately after
set_cpu_active().

> +
>  	return 0;
>  }
>  
> @@ -8698,6 +8701,8 @@ int sched_cpu_deactivate(unsigned int cpu)
>  	if (ret)
>  		return ret;
>  
> +	set_cpu_preferred(cpu, false);
> +
>  	/*
>  	 * Remove CPU from nohz.idle_cpus_mask to prevent participating in
>  	 * load balancing when not active
> -- 
> 2.47.3

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 08/20] sched/fair: load balance only among preferred CPUs
  2026-06-17 17:41 ` [PATCH v4 08/20] sched/fair: load balance only among preferred CPUs Shrikanth Hegde
@ 2026-06-18  3:03   ` K Prateek Nayak
  2026-06-18  3:54     ` Shrikanth Hegde
  0 siblings, 1 reply; 46+ messages in thread
From: K Prateek Nayak @ 2026-06-18  3:03 UTC (permalink / raw)
  To: Shrikanth Hegde, linux-kernel, mingo, peterz, juri.lelli,
	vincent.guittot, yury.norov, iii
  Cc: tglx, gregkh, pbonzini, seanjc, vschneid, huschle, rostedt,
	dietmar.eggemann, mgorman, bsegall, maddy, srikar, hdanton,
	chleroy, vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael

Hello Shrikanth,

On 6/17/2026 11:11 PM, Shrikanth Hegde wrote:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d78467ec6ee1..3f3c7f0ca489 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -13291,6 +13291,9 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>  
>  	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
>  
> +	/* Spread load among preferred CPUs */
> +	cpumask_and(cpus, cpus, cpu_preferred_mask);

Since "cpu_preferred_mask" is a subset of "cpu_active_mask", just:

    cpumask_and(cpus, sched_domain_span(sd), cpu_preferred_mask);

is sufficient. We don't need that redundant cpumask_and() with
cpu_active_mask before.

> +
>  	schedstat_inc(sd->lb_count[idle]);
>  
>  redo:

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 06/20] sched/core: allow only preferred CPUs in is_cpu_allowed
  2026-06-17 17:41 ` [PATCH v4 06/20] sched/core: allow only preferred CPUs in is_cpu_allowed Shrikanth Hegde
@ 2026-06-18  3:32   ` Yury Norov
       [not found]     ` <c4546759-b316-47e7-aa97-408e20d0f6ed@linux.ibm.com>
  2026-06-18  3:49   ` K Prateek Nayak
  1 sibling, 1 reply; 46+ messages in thread
From: Yury Norov @ 2026-06-18  3:32 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, tglx, gregkh, pbonzini, seanjc,
	vschneid, huschle, rostedt, dietmar.eggemann, mgorman, bsegall,
	maddy, srikar, hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael

On Wed, Jun 17, 2026 at 11:11:25PM +0530, Shrikanth Hegde wrote:
> When possible, choose a preferred CPUs to pick.
> 
> Push task mechanism uses stopper thread which going to call
> select_fallback_rq and use this mechanism to pick only a preferred CPU.
> 
> When task is affined only to non-preferred CPUs it should continue to
> run there. Detect that by checking if cpus_ptr and cpu_preferred_mask
> intersect or not.
> 
> Since is_cpu_allowed can be called directly or repeatedly in
> select_fallback_rq, encode the info in task_struct->has_preferred_cpu_state
> if the path is via select_fallback_rq or not.
> This helps to avoid N**2 complexity for the rare cases.
> 
> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> ---
> v3->v4:
> - Missing case of PF_KTHREAD is avoided.
> - Add a new field in task_struct which encodes intersection of
>   tasks affinity and preferred CPUs and path its coming from.
> 
>  include/linux/sched.h |  1 +
>  kernel/sched/core.c   | 34 ++++++++++++++++++++++++++++++++--
>  kernel/sched/sched.h  | 18 ++++++++++++++++++
>  3 files changed, 51 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index fc6ecb3869dd..2d0b1a6d50ac 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1657,6 +1657,7 @@ struct task_struct {
>  #ifdef CONFIG_UNWIND_USER
>  	struct unwind_task_info		unwind_info;
>  #endif
> +	int				has_preferred_cpu_state;

Shouldn't this be protected with the config?

>  
>  	/* CPU-specific state of this task: */
>  	struct thread_struct		thread;
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 9e16946c9d62..714816cfa975 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2500,6 +2500,8 @@ static inline bool rq_has_pinned_tasks(struct rq *rq)
>   */
>  static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
>  {
> +	bool task_check_preferred_cpu = false;

Initialization is not needed.

> +
>  	/* When not in the task's cpumask, no point in looking further. */
>  	if (!task_allowed_on_cpu(p, cpu))
>  		return false;
> @@ -2508,9 +2510,22 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
>  	if (is_migration_disabled(p))
>  		return cpu_online(cpu);
>  
> +	/*
> +	 * This is essential to maintain user affinities when preferred
> +	 * CPUs change. A task pinned on non-preferred CPU should continue
> +	 * to run there, since this is non-user triggered.
> +	 *
> +	 * If CPU is non-preferred and task can run on other CPUs which are
> +	 * currently preferred, then choose those other CPUs instead
> +	 */
> +	task_check_preferred_cpu = !cpu_preferred(cpu) && task_has_preferred_cpus(p);
> +
>  	/* Non kernel threads are not allowed during either online or offline. */
> -	if (!(p->flags & PF_KTHREAD))
> +	if (!(p->flags & PF_KTHREAD)) {
> +		if (task_check_preferred_cpu)
> +			return false;
>  		return cpu_active(cpu);
> +	}
>  
>  	/* KTHREAD_IS_PER_CPU is always allowed. */
>  	if (kthread_is_per_cpu(p))
> @@ -2520,6 +2535,10 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
>  	if (cpu_dying(cpu))
>  		return false;
>  
> +	/* Try on preferred CPU first if possible*/
> +	if (task_check_preferred_cpu)
> +		return false;
> +
>  	/* But are allowed during online. */
>  	return cpu_online(cpu);
>  }
> @@ -3549,6 +3568,14 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
>  	enum { cpuset, possible, fail } state = cpuset;
>  	int dest_cpu;
>  
> +	/*
> +	 * Cache value whether task's affinity spans preferred CPUs.

Because it's cached, it should go inside is_cpu_allowed(), I think.

> +	 * This helps to avoid repeating the same for each CPU
> +	 * later in the loop. Encode call to is_cpu_allowed coming
> +	 * via select_fallback_rq.
> +	 */
> +	p->has_preferred_cpu_state = task_has_preferred_cpus(p) << 8 | 0x1;

This looks weird. Your intention is to store three states: not cached, has
preferred CPUs and has not preferred CPUs,

Why don't you create an enum for it? Or a couple of flags?

> +
>  	/*
>  	 * If the node that the CPU is on has been offlined, cpu_to_node()
>  	 * will return -1. There is no CPU on the node, and we should
> @@ -3560,7 +3587,7 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
>  		/* Look for allowed, online CPU in same node. */
>  		for_each_cpu(dest_cpu, nodemask) {
>  			if (is_cpu_allowed(p, dest_cpu))
> -				return dest_cpu;
> +				goto clear_and_return;
>  		}
>  	}
>  
> @@ -3604,6 +3631,8 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
>  		}
>  	}
>  
> +clear_and_return:
> +	p->has_preferred_cpu_state = 0;

What for resetting it here? I think it should be zeroed only on update
of preferred cpumask. In other words, to properly implement caching,
you need to have a global counter incremented on each
cpu_preferred_mask update, and in task_has_preferred_cpus() you do:

 {
        if (p->preferred_cpu_updates == atomic_read(preferred_cpumask_updates))
                return p->has_preferred_cpus;

        p->preferred_cpu_updates = atomic_read(preferred_cpumask_updates);
        p->has_preferred_cpus = cpumask_intersects(...);
 }

Do you have any numbers that justify this caching? The best practice
is to put performance optimizations at the end of the series and
provide some sort of benchmark supporting it.

>  	return dest_cpu;
>  }
>  
> @@ -4612,6 +4641,7 @@ static void __sched_fork(u64 clone_flags, struct task_struct *p)
>  	init_numa_balancing(clone_flags, p);
>  	p->wake_entry.u_flags = CSD_TYPE_TTWU;
>  	p->migration_pending = NULL;
> +	p->has_preferred_cpu_state = 0;
>  	init_sched_mm(p);
>  }
>  
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index c7c2dea65edd..38fd84b0b8f8 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -4213,4 +4213,22 @@ DEFINE_CLASS_IS_UNCONDITIONAL(sched_change)
>  
>  #include "ext.h"
>  
> +/*
> + * has_preferred_cpu_state is encoding two bits of information.
> + * First Byte is to encode where the call to is_cpu_allowed coming from.
> + * Second Byte is to encode the intersection of task affinity
> + * and cpu_preferred_mask.
> + *
> + * If 1st Byte is set, call to is_cpu_allowed coming from select_fallback_rq.
> + * That helps to avoid repeated calculation keeping time complexity same.
> + */
> +static inline bool task_has_preferred_cpus(struct task_struct *p)

This function should be void because you change the task state.

> +{
> +	int cached_value = p->has_preferred_cpu_state;
> +
> +	if (cached_value & 0x1)
> +		return p->has_preferred_cpu_state >> 8;
> +	else
> +		return cpumask_intersects(p->cpus_ptr, cpu_preferred_mask);
> +}
>  #endif /* _KERNEL_SCHED_SCHED_H */
> -- 
> 2.47.3

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 03/20] kconfig: Provide PREFERRED_CPU option
  2026-06-18  0:51   ` Yury Norov
@ 2026-06-18  3:44     ` Shrikanth Hegde
  0 siblings, 0 replies; 46+ messages in thread
From: Shrikanth Hegde @ 2026-06-18  3:44 UTC (permalink / raw)
  To: Yury Norov
  Cc: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	kprateek.nayak, iii, tglx, gregkh, pbonzini, seanjc, vschneid,
	huschle, rostedt, dietmar.eggemann, mgorman, bsegall, maddy,
	srikar, hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael

Hi Yury, thanks for reviewing the patches.

On 6/18/26 6:21 AM, Yury Norov wrote:
> On Wed, Jun 17, 2026 at 11:11:22PM +0530, Shrikanth Hegde wrote:
>> Introduce a new config named PREFERRED_CPU.
>>
>> This helps to:
>> - Avoid the code bloat when PREFERRED_CPU=n. In that cases preferred
>>    is same as active.
>> - Avoid the ifdeffery around PREFERRED_CPU in many files.
>>
>> Since paravirtulized use case is the main driving force of this
>> feature, make it default for kernels with PARAVIRT=y
>>
>> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
>> ---
>>   kernel/Kconfig.preempt | 13 +++++++++++++
>>   1 file changed, 13 insertions(+)
>>
>> diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
>> index 88c594c6d7fc..0995f5ba66eb 100644
>> --- a/kernel/Kconfig.preempt
>> +++ b/kernel/Kconfig.preempt
>> @@ -192,3 +192,16 @@ config SCHED_CLASS_EXT
>>   	  For more information:
>>   	    Documentation/scheduler/sched-ext.rst
>>   	    https://github.com/sched-ext/scx
>> +
>> +config PREFERRED_CPU
>> +	bool "Dynamic vCPU management based on steal time"
>> +	default y if PARAVIRT && SMP
>> +	help
>> +	This feature helps to reduce the steal time in paravirtualised
>> +	environment, there by reducing vCPU preemption. Reducing vCPU
>> +	preemption provides improved lock holder preemption and reduces
>> +	cost of vCPU preemption in the host.
> 
> If it doesn't make sense for non-paravirt kernels, then
> 
>          depends on PARAVIRT && SMP
>          default y
> 

Ok.
Yes paravirt is the only use case today.

>> +
>> +	By default preferred CPUs will be same as active CPUs. Depending
>> +	on the steal time when steal monitor is enabled, preferred CPUs
>> +	could become subset of active CPUs.
>> -- 
>> 2.47.3


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 06/20] sched/core: allow only preferred CPUs in is_cpu_allowed
  2026-06-17 17:41 ` [PATCH v4 06/20] sched/core: allow only preferred CPUs in is_cpu_allowed Shrikanth Hegde
  2026-06-18  3:32   ` Yury Norov
@ 2026-06-18  3:49   ` K Prateek Nayak
  2026-06-18  4:22     ` Shrikanth Hegde
  1 sibling, 1 reply; 46+ messages in thread
From: K Prateek Nayak @ 2026-06-18  3:49 UTC (permalink / raw)
  To: Shrikanth Hegde, linux-kernel, mingo, peterz, juri.lelli,
	vincent.guittot, yury.norov, iii
  Cc: tglx, gregkh, pbonzini, seanjc, vschneid, huschle, rostedt,
	dietmar.eggemann, mgorman, bsegall, maddy, srikar, hdanton,
	chleroy, vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael

Hello Shrikanth,

On 6/17/2026 11:11 PM, Shrikanth Hegde wrote:
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1657,6 +1657,7 @@ struct task_struct {
>  #ifdef CONFIG_UNWIND_USER
>  	struct unwind_task_info		unwind_info;
>  #endif
> +	int				has_preferred_cpu_state;

Since this only really needs 2 bits, perhaps you can use the "u8 __pad"
in the task struct? ...

>  
>  	/* CPU-specific state of this task: */
>  	struct thread_struct		thread;

[...snip...]

> @@ -3549,6 +3568,14 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
>  	enum { cpuset, possible, fail } state = cpuset;
>  	int dest_cpu;
>  
> +	/*
> +	 * Cache value whether task's affinity spans preferred CPUs.
> +	 * This helps to avoid repeating the same for each CPU
> +	 * later in the loop. Encode call to is_cpu_allowed coming
> +	 * via select_fallback_rq.
> +	 */
> +	p->has_preferred_cpu_state = task_has_preferred_cpus(p) << 8 | 0x1;

... Or maybe it can be an s8 and this indicator can be a tri-state like:

 -1:  Cached; preferred CPUs exists
  0:  preferred CPUs not cached
  1:  Cached; preferred CPUs don't exist

and then the comparison can simply be:

    if (cached)
        return cached < 0;

ppc64le and arm64 seems to generates slightly smaller code for "< 0"
check compared to the current scheme of (& + >>) in
task_has_preferred_cpus().

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 04/20] cpumask: Introduce cpu_preferred_mask
  2026-06-18  1:29   ` Yury Norov
@ 2026-06-18  3:53     ` Shrikanth Hegde
  0 siblings, 0 replies; 46+ messages in thread
From: Shrikanth Hegde @ 2026-06-18  3:53 UTC (permalink / raw)
  To: Yury Norov
  Cc: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	kprateek.nayak, iii, tglx, gregkh, pbonzini, seanjc, vschneid,
	huschle, rostedt, dietmar.eggemann, mgorman, bsegall, maddy,
	srikar, hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael

Hi Yury.

>> +void set_cpu_preferred(unsigned int cpu, bool preferred)
>> +{
>> +	if (!IS_ENABLED(CONFIG_PREFERRED_CPU))
>> +		return;
>> +
>> +	assign_cpu((cpu), &__cpu_preferred_mask, (preferred));
>> +}
> 
> set_cpu_xxx() is a macro on purpose - it improves code generation
> quite a lot. See 5c563ee90a22d. Can you keep set_cpu_preferred aligned
> with the other set_cpu(), i.e. make it a macro?
>

Ok. only reason was to avoid ifdeffery there.

you mean below?

#ifdef CONFIG_PREFERRED_CPU
#else
#endif


  
>> +
>>   /*
>>    * Activate the first processor.
>>    */
>> @@ -3164,6 +3177,7 @@ void __init boot_cpu_init(void)
>>   	/* Mark the boot cpu "present", "online" etc for SMP and UP case */
>>   	set_cpu_online(cpu, true);
>>   	set_cpu_active(cpu, true);
>> +	set_cpu_preferred(cpu, true);
>>   	set_cpu_present(cpu, true);
>>   	set_cpu_possible(cpu, true);
>>   
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 2f4530eb543f..9e16946c9d62 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -8685,6 +8685,9 @@ int sched_cpu_activate(unsigned int cpu)
>>   	 */
>>   	sched_set_rq_online(rq, cpu);
>>   
>> +	/* preferred is subset of active and follows its state */
>> +	set_cpu_preferred(cpu, true);
> 
> Did you put it at the end of the function on purpose? If yes, please
> add a comment. If no, I'd prefer to have it immediately after
> set_cpu_active().

         /*
          * Put the rq online, if not already. This happens:
          *
          * 1) In the early boot process, because we build the real domains
          *    after all CPUs have been brought up.

          * 2) At runtime, if cpuset_cpu_active() fails to rebuild the
          *    domains.
          */
         sched_set_rq_online(rq, cpu);


This was the reason. I wanted active related functionality to complete.

> 
>> +
>>   	return 0;
>>   }
>>   
>> @@ -8698,6 +8701,8 @@ int sched_cpu_deactivate(unsigned int cpu)
>>   	if (ret)
>>   		return ret;
>>   
>> +	set_cpu_preferred(cpu, false);
>> +
>>   	/*
>>   	 * Remove CPU from nohz.idle_cpus_mask to prevent participating in
>>   	 * load balancing when not active
>> -- 
>> 2.47.3


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 08/20] sched/fair: load balance only among preferred CPUs
  2026-06-18  3:03   ` K Prateek Nayak
@ 2026-06-18  3:54     ` Shrikanth Hegde
  0 siblings, 0 replies; 46+ messages in thread
From: Shrikanth Hegde @ 2026-06-18  3:54 UTC (permalink / raw)
  To: K Prateek Nayak, linux-kernel, mingo, peterz, juri.lelli,
	vincent.guittot, yury.norov, iii
  Cc: tglx, gregkh, pbonzini, seanjc, vschneid, huschle, rostedt,
	dietmar.eggemann, mgorman, bsegall, maddy, srikar, hdanton,
	chleroy, vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael

Hi Prateek, thanks for going through the patches.

On 6/18/26 8:33 AM, K Prateek Nayak wrote:
> Hello Shrikanth,
> 
> On 6/17/2026 11:11 PM, Shrikanth Hegde wrote:
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index d78467ec6ee1..3f3c7f0ca489 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -13291,6 +13291,9 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>>   
>>   	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
>>   
>> +	/* Spread load among preferred CPUs */
>> +	cpumask_and(cpus, cpus, cpu_preferred_mask);
> 
> Since "cpu_preferred_mask" is a subset of "cpu_active_mask", just:
> 
>      cpumask_and(cpus, sched_domain_span(sd), cpu_preferred_mask);
> 
> is sufficient. We don't need that redundant cpumask_and() with
> cpu_active_mask before.
> 

Ack.

>> +
>>   	schedstat_inc(sd->lb_count[idle]);
>>   
>>   redo:
> 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 15/20] sched/core: Compute steal values at regular intervals
  2026-06-17 17:41 ` [PATCH v4 15/20] sched/core: Compute steal values at regular intervals Shrikanth Hegde
@ 2026-06-18  4:04   ` Yury Norov
  2026-06-18  5:39     ` Shrikanth Hegde
  0 siblings, 1 reply; 46+ messages in thread
From: Yury Norov @ 2026-06-18  4:04 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, tglx, gregkh, pbonzini, seanjc,
	vschneid, huschle, rostedt, dietmar.eggemann, mgorman, bsegall,
	maddy, srikar, hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael

On Wed, Jun 17, 2026 at 11:11:34PM +0530, Shrikanth Hegde wrote:
> Kick off the work to compute the steal time at regular interval.
> Gated with steal monitor enabled static key check to avoid any overhead
> when its disabled.
> 
> The sampling period can changed at runtime using steal_mon/sampling_period.
> By default is 1000 milliseconds. I.e. 1 second
> 
> This work is done by first active housekeeping CPU only. Hence it won't
> need any complicated synchronization.
> 
> Now, that sched_steal_mon_enabled() is available which is a static branch,
> add this to hotpath such as wakeup and load balance.
> This will make them effectively nop when the feature is disabled.
> 
> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> ---
> v3->v4:
> - Add static key check in hotpaths. Could be split into a separate
>   patch. Let me know if thats better. 
> 
>  include/linux/sched.h |  2 ++
>  kernel/sched/core.c   | 28 +++++++++++++++++++++++++++-
>  kernel/sched/debug.c  |  1 +
>  kernel/sched/fair.c   |  3 ++-
>  kernel/sched/sched.h  | 10 +++++++++-
>  5 files changed, 41 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index ce6bc8a22eb1..5b15353ed7ef 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2527,5 +2527,7 @@ struct steal_monitor_t {
>  	unsigned int high_threshold;
>  	unsigned int sampling_period_ms;
>  };
> +
> +extern struct steal_monitor_t steal_mon;
>  #endif
>  #endif
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index cc48632dd42d..f1a91021e357 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5793,7 +5793,7 @@ void sched_tick(void)
>  	unsigned long hw_pressure;
>  	u64 resched_latency;
>  
> -	if (!cpu_preferred(cpu))
> +	if (sched_steal_mon_enabled() && !cpu_preferred(cpu))
>  		sched_push_current_non_preferred_cpu(rq);

This looks like CPU can be non-preferred only if steal monitor is
enabled. To properly implement it, you need to mark all active CPUs
as preferred during the steal monitor disabling. That way you don't
need to complicate the condition.

>  
>  	if (housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE))
> @@ -5834,6 +5834,9 @@ void sched_tick(void)
>  		rq->idle_balance = idle_cpu(cpu);
>  		sched_balance_trigger(rq);
>  	}
> +
> +	if (sched_steal_mon_enabled())
> +		sched_trigger_steal_computation(cpu);
>  }
>  
>  #ifdef CONFIG_NO_HZ_FULL
> @@ -11407,4 +11410,27 @@ void sched_steal_detection_work(struct work_struct *work)
>  	now = ktime_get();
>  	sm->prev_time = now;
>  }
> +
> +void sched_trigger_steal_computation(int cpu)
> +{
> +	int first_hk_cpu = cpumask_first_and(housekeeping_cpumask(HK_TYPE_KERNEL_NOISE),
> +					     cpu_active_mask);
> +	ktime_t now;
> +
> +	/* Done by first active housekeeping CPU only */
> +	if (likely(cpu != first_hk_cpu))
> +		return;
> +
> +	/*
> +	 * Since everything is updated by first housekeeping CPU,
> +	 * There is no need for complex syncronization.
> +	 */
> +	now = ktime_get();
> +
> +	/* Default is once per second */
> +	if (likely(ktime_ms_delta(now, steal_mon.prev_time) < steal_mon.sampling_period_ms))
> +		return;
> +
> +	schedule_work_on(first_hk_cpu, &steal_mon.work);

I think, there should be a better way to schedule a work on regular
interval...

Maybe steal_mon.work would schedule itself? So, the first time it's
scheduled on steal monitor enablement, and then just reschedules
itself. This way you'll avoid polluting sched_tick().


> +}
>  #endif
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index 2d62858f9cc0..55b8beb42574 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -649,6 +649,7 @@ static ssize_t sched_sm_en_write(struct file *filp, const char __user *ubuf,
>  		static_branch_enable(&__sched_sm_enable);
>  	} else if (!sched_sm_wr_enable && orig) {
>  		static_branch_disable(&__sched_sm_enable);
> +		cancel_work_sync(&steal_mon.work);
>  		cpumask_copy(&__cpu_preferred_mask, cpu_active_mask);
>  	}
>  
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 3f3c7f0ca489..b02a414ffaae 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -13292,7 +13292,8 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>  	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
>  
>  	/* Spread load among preferred CPUs */
> -	cpumask_and(cpus, cpus, cpu_preferred_mask);
> +	if (sched_steal_mon_enabled())
> +		cpumask_and(cpus, cpus, cpu_preferred_mask);

Again, if you mark do cpumask_copy(preferred, active) on the steal
monitor disablement, you don't need to complicate core logic here and
there.

>  
>  	schedstat_inc(sd->lb_count[idle]);
>  
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 984da3827f19..f3814099cc0b 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1060,6 +1060,7 @@ struct root_domain {
>  	struct perf_domain __rcu *pd;
>  };
>  
> +static inline bool sched_steal_mon_enabled(void);
>  extern void init_defrootdomain(void);
>  extern int sched_init_domains(const struct cpumask *cpu_map);
>  extern void rq_attach_root(struct rq *rq, struct root_domain *rd);
> @@ -1436,7 +1437,7 @@ static inline bool available_idle_cpu(int cpu)
>  	if (!idle_rq(cpu_rq(cpu)))
>  		return 0;
>  
> -	if (!cpu_preferred(cpu))
> +	if (sched_steal_mon_enabled() && !cpu_preferred(cpu))
>  		return 0;
>  
>  	if (vcpu_is_preempted(cpu))
> @@ -4243,8 +4244,15 @@ DECLARE_STATIC_KEY_FALSE(__sched_sm_enable);
>  void sched_init_steal_monitor(void);
>  void sched_steal_detection_work(struct work_struct *work);
>  void sched_push_current_non_preferred_cpu(struct rq *rq);
> +void sched_trigger_steal_computation(int cpu);
> +static inline bool sched_steal_mon_enabled(void)
> +{
> +	return static_branch_unlikely(&__sched_sm_enable);
> +}
>  #else	/* !CONFIG_PREFERRED_CPU */
>  static inline void sched_push_current_non_preferred_cpu(struct rq *rq) { }
>  static inline void sched_init_steal_monitor(void) { }
> +static inline void sched_trigger_steal_computation(int cpu) { }
> +static inline bool sched_steal_mon_enabled(void) { return false; }
>  #endif
>  #endif /* _KERNEL_SCHED_SCHED_H */
> -- 
> 2.47.3

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 10/20] sched/core: Push current task from non preferred CPU
  2026-06-17 17:41 ` [PATCH v4 10/20] sched/core: Push current task from non preferred CPU Shrikanth Hegde
@ 2026-06-18  4:09   ` K Prateek Nayak
  2026-06-18  6:05     ` Shrikanth Hegde
  0 siblings, 1 reply; 46+ messages in thread
From: K Prateek Nayak @ 2026-06-18  4:09 UTC (permalink / raw)
  To: Shrikanth Hegde, linux-kernel, mingo, peterz, juri.lelli,
	vincent.guittot, yury.norov, iii
  Cc: tglx, gregkh, pbonzini, seanjc, vschneid, huschle, rostedt,
	dietmar.eggemann, mgorman, bsegall, maddy, srikar, hdanton,
	chleroy, vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael

Hello Shrikanth,

On 6/17/2026 11:11 PM, Shrikanth Hegde wrote:
> +static int sched_non_preferred_cpu_push_stop(void *arg)
> +{
> +	struct task_struct *p = arg;
> +	struct rq *rq = this_rq();
> +	struct rq_flags rf;
> +	int cpu;
> +
> +	/* sanity check */
> +	if (cpu_preferred(rq->cpu))
> +		return 0;

I think this early return path should also clear "push_task_work_done"
indicator, otherwise, we will fail to schedule the stopper on this CPU
next time.

Also, we may need to add a context_unsafe_alias(rq) call here to keep
the context analysis bits happy similar to migration_cpu_stop().

> +
> +	raw_spin_lock_irq(&p->pi_lock);
> +	rq_lock(rq, &rf);
> +	rq->push_task_work_done = 0;
> +
> +	update_rq_clock(rq);
> +
> +	if (task_rq(p) == rq && task_on_rq_queued(p)) {
> +		cpu = select_fallback_rq(rq->cpu, p);

Do we need a task_has_preferred_cpus() sanity check here?

If the affinity changed before the stopper grabbed the p->pi_lock, and
there are no preferred CPUs to run on anymore, might as well keep the
task here instead of migrating it away.

> +		rq = __migrate_task(rq, &rf, p, cpu);
> +	}
> +
> +	rq_unlock(rq, &rf);
> +	raw_spin_unlock_irq(&p->pi_lock);
> +	put_task_struct(p);
> +
> +	return 0;
> +}

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 20/20] sched/core: Add a few check for valid CPU in inc/dec of preferred CPUs
  2026-06-17 17:41 ` [PATCH v4 20/20] sched/core: Add a few check for valid CPU in inc/dec of preferred CPUs Shrikanth Hegde
@ 2026-06-18  4:21   ` Yury Norov
  2026-06-18  4:40     ` Shrikanth Hegde
  0 siblings, 1 reply; 46+ messages in thread
From: Yury Norov @ 2026-06-18  4:21 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, tglx, gregkh, pbonzini, seanjc,
	vschneid, huschle, rostedt, dietmar.eggemann, mgorman, bsegall,
	maddy, srikar, hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael

On Wed, Jun 17, 2026 at 11:11:39PM +0530, Shrikanth Hegde wrote:
> Add few cpumask_check where cpu is expected to be a a valid one.

This should be embedded in the corresponding patches, no need to make
it a separate patch.

> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> ---
> v3->v4:
> - new patch.
> 
>  kernel/sched/core.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 57d52973ef0d..9342aae315ca 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -11417,6 +11417,9 @@ void arch_dec_preferred_cpus(struct steal_monitor_t *sm, u64 steal_ratio)
>  
>  	last_cpu = cpumask_last(cpu_preferred_mask);
>  
> +	/* mask can't be null */
> +	cpumask_check(last_cpu);

cpumask_check() is disabled by default. If you need to enforce the
code, use WARN_ON(). 

> +
>  	/*
>  	 * If the core belongs to the housekeeping CPUs, no action is
>  	 * taken. This leaves at least one core preferred always.
> @@ -11505,6 +11508,9 @@ void sched_trigger_steal_computation(int cpu)
>  					     cpu_active_mask);
>  	ktime_t now;
>  
> +	/* at least one housekeeping CPU must be active */
> +	cpumask_check(first_hk_cpu);
> +
>  	/* Done by first active housekeeping CPU only */
>  	if (likely(cpu != first_hk_cpu))
>  		return;
> -- 
> 2.47.3

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 06/20] sched/core: allow only preferred CPUs in is_cpu_allowed
  2026-06-18  3:49   ` K Prateek Nayak
@ 2026-06-18  4:22     ` Shrikanth Hegde
  0 siblings, 0 replies; 46+ messages in thread
From: Shrikanth Hegde @ 2026-06-18  4:22 UTC (permalink / raw)
  To: K Prateek Nayak, linux-kernel, mingo, peterz, juri.lelli,
	vincent.guittot, yury.norov, iii
  Cc: tglx, gregkh, pbonzini, seanjc, vschneid, huschle, rostedt,
	dietmar.eggemann, mgorman, bsegall, maddy, srikar, hdanton,
	chleroy, vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael

Hi Prateek.

On 6/18/26 9:19 AM, K Prateek Nayak wrote:
> Hello Shrikanth,
> 
> On 6/17/2026 11:11 PM, Shrikanth Hegde wrote:
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1657,6 +1657,7 @@ struct task_struct {
>>   #ifdef CONFIG_UNWIND_USER
>>   	struct unwind_task_info		unwind_info;
>>   #endif
>> +	int				has_preferred_cpu_state;
> 
> Since this only really needs 2 bits, perhaps you can use the "u8 __pad"
> in the task struct? ...
> 
>>   
>>   	/* CPU-specific state of this task: */
>>   	struct thread_struct		thread;
> 
> [...snip...]
> 
>> @@ -3549,6 +3568,14 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
>>   	enum { cpuset, possible, fail } state = cpuset;
>>   	int dest_cpu;
>>   
>> +	/*
>> +	 * Cache value whether task's affinity spans preferred CPUs.
>> +	 * This helps to avoid repeating the same for each CPU
>> +	 * later in the loop. Encode call to is_cpu_allowed coming
>> +	 * via select_fallback_rq.
>> +	 */
>> +	p->has_preferred_cpu_state = task_has_preferred_cpus(p) << 8 | 0x1;
> 
> ... Or maybe it can be an s8 and this indicator can be a tri-state like:
> 
>   -1:  Cached; preferred CPUs exists
>    0:  preferred CPUs not cached
>    1:  Cached; preferred CPUs don't exist
> 

Yes. Current encoding has.

1. call came from select_fallback_rq and preferred CPUs exists
2. call came from select_fallback_rq and preferred CPUs doesn't exists
3. Call didn't come from select_fallback_rq and evaluate the cpumask_interesect.

> and then the comparison can simply be:
> 
>      if (cached)
>          return cached < 0;
> 
> ppc64le and arm64 seems to generates slightly smaller code for "< 0"
> check compared to the current scheme of (& + >>) in
> task_has_preferred_cpus().
> 

Yes, this seems better. I will give it a try.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 14/20] sched/core: Introduce a simple steal monitor
  2026-06-17 17:41 ` [PATCH v4 14/20] sched/core: Introduce a simple " Shrikanth Hegde
@ 2026-06-18  4:30   ` Yury Norov
  2026-06-18  4:44     ` Shrikanth Hegde
  0 siblings, 1 reply; 46+ messages in thread
From: Yury Norov @ 2026-06-18  4:30 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, tglx, gregkh, pbonzini, seanjc,
	vschneid, huschle, rostedt, dietmar.eggemann, mgorman, bsegall,
	maddy, srikar, hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael

On Wed, Jun 17, 2026 at 11:11:33PM +0530, Shrikanth Hegde wrote:
> Start with a simple steal monitor.
> 
> It is meant to look at steal time and make the decision to
> reduce/increase the preferred CPUs.
> 
> It has
> - work function to execute the steal time calculations and decision
>   making periodically.
> - low and high thresholds for steal time.
> - sampling period to control the frequency of steal time calculations.
> - cache the previous decision to avoid oscillations

This monitor is the one implementation out of quite many possible,
right? I don't think it should live in the core scheduler files, it
should be a module.
 
> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> ---
> v3->v4:
> - Drop tmp_mask
> 
>  include/linux/sched.h | 11 +++++++++++
>  kernel/sched/core.c   | 23 +++++++++++++++++++++++
>  kernel/sched/sched.h  |  3 +++
>  3 files changed, 37 insertions(+)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 5f523782ca28..ce6bc8a22eb1 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2517,4 +2517,15 @@ extern void migrate_enable(void);
>  
>  DEFINE_LOCK_GUARD_0(migrate, migrate_disable(), migrate_enable())
>  
> +#ifdef CONFIG_PREFERRED_CPU
> +struct steal_monitor_t {
> +	struct work_struct  work;
> +	ktime_t prev_time;
> +	u64 prev_steal;
> +	int previous_decision;
> +	unsigned int low_threshold;
> +	unsigned int high_threshold;
> +	unsigned int sampling_period_ms;
> +};
> +#endif
>  #endif
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 24d4abc74241..cc48632dd42d 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -9138,6 +9138,8 @@ void __init sched_init(void)
>  
>  	preempt_dynamic_init();
>  
> +	sched_init_steal_monitor();
> +
>  	scheduler_running = 1;
>  }
>  
> @@ -11384,4 +11386,25 @@ void sched_push_current_non_preferred_cpu(struct rq *rq)
>  	stop_one_cpu_nowait(rq->cpu, sched_non_preferred_cpu_push_stop,
>  			    push_task, this_cpu_ptr(&npc_push_task_work));
>  }
> +
> +struct steal_monitor_t steal_mon;
> +
> +void sched_init_steal_monitor(void)
> +{
> +	INIT_WORK(&steal_mon.work, sched_steal_detection_work);
> +	steal_mon.low_threshold       = 200;		/* 2% steal time */
> +	steal_mon.high_threshold      = 500;		/* 5% steal time */
> +	steal_mon.sampling_period_ms  = 1000;		/* once per second */
> +}
> +
> +/* This is only a skeleton. Subsequent patches introduce more of it */
> +void sched_steal_detection_work(struct work_struct *work)
> +{
> +	struct steal_monitor_t *sm = container_of(work, struct steal_monitor_t, work);
> +	ktime_t now;
> +
> +	/* Update the prev_time for next iteration*/
> +	now = ktime_get();
> +	sm->prev_time = now;
> +}
>  #endif
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 9cb006c21090..984da3827f19 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -4240,8 +4240,11 @@ static inline bool task_has_preferred_cpus(struct task_struct *p)
>  #ifdef CONFIG_PREFERRED_CPU
>  DECLARE_STATIC_KEY_FALSE(__sched_sm_enable);
>  
> +void sched_init_steal_monitor(void);
> +void sched_steal_detection_work(struct work_struct *work);
>  void sched_push_current_non_preferred_cpu(struct rq *rq);
>  #else	/* !CONFIG_PREFERRED_CPU */
>  static inline void sched_push_current_non_preferred_cpu(struct rq *rq) { }
> +static inline void sched_init_steal_monitor(void) { }
>  #endif
>  #endif /* _KERNEL_SCHED_SCHED_H */
> -- 
> 2.47.3

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 20/20] sched/core: Add a few check for valid CPU in inc/dec of preferred CPUs
  2026-06-18  4:21   ` Yury Norov
@ 2026-06-18  4:40     ` Shrikanth Hegde
  0 siblings, 0 replies; 46+ messages in thread
From: Shrikanth Hegde @ 2026-06-18  4:40 UTC (permalink / raw)
  To: Yury Norov
  Cc: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	kprateek.nayak, iii, tglx, gregkh, pbonzini, seanjc, vschneid,
	huschle, rostedt, dietmar.eggemann, mgorman, bsegall, maddy,
	srikar, hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael



On 6/18/26 9:51 AM, Yury Norov wrote:
> On Wed, Jun 17, 2026 at 11:11:39PM +0530, Shrikanth Hegde wrote:
>> Add few cpumask_check where cpu is expected to be a a valid one.
> 
> This should be embedded in the corresponding patches, no need to make
> it a separate patch.

Ack.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 16/20] sched/core: Introduce default arch handling code for inc/dec preferred CPUs
       [not found]   ` <ajNwy25WYg45AQJX@yury>
@ 2026-06-18  4:42     ` Shrikanth Hegde
  0 siblings, 0 replies; 46+ messages in thread
From: Shrikanth Hegde @ 2026-06-18  4:42 UTC (permalink / raw)
  To: Yury Norov
  Cc: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	kprateek.nayak, iii, tglx, gregkh, pbonzini, seanjc, vschneid,
	huschle, rostedt, dietmar.eggemann, mgorman, bsegall, maddy,
	srikar, hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael



On 6/18/26 9:45 AM, Yury Norov wrote:
> On Wed, Jun 17, 2026 at 11:11:35PM +0530, Shrikanth Hegde wrote:
>> Define default handlers for high/low steal time. If arch has better
>> decision logic, may override the default implementation.
>>
>> - If the steal time higher than threshold, reduce the number of preferred
>>    CPUs by 1 core. The last core in the intersection of active and
>>    preferred CPUs will be marked as non-preferred.
>>    Ensure at least one core is left as preferred always.
>>
>> - If the steal time lower than threshold, increase the number of preferred
>>    CPUs by 1 core. First active core which is not in cpu_preferred_mask will
>>    be marked as preferred.
>>    If all cores are already set to preferred, bail out.
> 
> And the code below does nothing of that.

Agree. My bad. This changelog needs update.

>   
>> Increase/Decrease may need to modify the splicing across NUMA nodes. It is
>> being kept simple for now.
>>
>> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 14/20] sched/core: Introduce a simple steal monitor
  2026-06-18  4:30   ` Yury Norov
@ 2026-06-18  4:44     ` Shrikanth Hegde
  2026-06-18  5:32       ` K Prateek Nayak
  0 siblings, 1 reply; 46+ messages in thread
From: Shrikanth Hegde @ 2026-06-18  4:44 UTC (permalink / raw)
  To: Yury Norov
  Cc: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	kprateek.nayak, iii, tglx, gregkh, pbonzini, seanjc, vschneid,
	huschle, rostedt, dietmar.eggemann, mgorman, bsegall, maddy,
	srikar, hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael



On 6/18/26 10:00 AM, Yury Norov wrote:
> On Wed, Jun 17, 2026 at 11:11:33PM +0530, Shrikanth Hegde wrote:
>> Start with a simple steal monitor.
>>
>> It is meant to look at steal time and make the decision to
>> reduce/increase the preferred CPUs.
>>
>> It has
>> - work function to execute the steal time calculations and decision
>>    making periodically.
>> - low and high thresholds for steal time.
>> - sampling period to control the frequency of steal time calculations.
>> - cache the previous decision to avoid oscillations
> 
> This monitor is the one implementation out of quite many possible,
> right? I don't think it should live in the core scheduler files, it
> should be a module.

You mean similar to drivers/cpuidle/? a new one drivers/steal_monitor/ ?

>   
>> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
>> ---
>> v3->v4:
>> - Drop tmp_mask
>>
>>   include/linux/sched.h | 11 +++++++++++
>>   kernel/sched/core.c   | 23 +++++++++++++++++++++++
>>   kernel/sched/sched.h  |  3 +++
>>   3 files changed, 37 insertions(+)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 5f523782ca28..ce6bc8a22eb1 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -2517,4 +2517,15 @@ extern void migrate_enable(void);
>>   
>>   DEFINE_LOCK_GUARD_0(migrate, migrate_disable(), migrate_enable())
>>   
>> +#ifdef CONFIG_PREFERRED_CPU
>> +struct steal_monitor_t {
>> +	struct work_struct  work;
>> +	ktime_t prev_time;
>> +	u64 prev_steal;
>> +	int previous_decision;
>> +	unsigned int low_threshold;
>> +	unsigned int high_threshold;
>> +	unsigned int sampling_period_ms;
>> +};
>> +#endif
>>   #endif
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 24d4abc74241..cc48632dd42d 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -9138,6 +9138,8 @@ void __init sched_init(void)
>>   
>>   	preempt_dynamic_init();
>>   
>> +	sched_init_steal_monitor();
>> +
>>   	scheduler_running = 1;
>>   }
>>   
>> @@ -11384,4 +11386,25 @@ void sched_push_current_non_preferred_cpu(struct rq *rq)
>>   	stop_one_cpu_nowait(rq->cpu, sched_non_preferred_cpu_push_stop,
>>   			    push_task, this_cpu_ptr(&npc_push_task_work));
>>   }
>> +
>> +struct steal_monitor_t steal_mon;
>> +
>> +void sched_init_steal_monitor(void)
>> +{
>> +	INIT_WORK(&steal_mon.work, sched_steal_detection_work);
>> +	steal_mon.low_threshold       = 200;		/* 2% steal time */
>> +	steal_mon.high_threshold      = 500;		/* 5% steal time */
>> +	steal_mon.sampling_period_ms  = 1000;		/* once per second */
>> +}
>> +
>> +/* This is only a skeleton. Subsequent patches introduce more of it */
>> +void sched_steal_detection_work(struct work_struct *work)
>> +{
>> +	struct steal_monitor_t *sm = container_of(work, struct steal_monitor_t, work);
>> +	ktime_t now;
>> +
>> +	/* Update the prev_time for next iteration*/
>> +	now = ktime_get();
>> +	sm->prev_time = now;
>> +}
>>   #endif
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 9cb006c21090..984da3827f19 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -4240,8 +4240,11 @@ static inline bool task_has_preferred_cpus(struct task_struct *p)
>>   #ifdef CONFIG_PREFERRED_CPU
>>   DECLARE_STATIC_KEY_FALSE(__sched_sm_enable);
>>   
>> +void sched_init_steal_monitor(void);
>> +void sched_steal_detection_work(struct work_struct *work);
>>   void sched_push_current_non_preferred_cpu(struct rq *rq);
>>   #else	/* !CONFIG_PREFERRED_CPU */
>>   static inline void sched_push_current_non_preferred_cpu(struct rq *rq) { }
>> +static inline void sched_init_steal_monitor(void) { }
>>   #endif
>>   #endif /* _KERNEL_SCHED_SCHED_H */
>> -- 
>> 2.47.3


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 06/20] sched/core: allow only preferred CPUs in is_cpu_allowed
       [not found]     ` <c4546759-b316-47e7-aa97-408e20d0f6ed@linux.ibm.com>
@ 2026-06-18  4:49       ` Yury Norov
  2026-06-18  5:14         ` Shrikanth Hegde
  0 siblings, 1 reply; 46+ messages in thread
From: Yury Norov @ 2026-06-18  4:49 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: Yury Norov, linux-kernel, mingo, peterz, juri.lelli,
	vincent.guittot, kprateek.nayak, iii, tglx, gregkh, pbonzini,
	seanjc, vschneid, huschle, rostedt, dietmar.eggemann, mgorman,
	bsegall, maddy, srikar, hdanton, chleroy, vineeth, frederic,
	arighi, pauld, christian.loehle, tj, tommaso.cucinotta, maz,
	rafael

On Thu, Jun 18, 2026 at 09:47:49AM +0530, Shrikanth Hegde wrote:
> 
> 
> On 6/18/26 9:02 AM, Yury Norov wrote:
> > On Wed, Jun 17, 2026 at 11:11:25PM +0530, Shrikanth Hegde wrote:
> > > When possible, choose a preferred CPUs to pick.
> > > 
> > > Push task mechanism uses stopper thread which going to call
> > > select_fallback_rq and use this mechanism to pick only a preferred CPU.
> > > 
> > > When task is affined only to non-preferred CPUs it should continue to
> > > run there. Detect that by checking if cpus_ptr and cpu_preferred_mask
> > > intersect or not.
> > > 
> > > Since is_cpu_allowed can be called directly or repeatedly in
> > > select_fallback_rq, encode the info in task_struct->has_preferred_cpu_state
> > > if the path is via select_fallback_rq or not.
> > > This helps to avoid N**2 complexity for the rare cases.
> > > 
> > > Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> > > ---
> > > v3->v4:
> > > - Missing case of PF_KTHREAD is avoided.
> > > - Add a new field in task_struct which encodes intersection of
> > >    tasks affinity and preferred CPUs and path its coming from.
> > > 
> > >   include/linux/sched.h |  1 +
> > >   kernel/sched/core.c   | 34 ++++++++++++++++++++++++++++++++--
> > >   kernel/sched/sched.h  | 18 ++++++++++++++++++
> > >   3 files changed, 51 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > > index fc6ecb3869dd..2d0b1a6d50ac 100644
> > > --- a/include/linux/sched.h
> > > +++ b/include/linux/sched.h
> > > @@ -1657,6 +1657,7 @@ struct task_struct {
> > >   #ifdef CONFIG_UNWIND_USER
> > >   	struct unwind_task_info		unwind_info;
> > >   #endif
> > > +	int				has_preferred_cpu_state;
> > 
> > Shouldn't this be protected with the config?
> 
> Since preferred is defined always, i don;t see a reason to add it again here.
> 
> > 
> > >   	/* CPU-specific state of this task: */
> > >   	struct thread_struct		thread;
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index 9e16946c9d62..714816cfa975 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -2500,6 +2500,8 @@ static inline bool rq_has_pinned_tasks(struct rq *rq)
> > >    */
> > >   static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
> > >   {
> > > +	bool task_check_preferred_cpu = false;
> > 
> > Initialization is not needed.
> 
> ok
> 
> > 
> > > +
> > >   	/* When not in the task's cpumask, no point in looking further. */
> > >   	if (!task_allowed_on_cpu(p, cpu))
> > >   		return false;
> > > @@ -2508,9 +2510,22 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
> > >   	if (is_migration_disabled(p))
> > >   		return cpu_online(cpu);
> > > +	/*
> > > +	 * This is essential to maintain user affinities when preferred
> > > +	 * CPUs change. A task pinned on non-preferred CPU should continue
> > > +	 * to run there, since this is non-user triggered.
> > > +	 *
> > > +	 * If CPU is non-preferred and task can run on other CPUs which are
> > > +	 * currently preferred, then choose those other CPUs instead
> > > +	 */
> > > +	task_check_preferred_cpu = !cpu_preferred(cpu) && task_has_preferred_cpus(p);
> > > +
> > >   	/* Non kernel threads are not allowed during either online or offline. */
> > > -	if (!(p->flags & PF_KTHREAD))
> > > +	if (!(p->flags & PF_KTHREAD)) {
> > > +		if (task_check_preferred_cpu)
> > > +			return false;
> > >   		return cpu_active(cpu);
> > > +	}
> > >   	/* KTHREAD_IS_PER_CPU is always allowed. */
> > >   	if (kthread_is_per_cpu(p))
> > > @@ -2520,6 +2535,10 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
> > >   	if (cpu_dying(cpu))
> > >   		return false;
> > > +	/* Try on preferred CPU first if possible*/
> > > +	if (task_check_preferred_cpu)
> > > +		return false;
> > > +
> > >   	/* But are allowed during online. */
> > >   	return cpu_online(cpu);
> > >   }
> > > @@ -3549,6 +3568,14 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
> > >   	enum { cpuset, possible, fail } state = cpuset;
> > >   	int dest_cpu;
> > > +	/*
> > > +	 * Cache value whether task's affinity spans preferred CPUs.
> > 
> > Because it's cached, it should go inside is_cpu_allowed(), I think.
> > 
> > > +	 * This helps to avoid repeating the same for each CPU
> > > +	 * later in the loop. Encode call to is_cpu_allowed coming
> > > +	 * via select_fallback_rq.
> > > +	 */
> > > +	p->has_preferred_cpu_state = task_has_preferred_cpus(p) << 8 | 0x1;
> > 
> > This looks weird. Your intention is to store three states: not cached, has
> > preferred CPUs and has not preferred CPUs,
> > 
> > Why don't you create an enum for it? Or a couple of flags?
> 
> I think what prateek suggested in other thread looks same. I will give that a try.
> 
> > 
> > > +
> > >   	/*
> > >   	 * If the node that the CPU is on has been offlined, cpu_to_node()
> > >   	 * will return -1. There is no CPU on the node, and we should
> > > @@ -3560,7 +3587,7 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
> > >   		/* Look for allowed, online CPU in same node. */
> > >   		for_each_cpu(dest_cpu, nodemask) {
> > >   			if (is_cpu_allowed(p, dest_cpu))
> > > -				return dest_cpu;
> > > +				goto clear_and_return;
> > >   		}
> > >   	}
> > > @@ -3604,6 +3631,8 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
> > >   		}
> > >   	}
> > > +clear_and_return:
> > > +	p->has_preferred_cpu_state = 0;
> > 
> 
> It is reset to indicate that any subsequent direct calls to is_cpu_allowed can't use the
> old cached value of select_fallback_rq.
> 
> So events could be,
> 
> - cpu marked as non preferred - select_fallback_rq (sets the p->has_preferred_cpu_state)
>   Lets say CPU(300-450) are marked as non-preferred and Task affinity is (200-350)
> - task moved out. Now either task's affinity changed or preferred_mask has changed.
>   while CPU(400) maybe still marked as non-preferred but CPU(340) is marked as preferred.
> - Subsequent call to is_cpu_allowed (CPU=340) can't assume the old value.

Please, no top-posting.

My point is: out of the scope of the select_fallback_rq(), the
p->has_preferred_cpu_state is always 0 because of the line above. It
means, it doesn't belong to the task_struct, it belongs the current
scope.

So either make this caching really surviving the scope exit, or make
it a local variable.

Not sure I understood the passage above about the possible events, but
variables that are always zero out of the function scope should not be
placed in global structures.
 
> > What for resetting it here? I think it should be zeroed only on update
> > of preferred cpumask. In other words, to properly implement caching,
> > you need to have a global counter incremented on each
> > cpu_preferred_mask update, and in task_has_preferred_cpus() you do:
> > 
> >   {
> >          if (p->preferred_cpu_updates == atomic_read(preferred_cpumask_updates))
> >                  return p->has_preferred_cpus;
> > 
> >          p->preferred_cpu_updates = atomic_read(preferred_cpumask_updates);
> >          p->has_preferred_cpus = cpumask_intersects(...);
> >   }
> > 
> > Do you have any numbers that justify this caching? The best practice
> > is to put performance optimizations at the end of the series and
> > provide some sort of benchmark supporting it.
> > 
> 
> This was to avoid N**2 aspect that was there in select_fallback_rq.
> Its more of the functional aspect which i mentioned above which this needs
> to take care as well.

Please, collect the performance data first, then optimize your code,
not vice-versa.
 
> > >   	return dest_cpu;
> > >   }
> > > @@ -4612,6 +4641,7 @@ static void __sched_fork(u64 clone_flags, struct task_struct *p)
> > >   	init_numa_balancing(clone_flags, p);
> > >   	p->wake_entry.u_flags = CSD_TYPE_TTWU;
> > >   	p->migration_pending = NULL;
> > > +	p->has_preferred_cpu_state = 0;
> > >   	init_sched_mm(p);
> > >   }
> > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > > index c7c2dea65edd..38fd84b0b8f8 100644
> > > --- a/kernel/sched/sched.h
> > > +++ b/kernel/sched/sched.h
> > > @@ -4213,4 +4213,22 @@ DEFINE_CLASS_IS_UNCONDITIONAL(sched_change)
> > >   #include "ext.h"
> > > +/*
> > > + * has_preferred_cpu_state is encoding two bits of information.
> > > + * First Byte is to encode where the call to is_cpu_allowed coming from.
> > > + * Second Byte is to encode the intersection of task affinity
> > > + * and cpu_preferred_mask.
> > > + *
> > > + * If 1st Byte is set, call to is_cpu_allowed coming from select_fallback_rq.
> > > + * That helps to avoid repeated calculation keeping time complexity same.
> > > + */
> > > +static inline bool task_has_preferred_cpus(struct task_struct *p)
> > 
> > This function should be void because you change the task state.
> > 
> 
> It doesn't alter p->has_preferred_cpu_state. No?

It doesn't, but it should.

> > > +{
> > > +	int cached_value = p->has_preferred_cpu_state;
> > > +
> > > +	if (cached_value & 0x1)
> > > +		return p->has_preferred_cpu_state >> 8;
> > > +	else
> > > +		return cpumask_intersects(p->cpus_ptr, cpu_preferred_mask);
> > > +}
> > >   #endif /* _KERNEL_SCHED_SCHED_H */
> > > -- 
> > > 2.47.3

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 06/20] sched/core: allow only preferred CPUs in is_cpu_allowed
  2026-06-18  4:49       ` Yury Norov
@ 2026-06-18  5:14         ` Shrikanth Hegde
  0 siblings, 0 replies; 46+ messages in thread
From: Shrikanth Hegde @ 2026-06-18  5:14 UTC (permalink / raw)
  To: Yury Norov
  Cc: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	kprateek.nayak, iii, tglx, gregkh, pbonzini, seanjc, vschneid,
	huschle, rostedt, dietmar.eggemann, mgorman, bsegall, maddy,
	srikar, hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael



On 6/18/26 10:19 AM, Yury Norov wrote:
> On Thu, Jun 18, 2026 at 09:47:49AM +0530, Shrikanth Hegde wrote:
>>
>> It is reset to indicate that any subsequent direct calls to is_cpu_allowed can't use the
>> old cached value of select_fallback_rq.
>>
>> So events could be,
>>
>> - cpu marked as non preferred - select_fallback_rq (sets the p->has_preferred_cpu_state)
>>    Lets say CPU(300-450) are marked as non-preferred and Task affinity is (200-350)
>> - task moved out. Now either task's affinity changed or preferred_mask has changed.
>>    while CPU(400) maybe still marked as non-preferred but CPU(340) is marked as preferred.
>> - Subsequent call to is_cpu_allowed (CPU=340) can't assume the old value.
> 
> Please, no top-posting.
> 
> My point is: out of the scope of the select_fallback_rq(), the
> p->has_preferred_cpu_state is always 0 because of the line above. It
> means, it doesn't belong to the task_struct, it belongs the current
> scope.
> 
> So either make this caching really surviving the scope exit, or make
> it a local variable.
> 
> Not sure I understood the passage above about the possible events, but
> variables that are always zero out of the function scope should not be
> placed in global structures.

That was the other way. add one more variable to is_cpu_allowed.

>   
>>> What for resetting it here? I think it should be zeroed only on update
>>> of preferred cpumask. In other words, to properly implement caching,
>>> you need to have a global counter incremented on each
>>> cpu_preferred_mask update, and in task_has_preferred_cpus() you do:
>>>
>>>    {
>>>           if (p->preferred_cpu_updates == atomic_read(preferred_cpumask_updates))
>>>                   return p->has_preferred_cpus;
>>>
>>>           p->preferred_cpu_updates = atomic_read(preferred_cpumask_updates);
>>>           p->has_preferred_cpus = cpumask_intersects(...);
>>>    }
>>>
>>> Do you have any numbers that justify this caching? The best practice
>>> is to put performance optimizations at the end of the series and
>>> provide some sort of benchmark supporting it.
>>>
>>
>> This was to avoid N**2 aspect that was there in select_fallback_rq.
>> Its more of the functional aspect which i mentioned above which this needs
>> to take care as well.
> 
> Please, collect the performance data first, then optimize your code,
> not vice-versa.

It did make sense to cache it since select_fallback_rq does repeated 
calculations with pi_lock held and irq disabled.

>   
>>>>    	return dest_cpu;
>>>>    }
>>>> @@ -4612,6 +4641,7 @@ static void __sched_fork(u64 clone_flags, struct task_struct *p)
>>>>    	init_numa_balancing(clone_flags, p);
>>>>    	p->wake_entry.u_flags = CSD_TYPE_TTWU;
>>>>    	p->migration_pending = NULL;
>>>> +	p->has_preferred_cpu_state = 0;
>>>>    	init_sched_mm(p);
>>>>    }
>>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>>>> index c7c2dea65edd..38fd84b0b8f8 100644
>>>> --- a/kernel/sched/sched.h
>>>> +++ b/kernel/sched/sched.h
>>>> @@ -4213,4 +4213,22 @@ DEFINE_CLASS_IS_UNCONDITIONAL(sched_change)
>>>>    #include "ext.h"
>>>> +/*
>>>> + * has_preferred_cpu_state is encoding two bits of information.
>>>> + * First Byte is to encode where the call to is_cpu_allowed coming from.
>>>> + * Second Byte is to encode the intersection of task affinity
>>>> + * and cpu_preferred_mask.
>>>> + *
>>>> + * If 1st Byte is set, call to is_cpu_allowed coming from select_fallback_rq.
>>>> + * That helps to avoid repeated calculation keeping time complexity same.
>>>> + */
>>>> +static inline bool task_has_preferred_cpus(struct task_struct *p)
>>>
>>> This function should be void because you change the task state.
>>>
>>
>> It doesn't alter p->has_preferred_cpu_state. No?
> 
> It doesn't, but it should.

Not sure. i think this can be just a wrapper to indicate the info
to inform if this cpu should be used or not based on task affinity.

> 
>>>> +{
>>>> +	int cached_value = p->has_preferred_cpu_state;
>>>> +
>>>> +	if (cached_value & 0x1)
>>>> +		return p->has_preferred_cpu_state >> 8;
>>>> +	else
>>>> +		return cpumask_intersects(p->cpus_ptr, cpu_preferred_mask);
>>>> +}
>>>>    #endif /* _KERNEL_SCHED_SCHED_H */
>>>> -- 
>>>> 2.47.3


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 14/20] sched/core: Introduce a simple steal monitor
  2026-06-18  4:44     ` Shrikanth Hegde
@ 2026-06-18  5:32       ` K Prateek Nayak
  2026-06-18  6:01         ` Shrikanth Hegde
  0 siblings, 1 reply; 46+ messages in thread
From: K Prateek Nayak @ 2026-06-18  5:32 UTC (permalink / raw)
  To: Shrikanth Hegde, Yury Norov
  Cc: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot, iii,
	tglx, gregkh, pbonzini, seanjc, vschneid, huschle, rostedt,
	dietmar.eggemann, mgorman, bsegall, maddy, srikar, hdanton,
	chleroy, vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael

Hello Shrikanth, Yury,

On 6/18/2026 10:14 AM, Shrikanth Hegde wrote:
> On 6/18/26 10:00 AM, Yury Norov wrote:
>> On Wed, Jun 17, 2026 at 11:11:33PM +0530, Shrikanth Hegde wrote:
>>> Start with a simple steal monitor.
>>>
>>> It is meant to look at steal time and make the decision to
>>> reduce/increase the preferred CPUs.
>>>
>>> It has
>>> - work function to execute the steal time calculations and decision
>>>    making periodically.
>>> - low and high thresholds for steal time.
>>> - sampling period to control the frequency of steal time calculations.
>>> - cache the previous decision to avoid oscillations
>>
>> This monitor is the one implementation out of quite many possible,
>> right? I don't think it should live in the core scheduler files, it
>> should be a module.

I agree that this tight of an integration with the sched bits might not
not be required.

> 
> You mean similar to drivers/cpuidle/? a new one drivers/steal_monitor/ ?

Since steal time is a virtualization concept, somewhere in drivers/virt/
probably makes more sense unless we need some scheduler internal API to
implement it which shouldn't be the case.

All the driver has to do is track steal-time (which should be available
via kcpustat_cpu_fetch()) periodically (using a workqueue?) and should
do set_cpu_preferred() (which needs to be made available for other use
cases anyways) so it should be possible.

Since you mentioned you get an interrupt in LPAR before vCPU is
scheduled out due to contention, perhaps this also allows for a way to
add governors, and other heuristic along the line.

-- 
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 15/20] sched/core: Compute steal values at regular intervals
  2026-06-18  4:04   ` Yury Norov
@ 2026-06-18  5:39     ` Shrikanth Hegde
  0 siblings, 0 replies; 46+ messages in thread
From: Shrikanth Hegde @ 2026-06-18  5:39 UTC (permalink / raw)
  To: Yury Norov
  Cc: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	kprateek.nayak, iii, tglx, gregkh, pbonzini, seanjc, vschneid,
	huschle, rostedt, dietmar.eggemann, mgorman, bsegall, maddy,
	srikar, hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael



On 6/18/26 9:34 AM, Yury Norov wrote:
> On Wed, Jun 17, 2026 at 11:11:34PM +0530, Shrikanth Hegde wrote:
>> Kick off the work to compute the steal time at regular interval.
>> Gated with steal monitor enabled static key check to avoid any overhead
>> when its disabled.
>>
>> The sampling period can changed at runtime using steal_mon/sampling_period.
>> By default is 1000 milliseconds. I.e. 1 second
>>
>> This work is done by first active housekeeping CPU only. Hence it won't
>> need any complicated synchronization.
>>
>> Now, that sched_steal_mon_enabled() is available which is a static branch,
>> add this to hotpath such as wakeup and load balance.
>> This will make them effectively nop when the feature is disabled.
>>
>> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
>> ---
>> v3->v4:
>> - Add static key check in hotpaths. Could be split into a separate
>>    patch. Let me know if thats better.
>>
>>   include/linux/sched.h |  2 ++
>>   kernel/sched/core.c   | 28 +++++++++++++++++++++++++++-
>>   kernel/sched/debug.c  |  1 +
>>   kernel/sched/fair.c   |  3 ++-
>>   kernel/sched/sched.h  | 10 +++++++++-
>>   5 files changed, 41 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index ce6bc8a22eb1..5b15353ed7ef 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -2527,5 +2527,7 @@ struct steal_monitor_t {
>>   	unsigned int high_threshold;
>>   	unsigned int sampling_period_ms;
>>   };
>> +
>> +extern struct steal_monitor_t steal_mon;
>>   #endif
>>   #endif
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index cc48632dd42d..f1a91021e357 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -5793,7 +5793,7 @@ void sched_tick(void)
>>   	unsigned long hw_pressure;
>>   	u64 resched_latency;
>>   
>> -	if (!cpu_preferred(cpu))
>> +	if (sched_steal_mon_enabled() && !cpu_preferred(cpu))
>>   		sched_push_current_non_preferred_cpu(rq);
> 
> This looks like CPU can be non-preferred only if steal monitor is
> enabled. To properly implement it, you need to mark all active CPUs
> as preferred during the steal monitor disabling. That way you don't
> need to complicate the condition.
> 

That is done in disabling the feature [PATCH 13/20].

	if (sched_sm_wr_enable && !orig) {
		static_branch_enable(&__sched_sm_enable);
	} else if (!sched_sm_wr_enable && orig) {
		static_branch_disable(&__sched_sm_enable);
		cpumask_copy(&__cpu_preferred_mask, cpu_active_mask);

>>   
>>   	if (housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE))
>> @@ -5834,6 +5834,9 @@ void sched_tick(void)
>>   		rq->idle_balance = idle_cpu(cpu);
>>   		sched_balance_trigger(rq);
>>   	}
>> +
>> +	if (sched_steal_mon_enabled())
>> +		sched_trigger_steal_computation(cpu);
>>   }
>>   
>>   #ifdef CONFIG_NO_HZ_FULL
>> @@ -11407,4 +11410,27 @@ void sched_steal_detection_work(struct work_struct *work)
>>   	now = ktime_get();
>>   	sm->prev_time = now;
>>   }
>> +
>> +void sched_trigger_steal_computation(int cpu)
>> +{
>> +	int first_hk_cpu = cpumask_first_and(housekeeping_cpumask(HK_TYPE_KERNEL_NOISE),
>> +					     cpu_active_mask);
>> +	ktime_t now;
>> +
>> +	/* Done by first active housekeeping CPU only */
>> +	if (likely(cpu != first_hk_cpu))
>> +		return;
>> +
>> +	/*
>> +	 * Since everything is updated by first housekeeping CPU,
>> +	 * There is no need for complex syncronization.
>> +	 */
>> +	now = ktime_get();
>> +
>> +	/* Default is once per second */
>> +	if (likely(ktime_ms_delta(now, steal_mon.prev_time) < steal_mon.sampling_period_ms))
>> +		return;
>> +
>> +	schedule_work_on(first_hk_cpu, &steal_mon.work);
> 
> I think, there should be a better way to schedule a work on regular
> interval...
> 
> Maybe steal_mon.work would schedule itself? So, the first time it's
> scheduled on steal monitor enablement, and then just reschedules
> itself. This way you'll avoid polluting sched_tick().
> 

That's a good idea too.
All it needs a periodic call at the granularity of sampling period.

maybe start hrtimer when enabling the feature.
hrtimer kicks off the work.
work function enables the hrtimer again.

> 
>> +}
>>   #endif
>> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
>> index 2d62858f9cc0..55b8beb42574 100644
>> --- a/kernel/sched/debug.c
>> +++ b/kernel/sched/debug.c
>> @@ -649,6 +649,7 @@ static ssize_t sched_sm_en_write(struct file *filp, const char __user *ubuf,
>>   		static_branch_enable(&__sched_sm_enable);
>>   	} else if (!sched_sm_wr_enable && orig) {
>>   		static_branch_disable(&__sched_sm_enable);
>> +		cancel_work_sync(&steal_mon.work);
>>   		cpumask_copy(&__cpu_preferred_mask, cpu_active_mask);
>>   	}
>>   
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 3f3c7f0ca489..b02a414ffaae 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -13292,7 +13292,8 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>>   	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
>>   
>>   	/* Spread load among preferred CPUs */
>> -	cpumask_and(cpus, cpus, cpu_preferred_mask);
>> +	if (sched_steal_mon_enabled())
>> +		cpumask_and(cpus, cpus, cpu_preferred_mask);
> 
> Again, if you mark do cpumask_copy(preferred, active) on the steal
> monitor disablement, you don't need to complicate core logic here and
> there.
> 
>>   
>>   	schedstat_inc(sd->lb_count[idle]);
>>   
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 984da3827f19..f3814099cc0b 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -1060,6 +1060,7 @@ struct root_domain {
>>   	struct perf_domain __rcu *pd;
>>   };
>>   
>> +static inline bool sched_steal_mon_enabled(void);
>>   extern void init_defrootdomain(void);
>>   extern int sched_init_domains(const struct cpumask *cpu_map);
>>   extern void rq_attach_root(struct rq *rq, struct root_domain *rd);
>> @@ -1436,7 +1437,7 @@ static inline bool available_idle_cpu(int cpu)
>>   	if (!idle_rq(cpu_rq(cpu)))
>>   		return 0;
>>   
>> -	if (!cpu_preferred(cpu))
>> +	if (sched_steal_mon_enabled() && !cpu_preferred(cpu))
>>   		return 0;
>>   
>>   	if (vcpu_is_preempted(cpu))
>> @@ -4243,8 +4244,15 @@ DECLARE_STATIC_KEY_FALSE(__sched_sm_enable);
>>   void sched_init_steal_monitor(void);
>>   void sched_steal_detection_work(struct work_struct *work);
>>   void sched_push_current_non_preferred_cpu(struct rq *rq);
>> +void sched_trigger_steal_computation(int cpu);
>> +static inline bool sched_steal_mon_enabled(void)
>> +{
>> +	return static_branch_unlikely(&__sched_sm_enable);
>> +}
>>   #else	/* !CONFIG_PREFERRED_CPU */
>>   static inline void sched_push_current_non_preferred_cpu(struct rq *rq) { }
>>   static inline void sched_init_steal_monitor(void) { }
>> +static inline void sched_trigger_steal_computation(int cpu) { }
>> +static inline bool sched_steal_mon_enabled(void) { return false; }
>>   #endif
>>   #endif /* _KERNEL_SCHED_SCHED_H */
>> -- 
>> 2.47.3


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 14/20] sched/core: Introduce a simple steal monitor
  2026-06-18  5:32       ` K Prateek Nayak
@ 2026-06-18  6:01         ` Shrikanth Hegde
  2026-06-18  6:39           ` Yury Norov
  0 siblings, 1 reply; 46+ messages in thread
From: Shrikanth Hegde @ 2026-06-18  6:01 UTC (permalink / raw)
  To: K Prateek Nayak, Yury Norov
  Cc: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot, iii,
	tglx, gregkh, pbonzini, seanjc, vschneid, huschle, rostedt,
	dietmar.eggemann, mgorman, bsegall, maddy, srikar, hdanton,
	chleroy, vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael



On 6/18/26 11:02 AM, K Prateek Nayak wrote:
> Hello Shrikanth, Yury,
> 
> On 6/18/2026 10:14 AM, Shrikanth Hegde wrote:
>> On 6/18/26 10:00 AM, Yury Norov wrote:
>>> On Wed, Jun 17, 2026 at 11:11:33PM +0530, Shrikanth Hegde wrote:
>>>> Start with a simple steal monitor.
>>>>
>>>> It is meant to look at steal time and make the decision to
>>>> reduce/increase the preferred CPUs.
>>>>
>>>> It has
>>>> - work function to execute the steal time calculations and decision
>>>>     making periodically.
>>>> - low and high thresholds for steal time.
>>>> - sampling period to control the frequency of steal time calculations.
>>>> - cache the previous decision to avoid oscillations
>>>
>>> This monitor is the one implementation out of quite many possible,
>>> right? I don't think it should live in the core scheduler files, it
>>> should be a module.
> 
> I agree that this tight of an integration with the sched bits might not
> not be required.
> 
>>
>> You mean similar to drivers/cpuidle/? a new one drivers/steal_monitor/ ?
> 
> Since steal time is a virtualization concept, somewhere in drivers/virt/
> probably makes more sense unless we need some scheduler internal API to
> implement it which shouldn't be the case.
> 
> All the driver has to do is track steal-time (which should be available
> via kcpustat_cpu_fetch()) periodically (using a workqueue?) and should
> do set_cpu_preferred() (which needs to be made available for other use
> cases anyways) so it should be possible.

Yes. Seems like doable.

Do you think it would make sense to keep the debugfs in sched still?

> 
> Since you mentioned you get an interrupt in LPAR before vCPU is
> scheduled out due to contention, perhaps this also allows for a way to
> add governors, and other heuristic along the line.
> 

No. when vCPU gets scheduled out, there is no such interrupt. Guest vCPU
doesn't know.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 10/20] sched/core: Push current task from non preferred CPU
  2026-06-18  4:09   ` K Prateek Nayak
@ 2026-06-18  6:05     ` Shrikanth Hegde
  0 siblings, 0 replies; 46+ messages in thread
From: Shrikanth Hegde @ 2026-06-18  6:05 UTC (permalink / raw)
  To: K Prateek Nayak, linux-kernel, mingo, peterz, juri.lelli,
	vincent.guittot, yury.norov, iii
  Cc: tglx, gregkh, pbonzini, seanjc, vschneid, huschle, rostedt,
	dietmar.eggemann, mgorman, bsegall, maddy, srikar, hdanton,
	chleroy, vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael



On 6/18/26 9:39 AM, K Prateek Nayak wrote:
> Hello Shrikanth,
> 
> On 6/17/2026 11:11 PM, Shrikanth Hegde wrote:
>> +static int sched_non_preferred_cpu_push_stop(void *arg)
>> +{
>> +	struct task_struct *p = arg;
>> +	struct rq *rq = this_rq();
>> +	struct rq_flags rf;
>> +	int cpu;
>> +
>> +	/* sanity check */
>> +	if (cpu_preferred(rq->cpu))
>> +		return 0;
> 
> I think this early return path should also clear "push_task_work_done"
> indicator, otherwise, we will fail to schedule the stopper on this CPU
> next time.

Ok.

> 
> Also, we may need to add a context_unsafe_alias(rq) call here to keep
> the context analysis bits happy similar to migration_cpu_stop().
> 

Let me see that.

>> +
>> +	raw_spin_lock_irq(&p->pi_lock);
>> +	rq_lock(rq, &rf);
>> +	rq->push_task_work_done = 0;
>> +
>> +	update_rq_clock(rq);
>> +
>> +	if (task_rq(p) == rq && task_on_rq_queued(p)) {
>> +		cpu = select_fallback_rq(rq->cpu, p);
> 
> Do we need a task_has_preferred_cpus() sanity check here?
> 
> If the affinity changed before the stopper grabbed the p->pi_lock, and
> there are no preferred CPUs to run on anymore, might as well keep the
> task here instead of migrating it away.
> 

No I think.

If affinity changed before stopper started running,
subsequent select_fallback_rq will return the CPU it can run.
If it has no choice but a non-preferred CPU, it will return a non-preferred CPU too.

>> +		rq = __migrate_task(rq, &rf, p, cpu);
>> +	}
>> +
>> +	rq_unlock(rq, &rf);
>> +	raw_spin_unlock_irq(&p->pi_lock);
>> +	put_task_struct(p);
>> +
>> +	return 0;
>> +}
> 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 14/20] sched/core: Introduce a simple steal monitor
  2026-06-18  6:01         ` Shrikanth Hegde
@ 2026-06-18  6:39           ` Yury Norov
  2026-06-18  6:45             ` Shrikanth Hegde
  0 siblings, 1 reply; 46+ messages in thread
From: Yury Norov @ 2026-06-18  6:39 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: K Prateek Nayak, Yury Norov, linux-kernel, mingo, peterz,
	juri.lelli, vincent.guittot, iii, tglx, gregkh, pbonzini, seanjc,
	vschneid, huschle, rostedt, dietmar.eggemann, mgorman, bsegall,
	maddy, srikar, hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael

On Thu, Jun 18, 2026 at 11:31:17AM +0530, Shrikanth Hegde wrote:
> 
> 
> On 6/18/26 11:02 AM, K Prateek Nayak wrote:
> > Hello Shrikanth, Yury,
> > 
> > On 6/18/2026 10:14 AM, Shrikanth Hegde wrote:
> > > On 6/18/26 10:00 AM, Yury Norov wrote:
> > > > On Wed, Jun 17, 2026 at 11:11:33PM +0530, Shrikanth Hegde wrote:
> > > > > Start with a simple steal monitor.
> > > > > 
> > > > > It is meant to look at steal time and make the decision to
> > > > > reduce/increase the preferred CPUs.
> > > > > 
> > > > > It has
> > > > > - work function to execute the steal time calculations and decision
> > > > >     making periodically.
> > > > > - low and high thresholds for steal time.
> > > > > - sampling period to control the frequency of steal time calculations.
> > > > > - cache the previous decision to avoid oscillations
> > > > 
> > > > This monitor is the one implementation out of quite many possible,
> > > > right? I don't think it should live in the core scheduler files, it
> > > > should be a module.
> > 
> > I agree that this tight of an integration with the sched bits might not
> > not be required.
> > 
> > > 
> > > You mean similar to drivers/cpuidle/? a new one drivers/steal_monitor/ ?
> > 
> > Since steal time is a virtualization concept, somewhere in drivers/virt/
> > probably makes more sense unless we need some scheduler internal API to
> > implement it which shouldn't be the case.
> > 
> > All the driver has to do is track steal-time (which should be available
> > via kcpustat_cpu_fetch()) periodically (using a workqueue?) and should
> > do set_cpu_preferred() (which needs to be made available for other use
> > cases anyways) so it should be possible.
> 
> Yes. Seems like doable.
> 
> Do you think it would make sense to keep the debugfs in sched still?

The enable/disable part will be replaced with insmod/rmmod. The
statistics part - IDK. It is nice to have all stats at the same
place. On the other hand, without the driver loaded it would
always read zeroes. It anyways is just a single line in sched/core.c,
not a big deal.

Thanks,
Yury

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 14/20] sched/core: Introduce a simple steal monitor
  2026-06-18  6:39           ` Yury Norov
@ 2026-06-18  6:45             ` Shrikanth Hegde
  2026-06-18  7:16               ` Yury Norov
  0 siblings, 1 reply; 46+ messages in thread
From: Shrikanth Hegde @ 2026-06-18  6:45 UTC (permalink / raw)
  To: Yury Norov
  Cc: K Prateek Nayak, linux-kernel, mingo, peterz, juri.lelli,
	vincent.guittot, iii, tglx, gregkh, pbonzini, seanjc, vschneid,
	huschle, rostedt, dietmar.eggemann, mgorman, bsegall, maddy,
	srikar, hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael




Hi Yury.

On 6/18/26 12:09 PM, Yury Norov wrote:
> On Thu, Jun 18, 2026 at 11:31:17AM +0530, Shrikanth Hegde wrote:
>>
>>
>> On 6/18/26 11:02 AM, K Prateek Nayak wrote:
>>> Hello Shrikanth, Yury,
>>>
>>> On 6/18/2026 10:14 AM, Shrikanth Hegde wrote:
>>>> On 6/18/26 10:00 AM, Yury Norov wrote:
>>>>> On Wed, Jun 17, 2026 at 11:11:33PM +0530, Shrikanth Hegde wrote:
>>>>>> Start with a simple steal monitor.
>>>>>>
>>>>>> It is meant to look at steal time and make the decision to
>>>>>> reduce/increase the preferred CPUs.
>>>>>>
>>>>>> It has
>>>>>> - work function to execute the steal time calculations and decision
>>>>>>      making periodically.
>>>>>> - low and high thresholds for steal time.
>>>>>> - sampling period to control the frequency of steal time calculations.
>>>>>> - cache the previous decision to avoid oscillations
>>>>>
>>>>> This monitor is the one implementation out of quite many possible,
>>>>> right? I don't think it should live in the core scheduler files, it
>>>>> should be a module.
>>>
>>> I agree that this tight of an integration with the sched bits might not
>>> not be required.
>>>
>>>>
>>>> You mean similar to drivers/cpuidle/? a new one drivers/steal_monitor/ ?
>>>
>>> Since steal time is a virtualization concept, somewhere in drivers/virt/
>>> probably makes more sense unless we need some scheduler internal API to
>>> implement it which shouldn't be the case.
>>>
>>> All the driver has to do is track steal-time (which should be available
>>> via kcpustat_cpu_fetch()) periodically (using a workqueue?) and should
>>> do set_cpu_preferred() (which needs to be made available for other use
>>> cases anyways) so it should be possible.
>>
>> Yes. Seems like doable.
>>
>> Do you think it would make sense to keep the debugfs in sched still?
> 
> The enable/disable part will be replaced with insmod/rmmod. The
> statistics part - IDK. It is nice to have all stats at the same
> place. On the other hand, without the driver loaded it would
> always read zeroes. It anyways is just a single line in sched/core.c,
> not a big deal.
> 

I was asking about these debugfs knobs.

steal_monitor/high_threshold:500
steal_monitor/low_threshold:200
steal_monitor/sampling_period:1000

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 14/20] sched/core: Introduce a simple steal monitor
  2026-06-18  6:45             ` Shrikanth Hegde
@ 2026-06-18  7:16               ` Yury Norov
  0 siblings, 0 replies; 46+ messages in thread
From: Yury Norov @ 2026-06-18  7:16 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: Yury Norov, K Prateek Nayak, linux-kernel, mingo, peterz,
	juri.lelli, vincent.guittot, iii, tglx, gregkh, pbonzini, seanjc,
	vschneid, huschle, rostedt, dietmar.eggemann, mgorman, bsegall,
	maddy, srikar, hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael

On Thu, Jun 18, 2026 at 12:15:16PM +0530, Shrikanth Hegde wrote:
> 
> 
> 
> Hi Yury.
> 
> On 6/18/26 12:09 PM, Yury Norov wrote:
> > On Thu, Jun 18, 2026 at 11:31:17AM +0530, Shrikanth Hegde wrote:
> > > 
> > > 
> > > On 6/18/26 11:02 AM, K Prateek Nayak wrote:
> > > > Hello Shrikanth, Yury,
> > > > 
> > > > On 6/18/2026 10:14 AM, Shrikanth Hegde wrote:
> > > > > On 6/18/26 10:00 AM, Yury Norov wrote:
> > > > > > On Wed, Jun 17, 2026 at 11:11:33PM +0530, Shrikanth Hegde wrote:
> > > > > > > Start with a simple steal monitor.
> > > > > > > 
> > > > > > > It is meant to look at steal time and make the decision to
> > > > > > > reduce/increase the preferred CPUs.
> > > > > > > 
> > > > > > > It has
> > > > > > > - work function to execute the steal time calculations and decision
> > > > > > >      making periodically.
> > > > > > > - low and high thresholds for steal time.
> > > > > > > - sampling period to control the frequency of steal time calculations.
> > > > > > > - cache the previous decision to avoid oscillations
> > > > > > 
> > > > > > This monitor is the one implementation out of quite many possible,
> > > > > > right? I don't think it should live in the core scheduler files, it
> > > > > > should be a module.
> > > > 
> > > > I agree that this tight of an integration with the sched bits might not
> > > > not be required.
> > > > 
> > > > > 
> > > > > You mean similar to drivers/cpuidle/? a new one drivers/steal_monitor/ ?
> > > > 
> > > > Since steal time is a virtualization concept, somewhere in drivers/virt/
> > > > probably makes more sense unless we need some scheduler internal API to
> > > > implement it which shouldn't be the case.
> > > > 
> > > > All the driver has to do is track steal-time (which should be available
> > > > via kcpustat_cpu_fetch()) periodically (using a workqueue?) and should
> > > > do set_cpu_preferred() (which needs to be made available for other use
> > > > cases anyways) so it should be possible.
> > > 
> > > Yes. Seems like doable.
> > > 
> > > Do you think it would make sense to keep the debugfs in sched still?
> > 
> > The enable/disable part will be replaced with insmod/rmmod. The
> > statistics part - IDK. It is nice to have all stats at the same
> > place. On the other hand, without the driver loaded it would
> > always read zeroes. It anyways is just a single line in sched/core.c,
> > not a big deal.
> > 
> 
> I was asking about these debugfs knobs.
> 
> steal_monitor/high_threshold:500
> steal_monitor/low_threshold:200
> steal_monitor/sampling_period:1000

Those are the driver defaults. And if you want to override one, then:

insmod steal_monitor.ko high_threshold=400

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2026-06-18  7:16 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-17 17:41 [PATCH v4 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
2026-06-17 17:41 ` [PATCH v4 01/20] sched/debug: Remove unused schedstats Shrikanth Hegde
2026-06-17 17:41 ` [PATCH v4 02/20] sched/docs: Document cpu_preferred_mask and Preferred CPU concept Shrikanth Hegde
2026-06-17 17:41 ` [PATCH v4 03/20] kconfig: Provide PREFERRED_CPU option Shrikanth Hegde
2026-06-18  0:51   ` Yury Norov
2026-06-18  3:44     ` Shrikanth Hegde
2026-06-17 17:41 ` [PATCH v4 04/20] cpumask: Introduce cpu_preferred_mask Shrikanth Hegde
2026-06-18  1:29   ` Yury Norov
2026-06-18  3:53     ` Shrikanth Hegde
2026-06-17 17:41 ` [PATCH v4 05/20] sysfs: Add preferred CPU file Shrikanth Hegde
2026-06-17 17:41 ` [PATCH v4 06/20] sched/core: allow only preferred CPUs in is_cpu_allowed Shrikanth Hegde
2026-06-18  3:32   ` Yury Norov
     [not found]     ` <c4546759-b316-47e7-aa97-408e20d0f6ed@linux.ibm.com>
2026-06-18  4:49       ` Yury Norov
2026-06-18  5:14         ` Shrikanth Hegde
2026-06-18  3:49   ` K Prateek Nayak
2026-06-18  4:22     ` Shrikanth Hegde
2026-06-17 17:41 ` [PATCH v4 07/20] sched/fair: Select preferred CPU at wakeup when possible Shrikanth Hegde
2026-06-17 17:41 ` [PATCH v4 08/20] sched/fair: load balance only among preferred CPUs Shrikanth Hegde
2026-06-18  3:03   ` K Prateek Nayak
2026-06-18  3:54     ` Shrikanth Hegde
2026-06-17 17:41 ` [PATCH v4 09/20] sched/core: Keep tick on non-preferred CPUs until tasks are out Shrikanth Hegde
2026-06-17 17:41 ` [PATCH v4 10/20] sched/core: Push current task from non preferred CPU Shrikanth Hegde
2026-06-18  4:09   ` K Prateek Nayak
2026-06-18  6:05     ` Shrikanth Hegde
2026-06-17 17:41 ` [PATCH v4 11/20] sched/debug: Add migration stats due to non preferred CPUs Shrikanth Hegde
2026-06-17 17:41 ` [PATCH v4 12/20] sched/debug: Create debugfs folder steal monitor Shrikanth Hegde
2026-06-17 17:41 ` [PATCH v4 13/20] sched/debug: Provide debugfs to enable/disable " Shrikanth Hegde
2026-06-17 17:41 ` [PATCH v4 14/20] sched/core: Introduce a simple " Shrikanth Hegde
2026-06-18  4:30   ` Yury Norov
2026-06-18  4:44     ` Shrikanth Hegde
2026-06-18  5:32       ` K Prateek Nayak
2026-06-18  6:01         ` Shrikanth Hegde
2026-06-18  6:39           ` Yury Norov
2026-06-18  6:45             ` Shrikanth Hegde
2026-06-18  7:16               ` Yury Norov
2026-06-17 17:41 ` [PATCH v4 15/20] sched/core: Compute steal values at regular intervals Shrikanth Hegde
2026-06-18  4:04   ` Yury Norov
2026-06-18  5:39     ` Shrikanth Hegde
2026-06-17 17:41 ` [PATCH v4 16/20] sched/core: Introduce default arch handling code for inc/dec preferred CPUs Shrikanth Hegde
     [not found]   ` <ajNwy25WYg45AQJX@yury>
2026-06-18  4:42     ` Shrikanth Hegde
2026-06-17 17:41 ` [PATCH v4 17/20] sched/core: Handle steal values and mark CPUs as preferred Shrikanth Hegde
2026-06-17 17:41 ` [PATCH v4 18/20] sched/core: Mark the direction of steal values to avoid oscillations Shrikanth Hegde
2026-06-17 17:41 ` [PATCH v4 19/20] sched/debug: Add debug knobs for steal monitor Shrikanth Hegde
2026-06-17 17:41 ` [PATCH v4 20/20] sched/core: Add a few check for valid CPU in inc/dec of preferred CPUs Shrikanth Hegde
2026-06-18  4:21   ` Yury Norov
2026-06-18  4:40     ` Shrikanth Hegde

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox