public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 00/17] sched/paravirt: Introduce cpu_preferred_mask and steal-driven vCPU backoff
@ 2026-04-07 19:19 Shrikanth Hegde
  2026-04-07 19:19 ` [PATCH v2 01/17] sched/debug: Remove unused schedstats Shrikanth Hegde
                   ` (18 more replies)
  0 siblings, 19 replies; 24+ messages in thread
From: Shrikanth Hegde @ 2026-04-07 19:19 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, gregkh
  Cc: sshegde, pbonzini, seanjc, kprateek.nayak, vschneid, iii, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, joelagnelf

In the virtualized environment, often there is vCPU overcommit. i.e. sum
of CPUs in all guests(virtual CPU aka vCPU) exceed the underlying physical CPU
(managed by host aka pCPU). 

When many guests ask for CPU at the same time, host/hypervisor would
fail to satisfy that ask and has to preempt one vCPU to run another. If
the guests co-ordinate and ask for less CPU overall, that reduces the
vCPU threads in host, and vCPU preemption goes down.

Steal time is an indication of the underlying contention. Based on that,
if the guests reduce the vCPU request that proportionally, it would achieve
the desired outcome.

The added advantage is, it would reduce the lockholder preemption.
A vCPU maybe holding a spinlock, but still could get preempted. Such cases
will reduce since there is less vCPU preemption and lockholder will run to
completion since it would have disabled preemption in the guest.
Workload could run with time-slice extention to reduce lockholder
preemption for userspace locks, and this could help reduce lockholder
preemption even for kernelspace due to vCPU preemption.

Currently there is no infra in scheduler which moves away the task from
some CPUs without breaking the userspace affinities. CPU hotplug,
isolated CPUset would achieve moving the task off some CPUs at runtime,
But if some task is affined to specific CPUs, taking those CPUs away
results in affinity list being reset. That breaks the user affinities,
Since this is driven by scheduler rather than user doing so, can't do
that. So need a new infra. It would be better if it is lightweight.

Core idea is:
- Maintain set of CPUs which can be used by workload. It is denoted as
  cpu_preferred_mask
- Periodically compute the steal time. If steal time is high/low based
  on the thresholds, either reduce/increase the preferred CPUs.
- If a CPU is marked as non-preferred, push the task running on it if
  possible.
- Use this CPU state in wakeup and load balance to ensure tasks run
  within preferred CPUs.

For the host kernel, there is no steal time, so no changes to its preferred
CPUs. So series would affect only the guest kernels.

Current series implements a simple steal time monitor, which
reduces/increases the number of cores by 1 depending on the steal time.
It also implements a very simple method to avoid oscillations. If there
is need a need for more complex mechanisms for these, then doing them
via a steal time governors maybe an idea. One needs to enable the
feature STEAL_MONITOR to see the steal time values being processed and
preferred CPUs being set correctly. In most of the systems where there
is no steal time, preferred CPUs will be same as online CPUs.

I will attach the irqbalance patch which detects the changes in this
mask and re-adjusts the irq affinities. Series doesn't address when
irqbalance=n. Assuming many distros have irqbalance=y by default.

Discussion at LPC 2025:
https://www.youtube.com/watch?v=sZKpHVUUy1g

*** Please provide your suggestions and comments ***

=====================================================================
Patch Layout:
PATCH    01: Remove stale schedstats. Independent of the series.
PATCH 02-04: Introduce cpu_preferred_mask.
PATCH 05-09: Make scheduler aware of this mask.
PATCH    10: Push the current task in sched_tick if cpu is non-preferred.
PATCH    11: Add a new schedstat.
PATCH    12: Add a new sched feature: STEAL_MONITOR
PATCH 13-17: Periodically calculating steal time and take appropriate
             action.

======================================================================
Performance Numbers:
baseline: tip/master at 8a5f70eb7e4f (Merge branch into tip/master: 'x86/tdx')

on PowerPC: powerVM hypervisor:
+++++++++
Daytrader
+++++++++ 
It is a database workload which simulates stock live trading.
There are two VMs. The same workload is run in both VMs at the same time.
VM1 is bigger than VM2.

Note: VM1 sees 20% steal time, and VM2 sees 10% steal time with
baseline.


(with series: STEAL_MONITOR=y and Default debug steal_mon values)
On VM1:
			baseline		with_series
Throughput		1x			1.3x 
On VM2:
                        baseline                with_series
Throughput              1x                      1.1x


(with series: STEAL_MONITOR=y and Period 100, High 200, Low 100)
On VM1:
                        baseline                with_series
Throughput:             1x                      1.45x
On VM2:
                        baseline                with_series
Throughput:             1x                      1.13x

Verdict: Shows good improvement with default values. Even better when
tuned the debug knobs.

+++++++++
Hackbench 
+++++++++
(with series: STEAL_MONITOR=y and Period 100, High 200, Low 100)
On VM1:
			baseline		with_series
10 groups		10.3			 8.5
30 groups		40.8			25.5
60 groups		77.2			47.8

on VM2:
			baseline		with_series
10 groups		 8.4			 7.5
30 groups		25.3			19.8
60 groups		41.7			36.3

Verdict: With tuned values, shows very good improvement.

==========================================================================
Since v1:
- A new name - Preferred CPUs and cpu_preferred_mask
  I had initially used the name as "Usable CPUs", but this seemed
  better. I thought of pv_preferred too, but left it as it could be too long.

- Arch independent code. Everything happens in scheduler. steal time is
  generic construct and this would help avoid each architecture doing the
  same thing more or less. Dropped powerpc code.

- Removed hacks around wakeups. Made it as part of available_idle_cpu
  which take care of many of the wakeup decisions. same for rt code.

- Implement a work function to calculate the steal times and enforce the
  policy decisions. This ensures sched_tick doesn't suffer any major
  latency.

- Steal time computation is gated with sched feature STEAL_MONITOR to
  avoid any overheads in systems which don't have vCPU overcommit.
  Feature is disabled by default.

- CPU_CAPACITY=1 was not considered since one needs the state of all CPUs
  which have this special value. Computing that in hotpath is not ideal.

- Using cpuset was not considered since it was quite tricky, given there
  is different versions and cgroups is natively user driven.

v1: https://lore.kernel.org/all/20251119124449.1149616-1-sshegde@linux.ibm.com/#t
earlier versions: https://lore.kernel.org/all/236f4925-dd3c-41ef-be04-47708c9ce129@linux.ibm.com/

TODO:
- Splicing of CPUs across NUMA nodes when CPUs aren't split equally.
- irq affinity when irqbalance=n. Not sure if this is worth.
- Avoid running any unbound housekeeping work on non-preferred CPUs 
  such as in find_new_ilb. Tried, but showed a little regression in 
  no noise case. So didn't consider.
- This currently works for kernel built with CONFIG_SCHED_SMT. Didn't
  want to sprinkle too many ifdefs there. Not sure if there is any
  system which needs this feature but !SMT. If so, let me know.
  Seeing those ifdefs makes me wonder, Maybe we could cleanup
  CONFIG_SCHED_SMT with cpumask_of(cpu) in case  of !SMT?
- Performance numbers in KVM with x86, s390. 

Sorry for sending it this late. This series is the one which is meant
for discussion at OSPM 2026.


Shrikanth Hegde (17):
  sched/debug: Remove unused schedstats
  sched/docs: Document cpu_preferred_mask and Preferred CPU concept
  cpumask: Introduce cpu_preferred_mask
  sysfs: Add preferred CPU file
  sched/core: allow only preferred CPUs in is_cpu_allowed
  sched/fair: Select preferred CPU at wakeup when possible
  sched/fair: load balance only among preferred CPUs
  sched/rt: Select a preferred CPU for wakeup and pulling rt task
  sched/core: Keep tick on non-preferred CPUs until tasks are out
  sched/core: Push current task from non preferred CPU
  sched/debug: Add migration stats due to non preferred CPUs
  sched/feature: Add STEAL_MONITOR feature
  sched/core: Introduce a simple steal monitor
  sched/core: Compute steal values at regular intervals
  sched/core: Handle steal values and mark CPUs as preferred
  sched/core: Mark the direction of steal values to avoid oscillations
  sched/debug: Add debug knobs for steal monitor

 .../ABI/testing/sysfs-devices-system-cpu      |  11 +
 Documentation/scheduler/sched-arch.rst        |  48 ++++
 Documentation/scheduler/sched-debug.rst       |  27 +++
 drivers/base/cpu.c                            |  12 +
 include/linux/cpumask.h                       |  22 ++
 include/linux/sched.h                         |   4 +-
 kernel/cpu.c                                  |   6 +
 kernel/sched/core.c                           | 219 +++++++++++++++++-
 kernel/sched/cpupri.c                         |   4 +
 kernel/sched/debug.c                          |  10 +-
 kernel/sched/fair.c                           |   8 +-
 kernel/sched/features.h                       |   3 +
 kernel/sched/rt.c                             |   4 +
 kernel/sched/sched.h                          |  41 ++++
 14 files changed, 409 insertions(+), 10 deletions(-)

-- 
2.47.3


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 01/17] sched/debug: Remove unused schedstats
  2026-04-07 19:19 [PATCH v2 00/17] sched/paravirt: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
@ 2026-04-07 19:19 ` Shrikanth Hegde
  2026-04-07 19:19 ` [PATCH v2 02/17] sched/docs: Document cpu_preferred_mask and Preferred CPU concept Shrikanth Hegde
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 24+ messages in thread
From: Shrikanth Hegde @ 2026-04-07 19:19 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, gregkh
  Cc: sshegde, pbonzini, seanjc, kprateek.nayak, vschneid, iii, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, joelagnelf

nr_migrations_cold, nr_wakeups_passive and nr_wakeups_idle are not
being updated anywhere. So remove them.

These are per process stats. So updating sched stats version isn't
necessary.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 include/linux/sched.h | 3 ---
 kernel/sched/debug.c  | 3 ---
 2 files changed, 6 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8ec3b6d7d718..1eb3825bcaeb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -550,7 +550,6 @@ struct sched_statistics {
 	s64				exec_max;
 	u64				slice_max;
 
-	u64				nr_migrations_cold;
 	u64				nr_failed_migrations_affine;
 	u64				nr_failed_migrations_running;
 	u64				nr_failed_migrations_hot;
@@ -563,8 +562,6 @@ struct sched_statistics {
 	u64				nr_wakeups_remote;
 	u64				nr_wakeups_affine;
 	u64				nr_wakeups_affine_attempts;
-	u64				nr_wakeups_passive;
-	u64				nr_wakeups_idle;
 
 #ifdef CONFIG_SCHED_CORE
 	u64				core_forceidle_sum;
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 74c1617cf652..f8a43fc13564 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1301,7 +1301,6 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 		P_SCHEDSTAT(wait_count);
 		PN_SCHEDSTAT(iowait_sum);
 		P_SCHEDSTAT(iowait_count);
-		P_SCHEDSTAT(nr_migrations_cold);
 		P_SCHEDSTAT(nr_failed_migrations_affine);
 		P_SCHEDSTAT(nr_failed_migrations_running);
 		P_SCHEDSTAT(nr_failed_migrations_hot);
@@ -1313,8 +1312,6 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 		P_SCHEDSTAT(nr_wakeups_remote);
 		P_SCHEDSTAT(nr_wakeups_affine);
 		P_SCHEDSTAT(nr_wakeups_affine_attempts);
-		P_SCHEDSTAT(nr_wakeups_passive);
-		P_SCHEDSTAT(nr_wakeups_idle);
 
 		avg_atom = p->se.sum_exec_runtime;
 		if (nr_switches)
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 02/17] sched/docs: Document cpu_preferred_mask and Preferred CPU concept
  2026-04-07 19:19 [PATCH v2 00/17] sched/paravirt: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
  2026-04-07 19:19 ` [PATCH v2 01/17] sched/debug: Remove unused schedstats Shrikanth Hegde
@ 2026-04-07 19:19 ` Shrikanth Hegde
  2026-04-07 19:19 ` [PATCH v2 03/17] cpumask: Introduce cpu_preferred_mask Shrikanth Hegde
                   ` (16 subsequent siblings)
  18 siblings, 0 replies; 24+ messages in thread
From: Shrikanth Hegde @ 2026-04-07 19:19 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, gregkh
  Cc: sshegde, pbonzini, seanjc, kprateek.nayak, vschneid, iii, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, joelagnelf

Add documentation for new cpumask called cpu_preferred_mask. This could
help users in understanding what this mask and the concept behind it.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 Documentation/scheduler/sched-arch.rst | 48 ++++++++++++++++++++++++++
 1 file changed, 48 insertions(+)

diff --git a/Documentation/scheduler/sched-arch.rst b/Documentation/scheduler/sched-arch.rst
index ed07efea7d02..2e926c7afc8f 100644
--- a/Documentation/scheduler/sched-arch.rst
+++ b/Documentation/scheduler/sched-arch.rst
@@ -62,6 +62,54 @@ Your cpu_idle routines need to obey the following rules:
 arch/x86/kernel/process.c has examples of both polling and
 sleeping idle functions.
 
+Preferred CPUs
+==============
+
+In virtualised environments it is possible to overcommit CPU resources.
+i.e sum of virtual CPU(vCPU) of all VM's is greater than number of physical
+CPUs(pCPU). Under such conditions when all or many VM's have high utilization,
+hypervisor won't be able to satisfy the CPU requirement and has to context
+switch within or across VM. i.e hypervisor need to preempt one vCPU to run
+another. This is called vCPU preemption. This is more expensive compared to
+task context switch within a vCPU.
+
+In such cases it is better that VM's co-ordinate among themselves and ask for
+less CPU by not using some of the vCPUs. vCPUs where workload can be safely
+scheduled which won't increase any contention for pCPU are called as
+"Preferred CPUs".
+
+In most cases preferred CPUs will be same as online CPUs, when there is pCPU
+contention, Preferred CPUs will reduce based on the amount of steal time.
+When the pCPU contention goes away as indicated by steal time, Preffered CPUs
+will become same as online CPUs again. This will be done by new scheduler
+feature called STEAL_MONITOR.
+
+For scheduling decisions such as wakeup, pushing the task etc, needs this
+CPU state info. This is maintained in cpu_preferred_mask.
+
+vCPUs which are not in cpu_preferred_mask should be treated as vCPUs which
+should not be used at this moment provided it doesn't break user affinity.
+This is achieved by
+1. Selecting only a preferred CPU at wakeup.
+2. Push the task away from non-preferred CPU at tick.
+3. Only select preferred CPUs for load balance.
+
+This works only for SCHED_RT and SCHED_NORMAL. SCHED_EXT and userspace can
+make choices using cpu_preferred_mask.
+
+/sys/devices/system/cpu/preferred prints the current cpu_preferred_mask in
+cpulist format.
+
+Notes:
+1. This feature is available under CONFIG_PARAVIRT.
+2. preferred CPUs is same as online CPUs until STEAL_MONITOR is enabled.
+3. A task pinned, which can't be moved to preferred CPUs will continue
+   to run based on its affinity. But no load balancing happens
+4. If needed, steal time based governors/arch dependent method
+   could be used to cater to different types of cpu numbers.
+5. Decision to use/not use is driven by kernel. Hence it shouldn't
+   break user affinities. One of the main reason why CPU hotplug
+   or Isolated cpuset partitions was not a solution.
 
 Possible arch/ problems
 =======================
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 03/17] cpumask: Introduce cpu_preferred_mask
  2026-04-07 19:19 [PATCH v2 00/17] sched/paravirt: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
  2026-04-07 19:19 ` [PATCH v2 01/17] sched/debug: Remove unused schedstats Shrikanth Hegde
  2026-04-07 19:19 ` [PATCH v2 02/17] sched/docs: Document cpu_preferred_mask and Preferred CPU concept Shrikanth Hegde
@ 2026-04-07 19:19 ` Shrikanth Hegde
  2026-04-07 20:27   ` Yury Norov
  2026-04-07 19:19 ` [PATCH v2 04/17] sysfs: Add preferred CPU file Shrikanth Hegde
                   ` (15 subsequent siblings)
  18 siblings, 1 reply; 24+ messages in thread
From: Shrikanth Hegde @ 2026-04-07 19:19 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, gregkh
  Cc: sshegde, pbonzini, seanjc, kprateek.nayak, vschneid, iii, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, joelagnelf

This patch does
- Declare and Define cpu_preferred_mask.
- Get/Set helpers for it.

Values are set/clear by the scheduler by detecting the steal time values.

A CPU is set to preferred when it comes online. Later it may be
marked as non-preferred depending on steal time values with 
STEAL_MONITOR enabled.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 include/linux/cpumask.h | 22 ++++++++++++++++++++++
 kernel/cpu.c            |  6 ++++++
 kernel/sched/core.c     |  5 +++++
 3 files changed, 33 insertions(+)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 80211900f373..80c5cc13b8ad 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -1296,6 +1296,28 @@ static __always_inline bool cpu_dying(unsigned int cpu)
 
 #endif /* NR_CPUS > 1 */
 
+/*
+ * All related wrappers kept together to avoid too many ifdefs
+ * See Documentation/scheduler/sched-arch.rst for details
+ */
+#ifdef CONFIG_PARAVIRT
+extern struct cpumask __cpu_preferred_mask;
+#define cpu_preferred_mask    ((const struct cpumask *)&__cpu_preferred_mask)
+#define set_cpu_preferred(cpu, preferred) assign_cpu((cpu), &__cpu_preferred_mask, (preferred))
+
+static __always_inline bool cpu_preferred(unsigned int cpu)
+{
+	return cpumask_test_cpu(cpu, cpu_preferred_mask);
+}
+#else
+static __always_inline bool cpu_preferred(unsigned int cpu)
+{
+	return true;
+}
+
+static __always_inline void set_cpu_preferred(unsigned int cpu, bool preferred) { }
+#endif
+
 #define cpu_is_offline(cpu)	unlikely(!cpu_online(cpu))
 
 #if NR_CPUS <= BITS_PER_LONG
diff --git a/kernel/cpu.c b/kernel/cpu.c
index bc4f7a9ba64e..2d4d037680d4 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -3137,6 +3137,12 @@ void set_cpu_online(unsigned int cpu, bool online)
 		if (cpumask_test_and_clear_cpu(cpu, &__cpu_online_mask))
 			atomic_dec(&__num_online_cpus);
 	}
+
+	/*
+	 * An online CPU is by default assumed to be preferred
+	 * Unitl STEAL_MONITOR changes it
+	 */
+	set_cpu_preferred(cpu, online);
 }
 
 /*
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f351296922ac..7ea05a7a717b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -11228,3 +11228,8 @@ void sched_change_end(struct sched_change_ctx *ctx)
 		p->sched_class->prio_changed(rq, p, ctx->prio);
 	}
 }
+
+#ifdef CONFIG_PARAVIRT
+struct cpumask __cpu_preferred_mask __read_mostly;
+EXPORT_SYMBOL(__cpu_preferred_mask);
+#endif
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 04/17] sysfs: Add preferred CPU file
  2026-04-07 19:19 [PATCH v2 00/17] sched/paravirt: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (2 preceding siblings ...)
  2026-04-07 19:19 ` [PATCH v2 03/17] cpumask: Introduce cpu_preferred_mask Shrikanth Hegde
@ 2026-04-07 19:19 ` Shrikanth Hegde
  2026-04-07 19:19 ` [PATCH v2 05/17] sched/core: allow only preferred CPUs in is_cpu_allowed Shrikanth Hegde
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 24+ messages in thread
From: Shrikanth Hegde @ 2026-04-07 19:19 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, gregkh
  Cc: sshegde, pbonzini, seanjc, kprateek.nayak, vschneid, iii, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, joelagnelf

Add "preferred" file in /sys/devices/system/cpu

This offers
- User can quickly check which CPUs are marked as preferred at this
  moment.
- Userspace algorithms irqbalance could use this mask to send irq into
  preferred CPUs.

For example:
cat /sys/devices/system/cpu/online
0-719
cat /sys/devices/system/cpu/preferred
0-599        <<< Implies 0-599 are preferred for workloads and 600-719
                 should be avoided at this moment.

cat /sys/devices/system/cpu/preferred
0-719        <<< All CPUs are usable. There is no preferrence.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 Documentation/ABI/testing/sysfs-devices-system-cpu | 11 +++++++++++
 drivers/base/cpu.c                                 | 12 ++++++++++++
 2 files changed, 23 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu
index 3a05604c21bf..ffa05605923b 100644
--- a/Documentation/ABI/testing/sysfs-devices-system-cpu
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -788,3 +788,14 @@ Date:		Nov 2022
 Contact:	Linux kernel mailing list <linux-kernel@vger.kernel.org>
 Description:
 		(RO) the list of CPUs that can be brought online.
+
+What:		/sys/devices/system/cpu/preferred
+Date:		Apr 2026
+Contact:	Linux kernel mailing list <linux-kernel@vger.kernel.org>
+Description:
+		(RO) the list of preferred CPUs at this moment.
+		These are the only CPUs meant to be used at the moment.
+		Using CPU outside of the list could lead to more
+		contention of underlying physical CPU resource. Dynamically
+		changes to reflect the current situation by using
+		STEAL_MONITOR scheduler feature. Expects CONFIG_PARAVIRT=y
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index 875abdc9942e..0a6cf37f2001 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -391,6 +391,15 @@ static int cpu_uevent(const struct device *dev, struct kobj_uevent_env *env)
 }
 #endif
 
+#ifdef CONFIG_PARAVIRT
+static ssize_t preferred_show(struct device *dev,
+			      struct device_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "%*pbl\n", cpumask_pr_args(cpu_preferred_mask));
+}
+static DEVICE_ATTR_RO(preferred);
+#endif
+
 const struct bus_type cpu_subsys = {
 	.name = "cpu",
 	.dev_name = "cpu",
@@ -531,6 +540,9 @@ static struct attribute *cpu_root_attrs[] = {
 #endif
 #ifdef CONFIG_GENERIC_CPU_AUTOPROBE
 	&dev_attr_modalias.attr,
+#endif
+#ifdef CONFIG_PARAVIRT
+	&dev_attr_preferred.attr,
 #endif
 	NULL
 };
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 05/17] sched/core: allow only preferred CPUs in is_cpu_allowed
  2026-04-07 19:19 [PATCH v2 00/17] sched/paravirt: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (3 preceding siblings ...)
  2026-04-07 19:19 ` [PATCH v2 04/17] sysfs: Add preferred CPU file Shrikanth Hegde
@ 2026-04-07 19:19 ` Shrikanth Hegde
  2026-04-08  1:05   ` Yury Norov
  2026-04-07 19:19 ` [PATCH v2 06/17] sched/fair: Select preferred CPU at wakeup when possible Shrikanth Hegde
                   ` (13 subsequent siblings)
  18 siblings, 1 reply; 24+ messages in thread
From: Shrikanth Hegde @ 2026-04-07 19:19 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, gregkh
  Cc: sshegde, pbonzini, seanjc, kprateek.nayak, vschneid, iii, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, joelagnelf

When possible, choose a preferred CPUs to pick.

Push task mechanism uses stopper thread which going to call
select_fallback_rq and use this mechanism to pick only a preferred CPU.

When task is affined only to non-preferred CPUs it should continue to
run there. Detect that by checking if cpus_ptr and cpu_preferred_mask
interesect or not.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/core.c  | 17 ++++++++++++++---
 kernel/sched/sched.h | 12 ++++++++++++
 2 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7ea05a7a717b..336e7c694eb7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2463,9 +2463,16 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
 	if (is_migration_disabled(p))
 		return cpu_online(cpu);
 
-	/* Non kernel threads are not allowed during either online or offline. */
-	if (!(p->flags & PF_KTHREAD))
-		return cpu_active(cpu);
+	/*
+	 * Non kernel threads are not allowed during either online or offline.
+	 * Ensure it is a preferred CPU to avoid further contention
+	 */
+	if (!(p->flags & PF_KTHREAD)) {
+		if (!cpu_active(cpu))
+			return false;
+		if (!cpu_preferred(cpu) && task_can_run_on_preferred_cpu(p))
+			return false;
+	}
 
 	/* KTHREAD_IS_PER_CPU is always allowed. */
 	if (kthread_is_per_cpu(p))
@@ -2475,6 +2482,10 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
 	if (cpu_dying(cpu))
 		return false;
 
+	/* Try on preferred CPU first */
+	if (!cpu_preferred(cpu) && task_can_run_on_preferred_cpu(p))
+		return false;
+
 	/* But are allowed during online. */
 	return cpu_online(cpu);
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 88e0c93b9e21..7271af2ca64f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -4130,4 +4130,16 @@ DEFINE_CLASS_IS_UNCONDITIONAL(sched_change)
 
 #include "ext.h"
 
+#ifdef CONFIG_PARAVIRT
+static inline bool task_can_run_on_preferred_cpu(struct task_struct *p)
+{
+	return cpumask_intersects(p->cpus_ptr, cpu_preferred_mask);
+}
+#else
+static inline bool task_can_run_on_preferred_cpu(struct task_struct *p)
+{
+	return true;
+}
+#endif
+
 #endif /* _KERNEL_SCHED_SCHED_H */
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 06/17] sched/fair: Select preferred CPU at wakeup when possible
  2026-04-07 19:19 [PATCH v2 00/17] sched/paravirt: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (4 preceding siblings ...)
  2026-04-07 19:19 ` [PATCH v2 05/17] sched/core: allow only preferred CPUs in is_cpu_allowed Shrikanth Hegde
@ 2026-04-07 19:19 ` Shrikanth Hegde
  2026-04-07 19:19 ` [PATCH v2 07/17] sched/fair: load balance only among preferred CPUs Shrikanth Hegde
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 24+ messages in thread
From: Shrikanth Hegde @ 2026-04-07 19:19 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, gregkh
  Cc: sshegde, pbonzini, seanjc, kprateek.nayak, vschneid, iii, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, joelagnelf

Update available_idle_cpu to consider preferred CPUs. This takes care of
lot of decisions at wakeup to use only preferred CPUs. There is no need to
put those explicit checks everywhere.

Only other place where prev_cpu was not preferred and could possibly return
was sched_balance_find_dst_cpu. Put the check there.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/fair.c  | 3 ++-
 kernel/sched/sched.h | 3 +++
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 86ef9ce39b61..22010afb4c1d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7698,7 +7698,8 @@ static inline int sched_balance_find_dst_cpu(struct sched_domain *sd, struct tas
 {
 	int new_cpu = cpu;
 
-	if (!cpumask_intersects(sched_domain_span(sd), p->cpus_ptr))
+	if (!cpumask_intersects(sched_domain_span(sd), p->cpus_ptr) &&
+	    cpu_preferred(prev_cpu))
 		return prev_cpu;
 
 	/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7271af2ca64f..4c45092b2fce 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1412,6 +1412,9 @@ static inline bool available_idle_cpu(int cpu)
 	if (!idle_rq(cpu_rq(cpu)))
 		return 0;
 
+	if (!cpu_preferred(cpu))
+		return 0;
+
 	if (vcpu_is_preempted(cpu))
 		return 0;
 
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 07/17] sched/fair: load balance only among preferred CPUs
  2026-04-07 19:19 [PATCH v2 00/17] sched/paravirt: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (5 preceding siblings ...)
  2026-04-07 19:19 ` [PATCH v2 06/17] sched/fair: Select preferred CPU at wakeup when possible Shrikanth Hegde
@ 2026-04-07 19:19 ` Shrikanth Hegde
  2026-04-07 19:19 ` [PATCH v2 08/17] sched/rt: Select a preferred CPU for wakeup and pulling rt task Shrikanth Hegde
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 24+ messages in thread
From: Shrikanth Hegde @ 2026-04-07 19:19 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, gregkh
  Cc: sshegde, pbonzini, seanjc, kprateek.nayak, vschneid, iii, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, joelagnelf

Consider only preferred CPUs for load balance.

With this, load balance will end up choosing a preferred CPUs for pull.
This makes it not fight against the push task mechanism which happens
at tick. Also, this stops active balance to happen on non-preferred CPU
pulling the load.

This means there is no load balancing if the task is pinned only to
non-preferred CPUs. They will continue to run where they were previously
running before the CPUs was marked as non-preferred.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/fair.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 22010afb4c1d..e4571bd71a44 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12058,6 +12058,11 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
 
 	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
 
+#ifdef CONFIG_PARAVIRT
+	/* Spread load among preferred CPUs */
+	cpumask_and(cpus, cpus, cpu_preferred_mask);
+#endif
+
 	schedstat_inc(sd->lb_count[idle]);
 
 redo:
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 08/17] sched/rt: Select a preferred CPU for wakeup and pulling rt task
  2026-04-07 19:19 [PATCH v2 00/17] sched/paravirt: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (6 preceding siblings ...)
  2026-04-07 19:19 ` [PATCH v2 07/17] sched/fair: load balance only among preferred CPUs Shrikanth Hegde
@ 2026-04-07 19:19 ` Shrikanth Hegde
  2026-04-07 19:19 ` [PATCH v2 09/17] sched/core: Keep tick on non-preferred CPUs until tasks are out Shrikanth Hegde
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 24+ messages in thread
From: Shrikanth Hegde @ 2026-04-07 19:19 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, gregkh
  Cc: sshegde, pbonzini, seanjc, kprateek.nayak, vschneid, iii, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, joelagnelf

For RT class,
- During wakeup choose a preferred CPU.
- For push_rt framework, limit pushing to preferred CPUs
- Pull the rt task only if CPU is preferred.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/cpupri.c | 4 ++++
 kernel/sched/rt.c     | 4 ++++
 2 files changed, 8 insertions(+)

diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c
index 8f2237e8b484..3e3106690ff3 100644
--- a/kernel/sched/cpupri.c
+++ b/kernel/sched/cpupri.c
@@ -104,6 +104,10 @@ static inline int __cpupri_find(struct cpupri *cp, struct task_struct *p,
 		cpumask_and(lowest_mask, &p->cpus_mask, vec->mask);
 		cpumask_and(lowest_mask, lowest_mask, cpu_active_mask);
 
+#ifdef CONFIG_PARAVIRT
+		cpumask_and(lowest_mask, lowest_mask, cpu_preferred_mask);
+#endif
+
 		/*
 		 * We have to ensure that we have at least one bit
 		 * still set in the array, since the map could have
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index a48e86794913..0c8cc8555287 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2262,6 +2262,10 @@ static void pull_rt_task(struct rq *this_rq)
 	if (likely(!rt_overload_count))
 		return;
 
+	/* No point in pulling the load, just to push it next tick again */
+	if (!cpu_preferred(this_cpu))
+		return;
+
 	/*
 	 * Match the barrier from rt_set_overloaded; this guarantees that if we
 	 * see overloaded we must also see the rto_mask bit.
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 09/17] sched/core: Keep tick on non-preferred CPUs until tasks are out
  2026-04-07 19:19 [PATCH v2 00/17] sched/paravirt: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (7 preceding siblings ...)
  2026-04-07 19:19 ` [PATCH v2 08/17] sched/rt: Select a preferred CPU for wakeup and pulling rt task Shrikanth Hegde
@ 2026-04-07 19:19 ` Shrikanth Hegde
  2026-04-07 19:19 ` [PATCH v2 10/17] sched/core: Push current task from non preferred CPU Shrikanth Hegde
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 24+ messages in thread
From: Shrikanth Hegde @ 2026-04-07 19:19 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, gregkh
  Cc: sshegde, pbonzini, seanjc, kprateek.nayak, vschneid, iii, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, joelagnelf

Enable tick on nohz full CPU when it is marked as non-preferred.
If there in no CFS/RT running there, disable the tick to save the power.

Steal time handling code will call tick_nohz_dep_set_cpu with
TICK_DEP_BIT_SCHED for moving the task out of nohz_full CPU fast.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/core.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 336e7c694eb7..c7f046443dc5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1407,6 +1407,10 @@ bool sched_can_stop_tick(struct rq *rq)
 	if (rq->dl.dl_nr_running)
 		return false;
 
+	/* Keep the tick running until both RT and CFS are pushed out*/
+	if (!cpu_preferred(rq->cpu) && (rq->rt.rt_nr_running || rq->cfs.h_nr_queued))
+		return false;
+
 	/*
 	 * If there are more than one RR tasks, we need the tick to affect the
 	 * actual RR behaviour.
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 10/17] sched/core: Push current task from non preferred CPU
  2026-04-07 19:19 [PATCH v2 00/17] sched/paravirt: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (8 preceding siblings ...)
  2026-04-07 19:19 ` [PATCH v2 09/17] sched/core: Keep tick on non-preferred CPUs until tasks are out Shrikanth Hegde
@ 2026-04-07 19:19 ` Shrikanth Hegde
  2026-04-07 19:19 ` [PATCH v2 11/17] sched/debug: Add migration stats due to non preferred CPUs Shrikanth Hegde
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 24+ messages in thread
From: Shrikanth Hegde @ 2026-04-07 19:19 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, gregkh
  Cc: sshegde, pbonzini, seanjc, kprateek.nayak, vschneid, iii, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, joelagnelf

Actively push out RT/CFS running on a non-preferred CPU. Since the task is
running on the CPU, need to stop the cpu and push the task out.
However, if the task in pinned only to non-preferred CPUs, it will continue
running there. This will help in maintaining the userspace affinities
unlike CPU hotplug or isolated cpusets.

Though code is almost same as __balance_push_cpu_stop and quite close to
push_cpu_stop, it is being kept separate as it provides a cleaner
implementation w.r.t to PARAVIRT config.

Add push_task_work_done flag to protect work buffer.
This currently works only fair and rt class.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/core.c  | 84 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |  8 +++++
 2 files changed, 92 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c7f046443dc5..b375c500d49e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5652,6 +5652,10 @@ void sched_tick(void)
 	unsigned long hw_pressure;
 	u64 resched_latency;
 
+	/* push the current CFS/RT task out if its on non-preferred CPU */
+	if (!cpu_preferred(cpu))
+		sched_push_current_non_preferred_cpu(rq);
+
 	if (housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE))
 		arch_scale_freq_tick();
 
@@ -11247,4 +11251,84 @@ void sched_change_end(struct sched_change_ctx *ctx)
 #ifdef CONFIG_PARAVIRT
 struct cpumask __cpu_preferred_mask __read_mostly;
 EXPORT_SYMBOL(__cpu_preferred_mask);
+
+/* npc - non preferred CPU */
+static DEFINE_PER_CPU(struct cpu_stop_work, npc_push_task_work);
+
+static int sched_non_preferred_cpu_push_stop(void *arg)
+{
+	struct task_struct *p = arg;
+	struct rq *rq = this_rq();
+	struct rq_flags rf;
+	int cpu;
+
+	raw_spin_lock_irq(&p->pi_lock);
+	rq_lock(rq, &rf);
+	rq->push_task_work_done = 0;
+
+	update_rq_clock(rq);
+
+	if (task_rq(p) == rq && task_on_rq_queued(p)) {
+		cpu = select_fallback_rq(rq->cpu, p);
+		rq = __migrate_task(rq, &rf, p, cpu);
+	}
+
+	rq_unlock(rq, &rf);
+	raw_spin_unlock_irq(&p->pi_lock);
+	put_task_struct(p);
+
+	return 0;
+}
+
+/* Using this CPU will lead to more hypervisor preemptions.
+ * It is better not to use this CPU.
+ *
+ * In case any task is scheduled on such CPU, move it out. In
+ * select_fallback_rq a preferred CPU will be chosen and henceforth
+ * task shouldn't come back to this CPU
+ */
+void sched_push_current_non_preferred_cpu(struct rq *rq)
+{
+	struct task_struct *push_task = rq->curr;
+	unsigned long flags;
+	struct rq_flags rf;
+
+	/* sanity check */
+	if (cpu_preferred(rq->cpu))
+		return;
+
+	/* Idle task can't be pused out */
+	if (rq->curr == rq->idle)
+		return;
+
+	/* Do for only SCHED_NORMAL AND RT for now */
+	if (push_task->sched_class != &fair_sched_class &&
+	    push_task->sched_class != &rt_sched_class)
+		return;
+
+	if (kthread_is_per_cpu(push_task) ||
+	    is_migration_disabled(push_task))
+		return;
+
+	/* Is there any preferred CPU in the affinity list */
+	if (!task_can_run_on_preferred_cpu(push_task))
+		return;
+
+	/* There is already a stopper thread for this. Dont race with it */
+	if (rq->push_task_work_done == 1)
+		return;
+
+	local_irq_save(flags);
+
+	get_task_struct(push_task);
+
+	rq_lock(rq, &rf);
+	rq->push_task_work_done = 1;
+	rq_unlock(rq, &rf);
+
+	stop_one_cpu_nowait(rq->cpu, sched_non_preferred_cpu_push_stop,
+			    push_task, this_cpu_ptr(&npc_push_task_work));
+	local_irq_restore(flags);
+}
+
 #endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4c45092b2fce..c1d037f11c62 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1239,6 +1239,10 @@ struct rq {
 	unsigned char		nohz_idle_balance;
 	unsigned char		idle_balance;
 
+#ifdef CONFIG_PARAVIRT
+	bool			push_task_work_done;
+#endif
+
 	unsigned long		misfit_task_load;
 
 	/* For active balancing */
@@ -4138,11 +4142,15 @@ static inline bool task_can_run_on_preferred_cpu(struct task_struct *p)
 {
 	return cpumask_intersects(p->cpus_ptr, cpu_preferred_mask);
 }
+
+void sched_push_current_non_preferred_cpu(struct rq *rq);
 #else
 static inline bool task_can_run_on_preferred_cpu(struct task_struct *p)
 {
 	return true;
 }
+
+static inline void sched_push_current_non_preferred_cpu(struct rq *rq) { }
 #endif
 
 #endif /* _KERNEL_SCHED_SCHED_H */
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 11/17] sched/debug: Add migration stats due to non preferred CPUs
  2026-04-07 19:19 [PATCH v2 00/17] sched/paravirt: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (9 preceding siblings ...)
  2026-04-07 19:19 ` [PATCH v2 10/17] sched/core: Push current task from non preferred CPU Shrikanth Hegde
@ 2026-04-07 19:19 ` Shrikanth Hegde
  2026-04-07 19:19 ` [PATCH v2 12/17] sched/feature: Add STEAL_MONITOR feature Shrikanth Hegde
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 24+ messages in thread
From: Shrikanth Hegde @ 2026-04-07 19:19 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, gregkh
  Cc: sshegde, pbonzini, seanjc, kprateek.nayak, vschneid, iii, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, joelagnelf

Add new stats.
- nr_migrations_cpu_non_preferred: number of migrations happened since
  a CPU was marked as non preferred due to high steal time.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 include/linux/sched.h | 1 +
 kernel/sched/core.c   | 1 +
 kernel/sched/debug.c  | 1 +
 3 files changed, 3 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1eb3825bcaeb..6c0d5d36f21c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -554,6 +554,7 @@ struct sched_statistics {
 	u64				nr_failed_migrations_running;
 	u64				nr_failed_migrations_hot;
 	u64				nr_forced_migrations;
+	u64				nr_migrations_cpu_non_preferred;
 
 	u64				nr_wakeups;
 	u64				nr_wakeups_sync;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b375c500d49e..7a9442439eb2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -11321,6 +11321,7 @@ void sched_push_current_non_preferred_cpu(struct rq *rq)
 	local_irq_save(flags);
 
 	get_task_struct(push_task);
+	schedstat_inc(push_task->stats.nr_migrations_cpu_non_preferred);
 
 	rq_lock(rq, &rf);
 	rq->push_task_work_done = 1;
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index f8a43fc13564..482c86a0ff80 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1305,6 +1305,7 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 		P_SCHEDSTAT(nr_failed_migrations_running);
 		P_SCHEDSTAT(nr_failed_migrations_hot);
 		P_SCHEDSTAT(nr_forced_migrations);
+		P_SCHEDSTAT(nr_migrations_cpu_non_preferred);
 		P_SCHEDSTAT(nr_wakeups);
 		P_SCHEDSTAT(nr_wakeups_sync);
 		P_SCHEDSTAT(nr_wakeups_migrate);
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 12/17] sched/feature: Add STEAL_MONITOR feature
  2026-04-07 19:19 [PATCH v2 00/17] sched/paravirt: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (10 preceding siblings ...)
  2026-04-07 19:19 ` [PATCH v2 11/17] sched/debug: Add migration stats due to non preferred CPUs Shrikanth Hegde
@ 2026-04-07 19:19 ` Shrikanth Hegde
  2026-04-07 19:19 ` [PATCH v2 13/17] sched/core: Introduce a simple steal monitor Shrikanth Hegde
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 24+ messages in thread
From: Shrikanth Hegde @ 2026-04-07 19:19 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, gregkh
  Cc: sshegde, pbonzini, seanjc, kprateek.nayak, vschneid, iii, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, joelagnelf

Add a new sched feature to do steal time calculation.
Computing steal time and acting on it periodically are to be opted by
the user. This helps to avoid any overhead when the feature
is disabled.

It is disabled by default.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/features.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 84c4fe3abd74..08208bec3dd2 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -134,3 +134,6 @@ SCHED_FEAT(LATENCY_WARN, false)
  */
 SCHED_FEAT(NI_RANDOM, true)
 SCHED_FEAT(NI_RATE, true)
+
+/* Generic steal time monitor to act on cpu_preferred_mask */
+SCHED_FEAT(STEAL_MONITOR, false)
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 13/17] sched/core: Introduce a simple steal monitor
  2026-04-07 19:19 [PATCH v2 00/17] sched/paravirt: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (11 preceding siblings ...)
  2026-04-07 19:19 ` [PATCH v2 12/17] sched/feature: Add STEAL_MONITOR feature Shrikanth Hegde
@ 2026-04-07 19:19 ` Shrikanth Hegde
  2026-04-07 19:19 ` [PATCH v2 14/17] sched/core: Compute steal values at regular intervals Shrikanth Hegde
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 24+ messages in thread
From: Shrikanth Hegde @ 2026-04-07 19:19 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, gregkh
  Cc: sshegde, pbonzini, seanjc, kprateek.nayak, vschneid, iii, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, joelagnelf

Start with a simple steal monitor.

It is meant to look at steal time and make the decision to
reduce/increase the preferred CPUs.

It has
- work function to execute the steal time calculations and decision
  making periodically.
- temporary cpumask, which will be used in the work function. This helps
  to avoid cpumask allocation in periodic work function.
- low and high thesholds for steal time.
- sampling period to control the frequency of steal time calculations.
- cache the previous decision to avoid oscillations

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/core.c  | 23 +++++++++++++++++++++++
 kernel/sched/sched.h | 14 ++++++++++++++
 2 files changed, 37 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7a9442439eb2..8c80600ddd28 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9083,6 +9083,8 @@ void __init sched_init(void)
 
 	preempt_dynamic_init();
 
+	sched_init_steal_monitor();
+
 	scheduler_running = 1;
 }
 
@@ -11332,4 +11334,25 @@ void sched_push_current_non_preferred_cpu(struct rq *rq)
 	local_irq_restore(flags);
 }
 
+struct steal_monitor_t steal_mon;
+
+void sched_init_steal_monitor(void)
+{
+	INIT_WORK(&steal_mon.work, sched_steal_detection_work);
+	zalloc_cpumask_var(&steal_mon.tmp_mask, GFP_NOWAIT);
+	steal_mon.low_threshold       = 200;		/* 2% steal time */
+	steal_mon.high_threshold      = 500;		/* 5% steal time */
+	steal_mon.sampling_period_ms  = 1000;		/* once per second */
+}
+
+/* This is only a skeleton. Subsequent patches introduce more of it */
+void sched_steal_detection_work(struct work_struct *work)
+{
+	struct steal_monitor_t *sm = container_of(work, struct steal_monitor_t, work);
+	ktime_t now;
+
+	/* Update the prev_time for next iteration*/
+	now = ktime_get();
+	sm->prev_time = now;
+}
 #endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c1d037f11c62..c0fbfb04eda3 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -4138,12 +4138,25 @@ DEFINE_CLASS_IS_UNCONDITIONAL(sched_change)
 #include "ext.h"
 
 #ifdef CONFIG_PARAVIRT
+struct steal_monitor_t {
+	struct work_struct  work;
+	cpumask_var_t tmp_mask;
+	ktime_t prev_time;
+	u64 prev_steal;
+	int previous_decision;
+	unsigned int low_threshold;
+	unsigned int high_threshold;
+	unsigned int sampling_period_ms;
+};
+
 static inline bool task_can_run_on_preferred_cpu(struct task_struct *p)
 {
 	return cpumask_intersects(p->cpus_ptr, cpu_preferred_mask);
 }
 
 void sched_push_current_non_preferred_cpu(struct rq *rq);
+void sched_init_steal_monitor(void);
+void sched_steal_detection_work(struct work_struct *work);
 #else
 static inline bool task_can_run_on_preferred_cpu(struct task_struct *p)
 {
@@ -4151,6 +4164,7 @@ static inline bool task_can_run_on_preferred_cpu(struct task_struct *p)
 }
 
 static inline void sched_push_current_non_preferred_cpu(struct rq *rq) { }
+static inline void sched_init_steal_monitor(void) { }
 #endif
 
 #endif /* _KERNEL_SCHED_SCHED_H */
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 14/17] sched/core: Compute steal values at regular intervals
  2026-04-07 19:19 [PATCH v2 00/17] sched/paravirt: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (12 preceding siblings ...)
  2026-04-07 19:19 ` [PATCH v2 13/17] sched/core: Introduce a simple steal monitor Shrikanth Hegde
@ 2026-04-07 19:19 ` Shrikanth Hegde
  2026-04-07 19:19 ` [PATCH v2 15/17] sched/core: Handle steal values and mark CPUs as preferred Shrikanth Hegde
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 24+ messages in thread
From: Shrikanth Hegde @ 2026-04-07 19:19 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, gregkh
  Cc: sshegde, pbonzini, seanjc, kprateek.nayak, vschneid, iii, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, joelagnelf

Kick off the work to compute the steal time at regular interval.
Gated with sched feature STEAL_MONITOR to avoid any overhead in systems
that are not interested in it.

The sampling period can configured at runtime using steal_mon_period.
By default is 1000 milliseconds. i.e. 1 second

This work is done by first online housekeeping CPU only. Hence it won't
need any complicated synchronization.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/core.c  | 27 +++++++++++++++++++++++++++
 kernel/sched/sched.h |  2 ++
 2 files changed, 29 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8c80600ddd28..1c6fcf1ae4fe 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5694,6 +5694,10 @@ void sched_tick(void)
 		rq->idle_balance = idle_cpu(cpu);
 		sched_balance_trigger(rq);
 	}
+
+	/* This feature works currently on SMT system */
+	if (sched_feat(STEAL_MONITOR) && IS_ENABLED(CONFIG_SCHED_SMT))
+		sched_trigger_steal_computation(cpu);
 }
 
 #ifdef CONFIG_NO_HZ_FULL
@@ -11355,4 +11359,27 @@ void sched_steal_detection_work(struct work_struct *work)
 	now = ktime_get();
 	sm->prev_time = now;
 }
+
+void sched_trigger_steal_computation(int cpu)
+{
+	int first_hk_cpu = cpumask_first_and(housekeeping_cpumask(HK_TYPE_KERNEL_NOISE),
+					     cpu_online_mask);
+	ktime_t now;
+
+	/* Done by first online housekeeping CPU only */
+	if (likely(cpu != first_hk_cpu))
+		return;
+
+	/*
+	 * Since everything is updated by first housekeeping CPU,
+	 * There is no need for complex syncronization.
+	 */
+	now = ktime_get();
+
+	/* Default is once per second */
+	if (likely((now - steal_mon.prev_time) < steal_mon.sampling_period_ms * NSEC_PER_MSEC))
+		return;
+
+	schedule_work_on(first_hk_cpu, &steal_mon.work);
+}
 #endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c0fbfb04eda3..337357e48a83 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -4157,6 +4157,7 @@ static inline bool task_can_run_on_preferred_cpu(struct task_struct *p)
 void sched_push_current_non_preferred_cpu(struct rq *rq);
 void sched_init_steal_monitor(void);
 void sched_steal_detection_work(struct work_struct *work);
+void sched_trigger_steal_computation(int cpu);
 #else
 static inline bool task_can_run_on_preferred_cpu(struct task_struct *p)
 {
@@ -4165,6 +4166,7 @@ static inline bool task_can_run_on_preferred_cpu(struct task_struct *p)
 
 static inline void sched_push_current_non_preferred_cpu(struct rq *rq) { }
 static inline void sched_init_steal_monitor(void) { }
+static inline void sched_trigger_steal_computation(int cpu) { }
 #endif
 
 #endif /* _KERNEL_SCHED_SCHED_H */
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 15/17] sched/core: Handle steal values and mark CPUs as preferred
  2026-04-07 19:19 [PATCH v2 00/17] sched/paravirt: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (13 preceding siblings ...)
  2026-04-07 19:19 ` [PATCH v2 14/17] sched/core: Compute steal values at regular intervals Shrikanth Hegde
@ 2026-04-07 19:19 ` Shrikanth Hegde
  2026-04-07 19:19 ` [PATCH v2 16/17] sched/core: Mark the direction of steal values to avoid oscillations Shrikanth Hegde
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 24+ messages in thread
From: Shrikanth Hegde @ 2026-04-07 19:19 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, gregkh
  Cc: sshegde, pbonzini, seanjc, kprateek.nayak, vschneid, iii, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, joelagnelf

This is the main periodic work which handles the steal time values.

- Compute the steal time by looking CPUTIME_STEAL across all online CPUs

- Compute steal ratio. It is multiplied by 100 to handle the fractional
  values.

- If the steal time higher than threshold, reduce the number of preferred
  CPUs by 1 core. The last core in the intersection of online and 
  preferred CPUs will be marked as non-preferred.
  Ensure at least one core is left as preferred always.

- If the steal time lower than threshold, increase the number of preferred
  CPUs by 1 core. First online core which is not in cpu_preferred_mask will
  be marked as preferred.
  If all cores are aleady set to preferred, bail out.

Increase/Decrease may need to modify the splicing across NUMA nodes. It is
being kept simple for now.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/core.c | 52 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 51 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1c6fcf1ae4fe..6e2b733adf45 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -11349,15 +11349,65 @@ void sched_init_steal_monitor(void)
 	steal_mon.sampling_period_ms  = 1000;		/* once per second */
 }
 
-/* This is only a skeleton. Subsequent patches introduce more of it */
 void sched_steal_detection_work(struct work_struct *work)
 {
 	struct steal_monitor_t *sm = container_of(work, struct steal_monitor_t, work);
+	int this_cpu = raw_smp_processor_id();
+	u64 delta_steal, delta_ns, steal = 0;
+	u64 steal_ratio;
 	ktime_t now;
+	int tmp_cpu;
+
+	for_each_cpu(tmp_cpu, cpu_online_mask)
+		steal += kcpustat_cpu(tmp_cpu).cpustat[CPUTIME_STEAL];
 
 	/* Update the prev_time for next iteration*/
 	now = ktime_get();
+	delta_steal = steal > sm->prev_steal ? steal - sm->prev_steal : 0;
+	delta_ns = max_t(u64, ktime_to_ns(ktime_sub(now, sm->prev_time)), 1);
+
 	sm->prev_time = now;
+	sm->prev_steal = steal;
+
+#ifdef CONFIG_SCHED_SMT
+	/* Multiply by 100 to consider the fractional values of steal time */
+	steal_ratio = (delta_steal * 100 * 100) / (delta_ns * num_online_cpus());
+
+	/* If the steal time values are high, reduce one core from preferred CPUs */
+	if (steal_ratio > sm->high_threshold) {
+		int last_cpu;
+
+		cpumask_and(sm->tmp_mask, cpu_online_mask, cpu_preferred_mask);
+		last_cpu = cpumask_last(sm->tmp_mask);
+
+		/*
+		 * If the core belongs to the housekeeping CPUs, no action is
+		 * taken. This leaves at least one core preferred always.
+		 * This ensures at least some CPUs are available to run
+		 */
+		if (cpumask_equal(cpu_smt_mask(last_cpu), cpu_smt_mask(this_cpu)))
+			return;
+
+		for_each_cpu(tmp_cpu, cpu_smt_mask(last_cpu)) {
+			set_cpu_preferred(tmp_cpu, false);
+			if (tick_nohz_full_cpu(tmp_cpu))
+				tick_nohz_dep_set_cpu(tmp_cpu, TICK_DEP_BIT_SCHED);
+		}
+	}
+
+	/* If the steal time values are low, increase one core as preferred CPUs */
+	if (steal_ratio < sm->low_threshold) {
+		int first_cpu;
+
+		first_cpu = cpumask_first_andnot(cpu_online_mask, cpu_preferred_mask);
+		/* All CPUs are preferred. Nothing to increase further */
+		if (first_cpu >= nr_cpu_ids)
+			return;
+
+		for_each_cpu(tmp_cpu, cpu_smt_mask(first_cpu))
+			set_cpu_preferred(tmp_cpu, true);
+	}
+#endif
 }
 
 void sched_trigger_steal_computation(int cpu)
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 16/17] sched/core: Mark the direction of steal values to avoid oscillations
  2026-04-07 19:19 [PATCH v2 00/17] sched/paravirt: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (14 preceding siblings ...)
  2026-04-07 19:19 ` [PATCH v2 15/17] sched/core: Handle steal values and mark CPUs as preferred Shrikanth Hegde
@ 2026-04-07 19:19 ` Shrikanth Hegde
  2026-04-07 19:19 ` [PATCH v2 17/17] sched/debug: Add debug knobs for steal monitor Shrikanth Hegde
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 24+ messages in thread
From: Shrikanth Hegde @ 2026-04-07 19:19 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, gregkh
  Cc: sshegde, pbonzini, seanjc, kprateek.nayak, vschneid, iii, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, joelagnelf

Cache the previous decision on steal time. So consecutive values of
high values or low values are taken for increase/decrease of preferred
CPUs.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/core.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6e2b733adf45..cb9110f95ebf 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -11374,7 +11374,7 @@ void sched_steal_detection_work(struct work_struct *work)
 	steal_ratio = (delta_steal * 100 * 100) / (delta_ns * num_online_cpus());
 
 	/* If the steal time values are high, reduce one core from preferred CPUs */
-	if (steal_ratio > sm->high_threshold) {
+	if (sm->previous_decision == 1 && steal_ratio > sm->high_threshold) {
 		int last_cpu;
 
 		cpumask_and(sm->tmp_mask, cpu_online_mask, cpu_preferred_mask);
@@ -11396,7 +11396,7 @@ void sched_steal_detection_work(struct work_struct *work)
 	}
 
 	/* If the steal time values are low, increase one core as preferred CPUs */
-	if (steal_ratio < sm->low_threshold) {
+	if (sm->previous_decision == -1 && steal_ratio < sm->low_threshold) {
 		int first_cpu;
 
 		first_cpu = cpumask_first_andnot(cpu_online_mask, cpu_preferred_mask);
@@ -11407,6 +11407,14 @@ void sched_steal_detection_work(struct work_struct *work)
 		for_each_cpu(tmp_cpu, cpu_smt_mask(first_cpu))
 			set_cpu_preferred(tmp_cpu, true);
 	}
+
+	/* mark the direction. This helps to avoid ping-pongs */
+	if (steal_ratio > sm->high_threshold)
+		sm->previous_decision = 1;
+	else if (steal_ratio < sm->low_threshold)
+		sm->previous_decision = -1;
+	else
+		sm->previous_decision = 0;
 #endif
 }
 
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 17/17] sched/debug: Add debug knobs for steal monitor
  2026-04-07 19:19 [PATCH v2 00/17] sched/paravirt: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (15 preceding siblings ...)
  2026-04-07 19:19 ` [PATCH v2 16/17] sched/core: Mark the direction of steal values to avoid oscillations Shrikanth Hegde
@ 2026-04-07 19:19 ` Shrikanth Hegde
  2026-04-07 19:50 ` [PATCH v2 00/17] sched/paravirt: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
  2026-04-08 10:14 ` Hillf Danton
  18 siblings, 0 replies; 24+ messages in thread
From: Shrikanth Hegde @ 2026-04-07 19:19 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, gregkh
  Cc: sshegde, pbonzini, seanjc, kprateek.nayak, vschneid, iii, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, joelagnelf

Add three debug knobs:

steal_mon_period - sampling frequency in milliseconds.
steal_mon_low - lower threshold value (specify percentage * 100)
steal_mon_high - higher threshold value (specify percentage * 100)

Refer to Documentation/scheduler/sched-debug.rst for detailed info.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 Documentation/scheduler/sched-debug.rst | 27 +++++++++++++++++++++++++
 kernel/sched/debug.c                    |  6 ++++++
 kernel/sched/sched.h                    |  2 ++
 3 files changed, 35 insertions(+)

diff --git a/Documentation/scheduler/sched-debug.rst b/Documentation/scheduler/sched-debug.rst
index b5a92a39eccd..288cd2c63224 100644
--- a/Documentation/scheduler/sched-debug.rst
+++ b/Documentation/scheduler/sched-debug.rst
@@ -52,3 +52,30 @@ rate for each task.
 
 ``scan_size_mb`` is how many megabytes worth of pages are scanned for
 a given scan.
+
+==================================
+Tunables for generic steal monitor
+==================================
+
+Generic Steal time monitor can be enabled by selecting STEAL_MONITOR in
+sched features. It is disabled by default.
+
+steal_mon_period - sampling frequency in milliseconds.
+How often sampling for steal values happen. This controls how fast scheduler
+acts on detecting the changes to steal time values.
+Default value is 1000 milliseconds.
+
+steal_mon_low - lower threshold value in percentage * 100
+This determines what values should be considered as nil/no steal values.
+When scheduler see steal times below this value, it will try to increase
+the preferred CPUs by 1 core. Having value as zero causes too much oscillations.
+Default value is 200, i.e 2% steal is considered as low threshold.
+
+steal_mon_high - higher threshold value in percentage * 100
+This determines what values should be considered as high steal values.
+When scheduler see steal times higher than this value, it will reduce
+the preferred CPUs by 1 core.
+Default value is 500, i.e 5% steal is considered as high threshold.
+
+Note: When the steal values in between high and low threshold no action is taken
+by scheduler. This is to avoid too much oscillations.
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 482c86a0ff80..9a6c1ada2cec 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -612,6 +612,12 @@ static __init int sched_init_debug(void)
 	debugfs_create_u32("migration_cost_ns", 0644, debugfs_sched, &sysctl_sched_migration_cost);
 	debugfs_create_u32("nr_migrate", 0644, debugfs_sched, &sysctl_sched_nr_migrate);
 
+#ifdef CONFIG_PARAVIRT
+	debugfs_create_u32("steal_mon_low", 0644, debugfs_sched, &steal_mon.low_threshold);
+	debugfs_create_u32("steal_mon_high", 0644, debugfs_sched, &steal_mon.high_threshold);
+	debugfs_create_u32("steal_mon_period", 0644, debugfs_sched, &steal_mon.sampling_period_ms);
+#endif
+
 	sched_domains_mutex_lock();
 	update_sched_domain_debugfs();
 	sched_domains_mutex_unlock();
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 337357e48a83..850d944b22f4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -4149,6 +4149,8 @@ struct steal_monitor_t {
 	unsigned int sampling_period_ms;
 };
 
+extern struct steal_monitor_t steal_mon;
+
 static inline bool task_can_run_on_preferred_cpu(struct task_struct *p)
 {
 	return cpumask_intersects(p->cpus_ptr, cpu_preferred_mask);
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 00/17] sched/paravirt: Introduce cpu_preferred_mask and steal-driven vCPU backoff
  2026-04-07 19:19 [PATCH v2 00/17] sched/paravirt: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (16 preceding siblings ...)
  2026-04-07 19:19 ` [PATCH v2 17/17] sched/debug: Add debug knobs for steal monitor Shrikanth Hegde
@ 2026-04-07 19:50 ` Shrikanth Hegde
  2026-04-08 10:14 ` Hillf Danton
  18 siblings, 0 replies; 24+ messages in thread
From: Shrikanth Hegde @ 2026-04-07 19:50 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, gregkh
  Cc: pbonzini, seanjc, kprateek.nayak, vschneid, iii, huschle, rostedt,
	dietmar.eggemann, mgorman, bsegall, maddy, srikar, hdanton,
	chleroy, vineeth, joelagnelf


> I will attach the irqbalance patch which detects the changes in this
> mask and re-adjusts the irq affinities. Series doesn't address when
> irqbalance=n. Assuming many distros have irqbalance=y by default.
> 

Subject: [PATCH] irqbalance: Check for changes in cpu_preferred_mask

---
  cputree.c    | 28 +++++++++++++++++++++++++++-
  irqbalance.c |  2 ++
  irqbalance.h |  1 +
  3 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/cputree.c b/cputree.c
index 9baa264..1db3422 100644
--- a/cputree.c
+++ b/cputree.c
@@ -56,6 +56,11 @@ cpumask_t banned_cpus;
  
  cpumask_t cpu_online_map;
  
+/* This can dynamically change. If any change in mask detect
+ * and trigger a rebuild
+ */
+cpumask_t cpu_preferred_mask;
+
  /*
     it's convenient to have the complement of banned_cpus available so that
     the AND operator can be used to mask out unwanted cpus
@@ -506,15 +511,36 @@ void clear_work_stats(void)
  	for_each_object(numa_nodes, clear_obj_stats, NULL);
  }
  
+void parse_preferred_cpus(void)
+{
+	cpumask_t preferred;
+	char *path = NULL;
+
+	path = "/sys/devices/system/cpu/preferred";
+	cpus_clear(preferred);
+	process_one_line(path, get_mask_from_cpulist, &preferred);
+
+	/* Did anything change compared to earlier */
+	if (!cpus_equal(preferred, cpu_preferred_mask)) {
+		log(TO_CONSOLE, LOG_INFO, "cpu preferred mask changed\n");
+		need_rebuild = 1;
+	}
+
+	cpus_copy(cpu_preferred_mask, preferred);
+}
  
  void parse_cpu_tree(void)
  {
  	DIR *dir;
  	struct dirent *entry;
+	char buffer[4096];
  
  	setup_banned_cpus();
  
-	cpus_complement(unbanned_cpus, banned_cpus);
+	cpus_andnot(unbanned_cpus, cpu_preferred_mask, banned_cpus);
+
+	cpumask_scnprintf(buffer, 4096, unbanned_cpus);
+	log(TO_CONSOLE, LOG_INFO, "Unbanned CPUs: %s\n", buffer);
  
  	dir = opendir("/sys/devices/system/cpu");
  	if (!dir)
diff --git a/irqbalance.c b/irqbalance.c
index f80244c..f3d46b8 100644
--- a/irqbalance.c
+++ b/irqbalance.c
@@ -229,6 +229,7 @@ static void parse_command_line(int argc, char **argv)
  static void build_object_tree(void)
  {
  	build_numa_node_list();
+	parse_preferred_cpus();
  	parse_cpu_tree();
  	rebuild_irq_db();
  }
@@ -275,6 +276,7 @@ gboolean scan(gpointer data __attribute__((unused)))
  	log(TO_CONSOLE, LOG_INFO, "\n\n\n-----------------------------------------------------------------------------\n");
  	clear_work_stats();
  	parse_proc_interrupts();
+	parse_preferred_cpus();
  
  
  	/* cope with cpu hotplug -- detected during /proc/interrupts parsing */
diff --git a/irqbalance.h b/irqbalance.h
index 47e40cc..593b183 100644
--- a/irqbalance.h
+++ b/irqbalance.h
@@ -57,6 +57,7 @@ void migrate_irq_obj(struct topo_obj *from, struct topo_obj *to, struct irq_info
  void activate_mappings(void);
  void clear_cpu_tree(void);
  void free_cpu_topo(gpointer data);
+extern void parse_preferred_cpus(void);
  /*===================NEW BALANCER FUNCTIONS============================*/
  
  /*
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 03/17] cpumask: Introduce cpu_preferred_mask
  2026-04-07 19:19 ` [PATCH v2 03/17] cpumask: Introduce cpu_preferred_mask Shrikanth Hegde
@ 2026-04-07 20:27   ` Yury Norov
  2026-04-08  9:16     ` Shrikanth Hegde
  0 siblings, 1 reply; 24+ messages in thread
From: Yury Norov @ 2026-04-07 20:27 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, gregkh, pbonzini, seanjc, kprateek.nayak, vschneid,
	iii, huschle, rostedt, dietmar.eggemann, mgorman, bsegall, maddy,
	srikar, hdanton, chleroy, vineeth, joelagnelf

On Wed, Apr 08, 2026 at 12:49:36AM +0530, Shrikanth Hegde wrote:
> This patch does
> - Declare and Define cpu_preferred_mask.
> - Get/Set helpers for it.
> 
> Values are set/clear by the scheduler by detecting the steal time values.
> 
> A CPU is set to preferred when it comes online. Later it may be
> marked as non-preferred depending on steal time values with 
> STEAL_MONITOR enabled.
> 
> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> ---
>  include/linux/cpumask.h | 22 ++++++++++++++++++++++
>  kernel/cpu.c            |  6 ++++++
>  kernel/sched/core.c     |  5 +++++
>  3 files changed, 33 insertions(+)
> 
> diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
> index 80211900f373..80c5cc13b8ad 100644
> --- a/include/linux/cpumask.h
> +++ b/include/linux/cpumask.h
> @@ -1296,6 +1296,28 @@ static __always_inline bool cpu_dying(unsigned int cpu)
>  
>  #endif /* NR_CPUS > 1 */
>  
> +/*
> + * All related wrappers kept together to avoid too many ifdefs
> + * See Documentation/scheduler/sched-arch.rst for details
> + */
> +#ifdef CONFIG_PARAVIRT
> +extern struct cpumask __cpu_preferred_mask;
> +#define cpu_preferred_mask    ((const struct cpumask *)&__cpu_preferred_mask)
> +#define set_cpu_preferred(cpu, preferred) assign_cpu((cpu), &__cpu_preferred_mask, (preferred))
> +
> +static __always_inline bool cpu_preferred(unsigned int cpu)
> +{
> +	return cpumask_test_cpu(cpu, cpu_preferred_mask);
> +}
> +#else
> +static __always_inline bool cpu_preferred(unsigned int cpu)
> +{
> +	return true;
> +}

This doesn't look consistent, probably not correct. What if
I pass an offline CPU here? Is it still preferred?

Later you say that preferred CPU is online + STEAL-approved one.
So in non-paravirtualized case, I believe, you should consider
that only online CPUs are preferred. What about dying CPUs? Can
they be preferred too?

At least, please run cpumask_check() on the argument.

There's a top-comment describing all the system cpumasks. Except for
cpu_dying, it's nice and complete. Can you describe your new creature
there?

Finally, I don't think that __cpu_preferred_mask should depend on 
PARAVIRT config. Consider cpu_present_mask. It mirrors cpu_possible_mask
if hotplug is disabled, but it's still a real mask even in that case.
The way you're doing it, you spread CONFIG_PARAVIRT ifdefery pretty
much anywhere where people might want to use this new mask for anything
except for testing a bit.

Thanks,
Yury

> +static __always_inline void set_cpu_preferred(unsigned int cpu, bool preferred) { }
> +#endif
> +
>  #define cpu_is_offline(cpu)	unlikely(!cpu_online(cpu))
>  
>  #if NR_CPUS <= BITS_PER_LONG
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index bc4f7a9ba64e..2d4d037680d4 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -3137,6 +3137,12 @@ void set_cpu_online(unsigned int cpu, bool online)
>  		if (cpumask_test_and_clear_cpu(cpu, &__cpu_online_mask))
>  			atomic_dec(&__num_online_cpus);
>  	}
> +
> +	/*
> +	 * An online CPU is by default assumed to be preferred
> +	 * Unitl STEAL_MONITOR changes it
> +	 */
> +	set_cpu_preferred(cpu, online);
>  }
>  
>  /*
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f351296922ac..7ea05a7a717b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -11228,3 +11228,8 @@ void sched_change_end(struct sched_change_ctx *ctx)
>  		p->sched_class->prio_changed(rq, p, ctx->prio);
>  	}
>  }
> +
> +#ifdef CONFIG_PARAVIRT
> +struct cpumask __cpu_preferred_mask __read_mostly;
> +EXPORT_SYMBOL(__cpu_preferred_mask);
> +#endif
> -- 
> 2.47.3

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 05/17] sched/core: allow only preferred CPUs in is_cpu_allowed
  2026-04-07 19:19 ` [PATCH v2 05/17] sched/core: allow only preferred CPUs in is_cpu_allowed Shrikanth Hegde
@ 2026-04-08  1:05   ` Yury Norov
  2026-04-08 12:56     ` Shrikanth Hegde
  0 siblings, 1 reply; 24+ messages in thread
From: Yury Norov @ 2026-04-08  1:05 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, gregkh, pbonzini, seanjc, kprateek.nayak, vschneid,
	iii, huschle, rostedt, dietmar.eggemann, mgorman, bsegall, maddy,
	srikar, hdanton, chleroy, vineeth, joelagnelf

On Wed, Apr 08, 2026 at 12:49:38AM +0530, Shrikanth Hegde wrote:
> 
> When possible, choose a preferred CPUs to pick.
> 
> Push task mechanism uses stopper thread which going to call
> select_fallback_rq and use this mechanism to pick only a preferred CPU.
> 
> When task is affined only to non-preferred CPUs it should continue to
> run there. Detect that by checking if cpus_ptr and cpu_preferred_mask
> interesect or not.
> 
> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> ---
>  kernel/sched/core.c  | 17 ++++++++++++++---
>  kernel/sched/sched.h | 12 ++++++++++++
>  2 files changed, 26 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 7ea05a7a717b..336e7c694eb7 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2463,9 +2463,16 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
>  	if (is_migration_disabled(p))
>  		return cpu_online(cpu);
>  
> -	/* Non kernel threads are not allowed during either online or offline. */
> -	if (!(p->flags & PF_KTHREAD))
> -		return cpu_active(cpu);
> +	/*
> +	 * Non kernel threads are not allowed during either online or offline.
> +	 * Ensure it is a preferred CPU to avoid further contention
> +	 */
> +	if (!(p->flags & PF_KTHREAD)) {
> +		if (!cpu_active(cpu))
> +			return false;
> +		if (!cpu_preferred(cpu) && task_can_run_on_preferred_cpu(p))
> +			return false;
> +	}
>  
>  	/* KTHREAD_IS_PER_CPU is always allowed. */
>  	if (kthread_is_per_cpu(p))
> @@ -2475,6 +2482,10 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
>  	if (cpu_dying(cpu))
>  		return false;
>  
> +	/* Try on preferred CPU first */
> +	if (!cpu_preferred(cpu) && task_can_run_on_preferred_cpu(p))
> +		return false;

You repeat this for the 2nd time. The cpu_preferred() call should go
inside task_can_run_on_preferred_cpu().

And can you please pick some shorter name?

> +
>  	/* But are allowed during online. */
>  	return cpu_online(cpu);
>  }
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 88e0c93b9e21..7271af2ca64f 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -4130,4 +4130,16 @@ DEFINE_CLASS_IS_UNCONDITIONAL(sched_change)
>  
>  #include "ext.h"
>  
> +#ifdef CONFIG_PARAVIRT
> +static inline bool task_can_run_on_preferred_cpu(struct task_struct *p)
> +{
> +	return cpumask_intersects(p->cpus_ptr, cpu_preferred_mask);

This makes is_cpu_allowed() O(N). Even if CONFIG_PARAVIRT is enabled,
I think some people would prefer to avoid this. Also, select_fallback_rq()
calls it in a loop, and this makes it O(N^2).


       /* Any allowed, online CPU? */
       for_each_cpu(dest_cpu, p->cpus_ptr) {
               if (!is_cpu_allowed(p, dest_cpu))
                       continue;

               goto out;
       }

You can keep it O(N):
       for_each_cpu_and(dest_cpu, p->cpus_ptr, cpu_preferred_mask) {
                ...
       }

Not sure how critical that path is, but this looks suspicious.

> +}
> +#else
> +static inline bool task_can_run_on_preferred_cpu(struct task_struct *p)
> +{
> +	return true;
> +}
> +#endif

Same comment as in patch 3. I believe, it's worth to declare cpu_preferred_mask
unrelated to CONFIG_PARAVIRT, so that you'll not have to spread this
ifdefery around. 

> +
>  #endif /* _KERNEL_SCHED_SCHED_H */
> -- 
> 2.47.3

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 03/17] cpumask: Introduce cpu_preferred_mask
  2026-04-07 20:27   ` Yury Norov
@ 2026-04-08  9:16     ` Shrikanth Hegde
  0 siblings, 0 replies; 24+ messages in thread
From: Shrikanth Hegde @ 2026-04-08  9:16 UTC (permalink / raw)
  To: Yury Norov
  Cc: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, gregkh, pbonzini, seanjc, kprateek.nayak, vschneid,
	iii, huschle, rostedt, dietmar.eggemann, mgorman, bsegall, maddy,
	srikar, hdanton, chleroy, vineeth, joelagnelf

Hi Yury. Thanks for going through the series.

On 4/8/26 1:57 AM, Yury Norov wrote:
> On Wed, Apr 08, 2026 at 12:49:36AM +0530, Shrikanth Hegde wrote:
>> This patch does
>> - Declare and Define cpu_preferred_mask.
>> - Get/Set helpers for it.
>>
>> Values are set/clear by the scheduler by detecting the steal time values.
>>
>> A CPU is set to preferred when it comes online. Later it may be
>> marked as non-preferred depending on steal time values with
>> STEAL_MONITOR enabled.
>>
>> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
>> ---
>>   include/linux/cpumask.h | 22 ++++++++++++++++++++++
>>   kernel/cpu.c            |  6 ++++++
>>   kernel/sched/core.c     |  5 +++++
>>   3 files changed, 33 insertions(+)
>>
>> diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
>> index 80211900f373..80c5cc13b8ad 100644
>> --- a/include/linux/cpumask.h
>> +++ b/include/linux/cpumask.h
>> @@ -1296,6 +1296,28 @@ static __always_inline bool cpu_dying(unsigned int cpu)
>>   
>>   #endif /* NR_CPUS > 1 */
>>   
>> +/*
>> + * All related wrappers kept together to avoid too many ifdefs
>> + * See Documentation/scheduler/sched-arch.rst for details
>> + */
>> +#ifdef CONFIG_PARAVIRT
>> +extern struct cpumask __cpu_preferred_mask;
>> +#define cpu_preferred_mask    ((const struct cpumask *)&__cpu_preferred_mask)
>> +#define set_cpu_preferred(cpu, preferred) assign_cpu((cpu), &__cpu_preferred_mask, (preferred))
>> +
>> +static __always_inline bool cpu_preferred(unsigned int cpu)
>> +{
>> +	return cpumask_test_cpu(cpu, cpu_preferred_mask);
>> +}
>> +#else
>> +static __always_inline bool cpu_preferred(unsigned int cpu)
>> +{
>> +	return true;
>> +}
> 
> This doesn't look consistent, probably not correct. What if
> I pass an offline CPU here? Is it still preferred?
> 

preferred cpu state follows the online state. This was done by change
below in set_cpu_online. So when cpu goes offline, it will be removed from
the preferred mask too.

In the design principle I wanted, preferred to be always subset of online

preferred <= online <= possible.

> Later you say that preferred CPU is online + STEAL-approved one.
> So in non-paravirtualized case, I believe, you should consider

There it would clearly be same as online CPUs.

> that only online CPUs are preferred. What about dying CPUs? Can
> they be preferred too?

When there is no CPU hotplug, preferred will be subset of online.

Lets see different cases with CPU hotplug.
when STEAL_MONITOR is on and there is high steal time.

Lets say, 600 CPUs system with SMT.

Case 1:
CPU 500 was offline. It would have it's preferred bit=0 . after a while
there was high steal time, and preferred_cpus = <0-399> and once the contention
was gone, since it is using cpu_smt_mask, it would set 500's preferred bit=1, though
it is offline.

Case 2:
all online CPUs were preferred. 500 was offline. after a while there was
high steal and while iterating through cpu_smt_mask, after say 499 was done,
500 is brought online. that would set it in preferred.
Since it was part of the mask, 500 will be marked preferred=0.
That's ok. It was meant to be anyway.

Case 3:
all online CPUs were preferred. 500 was offline. after a while there was high steal
and preferred_cpus = <0-399> and 500 is brought online. that would set it
in preferred. In the next cycle, bringing online causes more steal time, and since it is
the last CPU in the mask, it will be marked as non-preferred. Thats ok.

So Case 1 is the one where the construct is broken.
This is solvable by checking the online state in steal time handling code.

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d3b2bcb6008c..bad091f1f604 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -11329,7 +11329,7 @@ void sched_steal_detection_work(struct work_struct *work)
                 if (cpumask_equal(cpu_smt_mask(last_cpu), cpu_smt_mask(this_cpu)))
                         return;
  
-               for_each_cpu(tmp_cpu, cpu_smt_mask(last_cpu)) {
+               for_each_cpu_and(tmp_cpu, cpu_smt_mask(last_cpu), cpu_online_mask) {
                         set_cpu_preferred(tmp_cpu, false);
                         if (tick_nohz_full_cpu(tmp_cpu))
                                 tick_nohz_dep_set_cpu(tmp_cpu, TICK_DEP_BIT_SCHED);
@@ -11345,7 +11345,7 @@ void sched_steal_detection_work(struct work_struct *work)
                 if (first_cpu >= nr_cpu_ids)
                         return;
  
-               for_each_cpu(tmp_cpu, cpu_smt_mask(first_cpu))
+               for_each_cpu_and(tmp_cpu, cpu_smt_mask(first_cpu), cpu_online_mask)
                         set_cpu_preferred(tmp_cpu, true);
         }


I had thought of this scenario. I hadn't seen it from consistency point of
view. It should be consistent since it is exposed to user.

Functionality wise it was okay since, current code has enough checks to
schedule only on online CPUs. Even is_cpu_allowed returns true only
if it is online. But i get the point, and above diff should address it.

> 
> At least, please run cpumask_check() on the argument.

It is set either within online or in PATCH 15/17 by iterating through
cpu_smt_mask. That should always yeild cpu < nr_cpu_ids.

I didn't get why cpumask_check is needed again.

> 
> There's a top-comment describing all the system cpumasks. Except for
> cpu_dying, it's nice and complete. Can you describe your new creature
> there?

Ok. I can add a comment there.

> 
> Finally, I don't think that __cpu_preferred_mask should depend on
> PARAVIRT config. Consider cpu_present_mask. It mirrors cpu_possible_mask
> if hotplug is disabled, but it's still a real mask even in that case.
> The way you're doing it, you spread CONFIG_PARAVIRT ifdefery pretty
> much anywhere where people might want to use this new mask for anything
> except for testing a bit.
> 

One concern you had raised earlier was bloating of the code for systems
CONFIG_PARAVIRT=n.

Maybe in some of the hotpaths we could do, IS_ENABLED(CONFIG_PARAVIRT) check and
that should be ok?

If so, we can get rid off lot of this ifdefery.

cpu_preferred(cpu) is a bit check and shouldn't that expensive.

> Thanks,
> Yury
> 
>> +static __always_inline void set_cpu_preferred(unsigned int cpu, bool preferred) { }
>> +#endif
>> +
>>   #define cpu_is_offline(cpu)	unlikely(!cpu_online(cpu))
>>   
>>   #if NR_CPUS <= BITS_PER_LONG
>> diff --git a/kernel/cpu.c b/kernel/cpu.c
>> index bc4f7a9ba64e..2d4d037680d4 100644
>> --- a/kernel/cpu.c
>> +++ b/kernel/cpu.c
>> @@ -3137,6 +3137,12 @@ void set_cpu_online(unsigned int cpu, bool online)
>>   		if (cpumask_test_and_clear_cpu(cpu, &__cpu_online_mask))
>>   			atomic_dec(&__num_online_cpus);
>>   	}
>> +
>> +	/*
>> +	 * An online CPU is by default assumed to be preferred
>> +	 * Unitl STEAL_MONITOR changes it
>> +	 */
>> +	set_cpu_preferred(cpu, online);
>>   }

Here, preferred follows the online state.

>>   
>>   /*
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index f351296922ac..7ea05a7a717b 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -11228,3 +11228,8 @@ void sched_change_end(struct sched_change_ctx *ctx)
>>   		p->sched_class->prio_changed(rq, p, ctx->prio);
>>   	}
>>   }
>> +
>> +#ifdef CONFIG_PARAVIRT
>> +struct cpumask __cpu_preferred_mask __read_mostly;
>> +EXPORT_SYMBOL(__cpu_preferred_mask);
>> +#endif
>> -- 
>> 2.47.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 00/17] sched/paravirt: Introduce cpu_preferred_mask and steal-driven vCPU backoff
  2026-04-07 19:19 [PATCH v2 00/17] sched/paravirt: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
                   ` (17 preceding siblings ...)
  2026-04-07 19:50 ` [PATCH v2 00/17] sched/paravirt: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
@ 2026-04-08 10:14 ` Hillf Danton
  18 siblings, 0 replies; 24+ messages in thread
From: Hillf Danton @ 2026-04-08 10:14 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: linux-kernel, peterz, Sean Christopherson, vincent.guittot,
	yury.norov, kprateek.nayak

On Wed,  8 Apr 2026 00:49:33 +0530 Shrikanth Hegde wrote:
> In the virtualized environment, often there is vCPU overcommit. i.e. sum
> of CPUs in all guests(virtual CPU aka vCPU) exceed the underlying physical CPU
> (managed by host aka pCPU). 
> 
> When many guests ask for CPU at the same time, host/hypervisor would
> fail to satisfy that ask and has to preempt one vCPU to run another. If
> the guests co-ordinate and ask for less CPU overall, that reduces the
> vCPU threads in host, and vCPU preemption goes down.
> 
> Steal time is an indication of the underlying contention. Based on that,
> if the guests reduce the vCPU request that proportionally, it would achieve
> the desired outcome.
> 
> The added advantage is, it would reduce the lockholder preemption.
> A vCPU maybe holding a spinlock, but still could get preempted. Such cases
> will reduce since there is less vCPU preemption and lockholder will run to
> completion since it would have disabled preemption in the guest.
> Workload could run with time-slice extention to reduce lockholder
> preemption for userspace locks, and this could help reduce lockholder
> preemption even for kernelspace due to vCPU preemption.
> 
> Currently there is no infra in scheduler which moves away the task from
> some CPUs without breaking the userspace affinities. CPU hotplug,
> isolated CPUset would achieve moving the task off some CPUs at runtime,
> But if some task is affined to specific CPUs, taking those CPUs away
> results in affinity list being reset. That breaks the user affinities,
> Since this is driven by scheduler rather than user doing so, can't do
> that. So need a new infra. It would be better if it is lightweight.
> 
> Core idea is:
> - Maintain set of CPUs which can be used by workload. It is denoted as
>   cpu_preferred_mask
> - Periodically compute the steal time. If steal time is high/low based
>   on the thresholds, either reduce/increase the preferred CPUs.
> - If a CPU is marked as non-preferred, push the task running on it if
>   possible.
> - Use this CPU state in wakeup and load balance to ensure tasks run
>   within preferred CPUs.
> 
> For the host kernel, there is no steal time, so no changes to its preferred
> CPUs. So series would affect only the guest kernels.
> 
Changes are added to guest in order to detect if pCPU is overloaded, and if
that is true (I mean it is layer violation), why not ask the pCPU governor,
hypervisor, to monitor the loads on pCPU and migrate vCPUs forth and back
if necessary.

> Current series implements a simple steal time monitor, which
> reduces/increases the number of cores by 1 depending on the steal time.
> It also implements a very simple method to avoid oscillations. If there
> is need a need for more complex mechanisms for these, then doing them
> via a steal time governors maybe an idea. One needs to enable the
> feature STEAL_MONITOR to see the steal time values being processed and
> preferred CPUs being set correctly. In most of the systems where there
> is no steal time, preferred CPUs will be same as online CPUs.
> 
> I will attach the irqbalance patch which detects the changes in this
> mask and re-adjusts the irq affinities. Series doesn't address when
> irqbalance=n. Assuming many distros have irqbalance=y by default.
> 
> Discussion at LPC 2025:
> https://www.youtube.com/watch?v=sZKpHVUUy1g
> 
> *** Please provide your suggestions and comments ***
> 
> =====================================================================
> Patch Layout:
> PATCH    01: Remove stale schedstats. Independent of the series.
> PATCH 02-04: Introduce cpu_preferred_mask.
> PATCH 05-09: Make scheduler aware of this mask.
> PATCH    10: Push the current task in sched_tick if cpu is non-preferred.
> PATCH    11: Add a new schedstat.
> PATCH    12: Add a new sched feature: STEAL_MONITOR
> PATCH 13-17: Periodically calculating steal time and take appropriate
>              action.
> 
> ======================================================================
> Performance Numbers:
> baseline: tip/master at 8a5f70eb7e4f (Merge branch into tip/master: 'x86/tdx')
> 
> on PowerPC: powerVM hypervisor:
> +++++++++
> Daytrader
> +++++++++ 
> It is a database workload which simulates stock live trading.
> There are two VMs. The same workload is run in both VMs at the same time.
> VM1 is bigger than VM2.
> 
> Note: VM1 sees 20% steal time, and VM2 sees 10% steal time with
> baseline.
> 
> 
> (with series: STEAL_MONITOR=y and Default debug steal_mon values)
> On VM1:
> 			baseline		with_series
> Throughput		1x			1.3x 
> On VM2:
>                         baseline                with_series
> Throughput              1x                      1.1x
> 
> 
> (with series: STEAL_MONITOR=y and Period 100, High 200, Low 100)
> On VM1:
>                         baseline                with_series
> Throughput:             1x                      1.45x
> On VM2:
>                         baseline                with_series
> Throughput:             1x                      1.13x
> 
> Verdict: Shows good improvement with default values. Even better when
> tuned the debug knobs.
> 
> +++++++++
> Hackbench 
> +++++++++
> (with series: STEAL_MONITOR=y and Period 100, High 200, Low 100)
> On VM1:
> 			baseline		with_series
> 10 groups		10.3			 8.5
> 30 groups		40.8			25.5
> 60 groups		77.2			47.8
> 
> on VM2:
> 			baseline		with_series
> 10 groups		 8.4			 7.5
> 30 groups		25.3			19.8
> 60 groups		41.7			36.3
> 
> Verdict: With tuned values, shows very good improvement.
> 
> ==========================================================================
> Since v1:
> - A new name - Preferred CPUs and cpu_preferred_mask
>   I had initially used the name as "Usable CPUs", but this seemed
>   better. I thought of pv_preferred too, but left it as it could be too long.
> 
> - Arch independent code. Everything happens in scheduler. steal time is
>   generic construct and this would help avoid each architecture doing the
>   same thing more or less. Dropped powerpc code.
> 
> - Removed hacks around wakeups. Made it as part of available_idle_cpu
>   which take care of many of the wakeup decisions. same for rt code.
> 
> - Implement a work function to calculate the steal times and enforce the
>   policy decisions. This ensures sched_tick doesn't suffer any major
>   latency.
> 
> - Steal time computation is gated with sched feature STEAL_MONITOR to
>   avoid any overheads in systems which don't have vCPU overcommit.
>   Feature is disabled by default.
> 
> - CPU_CAPACITY=1 was not considered since one needs the state of all CPUs
>   which have this special value. Computing that in hotpath is not ideal.
> 
> - Using cpuset was not considered since it was quite tricky, given there
>   is different versions and cgroups is natively user driven.
> 
> v1: https://lore.kernel.org/all/20251119124449.1149616-1-sshegde@linux.ibm.com/#t
> earlier versions: https://lore.kernel.org/all/236f4925-dd3c-41ef-be04-47708c9ce129@linux.ibm.com/
> 
> TODO:
> - Splicing of CPUs across NUMA nodes when CPUs aren't split equally.
> - irq affinity when irqbalance=n. Not sure if this is worth.
> - Avoid running any unbound housekeeping work on non-preferred CPUs 
>   such as in find_new_ilb. Tried, but showed a little regression in 
>   no noise case. So didn't consider.
> - This currently works for kernel built with CONFIG_SCHED_SMT. Didn't
>   want to sprinkle too many ifdefs there. Not sure if there is any
>   system which needs this feature but !SMT. If so, let me know.
>   Seeing those ifdefs makes me wonder, Maybe we could cleanup
>   CONFIG_SCHED_SMT with cpumask_of(cpu) in case  of !SMT?
> - Performance numbers in KVM with x86, s390. 
> 
> Sorry for sending it this late. This series is the one which is meant
> for discussion at OSPM 2026.
> 
> 
> Shrikanth Hegde (17):
>   sched/debug: Remove unused schedstats
>   sched/docs: Document cpu_preferred_mask and Preferred CPU concept
>   cpumask: Introduce cpu_preferred_mask
>   sysfs: Add preferred CPU file
>   sched/core: allow only preferred CPUs in is_cpu_allowed
>   sched/fair: Select preferred CPU at wakeup when possible
>   sched/fair: load balance only among preferred CPUs
>   sched/rt: Select a preferred CPU for wakeup and pulling rt task
>   sched/core: Keep tick on non-preferred CPUs until tasks are out
>   sched/core: Push current task from non preferred CPU
>   sched/debug: Add migration stats due to non preferred CPUs
>   sched/feature: Add STEAL_MONITOR feature
>   sched/core: Introduce a simple steal monitor
>   sched/core: Compute steal values at regular intervals
>   sched/core: Handle steal values and mark CPUs as preferred
>   sched/core: Mark the direction of steal values to avoid oscillations
>   sched/debug: Add debug knobs for steal monitor
> 
>  .../ABI/testing/sysfs-devices-system-cpu      |  11 +
>  Documentation/scheduler/sched-arch.rst        |  48 ++++
>  Documentation/scheduler/sched-debug.rst       |  27 +++
>  drivers/base/cpu.c                            |  12 +
>  include/linux/cpumask.h                       |  22 ++
>  include/linux/sched.h                         |   4 +-
>  kernel/cpu.c                                  |   6 +
>  kernel/sched/core.c                           | 219 +++++++++++++++++-
>  kernel/sched/cpupri.c                         |   4 +
>  kernel/sched/debug.c                          |  10 +-
>  kernel/sched/fair.c                           |   8 +-
>  kernel/sched/features.h                       |   3 +
>  kernel/sched/rt.c                             |   4 +
>  kernel/sched/sched.h                          |  41 ++++
>  14 files changed, 409 insertions(+), 10 deletions(-)
> 
> -- 
> 2.47.3
> 
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 05/17] sched/core: allow only preferred CPUs in is_cpu_allowed
  2026-04-08  1:05   ` Yury Norov
@ 2026-04-08 12:56     ` Shrikanth Hegde
  0 siblings, 0 replies; 24+ messages in thread
From: Shrikanth Hegde @ 2026-04-08 12:56 UTC (permalink / raw)
  To: Yury Norov
  Cc: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, gregkh, pbonzini, seanjc, kprateek.nayak, vschneid,
	iii, huschle, rostedt, dietmar.eggemann, mgorman, bsegall, maddy,
	srikar, hdanton, chleroy, vineeth, joelagnelf

Hi Yury.

On 4/8/26 6:35 AM, Yury Norov wrote:
> On Wed, Apr 08, 2026 at 12:49:38AM +0530, Shrikanth Hegde wrote:
>>
>> When possible, choose a preferred CPUs to pick.
>>
>> Push task mechanism uses stopper thread which going to call
>> select_fallback_rq and use this mechanism to pick only a preferred CPU.
>>
>> When task is affined only to non-preferred CPUs it should continue to
>> run there. Detect that by checking if cpus_ptr and cpu_preferred_mask
>> interesect or not.
>>
>> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
>> ---
>>   kernel/sched/core.c  | 17 ++++++++++++++---
>>   kernel/sched/sched.h | 12 ++++++++++++
>>   2 files changed, 26 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 7ea05a7a717b..336e7c694eb7 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -2463,9 +2463,16 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
>>   	if (is_migration_disabled(p))
>>   		return cpu_online(cpu);
>>   
>> -	/* Non kernel threads are not allowed during either online or offline. */
>> -	if (!(p->flags & PF_KTHREAD))
>> -		return cpu_active(cpu);
>> +	/*
>> +	 * Non kernel threads are not allowed during either online or offline.
>> +	 * Ensure it is a preferred CPU to avoid further contention
>> +	 */
>> +	if (!(p->flags & PF_KTHREAD)) {
>> +		if (!cpu_active(cpu))
>> +			return false;
>> +		if (!cpu_preferred(cpu) && task_can_run_on_preferred_cpu(p))
>> +			return false;
>> +	}
>>   
>>   	/* KTHREAD_IS_PER_CPU is always allowed. */
>>   	if (kthread_is_per_cpu(p))
>> @@ -2475,6 +2482,10 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
>>   	if (cpu_dying(cpu))
>>   		return false;
>>   
>> +	/* Try on preferred CPU first */
>> +	if (!cpu_preferred(cpu) && task_can_run_on_preferred_cpu(p))
>> +		return false;
> 

First one was regular tasks, this is for unbound kernel threads.
Both will need. No?

> You repeat this for the 2nd time. The cpu_preferred() call should go
> inside task_can_run_on_preferred_cpu().

I want to keep this check for cpu_preferred() first. the reason being
it is inexpensive since it is bit check. Only if it fails, then one should
bother about task_can_run_on_preferred_cpu which is O(N) as you said.

I am using the task_can_run_on_preferred_cpu in push task mechanism too.
PATCH 10/17. I get there only on non-preferred CPU.
So can I keep as is?

> 
> And can you please pick some shorter name?
> 

task_has_preferred_cpus?

>> +
>>   	/* But are allowed during online. */
>>   	return cpu_online(cpu);
>>   }
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 88e0c93b9e21..7271af2ca64f 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -4130,4 +4130,16 @@ DEFINE_CLASS_IS_UNCONDITIONAL(sched_change)
>>   
>>   #include "ext.h"
>>   
>> +#ifdef CONFIG_PARAVIRT
>> +static inline bool task_can_run_on_preferred_cpu(struct task_struct *p)
>> +{
>> +	return cpumask_intersects(p->cpus_ptr, cpu_preferred_mask);
> 
> This makes is_cpu_allowed() O(N). Even if CONFIG_PARAVIRT is enabled,
> I think some people would prefer to avoid this. Also, select_fallback_rq()
> calls it in a loop, and this makes it O(N^2).
> 
> 
>         /* Any allowed, online CPU? */
>         for_each_cpu(dest_cpu, p->cpus_ptr) {
>                 if (!is_cpu_allowed(p, dest_cpu))
>                         continue;
> 
>                 goto out;
>         }
> 
> You can keep it O(N):
>         for_each_cpu_and(dest_cpu, p->cpus_ptr, cpu_preferred_mask) {
>                  ...
>         }

This would leave tasks which has affinity only on non-preferred CPUs without a CPU.

That breaks below case,
600 CPUs, high steal time and hence preferred is 0-399.
In that state, user does "taskset -c 500 <stress-ng>", that task ends up
going to preferred CPUs since its affinity gets reset in the switch
block later in select_fallback_rq.

> 
> Not sure how critical that path is, but this looks suspicious.
> 

Fair point. But we come here only if cpu_preferred(cpu) == false.

When that happens, we expect the task running on will get preempted
to a preferred CPUs. If it migrated to a preferred cpu, then on-wards
it wont suffer the additional overhead of task_can_run_on_preferred_cpu

case where it would be O(N^2) is for tasks which have explicit affinity
on non-preferred CPUs. I see that as rare case.

For the majority of the cases, it should still be O(N) since cpu_preferred(cpu)
will be true.


Here is the benchmark data on system where there is 0 steal time.
It is dedicated LPAR(VM).

| Test                            | Baseline | NO STEAL | %diff to base| STEAL     | %diff to base |
|                                 |          | MONITOR  |              | MONITOR   |               |
|---------------------------------|----------|----------|--------------|-----------|---------------|
| HackBench Process 20 groups     |     2.70 |  2.67    |      +1.11%  |   2.66    |       +1.48%  |
| HackBench Process 40 groups     |     5.31 |  5.26    |      +0.94%  |   5.30    |       +0.19%  |
| HackBench Process 60 groups     |     7.95 |  7.82    |      +1.64%  |   7.90    |       +0.63%  |
| HackBench thread 10 Time        |     1.61 |  1.59    |      +1.24%  |   1.56    |       +3.11%  |
| HackBench thread 20 Time        |     2.82 |  2.83    |      -0.35%  |   2.80    |       +0.71%  |
| HackBench Process(Pipe) 20 Time |     1.51 |  1.50    |      +0.66%  |   1.47    |       +2.65%  |
| HackBench Process(Pipe) 40 Time |     2.78 |  2.69    |      +3.24%  |   2.69    |       +3.24%  |
| HackBench Process(Pipe) 60 Time |     3.73 |  3.76    |      -0.80%  |   3.64    |       +2.41%  |
| HackBench thread(Pipe) 10 Time  |     0.94 |  0.91    |      +3.19%  |   0.91    |       +3.19%  |
| HackBench thread(Pipe) 20 Time  |     1.58 |  1.58    |      -0.00%  |   1.59    |       -0.63%  |




>> +}
>> +#else
>> +static inline bool task_can_run_on_preferred_cpu(struct task_struct *p)
>> +{
>> +	return true;
>> +}
>> +#endif
> 
> Same comment as in patch 3. I believe, it's worth to declare cpu_preferred_mask
> unrelated to CONFIG_PARAVIRT, so that you'll not have to spread this
> ifdefery around.
> 

Yes.

>> +
>>   #endif /* _KERNEL_SCHED_SCHED_H */
>> -- 
>> 2.47.3


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2026-04-08 12:56 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-07 19:19 [PATCH v2 00/17] sched/paravirt: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 01/17] sched/debug: Remove unused schedstats Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 02/17] sched/docs: Document cpu_preferred_mask and Preferred CPU concept Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 03/17] cpumask: Introduce cpu_preferred_mask Shrikanth Hegde
2026-04-07 20:27   ` Yury Norov
2026-04-08  9:16     ` Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 04/17] sysfs: Add preferred CPU file Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 05/17] sched/core: allow only preferred CPUs in is_cpu_allowed Shrikanth Hegde
2026-04-08  1:05   ` Yury Norov
2026-04-08 12:56     ` Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 06/17] sched/fair: Select preferred CPU at wakeup when possible Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 07/17] sched/fair: load balance only among preferred CPUs Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 08/17] sched/rt: Select a preferred CPU for wakeup and pulling rt task Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 09/17] sched/core: Keep tick on non-preferred CPUs until tasks are out Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 10/17] sched/core: Push current task from non preferred CPU Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 11/17] sched/debug: Add migration stats due to non preferred CPUs Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 12/17] sched/feature: Add STEAL_MONITOR feature Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 13/17] sched/core: Introduce a simple steal monitor Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 14/17] sched/core: Compute steal values at regular intervals Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 15/17] sched/core: Handle steal values and mark CPUs as preferred Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 16/17] sched/core: Mark the direction of steal values to avoid oscillations Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 17/17] sched/debug: Add debug knobs for steal monitor Shrikanth Hegde
2026-04-07 19:50 ` [PATCH v2 00/17] sched/paravirt: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
2026-04-08 10:14 ` Hillf Danton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox