Linux Documentation

Linux Documentation
 help / color / mirror / Atom feed

* [PATCH v5 20/24] virt/steal_monitor: Provide default method to inc/dec preferred CPUs
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc
In-Reply-To: <20260625124648.802832-1-sshegde@linux.ibm.com>

These methods will be used by the steal_monitor core in subsequent
patches. Default implementation are likely good enough for most archs.

decrease_preferred_cpus() - Called when there is high steal time. It needs
to decide which CPUs to mark as non-preferred and set that state.
increase_preferred_cpus() - Called when there is low steal time. It needs
to decide which CPUs to mark as preferred and set that state.

Default Implementations:
decrease_preferred_cpus()
- Get the last CPU in cpu_preferred_mask.
- Check if that last CPU belong to first housekeeping core. If so there
  is nothing to do. This helps to keep at least one core as preferred.
  This is to be safe under non-normal cases.
- If it is not first housekeeping core, get its sibling and mark them as
  non-preferred. If they are nohz_full, enable the tick. push mechanism
  relies on sched_tick.

increase_preferred_cpus()
- Get the first active non-preferred CPUs. This likely is the last
  set of CPUs being marked as non-preferred.
- If there is no such CPU, i.e preferred is same as active. Nothing
  todo further.
- If not, get the siblings of that core and mark them as preferred.
  Note that clearing the tick isn't needed as that would be handled via
  sched_can_stop_tick.

Using core instead of individual CPUs give better numbers as SMT is
quite common and some hypervisor such as powerVM does core scheduling.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v4->v5:
- Modified for steal_monitor

 drivers/virt/steal_monitor/defaults.c | 68 +++++++++++++++++++++++++++
 drivers/virt/steal_monitor/sm_core.h  |  4 ++
 2 files changed, 72 insertions(+)

diff --git a/drivers/virt/steal_monitor/defaults.c b/drivers/virt/steal_monitor/defaults.c
index 17f57afacbe6..90ede838491f 100644
--- a/drivers/virt/steal_monitor/defaults.c
+++ b/drivers/virt/steal_monitor/defaults.c
@@ -25,3 +25,71 @@ u64 __weak get_system_steal_time(void)
 
 	return total_steal;
 }
+
+/*
+ * Default implementation of decrementing the preferred CPUs based on steal
+ * time. This is simple logic and decrease the preferred CPUs by 1 core.
+ * It takes out the last core in the active & preferred.
+ *
+ * Ensure at least one housekeeping core is always kept as preferred
+ *
+ * Could be overwritten by arch specific handling. Arch must ensure
+ * preferred is always subset of active.
+ */
+
+#define get_core_mask(cpu) topology_sibling_cpumask(cpu)
+
+void __weak decrease_preferred_cpus(struct steal_monitor *ctx)
+{
+	int last_cpu, tmp_cpu;
+	int first_hk_cpu;
+
+	guard(cpus_read_lock)();
+
+	last_cpu = cpumask_last(cpu_preferred_mask);
+	first_hk_cpu = cpumask_first_and(housekeeping_cpumask(HK_TYPE_KERNEL_NOISE),
+					 cpu_active_mask);
+	/*
+	 * If the core belongs to the first housekeeping CPUs, no action is
+	 * taken. This leaves at least one core preferred always.
+	 * This ensures at least some CPUs are available to run.
+	 */
+	if (cpumask_equal(get_core_mask(last_cpu), get_core_mask(first_hk_cpu)))
+		return;
+
+	/*
+	 * set tick bit for nohz_full CPU to push the task out. Once the tasks
+	 * are pushed out, bit will be cleared if there are no tasks.
+	 */
+
+	for_each_cpu_and(tmp_cpu, get_core_mask(last_cpu), cpu_active_mask) {
+		set_cpu_preferred(tmp_cpu, false);
+		if (tick_nohz_full_cpu(tmp_cpu))
+			tick_nohz_dep_set_cpu(tmp_cpu, TICK_DEP_BIT_SCHED);
+	}
+}
+
+/*
+ * Default implementation of incrementing preferred CPUs based on steal
+ * time. This is simple logic and increases the preferred CPUs by 1 core.
+ * It adds the first core in active & !preferred
+ *
+ * Nothing to do if active == preferred
+ *
+ * Could be overwritten by arch specific handling. Arch must ensure
+ * preferred is subset of active.
+ */
+void __weak increase_preferred_cpus(struct steal_monitor *ctx)
+{
+	int first_cpu, tmp_cpu;
+
+	guard(cpus_read_lock)();
+
+	first_cpu = cpumask_first_andnot(cpu_active_mask, cpu_preferred_mask);
+	/* All CPUs are preferred. Nothing to increase further */
+	if (first_cpu >= nr_cpu_ids)
+		return;
+
+	for_each_cpu_and(tmp_cpu, get_core_mask(first_cpu), cpu_active_mask)
+		set_cpu_preferred(tmp_cpu, true);
+}
diff --git a/drivers/virt/steal_monitor/sm_core.h b/drivers/virt/steal_monitor/sm_core.h
index e09745a2b813..1857d6a9a295 100644
--- a/drivers/virt/steal_monitor/sm_core.h
+++ b/drivers/virt/steal_monitor/sm_core.h
@@ -10,6 +10,8 @@
 #include <linux/cpumask.h>
 #include <linux/workqueue.h>
 #include <linux/kernel_stat.h>
+#include <linux/tick.h>
+#include <linux/sched/isolation.h>
 
 struct steal_monitor {
 	struct delayed_work	work;
@@ -24,4 +26,6 @@ struct steal_monitor {
 extern struct steal_monitor sm_core_ctx;
 
 u64 get_system_steal_time(void);
+void increase_preferred_cpus(struct steal_monitor *ctx);
+void decrease_preferred_cpus(struct steal_monitor *ctx);
 #endif /* __VIRT_STEAL_CORE_H */
-- 
2.47.3


^ permalink raw reply related

* [PATCH v5 19/24] virt/steal_monitor: Provide default method to get systemwide steal time
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc
In-Reply-To: <20260625124648.802832-1-sshegde@linux.ibm.com>

steal monitor takes global view of steal time instead of individual
vCPU. For this collect overall steal values across all the vCPUs or
vCPUs of interest.

Default implementation chooses steal time across all active CPUs.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v4->v5:
- new patch

 drivers/virt/steal_monitor/Makefile   |  2 +-
 drivers/virt/steal_monitor/defaults.c | 27 +++++++++++++++++++++++++++
 drivers/virt/steal_monitor/sm_core.h  |  2 ++
 3 files changed, 30 insertions(+), 1 deletion(-)
 create mode 100644 drivers/virt/steal_monitor/defaults.c

diff --git a/drivers/virt/steal_monitor/Makefile b/drivers/virt/steal_monitor/Makefile
index 24cee55342ce..7c16f8cf9583 100644
--- a/drivers/virt/steal_monitor/Makefile
+++ b/drivers/virt/steal_monitor/Makefile
@@ -11,4 +11,4 @@
 #
 obj-$(subst y,m,$(CONFIG_PREFERRED_CPU)) += steal_monitor.o
 
-steal_monitor-y := sm_core.o
+steal_monitor-y := sm_core.o defaults.o
diff --git a/drivers/virt/steal_monitor/defaults.c b/drivers/virt/steal_monitor/defaults.c
new file mode 100644
index 000000000000..17f57afacbe6
--- /dev/null
+++ b/drivers/virt/steal_monitor/defaults.c
@@ -0,0 +1,27 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Base file contains the default implementations.
+ * These are defined as __weak so that arch may define
+ * strong symbols to override.
+ *
+ * Copyright (C) 2026 IBM
+ * Author: Shrikanth Hegde <sshegde@linux.ibm.com>
+ */
+#include "sm_core.h"
+
+/*
+ * Compute steal time of the full system.
+ *
+ * Default implementation returns steal time across all active CPUs
+ */
+
+u64 __weak get_system_steal_time(void)
+{
+	int tmp_cpu;
+	u64 total_steal = 0;
+
+	for_each_cpu(tmp_cpu, cpu_active_mask)
+		total_steal += kcpustat_cpu(tmp_cpu).cpustat[CPUTIME_STEAL];
+
+	return total_steal;
+}
diff --git a/drivers/virt/steal_monitor/sm_core.h b/drivers/virt/steal_monitor/sm_core.h
index d50138ad8c42..e09745a2b813 100644
--- a/drivers/virt/steal_monitor/sm_core.h
+++ b/drivers/virt/steal_monitor/sm_core.h
@@ -9,6 +9,7 @@
 #include <linux/init.h>
 #include <linux/cpumask.h>
 #include <linux/workqueue.h>
+#include <linux/kernel_stat.h>
 
 struct steal_monitor {
 	struct delayed_work	work;
@@ -22,4 +23,5 @@ struct steal_monitor {
 
 extern struct steal_monitor sm_core_ctx;
 
+u64 get_system_steal_time(void);
 #endif /* __VIRT_STEAL_CORE_H */
-- 
2.47.3


^ permalink raw reply related

* [PATCH v5 18/24] virt/steal_monitor: Compute work at regular intervals
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc
In-Reply-To: <20260625124648.802832-1-sshegde@linux.ibm.com>

Trigger periodic work at intervals specified in interval_ms.
schedule_delayed_work is chosen since this work need not happen at this
instant and this variant is safe w.r.t to CPU hotplug.

Reset the interval_ms to default if one sets it to 0 to avoid workqueue
stalls.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v4->v5:
- Modified for steal_monitor

 drivers/virt/steal_monitor/sm_core.c | 26 +++++++++++++++++++++++++-
 drivers/virt/steal_monitor/sm_core.h |  2 ++
 2 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/drivers/virt/steal_monitor/sm_core.c b/drivers/virt/steal_monitor/sm_core.c
index b95b37e37a16..fac8f4d5dac7 100644
--- a/drivers/virt/steal_monitor/sm_core.c
+++ b/drivers/virt/steal_monitor/sm_core.c
@@ -32,15 +32,39 @@ module_param_named(low_threshold, sm_core_ctx.low_threshold, uint, 0644);
 MODULE_PARM_DESC(low_threshold,
 		 "Low steal threshold (default: 200 i.e 2%)");
 
+static void compute_preferred_cpus_work(struct work_struct *work)
+{
+	/* At least one core is kept as preferred */
+	WARN_ON(cpumask_empty(cpu_preferred_mask));
+
+	/* Warn if interval_ms is set to 0, that might cause lockup. */
+	if (unlikely(sm_core_ctx.interval_ms == 0)) {
+		WARN_ON(1);
+		sm_core_ctx.interval_ms = 1000; /* Fallback to default */
+	}
+
+	/* Trigger for next sampling */
+	schedule_delayed_work(&sm_core_ctx.work,
+			      msecs_to_jiffies(sm_core_ctx.interval_ms));
+}
+
 static int __init steal_monitor_init(void)
 {
-	pr_info("steal_monitor is enabled\n");
+	pr_info("steal_monitor is enabled. interval: %ums, high_threshold: %u, low_threshold: %u\n",
+		sm_core_ctx.interval_ms, sm_core_ctx.high_threshold, sm_core_ctx.low_threshold);
+
+	INIT_DELAYED_WORK(&sm_core_ctx.work, compute_preferred_cpus_work);
+
+	schedule_delayed_work(&sm_core_ctx.work,
+			      msecs_to_jiffies(sm_core_ctx.interval_ms));
+
 	return 0;
 }
 
 static void __exit steal_monitor_exit(void)
 {
 	pr_info("steal_monitor is disabled\n");
+	cancel_delayed_work_sync(&sm_core_ctx.work);
 	cpumask_copy(&__cpu_preferred_mask, cpu_active_mask);
 }
 
diff --git a/drivers/virt/steal_monitor/sm_core.h b/drivers/virt/steal_monitor/sm_core.h
index a4e813319680..d50138ad8c42 100644
--- a/drivers/virt/steal_monitor/sm_core.h
+++ b/drivers/virt/steal_monitor/sm_core.h
@@ -7,6 +7,8 @@
 #include <linux/module.h>
 #include <linux/kernel.h>
 #include <linux/init.h>
+#include <linux/cpumask.h>
+#include <linux/workqueue.h>
 
 struct steal_monitor {
 	struct delayed_work	work;
-- 
2.47.3


^ permalink raw reply related

* Re: [PATCH] Docs/translations/it_IT: update current minimal requirements
From: Jonathan Corbet @ 2026-06-25 12:49 UTC (permalink / raw)
  To: Doehyun Baek; +Cc: linux-doc, Shuah Khan, Federico Vaga
In-Reply-To: <CAN-j9UqSuCWikosJzu+kcU=cOwnfRzYkr85hXS24AqSf=qwVwQ@mail.gmail.com>

Doehyun Baek <doehyunbaek@gmail.com> writes:

> Hi Jonathan,
>
> Gentle ping on this patch. Federico replied that it looks good to him.
>
> If nothing else is needed, could this be applied to docs-next?

We're in the merge window, which slows things down.

In this case, as well, the patch doesn't apply to docs-next.  Please
send a version that does, and I'll apply it after the merge window
closes.

Thanks,

jon

^ permalink raw reply

* [PATCH v5 17/24] virt/steal_monitor: Add control knobs for handling steal values
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc
In-Reply-To: <20260625124648.802832-1-sshegde@linux.ibm.com>

These are the knobs to control the steal_monitor.

interval_ms:
How often steal monitor checks for steal time.
(Default: 1000 i.e 1 second)

This controls how fast steal monitor driver reacts to changes to
the contention of physical CPUs. Since it does fair amount of
work, setting too low will have overheads. If set to 0, on next
work it will be set to default.

low_threshold:
lower threshold value in percentage * 100.
(Default: 200, i.e 2% steal is considered as low threshold)

This determines what values should be considered as nil/no steal values.
When steal monitor see steal time is below or equal to this value, it
will increase the preferred CPUs by 1 core. Having value as zero
might cause too much oscillations.

high_threshold:
higher threshold value in percentage * 100
(Default: 500, i.e 5% steal is considered as high threshold)

This determines what values should be considered as high steal values.
When steal monitor sees steal time is higher than this value, it will
reduce the preferred CPUs by 1 core.

Also available at: Documentation/driver-api/steal-monitor.rst

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v4->v5:
- Modified for steal_monitor

 drivers/virt/steal_monitor/sm_core.c | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/drivers/virt/steal_monitor/sm_core.c b/drivers/virt/steal_monitor/sm_core.c
index 3feb686dd3c4..b95b37e37a16 100644
--- a/drivers/virt/steal_monitor/sm_core.c
+++ b/drivers/virt/steal_monitor/sm_core.c
@@ -14,7 +14,23 @@

 #include "sm_core.h"

-struct steal_monitor sm_core_ctx;
+struct steal_monitor sm_core_ctx = {
+	.interval_ms = 1000,	/* 1 second */
+	.high_threshold = 500,	/* 5% */
+	.low_threshold = 200,	/* 2% */
+};
+
+module_param_named(interval_ms, sm_core_ctx.interval_ms, uint, 0644);
+MODULE_PARM_DESC(interval_ms,
+		 "Sampling frequency for steal values in milliseconds (default: 1000)");
+
+module_param_named(high_threshold, sm_core_ctx.high_threshold, uint, 0644);
+MODULE_PARM_DESC(high_threshold,
+		 "High steal threshold (default: 500 i.e 5%)");
+
+module_param_named(low_threshold, sm_core_ctx.low_threshold, uint, 0644);
+MODULE_PARM_DESC(low_threshold,
+		 "Low steal threshold (default: 200 i.e 2%)");

 static int __init steal_monitor_init(void)
 {
-- 
2.47.3

^ permalink raw reply related

* [PATCH v5 16/24] virt/steal_monitor: Define steal_monitor structure
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc
In-Reply-To: <20260625124648.802832-1-sshegde@linux.ibm.com>

Main structure of steal monitor. It has
- work: deferred periodic work function
- prev_steal, prev_time - To calculate the delta in periodic work.
- interval_ms, high_threshold, low_threshold - debug knobs of
  steal_monitor.

sm_core_ctx - instance used by the core code.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v4->v5:
- Modified for steal_monitor

 drivers/virt/steal_monitor/sm_core.c |  2 ++
 drivers/virt/steal_monitor/sm_core.h | 12 ++++++++++++
 2 files changed, 14 insertions(+)

diff --git a/drivers/virt/steal_monitor/sm_core.c b/drivers/virt/steal_monitor/sm_core.c
index b1865fcdff93..3feb686dd3c4 100644
--- a/drivers/virt/steal_monitor/sm_core.c
+++ b/drivers/virt/steal_monitor/sm_core.c
@@ -14,6 +14,8 @@
 
 #include "sm_core.h"
 
+struct steal_monitor sm_core_ctx;
+
 static int __init steal_monitor_init(void)
 {
 	pr_info("steal_monitor is enabled\n");
diff --git a/drivers/virt/steal_monitor/sm_core.h b/drivers/virt/steal_monitor/sm_core.h
index 684a258526e1..a4e813319680 100644
--- a/drivers/virt/steal_monitor/sm_core.h
+++ b/drivers/virt/steal_monitor/sm_core.h
@@ -8,4 +8,16 @@
 #include <linux/kernel.h>
 #include <linux/init.h>
 
+struct steal_monitor {
+	struct delayed_work	work;
+	u64			prev_steal;
+	int			prev_direction;
+	unsigned int		interval_ms;
+	unsigned int		high_threshold;
+	unsigned int		low_threshold;
+	ktime_t			prev_time;
+};
+
+extern struct steal_monitor sm_core_ctx;
+
 #endif /* __VIRT_STEAL_CORE_H */
-- 
2.47.3


^ permalink raw reply related

* [PATCH v5 15/24] virt/steal_monitor: Restore to active on module disable
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc
In-Reply-To: <20260625124648.802832-1-sshegde@linux.ibm.com>

When the module is not in use, preferred CPUs must be same
as active CPUs.

Even if one disables the module during high steal time, it
still restores the preferred CPUs to be same as active CPUs
to keep disable path simple.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v4->v5:
- Modified for steal_monitor

 drivers/virt/steal_monitor/sm_core.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/virt/steal_monitor/sm_core.c b/drivers/virt/steal_monitor/sm_core.c
index e320559c6576..b1865fcdff93 100644
--- a/drivers/virt/steal_monitor/sm_core.c
+++ b/drivers/virt/steal_monitor/sm_core.c
@@ -23,6 +23,7 @@ static int __init steal_monitor_init(void)
 static void __exit steal_monitor_exit(void)
 {
 	pr_info("steal_monitor is disabled\n");
+	cpumask_copy(&__cpu_preferred_mask, cpu_active_mask);
 }
 
 module_init(steal_monitor_init);
-- 
2.47.3


^ permalink raw reply related

* [PATCH v5 14/24] virt: Introduce steal monitor driver
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc
In-Reply-To: <20260625124648.802832-1-sshegde@linux.ibm.com>

Introduce a new driver in virt named steal_monitor. This driver
will compute the steal time and drive the policy decisions of preferred
CPU state.

More on it can be found in the Documentation/driver-api/steal-monitor.rst
This patch introduces the skeleton code.

There is no new kconfig. It depends on CONFIG_PREFERRED_CPU.
- If CONFIG_PREFERRED_CPU=y, it gets compiled as a module. It is not
  loaded by default.
- If CONFIG_PREFERRED_CPU=n, module isn't compiled.

File layout of the driver is designed with having arch specific
files in the future.

- sm_core.c - contains main driver code. This includes the periodic
  work function and take action on steal time.
- defaults.c - contains the default implementation defined with __weak
  symbols.
- sm_core.h - header file which includes data structure.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v4->v5:
- new patch

Please let me know if the placing is not right.

 drivers/virt/Makefile                |  1 +
 drivers/virt/steal_monitor/Makefile  | 14 ++++++++++++
 drivers/virt/steal_monitor/sm_core.c | 33 ++++++++++++++++++++++++++++
 drivers/virt/steal_monitor/sm_core.h | 11 ++++++++++
 4 files changed, 59 insertions(+)
 create mode 100644 drivers/virt/steal_monitor/Makefile
 create mode 100644 drivers/virt/steal_monitor/sm_core.c
 create mode 100644 drivers/virt/steal_monitor/sm_core.h

diff --git a/drivers/virt/Makefile b/drivers/virt/Makefile
index f29901bd7820..aff715cea42d 100644
--- a/drivers/virt/Makefile
+++ b/drivers/virt/Makefile
@@ -9,4 +9,5 @@ obj-y				+= vboxguest/
 
 obj-$(CONFIG_NITRO_ENCLAVES)	+= nitro_enclaves/
 obj-$(CONFIG_ACRN_HSM)		+= acrn/
+obj-$(CONFIG_PREFERRED_CPU)	+= steal_monitor/
 obj-y				+= coco/
diff --git a/drivers/virt/steal_monitor/Makefile b/drivers/virt/steal_monitor/Makefile
new file mode 100644
index 000000000000..24cee55342ce
--- /dev/null
+++ b/drivers/virt/steal_monitor/Makefile
@@ -0,0 +1,14 @@
+# SPDX-License-Identifier: GPL-2.0-only
+#
+# Steal time monitor to alter preferred CPU state.
+#
+# Arch can implement strong function definitions and override the
+# default by adding them in arch specific file. It must ensure
+# that preferred is always subset of active.
+#
+# It is always compiled as module if CONFIG_PREFERRED_CPU=y
+# One has to enable the module.
+#
+obj-$(subst y,m,$(CONFIG_PREFERRED_CPU)) += steal_monitor.o
+
+steal_monitor-y := sm_core.o
diff --git a/drivers/virt/steal_monitor/sm_core.c b/drivers/virt/steal_monitor/sm_core.c
new file mode 100644
index 000000000000..e320559c6576
--- /dev/null
+++ b/drivers/virt/steal_monitor/sm_core.c
@@ -0,0 +1,33 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Steal time Monitor.
+ *
+ * Periodically compute steal time. Based on the thresholds either
+ * reduce/increase the preferred CPUs which can be made use
+ * by the workload to avoid vCPU preemption to an extent possible.
+ *
+ * Available as module with CONFIG_PREFERRED_CPU=y
+ *
+ * Copyright (C) 2026 IBM
+ * Author: Shrikanth Hegde <sshegde@linux.ibm.com>
+ */
+
+#include "sm_core.h"
+
+static int __init steal_monitor_init(void)
+{
+	pr_info("steal_monitor is enabled\n");
+	return 0;
+}
+
+static void __exit steal_monitor_exit(void)
+{
+	pr_info("steal_monitor is disabled\n");
+}
+
+module_init(steal_monitor_init);
+module_exit(steal_monitor_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("IBM Corporation");
+MODULE_DESCRIPTION("Virtualization Steal Time Monitor");
diff --git a/drivers/virt/steal_monitor/sm_core.h b/drivers/virt/steal_monitor/sm_core.h
new file mode 100644
index 000000000000..684a258526e1
--- /dev/null
+++ b/drivers/virt/steal_monitor/sm_core.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef __VIRT_STEAL_CORE_H
+#define __VIRT_STEAL_CORE_H
+
+#include <linux/types.h>
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/init.h>
+
+#endif /* __VIRT_STEAL_CORE_H */
-- 
2.47.3


^ permalink raw reply related

* [PATCH v5 13/24] virt/steal_monitor: Add documentation
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc
In-Reply-To: <20260625124648.802832-1-sshegde@linux.ibm.com>

Document this module named steal_monitor and its parameters.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v4-v5:
- new patch

Please let me know if the placing is not right.

 Documentation/driver-api/index.rst         |  1 +
 Documentation/driver-api/steal-monitor.rst | 93 ++++++++++++++++++++++
 2 files changed, 94 insertions(+)
 create mode 100644 Documentation/driver-api/steal-monitor.rst

diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst
index eaf7161ff957..ec12f396a5e6 100644
--- a/Documentation/driver-api/index.rst
+++ b/Documentation/driver-api/index.rst
@@ -138,6 +138,7 @@ Subsystem-specific APIs
    sm501
    soundwire/index
    spi
+   steal-monitor
    surface_aggregator/index
    switchtec
    sync_file
diff --git a/Documentation/driver-api/steal-monitor.rst b/Documentation/driver-api/steal-monitor.rst
new file mode 100644
index 000000000000..997a22d0812c
--- /dev/null
+++ b/Documentation/driver-api/steal-monitor.rst
@@ -0,0 +1,93 @@
+.. SPDX-License-Identifier: GPL-2.0
+=============
+Steal Monitor
+=============
+
+:Author: Shrikanth Hegde
+
+Introduction:
+=============
+
+Steal monitor is a driver aimed at solving the Noisy Neighbour problem
+in virtualized environments. I.e performance of workload
+running in one VM gets affected significantly due to other VMs and
+combined they make slower forward progress.
+
+When there is overcommit of CPU resources, i.e sum of virtual CPUs(vCPU)
+of all VMs is greater than number of physical CPUs(pCPU) and
+when all or many VMs have high utilization, hypervisor won't be able
+to satisfy the CPU requirement and has to context switch within or
+across VM. I.e hypervisor needs to preempt one vCPU to run
+another. This is called vCPU preemption.
+This is more expensive compared to task context switch within a vCPU.
+
+In such cases it is better that combined vCPU ask from all VM is reduced
+by not using some of the vCPUs. vCPUs where workload can be safely
+scheduled which won't increase any contention for pCPU are called as
+"Preferred CPUs".
+
+See more on "Preferred CPUs" in Documentation/scheduler/sched-arch.rst.
+
+This driver helps in setting/clearing the CPUs in the "Preferred CPUs" list.
+This list is obtained using cpu_preferred_mask.
+
+Core idea:
+==========
+steal time is an indication available today in Guest which shows contention
+for underlying physical CPU. Use it as a hint in the guest to fold the
+workload to a reduced set of vCPUs. When there is contention, steal time
+will show up in all the guests. When each guest honors the hint and folds
+the workload to a smaller set of vCPUs(Preferred CPUs), it reduces the
+contention and thereby reduces vCPU preemption.
+This is achieved without any cross-guest communication.
+
+Steal monitor driver effectively does:
+
+1. Periodically computes steal time across the system.
+
+2. If steal time is greater than high threshold, reduce the number of
+   preferred CPUs by 1 core. Ensure at least one core is left always.
+   This avoids running into extreme cases.
+
+3. If steal time is lower or equal to low threshold, increase the
+   number of preferred CPUs by 1 core. If preferred is same as active,
+   nothing to be done.
+
+4. Ensure preferred CPUs is always subset of active CPUs.
+   On feature disable it is same as active CPUs.
+
+Module Parameters:
+==================
+interval_ms
+-----------
+How often steal monitor checks for steal time.
+(Default: 1000 i.e 1 second)
+
+This controls how fast steal monitor driver reacts to changes to
+the contention of physical CPUs. Since it does fair amount of
+work, setting too low will have overheads. If set to 0, on next
+work it will be set to default.
+
+low_threshold
+-------------
+lower threshold value in percentage * 100.
+(Default: 200, i.e 2% steal is considered as low threshold)
+
+This determines what values should be considered as nil/no steal values.
+When steal monitor see steal time is below or equal to this value, it
+will increase the preferred CPUs by 1 core. Having value as zero
+might cause too much oscillations.
+
+high_threshold
+--------------
+higher threshold value in percentage * 100
+(Default: 500, i.e 5% steal is considered as high threshold)
+
+This determines what values should be considered as high steal values.
+When steal monitor sees steal time is higher than this value, it will
+reduce the preferred CPUs by 1 core.
+
+Notes:
+======
+This is available under CONFIG_PREFERRED_CPU. Selecting that includes
+this module. Module is not loaded by default.
-- 
2.47.3


^ permalink raw reply related

* [PATCH v5 05/24] sysfs: Add preferred CPU file
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc
In-Reply-To: <20260625124648.802832-1-sshegde@linux.ibm.com>

Add "preferred" file in /sys/devices/system/cpu

This offers
- User can quickly check which CPUs are marked as preferred at this
  moment.
- Userspace algorithms irqbalance could use this mask to send irq into
  preferred CPUs.

For example:
cat /sys/devices/system/cpu/online
0-719
cat /sys/devices/system/cpu/preferred
0-599        <<< Implies 0-599 are preferred for workloads and 600-719
                 should be avoided at this moment.

cat /sys/devices/system/cpu/preferred
0-719        <<< All CPUs are usable. There is no preferrence.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 Documentation/ABI/testing/sysfs-devices-system-cpu | 11 +++++++++++
 drivers/base/cpu.c                                 |  8 ++++++++
 2 files changed, 19 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu
index 82d10d556cc8..5fb973d53287 100644
--- a/Documentation/ABI/testing/sysfs-devices-system-cpu
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -806,3 +806,14 @@ Date:		Nov 2022
 Contact:	Linux kernel mailing list <linux-kernel@vger.kernel.org>
 Description:
 		(RO) the list of CPUs that can be brought online.
+
+What:		/sys/devices/system/cpu/preferred
+Date:		Jun 2026
+Contact:	Linux kernel mailing list <linux-kernel@vger.kernel.org>
+Description:
+		(RO) the list of preferred CPUs at this moment.
+		These are the only CPUs meant to be used at the moment.
+		Using CPU outside of the list could lead to more
+		contention of underlying physical CPU resource. Dynamically
+		changes based on steal time. With CONFIG_PREFERRED_CPU=n it
+		is same as active CPUs. See sched-arch.rst for more details.
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index 875abdc9942e..0c6647805805 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -391,6 +391,13 @@ static int cpu_uevent(const struct device *dev, struct kobj_uevent_env *env)
 }
 #endif
 
+static ssize_t preferred_show(struct device *dev,
+			      struct device_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "%*pbl\n", cpumask_pr_args(cpu_preferred_mask));
+}
+static DEVICE_ATTR_RO(preferred);
+
 const struct bus_type cpu_subsys = {
 	.name = "cpu",
 	.dev_name = "cpu",
@@ -532,6 +539,7 @@ static struct attribute *cpu_root_attrs[] = {
 #ifdef CONFIG_GENERIC_CPU_AUTOPROBE
 	&dev_attr_modalias.attr,
 #endif
+	&dev_attr_preferred.attr,
 	NULL
 };
 
-- 
2.47.3


^ permalink raw reply related

* [PATCH v5 12/24] sched/debug: Add migration stats due to non preferred CPUs
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc
In-Reply-To: <20260625124648.802832-1-sshegde@linux.ibm.com>

Add a new stat.
- nr_migrations_cpu_non_preferred: number of migrations happened since
  a CPU was marked as non preferred due to high steal time.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 include/linux/sched.h | 1 +
 kernel/sched/core.c   | 4 +++-
 kernel/sched/debug.c  | 1 +
 3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 27dbf676113e..32f4743326f8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -554,6 +554,7 @@ struct sched_statistics {
 	u64				nr_failed_migrations_running;
 	u64				nr_failed_migrations_hot;
 	u64				nr_forced_migrations;
+	u64				nr_migrations_cpu_non_preferred;
 
 	u64				nr_wakeups;
 	u64				nr_wakeups_sync;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1e42078251d5..fdfefdca74dc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -11336,8 +11336,10 @@ static int sched_non_preferred_cpu_push_stop(void *arg)
 
 	context_unsafe_alias(rq);
 
-	if (task_rq(p) == rq && task_on_rq_queued(p))
+	if (task_rq(p) == rq && task_on_rq_queued(p)) {
 		rq = __migrate_task(rq, &rf, p, cpu);
+		schedstat_inc(p->stats.nr_migrations_cpu_non_preferred);
+	}
 
 	rq_unlock(rq, &rf);
 	raw_spin_unlock_irq(&p->pi_lock);
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index f3a033b34ba0..106b448cafb6 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1363,6 +1363,7 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 		P_SCHEDSTAT(nr_failed_migrations_running);
 		P_SCHEDSTAT(nr_failed_migrations_hot);
 		P_SCHEDSTAT(nr_forced_migrations);
+		P_SCHEDSTAT(nr_migrations_cpu_non_preferred);
 		P_SCHEDSTAT(nr_wakeups);
 		P_SCHEDSTAT(nr_wakeups_sync);
 		P_SCHEDSTAT(nr_wakeups_migrate);
-- 
2.47.3


^ permalink raw reply related

* [PATCH v5 11/24] sched/core: Push current task from non preferred CPU
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc
In-Reply-To: <20260625124648.802832-1-sshegde@linux.ibm.com>

Actively push out task running on a non-preferred CPU. Since the task is
running on the CPU, need to stop the cpu and push the task out.
However, if the task in pinned only to non-preferred CPUs, it will continue
running there. This will help in maintaining the userspace affinities
unlike CPU hotplug or isolated cpusets.

Though code is almost same as __balance_push_cpu_stop and quite close to
push_cpu_stop, it is being kept separate as it provides a cleaner
implementation w.r.t CONFIG_HOTPLUG_CPU.

Add push_task_work_done flag to protect work buffer.
Works only with FAIR class.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v4->v5:
- Move select_fallback_rq outside of rq_lock (Sashiko)
- Add context_unsafe_alias (K Prateek Nayak)
- Cleanup properly on early exit.

 kernel/sched/core.c  | 87 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |  8 ++++
 2 files changed, 95 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c0391e7897f5..1e42078251d5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5794,6 +5794,9 @@ void sched_tick(void)
 	unsigned long hw_pressure;
 	u64 resched_latency;
 
+	if (!cpu_preferred(cpu))
+		sched_push_current_non_preferred_cpu(rq);
+
 	if (housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE))
 		arch_scale_freq_tick();
 
@@ -11303,3 +11306,87 @@ void sched_change_end(struct sched_change_ctx *ctx)
 		p->sched_class->prio_changed(rq, p, ctx->prio);
 	}
 }
+
+#ifdef CONFIG_PREFERRED_CPU
+/* npc - non preferred CPU */
+static DEFINE_PER_CPU(struct cpu_stop_work, npc_push_task_work);
+
+static int sched_non_preferred_cpu_push_stop(void *arg)
+{
+	struct task_struct *p = arg;
+	struct rq *rq = this_rq();
+	struct rq_flags rf;
+	int cpu;
+
+	/* sanity check and clear */
+	if (cpu_preferred(rq->cpu)) {
+		scoped_guard (rq_lock, rq)
+			rq->push_task_work_done = 0;
+		put_task_struct(p);
+		return 0;
+	}
+
+	raw_spin_lock_irq(&p->pi_lock);
+
+	/* This could take rq lock. So call it before rq lock is taken */
+	cpu = select_fallback_rq(rq->cpu, p);
+	rq_lock(rq, &rf);
+	rq->push_task_work_done = 0;
+	update_rq_clock(rq);
+
+	context_unsafe_alias(rq);
+
+	if (task_rq(p) == rq && task_on_rq_queued(p))
+		rq = __migrate_task(rq, &rf, p, cpu);
+
+	rq_unlock(rq, &rf);
+	raw_spin_unlock_irq(&p->pi_lock);
+	put_task_struct(p);
+
+	return 0;
+}
+
+/*
+ * Push the current task running on non-preferred CPU.
+ * Using this non preferred CPU will lead to more vCPU preemptions
+ * in the host. So it is better not to use this CPU.
+ *
+ * Since task is running, call a stopper to push the task out. This is
+ * similar to how task moves during hotplug. In select_fallback_rq a
+ * preferred CPU will be chosen and henceforth task shouldn't come back to
+ * this CPU again.
+ *
+ * Works for FAIR class only
+ *
+ * If task is affined only non-preferred CPUs, it can't be moved out
+ */
+void sched_push_current_non_preferred_cpu(struct rq *rq)
+{
+	struct task_struct *push_task = rq->curr;
+
+	/* Push only if it is FAIR class */
+	if (push_task->sched_class != &fair_sched_class)
+		return;
+
+	if (kthread_is_per_cpu(push_task) ||
+	    is_migration_disabled(push_task))
+		return;
+
+	/* Is there any preferred CPU in the affinity list */
+	if (!task_has_preferred_cpus(push_task))
+		return;
+
+	/* There is already a stopper thread for this. Dont race with it */
+	if (rq->push_task_work_done == 1)
+		return;
+
+	/* sched_tick runs with interrupts disabled. Don't disable again */
+	get_task_struct(push_task);
+
+	scoped_guard (rq_lock, rq)
+		rq->push_task_work_done = 1;
+
+	stop_one_cpu_nowait(rq->cpu, sched_non_preferred_cpu_push_stop,
+			    push_task, this_cpu_ptr(&npc_push_task_work));
+}
+#endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 148fe6145f1a..316d3ccefc48 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1274,6 +1274,8 @@ struct rq {
 
 	struct list_head cfs_tasks;
 
+	bool			push_task_work_done;
+
 	struct sched_avg	avg_rt;
 	struct sched_avg	avg_dl;
 #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
@@ -4241,4 +4243,10 @@ static inline bool task_has_preferred_cpus(struct task_struct *p)
 	else
 		return cpumask_intersects(p->cpus_ptr, cpu_preferred_mask);
 }
+
+#ifdef CONFIG_PREFERRED_CPU
+void sched_push_current_non_preferred_cpu(struct rq *rq);
+#else	/* !CONFIG_PREFERRED_CPU */
+static inline void sched_push_current_non_preferred_cpu(struct rq *rq) { }
+#endif
 #endif /* _KERNEL_SCHED_SCHED_H */
-- 
2.47.3


^ permalink raw reply related

* [PATCH v5 10/24] sched/core: Keep tick on non-preferred CPUs until tasks are out
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc
In-Reply-To: <20260625124648.802832-1-sshegde@linux.ibm.com>

Enable tick on nohz full CPU when it is marked as non-preferred.
If there in no CFS running there, disable the tick to save the power.

Steal time handling code will call tick_nohz_dep_set_cpu with
TICK_DEP_BIT_SCHED for moving the task out of nohz_full CPU fast.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v4->v5:
- Move it below rt checks. (Sashiko)

 kernel/sched/core.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 281715a6e88f..c0391e7897f5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1473,6 +1473,10 @@ bool sched_can_stop_tick(struct rq *rq)
 			return false;
 	}
 
+	/* Keep the tick running until CFS tasks are pushed out*/
+	if (!cpu_preferred(rq->cpu) && rq->cfs.h_nr_queued)
+		return false;
+
 	return true;
 }
 #endif /* CONFIG_NO_HZ_FULL */
-- 
2.47.3


^ permalink raw reply related

* [PATCH v5 09/24] sched/fair: Pull the load on preferred CPU
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc
In-Reply-To: <20260625124648.802832-1-sshegde@linux.ibm.com>

When cpu is marked as non preferred, any load pulled towards it is
pointless since in the next tick task will be pushed out again.

Since load balancing only happens among preferred CPUs, should_we_balance
will bail out. But for NEWIDLE and IDLE balance, this bailout can
happen even earlier.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v4->v5:
- new patch

 kernel/sched/fair.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 44a0d9736b67..fda8966d9d87 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -14196,6 +14196,10 @@ static void _nohz_idle_balance(struct rq *this_rq, unsigned int flags)
 		if (!idle_cpu(balance_cpu))
 			continue;
 
+		/* There is no point in pulling the load, just to push it out next */
+		if (!cpu_preferred(balance_cpu))
+			continue;
+
 		/*
 		 * If this CPU gets work to do, stop the load balancing
 		 * work being done for other CPUs. Next load
@@ -14375,6 +14379,10 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
 	if (!cpu_active(this_cpu))
 		return 0;
 
+	/* Do not pull to a !preferred CPU just to push it out next */
+	if (!cpu_preferred(this_cpu))
+		return 0;
+
 	/*
 	 * This is OK, because current is on_cpu, which avoids it being picked
 	 * for load-balance and preemption/IRQs are still disabled avoiding
-- 
2.47.3


^ permalink raw reply related

* [PATCH v5 08/24] sched/fair: load balance only among preferred CPUs
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc
In-Reply-To: <20260625124648.802832-1-sshegde@linux.ibm.com>

Consider only preferred CPUs for load balance.

With this, load balance will end up choosing a preferred CPUs for pull.
This makes it not fight against the push task mechanism which happens
at tick. Also, this stops active balance to happen on non-preferred CPU
pulling the load.

This means there is no load balancing if the task is pinned only to
non-preferred CPUs. They will continue to run where they were previously
running before the CPUs was marked as non-preferred.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v4->v5:
- Remove previous cpumask_and (K Prateek Nayak)

 kernel/sched/fair.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d78467ec6ee1..44a0d9736b67 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -13289,7 +13289,8 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
 	};
 	bool need_unlock = false;
 
-	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
+	/* Spread load among preferred CPUs */
+	cpumask_and(cpus, sched_domain_span(sd), cpu_preferred_mask);
 
 	schedstat_inc(sd->lb_count[idle]);
 
-- 
2.47.3


^ permalink raw reply related

* [PATCH v5 07/24] sched/fair: Select preferred CPU at wakeup when possible
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc
In-Reply-To: <20260625124648.802832-1-sshegde@linux.ibm.com>

Update available_idle_cpu to consider preferred CPUs. This takes care of
lot of decisions at wakeup to use only preferred CPUs. There is no need to
put those explicit checks everywhere.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/sched.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5d009c2529b2..148fe6145f1a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1434,6 +1434,9 @@ static inline bool available_idle_cpu(int cpu)
 	if (!idle_rq(cpu_rq(cpu)))
 		return 0;
 
+	if (!cpu_preferred(cpu))
+		return 0;
+
 	if (vcpu_is_preempted(cpu))
 		return 0;
 
-- 
2.47.3


^ permalink raw reply related

* [PATCH v5 06/24] sched/core: allow only preferred CPUs in is_cpu_allowed
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc
In-Reply-To: <20260625124648.802832-1-sshegde@linux.ibm.com>

When possible, choose a preferred CPUs to pick.

Push task mechanism uses stopper thread which going to call
select_fallback_rq and use this mechanism to pick only a preferred CPU.

When task is affined only to non-preferred CPUs it should continue to
run there. Detect that by checking if cpus_ptr and cpu_preferred_mask
intersect or not.

Since is_cpu_allowed can be called directly or repeatedly in
select_fallback_rq, encode the info in task_struct->has_preferred_cpu_state
if the path is via select_fallback_rq or not.
This helps to avoid N**2 complexity for the rare cases.

Additional overhead of O(N) comes to is_cpu_allowed only when cpu is not
preferred. So in normal scenarios overhead is only a bit check.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v4->v5:
- Do simple encoding of -1,0,1 instead (K Prateek Nayak)
- Make it s8 (K Prateek Nayak)
- Update changelog to address sashiko concerns of overhead.

 include/linux/sched.h |  1 +
 kernel/sched/core.c   | 35 +++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h  | 25 +++++++++++++++++++++++++
 3 files changed, 59 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index fc6ecb3869dd..27dbf676113e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1657,6 +1657,7 @@ struct task_struct {
 #ifdef CONFIG_UNWIND_USER
 	struct unwind_task_info		unwind_info;
 #endif
+	s8				has_preferred_cpu_state;
 
 	/* CPU-specific state of this task: */
 	struct thread_struct		thread;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9e16946c9d62..281715a6e88f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2500,6 +2500,8 @@ static inline bool rq_has_pinned_tasks(struct rq *rq)
  */
 static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
 {
+	bool task_check_preferred_cpu;
+
 	/* When not in the task's cpumask, no point in looking further. */
 	if (!task_allowed_on_cpu(p, cpu))
 		return false;
@@ -2508,9 +2510,23 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
 	if (is_migration_disabled(p))
 		return cpu_online(cpu);
 
+	/*
+	 * This is essential to maintain user affinities when preferred
+	 * CPUs change. A task pinned on non-preferred CPU should continue
+	 * to run there, since this is non-user triggered.
+	 *
+	 * If CPU is non-preferred and task can run on other CPUs which are
+	 * currently preferred, then choose those other CPUs instead.
+	 * Overhead is minimal when CPU is preferred.
+	 */
+	task_check_preferred_cpu = !cpu_preferred(cpu) && task_has_preferred_cpus(p);
+
 	/* Non kernel threads are not allowed during either online or offline. */
-	if (!(p->flags & PF_KTHREAD))
+	if (!(p->flags & PF_KTHREAD)) {
+		if (task_check_preferred_cpu)
+			return false;
 		return cpu_active(cpu);
+	}
 
 	/* KTHREAD_IS_PER_CPU is always allowed. */
 	if (kthread_is_per_cpu(p))
@@ -2520,6 +2536,10 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
 	if (cpu_dying(cpu))
 		return false;
 
+	/* Try on preferred CPU first if possible*/
+	if (task_check_preferred_cpu)
+		return false;
+
 	/* But are allowed during online. */
 	return cpu_online(cpu);
 }
@@ -3549,6 +3569,14 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
 	enum { cpuset, possible, fail } state = cpuset;
 	int dest_cpu;
 
+	/*
+	 * Cache the value whether task's affinity spans preferred CPUs.
+	 * This helps to avoid repeating the same for each CPU
+	 * later in the loop. Encode call to is_cpu_allowed coming
+	 * via select_fallback_rq.
+	 */
+	p->has_preferred_cpu_state = task_has_preferred_cpus(p) ? 1 : -1;
+
 	/*
 	 * If the node that the CPU is on has been offlined, cpu_to_node()
 	 * will return -1. There is no CPU on the node, and we should
@@ -3560,7 +3588,7 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
 		/* Look for allowed, online CPU in same node. */
 		for_each_cpu(dest_cpu, nodemask) {
 			if (is_cpu_allowed(p, dest_cpu))
-				return dest_cpu;
+				goto clear_and_return;
 		}
 	}
 
@@ -3604,6 +3632,8 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
 		}
 	}
 
+clear_and_return:
+	p->has_preferred_cpu_state = 0;
 	return dest_cpu;
 }
 
@@ -4612,6 +4642,7 @@ static void __sched_fork(u64 clone_flags, struct task_struct *p)
 	init_numa_balancing(clone_flags, p);
 	p->wake_entry.u_flags = CSD_TYPE_TTWU;
 	p->migration_pending = NULL;
+	p->has_preferred_cpu_state = 0;
 	init_sched_mm(p);
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c7c2dea65edd..5d009c2529b2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -4213,4 +4213,29 @@ DEFINE_CLASS_IS_UNCONDITIONAL(sched_change)
 
 #include "ext.h"
 
+/*
+ * has_preferred_cpu_state could have the value cached from
+ * select_fallback_rq. It is set/cleared while holding pi_lock
+ * and irq disabled.
+ *
+ *  1: Cached and preferred CPUs exists in task's affinity.
+ *  0: Not cached and need to evaluate.
+ * -1: Cached and preferred CPU doesn't exits task's affinity
+ *
+ * Only affects FAIR task.
+ */
+static inline bool task_has_preferred_cpus(struct task_struct *p)
+{
+	int cached;
+
+	/* Only FAIR tasks honor preferred CPU state */
+	if (unlikely(p->sched_class != &fair_sched_class))
+		return false;
+
+	cached = READ_ONCE(p->has_preferred_cpu_state);
+	if (cached)
+		return cached > 0;
+	else
+		return cpumask_intersects(p->cpus_ptr, cpu_preferred_mask);
+}
 #endif /* _KERNEL_SCHED_SCHED_H */
-- 
2.47.3


^ permalink raw reply related

* Re: [PATCH v4] coredump: Add /proc/<pid>/coredump_pre_exit for pre-exit before dumping
From: Lorenzo Stoakes @ 2026-06-25 12:48 UTC (permalink / raw)
  To: Xin Zhao
  Cc: brauner, mjguzik, pfalcato, ebiederm, viro, jack, jlayton,
	chuck.lever, alex.aring, arnd, keescook, mcgrof, j.granados,
	allen.lkml, linux-fsdevel, linux-kernel, linux-arch,
	Jonathan Corbet, Andrew Morton, David Hildenbrand, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Liam R. Howlett,
	linux-doc, linux-mm
In-Reply-To: <20260624145552.70143-1-jackzxcui1989@163.com>

+cc missing maintainers, lists.

NAK.

This is un-upstreamable for numerous reasons.

The stuff you're doing in mm is broken, wrong and invasive and you've not
even bothered to cc- mm people. I'm annoyed by this.

You're also doing incredibly silly mistakes at v4 of something that should have
been an RFC.

You don't seem to understand the concept of patch _series_ (break it up into
smaller patches!!!) and you haven't bothered cc'ing maintainers whose subsystems
you're radically alterting.

I'm annoyed as you have a history where you were told not to add insane hacks
before ([0], my reply at [1]).

[0]:https://lore.kernel.org/all/20260116042817.3790405-1-jackzxcui1989@163.com/
[1]:https://lore.kernel.org/all/14110b70-19e7-474d-b0dd-ba80e8bed9b0@lucifer.local/

Was I wasting my time there? Am I wasting my time responding now?

And how hard is it to run a simple perl script?

Let me run it for you for _just_ the maintainers:

$ scripts/get_maintainer.pl --nogit --nogit-fallback --nor your_patch.patch
Jonathan Corbet <corbet@lwn.net> (maintainer:DOCUMENTATION)
Alexander Viro <viro@zeniv.linux.org.uk> (maintainer:FILESYSTEMS (VFS and infrastructure))
Christian Brauner <brauner@kernel.org> (maintainer:FILESYSTEMS (VFS and infrastructure))
Andrew Morton <akpm@linux-foundation.org> (maintainer:MEMORY MANAGEMENT - CORE)
David Hildenbrand <david@kernel.org> (maintainer:MEMORY MANAGEMENT - CORE)
Arnd Bergmann <arnd@arndb.de> (maintainer:GENERIC INCLUDE/ASM HEADER FILES)
Ingo Molnar <mingo@redhat.com> (maintainer:SCHEDULER)
Peter Zijlstra <peterz@infradead.org> (maintainer:SCHEDULER)
Juri Lelli <juri.lelli@redhat.com> (maintainer:SCHEDULER)
Vincent Guittot <vincent.guittot@linaro.org> (maintainer:SCHEDULER)
Kees Cook <kees@kernel.org> (maintainer:EXEC & BINFMT API, ELF)
"Liam R. Howlett" <liam@infradead.org> (maintainer:MEMORY MAPPING)
Lorenzo Stoakes <ljs@kernel.org> (maintainer:MEMORY MAPPING)
linux-doc@vger.kernel.org (open list:DOCUMENTATION)
linux-kernel@vger.kernel.org (open list)
linux-fsdevel@vger.kernel.org (open list:PROC FILESYSTEM)
linux-mm@kvack.org (open list:MEMORY MANAGEMENT - CORE)
linux-arch@vger.kernel.org (open list:GENERIC INCLUDE/ASM HEADER FILES)
EXEC & BINFMT API, ELF status: Supported

You're missing the majority of these. That's _not OK_.

On Wed, Jun 24, 2026 at 10:55:52PM +0800, Xin Zhao wrote:
> A coredump typically takes some time to complete. If we happen to hold a
> write lock with flock just before triggering the coredump, that write lock
> will not be released during the entire coredump process. As a result,
> other processes attempting to acquire the same write lock may experience
> significant delays. Another typical scenario is that shared memory, such
> as dma-buf, remains occupied and is not released for a long time due to
> core dumps.
>
> To address this, add /proc/<pid>/coredump_pre_exit node so that people can

This is a horrible idea.

> specify which resources they want to release before dumping core. This
> patch implements the early release of two types of resources: flock files
> and file-backed shared memory. Default settings are NOT pre-exit anything.

What, people set this ahead of time? For a dynamic thing like files?

>
> A temporary bit, O_TMPCLOS, is added to mark vma->vm_file->f_flags during
> the execution of the newly introduced exit_mmap_mapped_shared() function.
> In this way, the subsequent exit_files_pre_exit() function does not need
> to find the corresponding vma through the file to check for the VM_SHARED
> attribute, thereby reducing the traversal cost.

This sentence doesn't even make sense?

And also !VM_SHARED means !vma->vm_file so your code would NULL deref if you
didn't check that. But !VM_SHARED VMAs can absolutely be file-backed...

>
> Signed-off-by: Xin Zhao <jackzxcui1989@163.com>
> ---
>
> Change in v4:
> - Christian pointed out that the coredump process will traverse file
>   descriptors (fd), so certain fds should not be closed by default.
>   Rework the whole feature, add /proc/<pid>/coredump_pre_exit for user
>   pre-exit resources selection, default is NOT pre-exit anything.
> - Mateusz suggested that walking the fd table and release the file-lock is
>   reasonable. No longer release all the fd(s). Based on user config, only
>   the flock fd(s) and the fd(s) correspondent to file-backed shared memory
>   will be released at most.
>
> Change in v3:
> - Add comment and commit-log to explain why do the MMF_DUMP_MAPPED_SHARED
>   mm_flags_test() check, note that memory mapped files keep their own
>   separate references to the files. The case to work around is that early
>   unlocking a flock on a file allows other processes to lock and modify
>   the mapped data protected by the flock,
>   as suggested by Pedro Falcato.
> - Link to v3: https://lore.kernel.org/all/20260619122419.3954581-1-jackzxcui1989@163.com/
>
> Change in v2:
> - Get rid of the implement of adding new fcntl API, the issue does not
>   worth inflicting the cost on everyone,
>   as suggested by Al Viro.
> - Call exit_files() in coredump_wait(),
>   as suggested by Eric W. Biederman.
>   Add MMF_DUMP_MAPPED_SHARED mm_flags_test() check to filter cases that
>   need to dump file-backed shared memory.
> - Link to v2: https://lore.kernel.org/lkml/20260618150301.3226517-1-jackzxcui1989@163.com/
>
> v1:
> - Link to v1: https://lore.kernel.org/all/20260618030700.2511668-1-jackzxcui1989@163.com/
> ---
>  .../admin-guide/kernel-parameters.txt         |  5 ++
>  Documentation/filesystems/proc.rst            | 58 +++++++++-----
>  fs/coredump.c                                 | 23 ++++++
>  fs/file.c                                     | 46 +++++++++++
>  fs/proc/base.c                                | 78 +++++++++++++++++++
>  include/linux/mm.h                            |  1 +

No.

>  include/linux/mm_types.h                      |  9 +++

No.

>  include/linux/sched/task.h                    |  1 +
>  include/uapi/asm-generic/fcntl.h              |  4 +
>  kernel/fork.c                                 | 12 +++
>  mm/mmap.c                                     | 21 +++++

No.

>  11 files changed, 238 insertions(+), 20 deletions(-)

This is a completely insane diffstat for a single patch. Ridiculous.

AND YOU HAVEN'T ADDED A SINGLE TEST.

>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index f575d4508..bc6d3859f 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -1024,6 +1024,11 @@ Kernel parameters
>  			/proc/<pid>/coredump_filter.
>  			See also Documentation/filesystems/proc.rst.
>
> +	coredump_pre_exit=
> +			[KNL] Change the default value for
> +			/proc/<pid>/coredump_pre_exit.
> +			See also Documentation/filesystems/proc.rst.
> +
>  	coresight_cpu_debug.enable
>  			[ARM,ARM64]
>  			Format: <bool>
> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> index db6167bef..6a637d31d 100644
> --- a/Documentation/filesystems/proc.rst
> +++ b/Documentation/filesystems/proc.rst
> @@ -39,16 +39,17 @@ fixes/update part 1.1  Stefani Seibold <stefani@seibold.net>    June 9 2009
>    3.2	/proc/<pid>/oom_score - Display current oom-killer score
>    3.3	/proc/<pid>/io - Display the IO accounting fields
>    3.4	/proc/<pid>/coredump_filter - Core dump filtering settings
> -  3.5	/proc/<pid>/mountinfo - Information about mounts
> -  3.6	/proc/<pid>/comm  & /proc/<pid>/task/<tid>/comm
> -  3.7   /proc/<pid>/task/<tid>/children - Information about task children
> -  3.8   /proc/<pid>/fdinfo/<fd> - Information about opened file
> -  3.9   /proc/<pid>/map_files - Information about memory mapped files
> -  3.10  /proc/<pid>/timerslack_ns - Task timerslack value
> -  3.11	/proc/<pid>/patch_state - Livepatch patch operation state
> -  3.12	/proc/<pid>/arch_status - Task architecture specific information
> -  3.13  /proc/<pid>/fd - List of symlinks to open files
> -  3.14  /proc/<pid>/ksm_stat - Information about the process's ksm status.
> +  3.5  /proc/<pid>/coredump_pre_exit - Core dump pre-exit settings
> +  3.6	/proc/<pid>/mountinfo - Information about mounts
> +  3.7	/proc/<pid>/comm  & /proc/<pid>/task/<tid>/comm
> +  3.8   /proc/<pid>/task/<tid>/children - Information about task children
> +  3.9   /proc/<pid>/fdinfo/<fd> - Information about opened file
> +  3.10   /proc/<pid>/map_files - Information about memory mapped files
> +  3.11  /proc/<pid>/timerslack_ns - Task timerslack value
> +  3.12	/proc/<pid>/patch_state - Livepatch patch operation state
> +  3.13	/proc/<pid>/arch_status - Task architecture specific information
> +  3.14  /proc/<pid>/fd - List of symlinks to open files
> +  3.15  /proc/<pid>/ksm_stat - Information about the process's ksm status.
>
>    4	Configuring procfs
>    4.1	Mount options
> @@ -1961,7 +1962,24 @@ For example::
>    $ echo 0x7 > /proc/self/coredump_filter
>    $ ./some_program
>
> -3.5	/proc/<pid>/mountinfo - Information about mounts
> +3.5 /proc/<pid>/coredump_pre_exit - Core dump pre-exit settings
> +---------------------------------------------------------------
> +A coredump typically takes some time to complete. If we happen to hold a write
> +lock with flock just before triggering the coredump, that write lock will not
> +be released during the entire coredump process. As a result, other processes
> +attempting to acquire the same write lock may experience significant delays.
> +Another typical scenario is that shared memory, such as dma-buf, remains
> +occupied and is not released for a long time due to core dumps.
> +
> +/proc/<pid>/coredump_pre_exit allows you to pre-exit some resources before
> +dumping core.
> +
> +The following two types are supported:
> +
> +  - (bit 0) flock files
> +  - (bit 1) file-backed shared memory
> +
> +3.6	/proc/<pid>/mountinfo - Information about mounts
>  --------------------------------------------------------
>
>  This file contains lines of the form::
> @@ -2001,7 +2019,7 @@ For more information on mount propagation see:
>    Documentation/filesystems/sharedsubtree.rst
>
>
> -3.6	/proc/<pid>/comm  & /proc/<pid>/task/<tid>/comm
> +3.7	/proc/<pid>/comm  & /proc/<pid>/task/<tid>/comm
>  --------------------------------------------------------
>  These files provide a method to access a task's comm value. It also allows for
>  a task to set its own or one of its thread siblings comm value. The comm value
> @@ -2010,7 +2028,7 @@ then the kernel's TASK_COMM_LEN (currently 16 chars, including the NUL
>  terminator) will result in a truncated comm value.
>
>
> -3.7	/proc/<pid>/task/<tid>/children - Information about task children
> +3.8	/proc/<pid>/task/<tid>/children - Information about task children
>  -------------------------------------------------------------------------
>  This file provides a fast way to retrieve first level children pids
>  of a task pointed by <pid>/<tid> pair. The format is a space separated
> @@ -2027,7 +2045,7 @@ pids, so one needs to either stop or freeze processes being inspected
>  if precise results are needed.
>
>
> -3.8	/proc/<pid>/fdinfo/<fd> - Information about opened file
> +3.9	/proc/<pid>/fdinfo/<fd> - Information about opened file
>  ---------------------------------------------------------------
>  This file provides information associated with an opened file. The regular
>  files have at least four fields -- 'pos', 'flags', 'mnt_id' and 'ino'.
> @@ -2198,7 +2216,7 @@ VFIO Device files
>  where 'vfio-device-syspath' is the sysfs path corresponding to the VFIO device
>  file.
>
> -3.9	/proc/<pid>/map_files - Information about memory mapped files
> +3.10	/proc/<pid>/map_files - Information about memory mapped files
>  ---------------------------------------------------------------------
>  This directory contains symbolic links which represent memory mapped files
>  the process is maintaining.  Example output::
> @@ -2220,7 +2238,7 @@ time one can open(2) mappings from the listings of two processes and
>  comparing their inode numbers to figure out which anonymous memory areas
>  are actually shared.
>
> -3.10	/proc/<pid>/timerslack_ns - Task timerslack value
> +3.11	/proc/<pid>/timerslack_ns - Task timerslack value
>  ---------------------------------------------------------
>  This file provides the value of the task's timerslack value in nanoseconds.
>  This value specifies an amount of time that normal timers may be deferred
> @@ -2236,7 +2254,7 @@ Valid values are from 0 - ULLONG_MAX
>  An application setting the value must have PTRACE_MODE_ATTACH_FSCREDS level
>  permissions on the task specified to change its timerslack_ns value.
>
> -3.11	/proc/<pid>/patch_state - Livepatch patch operation state
> +3.12	/proc/<pid>/patch_state - Livepatch patch operation state
>  -----------------------------------------------------------------
>  When CONFIG_LIVEPATCH is enabled, this file displays the value of the
>  patch state for the task.
> @@ -2253,7 +2271,7 @@ patched.  If the patch is being enabled, then the task has already been
>  patched.  If the patch is being disabled, then the task hasn't been
>  unpatched yet.
>
> -3.12 /proc/<pid>/arch_status - task architecture specific status
> +3.13 /proc/<pid>/arch_status - task architecture specific status
>  -------------------------------------------------------------------
>  When CONFIG_PROC_PID_ARCH_STATUS is enabled, this file displays the
>  architecture specific status of the task.
> @@ -2298,7 +2316,7 @@ AVX512_elapsed_ms
>    the task is unlikely an AVX512 user, but depends on the workload and the
>    scheduling scenario, it also could be a false negative mentioned above.
>
> -3.13 /proc/<pid>/fd - List of symlinks to open files
> +3.14 /proc/<pid>/fd - List of symlinks to open files
>  -------------------------------------------------------
>  This directory contains symbolic links which represent open files
>  the process is maintaining.  Example output::
> @@ -2313,7 +2331,7 @@ The number of open files for the process is stored in 'size' member
>  of stat() output for /proc/<pid>/fd for fast access.
>  -------------------------------------------------------
>
> -3.14 /proc/<pid>/ksm_stat - Information about the process's ksm status
> +3.15 /proc/<pid>/ksm_stat - Information about the process's ksm status
>  ----------------------------------------------------------------------
>  When CONFIG_KSM is enabled, each process has this file which displays
>  the information of ksm merging status.
> diff --git a/fs/coredump.c b/fs/coredump.c
> index bb6fdb1f4..e08a8a6c4 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -521,6 +521,27 @@ static int zap_threads(struct task_struct *tsk,
>  	return nr;
>  }
>
> +static void coredump_pre_exit(void)
> +{
> +	struct task_struct *tsk = current;
> +	unsigned long flags = __mm_flags_get_dumpable(tsk->mm);
> +
> +	if (!likely(flags & MMF_DUMP_PRE_EXIT_MASK))
> +		return;
> +
> +	/*
> +	 * Set O_TMPCLOS of file f_flags if file needs to be closed.
> +	 */
> +	if (test_bit(MMF_DUMP_PRE_EXIT_FILE_BACKED_SHARED, &flags) &&
> +	    !test_bit(MMF_DUMP_MAPPED_SHARED, &flags))
> +		exit_mmap_mapped_shared(tsk->mm);

What the hell are you doing?

This is not where we unmap VMAs?

This is likely broken in subtle ways.

> +
> +	/*
> +	 * Check O_TMPCLOS of file f_flags to close file and clear it.
> +	 */
> +	exit_files_pre_exit(tsk, mm_flags_test(MMF_DUMP_PRE_EXIT_FLOCK, tsk->mm));
> +}
> +
>  static int coredump_wait(int exit_code, struct core_state *core_state)
>  {
>  	struct task_struct *tsk = current;
> @@ -1100,6 +1121,8 @@ static void do_coredump(struct core_name *cn, struct coredump_params *cprm,
>  		return;
>  	}
>
> +	coredump_pre_exit();
> +
>  	switch (cn->core_type) {
>  	case COREDUMP_FILE:
>  		if (!coredump_file(cn, cprm, binfmt))
> diff --git a/fs/file.c b/fs/file.c
> index 2c81c0b16..a58ffffcc 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -23,6 +23,7 @@
>  #include <linux/file_ref.h>
>  #include <net/sock.h>
>  #include <linux/init_task.h>
> +#include <linux/filelock.h>
>
>  #include "internal.h"
>
> @@ -527,6 +528,51 @@ void exit_files(struct task_struct *tsk)
>  	}
>  }
>
> +void exit_files_pre_exit(struct task_struct *tsk, bool checkflock)
> +{
> +	struct files_struct *files = tsk->files;
> +	struct fdtable *fdt;
> +	struct file *file;
> +	unsigned int i, j = 0;
> +
> +	if (!files)
> +		return;
> +
> +	fdt = rcu_dereference_raw(files->fdt);
> +	for (;;) {
> +		unsigned long set;
> +
> +		i = j * BITS_PER_LONG;
> +		if (i >= fdt->max_fds)
> +			break;
> +		set = fdt->open_fds[j++];
> +		while (set) {
> +			if (!(set & 1))
> +				goto next_fd;
> +			file = fdt->fd[i];
> +			if (!file)
> +				goto next_fd;
> +			if (file->f_flags & O_TMPCLOS) {
> +				file->f_flags &= ~O_TMPCLOS;
> +				goto close_fd;
> +			}
> +			if (!checkflock)
> +				goto next_fd;
> +			if (!vfs_inode_has_locks(file_inode(file)))
> +				goto next_fd;
> +
> +close_fd:
> +			fdt->fd[i] = NULL;
> +			filp_close(file, files);
> +			cond_resched();
> +
> +next_fd:
> +			i++;
> +			set >>= 1;
> +		}
> +	}

This code hurts my eyes.

> +}
> +
>  struct files_struct init_files = {
>  	.count		= ATOMIC_INIT(1),
>  	.fdt		= &init_files.fdtab,
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index d9acfa89c..99b5f219f 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -3026,6 +3026,83 @@ static const struct file_operations proc_coredump_filter_operations = {
>  	.write		= proc_coredump_filter_write,
>  	.llseek		= generic_file_llseek,
>  };
> +

No comment, obviously.

> +static ssize_t proc_coredump_pre_exit_read(struct file *file, char __user *buf,
> +					   size_t count, loff_t *ppos)
> +{
> +	struct task_struct *task = get_proc_task(file_inode(file));
> +	struct mm_struct *mm;
> +	char buffer[PROC_NUMBUF];
> +	size_t len;
> +	int ret;
> +
> +	if (!task)
> +		return -ESRCH;
> +
> +	ret = 0;
> +	mm = get_task_mm(task);
> +	if (mm) {
> +		unsigned long flags = __mm_flags_get_dumpable(mm);
> +
> +		len = snprintf(buffer, sizeof(buffer), "%08lx\n",
> +			       ((flags & MMF_DUMP_PRE_EXIT_MASK) >>
> +				MMF_DUMP_PRE_EXIT_SHIFT));
> +		mmput(mm);
> +		ret = simple_read_from_buffer(buf, count, ppos, buffer, len);
> +	}
> +
> +	put_task_struct(task);
> +
> +	return ret;
> +}
> +

Yeah who needs a comment...

> +static ssize_t proc_coredump_pre_exit_write(struct file *file,
> +					    const char __user *buf,
> +					    size_t count,
> +					    loff_t *ppos)
> +{
> +	struct task_struct *task;
> +	struct mm_struct *mm;
> +	unsigned int val;
> +	int ret;
> +	int i;
> +	unsigned long mask;
> +
> +	ret = kstrtouint_from_user(buf, count, 0, &val);
> +	if (ret < 0)
> +		return ret;
> +
> +	ret = -ESRCH;
> +	task = get_proc_task(file_inode(file));
> +	if (!task)
> +		goto out_no_task;
> +
> +	mm = get_task_mm(task);
> +	if (!mm)
> +		goto out_no_mm;
> +	ret = 0;
> +
> +	for (i = 0, mask = 1; i < MMF_DUMP_PRE_EXIT_BITS; i++, mask <<= 1) {

What?

> +		if (val & mask)
> +			mm_flags_set(i + MMF_DUMP_PRE_EXIT_SHIFT, mm);
> +		else
> +			mm_flags_clear(i + MMF_DUMP_PRE_EXIT_SHIFT, mm);
> +	}
> +
> +	mmput(mm);
> + out_no_mm:
> +	put_task_struct(task);
> + out_no_task:
> +	if (ret < 0)
> +		return ret;
> +	return count;
> +}
> +
> +static const struct file_operations proc_coredump_pre_exit_operations = {
> +	.read		= proc_coredump_pre_exit_read,
> +	.write		= proc_coredump_pre_exit_write,
> +	.llseek		= generic_file_llseek,
> +};
>  #endif
>
>  #ifdef CONFIG_TASK_IO_ACCOUNTING
> @@ -3391,6 +3468,7 @@ static const struct pid_entry tgid_base_stuff[] = {
>  #endif
>  #ifdef CONFIG_ELF_CORE
>  	REG("coredump_filter", S_IRUGO|S_IWUSR, proc_coredump_filter_operations),
> +	REG("coredump_pre_exit", S_IRUGO|S_IWUSR, proc_coredump_pre_exit_operations),
>  #endif
>  #ifdef CONFIG_TASK_IO_ACCOUNTING
>  	ONE("io",	S_IRUSR, proc_tgid_io_accounting),
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index af23453e9..dfd4717c7 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -4066,6 +4066,7 @@ void anon_vma_interval_tree_verify(struct anon_vma_chain *node);
>  extern int __vm_enough_memory(const struct mm_struct *mm, long pages, int cap_sys_admin);
>  extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *);
>  extern void exit_mmap(struct mm_struct *);
> +extern void exit_mmap_mapped_shared(struct mm_struct *mm);

You don't use extern.

>  bool mmap_read_lock_maybe_expand(struct mm_struct *mm, struct vm_area_struct *vma,
>  				 unsigned long addr, bool write);
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index c7db35be6..0555aaf50 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1963,6 +1963,15 @@ enum {
>  	(BIT(MMF_DUMP_ANON_PRIVATE) | BIT(MMF_DUMP_ANON_SHARED) | \
>  	 BIT(MMF_DUMP_HUGETLB_PRIVATE) | MMF_DUMP_MASK_DEFAULT_ELF)
>
> +/* coredump pre-exit bits */
> +#define MMF_DUMP_PRE_EXIT_FLOCK	11
> +#define MMF_DUMP_PRE_EXIT_FILE_BACKED_SHARED 12

Err do we have space for this?

You really want to add 2 more bits to mm_struct flags for this insanity?

> +
> +#define MMF_DUMP_PRE_EXIT_SHIFT	(MMF_DUMPABLE_BITS + MMF_DUMP_FILTER_BITS)
> +#define MMF_DUMP_PRE_EXIT_BITS	2
> +#define MMF_DUMP_PRE_EXIT_MASK	\
> +	(((1 << MMF_DUMP_PRE_EXIT_BITS) - 1) << MMF_DUMP_PRE_EXIT_SHIFT)

So are these dumpable bits or not? Why are you not just incrementing
MMF_DUMPABLE_BITS?

> +
>  #ifdef CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS
>  # define MMF_DUMP_MASK_DEFAULT_ELF	BIT(MMF_DUMP_ELF_HEADERS)
>  #else
> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> index 41ed884cf..b4becbf6c 100644
> --- a/include/linux/sched/task.h
> +++ b/include/linux/sched/task.h
> @@ -93,6 +93,7 @@ static inline void exit_thread(struct task_struct *tsk)
>  extern __noreturn void do_group_exit(int);
>
>  extern void exit_files(struct task_struct *);
> +extern void exit_files_pre_exit(struct task_struct *, bool);
>  extern void exit_itimers(struct task_struct *);
>
>  extern pid_t kernel_clone(struct kernel_clone_args *kargs);
> diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
> index 613475285..360604d65 100644
> --- a/include/uapi/asm-generic/fcntl.h
> +++ b/include/uapi/asm-generic/fcntl.h
> @@ -95,6 +95,10 @@
>  #define O_NDELAY	O_NONBLOCK
>  #endif
>
> +#ifndef O_TMPCLOS
> +#define O_TMPCLOS	0x80000000	/* tag need close, temporarily used */
> +#endif
> +
>  #define F_DUPFD		0	/* dup */
>  #define F_GETFD		1	/* get close_on_exec */
>  #define F_SETFD		2	/* set/clear close_on_exec */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index a679b2448..84f1ee7f3 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1030,6 +1030,18 @@ static int __init coredump_filter_setup(char *s)
>
>  __setup("coredump_filter=", coredump_filter_setup);
>
> +static unsigned long default_dump_pre_exit;
> +
> +static int __init coredump_pre_exit_setup(char *s)
> +{
> +	default_dump_pre_exit =
> +		(simple_strtoul(s, NULL, 0) << MMF_DUMP_PRE_EXIT_SHIFT) &
> +		MMF_DUMP_PRE_EXIT_MASK;
> +	return 1;
> +}
> +
> +__setup("coredump_pre_exit=", coredump_pre_exit_setup);
> +
>  #include <linux/init_task.h>
>
>  static void mm_init_aio(struct mm_struct *mm)
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 5754d1c36..b955c47c0 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1326,6 +1326,27 @@ void exit_mmap(struct mm_struct *mm)
>  	vm_unacct_memory(nr_accounted);
>  }
>
> +void exit_mmap_mapped_shared(struct mm_struct *mm)
> +{
> +	struct vm_area_struct *vma;
> +	VMA_ITERATOR(vmi, mm, 0);
> +
> +	mmap_write_lock(mm);
> +	lru_add_drain();

Why?

> +
> +	for_each_vma(vmi, vma) {

Literally every single VMA? Including the gate VMA too?

No VMA locks... so that's already broken.

> +		if (vma->vm_flags & VM_HUGETLB)
> +			continue;

That's not how you test for hugetlb.

> +		if (!(vma->vm_flags & VM_SHARED) || !file_inode(vma->vm_file)->i_nlink)

This isn't how we work with flags any more.

> +			continue;
> +		vma->vm_file->f_flags |= O_TMPCLOS;


Not sure directly manipulating file flags like this is valid in any way, shape,
or form.

> +		do_munmap(mm, vma->vm_start, vma->vm_end - vma->vm_start, NULL);

This is utterly broken, the outer loop will be invalidated by you removing
these, do_munmap() has its own iterator...

And this is just madly inefficient. Why wouldn't you just loop over the VMAs to
alter flags then unmap the whole range?

But this is also introducing a completely separate, duplicative, version of
exit_mmap().

You're not doing any of what that function does. You're just very inefficiently
unmapping everything?

> +		cond_resched();

Of course!

> +	}
> +
> +	mmap_write_unlock(mm);

And VMAs can be mapped again now?

> +}
> +
>  /*
>   * Return true if the calling process may expand its vm space by the passed
>   * number of pages
> --
> 2.34.1
>

I'm not sure if this idea can be made upstreamble in any way. But this patch or
anything that looks like it or fundamentally alters mm is just not acceptable,
sorry.

Lorenzo

^ permalink raw reply

* [PATCH v5 04/24] cpumask: Introduce cpu_preferred_mask
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc
In-Reply-To: <20260625124648.802832-1-sshegde@linux.ibm.com>

This patch does
- Declare and Define cpu_preferred_mask.
- Get/Set helpers for it.

Values are set/clear by the scheduler by detecting the steal time values.

A CPU is set to preferred when it becomes active. Later it may be
marked as non-preferred depending on steal time values with
steal monitor being enabled.

Always maintain design construct of preferred is subset of active.
i.e. preferred ⊆ active ⊆ online ⊆ present ⊆ possible

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v4->v5:
- Make it macro instead (Yury Norov)

 include/linux/cpumask.h | 21 ++++++++++++++++++++-
 kernel/cpu.c            |  6 ++++++
 kernel/sched/core.c     |  5 +++++
 3 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 80211900f373..5a643d608ea6 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -120,12 +120,20 @@ extern struct cpumask __cpu_enabled_mask;
 extern struct cpumask __cpu_present_mask;
 extern struct cpumask __cpu_active_mask;
 extern struct cpumask __cpu_dying_mask;
+
+#ifdef CONFIG_PREFERRED_CPU
+extern struct cpumask __cpu_preferred_mask;
+#else
+#define __cpu_preferred_mask __cpu_active_mask
+#endif
+
 #define cpu_possible_mask ((const struct cpumask *)&__cpu_possible_mask)
 #define cpu_online_mask   ((const struct cpumask *)&__cpu_online_mask)
 #define cpu_enabled_mask   ((const struct cpumask *)&__cpu_enabled_mask)
 #define cpu_present_mask  ((const struct cpumask *)&__cpu_present_mask)
 #define cpu_active_mask   ((const struct cpumask *)&__cpu_active_mask)
 #define cpu_dying_mask    ((const struct cpumask *)&__cpu_dying_mask)
+#define cpu_preferred_mask ((const struct cpumask *)&__cpu_preferred_mask)
 
 extern atomic_t __num_online_cpus;
 extern unsigned int __num_possible_cpus;
@@ -1161,6 +1169,7 @@ void init_cpu_possible(const struct cpumask *src);
 #define set_cpu_present(cpu, present)	assign_cpu((cpu), &__cpu_present_mask, (present))
 #define set_cpu_active(cpu, active)	assign_cpu((cpu), &__cpu_active_mask, (active))
 #define set_cpu_dying(cpu, dying)	assign_cpu((cpu), &__cpu_dying_mask, (dying))
+#define set_cpu_preferred(cpu, preferred) assign_cpu((cpu), &__cpu_preferred_mask, (preferred))
 
 void set_cpu_online(unsigned int cpu, bool online);
 void set_cpu_possible(unsigned int cpu, bool possible);
@@ -1256,7 +1265,12 @@ static __always_inline bool cpu_dying(unsigned int cpu)
 	return cpumask_test_cpu(cpu, cpu_dying_mask);
 }
 
-#else
+static __always_inline bool cpu_preferred(unsigned int cpu)
+{
+	return cpumask_test_cpu(cpu, cpu_preferred_mask);
+}
+
+#else	/* NR_CPUS <= 1 */
 
 #define num_online_cpus()	1U
 #define num_possible_cpus()	1U
@@ -1294,6 +1308,11 @@ static __always_inline bool cpu_dying(unsigned int cpu)
 	return false;
 }
 
+static __always_inline bool cpu_preferred(unsigned int cpu)
+{
+	return cpu == 0;
+}
+
 #endif /* NR_CPUS > 1 */
 
 #define cpu_is_offline(cpu)	unlikely(!cpu_online(cpu))
diff --git a/kernel/cpu.c b/kernel/cpu.c
index bc4f7a9ba64e..d623a9c5554a 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -3107,6 +3107,11 @@ EXPORT_SYMBOL(__cpu_dying_mask);
 atomic_t __num_online_cpus __read_mostly;
 EXPORT_SYMBOL(__num_online_cpus);
 
+#ifdef CONFIG_PREFERRED_CPU
+struct cpumask __cpu_preferred_mask __read_mostly;
+EXPORT_SYMBOL(__cpu_preferred_mask);
+#endif
+
 void init_cpu_present(const struct cpumask *src)
 {
 	cpumask_copy(&__cpu_present_mask, src);
@@ -3164,6 +3169,7 @@ void __init boot_cpu_init(void)
 	/* Mark the boot cpu "present", "online" etc for SMP and UP case */
 	set_cpu_online(cpu, true);
 	set_cpu_active(cpu, true);
+	set_cpu_preferred(cpu, true);
 	set_cpu_present(cpu, true);
 	set_cpu_possible(cpu, true);
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2f4530eb543f..9e16946c9d62 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8685,6 +8685,9 @@ int sched_cpu_activate(unsigned int cpu)
 	 */
 	sched_set_rq_online(rq, cpu);
 
+	/* preferred is subset of active and follows its state */
+	set_cpu_preferred(cpu, true);
+
 	return 0;
 }
 
@@ -8698,6 +8701,8 @@ int sched_cpu_deactivate(unsigned int cpu)
 	if (ret)
 		return ret;
 
+	set_cpu_preferred(cpu, false);
+
 	/*
 	 * Remove CPU from nohz.idle_cpus_mask to prevent participating in
 	 * load balancing when not active
-- 
2.47.3


^ permalink raw reply related

* [PATCH v5 03/24] kconfig: Provide PREFERRED_CPU option
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc
In-Reply-To: <20260625124648.802832-1-sshegde@linux.ibm.com>

Introduce a new config named PREFERRED_CPU.

This helps to:
- Avoid the code bloat when PREFERRED_CPU=n. In that cases preferred
  is same as active.
- Avoid the ifdeffery around PREFERRED_CPU in many files.

Since paravirtualized use case is the main driving force of this
feature, make it default for kernels with PARAVIRT=y

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v4->v5:
- Make it depend on instead. (Yury Norov)
- Fix helper indentation (sashiko)

 kernel/Kconfig.preempt | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index 88c594c6d7fc..b3a543cb44cd 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -192,3 +192,17 @@ config SCHED_CLASS_EXT
 	  For more information:
 	    Documentation/scheduler/sched-ext.rst
 	    https://github.com/sched-ext/scx
+
+config PREFERRED_CPU
+	bool "Dynamic vCPU management based on steal time"
+	depends on PARAVIRT && SMP
+	default y
+	help
+	  This feature helps to reduce the steal time in paravirtualised
+	  environment, there by reducing vCPU preemption. Reducing vCPU
+	  preemption provides improved lock holder preemption and reduces
+	  cost of vCPU preemption in the host.
+
+	  By default preferred CPUs will be same as active CPUs. Depending
+	  on the steal time when steal_monitor driver is enabled,
+	  preferred CPUs could become subset of active CPUs.
-- 
2.47.3


^ permalink raw reply related

* [PATCH v5 02/24] sched/docs: Document cpu_preferred_mask and Preferred CPU concept
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc,
	kernel test robot
In-Reply-To: <20260625124648.802832-1-sshegde@linux.ibm.com>

Add documentation for new cpumask called cpu_preferred_mask. This could
help users in understanding what this mask is and the concept behind it.

Document how to enable it and implementation aspects of it.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202606180717.yNM0yb41-lkp@intel.com/
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v4->v5:
- Change text to reflect new driver info.
- Changes suggested by Randy Dunlap.
- Sashiko nitpicks

 Documentation/scheduler/sched-arch.rst | 50 ++++++++++++++++++++++++++
 1 file changed, 50 insertions(+)

diff --git a/Documentation/scheduler/sched-arch.rst b/Documentation/scheduler/sched-arch.rst
index ed07efea7d02..8fc56edd8e03 100644
--- a/Documentation/scheduler/sched-arch.rst
+++ b/Documentation/scheduler/sched-arch.rst
@@ -62,6 +62,56 @@ Your cpu_idle routines need to obey the following rules:
 arch/x86/kernel/process.c has examples of both polling and
 sleeping idle functions.
 
+Preferred CPUs
+==============
+
+In virtualised environments it is possible to overcommit CPU resources.
+i.e sum of virtual CPU(vCPU) of all VMs is greater than number of physical
+CPUs(pCPU). Under such conditions when all or many VMs have high utilization,
+hypervisor won't be able to satisfy the CPU requirement and has to context
+switch within or across VMs. i.e hypervisor needs to preempt one vCPU to run
+another. This is called vCPU preemption. This is more expensive compared to
+task context switch within a vCPU.
+
+In such cases it is better that combined vCPU ask from all VMs is reduced
+by not using some of the vCPUs in each VM. vCPUs where workload can be safely
+scheduled which won't increase any contention for pCPU are called as
+"Preferred CPUs".
+
+Main design construct is preferred CPUs is always subset of active CPUs.
+In most cases preferred CPUs will be same as active CPUs, when there is pCPU
+contention, Preferred CPUs will reduce based on the amount of steal time.
+When the pCPU contention goes away as indicated by steal time, Preferred CPUs
+will become same as active CPUs again. This is done by loading the
+steal_monitor driver available at drivers/virt/steal_monitor.
+
+For scheduling decisions such as wakeup, pushing the task etc, needs this
+CPU state info. This is maintained in cpu_preferred_mask.
+vCPUs which are not in cpu_preferred_mask should be treated as vCPUs which
+should not be used at this moment provided it doesn't break user affinity.
+
+This is achieved by
+1. Selecting a preferred CPU at wakeup.
+2. Push the task away from non-preferred CPU at tick.
+3. Only select preferred CPUs for load balance.
+
+/sys/devices/system/cpu/preferred prints the current cpu_preferred_mask in
+cpulist format.
+
+Notes:
+1. This feature is available under CONFIG_PREFERRED_CPU. This enables
+   steal_monitor driver. On enabling the driver, CPU preferred state
+   can change based on steal time. With CONFIG_PREFERRED_CPU=n,
+   preferred CPUs is same as active CPUs.
+
+2. This feature works for FAIR class only.
+
+3. A task pinned, which can't be moved to preferred CPUs will continue
+   to run based on its affinity. But no load balancing happens.
+
+4. Decision to use/not use is driven by kernel. Hence it shouldn't
+   break user affinities. One of the main reasons why CPU hotplug
+   or Isolated cpuset partitions was not a solution.
 
 Possible arch/ problems
 =======================
-- 
2.47.3


^ permalink raw reply related

* [PATCH v5 01/24] sched/debug: Remove unused schedstats
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc
In-Reply-To: <20260625124648.802832-1-sshegde@linux.ibm.com>

nr_migrations_cold, nr_wakeups_passive and nr_wakeups_idle are not
being updated anywhere. So remove them.

These are per process stats. So updating sched stats version isn't
necessary.

Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 include/linux/sched.h | 3 ---
 kernel/sched/debug.c  | 3 ---
 2 files changed, 6 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 35e6183ef615..fc6ecb3869dd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -550,7 +550,6 @@ struct sched_statistics {
 	s64				exec_max;
 	u64				slice_max;
 
-	u64				nr_migrations_cold;
 	u64				nr_failed_migrations_affine;
 	u64				nr_failed_migrations_running;
 	u64				nr_failed_migrations_hot;
@@ -563,8 +562,6 @@ struct sched_statistics {
 	u64				nr_wakeups_remote;
 	u64				nr_wakeups_affine;
 	u64				nr_wakeups_affine_attempts;
-	u64				nr_wakeups_passive;
-	u64				nr_wakeups_idle;
 
 #ifdef CONFIG_SCHED_CORE
 	u64				core_forceidle_sum;
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 40584b27ea0c..f3a033b34ba0 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1359,7 +1359,6 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 		P_SCHEDSTAT(wait_count);
 		PN_SCHEDSTAT(iowait_sum);
 		P_SCHEDSTAT(iowait_count);
-		P_SCHEDSTAT(nr_migrations_cold);
 		P_SCHEDSTAT(nr_failed_migrations_affine);
 		P_SCHEDSTAT(nr_failed_migrations_running);
 		P_SCHEDSTAT(nr_failed_migrations_hot);
@@ -1371,8 +1370,6 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 		P_SCHEDSTAT(nr_wakeups_remote);
 		P_SCHEDSTAT(nr_wakeups_affine);
 		P_SCHEDSTAT(nr_wakeups_affine_attempts);
-		P_SCHEDSTAT(nr_wakeups_passive);
-		P_SCHEDSTAT(nr_wakeups_idle);
 
 		avg_atom = p->se.sum_exec_runtime;
 		if (nr_switches)
-- 
2.47.3


^ permalink raw reply related

* [PATCH v5 00/24] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc

Very briefly,
- Maintain set of CPUs which can be used by workload. It is denoted as
  cpu_preferred_mask
- Periodically compute the steal time. If steal time is high/low based
  on the thresholds, either reduce/increase the preferred CPUs. This is
  handled in a new driver called steal_monitor
- If a CPU is marked as non-preferred, push the task running on it if
  possible.
- Use this CPU state in wakeup and load balance to ensure tasks run
  within preferred CPUs.

For more details on idea, problem statement and performance numbers,
please refer to cover-letter of v2[2] and OSPM talk[1].

*** Please review and provide your feedback!! ***

[1]:https://youtu.be/adxUKFPlOp0
[2] v2: https://lore.kernel.org/all/20260407191950.643549-1-sshegde@linux.ibm.com/#t
[3] v4: https://lore.kernel.org/all/20260617174139.155540-1-sshegde@linux.ibm.com/#t

Thank you very much for feedback so far. This has helped the code to
evolve towards a clear abstraction layers and get simplified.
(Hopefully). Apologies in advance if I have missed any comment.

base commit:
tip/sched/core at c095741713d1 ("sched/fair: Fix newidle vs core-sched")

v4->v5:
- Move the computation of steal time and decide on preferred CPU state
  to a driver. Drop those changes in core scheduler. (Yury Norov, K Prateek Nayak)
- A new driver called steal_monitor is added in drivers/virt/ (K Prateek Nayak)
  (Please let me know if there is a better place for it. I can move it
  there)
- New driver does periodic computation of steal time and
  increments/decrements the preferred CPUs.
- Debug knobs can be changed via module parameters. (Yury Norov)
- Default implementation are weak symbols. Archs may override by
  providing strong symbols in new respective arch specific file.
- Everything is centered around CONFIG_PREFERRED_CPU. No new config
  for new driver. Driver gets added to kernel, but not loaded by
  default.
- Load the driver to enable steal_monitor functionality. Unload to
  remove the same.
- Make CONFIG_PREFERRED_CPU depend on PARAVIRT && SMP (Yury Norov)
- move set_cpu_preferred to a macro. (Yury Norov)
  on CONFIG_PREFERRED_CPU=n it will just act on active CPUs in that case.
  It shouldn't alter any functionality.
- Do a simple encoding for has_preferred_cpu_state, which aims to avoid
  repeated cpumask_interest in is_cpu_allowed. 
  (Please let me know if new variable based approach to is_cpu_allowed
  should be done instead).
- Move select_fallback_rq above the rq_lock. (sashiko)
- Few documentation nitpicks (Randy Dunlap, sashiko)
- Avoid any decision for is_cpu_allowed for other classes (sashiko)
- Don't pull the load towards a non-preferred CPUs in idle and new
  idle balanced. (Inferred when seeing sashiko comments)
- Fix leaking of task_struct in push_work_done (K Prateek Nayak)
- Module parameters aren't checked for sane values. One should know
  what they are writing to it. If one writes 0 for interval_ms,
  then it gets set to default value again to avoid workqueue lockup.
- Added a few design construct related checks in the periodic work
  to ensure any future arch specific implementations follow it.
  1. preferred is subset of active.
  2. preferred cannot be empty.
- Added Documentation of steal_monitor in Documentation/driver-api/
  (Let me know if there is better place for it)

performance numbers are expected to be same or slightly better than v2.
With driver, one major overhead in sched_tick has been removed. i.e
finding the first housekeeping CPU which was O(N). 

Apologies in advance if there is any critical information is missing
regarding new driver such as policy, documentation or missing
implementation. Please let me know, and I can make those changes.
I have ensured checkpatch --strict is happy.

Also, I think there should be a MAINTAINERS file entry for new
driver. I don't see a drivers/virt/* entry.
Either as a new entry for driver or a few file in SCHEDULER entry.
Let me know if/what I should add it. I am bit cautious about such
change. I am willing to maintain this driver, other than that
I don't know what else i going to be necessary for it. I don't have
any maintainer experience either :)

PS: Sorry for the long CC list. Please unicast it to me if you want to
be dropped for the CC list.

Shrikanth Hegde (24):
  sched/debug: Remove unused schedstats
  sched/docs: Document cpu_preferred_mask and Preferred CPU concept
  kconfig: Provide PREFERRED_CPU option
  cpumask: Introduce cpu_preferred_mask
  sysfs: Add preferred CPU file
  sched/core: allow only preferred CPUs in is_cpu_allowed
  sched/fair: Select preferred CPU at wakeup when possible
  sched/fair: load balance only among preferred CPUs
  sched/fair: Pull the load on preferred CPU
  sched/core: Keep tick on non-preferred CPUs until tasks are out
  sched/core: Push current task from non preferred CPU
  sched/debug: Add migration stats due to non preferred CPUs
  virt/steal_monitor: Add documentation
  virt: Introduce steal monitor driver
  virt/steal_monitor: Restore to active on module disable
  virt/steal_monitor: Define steal_monitor structure
  virt/steal_monitor: Add control knobs for handling steal values
  virt/steal_monitor: Compute work at regular intervals
  virt/steal_monitor: Provide default method to get systemwide steal
    time
  virt/steal_monitor: Provide default method to inc/dec preferred CPUs
  virt/steal_monitor: Provide default method to get num of CPUs for
    steal ratio
  virt/steal_monitor: Act on steal values at regular intervals
  virt/steal_monitor: Add direction control
  virt/steal_monitor: Add design check of preferred subset of active

 .../ABI/testing/sysfs-devices-system-cpu      |  11 ++
 Documentation/driver-api/index.rst            |   1 +
 Documentation/driver-api/steal-monitor.rst    |  93 ++++++++++++
 Documentation/scheduler/sched-arch.rst        |  50 +++++++
 drivers/base/cpu.c                            |   8 ++
 drivers/virt/Makefile                         |   1 +
 drivers/virt/steal_monitor/Makefile           |  14 ++
 drivers/virt/steal_monitor/defaults.c         | 105 ++++++++++++++
 drivers/virt/steal_monitor/sm_core.c          | 124 ++++++++++++++++
 drivers/virt/steal_monitor/sm_core.h          |  32 +++++
 include/linux/cpumask.h                       |  21 ++-
 include/linux/sched.h                         |   5 +-
 kernel/Kconfig.preempt                        |  14 ++
 kernel/cpu.c                                  |   6 +
 kernel/sched/core.c                           | 133 +++++++++++++++++-
 kernel/sched/debug.c                          |   4 +-
 kernel/sched/fair.c                           |  11 +-
 kernel/sched/sched.h                          |  36 +++++
 18 files changed, 659 insertions(+), 10 deletions(-)
 create mode 100644 Documentation/driver-api/steal-monitor.rst
 create mode 100644 drivers/virt/steal_monitor/Makefile
 create mode 100644 drivers/virt/steal_monitor/defaults.c
 create mode 100644 drivers/virt/steal_monitor/sm_core.c
 create mode 100644 drivers/virt/steal_monitor/sm_core.h

-- 
2.47.3


^ permalink raw reply

* Re: [PATCH v7 10/42] KVM: guest_memfd: Ensure pages are not in use before conversion
From: David Hildenbrand (Arm) @ 2026-06-25 12:36 UTC (permalink / raw)
  To: Ackerley Tng, Vlastimil Babka (SUSE), aik, andrew.jones,
	binbin.wu, brauner, chao.p.peng, ira.weiny, jmattson, jthoughton,
	michael.roth, oupton, pankaj.gupta, qperret, rick.p.edgecombe,
	rientjes, shivankg, steven.price, tabba, willy, wyihan,
	yan.y.zhao, forkloop, pratyush, suzuki.poulose, aneesh.kumar,
	liam, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
	Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
	Baoquan He, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Jason Gunthorpe
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <CAEvNRgHM4a66Jx9++6iioQLpFY-KgPvjY5+bg_X97DfSjpXzRQ@mail.gmail.com>

On 6/19/26 02:17, Ackerley Tng wrote:
> "Vlastimil Babka (SUSE)" <vbabka@kernel.org> writes:
> 
>> On 5/23/26 02:17, Ackerley Tng via B4 Relay wrote:
>>> From: Ackerley Tng <ackerleytng@google.com>
>>>
>>> When converting memory to private in guest_memfd, it is necessary to ensure
>>> that the pages are not currently being accessed by any other part of the
>>> kernel or userspace to avoid any current user writing to guest private
>>> memory.
>>>
>>> guest_memfd checks for unexpected refcounts to determine whether a page is
>>> still in use. The only expected refcounts after unmapping the range
>>> requested for conversion are those that are held by guest_memfd itself.
>>
>> Is it sufficient to only check, and not also freeze the refcount? (i.e.
>> using folio_ref_freeze()), because without freezing, anything (e.g.
>> compaction's pfn-based scanner) could do a speculative folio_try_get() and
>> the checked refcount becomes stale.
>>
> 
> I believe there's no issue here, since the main thing here is to check
> for long-term pins on the folio. Perhaps David can help me verify. :)

I think I raised this in the past as well: ideally, we'd be freezing the
refcount, then, there is no need to worry about any concurrent access.

However, we could really only get additional page references through PFN walkers
(or speculative references), not through page tables or GUP pins, which is what
we care about.

So if we can tolerate a speculative bump+release of a folio reference, likely
we're good.

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH] Documentation: landlock: Document fs.resolve_unix audit blocker
From: Günther Noack @ 2026-06-25 12:31 UTC (permalink / raw)
  To: Doehyun Baek
  Cc: Mickaël Salaün, Jonathan Corbet, Shuah Khan,
	Sebastian Andrzej Siewior, linux-security-module, linux-doc,
	linux-kernel
In-Reply-To: <20260625092819.1870049-1-doehyunbaek@gmail.com>

On Thu, Jun 25, 2026 at 09:28:19AM +0000, Doehyun Baek wrote:
> The Landlock audit code can emit fs.resolve_unix as a filesystem blocker
> for pathname UNIX socket resolution denials, but the admin guide's blockers
> list did not mention it.
> 
> Add the missing blocker name and ABI version to keep the audit
> documentation in sync with the emitted records.
> 
> Fixes: ae97330d1bd6 ("landlock: Control pathname UNIX domain socket resolution by path")
> Signed-off-by: Doehyun Baek <doehyunbaek@gmail.com>
> ---
>  Documentation/admin-guide/LSM/landlock.rst | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/Documentation/admin-guide/LSM/landlock.rst b/Documentation/admin-guide/LSM/landlock.rst
> index 314052bbeb0a..8eb85c9381ff 100644
> --- a/Documentation/admin-guide/LSM/landlock.rst
> +++ b/Documentation/admin-guide/LSM/landlock.rst
> @@ -52,6 +52,7 @@ AUDIT_LANDLOCK_ACCESS
>          - fs.refer (ABI 2+)
>          - fs.truncate (ABI 3+)
>          - fs.ioctl_dev (ABI 5+)
> +        - fs.resolve_unix (ABI 9+)
>  
>      **net.*** - Network access rights (ABI 4+):
>          - net.bind_tcp - TCP port binding was denied
> 
> base-commit: ab9de95c9cf952332ab79453b4b5d1bfca8e514f
> -- 
> 2.43.0
> 

Thanks, good catch!

Reviewed-by: Günther Noack <gnoack@google.com>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox