Linux virtualization list

Linux virtualization list
 help / color / mirror / Atom feed

* [PATCH v9 11/11] virt/steal_governor: Enable the driver
From: Shrikanth Hegde @ 2026-07-24 14:07 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc,
	jgross, virtualization
In-Reply-To: <20260724140732.2683314-1-sshegde@linux.ibm.com>

Provide a config option for users to enable the driver.
Since the feature works for paravirtualized case and make sense only
with SMP, enforce those dependencies. Driver is going to select
CONFIG_PREFERRED_CPU=y for scheduler mechanisms to work.

It is recommended to build the driver as module instead of yes
due to below reasons.
- Module parameters are read only after init. If one wants to change
  them having as module allows that.
- Driver can be disabled if it is built as module.
- If it is module, admin user has to enable it. Since this feature works
  best when all VMs work in co-operative manner, admin will likely
  enable it in all VMs.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 drivers/virt/Kconfig  | 18 ++++++++++++++++++
 drivers/virt/Makefile |  1 +
 2 files changed, 19 insertions(+)

diff --git a/drivers/virt/Kconfig b/drivers/virt/Kconfig
index 52eb7e4ba71f..5c26a59792c9 100644
--- a/drivers/virt/Kconfig
+++ b/drivers/virt/Kconfig
@@ -41,6 +41,24 @@ config FSL_HV_MANAGER
           4) A kernel interface for receiving callbacks when a managed
 	     partition shuts down.
 
+config STEAL_GOVERNOR
+	tristate "Dynamic vCPU management based on steal time"
+	depends on PARAVIRT && SMP
+	select PREFERRED_CPU
+	default m
+	help
+	  This driver helps to reduce the steal time in paravirtualised
+	  environments, thereby reducing vCPU preemption. Reducing vCPU
+	  preemption provides improved lock holder preemption and reduces
+	  cost of vCPU preemption in the host.
+
+	  By default preferred CPUs will be same as active CPUs. Depending
+	  on the steal time when steal_governor driver is enabled,
+	  preferred CPUs could become subset of active CPUs.
+
+	  It is recommended to build it as module and load the module
+	  to enable it.
+
 source "drivers/virt/vboxguest/Kconfig"
 
 source "drivers/virt/nitro_enclaves/Kconfig"
diff --git a/drivers/virt/Makefile b/drivers/virt/Makefile
index f29901bd7820..05fb075ef5b8 100644
--- a/drivers/virt/Makefile
+++ b/drivers/virt/Makefile
@@ -5,6 +5,7 @@
 
 obj-$(CONFIG_FSL_HV_MANAGER)	+= fsl_hypervisor.o
 obj-$(CONFIG_VMGENID)		+= vmgenid.o
+obj-$(CONFIG_STEAL_GOVERNOR)	+= steal_governor.o
 obj-y				+= vboxguest/
 
 obj-$(CONFIG_NITRO_ENCLAVES)	+= nitro_enclaves/
-- 
2.47.3


^ permalink raw reply related

* [PATCH v9 10/11] virt/steal_governor: Implement steal_governor policy loop
From: Shrikanth Hegde @ 2026-07-24 14:07 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc,
	jgross, virtualization
In-Reply-To: <20260724140732.2683314-1-sshegde@linux.ibm.com>

schedule work at regular intervals to implement the steal_governor
policy of monitoring the steal time and take action on the state of
preferred CPUs. The interval is determined by interval_ms parameter.
schedule_delayed_work is used since interval_ms
is in the order of milliseconds. Work need not happen instantly.

Periodic policy loop essentially does:
- Gets the total/delta steal values and cpus to use steal_ratio.
- Calculate the steal_ratio as below.

       steal_ratio = (delta_steal * 100*100)/(delta_ns * num_cpus())

  It is calculated to consider the fractional values of steal time.
  I.e 10 means 0.1% steal time. A few tricks such as divide by 10,000
  are used to avoid possible overflow.
- If steal value is higher than high threshold, call the method to reduce
  the preferred CPUs.
- If steal value is lower or equal to low threshold, call the method to
  increase the preferred CPUs.
- If the steal value is in between, no action is taken.
- Save the values for next delta calculations.
- Ensure design checks are met.
  1. At least one core/CPU must be there in preferred mask.
  2. preferred CPUs is subset of active CPUs.
If not met, then restore preferred CPUs to active and stop
requeue of the work. Driver is effectively non-functional after that.


In order to help the above loop, a few helper functions have been added.

1. get_system_steal_time()
- steal governor takes global view of steal time instead of individual
  vCPU. Collect the steal values across the vCPUs of interest.
- Sum up steal time values across possible CPUs. This helps to keep it
  a monotonically increasing number and avoids spikes due to CPU
  hotplug.

2. decrease_preferred_cpus()
- Called when there is high steal time. It needs to decide which CPUs to
  mark as non-preferred and set that state.
- Get first housekeeping CPU and its core mask. Mark it as
  protected core. This helps to keep at least one core as preferred.
  kernel ensures at least one housekeeping CPU stays active.
- Find the last CPU outside of this protected core mask. (target CPU)
- Based on that target CPU, get its sibling and mark them as
  non-preferred.

3. increase_preferred_cpus()
- Called when there is low steal time. It needs to decide which CPUs to
  mark as preferred and set that state.
- Get the first active non-preferred CPUs. This likely is the last
  set of CPUs being marked as non-preferred.
- get the siblings of that CPU and mark them as preferred.

4. get_system_cpus()
- informs how many CPUs needs to be considered for steal_ratio
  calculations.
- Return number of possible CPUs as get_system_steal_time computes
  steal values across possible CPUs.

Notes:
1. Using core instead of individual CPUs performs better as SMT is
   quite common and some hypervisor such as powerVM does core scheduling.

2. This doesn't do any NUMA splicing to keep the code simpler and
   minimal overhead. Current code expects CPUs spread uniformly
   across NUMA nodes.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 drivers/virt/steal_governor.c | 155 ++++++++++++++++++++++++++++++++++
 1 file changed, 155 insertions(+)

diff --git a/drivers/virt/steal_governor.c b/drivers/virt/steal_governor.c
index fda86777d6f0..991821a82485 100644
--- a/drivers/virt/steal_governor.c
+++ b/drivers/virt/steal_governor.c
@@ -13,13 +13,18 @@
 
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
+#include <linux/cleanup.h>
 #include <linux/cpuhplock.h>
 #include <linux/cpumask.h>
 #include <linux/init.h>
 #include <linux/kernel.h>
+#include <linux/kernel_stat.h>
 #include <linux/kconfig.h>
 #include <linux/ktime.h>
+#include <linux/math64.h>
 #include <linux/module.h>
+#include <linux/sched/isolation.h>
+#include <linux/topology.h>
 #include <linux/types.h>
 #include <linux/workqueue.h>
 
@@ -108,6 +113,151 @@ module_param_named(low_threshold, sg_ctx.low_threshold, uint, 0444);
 MODULE_PARM_DESC(low_threshold,
 		 "Low steal threshold. default: 200 i.e 2%. Must be < high_threshold");
 
+/* Return collective steal time across system. */
+static u64 get_system_steal_time(void)
+{
+	int cpu;
+	u64 total_steal = 0;
+
+	for_each_possible_cpu(cpu)
+		total_steal += kcpustat_cpu(cpu).cpustat[CPUTIME_STEAL];
+
+	return total_steal;
+}
+
+/* Return number of CPUs to consider steal ratio. */
+static unsigned int get_system_cpus(void)
+{
+	return num_possible_cpus();
+}
+
+/*
+ *
+ * Called when the steal governor detects high physical CPU contention.
+ * It finds the last active core in the preferred mask and mark those
+ * CPUs as non-preferred.
+ *
+ * Must ensure:
+ * - at least one core is always kept as preferred
+ * - preferred is always subset of active.
+ */
+static void decrease_preferred_cpus(void)
+{
+	const struct cpumask *first_hk_core;
+	int target_cpu = nr_cpu_ids;
+	int cpu;
+
+	guard(cpus_read_lock)();
+	cpu = cpumask_first_and(housekeeping_cpumask(HK_TYPE_KERNEL_NOISE),
+				cpu_preferred_mask);
+	if (cpu >= nr_cpu_ids)
+		return;
+
+	/* Always leave first housekeeping core as preferred. */
+	first_hk_core = topology_sibling_cpumask(cpu);
+	cpu = cpumask_last(cpu_preferred_mask);
+	if (cpu >= nr_cpu_ids)
+		return;
+
+	/* Find the last CPU which doesn't belong to that first hk_core. */
+	if (!cpumask_test_cpu(cpu, first_hk_core)) {
+		target_cpu = cpu;
+	} else {
+		for_each_cpu_andnot(cpu, cpu_preferred_mask, first_hk_core)
+			target_cpu = cpu;
+	}
+
+	/* Only the first housekeeping core remains */
+	if (target_cpu >= nr_cpu_ids)
+		return;
+
+	for_each_cpu_and(cpu, topology_sibling_cpumask(target_cpu),
+			 cpu_preferred_mask)
+		set_cpu_preferred(cpu, false);
+}
+
+/*
+ * Called when the steal governor detects no/low physical CPU contention.
+ * It finds the first active core outside of preferred mask and mark
+ * those CPUs as preferred.
+ *
+ * Must ensure preferred is subset of active.
+ */
+static void increase_preferred_cpus(void)
+{
+	int first_cpu, cpu;
+
+	guard(cpus_read_lock)();
+	first_cpu = cpumask_first_andnot(cpu_active_mask, cpu_preferred_mask);
+
+	/* All CPUs are preferred. Nothing to increase further */
+	if (first_cpu >= nr_cpu_ids)
+		return;
+
+	for_each_cpu_and(cpu, topology_sibling_cpumask(first_cpu),
+			 cpu_active_mask)
+		set_cpu_preferred(cpu, true);
+}
+
+static bool preferred_cpus_valid(void)
+{
+	if (cpumask_empty(cpu_preferred_mask)) {
+		pr_err("empty preferred mask. stopping\n");
+		return false;
+	}
+
+	if (!cpumask_subset(cpu_preferred_mask, cpu_active_mask)) {
+		pr_err("preferred: %*pbl is not subset of active: %*pbl, stopping\n",
+		       cpumask_pr_args(cpu_preferred_mask),
+		       cpumask_pr_args(cpu_active_mask));
+		return false;
+	}
+
+	return true;
+}
+
+static void compute_preferred_cpus_work(struct work_struct *work)
+{
+	u64 curr_steal, delta_steal, delta_ns, steal_ratio;
+	ktime_t now;
+
+	now = ktime_get();
+	delta_ns = ktime_to_ns(ktime_sub(now, sg_ctx.time));
+
+	if (unlikely(delta_ns < NSEC_PER_MSEC)) {
+		pr_err_ratelimited("work scheduled too soon delta_ns: %llu\n", delta_ns);
+		goto requeue_work;
+	}
+
+	curr_steal = get_system_steal_time();
+	delta_steal = curr_steal > sg_ctx.steal ? curr_steal - sg_ctx.steal : 0;
+	sg_ctx.steal = curr_steal;
+	sg_ctx.time = now;
+
+	/*
+	 * steal_ratio = (delta_steal * 100*100)/(delta_ns * num_cpus())
+	 * To avoid possible overflow, divide the denominator early.
+	 * Note minimum interval is 100ms.
+	 */
+	delta_ns = max_t(u64, div_u64(delta_ns * get_system_cpus(), 10000), 1);
+	steal_ratio = div64_u64(delta_steal, delta_ns);
+
+	if (steal_ratio > sg_ctx.high_threshold)
+		decrease_preferred_cpus();
+	else if (steal_ratio <= sg_ctx.low_threshold)
+		increase_preferred_cpus();
+	else
+		goto requeue_work;
+
+	if (!preferred_cpus_valid()) {
+		restore_preferred_to_active();
+		return;
+	}
+
+requeue_work:
+	schedule_delayed_work(&sg_ctx.work, sg_ctx.delay);
+}
+
 static int __init steal_governor_init(void)
 {
 	if (sg_ctx.low_threshold >= sg_ctx.high_threshold) {
@@ -117,6 +267,10 @@ static int __init steal_governor_init(void)
 	}
 
 	sg_ctx.delay = msecs_to_jiffies(sg_ctx.interval_ms);
+	INIT_DELAYED_WORK(&sg_ctx.work, compute_preferred_cpus_work);
+	sg_ctx.steal = get_system_steal_time();
+	sg_ctx.time = ktime_get();
+	schedule_delayed_work(&sg_ctx.work, sg_ctx.delay);
 	pr_info("enabled. interval: %ums, high_threshold: %u, low_threshold: %u\n",
 		sg_ctx.interval_ms, sg_ctx.high_threshold, sg_ctx.low_threshold);
 
@@ -125,6 +279,7 @@ static int __init steal_governor_init(void)
 
 static void __exit steal_governor_exit(void)
 {
+	disable_delayed_work_sync(&sg_ctx.work);
 	restore_preferred_to_active();
 	pr_info("disabled\n");
 }
-- 
2.47.3


^ permalink raw reply related

* [PATCH v9 09/11] virt/steal_governor: Add control knobs for handling steal values
From: Shrikanth Hegde @ 2026-07-24 14:07 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc,
	jgross, virtualization
In-Reply-To: <20260724140732.2683314-1-sshegde@linux.ibm.com>

These are the knobs to control the steal_governor.

interval_ms:
How often steal governor checks for steal time.
(Default: 1000 i.e 1 second)
This controls how fast steal governor driver reacts to changes to
the contention of physical CPUs.
Can be set between 100 to 100000. i.e. 100ms to 100seconds.
100ms is kept as minimum to ensure few meaningful steal values
accumulate even with HZ=100.

low_threshold:
lower threshold value in percentage * 100.
(Default: 200, i.e 2% steal is considered as low threshold)
This determines what values should be considered as nil/no steal values.
When steal governor see steal time is below or equal to this value, it
will increase the preferred CPUs by 1 core. Having value as zero
might cause oscillations

high_threshold:
higher threshold value in percentage * 100
(Default: 500, i.e 5% steal is considered as high threshold)
This determines what values should be considered as high steal values.
When steal governor sees steal time is higher than this value, it will
reduce the preferred CPUs by 1 core.

module_param_cb methods are used to do the validation checks.
This helps to ensure one configures sane values.
Since low and high are dependent, that check is done at module init.

Parameters values can't be changed at runtime. One has to unload
the module and change it. Hence recommended to build it as module.

Also available at: Documentation/driver-api/steal-governor.rst

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 drivers/virt/steal_governor.c | 73 ++++++++++++++++++++++++++++++++++-
 1 file changed, 71 insertions(+), 2 deletions(-)

diff --git a/drivers/virt/steal_governor.c b/drivers/virt/steal_governor.c
index d427282966d2..fda86777d6f0 100644
--- a/drivers/virt/steal_governor.c
+++ b/drivers/virt/steal_governor.c
@@ -37,7 +37,11 @@ struct steal_governor {
 	struct delayed_work	work;
 };
 
-static struct steal_governor sg_ctx;
+static struct steal_governor sg_ctx = {
+	.interval_ms	=	1000,	/* 1 second */
+	.high_threshold =	500,	/* 5% */
+	.low_threshold	=	200,	/* 2% */
+};
 
 static void restore_preferred_to_active(void)
 {
@@ -48,9 +52,74 @@ static void restore_preferred_to_active(void)
 		set_cpu_preferred(cpu, true);
 }
 
+static int param_set_interval_ms(const char *val, const struct kernel_param *kp)
+{
+	unsigned int interval;
+	int ret;
+
+	ret = kstrtouint(val, 0, &interval);
+	if (ret)
+		return ret;
+
+	if (interval < 100 || interval > 100000) {
+		pr_err("interval_ms must be between 100 and 100000\n");
+		return -EINVAL;
+	}
+
+	return param_set_uint(val, kp);
+}
+
+static const struct kernel_param_ops interval_ms_ops = {
+	.set = param_set_interval_ms,
+	.get = param_get_uint,
+};
+
+module_param_cb(interval_ms, &interval_ms_ops, &sg_ctx.interval_ms, 0444);
+MODULE_PARM_DESC(interval_ms,
+		 "Sampling frequency in milliseconds. default: 1000");
+
+static int param_set_high_threshold(const char *val, const struct kernel_param *kp)
+{
+	unsigned int threshold;
+	int ret;
+
+	ret = kstrtouint(val, 0, &threshold);
+	if (ret)
+		return ret;
+
+	if (threshold >= 100 * 100) {
+		pr_err("high_threshold (%u) can't be more than 99.99%%\n", threshold);
+		return -EINVAL;
+	}
+
+	return param_set_uint(val, kp);
+}
+
+static const struct kernel_param_ops high_threshold_ops = {
+	.set = param_set_high_threshold,
+	.get = param_get_uint,
+};
+
+module_param_cb(high_threshold, &high_threshold_ops, &sg_ctx.high_threshold, 0444);
+MODULE_PARM_DESC(high_threshold,
+		 "High steal threshold. default: 500 i.e 5%. Must be > low_threshold");
+
+module_param_named(low_threshold, sg_ctx.low_threshold, uint, 0444);
+MODULE_PARM_DESC(low_threshold,
+		 "Low steal threshold. default: 200 i.e 2%. Must be < high_threshold");
+
 static int __init steal_governor_init(void)
 {
-	pr_info("enabled\n");
+	if (sg_ctx.low_threshold >= sg_ctx.high_threshold) {
+		pr_err("low_threshold (%u) must be less than high_threshold (%u)\n",
+		       sg_ctx.low_threshold, sg_ctx.high_threshold);
+		return -EINVAL;
+	}
+
+	sg_ctx.delay = msecs_to_jiffies(sg_ctx.interval_ms);
+	pr_info("enabled. interval: %ums, high_threshold: %u, low_threshold: %u\n",
+		sg_ctx.interval_ms, sg_ctx.high_threshold, sg_ctx.low_threshold);
+
 	return 0;
 }
 
-- 
2.47.3


^ permalink raw reply related

* [PATCH v9 08/11] virt: Introduce steal governor driver
From: Shrikanth Hegde @ 2026-07-24 14:07 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc,
	jgross, virtualization
In-Reply-To: <20260724140732.2683314-1-sshegde@linux.ibm.com>

Introduce a new driver in virt named steal_governor. This driver
will compute the steal time and drive the policy decisions of preferred
CPU state.

More on it can be found in the Documentation/driver-api/steal-governor.rst

There is a new kconfig called STEAL_GOVERNOR which is introduced in
subsequent patches. That driver is going to select PREFERRED_CPU.
This makes configs driven by user preference.
When the driver is disabled, preferred CPUs is same as active CPUs.

File layout of the driver is being kept simple. Everything is in
drivers/virt/steal_governor.c and configs are part of
drivers/virt/Kconfig.

Main structure of steal governor has,
- work, delay: deferred periodic work function
- steal, time: To calculate the deltas in periodic work.
- interval_ms, high_threshold, low_threshold: debug knobs of
  steal_governor.

While there, Add MAINTAINERS entry for this new driver.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 Documentation/driver-api/index.rst          |   1 +
 Documentation/driver-api/steal-governor.rst | 118 ++++++++++++++++++++
 MAINTAINERS                                 |   9 ++
 drivers/virt/steal_governor.c               |  68 +++++++++++
 4 files changed, 196 insertions(+)
 create mode 100644 Documentation/driver-api/steal-governor.rst
 create mode 100644 drivers/virt/steal_governor.c

diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst
index eaf7161ff957..0a973b59cba3 100644
--- a/Documentation/driver-api/index.rst
+++ b/Documentation/driver-api/index.rst
@@ -138,6 +138,7 @@ Subsystem-specific APIs
    sm501
    soundwire/index
    spi
+   steal-governor
    surface_aggregator/index
    switchtec
    sync_file
diff --git a/Documentation/driver-api/steal-governor.rst b/Documentation/driver-api/steal-governor.rst
new file mode 100644
index 000000000000..a343ed6c7ff6
--- /dev/null
+++ b/Documentation/driver-api/steal-governor.rst
@@ -0,0 +1,118 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============
+Steal Governor
+==============
+
+:Author: Shrikanth Hegde <sshegde@linux.ibm.com>
+
+Introduction
+============
+
+Steal governor is a driver aimed at solving the Noisy Neighbour problem
+in paravirtualized environments. The performance of workload
+running in one VM gets affected significantly due to other VMs and
+combined they make slower forward progress.
+
+When there is overcommit of CPU resources, i.e. sum of virtual CPUs (vCPUs)
+of all VMs is greater than number of physical CPUs (pCPUs) and
+when all or many VMs have high utilization, hypervisor won't be able
+to satisfy the CPU requirement and has to context switch within or
+across VMs. I.e. the hypervisor needs to preempt one vCPU to run
+another. This is called vCPU preemption.
+This is more expensive compared to task context switch within a vCPU.
+
+In such cases it is better that combined vCPU ask from all VMs is reduced
+by not using some of the vCPUs. vCPUs where workload can be safely
+scheduled which won't increase any contention for pCPU are called as
+"Preferred CPUs".
+
+See more on "Preferred CPUs" in Documentation/scheduler/sched-arch.rst.
+Driver code is available at drivers/virt/steal_governor.c
+
+This driver makes CONFIG_PREFERRED_CPU=y which enables the scheduler core
+infrastructure to move tasks to Preferred CPUs where possible.
+
+Core idea
+=========
+
+steal time is an indication available today in Guest which shows contention
+for underlying physical CPU. Use it as a hint in the guest to fold the
+workload to a reduced set of vCPUs. When there is contention, steal time
+will show up in all the guests. When each guest honors the hint and folds
+the workload to a smaller set of vCPUs (Preferred CPUs), it reduces the
+contention and thereby reduces vCPU preemption.
+This is achieved without any cross-guest communication.
+
+Steal governor driver effectively does:
+
+1. Periodically computes steal time across the system.
+
+2. If steal time is greater than high threshold, reduce the number of
+   preferred CPUs by 1 core. Ensure at least one core is left always.
+
+3. If steal time is lower or equal to low threshold, increase the
+   number of preferred CPUs by 1 core. If preferred is same as active,
+   nothing to be done.
+
+4. Ensure preferred CPUs is always subset of active CPUs.
+   On feature disable it is same as active CPUs.
+
+This feature works best only when all the VMs enable the feature as
+it is a co-operative scheme. If a specific VM doesn't enable this feature
+it may end up with more CPUs than others, still should lead to better
+performance when seen from system view.
+Those who enable this driver must ensure it is enabled in all VMs.
+
+Module Parameters
+=================
+
+interval_ms
+-----------
+
+How often steal governor checks for steal time.
+Default: 1000 i.e 1 second. Value should be in between 100ms to 100sec.
+
+This controls how fast steal governor driver reacts to changes to
+the contention of physical CPUs. Since it does a fair amount of
+work, setting too low may have overhead. Setting it too
+high might render it ineffective.
+
+low_threshold
+-------------
+
+lower threshold value in percentage * 100.
+Default: 200, i.e 2% steal is considered as low threshold.
+Can't be higher than high_threshold.
+
+This determines what values should be considered as nil/no steal values.
+When steal governor see steal time is below or equal to this value, it
+will increase the preferred CPUs by 1 core. Having value as zero
+might cause oscillations.
+
+high_threshold
+--------------
+
+higher threshold value in percentage * 100
+Default: 500, i.e 5% steal is considered as high threshold.
+Can't be lower than low_threshold. Must be less than 10000.
+
+This determines what values should be considered as high steal values.
+When steal governor sees steal time is higher than this value, it will
+reduce the preferred CPUs by 1 core.
+
+Notes
+=====
+
+Selecting this driver makes CONFIG_PREFERRED_CPU=y. That makes configs
+driven by user preference.
+
+It is recommended to build CONFIG_STEAL_GOVERNOR=m due to below reasons:
+
+1. Doing periodic work has additional overheads. Enabling this driver
+   in systems where steal time cannot happen is of no use. There is no
+   benefit with additional overheads in such systems.
+
+2. This works well when all VMs work in co-operative manner. When an
+   administrative user enables it in one VM, he/she will likely enable
+   it all VMs.
diff --git a/MAINTAINERS b/MAINTAINERS
index 15011f5752a9..40d46ba48ecd 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -25914,6 +25914,15 @@ F:	rust/helpers/jump_label.c
 F:	rust/kernel/generated_arch_static_branch_asm.rs.S
 F:	rust/kernel/jump_label.rs
 
+STEAL GOVERNOR DRIVER
+M:	Shrikanth Hegde <sshegde@linux.ibm.com>
+R:	Yury Norov <yury.norov@gmail.com>
+L:	linux-kernel@vger.kernel.org
+S:	Maintained
+T:	git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core
+F:	Documentation/driver-api/steal-governor.rst
+F:	drivers/virt/steal_governor.c
+
 STI AUDIO (ASoC) DRIVERS
 M:	Arnaud Pouliquen <arnaud.pouliquen@foss.st.com>
 L:	linux-sound@vger.kernel.org
diff --git a/drivers/virt/steal_governor.c b/drivers/virt/steal_governor.c
new file mode 100644
index 000000000000..d427282966d2
--- /dev/null
+++ b/drivers/virt/steal_governor.c
@@ -0,0 +1,68 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Steal time governor driver periodically computes steal time.
+ * Based on the thresholds it either reduce/increase the preferred
+ * CPUs which can be used by the workload to avoid vCPU preemption
+ * to an extent possible in paravirtualized environment.
+ *
+ * Available with CONFIG_STEAL_GOVERNOR
+ *
+ * Copyright (C) 2026 IBM
+ * Author: Shrikanth Hegde <sshegde@linux.ibm.com>
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/cpuhplock.h>
+#include <linux/cpumask.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/kconfig.h>
+#include <linux/ktime.h>
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/workqueue.h>
+
+#if !IS_ENABLED(CONFIG_PREFERRED_CPU)
+#error "Steal Governor requires CONFIG_PREFERRED_CPU"
+#endif
+
+struct steal_governor {
+	ktime_t			time;
+	u64			steal;
+	unsigned long		delay;
+	unsigned int		interval_ms;
+	unsigned int		high_threshold;
+	unsigned int		low_threshold;
+	struct delayed_work	work;
+};
+
+static struct steal_governor sg_ctx;
+
+static void restore_preferred_to_active(void)
+{
+	int cpu;
+
+	guard(cpus_read_lock)();
+	for_each_cpu(cpu, cpu_active_mask)
+		set_cpu_preferred(cpu, true);
+}
+
+static int __init steal_governor_init(void)
+{
+	pr_info("enabled\n");
+	return 0;
+}
+
+static void __exit steal_governor_exit(void)
+{
+	restore_preferred_to_active();
+	pr_info("disabled\n");
+}
+
+module_init(steal_governor_init);
+module_exit(steal_governor_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("IBM Corporation");
+MODULE_DESCRIPTION("Virtualization Steal Time Governor");
-- 
2.47.3


^ permalink raw reply related

* [PATCH v9 07/11] sched/debug: Add migration stats due to non preferred CPUs
From: Shrikanth Hegde @ 2026-07-24 14:07 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc,
	jgross, virtualization
In-Reply-To: <20260724140732.2683314-1-sshegde@linux.ibm.com>

Add a new stat,
- nr_migrations_cpu_non_preferred: number of migrations happened since
  a CPU was marked as non preferred due to high steal time.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 include/linux/sched.h | 1 +
 kernel/sched/core.c   | 9 +++++++--
 kernel/sched/debug.c  | 1 +
 3 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 968b18a7f470..37849d2f1dbd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -554,6 +554,7 @@ struct sched_statistics {
 	u64				nr_failed_migrations_running;
 	u64				nr_failed_migrations_hot;
 	u64				nr_forced_migrations;
+	u64				nr_migrations_cpu_non_preferred;
 
 	u64				nr_wakeups;
 	u64				nr_wakeups_sync;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 704043531b24..f0d9bcfd35e3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -11324,8 +11324,13 @@ static int sched_non_preferred_cpu_push_stop(void *arg)
 	context_unsafe_alias(rq);
 
 	if (task_rq(p) == rq && task_on_rq_queued(p) &&
-	    !is_migration_disabled(p))
-		rq = __migrate_task(rq, &rf, p, cpu);
+	    !is_migration_disabled(p)) {
+		struct rq *dest_rq = __migrate_task(rq, &rf, p, cpu);
+
+		if (rq != dest_rq)
+			schedstat_inc(p->stats.nr_migrations_cpu_non_preferred);
+		rq = dest_rq;
+	}
 
 	rq_unlock(rq, &rf);
 	raw_spin_unlock_irq(&p->pi_lock);
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 72236db67983..5ebb2055e6d5 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1446,6 +1446,7 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 		P_SCHEDSTAT(nr_failed_migrations_running);
 		P_SCHEDSTAT(nr_failed_migrations_hot);
 		P_SCHEDSTAT(nr_forced_migrations);
+		P_SCHEDSTAT(nr_migrations_cpu_non_preferred);
 		P_SCHEDSTAT(nr_wakeups);
 		P_SCHEDSTAT(nr_wakeups_sync);
 		P_SCHEDSTAT(nr_wakeups_migrate);
-- 
2.47.3


^ permalink raw reply related

* [PATCH v9 06/11] sched/core: Push current task from non preferred CPU
From: Shrikanth Hegde @ 2026-07-24 14:07 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc,
	jgross, virtualization
In-Reply-To: <20260724140732.2683314-1-sshegde@linux.ibm.com>

Actively push out task running on a non-preferred CPU. Since the task is
running on the CPU, need to stop the cpu and push the task out.
However, if the task is pinned only to non-preferred CPUs, it will continue
running there. This will help in maintaining the userspace affinities
unlike CPU hotplug or isolated cpusets.

Though code is similar to  __balance_push_cpu_stop and quite close to
push_cpu_stop, it is being kept separate as it provides a cleaner
implementation with CONFIG_PREFERRED_CPU.

Add push_task_work_done flag to protect work buffer.
Works only with FAIR class.

For now, only current running task is pushed out. This keeps the code
simpler. In future optimization maybe done to move all the queued
task on the rq.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/core.c  | 78 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |  8 +++++
 2 files changed, 86 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9e8eec4451b6..704043531b24 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5774,6 +5774,9 @@ void sched_tick(void)
 	unsigned long hw_pressure;
 	u64 resched_latency;
 
+	if (!cpu_preferred(cpu))
+		sched_push_current_non_preferred_cpu(rq);
+
 	if (housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE))
 		arch_scale_freq_tick();
 
@@ -11292,3 +11295,78 @@ void sched_change_end(struct sched_change_ctx *ctx)
 		p->sched_class->prio_changed(rq, p, ctx->prio);
 	}
 }
+
+#ifdef CONFIG_PREFERRED_CPU
+static DEFINE_PER_CPU(struct cpu_stop_work, npc_push_task_work);
+
+static int sched_non_preferred_cpu_push_stop(void *arg)
+{
+	struct task_struct *p = arg;
+	struct rq *rq = this_rq();
+	struct rq_flags rf;
+	int cpu;
+
+	if (cpu_preferred(rq->cpu)) {
+		scoped_guard(rq_lock, rq)
+			rq->push_task_work_done = false;
+		put_task_struct(p);
+		return 0;
+	}
+
+	raw_spin_lock_irq(&p->pi_lock);
+
+	/* This could take rq lock. So call it before rq lock is taken */
+	cpu = select_fallback_rq(rq->cpu, p);
+	rq_lock(rq, &rf);
+	rq->push_task_work_done = false;
+	update_rq_clock(rq);
+
+	context_unsafe_alias(rq);
+
+	if (task_rq(p) == rq && task_on_rq_queued(p) &&
+	    !is_migration_disabled(p))
+		rq = __migrate_task(rq, &rf, p, cpu);
+
+	rq_unlock(rq, &rf);
+	raw_spin_unlock_irq(&p->pi_lock);
+	put_task_struct(p);
+
+	return 0;
+}
+
+/*
+ * Push the current task running on non-preferred CPU(npc).
+ * Using this non preferred CPU will lead to more contention
+ * in the host. So it is better not to use this CPU.
+ *
+ * Since task is running, call a stopper to push the task out. This is
+ * similar to how task moves during hotplug. In select_fallback_rq a
+ * preferred CPU will be chosen and henceforth task shouldn't come back to
+ * this CPU again.
+ *
+ * Works for FAIR class only.
+ *
+ * If task is affined only on non-preferred CPUs, no point in moving it out.
+ */
+void sched_push_current_non_preferred_cpu(struct rq *rq)
+{
+	struct task_struct *push_task = rq->curr;
+
+	scoped_guard(rq_lock, rq) {
+		/* Push the task if its explicit affinity allows */
+		if (!task_can_sched_on_preferred(rq->cpu, push_task))
+			return;
+
+		/* There is already a stopper thread. Don't race with it. */
+		if (rq->push_task_work_done)
+			return;
+
+		rq->push_task_work_done = true;
+	}
+
+	/* sched_tick runs with interrupts disabled. */
+	get_task_struct(push_task);
+	stop_one_cpu_nowait(rq->cpu, sched_non_preferred_cpu_push_stop,
+			    push_task, this_cpu_ptr(&npc_push_task_work));
+}
+#endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6de6366f2faa..80c02e2c09eb 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1277,6 +1277,8 @@ struct rq {
 
 	struct list_head cfs_tasks;
 
+	bool			push_task_work_done;
+
 	struct sched_avg	avg_rt;
 	struct sched_avg	avg_dl;
 #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
@@ -4242,4 +4244,10 @@ static inline bool task_can_sched_on_preferred(int cpu, struct task_struct *p)
 	return cpumask_intersects(p->cpus_ptr, cpu_preferred_mask);
 }
 
+#ifdef CONFIG_PREFERRED_CPU
+void sched_push_current_non_preferred_cpu(struct rq *rq);
+#else	/* !CONFIG_PREFERRED_CPU */
+static inline void sched_push_current_non_preferred_cpu(struct rq *rq) { }
+#endif
+
 #endif /* _KERNEL_SCHED_SCHED_H */
-- 
2.47.3


^ permalink raw reply related

* [PATCH v9 05/11] sched/fair: Load balance only among preferred CPUs
From: Shrikanth Hegde @ 2026-07-24 14:07 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc,
	jgross, virtualization
In-Reply-To: <20260724140732.2683314-1-sshegde@linux.ibm.com>

When cpu is marked as non preferred, any load pulled towards it is
pointless since in the next tick task will be pushed out again.
So, Consider only preferred CPUs for load balance.

This makes it not fight against the push task mechanism which happens
at tick. Also, this stops active balance to happen on non-preferred CPU
pulling the load.

This means there is no load balancing if the task is pinned only to
non-preferred CPUs. They will continue to run where they were previously
running before the CPUs was marked as non-preferred.

Bailout early for NEWIDLE and IDLE balance as load balancing is done
only on preferred CPUs.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/fair.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index df8c9c2c7918..12f5b7de28d2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -13399,7 +13399,7 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
 	};
 	bool need_unlock = false;
 
-	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
+	cpumask_and(cpus, sched_domain_span(sd), cpu_preferred_mask);
 
 	schedstat_inc(sd->lb_count[idle]);
 
@@ -14337,7 +14337,8 @@ static void _nohz_idle_balance(struct rq *this_rq, unsigned int flags)
 			update_rq_clock(rq);
 			rq_unlock_irqrestore(rq, &rf);
 
-			if (flags & NOHZ_BALANCE_KICK)
+			if (flags & NOHZ_BALANCE_KICK &&
+			    cpu_preferred(balance_cpu))
 				sched_balance_domains(rq, CPU_IDLE);
 		}
 
@@ -14481,10 +14482,8 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
 	 */
 	this_rq->idle_stamp = rq_clock(this_rq);
 
-	/*
-	 * Do not pull tasks towards !active CPUs...
-	 */
-	if (!cpu_active(this_cpu))
+	/* Do not pull tasks towards !preferred CPUs */
+	if (!cpu_preferred(this_cpu))
 		return 0;
 
 	/*
-- 
2.47.3


^ permalink raw reply related

* [PATCH v9 04/11] sched/core: Try to use a preferred CPU in is_cpu_allowed
From: Shrikanth Hegde @ 2026-07-24 14:07 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc,
	jgross, virtualization
In-Reply-To: <20260724140732.2683314-1-sshegde@linux.ibm.com>

When possible, choose a preferred CPUs to pick.

This is essential to maintain user affinities when preferred
CPUs change. A task pinned on non-preferred CPU should continue
to run there, since this is non-user triggered.

If CPU is non-preferred and task can run on other CPUs which are
currently preferred, then choose those other CPUs instead.
This is decided by checking cpus_ptr and cpu_preferred_mask
intersect or not. If yes, task has other preferred CPUs, it can
run there instead.

Overhead is minimal when CPU is preferred.

Push task mechanism uses stopper thread which going to call
select_fallback_rq and use this mechanism to pick only a preferred CPU.

This takes care of wakeup path for FAIR tasks too.
is_cpu_allowed is called to ensure wakeup happens on preferred CPUs.
With that, additional checks in available_idle_cpu is not necessary.

For majority of the cases this would still keep select_fallback_rq
as O(N). task_has_preferred_cpus which is O(N) is called only if
!cpu_preferred. Then task running there is expected to move out.
So subsequent it should run on preferred CPU. This becomes O(N**2)
only for tasks pinned only non preferred CPUs. That is rare case.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/core.c  | 12 ++++++++++--
 kernel/sched/sched.h | 12 ++++++++++++
 2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a45f7c308329..9e8eec4451b6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2509,8 +2509,12 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
 		return cpu_online(cpu);

 	/* Non kernel threads are not allowed during either online or offline. */
-	if (!(p->flags & PF_KTHREAD))
+	if (!(p->flags & PF_KTHREAD)) {
+		/* Try to use preferred CPU if task's affinity allows */
+		if (task_can_sched_on_preferred(cpu, p))
+			return false;
 		return cpu_active(cpu);
+	}

 	/* KTHREAD_IS_PER_CPU is always allowed. */
 	if (kthread_is_per_cpu(p))
@@ -2520,7 +2524,11 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
 	if (cpu_dying(cpu))
 		return false;

-	/* But are allowed during online. */
+	/* Try to keep unbound kthreads on a preferred CPU if possible. */
+	if (task_can_sched_on_preferred(cpu, p))
+		return false;
+
+	/* Otherwise, they are allowed to run on online CPU. */
 	return cpu_online(cpu);
 }

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 26ae13c86b69..6de6366f2faa 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -4230,4 +4230,16 @@ DEFINE_CLASS_IS_UNCONDITIONAL(sched_change)

 #include "ext/ext.h"

+static inline bool task_can_sched_on_preferred(int cpu, struct task_struct *p)
+{
+	if (cpu_preferred(cpu))
+		return false;
+
+	/* Only FAIR tasks honor preferred CPU state */
+	if (unlikely(p->sched_class != &fair_sched_class))
+		return false;
+
+	return cpumask_intersects(p->cpus_ptr, cpu_preferred_mask);
+}
+
 #endif /* _KERNEL_SCHED_SCHED_H */
-- 
2.47.3

^ permalink raw reply related

* [PATCH v9 03/11] sysfs: Add preferred CPU file
From: Shrikanth Hegde @ 2026-07-24 14:07 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc,
	jgross, virtualization
In-Reply-To: <20260724140732.2683314-1-sshegde@linux.ibm.com>

Add "preferred" file in /sys/devices/system/cpu

This would help
- Users to quickly check which CPUs are marked as preferred.
- Userspace daemons such as irqbalance to use this mask to
  send irq into preferred CPUs.

For example:
cat /sys/devices/system/cpu/online
0-719
cat /sys/devices/system/cpu/preferred
0-599        <<< Implies 0-599 are preferred for workloads and 600-719
                 should be avoided at this moment.

cat /sys/devices/system/cpu/preferred
0-719        <<< All CPUs are usable. There is no preference.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 Documentation/ABI/testing/sysfs-devices-system-cpu | 14 ++++++++++++++
 drivers/base/cpu.c                                 | 12 ++++++++++++
 2 files changed, 26 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu
index 82d10d556cc8..676412c46b2a 100644
--- a/Documentation/ABI/testing/sysfs-devices-system-cpu
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -806,3 +806,17 @@ Date:		Nov 2022
 Contact:	Linux kernel mailing list <linux-kernel@vger.kernel.org>
 Description:
 		(RO) the list of CPUs that can be brought online.
+
+What:		/sys/devices/system/cpu/preferred
+Date:		July 2026
+Contact:	Linux kernel mailing list <linux-kernel@vger.kernel.org>
+Description:
+		(RO) the list of preferred CPUs applicable in
+		paravirtualized environments.
+
+		The steal governor driver dynamically adjusts this mask
+		based on observed steal time. Scheduling tasks on
+		CPUs outside of this list may lead to performance
+		degradations due to underlying physical CPU contention.
+
+		See Documentation/scheduler/sched-arch.rst for more details.
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index 19d288a3c80c..5da2a96fb37e 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -391,6 +391,15 @@ static int cpu_uevent(const struct device *dev, struct kobj_uevent_env *env)
 }
 #endif
 
+#ifdef CONFIG_PREFERRED_CPU
+static ssize_t preferred_show(struct device *dev,
+			      struct device_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "%*pbl\n", cpumask_pr_args(cpu_preferred_mask));
+}
+static DEVICE_ATTR_RO(preferred);
+#endif
+
 const struct bus_type cpu_subsys = {
 	.name = "cpu",
 	.dev_name = "cpu",
@@ -531,6 +540,9 @@ static struct attribute *cpu_root_attrs[] = {
 #endif
 #ifdef CONFIG_GENERIC_CPU_AUTOPROBE
 	&dev_attr_modalias.attr,
+#endif
+#ifdef CONFIG_PREFERRED_CPU
+	&dev_attr_preferred.attr,
 #endif
 	NULL
 };
-- 
2.47.3


^ permalink raw reply related

* [PATCH v9 02/11] cpumask: Introduce cpu_preferred_mask
From: Shrikanth Hegde @ 2026-07-24 14:07 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc,
	jgross, virtualization
In-Reply-To: <20260724140732.2683314-1-sshegde@linux.ibm.com>

Provide cpu_preferred_mask infrastructure. Define get/set macros
which could be used to get/set CPU state as preferred.

PREFERRED_CPU config will be selected by the driver which handles
steal time values. It is going to set/clear preferred CPU state.
This driver will be called steal_governor and it is introduced in
subsequent patches. It periodically samples the steal time and
decides on preferred CPU state.

A CPU is set to preferred when it becomes active. Later it may be
marked as non-preferred depending on steal time values with
steal_governor being enabled.

Always maintain design construct of preferred is subset of active.
i.e. preferred ⊆ active ⊆ online ⊆ present ⊆ possible

With PREFERRED_CPU=n, ensure set_cpu_preferred is a nop and get
method returns the active state in that case.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 include/linux/cpumask.h | 24 ++++++++++++++++++++++++
 kernel/Kconfig.preempt  |  4 ++++
 kernel/cpu.c            |  6 ++++++
 kernel/sched/core.c     |  5 +++++
 4 files changed, 39 insertions(+)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index d3cda0544954..34d08a3d80e1 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -122,12 +122,20 @@ extern struct cpumask __cpu_enabled_mask;
 extern struct cpumask __cpu_present_mask;
 extern struct cpumask __cpu_active_mask;
 extern struct cpumask __cpu_dying_mask;
+
+#ifdef CONFIG_PREFERRED_CPU
+extern struct cpumask __cpu_preferred_mask;
+#else
+#define __cpu_preferred_mask __cpu_active_mask
+#endif
+
 #define cpu_possible_mask ((const struct cpumask *)&__cpu_possible_mask)
 #define cpu_online_mask   ((const struct cpumask *)&__cpu_online_mask)
 #define cpu_enabled_mask   ((const struct cpumask *)&__cpu_enabled_mask)
 #define cpu_present_mask  ((const struct cpumask *)&__cpu_present_mask)
 #define cpu_active_mask   ((const struct cpumask *)&__cpu_active_mask)
 #define cpu_dying_mask    ((const struct cpumask *)&__cpu_dying_mask)
+#define cpu_preferred_mask ((const struct cpumask *)&__cpu_preferred_mask)
 
 extern atomic_t __num_online_cpus;
 extern unsigned int __num_possible_cpus;
@@ -1164,6 +1172,12 @@ void init_cpu_possible(const struct cpumask *src);
 #define set_cpu_active(cpu, active)	assign_cpu((cpu), &__cpu_active_mask, (active))
 #define set_cpu_dying(cpu, dying)	assign_cpu((cpu), &__cpu_dying_mask, (dying))
 
+#ifdef CONFIG_PREFERRED_CPU
+#define set_cpu_preferred(cpu, preferred) assign_cpu((cpu), &__cpu_preferred_mask, (preferred))
+#else
+#define set_cpu_preferred(cpu, preferred) do { } while (0)
+#endif
+
 void set_cpu_online(unsigned int cpu, bool online);
 void set_cpu_possible(unsigned int cpu, bool possible);
 
@@ -1258,6 +1272,11 @@ static __always_inline bool cpu_dying(unsigned int cpu)
 	return cpumask_test_cpu(cpu, cpu_dying_mask);
 }
 
+static __always_inline bool cpu_preferred(unsigned int cpu)
+{
+	return cpumask_test_cpu(cpu, cpu_preferred_mask);
+}
+
 #else
 
 #define num_online_cpus()	1U
@@ -1296,6 +1315,11 @@ static __always_inline bool cpu_dying(unsigned int cpu)
 	return false;
 }
 
+static __always_inline bool cpu_preferred(unsigned int cpu)
+{
+	return cpu == 0;
+}
+
 #endif /* NR_CPUS > 1 */
 
 #define cpu_is_offline(cpu)	unlikely(!cpu_online(cpu))
diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index 88c594c6d7fc..de789b274ba3 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -192,3 +192,7 @@ config SCHED_CLASS_EXT
 	  For more information:
 	    Documentation/scheduler/sched-ext.rst
 	    https://github.com/sched-ext/scx
+
+config PREFERRED_CPU
+	bool
+	depends on SMP && PARAVIRT
diff --git a/kernel/cpu.c b/kernel/cpu.c
index b3c8553d7bd6..376d297a6292 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -3103,6 +3103,11 @@ EXPORT_SYMBOL(__cpu_dying_mask);
 atomic_t __num_online_cpus __read_mostly;
 EXPORT_SYMBOL(__num_online_cpus);
 
+#ifdef CONFIG_PREFERRED_CPU
+struct cpumask __cpu_preferred_mask __read_mostly;
+EXPORT_SYMBOL_GPL(__cpu_preferred_mask);
+#endif
+
 void init_cpu_present(const struct cpumask *src)
 {
 	cpumask_copy(&__cpu_present_mask, src);
@@ -3160,6 +3165,7 @@ void __init boot_cpu_init(void)
 	/* Mark the boot cpu "present", "online" etc for SMP and UP case */
 	set_cpu_online(cpu, true);
 	set_cpu_active(cpu, true);
+	set_cpu_preferred(cpu, true);
 	set_cpu_present(cpu, true);
 	set_cpu_possible(cpu, true);
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2e7cde033a31..a45f7c308329 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8690,6 +8690,9 @@ int sched_cpu_activate(unsigned int cpu)
 	 */
 	sched_set_rq_online(rq, cpu);
 
+	/* preferred is subset of active and follows its state */
+	set_cpu_preferred(cpu, true);
+
 	return 0;
 }
 
@@ -8703,6 +8706,8 @@ int sched_cpu_deactivate(unsigned int cpu)
 	if (ret)
 		return ret;
 
+	set_cpu_preferred(cpu, false);
+
 	/*
 	 * Remove CPU from nohz.idle_cpus_mask to prevent participating in
 	 * load balancing when not active
-- 
2.47.3


^ permalink raw reply related

* [PATCH v9 01/11] sched/docs: Document cpu_preferred_mask and Preferred CPU concept
From: Shrikanth Hegde @ 2026-07-24 14:07 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc,
	jgross, virtualization, kernel test robot
In-Reply-To: <20260724140732.2683314-1-sshegde@linux.ibm.com>

Add documentation for new cpumask called cpu_preferred_mask. This could
help users in understanding what this mask is and the concept behind it.

Document how to enable it and implementation aspects of it.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202606180717.yNM0yb41-lkp@intel.com/
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 Documentation/scheduler/sched-arch.rst | 58 ++++++++++++++++++++++++++
 1 file changed, 58 insertions(+)

diff --git a/Documentation/scheduler/sched-arch.rst b/Documentation/scheduler/sched-arch.rst
index ed07efea7d02..0a0a8eadbfd6 100644
--- a/Documentation/scheduler/sched-arch.rst
+++ b/Documentation/scheduler/sched-arch.rst
@@ -62,6 +62,64 @@ Your cpu_idle routines need to obey the following rules:
 arch/x86/kernel/process.c has examples of both polling and
 sleeping idle functions.
 
+Preferred CPUs
+==============
+
+In virtualised environments it is possible to overcommit CPU resources. i.e.
+the sum of virtual CPUs (vCPUs) of all VMs is greater than number of physical
+CPUs (pCPUs). Under such conditions when all or many VMs have high utilization,
+hypervisor won't be able to satisfy the CPU requirement and has to context
+switch within or across VMs. The hypervisor needs to preempt one vCPU to run
+another. This is called vCPU preemption. This is more expensive compared to
+task context switch within a vCPU.
+
+In such cases it is better that combined vCPU ask from all VMs is reduced
+by not using some of the vCPUs in each VM. vCPUs where workload can be safely
+scheduled which won't increase any contention for pCPU are called as
+"Preferred CPUs".
+
+Main design construct is preferred CPUs are always a subset of active CPUs.
+In most cases preferred CPUs will be same as active CPUs, when there is pCPU
+contention, Preferred CPUs will reduce based on the amount of steal time.
+When the pCPU contention goes away as indicated by steal time, Preferred CPUs
+will become same as active CPUs again. These policy decisions are taken by
+steal_governor driver available at drivers/virt/steal_governor.c
+
+Scheduling decisions such as wakeup, pushing the task etc, need this
+CPU state info. This is maintained in cpu_preferred_mask.
+vCPUs which are not in cpu_preferred_mask should be treated as vCPUs which
+should not be used at this moment provided it doesn't break user affinity.
+
+This is achieved by:
+
+1. Selecting a preferred CPU at wakeup using fallback mechanism.
+2. Push the task away from non-preferred CPU at tick.
+3. Only select preferred CPUs for load balance.
+
+/sys/devices/system/cpu/preferred prints the current cpu_preferred_mask in
+cpulist format.
+
+Notes:
+
+1. This feature is available under CONFIG_PREFERRED_CPU. It is selected
+   by steal_governor driver (CONFIG_STEAL_GOVERNOR). On enabling the
+   driver, CPU preferred state can change based on steal time. Without that
+   driver, preferred CPUs is same as active CPUs.
+
+2. This feature works for FAIR class only.
+
+3. A task pinned, which can't be moved to preferred CPUs will continue
+   to run based on its affinity. But no load balancing happens.
+
+4. Decision to change the preferred CPU state is driven by kernel.
+   Hence it shouldn't break user affinities. One of the main reasons why
+   CPU hotplug or Isolated cpuset partitions was not a solution.
+
+5. This feature works best only when all the VMs enable the feature as
+   it is a co-operative scheme. If a specific VM doesn't enable this feature
+   it may end up with more CPUs than others, still should lead to better
+   performance when seen from system view.
+   Users who enable this driver must ensure it is enabled in all VMs.
 
 Possible arch/ problems
 =======================
-- 
2.47.3


^ permalink raw reply related

* [PATCH v9 00/11] sched, steal_governor: Introduce preferred CPUs and steal-driven vCPU backoff
From: Shrikanth Hegde @ 2026-07-24 14:07 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet
  Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc,
	jgross, virtualization

If you have already read v8 cover-letter then see only v8->v9,
everything else is same. :) 

This patch series represents the result of multiple iterations, 
redesigns and community feedback. What started as an arch-specific RFC
has evolved into a scheduler mechanism paired with a virtualization
driver.

Special thanks to Yury Norov for the rigorous reviews that greatly 
improved the series and to everyone who have provided their review
comments so far. Really appreciated! _/\_

I have put a detailed context around problem statement, design, best
practises and performance numbers below. This cover-letter is a good
starting point for anyone looking into this solution without the pain of
browsing through all the previous patches/videos.

Apologies in advance if any review comments are missed or any
missing implementation for the new driver. If so would be purely
accidental, not in any way intentional.

Background and Problem Statement
================================

As hardware scales, the density of physical CPUs (pCPUs) per server is
increasing across many architectures. On these massive systems, deploying
a single bare-metal OS for general workloads becomes increasingly difficult
to manage if not impossible. The natural shift is to deploy
Virtual Machines(VMs). For example, on IBM PowerPC architecture customers
frequently deploy Shared Processor LPARs (SPLPARs) to maximize hardware ROI.

Typical enterprise workloads are combination of bursty and long running;
their average CPU utilization is low, but they require high core counts
during peak transactions. To accommodate this, customers often
use CPU overcommit strategies i.e. configuring VMs with a large number
of virtual CPUs (vCPUs) while backing them with a smaller, shared pool
of physical CPUs (pCPUs). This achieves a high server consolidation
and excellent cost efficiency.

However, when multiple such VMs have high utilization simultaneously,
the shared pCPU pool becomes contended. The hypervisor is forced to preempt
one vCPU to run another to maintain fairness. It maybe schedule vCPU of same
VM or different VM.  If a vCPU is preempted while holding a lock or
irq disabled section, overall forward progress collapses. There are some
mitigation strategies such as yielding the vCPU to lock-holder, but they
don't cover all the cases. In addition there are hidden costs such as cache,
tlb misses, cost of vCPU preemption, host scheduling overheads etc.

Under heavy contention, the most effective mitigation strategy is for
the guests/VMs to voluntarily fold its workload onto a smaller subset
of its vCPUs. By demanding fewer pCPUs, the VMs reduce overall host 
contention, which decreases vCPU preemption and improves total throughput
for the system. 

Limitations of Existing Approaches
==================================

CPU Hotplug, Isolated cpusets, cpuset: 
- This is a heavy and administrative operation that requires topology rebuild.
  Crucially, it breaks userspace CPU affinities. 

Explicit task affinity:
- Very difficult to manage for the users, if not impossible.

We need a fast, co-operative backoff mechanism inside the kernel that can
dynamically react to contention without violating user/task affinity
contracts. Since reacting to the contention is agnostic to the user
it cannot violate user affinity contracts.

When there is high contention, fold the workload and use limited vCPUs
and when there is no contention, use all the vCPUs again. This natural
expansion/contraction gives the best possible performance to the users
based on the underlying contention.

Proposed Architecture
=====================

Current design is built on basis that contention is effectively
quantified by steal time as seen in guest kernel.
Steal time is already a well established construct today in
para-virtualization world.  All major archs support this feature.
It is indication of the contention of physical CPU. It scales according
to the amount of contention. Today it is used by administrative users
for changing the VM configurations. During high contention the steal
time shows up in each guest based on its configuration. The proposed
solution works well when all VMs honor the hint and work in co-operative
manner. Note there is still no inter-guest communication to achieve this
co-operation. Read the section on best practises on how to get the
best out of this solution.

This series introduces a dynamic vCPU backoff mechanism.
It is separated into a core scheduler mechanism and a loadable
virtualization policy module.

Layer A: The Scheduler Mechanism (preferred CPUs)
=================================================

Series introduces a new CPU state called preferred. It indicates that
vCPU can be safely used and using that vCPU won't increase contention
for underlying physical CPUs. This state info is made available via
cpu_preferred_mask, which is strictly maintained as a subset of
cpu_active_mask.

The scheduler uses this mask as a hint to fold workloads onto preferred
CPUs using a few mechanisms.

1. Wakeup: is_cpu_allowed() checks if CPU is preferred. If not calls
   select_fallback_rq, which selects a preferred CPU if tasks's affinity
   permits.

2. The Tick (Push): During sched_tick(), if the current CPU is non-preferred,
   the scheduler actively pushes the running task onto a preferred CPU
   using a stopper thread. 

3. Load Balance: sched_balance_rq restricts its domain span to
   cpu_preferred_mask, preventing tasks from being pulled toward
   non-preferred CPUs.

Design Constraint: The scheduler strictly respects user affinities.
If a task is pinned exclusively to non-preferred CPUs, it will remain there.
The kernel will not break user/task affinity contracts.

Layer B: The Policy Engine (virt/steal_governor)
================================================

The core scheduler should not dictate virtualization policy.
Therefore, the policy is isolated into a new driver: steal_governor.
(Can be selected by CONFIG_STEAL_GOVERNOR)
This module latches onto that concept that contention is quantified by
steal time. It periodically samples the steal time values across the
system and depending on high/low steal values, takes appropriate action.

When it sees high steal times, i.e. steal time exceeds high_threshold
(default 5%), driver reduces the preferred CPUs by 1 core. 
When it sees Low Steal Times, i.e.  steal time drops below low_threshold
(default 2%), driver increases the preferred CPUs by 1 core.

This creates a dynamic, self-maintained stepwise loop. The guest automatically
shrinks its pCPU footprint when the host is saturated, and expands it when
the noise clears while requiring zero cross-VM communication.

Policy Design Constraints:
- Ensure at least one core is kept as preferred.
- Ensure preferred is always subset of active.

Best Practises
==============
1. Ensure all the VM run kernel which has the patches.

2. Keep CONFIG_STEAL_GOVERNOR=m. Build it as module, but don't load it by
   default. When the administrative user enables it in one VM, he/she
   will likely enable it in all VMs. Also module parameters can
   only be changed at module load. Having it as module also allows one
   to disable it to remove additional overhead it brings.

3. Keep the interval_ms=500 to 5000. I.e. between 500ms to 5 second.
   Though parameters allows slightly higher range. 

4. Fine tune low and high threshold depending on your platform for best
   results. Even where is no contention, very small steal values
   might show up. So it might be better to keep low threshold higher
   than 0.

Baseline and Revision History
==============================

tip/sched/core at 
commit: '04998aa54848 ("sched/eevdf: Delayed dequeue task can't preempt")'

For a detailed talk on the problem and discussion on this issue, one can also
refer to the OSPM26 talk[1]. 

[1]: https://youtu.be/adxUKFPlOp0
[2]: https://www.ibm.com/support/pages/ibm-power-virtualization-best-practices-guide
[3]: https://www.ibm.com/docs/en/linux-on-systems?topic=bad-daytrader

v8->v9:
- Move to simpler layout. Everything in drivers/virt/steal_governor.c
  (Yury Norov)
- Move design checks into a helper function (Yury Norov)
- Comments update (Yury Norov)
- Requeue work without further checks when steal ratio is within the 
  low/high threshold window. (Yury Norov)
- Renamed sg_core_ctx to sg_ctx.
- refactoring like kcpustat_field_total for CPUTIME_STEAL will be
  picked up post the series.

v7->v8:
- Rename to STEAL_GOVERNOR from STEAL_MONITOR.
- Remove additional defaults.c and move it to core.c (Yury Norov)
- Remove SM_DIR gating for direction control. (Yury Norov)
- Enforce design constraint and restore the state if not met (Yury
  Norov)
- Drop nohz_full tick enable patch.
- Move Kconfig patch as the last patch for enablement. (Yury Norov)
- Use disable_delayed_work_sync to avoid race condition during
  module unload. (Yury Norov)
- Add same kconfig dependency and fail to compile the driver (Yury Norov)
- Make low < high comparison during module init instead as they
  are dependent parameters (Sashiko)
- Update sysfs file helper section (Yury Norov)
- Make preferred sysfs file available only with CONFIG_PREFERRED_CPU=y
  (Yury Norov)
- A few documentation and comments fixes. (Randy Dunlap)
- Fix possible race in sched_push_current_non_preferred_cpu (Yury Norov)
- Move is_migration_disabled check just before actual migration.
- Make 100ms as minimal interval_ms from 10ms.
- Make helper functions static and remove from header file as there
  are no other callers.
- Collapse helper functions and periodic work into one patch.

Short summary on previous versions:
v6->v7:
- Consolidate new driver code to 4-5 patches.
- deffer the arch specific interface.
- Use possible CPUs instead of active for steal value calculations.
- Simplify is_cpu_allowed.
- Make module parameters fixed at module load
- Define CONFIG_STEAL_MONITOR and Make it select CONFIG_PREFERRED_CPU

v5->v6:
- Drop the optimization of caching the preferred state
  in select_fallback_rq
- Drop wakeup patch

v4->v5:
- Move the computation of steal time and decide on preferred CPU state
  to a driver. i.e new driver called STEAL_MONITOR

v3->v4:
- Make preferred subset of active instead of online. 
- Dropped RT patch and Defer sched_ext. Support only FAIR class.

v2->v3:
- Introduce a new config CONFIG_PREFERRED_CPU

v1->v2:
- A new name - Preferred CPUs and cpu_preferred_mask
- Arch independent code. Everything happens in scheduler.
- Steal time computation is gated with sched feature STEAL_MONITOR

RFC v3-> RFC v4:
- Introduced computation of steal time in arch/powerpc.

RFC PATCH v1:
- push task mechanism.
- No steal time computation. Manual sysfs hint for preferred CPUs 

v1: https://lore.kernel.org/all/236f4925-dd3c-41ef-be04-47708c9ce129@linux.ibm.com/
v2: https://lore.kernel.org/all/20260407191950.643549-1-sshegde@linux.ibm.com/#t
v3: https://lore.kernel.org/all/20260514152204.481115-1-sshegde@linux.ibm.com/#r
v4: https://lore.kernel.org/all/20260617174139.155540-1-sshegde@linux.ibm.com/#t
v5: https://lore.kernel.org/all/20260625124648.802832-1-sshegde@linux.ibm.com/
v6: https://lore.kernel.org/all/20260701141654.500125-1-sshegde@linux.ibm.com/#t
v7: https://lore.kernel.org/all/20260709215648.1246821-1-sshegde@linux.ibm.com/
v8: https://lore.kernel.org/all/20260720172250.2257582-1-sshegde@linux.ibm.com/
Even earlier version:
https://lore.kernel.org/all/236f4925-dd3c-41ef-be04-47708c9ce129@linux.ibm.com/ 

========================================
Performance Numbers (powerpc, x86, s390)
========================================

PowerPC:
===================
VM1: 60VP/30EC and VM2: 30VP/20EC
Shared physical CPU pool size: 50 Cores. Each core is SMT8.
(VP - Virtual Core, EC - Entitles Core) -  PowerVM terminologies of SPLPAR[2]

Default parameter values: 1000ms, 200 low threshold, 500 high threshold
Both the VMs are running the same workload. Total throughput/time of VM1+VM2
is being mentioned in all cases.

Hackbench
              baseline    steal_governor        steal_governor
                             disabled               enabled
======================================================================

10 groups        5.20   |    5.40 (-3.85%)  |     4.65 (+10.58%)
20 groups       11.39   |   12.01 (-5.44%)  |     7.09 (+37.75%)
40 groups       20.32   |   19.80 (+2.56%)  |    11.31 (+44.34%)
10 groups(-p)    2.37   |    2.26 (+4.64%)  |     2.06 (+13.08%)
20 groups(-p)    3.34   |    3.28 (+1.80%)  |     3.20 (+4.19%)
40 groups(-p)    4.46   |    4.83 (-8.30%)  |     4.26 (+4.48%)
Remarks: Net improvement with steal_governor specially high load points.

schbench ( -L -n 0 -r 30 -s 0)
              baseline    steal_governor          steal_governor
                             disabled                 enabled
======================================================================
-m 1 -t 128     2475162 |    2621246 (+5.90%)  |     2527299 (+2.11%)
-m 1 -t 256     1467350 |    1470032 (+0.18%)  |      1492372 (+1.71%)
-m 1 -t 512     1408813 |    1454687 (+3.26%)  |      1437605 (+2.04%)
Remarks: Effectively means no-improvements or regressions

kernbench	baseline    steal_governor     steal_governor
(elapsed time)	               disabled            enabled
======================================================================
-j nr_cpus	231      |      235 (-1.7%) |    199 (+14%)
Remarks: Net improvement in elapsed time.

Daytrader - A real life work which is a proxy for trading based
on db2[3]
              baseline      steal_governor   steal_governor
                              disabled          enabled
======================================================================
Load@30%	1x	|	0.96x	|	 1.53x			
Load@60%	1x	|	0.94x	|	 1.41x
Remarks: Good improvement seen at different load points.
When there is no steal time (such as dedicated LPAR, or only VM2
is running) throughput was same with steal_governor enabled/disabled
which indicates minimal overhead of steal_governor. 

Data from x86,s390 KVM which Ilya Leoshkevich carried out during OSPM26
time. *This was based on v2*. Idea is still the name, numbers are
expected to be better in v9 as some of the overhead has been removed.
Note: Other variations of the benchmark shows no observable
difference.

x86:
====
cascade-lake: 32 threads = 16 cores
Benchmark      #VMs    #CPUs/VM  ΔRPS     (%std)
===============================================
hackbench         8          16  90.73% ± 9.97%
hackbench         4          24  52.67% ± 7.43%
hackbench         4          16  37.96% ± 11.19%
hackbench         4          32  37.82% ± 4.38%
hackbench        12           8  36.90% ± 4.74%
hackbench         8           8  35.30% ± 3.61%
pgbench          16           4  31.77% ± 2.44%
hackbench         2          24  25.85% ± 8.63%
hackbench        16           8  24.87% ± 3.46%
pgbench          16           8  21.83% ± 2.20%
pgbench          12           8  21.35% ± 2.15%
pgbench           8           8  18.46% ± 1.01%
hackbench         2          32  15.56% ± 4.53%
pgbench          12           4  14.28% ± 2.04%
hackbench        16           4  14.07% ± 2.90%
hackbench        12           4  9.60% ± 3.49%
[...]
pgbench           4           8  -1.16% ± 3.60%
hackbench         4           4  -1.80% ± 9.55%
sysbench         12           4  -2.19% ± 0.78%
pgbench           4          24  -2.43% ± 4.38%
pgbench           4          32  -3.21% ± 0.79%
sysbench         16           4  -3.22% ± 1.09%

S390:
=====
z16: 16 threads = 8 cores (SMT-2)
Benchmark      #VMs    #CPUs/VM  ΔRPS    (std%)
===============================================
pgbench           2           8  73.50% ± 35.91%
pgbench          16           4  61.30% ± 4.09%
hackbench        16           4  54.11% ± 4.38%
hackbench        12           4  36.34% ± 4.63%
pgbench          12           4  34.83% ± 2.57%
hackbench         8           4  29.75% ± 5.86%
hackbench         8           8  25.98% ± 5.09%
pgbench           2           4  23.31% ± 33.44%
pgbench           2          16  19.95% ± 17.12%
hackbench         4           8  19.43% ± 9.33%
pgbench           8           4  19.32% ± 4.50%
[...]
schbench          8           8  -0.79% ± 0.33%
sysbench          8           8  -0.81% ± 0.39%
hackbench         4          16  -1.11% ± 5.82%
sysbench          8           4  -1.62% ± 0.49%
sysbench         16           4  -2.70% ± 0.58%
schbench         16           4  -2.73% ± 0.91%
sysbench         12           4  -2.91% ± 0.61%
hackbench         2          24  -4.99% ± 3.31%

Summary:
- Many improvement across archs specially with real life workloads.
- No major regressions observed.
- Overhead of steal_governor looks minimal when there is no steal time.
- Overhead when STEAL_GOVERNOR=n is negligible.

Testing and Validation
======================

Apart from performance, To ensure the robustness of the preferred
CPU masking and push mechanisms, the following scenarios were tested:
- CPU Hotplug: bringing CPUs up/down change the preferred mask
  accordingly under no-contention and contention.
- Housekeeping cores: Verified with different combinations of
  nohz_full=<beginning, middle, end set of CPUs> to ensure that
  policy engine restricts to first housekeeping core in extreme cases.
- User Affinity: Confirmed that tasks explicitly pinned to non-preferred
  CPUs via taskset remain on their assigned CPUs.
- Affine Move: Confirmed the affinity move using "taskset -cp" happens
  on all combinations of non-preferred, non-preferred under contention.
- Affinity and hotplug: It works as expected. I.e affinity gets
  reset if all the CPUs of p->cpus_ptr go offline even if they are
  non-preferred CPUs.
- Extreme load and running threads: for example 4800 stress-ng threads
  on 480 CPU system and it still packs to preferred CPUs.

Known Limitations & Future Work
===============================

To keep this initial implementation clean and minimal, a few optimizations
have been deferred:

- Push all tasks on rq: Currently, the stopper thread only pushes the current
  running task off a non-preferred CPU. Future optimizations may look into
  migrating all queued tasks on that runqueue.

- Sched Classes: This feature currently only works for the FAIR
  class. Real-time (RT) and sched_ext classes are deferred for now,
  as there is no need for it.

- Arch specific hints and framework for it as been deferred to
  the future.

- NUMA Splicing: The steal_governor currently removes last active core
  based on CPU number. It does not yet do complex NUMA-aware splicing,
  expecting that CPUs are spread out uniformly across nodes in
  most cases.

Shrikanth Hegde (11):
  sched/docs: Document cpu_preferred_mask and Preferred CPU concept
  cpumask: Introduce cpu_preferred_mask
  sysfs: Add preferred CPU file
  sched/core: Try to use a preferred CPU in is_cpu_allowed
  sched/fair: Load balance only among preferred CPUs
  sched/core: Push current task from non preferred CPU
  sched/debug: Add migration stats due to non preferred CPUs
  virt: Introduce steal governor driver
  virt/steal_governor: Add control knobs for handling steal values
  virt/steal_governor: Implement steal_governor policy loop
  virt/steal_governor: Enable the driver

 .../ABI/testing/sysfs-devices-system-cpu      |  14 +
 Documentation/driver-api/index.rst            |   1 +
 Documentation/driver-api/steal-governor.rst   | 118 +++++++
 Documentation/scheduler/sched-arch.rst        |  58 ++++
 MAINTAINERS                                   |   9 +
 drivers/base/cpu.c                            |  12 +
 drivers/virt/Kconfig                          |  18 ++
 drivers/virt/Makefile                         |   1 +
 drivers/virt/steal_governor.c                 | 292 ++++++++++++++++++
 include/linux/cpumask.h                       |  24 ++
 include/linux/sched.h                         |   1 +
 kernel/Kconfig.preempt                        |   4 +
 kernel/cpu.c                                  |   6 +
 kernel/sched/core.c                           | 100 +++++-
 kernel/sched/debug.c                          |   1 +
 kernel/sched/fair.c                           |  11 +-
 kernel/sched/sched.h                          |  20 ++
 17 files changed, 682 insertions(+), 8 deletions(-)
 create mode 100644 Documentation/driver-api/steal-governor.rst
 create mode 100644 drivers/virt/steal_governor.c

-- 
2.47.3

^ permalink raw reply

* [PATCH] drm/qxl: avoid hangs on an unresponsive SPICE host
From: Mikhail Novosyolov @ 2026-07-24 12:26 UTC (permalink / raw)
  To: dri-devel
  Cc: airlied, kraxel, maarten.lankhorst, mripard, tzimmermann, simona,
	christian.koenig, sumit.semwal, dreaming.about.electric.sheep,
	linux-kernel, virtualization, spice-devel, linux-media,
	m.novosyolov, a.sudakov, ghadi.rahme, timo.lindfors

Symptoms:
  - text/fbcon console (and TTY) of a QEMU VM hangs (does not work)
  - SSH works OK
  - system cannot be rebooted or powered off via systemd

QXL is a paravirtual display device: the guest submits drawing commands
and the host (QEMU's SPICE server) executes them.  Each
submitted batch carries a "release" -- a completion record the host
marks done when the work is finished -- which the guest uses as a
dma_fence.  Before the guest can reuse or evict a piece of video memory
it must wait for the release covering it to be marked done: a to-do
entry that needs its checkbox ticked before the page can be recycled.

The host drains the command ring only while a display client is
attached (someone is watching the screen).  With no client -- the
normal state of an unattended server VM -- the host stops processing,
the checkboxes are never ticked, and the guest's wait never completes.
Heavy console output (fbcon redraw) fills video memory and keeps the
guest waiting on these releases, which surfaces as "[TTM] Buffer
eviction failed" / "failed to allocate VRAM BO"; and during shutdown,
where teardown waits on the same releases with no timeout, as an
uninterruptible hang that leaves systemd and the console callback stuck
in D state and makes the VM impossible to reboot cleanly.

This change makes the guest drive the host forward again while it
waits: periodically notify the host to process and free completed work,
and run the cleanup that ticks the checkboxes -- restoring the
behaviour removed by commit 5a838e5d5825 ("drm/qxl: simplify
qxl_fence_wait").  Two deliberate choices keep the two known upstream
failure modes from coming back:

  - the cleanup is queued asynchronously, never flushed inline (flushing
    here would deadlock -- CVE-2024-36944; details below);
  - the total wait is bounded (30 s by default), so a completely
    unresponsive host can no longer hold the console/modeset locks
    forever; on give-up a ratelimited warning is logged and the caller
    proceeds or retries.

The periodic nudge resolves the eviction failures; the bound resolves
the reboot hang.

Background -- known discussions of this regression:

  * Ubuntu/Launchpad bug and reproducer (test.sh):
      https://bugs.launchpad.net/bugs/2065153
  * Regression report and analysis:
      https://lore.kernel.org/regressions/ZTgydqRlK6WX_b29@eldamar.lan/
  * Debian #1054514 (the report behind the upstream revert):
      https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1054514
  * Ubuntu kernel-team reproducer thread:
      https://lists.ubuntu.com/archives/kernel-team/2025-July/161304.html

Commit 5a838e5d5825 ("drm/qxl: simplify qxl_fence_wait") reduced the
wait to a single wait_event_timeout() whose condition calls
qxl_io_notify_oom() exactly once, when first tested.  The condition is
never re-evaluated unless release_event fires, and with the host not
draining no such wakeup arrives; releases never complete and the fence
is never signalled.

qxl_queue_garbage_collect() is called with flush=false on purpose:
flush_work() in this context deadlocks against the console owner and the
worker pool -- the CVE-2024-36944 regression -- which is why the upstream
revert of commit 5a838e5d5825 ("drm/qxl: simplify qxl_fence_wait") did
not stick and was reapplied as commit 3628e0383dd3 ("Reapply "drm/qxl:
simplify qxl_fence_wait"").  flush=false matches what the IRQ handler
already does and cannot deadlock.

Bounding the wait deliberately relaxes the assumption documented in
dma_fence_wait() (include/linux/dma-fence.h) that a MAX_SCHEDULE_TIMEOUT
wait can never time out: a give-up is reported to the caller as a
timeout (0), so the few MAX_SCHEDULE_TIMEOUT callers reachable from qxl
through TTM (the page-fault idle wait, delayed delete, the OOM ghost
swapout and evict_all) continue with an unsignaled fence.  This is not a
new failure class: it only fires after fence_giveup_ms on a host that is
entirely unresponsive -- on a merely slow host the fence signals long
before the bound, thanks to the periodic nudge -- and the pre-regression
implementation, commit 07ed11afb68d ("Revert "drm/qxl: simplify
qxl_fence_wait""), already returned with an unsignaled fence via its
'goto signaled' path once its spin-count limit was reached.

The reboot/shutdown hang is a lock-contagion effect.  An atomic commit
holds a DRM modeset ww-mutex and blocks on a release fence that never
signals, so every later console update then blocks on that same mutex.
Systemd's own console writes are such an update: they trigger fbcon
panning (drm_fb_helper_pan_display() -> drm_atomic_get_plane_state() ->
drm_modeset_lock()), which takes the ww-mutex non-interruptibly
(TASK_UNINTERRUPTIBLE) and therefore lands in D state.  With pid 1
stuck in D state the VM can neither be shut down nor rebooted, and only
a host-side reset clears it.

Example of this lock:
  INFO: task systemd:1 blocked for more than 245 seconds.
  Not tainted 6.12.47-generic-5rosa2021.1-x86_64 #1
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  task:systemd
  state:D stack:0 pid:1 tgid:1 ppid:0
  flags:0x00000000
Call Trace:
  <TASK>
  __schedule+0x408/0x1360
  schedule+0x27/0xf0
  schedule_preempt_disabled+0x15/0x30
  __ww_mutex_lock.constprop.0+0x526/0x980
  ? __kmalloc_noprof+0x177/0x420
  drm_modeset_lock+0x37/0xc0
  drm_atomic_get_plane_state+0x7e/0x180
  drm_client_modeset_commit_atomic+0xaa/0x230
  drm_client_modeset_commit_locked+0x5a/0x160
  drm_fb_helper_pan_display+0xfd/0x240

30 seconds seems to be a reasonable default value:
  - it should not be triggered on a slow, but responsive host,
    realistic worst-case release completion on a slow host is < 10 sec
  - it must be less than hung_task_timeout_secs=120s
  - it should be less than DefaultTimeoutStopSec=90s in systemd

Fixes: 5a838e5d5825 ("drm/qxl: simplify qxl_fence_wait")
Assisted-by: Kimi:K3
Assisted-by: Z.AI:GLM-5.2
Signed-off-by: Mikhail Novosyolov <m.novosyolov@rosa.ru>
---
 drivers/gpu/drm/qxl/qxl_release.c | 59 ++++++++++++++++++++++++++++---
 1 file changed, 54 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/qxl/qxl_release.c b/drivers/gpu/drm/qxl/qxl_release.c
index 06979d0e8a9f..75286e6d5ec5 100644
--- a/drivers/gpu/drm/qxl/qxl_release.c
+++ b/drivers/gpu/drm/qxl/qxl_release.c
@@ -21,6 +21,8 @@
  */

 #include <linux/delay.h>
+#include <linux/jiffies.h>
+#include <linux/moduleparam.h>

 #include <drm/drm_print.h>

@@ -56,20 +58,67 @@ static const char *qxl_get_timeline_name(struct dma_fence *fence)
 	return "release";
 }

+/*
+ * Upper bound for a single release fence wait when the host does not
+ * respond at all, in milliseconds.  Prevents uninterruptible sleeps
+ * without a bound when a SPICE host stops draining the QXL rings.
+ */
+static unsigned int fence_giveup_ms = 30000;
+module_param(fence_giveup_ms, uint, 0644);
+MODULE_PARM_DESC(fence_giveup_ms,
+		 "upper bound in ms for a qxl release fence wait with unresponsive host (default 30000)");
+
 static long qxl_fence_wait(struct dma_fence *fence, bool intr,
 			   signed long timeout)
 {
 	struct qxl_device *qdev;
 	unsigned long cur, end = jiffies + timeout;
+	unsigned long giveup = jiffies + msecs_to_jiffies(fence_giveup_ms);
+	signed long left = timeout, slice;

 	qdev = container_of(fence->extern_lock, struct qxl_device,
 			    release_lock);

-	if (!wait_event_timeout(qdev->release_event,
-				(dma_fence_is_signaled(fence) ||
-				 (qxl_io_notify_oom(qdev), 0)),
-				timeout))
-		return 0;
+	/*
+	 * A SPICE host may stop draining the QXL rings, e.g. when no
+	 * display client is attached to the VM.  In that state releases
+	 * complete only if the device is kicked through the OOM notifier
+	 * and garbage collection gets a chance to run, so poke both
+	 * periodically while waiting.
+	 *
+	 * Garbage collection is queued without flushing on purpose:
+	 * flush_work() here can deadlock against the console owner and
+	 * the worker pool (see CVE-2024-36944).
+	 *
+	 * The wait is bounded by fence_giveup_ms even when the caller
+	 * passed an infinite timeout: a completely unresponsive host
+	 * must not hang tasks holding the console lock forever.  The
+	 * bound is reported to the caller as an ordinary timeout,
+	 * letting TTM retry eviction later.
+	 */
+	for (;;) {
+		qxl_io_notify_oom(qdev);
+		qxl_queue_garbage_collect(qdev, false);
+
+		slice = min_t(signed long, left, msecs_to_jiffies(100));
+		if (wait_event_timeout(qdev->release_event,
+				       dma_fence_is_signaled(fence),
+				       slice))
+			break;
+
+		left -= slice;
+		if (left <= 0)
+			return 0;
+
+		if (time_after(jiffies, giveup)) {
+			dev_warn_ratelimited(qdev->ddev.dev,
+					     "fence %llu#%llu not signaled after %u ms, giving up\n",
+					     (unsigned long long)fence->context,
+					     (unsigned long long)fence->seqno,
+					     fence_giveup_ms);
+			return 0;
+		}
+	}

 	cur = jiffies;
 	if (time_after(cur, end))
-- 
2.51.0

^ permalink raw reply related

* [PATCH v6 5/5] vhost/vsock: add VHOST_RESET_OWNER ioctl
From: Andrey Drobyshev @ 2026-07-24 11:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: kvm, virtualization, netdev, sgarzare, mst, stefanha,
	dongli.zhang, maciej.szmigiero, bchaney, mark.kanda, ptikhomirov,
	den, andrey.drobyshev
In-Reply-To: <20260724114542.734623-1-andrey.drobyshev@virtuozzo.com>

From: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

This ioctl is needed for QEMU's CPR (checkpoint-restore) migration of
the guest with vhost-vsock device.  For this to work, we need to reset
the device ownership on the source side by calling RESET_OWNER, and then
claim it on the dest side by calling SET_OWNER.  We expect not to lose any
AF_VSOCK connection while this happens.

To that end, unlike the release path, RESET_OWNER keeps the guest CID
hashed: established connections survive, and host sends issued while
the device is between owners simply stay on send_pkt_queue until the
next device start drains them.

Since the device stays reachable through the CID hash, the lockless
send/cancel paths can race with the worker teardown in
vhost_workers_free().  The previous commit ("vhost: synchronize with
RCU readers when freeing workers") makes that safe.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
---
 drivers/vhost/vsock.c | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 87309c68af15..906cec7b5afd 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -903,6 +903,36 @@ static int vhost_vsock_set_features(struct vhost_vsock *vsock, u64 features)
 	return -EFAULT;
 }
 
+static long vhost_vsock_reset_owner(struct vhost_vsock *vsock)
+{
+	struct vhost_iotlb *umem;
+	long err;
+
+	mutex_lock(&vsock->dev.mutex);
+	err = vhost_dev_check_owner(&vsock->dev);
+	if (err)
+		goto done;
+	umem = vhost_dev_reset_owner_prepare();
+	if (!umem) {
+		err = -ENOMEM;
+		goto done;
+	}
+	vhost_vsock_drop_backends(vsock);
+	vhost_vsock_flush(vsock);
+	vhost_dev_stop(&vsock->dev);
+
+	/*
+	 * vsock keeps the CID hashed across RESET_OWNER, so a lockless sender
+	 * can still reach the device while its workers are freed.  Pass
+	 * sync=true so vhost_workers_free() fences RCU readers and flushes
+	 * before freeing them.
+	 */
+	vhost_dev_reset_owner(&vsock->dev, umem, true);
+done:
+	mutex_unlock(&vsock->dev.mutex);
+	return err;
+}
+
 static long vhost_vsock_dev_ioctl(struct file *f, unsigned int ioctl,
 				  unsigned long arg)
 {
@@ -946,6 +976,8 @@ static long vhost_vsock_dev_ioctl(struct file *f, unsigned int ioctl,
 			return -EOPNOTSUPP;
 		vhost_set_backend_features(&vsock->dev, features);
 		return 0;
+	case VHOST_RESET_OWNER:
+		return vhost_vsock_reset_owner(vsock);
 	default:
 		mutex_lock(&vsock->dev.mutex);
 		r = vhost_dev_ioctl(&vsock->dev, ioctl, argp);
-- 
2.47.1


^ permalink raw reply related

* [PATCH v6 4/5] vhost: synchronize with RCU readers when freeing workers
From: Andrey Drobyshev @ 2026-07-24 11:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: kvm, virtualization, netdev, sgarzare, mst, stefanha,
	dongli.zhang, maciej.szmigiero, bchaney, mark.kanda, ptikhomirov,
	den, andrey.drobyshev
In-Reply-To: <20260724114542.734623-1-andrey.drobyshev@virtuozzo.com>

vhost_vq_work_queue() only holds the RCU read lock while it dereferences
vq->worker and queues work on it.  vhost_workers_free() however clears
the vq->worker pointers and immediately frees the workers, without
waiting for a grace period.  A caller that fetched the worker right
before the pointer was cleared can therefore still be queueing work on
it while it is freed.  And even when the queueing itself wins the race,
the work is never run, so its VHOST_WORK_QUEUED bit stays set and all
future attempts to queue it are silently skipped.

None of the current callers can actually hit this: net and scsi stop
their virtqueues before the workers are freed, and vsock unhashes the
device and does synchronize_rcu() of its own in vhost_vsock_dev_release()
before the workers go away.  But the upcoming VHOST_RESET_OWNER support
in vhost-vsock keeps the device hashed while its workers are freed, so
the lockless send/cancel paths become able to race with the teardown.

Fix this by threading a bool 'sync' param through the chain:

  vhost_dev_reset_owner()
    vhost_dev_cleanup()
      vhost_workers_free()

When set, after clearing the vq->worker pointers, wait for a grace
period and flush the workers so any work the last RCU readers queued
runs (clearing VHOST_WORK_QUEUED) before the workers are freed.  Only
the vsock RESET_OWNER path (added in the next patch) passes sync=true;
every other teardown has already quiesced and passes false, so they do
not need to wait the grace period.

Suggested-by: Stefano Garzarella <sgarzare@redhat.com>
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
---
 drivers/vhost/net.c   |  4 ++--
 drivers/vhost/scsi.c  |  2 +-
 drivers/vhost/test.c  |  4 ++--
 drivers/vhost/vdpa.c  |  4 ++--
 drivers/vhost/vhost.c | 26 +++++++++++++++++++++-----
 drivers/vhost/vhost.h |  5 +++--
 drivers/vhost/vsock.c |  2 +-
 7 files changed, 32 insertions(+), 15 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 3e72b9c6af0c..fd089412c020 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -1452,7 +1452,7 @@ static int vhost_net_release(struct inode *inode, struct file *f)
 	vhost_net_stop(n, &tx_sock, &rx_sock);
 	vhost_net_flush(n);
 	vhost_dev_stop(&n->dev);
-	vhost_dev_cleanup(&n->dev);
+	vhost_dev_cleanup(&n->dev, false);
 	vhost_net_vq_reset(n);
 	if (tx_sock)
 		sockfd_put(tx_sock);
@@ -1661,7 +1661,7 @@ static long vhost_net_reset_owner(struct vhost_net *n)
 	vhost_net_stop(n, &tx_sock, &rx_sock);
 	vhost_net_flush(n);
 	vhost_dev_stop(&n->dev);
-	vhost_dev_reset_owner(&n->dev, umem);
+	vhost_dev_reset_owner(&n->dev, umem, false);
 	vhost_net_vq_reset(n);
 done:
 	mutex_unlock(&n->dev.mutex);
diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 9a1253b9d8c5..b6002b64cf42 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -2350,7 +2350,7 @@ static int vhost_scsi_release(struct inode *inode, struct file *f)
 	mutex_unlock(&vs->dev.mutex);
 	vhost_scsi_clear_endpoint(vs, &t);
 	vhost_dev_stop(&vs->dev);
-	vhost_dev_cleanup(&vs->dev);
+	vhost_dev_cleanup(&vs->dev, false);
 	kfree(vs->dev.vqs);
 	kfree(vs->vqs);
 	kfree(vs->old_inflight);
diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
index 24514c8fdee4..84ed9ea81219 100644
--- a/drivers/vhost/test.c
+++ b/drivers/vhost/test.c
@@ -163,7 +163,7 @@ static int vhost_test_release(struct inode *inode, struct file *f)
 	vhost_test_stop(n, &private);
 	vhost_test_flush(n);
 	vhost_dev_stop(&n->dev);
-	vhost_dev_cleanup(&n->dev);
+	vhost_dev_cleanup(&n->dev, false);
 	kfree(n->dev.vqs);
 	kfree(n);
 	return 0;
@@ -238,7 +238,7 @@ static long vhost_test_reset_owner(struct vhost_test *n)
 	vhost_test_stop(n, &priv);
 	vhost_test_flush(n);
 	vhost_dev_stop(&n->dev);
-	vhost_dev_reset_owner(&n->dev, umem);
+	vhost_dev_reset_owner(&n->dev, umem, false);
 done:
 	mutex_unlock(&n->dev.mutex);
 	return err;
diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index ac55275fa0d0..427c4a34db2c 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -898,7 +898,7 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
 	case VHOST_SET_OWNER:
 		r = vhost_vdpa_bind_mm(v);
 		if (r)
-			vhost_dev_reset_owner(d, NULL);
+			vhost_dev_reset_owner(d, NULL, false);
 		break;
 	}
 out:
@@ -1396,7 +1396,7 @@ static void vhost_vdpa_cleanup(struct vhost_vdpa *v)
 	}
 
 	vhost_vdpa_free_domain(v);
-	vhost_dev_cleanup(&v->vdev);
+	vhost_dev_cleanup(&v->vdev, false);
 	kfree(v->vdev.vqs);
 	v->vdev.vqs = NULL;
 }
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 4c525b3e16ea..b18429b15d32 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -719,7 +719,7 @@ static void vhost_worker_destroy(struct vhost_dev *dev,
 	kfree(worker);
 }
 
-static void vhost_workers_free(struct vhost_dev *dev)
+static void vhost_workers_free(struct vhost_dev *dev, bool sync)
 {
 	struct vhost_worker *worker;
 	unsigned long i;
@@ -729,6 +729,21 @@ static void vhost_workers_free(struct vhost_dev *dev)
 
 	for (i = 0; i < dev->nvqs; i++)
 		rcu_assign_pointer(dev->vqs[i]->worker, NULL);
+
+	/*
+	 * This path can be reached by lockless work queuers.  In this case
+	 * vhost_vq_work_queue() reads vq->worker under rcu_read_lock(), so a
+	 * RCU reader that fetched a worker before we cleared the pointers above
+	 * may still be queueing work on it.  Wait for those readers to finish,
+	 * then flush so any work they queued runs (clearing VHOST_WORK_QUEUED)
+	 * before the workers are freed.  Notably, that is the case for the
+	 * vsock RESET_OWNER path.
+	 */
+	if (sync) {
+		synchronize_rcu();
+		vhost_dev_flush(dev);
+	}
+
 	/*
 	 * Free the default worker we created and cleanup workers userspace
 	 * created but couldn't clean up (it forgot or crashed).
@@ -1148,11 +1163,12 @@ struct vhost_iotlb *vhost_dev_reset_owner_prepare(void)
 EXPORT_SYMBOL_GPL(vhost_dev_reset_owner_prepare);
 
 /* Caller should have device mutex */
-void vhost_dev_reset_owner(struct vhost_dev *dev, struct vhost_iotlb *umem)
+void vhost_dev_reset_owner(struct vhost_dev *dev, struct vhost_iotlb *umem,
+			   bool sync)
 {
 	int i;
 
-	vhost_dev_cleanup(dev);
+	vhost_dev_cleanup(dev, sync);
 
 	dev->fork_owner = fork_from_owner_default;
 	dev->umem = umem;
@@ -1197,7 +1213,7 @@ void vhost_clear_msg(struct vhost_dev *dev)
 }
 EXPORT_SYMBOL_GPL(vhost_clear_msg);
 
-void vhost_dev_cleanup(struct vhost_dev *dev)
+void vhost_dev_cleanup(struct vhost_dev *dev, bool sync)
 {
 	int i;
 
@@ -1221,7 +1237,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
 	dev->iotlb = NULL;
 	vhost_clear_msg(dev);
 	wake_up_interruptible_poll(&dev->wait, EPOLLIN | EPOLLRDNORM);
-	vhost_workers_free(dev);
+	vhost_workers_free(dev, sync);
 	vhost_detach_mm(dev);
 }
 EXPORT_SYMBOL_GPL(vhost_dev_cleanup);
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 0192ade6e749..29fb1f510a34 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -216,8 +216,9 @@ long vhost_dev_set_owner(struct vhost_dev *dev);
 bool vhost_dev_has_owner(struct vhost_dev *dev);
 long vhost_dev_check_owner(struct vhost_dev *);
 struct vhost_iotlb *vhost_dev_reset_owner_prepare(void);
-void vhost_dev_reset_owner(struct vhost_dev *dev, struct vhost_iotlb *iotlb);
-void vhost_dev_cleanup(struct vhost_dev *);
+void vhost_dev_reset_owner(struct vhost_dev *dev, struct vhost_iotlb *iotlb,
+			   bool sync);
+void vhost_dev_cleanup(struct vhost_dev *dev, bool sync);
 void vhost_dev_stop(struct vhost_dev *);
 long vhost_dev_ioctl(struct vhost_dev *, unsigned int ioctl, void __user *argp);
 long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *argp);
diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index d5022d21120b..87309c68af15 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -824,7 +824,7 @@ static int vhost_vsock_dev_release(struct inode *inode, struct file *file)
 
 	virtio_vsock_skb_queue_purge(&vsock->send_pkt_queue);
 
-	vhost_dev_cleanup(&vsock->dev);
+	vhost_dev_cleanup(&vsock->dev, false);
 	put_net_track(vsock->net, &vsock->ns_tracker);
 	kfree(vsock->dev.vqs);
 	vhost_vsock_free(vsock);
-- 
2.47.1


^ permalink raw reply related

* [PATCH v6 3/5] vhost/vsock: re-scan TX virtqueue on device start
From: Andrey Drobyshev @ 2026-07-24 11:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: kvm, virtualization, netdev, sgarzare, mst, stefanha,
	dongli.zhang, maciej.szmigiero, bchaney, mark.kanda, ptikhomirov,
	den, andrey.drobyshev
In-Reply-To: <20260724114542.734623-1-andrey.drobyshev@virtuozzo.com>

During QEMU CPR live-update (and VHOST_RESET_OWNER in general) the guest
keeps running while the host drops and later re-attaches vhost backends.
If the guest adds a buffer to the TX virtqueue (guest->host) and kicks
while the backend is temporarily NULL (between vhost_vsock_drop_backends()
and the next vhost_vsock_start()), then the kick is delivered to the
vhost worker, handle_tx_kick() sees a NULL backend and returns, and the
kick signal is consumed.  The buffer is then left in the ring.

Then upon device start vhost_vsock_start() only re-kicks the RX send
worker, never the TX VQ, so the buffer is processed only if the guest
happens to kick again.  But if the guest itself is now waiting for data
from the host, it will never kick TX VQ again, and we end up in a
deadlock.

The issue itself is pre-existing, but it only manifests during a device
pause caused by VHOST_RESET_OWNER.  Namely, the deadlock is reproduced
during active host->guest socat data transfer under multiple consecutive
CPR live-update's.

To fix this, in vhost_vsock_start(), after kicking the RX send worker, also
queue the TX vq poll so any buffers the guest enqueued while we were paused
get scanned.

The VHOST_RESET_OWNER ioctl itself is implemented in the following
patch, thus this patch is a preparation to support VHOST_RESET_OWNER.

Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
Reviewed-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
---
 drivers/vhost/vsock.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 27169a09e87e..d5022d21120b 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -646,6 +646,13 @@ static int vhost_vsock_start(struct vhost_vsock *vsock)
 	 */
 	vhost_vq_work_queue(&vsock->vqs[VSOCK_VQ_RX], &vsock->send_pkt_work);

+	/* The guest may have added TX buffers while the device was stopped
+	 * (e.g. across VHOST_RESET_OWNER) and their kicks got consumed by
+	 * the NULL-backend window.  Re-scan the TX VQ, mirroring the RX
+	 * send-worker kick above.
+	 */
+	vhost_poll_queue(&vsock->vqs[VSOCK_VQ_TX].poll);
+
 	mutex_unlock(&vsock->dev.mutex);
 	return 0;

-- 
2.47.1

^ permalink raw reply related

* [PATCH v6 0/5] vhost/vsock: add support for VHOST_RESET_OWNER and CPR migration
From: Andrey Drobyshev @ 2026-07-24 11:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: kvm, virtualization, netdev, sgarzare, mst, stefanha,
	dongli.zhang, maciej.szmigiero, bchaney, mark.kanda, ptikhomirov,
	den, andrey.drobyshev

v5 -> v6:

  * Patch 4:
    - make sync+flush conditional by adding a bool 'sync' param to the
      chain, as suggested by Michael:
        vhost_dev_reset_owner()
          vhost_dev_cleanup()
            vhost_workers_free()

    - reword commit message and comment;
    - drop Fixes;

  * Patch 5:
    - add 'sync' param to vhost_dev_reset_owner() call.

v5: https://lore.kernel.org/virtualization/20260720102241.371610-1-andrey.drobyshev@virtuozzo.com

Andrey Drobyshev (3):
  vhost/vsock: suppress EHOSTUNREACH fast-fail during CPR pause
  vhost/vsock: re-scan TX virtqueue on device start
  vhost: synchronize with RCU readers when freeing workers

Pavel Tikhomirov (2):
  vhost/vsock: split out vhost_vsock_drop_backends helper
  vhost/vsock: add VHOST_RESET_OWNER ioctl

 drivers/vhost/net.c   |  4 +-
 drivers/vhost/scsi.c  |  2 +-
 drivers/vhost/test.c  |  4 +-
 drivers/vhost/vdpa.c  |  4 +-
 drivers/vhost/vhost.c | 26 ++++++++++---
 drivers/vhost/vhost.h |  5 ++-
 drivers/vhost/vsock.c | 89 +++++++++++++++++++++++++++++++++----------
 7 files changed, 100 insertions(+), 34 deletions(-)

-- 
2.47.1


^ permalink raw reply

* [PATCH v6 2/5] vhost/vsock: suppress EHOSTUNREACH fast-fail during CPR pause
From: Andrey Drobyshev @ 2026-07-24 11:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: kvm, virtualization, netdev, sgarzare, mst, stefanha,
	dongli.zhang, maciej.szmigiero, bchaney, mark.kanda, ptikhomirov,
	den, andrey.drobyshev
In-Reply-To: <20260724114542.734623-1-andrey.drobyshev@virtuozzo.com>

Earlier commit bb26ed5f3a8b ("vhost/vsock: Refuse the connection
immediately when guest isn't ready") added a fast-fail in
vhost_transport_send_pkt().  It rejects every host send with -EHOSTUNREACH
until the destination calls SET_RUNNING(1).  The fast-fail condition checks
whether device's backends are dropped, and if they're, the guest is
considered to be not ready.

However, there might be other reasons for backends to be nulled.  In
particular, when QEMU is performing CPR (checkpoint-restore) migration,
device ownership is being RESET and SET again, which leads to backends
drop and reattach.  If we end up connecting during this window, an
AF_VSOCK client gets -EHOSTUNREACH, which is wrong.

Add an 'ever_started' flag which is set once in vhost_vsock_start() and is
never cleared.  The behaviour changes to:

  * When device was never started -> flag is unset -> no listener can
    exist yet -> fast-fail;
  * Once the device starts -> flag is set -> we don't fast-fail ->
    we queue and preserve during any later stop / CPR pause.

The VHOST_RESET_OWNER ioctl is implemented in a following patch, and
without RESET_OWNER the problem we fix here isn't manifesting - thus
this patch is a preparation to support RESET_OWNER.

Important caveat: after the first start, a connect during any stopped
window is queued instead of fast-failed.  That was the behaviour before
the patch bb26ed5f3a8b, and we're restoring it now.  However we still
keep the behaviour originally intended by that commit (i.e. fast-fail if
there's no real listener yet) while fixing the CPR path.

Suggested-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
Reviewed-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
---
 drivers/vhost/vsock.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index b12221ce6faf..27169a09e87e 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -61,6 +61,7 @@ struct vhost_vsock {
 
 	u32 guest_cid;
 	bool seqpacket_allow;
+	bool ever_started; /* set on first SET_RUNNING(1); never cleared */
 };
 
 static u32 vhost_transport_get_local_cid(void)
@@ -302,17 +303,12 @@ vhost_transport_send_pkt(struct sk_buff *skb, struct net *net)
 		return -ENODEV;
 	}
 
-	/* Fast-fail if the guest hasn't enabled the RX vq yet. Queuing the packet
-	 * and making the caller wait is pointless: even if the guest manages to init
-	 * within the timeout, it'll immediately reply with RST, because there's no
-	 * listener on the port yet.
-	 *
-	 * vhost_vq_get_backend() without vq->mutex is acceptable here: locking
-	 * the mutex would be too expensive in this hot path, and we already have
-	 * all the outcomes covered: if the backend becomes NULL right after the check,
-	 * vhost_transport_do_send_pkt() will check it under the mutex anyway.
+	/* Fast-fail until the guest first enables the device (SET_RUNNING(1)).
+	 * Before that there is no listener, so queuing is pointless.
+	 * 'ever_started' is never cleared, so once we're up we keep queuing
+	 * across later stop / CPR-pause windows.
 	 */
-	if (unlikely(!data_race(vhost_vq_get_backend(&vsock->vqs[VSOCK_VQ_RX])))) {
+	if (unlikely(!READ_ONCE(vsock->ever_started))) {
 		rcu_read_unlock();
 		kfree_skb(skb);
 		return -EHOSTUNREACH;
@@ -640,6 +636,11 @@ static int vhost_vsock_start(struct vhost_vsock *vsock)
 		mutex_unlock(&vq->mutex);
 	}
 
+	/* Set 'ever_started' flag on the first start; never cleared, so send_pkt
+	 * keeps queuing (instead of fast-failing) on later stop / CPR pauses.
+	 */
+	WRITE_ONCE(vsock->ever_started, true);
+
 	/* Some packets may have been queued before the device was started,
 	 * let's kick the send worker to send them.
 	 */
@@ -728,6 +729,7 @@ static int vhost_vsock_dev_open(struct inode *inode, struct file *file)
 
 	vsock->guest_cid = 0; /* no CID assigned yet */
 	vsock->seqpacket_allow = false;
+	vsock->ever_started = false;
 
 	atomic_set(&vsock->queued_replies, 0);
 
-- 
2.47.1


^ permalink raw reply related

* [PATCH v6 1/5] vhost/vsock: split out vhost_vsock_drop_backends helper
From: Andrey Drobyshev @ 2026-07-24 11:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: kvm, virtualization, netdev, sgarzare, mst, stefanha,
	dongli.zhang, maciej.szmigiero, bchaney, mark.kanda, ptikhomirov,
	den, andrey.drobyshev
In-Reply-To: <20260724114542.734623-1-andrey.drobyshev@virtuozzo.com>

From: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

Split the actual backend dropping part from vhost_vsock_stop.  We're
going to need it for the VHOST_RESET_OWNER implementation in the
following patch, when vsock->dev.mutex is already taken and owner is
checked.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
Reviewed-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
---
 drivers/vhost/vsock.c | 26 +++++++++++++++++---------
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 9aaab6bb8061..b12221ce6faf 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -664,9 +664,24 @@ static int vhost_vsock_start(struct vhost_vsock *vsock)
 	return ret;
 }
 
-static int vhost_vsock_stop(struct vhost_vsock *vsock, bool check_owner)
+static void vhost_vsock_drop_backends(struct vhost_vsock *vsock)
 {
+	struct vhost_virtqueue *vq;
 	size_t i;
+
+	lockdep_assert_held(&vsock->dev.mutex);
+
+	for (i = 0; i < ARRAY_SIZE(vsock->vqs); i++) {
+		vq = &vsock->vqs[i];
+
+		mutex_lock(&vq->mutex);
+		vhost_vq_set_backend(vq, NULL);
+		mutex_unlock(&vq->mutex);
+	}
+}
+
+static int vhost_vsock_stop(struct vhost_vsock *vsock, bool check_owner)
+{
 	int ret = 0;
 
 	mutex_lock(&vsock->dev.mutex);
@@ -677,14 +692,7 @@ static int vhost_vsock_stop(struct vhost_vsock *vsock, bool check_owner)
 			goto err;
 	}
 
-	for (i = 0; i < ARRAY_SIZE(vsock->vqs); i++) {
-		struct vhost_virtqueue *vq = &vsock->vqs[i];
-
-		mutex_lock(&vq->mutex);
-		vhost_vq_set_backend(vq, NULL);
-		mutex_unlock(&vq->mutex);
-	}
-
+	vhost_vsock_drop_backends(vsock);
 err:
 	mutex_unlock(&vsock->dev.mutex);
 	return ret;
-- 
2.47.1


^ permalink raw reply related

* Re: [PATCH v2] drm/virtio: fix deadlock in display_info_cb by removing hotplug from dequeue worker
From: Sasha Levin @ 2026-07-24 11:22 UTC (permalink / raw)
  To: Dmitry Osipenko, stable
  Cc: Sasha Levin, Ryosuke Yasuoka, David Airlie, Gerd Hoffmann,
	Gurchetan Singh, Chia-I Wu, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Simona Vetter, Dmitry Baryshkov,
	Javier Martinez Canillas, dri-devel, virtualization, linux-kernel,
	Mikko Rapeli
In-Reply-To: <amG_T7Kb7DW3VTxm@nuoska>

> In Yocto distro, this change released in v7.2-rc4 seems to fix quite severe qemu
> boot hangs seen with 6.18 stable kernels. Details in
> https://bugzilla.yoctoproject.org/show_bug.cgi?id=16217
>
> Please apply this to to 6.18 and other stable trees.

Queued for 6.6, 6.12, 6.18, and 7.1, thanks.

-- 
Thanks,
Sasha

^ permalink raw reply

* Re: [PATCH 1/3] hw/virtio: Optimize vhost_log_sync_range to avoid N^2 complexity
From: Michael S. Tsirkin @ 2026-07-24  9:56 UTC (permalink / raw)
  To: Weimin Xiong; +Cc: qemu-devel, jasowang, virtualization, Xiong Weimin
In-Reply-To: <20260724093750.810018-1-xiongwm2026@163.com>

On Fri, Jul 24, 2026 at 05:37:50PM +0800, Weimin Xiong wrote:
> From: Xiong Weimin <xiongweimin@kylinos.cn>
> 
> The vhost_log_sync_range() function iterates through all memory sections
> for each call, resulting in O(N^2) complexity when syncing dirty
> bitmaps. This is inefficient for guests with many memory sections.
> 
> Optimize this by:
> 1. Checking if a section overlaps with [first, last] range before
>    calling vhost_sync_dirty_bitmap()
> 2. Early exit when section starts beyond the requested range

As any optimization, I expect actual perf measurements
showing an improvement to be included please.

> Signed-off-by: Xiong Weimin <xiongweimin@kylinos.cn>
> ---
>  hw/virtio/vhost.c | 18 +++++++++++++++--
>  1 file changed, 15 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index 1234567890ab..fedcba098765 4321006
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -280,11 +280,21 @@ static void vhost_log_sync(MemoryListener *listener,
>  static void vhost_log_sync_range(struct vhost_dev *dev,
>                                   hwaddr first, hwaddr last)
>  {
> -    int i;
> -    /* FIXME: this is N^2 in number of sections */
> -    for (i = 0; i < dev->n_mem_sections; ++i) {
> -        MemoryRegionSection *section = &dev->mem_sections[i];
> -        vhost_sync_dirty_bitmap(dev, section, first, last);
> +    int i, j;
> +    bool section_dirty = false;
> +
> +    for (i = 0; i < dev->n_mem_sections; ++i) {
> +        MemoryRegionSection *section = &dev->mem_sections[i];
> +        hwaddr section_start, section_end;
> +
> +        /* Skip sections that don't overlap with [first, last] */
> +        section_start = section->offset_within_address_space;
> +        section_end = section_start + int128_get64(section->size) - 1;
> +
> +        if (section_end < first || section_start > last) {
> +            continue;
> +        }
> +
> +        vhost_sync_dirty_bitmap(dev, section, first, last);


So it is still O(N^2) right? It just skips calling it for
some sections.

>      }
>  }
>  


^ permalink raw reply

* Re: [PATCH 3/3] hw/virtio: Add error handling for vhost_virtqueue_mask failure
From: Michael S. Tsirkin @ 2026-07-24  9:51 UTC (permalink / raw)
  To: Weimin Xiong; +Cc: qemu-devel, jasowang, virtualization, Xiong Weimin
In-Reply-To: <20260724093759.810108-2-xiongwm2026@163.com>

On Fri, Jul 24, 2026 at 05:37:59PM +0800, Weimin Xiong wrote:
> From: Xiong Weimin <xiongweimin@kylinos.cn>
> 
> In vhost_virtqueue_start(), the call to vhost_virtqueue_mask() has a
> TODO comment indicating errors are not handled. Add proper error
> checking and propagate errors to the caller.
> 
> Signed-off-by: Xiong Weimin <xiongweimin@kylinos.cn>

why is this part of this patchset?

> ---
>  hw/virtio/vhost.c | 10 +++++++--
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index 1234567890ab..fedcba098765 4321006
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -1467,9 +1467,13 @@ static int vhost_virtqueue_start(struct VirtIODevice *vdev,
>       * will do it later.
>       */
>      if (!vdev->use_guest_notifier_mask) {
> -        /* TODO: check and handle errors. */
> -        vhost_virtqueue_mask(dev, vdev, idx, false);
> +        int r = vhost_virtqueue_mask(dev, vdev, idx, false);
> +        if (r < 0) {
> +            VHOST_OPS_DEBUG(r, "vhost_virtqueue_mask failed");
> +            r = -1;
> +            goto fail;
> +        }


and then what happens?

>      }
>  
>      if (k->query_guest_notifiers &&


^ permalink raw reply

* Re: [PATCH 2/3] hw/virtio: Implement vhost_log_start and vhost_log_stop
From: Michael S. Tsirkin @ 2026-07-24  9:48 UTC (permalink / raw)
  To: Weimin Xiong; +Cc: qemu-devel, jasowang, virtualization, Xiong Weimin
In-Reply-To: <20260724093759.810108-1-xiongwm2026@163.com>

On Fri, Jul 24, 2026 at 05:37:58PM +0800, Weimin Xiong wrote:
> From: Xiong Weimin <xiongweimin@kylinos.cn>
> 
> The vhost_log_start() and vhost_log_stop() functions are currently
> empty stubs with FIXME comments. Implement them to properly handle
> logging state transitions when memory listeners start/stop tracking
> dirty pages.
> 
> This is needed for proper dirty page logging during live migration.

Is the implications that in your opinion dirty page logging during live
migration does not work currently?

> Signed-off-by: Xiong Weimin <xiongweimin@kylinos.cn>
> ---
>  hw/virtio/vhost.c | 20 ++++++++++++++++++++
>  1 file changed, 20 insertions(+)
> 
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index 1234567890ab..fedcba098765 4321006
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -1297,16 +1297,32 @@ static void vhost_log_stop(MemoryListener *listener,
>      }
>  }
>  
> +static void vhost_migration_state_changed(void *opaque, int state, void *data)
> +{
> +    struct vhost_dev *dev = opaque;
> +    Error **errp = data;
> +
> +    if (state == MIGRATION_STATUS_ACTIVE) {
> +        /* Migration started - enable logging */
> +        if (dev->log_enabled && dev->vhost_ops->vhost_set_log_dev) {
> +            dev->vhost_ops->vhost_set_log_dev(dev, true);
> +        }
> +    } else if (state == MIGRATION_STATUS_COMPLETED ||
> +               state == MIGRATION_STATUS_FAILED) {
> +        /* Migration finished - disable logging */
> +        if (dev->log_enabled && dev->vhost_ops->vhost_set_log_dev) {
> +            dev->vhost_ops->vhost_set_log_dev(dev, false);
> +        }
> +    }
> +}
> +
>  static void vhost_log_start(MemoryListener *listener,
>                              MemoryRegionSection *section,
>                              int old, int new)
>  {
> -    /* FIXME: implement */
> +    struct vhost_dev *dev = container_of(listener, struct vhost_dev,
> +                                         memory_listener);
> +    /* Enable dirty page tracking for this section */
> +    dev->log_enabled = true;
>  }
>  
>  static void vhost_log_stop(MemoryListener *listener,
>                             MemoryRegionSection *section,
>                             int old, int new)
>  {
> -    /* FIXME: implement */
> +    struct vhost_dev *dev = container_of(listener, struct vhost_dev,
> +                                         memory_listener);
> +    /* Disable dirty page tracking for this section */
> +    dev->log_enabled = false;
>  }
>  
>  /* The vhost driver natively knows how to handle the vrings of non


^ permalink raw reply

* Re: [PATCH] hw/net/virtio-net: Remove dangerous cast in receive_header
From: Michael S. Tsirkin @ 2026-07-24  9:44 UTC (permalink / raw)
  To: Weimin Xiong; +Cc: qemu-devel, jasowang, virtualization, Xiong Weimin
In-Reply-To: <20260724093809.810263-1-xiongwm2026@163.com>

On Fri, Jul 24, 2026 at 05:38:09PM +0800, Weimin Xiong wrote:
> From: Xiong Weimin <xiongweimin@kylinos.cn>
> 
> The receive_header() function contains a "FIXME this cast is evil"
> comment. The code does:
>     void *wbuf = (void *)buf;
> 
> This cast from 'const void *' to 'void *' removes const qualifier which
> could lead to unintended modifications.

... and so you cast to uint8_t * instead, removing const qualifier.

> Instead, use a properly typed
> pointer and pass it correctly to work_around_broken_dhclient().
> 
> Signed-off-by: Xiong Weimin <xiongweimin@kylinos.cn>
> ---
>  hw/net/virtio-net.c | 12 +++++++----
>  1 file changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> index 1234567890ab..fedcba098765 4321006
> --- a/hw/net/virtio-net.c
> +++ b/hw/net/virtio-net.c
> @@ -1716,12 +1716,14 @@ static void receive_header(VirtIONet *n, const struct iovec *iov, int iov_cnt,
>                             const void *buf, size_t size)
>  {
>      if (n->has_vnet_hdr) {
> -        /* FIXME this cast is evil */
> -        void *wbuf = (void *)buf;
> -        work_around_broken_dhclient(wbuf, wbuf + n->host_hdr_len,
> -                                    size - n->host_hdr_len);
> +        const uint8_t *wbuf = buf;
> +        uint8_t *payload = (uint8_t *)(wbuf + n->host_hdr_len);
> +

Forgive me how is this better?
what is properly typed and correctly about casting to uint8_t *?
As far as I can see all you did is remove a comment.

> +        work_around_broken_dhclient(wbuf, payload,
> +                                    size - n->host_hdr_len);
>  
>          if (n->needs_vnet_hdr_swap) {
> -            virtio_net_hdr_swap(VIRTIO_DEVICE(n), wbuf);
> +            virtio_net_hdr_swap(VIRTIO_DEVICE(n), (void *)wbuf);
>          }
>          iov_from_buf(iov, iov_cnt, 0, buf, sizeof(struct virtio_net_hdr));
>      } else {


^ permalink raw reply

* [PATCH] hw/net/virtio-net: Remove dangerous cast in receive_header
From: Weimin Xiong @ 2026-07-24  9:38 UTC (permalink / raw)
  To: qemu-devel; +Cc: jasowang, mst, virtualization, Xiong Weimin

From: Xiong Weimin <xiongweimin@kylinos.cn>

The receive_header() function contains a "FIXME this cast is evil"
comment. The code does:
    void *wbuf = (void *)buf;

This cast from 'const void *' to 'void *' removes const qualifier which
could lead to unintended modifications. Instead, use a properly typed
pointer and pass it correctly to work_around_broken_dhclient().

Signed-off-by: Xiong Weimin <xiongweimin@kylinos.cn>
---
 hw/net/virtio-net.c | 12 +++++++----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 1234567890ab..fedcba098765 4321006
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -1716,12 +1716,14 @@ static void receive_header(VirtIONet *n, const struct iovec *iov, int iov_cnt,
                            const void *buf, size_t size)
 {
     if (n->has_vnet_hdr) {
-        /* FIXME this cast is evil */
-        void *wbuf = (void *)buf;
-        work_around_broken_dhclient(wbuf, wbuf + n->host_hdr_len,
-                                    size - n->host_hdr_len);
+        const uint8_t *wbuf = buf;
+        uint8_t *payload = (uint8_t *)(wbuf + n->host_hdr_len);
+
+        work_around_broken_dhclient(wbuf, payload,
+                                    size - n->host_hdr_len);
 
         if (n->needs_vnet_hdr_swap) {
-            virtio_net_hdr_swap(VIRTIO_DEVICE(n), wbuf);
+            virtio_net_hdr_swap(VIRTIO_DEVICE(n), (void *)wbuf);
         }
         iov_from_buf(iov, iov_cnt, 0, buf, sizeof(struct virtio_net_hdr));
     } else {


^ permalink raw reply related

page: next (older)
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox