[PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption
@ 2025-11-19 12:44 Shrikanth Hegde
  2025-11-19 12:44 ` [PATCH 01/17] sched/docs: Document cpu_paravirt_mask and Paravirt CPU concept Shrikanth Hegde
                   ` (20 more replies)
  0 siblings, 21 replies; 41+ messages in thread
From: Shrikanth Hegde @ 2025-11-19 12:44 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev
  Cc: sshegde, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, maddy, srikar, gregkh, pbonzini, seanjc,
	kprateek.nayak, vschneid, iii, huschle, rostedt, dietmar.eggemann,
	christophe.leroy

Detailed problem statement and some of the implementation choices were 
discussed earlier[1].

[1]: https://lore.kernel.org/all/20250910174210.1969750-1-sshegde@linux.ibm.com/

This is likely the version which would be used for LPC2025 discussion on
this topic. Feel free to provide your suggestion and hoping for a solution
that works for different architectures and it's use cases.

All the existing alternatives such as cpu hotplug, creating isolated
partitions etc break the user affinity. Since number of CPUs to use change
depending on the steal time, it is not driven by User. Hence it would be
wrong to break the affinity. This series allows if the task is pinned
only paravirt CPUs, it will continue running there.

Changes compared v3[1]:

- Introduced computation of steal time in powerpc code.
- Derive number of CPUs to use and mark the remaining as paravirt based
  on steal values. 
- Provide debugfs knobs to alter how steal time values being used.
- Removed static key check for paravirt CPUs (Yury)
- Removed preempt_disable/enable while calling stopper (Prateek)
- Made select_idle_sibling and friends aware of paravirt CPUs.
- Removed 3 unused schedstat fields and introduced 2 related to paravirt
  handling.
- Handled nohz_full case by enabling tick on it when there is CFS/RT on
  it.
- Updated helper patch to override arch behaviour for easier debugging
  during development.
- Kept 

Changes compared to v4[2]:
- Last two patches were sent out separate instead of being with series.
  That created confusion. Those two patches are debug patches one can
  make use to check functionality across acrhitectures. Sorry about
  that.
- Use DEVICE_ATTR_RW instead (greg)
- Made it as PATCH since arch specific handling completes the
  functionality.

[2]: https://lore.kernel.org/all/20251119062100.1112520-1-sshegde@linux.ibm.com/

TODO: 

- Get performance numbers on PowerPC, x86 and S390. Hopefully by next
  week. Didn't want to hold the series till then.

- The CPUs to mark as paravirt is very simple and doesn't work when
  vCPUs aren't spread out uniformly across NUMA nodes. Ideal would be splice
  the numbers based on how many CPUs each NUMA node has. It is quite
  tricky to do specially since cpumask can be on stack too. Given
  NR_CPUS can be 8192 and nr_possible_nodes 32. Haven't got my head into
  solving it yet. Maybe there is easier way.

- DLPAR Add/Remove needs to call init of EC/VP cores (powerpc specific)

- Userspace tools awareness such as irqbalance. 

- Delve into design of hint from Hyeprvisor(HW Hint). i.e Host informs
  guest which/how many CPUs it has to use at this moment. This interface
  should work across archs with each arch doing its specific handling.

- Determine the default values for steal time related knobs
  empirically and document them.

- Need to check safety against CPU hotplug specially in process_steal.


Applies cleanly on tip/master:
commit c2ef745151b21d4dcc4b29a1eabf1096f5ba544b


Thanks to srikar for providing the initial code around powerpc steal
time handling code. Thanks to all who went through and provided reviews.

PS: I haven't found a better name. Please suggest if you have any.

Shrikanth Hegde (17):
  sched/docs: Document cpu_paravirt_mask and Paravirt CPU concept
  cpumask: Introduce cpu_paravirt_mask
  sched/core: Dont allow to use CPU marked as paravirt
  sched/debug: Remove unused schedstats
  sched/fair: Add paravirt movements for proc sched file
  sched/fair: Pass current cpu in select_idle_sibling
  sched/fair: Don't consider paravirt CPUs for wakeup and load balance
  sched/rt: Don't select paravirt CPU for wakeup and push/pull rt task
  sched/core: Add support for nohz_full CPUs
  sched/core: Push current task from paravirt CPU
  sysfs: Add paravirt CPU file
  powerpc: method to initialize ec and vp cores
  powerpc: enable/disable paravirt CPUs based on steal time
  powerpc: process steal values at fixed intervals
  powerpc: add debugfs file for controlling handling on steal values
  sysfs: Provide write method for paravirt
  sysfs: disable arch handling if paravirt file being written

 .../ABI/testing/sysfs-devices-system-cpu      |   9 +
 Documentation/scheduler/sched-arch.rst        |  37 +++
 arch/powerpc/include/asm/smp.h                |   1 +
 arch/powerpc/kernel/smp.c                     |   1 +
 arch/powerpc/platforms/pseries/lpar.c         | 223 ++++++++++++++++++
 arch/powerpc/platforms/pseries/pseries.h      |   1 +
 drivers/base/cpu.c                            |  59 +++++
 include/linux/cpumask.h                       |  20 ++
 include/linux/sched.h                         |   9 +-
 kernel/sched/core.c                           | 106 ++++++++-
 kernel/sched/debug.c                          |   5 +-
 kernel/sched/fair.c                           |  42 +++-
 kernel/sched/rt.c                             |  11 +-
 kernel/sched/sched.h                          |   9 +
 14 files changed, 519 insertions(+), 14 deletions(-)

-- 
2.47.3



^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 01/17] sched/docs: Document cpu_paravirt_mask and Paravirt CPU concept
  2025-11-19 12:44 [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
@ 2025-11-19 12:44 ` Shrikanth Hegde
  2025-11-19 12:44 ` [PATCH 02/17] cpumask: Introduce cpu_paravirt_mask Shrikanth Hegde
                   ` (19 subsequent siblings)
  20 siblings, 0 replies; 41+ messages in thread
From: Shrikanth Hegde @ 2025-11-19 12:44 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev
  Cc: sshegde, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, maddy, srikar, gregkh, pbonzini, seanjc,
	kprateek.nayak, vschneid, iii, huschle, rostedt, dietmar.eggemann,
	christophe.leroy

Add documentation for new cpumask called cpu_paravirt_mask. This could
help users in understanding what this mask and the concept behind it.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 Documentation/scheduler/sched-arch.rst | 37 ++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/Documentation/scheduler/sched-arch.rst b/Documentation/scheduler/sched-arch.rst
index ed07efea7d02..6972c295013d 100644
--- a/Documentation/scheduler/sched-arch.rst
+++ b/Documentation/scheduler/sched-arch.rst
@@ -62,6 +62,43 @@ Your cpu_idle routines need to obey the following rules:
 arch/x86/kernel/process.c has examples of both polling and
 sleeping idle functions.
 
+Paravirt CPUs
+=============
+
+Under virtualised environments it is possible to overcommit CPU resources.
+i.e sum of virtual CPU(vCPU) of all VM's is greater than number of physical
+CPUs(pCPU). Under such conditions when all or many VM's have high utilization,
+hypervisor won't be able to satisfy the CPU requirement and has to context
+switch within or across VM. i.e hypervisor need to preempt one vCPU to run
+another. This is called vCPU preemption. This is more expensive compared to
+task context switch within a vCPU.
+
+In such cases it is better that VM's co-ordinate among themselves and ask for
+less CPU by not using some of the vCPUs. Such vCPUs where workload can be
+avoided at the moment for less vCPU preemption are called as "Paravirt CPUs".
+Note that when the pCPU contention goes away, these vCPUs can be used again
+by the workload.
+
+Arch need to set/unset the specific vCPU in cpu_paravirt_mask. When set, avoid
+that vCPU and when unset, use it as usual.
+
+Scheduler will try to avoid paravirt vCPUs as much as it can.
+This is achieved by
+1. Not selecting paravirt CPU at wakeup.
+2. Push the task away from paravirt CPU at tick.
+3. Not selecting paravirt CPU at load balance.
+
+This works only for SCHED_RT and SCHED_NORMAL. SCHED_EXT and userspace can make
+choices accordingly using cpu_paravirt_mask.
+
+/sys/devices/system/cpu/paravirt prints the current cpu_paravirt_mask in
+cpulist format.
+
+Notes:
+1. A task pinned only on paravirt CPUs will continue to run there.
+2. This feature is available under CONFIG_PARAVIRT
+3. Refer to PowerPC for architecure implementation side.
+4. Doesn't push out any task running on isolated CPUs.
 
 Possible arch/ problems
 =======================
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 02/17] cpumask: Introduce cpu_paravirt_mask
  2025-11-19 12:44 [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
  2025-11-19 12:44 ` [PATCH 01/17] sched/docs: Document cpu_paravirt_mask and Paravirt CPU concept Shrikanth Hegde
@ 2025-11-19 12:44 ` Shrikanth Hegde
  2025-11-19 12:44 ` [PATCH 03/17] sched/core: Dont allow to use CPU marked as paravirt Shrikanth Hegde
                   ` (18 subsequent siblings)
  20 siblings, 0 replies; 41+ messages in thread
From: Shrikanth Hegde @ 2025-11-19 12:44 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev
  Cc: sshegde, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, maddy, srikar, gregkh, pbonzini, seanjc,
	kprateek.nayak, vschneid, iii, huschle, rostedt, dietmar.eggemann,
	christophe.leroy

This patch does
- Declare and Define cpu_paravirt_mask.
- Get/Set helpers for it.

Values are set by arch code and consumed by the scheduler.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 include/linux/cpumask.h | 20 ++++++++++++++++++++
 kernel/sched/core.c     |  5 +++++
 2 files changed, 25 insertions(+)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index ff8f41ab7ce6..079903851341 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -1270,6 +1270,26 @@ static __always_inline bool cpu_dying(unsigned int cpu)
 
 #endif /* NR_CPUS > 1 */
 
+/*
+ * All related wrappers kept together to avoid too many ifdefs
+ * See Documentation/scheduler/sched-arch.rst for details
+ */
+#ifdef CONFIG_PARAVIRT
+extern struct cpumask __cpu_paravirt_mask;
+#define cpu_paravirt_mask    ((const struct cpumask *)&__cpu_paravirt_mask)
+#define set_cpu_paravirt(cpu, paravirt) assign_cpu((cpu), &__cpu_paravirt_mask, (paravirt))
+
+static __always_inline bool cpu_paravirt(unsigned int cpu)
+{
+	return cpumask_test_cpu(cpu, cpu_paravirt_mask);
+}
+#else
+static __always_inline bool cpu_paravirt(unsigned int cpu)
+{
+	return false;
+}
+#endif
+
 #define cpu_is_offline(cpu)	unlikely(!cpu_online(cpu))
 
 #if NR_CPUS <= BITS_PER_LONG
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9f10cfbdc228..40db5e659994 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10852,3 +10852,8 @@ void sched_change_end(struct sched_change_ctx *ctx)
 		p->sched_class->prio_changed(rq, p, ctx->prio);
 	}
 }
+
+#ifdef CONFIG_PARAVIRT
+struct cpumask __cpu_paravirt_mask __read_mostly;
+EXPORT_SYMBOL(__cpu_paravirt_mask);
+#endif
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 03/17] sched/core: Dont allow to use CPU marked as paravirt
  2025-11-19 12:44 [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
  2025-11-19 12:44 ` [PATCH 01/17] sched/docs: Document cpu_paravirt_mask and Paravirt CPU concept Shrikanth Hegde
  2025-11-19 12:44 ` [PATCH 02/17] cpumask: Introduce cpu_paravirt_mask Shrikanth Hegde
@ 2025-11-19 12:44 ` Shrikanth Hegde
  2025-11-19 12:44 ` [PATCH 04/17] sched/debug: Remove unused schedstats Shrikanth Hegde
                   ` (17 subsequent siblings)
  20 siblings, 0 replies; 41+ messages in thread
From: Shrikanth Hegde @ 2025-11-19 12:44 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev
  Cc: sshegde, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, maddy, srikar, gregkh, pbonzini, seanjc,
	kprateek.nayak, vschneid, iii, huschle, rostedt, dietmar.eggemann,
	christophe.leroy

Don't allow a paravirt CPU to be used while looking for a CPU to use.

Push task mechanism uses stopper thread which going to call
select_fallback_rq and use this mechanism to avoid picking a paravirt CPU.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/core.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 40db5e659994..90fc04d84b74 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2397,8 +2397,13 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
 		return cpu_online(cpu);
 
 	/* Non kernel threads are not allowed during either online or offline. */
-	if (!(p->flags & PF_KTHREAD))
-		return cpu_active(cpu);
+	if (!(p->flags & PF_KTHREAD)) {
+		/* A user thread shouldn't be allowed on a paravirt cpu */
+		if (cpu_paravirt(cpu))
+			return false;
+		else
+			return cpu_active(cpu);
+	}
 
 	/* KTHREAD_IS_PER_CPU is always allowed. */
 	if (kthread_is_per_cpu(p))
@@ -2408,6 +2413,10 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
 	if (cpu_dying(cpu))
 		return false;
 
+	/* Non percpu kthreads should stay away from paravirt cpu*/
+	if (cpu_paravirt(cpu))
+		return false;
+
 	/* But are allowed during online. */
 	return cpu_online(cpu);
 }
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 04/17] sched/debug: Remove unused schedstats
  2025-11-19 12:44 [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
                   ` (2 preceding siblings ...)
  2025-11-19 12:44 ` [PATCH 03/17] sched/core: Dont allow to use CPU marked as paravirt Shrikanth Hegde
@ 2025-11-19 12:44 ` Shrikanth Hegde
  2025-11-19 12:44 ` [PATCH 05/17] sched/fair: Add paravirt movements for proc sched file Shrikanth Hegde
                   ` (16 subsequent siblings)
  20 siblings, 0 replies; 41+ messages in thread
From: Shrikanth Hegde @ 2025-11-19 12:44 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev
  Cc: sshegde, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, maddy, srikar, gregkh, pbonzini, seanjc,
	kprateek.nayak, vschneid, iii, huschle, rostedt, dietmar.eggemann,
	christophe.leroy

nr_migrations_cold, nr_wakeups_passive and nr_wakeups_idle are not
being updated anywhere. So remove them.

This will help to add couple more stats in the next patch without
bloating the size.

These are per process stats. So updating sched stats version isn't
necessary.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 include/linux/sched.h | 3 ---
 kernel/sched/debug.c  | 3 ---
 2 files changed, 6 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index bb436ee1942d..f802bfd7120f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -545,7 +545,6 @@ struct sched_statistics {
 	s64				exec_max;
 	u64				slice_max;
 
-	u64				nr_migrations_cold;
 	u64				nr_failed_migrations_affine;
 	u64				nr_failed_migrations_running;
 	u64				nr_failed_migrations_hot;
@@ -558,8 +557,6 @@ struct sched_statistics {
 	u64				nr_wakeups_remote;
 	u64				nr_wakeups_affine;
 	u64				nr_wakeups_affine_attempts;
-	u64				nr_wakeups_passive;
-	u64				nr_wakeups_idle;
 
 #ifdef CONFIG_SCHED_CORE
 	u64				core_forceidle_sum;
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 41caa22e0680..2cb3ffc653df 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1182,7 +1182,6 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 		P_SCHEDSTAT(wait_count);
 		PN_SCHEDSTAT(iowait_sum);
 		P_SCHEDSTAT(iowait_count);
-		P_SCHEDSTAT(nr_migrations_cold);
 		P_SCHEDSTAT(nr_failed_migrations_affine);
 		P_SCHEDSTAT(nr_failed_migrations_running);
 		P_SCHEDSTAT(nr_failed_migrations_hot);
@@ -1194,8 +1193,6 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 		P_SCHEDSTAT(nr_wakeups_remote);
 		P_SCHEDSTAT(nr_wakeups_affine);
 		P_SCHEDSTAT(nr_wakeups_affine_attempts);
-		P_SCHEDSTAT(nr_wakeups_passive);
-		P_SCHEDSTAT(nr_wakeups_idle);
 
 		avg_atom = p->se.sum_exec_runtime;
 		if (nr_switches)
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 05/17] sched/fair: Add paravirt movements for proc sched file
  2025-11-19 12:44 [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
                   ` (3 preceding siblings ...)
  2025-11-19 12:44 ` [PATCH 04/17] sched/debug: Remove unused schedstats Shrikanth Hegde
@ 2025-11-19 12:44 ` Shrikanth Hegde
  2025-11-19 12:44 ` [PATCH 06/17] sched/fair: Pass current cpu in select_idle_sibling Shrikanth Hegde
                   ` (15 subsequent siblings)
  20 siblings, 0 replies; 41+ messages in thread
From: Shrikanth Hegde @ 2025-11-19 12:44 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev
  Cc: sshegde, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, maddy, srikar, gregkh, pbonzini, seanjc,
	kprateek.nayak, vschneid, iii, huschle, rostedt, dietmar.eggemann,
	christophe.leroy

Add couple of new stats.
- nr_migrations_paravirt: number of migrations due to current task being
  moved out of paravirt CPU.

- nr_wakeups_paravirt - number of wakeups where previous CPU was marked
  as paravirt and hence task is being woken up on current CPU.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 include/linux/sched.h | 2 ++
 kernel/sched/debug.c  | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f802bfd7120f..3628edd1468b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -549,6 +549,7 @@ struct sched_statistics {
 	u64				nr_failed_migrations_running;
 	u64				nr_failed_migrations_hot;
 	u64				nr_forced_migrations;
+	u64				nr_migrations_paravirt;
 
 	u64				nr_wakeups;
 	u64				nr_wakeups_sync;
@@ -557,6 +558,7 @@ struct sched_statistics {
 	u64				nr_wakeups_remote;
 	u64				nr_wakeups_affine;
 	u64				nr_wakeups_affine_attempts;
+	u64				nr_wakeups_paravirt;
 
 #ifdef CONFIG_SCHED_CORE
 	u64				core_forceidle_sum;
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 2cb3ffc653df..0e7d08514148 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1186,6 +1186,7 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 		P_SCHEDSTAT(nr_failed_migrations_running);
 		P_SCHEDSTAT(nr_failed_migrations_hot);
 		P_SCHEDSTAT(nr_forced_migrations);
+		P_SCHEDSTAT(nr_migrations_paravirt);
 		P_SCHEDSTAT(nr_wakeups);
 		P_SCHEDSTAT(nr_wakeups_sync);
 		P_SCHEDSTAT(nr_wakeups_migrate);
@@ -1193,6 +1194,7 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 		P_SCHEDSTAT(nr_wakeups_remote);
 		P_SCHEDSTAT(nr_wakeups_affine);
 		P_SCHEDSTAT(nr_wakeups_affine_attempts);
+		P_SCHEDSTAT(nr_wakeups_paravirt);
 
 		avg_atom = p->se.sum_exec_runtime;
 		if (nr_switches)
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 06/17] sched/fair: Pass current cpu in select_idle_sibling
  2025-11-19 12:44 [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
                   ` (4 preceding siblings ...)
  2025-11-19 12:44 ` [PATCH 05/17] sched/fair: Add paravirt movements for proc sched file Shrikanth Hegde
@ 2025-11-19 12:44 ` Shrikanth Hegde
  2025-11-19 12:44 ` [PATCH 07/17] sched/fair: Don't consider paravirt CPUs for wakeup and load balance Shrikanth Hegde
                   ` (14 subsequent siblings)
  20 siblings, 0 replies; 41+ messages in thread
From: Shrikanth Hegde @ 2025-11-19 12:44 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev
  Cc: sshegde, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, maddy, srikar, gregkh, pbonzini, seanjc,
	kprateek.nayak, vschneid, iii, huschle, rostedt, dietmar.eggemann,
	christophe.leroy

Pattern in select_task_rq_fair:

	cpu = smp_processor_id();
	new_cpu = prev_cpu;

	//May change new_cpu due to wake_affine, otherwise it remains prev_cpu

	new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);

Due to this often prev_cpu == new_cpu. If the task was sleeping when
the prev_cpu was marked as paravirt, it would be beneficial to choose current
cpu instead. If the current cpu is paravirt too, then wakeup will happen there and
at next tick task will move out.

So pass current CPU as well in the select_idle_sibling.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/fair.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1855975b8248..015e00b370c9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1048,7 +1048,7 @@ static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
 
 #include "pelt.h"
 
-static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu);
+static int select_idle_sibling(struct task_struct *p, int this_cpu, int prev, int target);
 static unsigned long task_h_load(struct task_struct *p);
 static unsigned long capacity_of(int cpu);
 
@@ -7770,7 +7770,7 @@ static inline bool asym_fits_cpu(unsigned long util,
 /*
  * Try and locate an idle core/thread in the LLC cache domain.
  */
-static int select_idle_sibling(struct task_struct *p, int prev, int target)
+static int select_idle_sibling(struct task_struct *p, int this_cpu, int prev, int target)
 {
 	bool has_idle_core = false;
 	struct sched_domain *sd;
@@ -8578,7 +8578,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 		new_cpu = sched_balance_find_dst_cpu(sd, p, cpu, prev_cpu, sd_flag);
 	} else if (wake_flags & WF_TTWU) { /* XXX always ? */
 		/* Fast path */
-		new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
+		new_cpu = select_idle_sibling(p, cpu, prev_cpu, new_cpu);
 	}
 	rcu_read_unlock();
 
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 07/17] sched/fair: Don't consider paravirt CPUs for wakeup and load balance
  2025-11-19 12:44 [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
                   ` (5 preceding siblings ...)
  2025-11-19 12:44 ` [PATCH 06/17] sched/fair: Pass current cpu in select_idle_sibling Shrikanth Hegde
@ 2025-11-19 12:44 ` Shrikanth Hegde
  2025-11-19 12:44 ` [PATCH 08/17] sched/rt: Don't select paravirt CPU for wakeup and push/pull rt task Shrikanth Hegde
                   ` (13 subsequent siblings)
  20 siblings, 0 replies; 41+ messages in thread
From: Shrikanth Hegde @ 2025-11-19 12:44 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev
  Cc: sshegde, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, maddy, srikar, gregkh, pbonzini, seanjc,
	kprateek.nayak, vschneid, iii, huschle, rostedt, dietmar.eggemann,
	christophe.leroy

For CFS load balancer,
- mask out paravirt CPUs from list of cpus to balance.
- This helps to restrict/expand the workload depending on the mask.

At wakeup,
- If prev_cpu is paravirt, see if recent_used_cpu can be chosen.
If not choose current cpu.
- For EAS system, put a warning if wake up happens on paravirt CPU.
At this point, not expecting any EAS system will have a overcommit of
CPUs.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/fair.c | 36 +++++++++++++++++++++++++++++++++++-
 1 file changed, 35 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 015e00b370c9..760813802cb9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7358,6 +7358,9 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 {
 	int target = nr_cpumask_bits;
 
+	if (cpu_paravirt(prev_cpu))
+		return this_cpu;
+
 	if (sched_feat(WA_IDLE))
 		target = wake_affine_idle(this_cpu, prev_cpu, sync);
 
@@ -7441,6 +7444,11 @@ static inline int sched_balance_find_dst_cpu(struct sched_domain *sd, struct tas
 {
 	int new_cpu = cpu;
 
+	if (cpu_paravirt(prev_cpu)) {
+		schedstat_inc(p->stats.nr_wakeups_paravirt);
+		return cpu;
+	}
+
 	if (!cpumask_intersects(sched_domain_span(sd), p->cpus_ptr))
 		return prev_cpu;
 
@@ -7777,10 +7785,25 @@ static int select_idle_sibling(struct task_struct *p, int this_cpu, int prev, in
 	unsigned long task_util, util_min, util_max;
 	int i, recent_used_cpu, prev_aff = -1;
 
+	/* Likely prev,target belong to same LLC, it is better at wakeup
+	 * to move away from them. at best return recent_used_cpu if it
+	 * is usable
+	 */
+	if (cpu_paravirt(prev) || cpu_paravirt(target)) {
+		schedstat_inc(p->stats.nr_wakeups_paravirt);
+
+		recent_used_cpu = p->recent_used_cpu;
+		if (!cpu_paravirt(recent_used_cpu))
+			return recent_used_cpu;
+		else
+			return this_cpu;
+	}
+
 	/*
 	 * On asymmetric system, update task utilization because we will check
 	 * that the task fits with CPU's capacity.
 	 */
+
 	if (sched_asym_cpucap_active()) {
 		sync_entity_load_avg(&p->se);
 		task_util = task_util_est(p);
@@ -8539,8 +8562,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 
 		if (!is_rd_overutilized(this_rq()->rd)) {
 			new_cpu = find_energy_efficient_cpu(p, prev_cpu);
-			if (new_cpu >= 0)
+
+			/* System supporting Energy model isn't expected
+			 * have a CPU marked as paravirt
+			 */
+			if (new_cpu >= 0) {
+				WARN_ON_ONCE(cpu_paravirt(new_cpu));
 				return new_cpu;
+			}
 			new_cpu = prev_cpu;
 		}
 
@@ -11832,6 +11861,11 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
 
 	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
 
+#ifdef CONFIG_PARAVIRT
+	/* Don't spread load to paravirt CPUs */
+	cpumask_andnot(cpus, cpus, cpu_paravirt_mask);
+#endif
+
 	schedstat_inc(sd->lb_count[idle]);
 
 redo:
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 08/17] sched/rt: Don't select paravirt CPU for wakeup and push/pull rt task
  2025-11-19 12:44 [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
                   ` (6 preceding siblings ...)
  2025-11-19 12:44 ` [PATCH 07/17] sched/fair: Don't consider paravirt CPUs for wakeup and load balance Shrikanth Hegde
@ 2025-11-19 12:44 ` Shrikanth Hegde
  2025-11-19 12:44 ` [PATCH 09/17] sched/core: Add support for nohz_full CPUs Shrikanth Hegde
                   ` (12 subsequent siblings)
  20 siblings, 0 replies; 41+ messages in thread
From: Shrikanth Hegde @ 2025-11-19 12:44 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev
  Cc: sshegde, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, maddy, srikar, gregkh, pbonzini, seanjc,
	kprateek.nayak, vschneid, iii, huschle, rostedt, dietmar.eggemann,
	christophe.leroy

For RT class,
- During wakeup don't select a paravirt CPU.
- Don't pull a task towards a paravirt CPU.
- Don't push a task to a paravirt CPU.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/rt.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index f1867fe8e5c5..0b78c74dbbe3 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1552,6 +1552,9 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
 		if (!test && target != -1 && !rt_task_fits_capacity(p, target))
 			goto out_unlock;
 
+		/* Avoid moving to a paravirt CPU */
+		if (cpu_paravirt(target))
+			goto out_unlock;
 		/*
 		 * Don't bother moving it if the destination CPU is
 		 * not running a lower priority task.
@@ -1876,7 +1879,7 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
 	for (tries = 0; tries < RT_MAX_TRIES; tries++) {
 		cpu = find_lowest_rq(task);
 
-		if ((cpu == -1) || (cpu == rq->cpu))
+		if ((cpu == -1) || (cpu == rq->cpu) || cpu_paravirt(cpu))
 			break;
 
 		lowest_rq = cpu_rq(cpu);
@@ -1974,7 +1977,7 @@ static int push_rt_task(struct rq *rq, bool pull)
 			return 0;
 
 		cpu = find_lowest_rq(rq->curr);
-		if (cpu == -1 || cpu == rq->cpu)
+		if (cpu == -1 || cpu == rq->cpu || cpu_paravirt(cpu))
 			return 0;
 
 		/*
@@ -2237,6 +2240,10 @@ static void pull_rt_task(struct rq *this_rq)
 	if (likely(!rt_overload_count))
 		return;
 
+	/* There is no point in pulling the task towards a paravirt cpu */
+	if (cpu_paravirt(this_rq->cpu))
+		return;
+
 	/*
 	 * Match the barrier from rt_set_overloaded; this guarantees that if we
 	 * see overloaded we must also see the rto_mask bit.
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 09/17] sched/core: Add support for nohz_full CPUs
  2025-11-19 12:44 [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
                   ` (7 preceding siblings ...)
  2025-11-19 12:44 ` [PATCH 08/17] sched/rt: Don't select paravirt CPU for wakeup and push/pull rt task Shrikanth Hegde
@ 2025-11-19 12:44 ` Shrikanth Hegde
  2025-11-21  3:16   ` K Prateek Nayak
  2025-11-19 12:44 ` [PATCH 10/17] sched/core: Push current task from paravirt CPU Shrikanth Hegde
                   ` (11 subsequent siblings)
  20 siblings, 1 reply; 41+ messages in thread
From: Shrikanth Hegde @ 2025-11-19 12:44 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev
  Cc: sshegde, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, maddy, srikar, gregkh, pbonzini, seanjc,
	kprateek.nayak, vschneid, iii, huschle, rostedt, dietmar.eggemann,
	christophe.leroy

Enable tick on nohz full CPU when it is marked as paravirt.
If there in no CFS/RT running there, disable the tick to save the power.

In addition to this, arch specific code which enables the paravirt CPU
should call, tick_nohz_dep_set_cpu with TICK_DEP_BIT_SCHED for moving
the task out of nohz_full CPU fast.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/core.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 90fc04d84b74..73d1d49a3c72 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1336,6 +1336,10 @@ bool sched_can_stop_tick(struct rq *rq)
 {
 	int fifo_nr_running;
 
+	/* Keep the tick running until both RT and CFS are pushed out*/
+	if (cpu_paravirt(rq->cpu) && (rq->rt.rt_nr_running || rq->cfs.h_nr_queued))
+		return false;
+
 	/* Deadline tasks, even if single, need the tick */
 	if (rq->dl.dl_nr_running)
 		return false;
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 10/17] sched/core: Push current task from paravirt CPU
  2025-11-19 12:44 [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
                   ` (8 preceding siblings ...)
  2025-11-19 12:44 ` [PATCH 09/17] sched/core: Add support for nohz_full CPUs Shrikanth Hegde
@ 2025-11-19 12:44 ` Shrikanth Hegde
  2025-11-19 12:44 ` [PATCH 11/17] sysfs: Add paravirt CPU file Shrikanth Hegde
                   ` (10 subsequent siblings)
  20 siblings, 0 replies; 41+ messages in thread
From: Shrikanth Hegde @ 2025-11-19 12:44 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev
  Cc: sshegde, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, maddy, srikar, gregkh, pbonzini, seanjc,
	kprateek.nayak, vschneid, iii, huschle, rostedt, dietmar.eggemann,
	christophe.leroy

Actively push out RT/CFS running on a paravirt CPU. Since the task is
running on the CPU, need to stop the cpu and push the task out.
However, if the task in pinned only to paravirt CPUs, it will continue
running there.

Though code is almost same as __balance_push_cpu_stop and quite close to
push_cpu_stop, it provides a cleaner implementation w.r.t to PARAVIRT config.

Add push_task_work_done flag to protect pv_push_task_work buffer.
This currently works only FAIR and RT.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/core.c  | 83 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |  9 +++++
 2 files changed, 92 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 73d1d49a3c72..65c247c24191 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5521,6 +5521,10 @@ void sched_tick(void)
 	unsigned long hw_pressure;
 	u64 resched_latency;
 
+	/* push the current task out if a paravirt CPU */
+	if (cpu_paravirt(cpu))
+		push_current_from_paravirt_cpu(rq);
+
 	if (housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE))
 		arch_scale_freq_tick();
 
@@ -10869,4 +10873,83 @@ void sched_change_end(struct sched_change_ctx *ctx)
 #ifdef CONFIG_PARAVIRT
 struct cpumask __cpu_paravirt_mask __read_mostly;
 EXPORT_SYMBOL(__cpu_paravirt_mask);
+
+static DEFINE_PER_CPU(struct cpu_stop_work, pv_push_task_work);
+
+static int paravirt_push_cpu_stop(void *arg)
+{
+	struct task_struct *p = arg;
+	struct rq *rq = this_rq();
+	struct rq_flags rf;
+	int cpu;
+
+	raw_spin_lock_irq(&p->pi_lock);
+	rq_lock(rq, &rf);
+	rq->push_task_work_done = 0;
+
+	update_rq_clock(rq);
+
+	if (task_rq(p) == rq && task_on_rq_queued(p)) {
+		cpu = select_fallback_rq(rq->cpu, p);
+		rq = __migrate_task(rq, &rf, p, cpu);
+	}
+
+	rq_unlock(rq, &rf);
+	raw_spin_unlock_irq(&p->pi_lock);
+	put_task_struct(p);
+
+	return 0;
+}
+
+/* A CPU is marked as Paravirt when there is contention for underlying
+ * physical CPU and using this CPU will lead to hypervisor preemptions.
+ * It is better not to use this CPU.
+ *
+ * In case any task is scheduled on such CPU, move it out. In
+ * select_fallback_rq a non paravirt CPU will be chosen and henceforth
+ * task shouldn't come back to this CPU
+ */
+void push_current_from_paravirt_cpu(struct rq *rq)
+{
+	struct task_struct *push_task = rq->curr;
+	unsigned long flags;
+	struct rq_flags rf;
+
+	if (!cpu_paravirt(rq->cpu))
+		return;
+
+	/* Idle task can't be pused out */
+	if (rq->curr == rq->idle)
+		return;
+
+	/* Do for only SCHED_NORMAL AND RT for now */
+	if (push_task->sched_class != &fair_sched_class &&
+	    push_task->sched_class != &rt_sched_class)
+		return;
+
+	if (kthread_is_per_cpu(push_task) ||
+	    is_migration_disabled(push_task))
+		return;
+
+	/* Is it affine to only paravirt cpus? */
+	if (cpumask_subset(push_task->cpus_ptr, cpu_paravirt_mask))
+		return;
+
+	/* There is already a stopper thread for this. Dont race with it */
+	if (rq->push_task_work_done == 1)
+		return;
+
+	local_irq_save(flags);
+
+	get_task_struct(push_task);
+	schedstat_inc(push_task->stats.nr_migrations_paravirt);
+
+	rq_lock(rq, &rf);
+	rq->push_task_work_done = 1;
+	rq_unlock(rq, &rf);
+
+	stop_one_cpu_nowait(rq->cpu, paravirt_push_cpu_stop, push_task,
+			    this_cpu_ptr(&pv_push_task_work));
+	local_irq_restore(flags);
+}
 #endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b419a4d98461..42984a65384c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1214,6 +1214,9 @@ struct rq {
 	unsigned char		nohz_idle_balance;
 	unsigned char		idle_balance;
 
+#ifdef CONFIG_PARAVIRT
+	bool			push_task_work_done;
+#endif
 	unsigned long		misfit_task_load;
 
 	/* For active balancing */
@@ -4017,6 +4020,12 @@ extern bool dequeue_task(struct rq *rq, struct task_struct *p, int flags);
 extern struct balance_callback *splice_balance_callbacks(struct rq *rq);
 extern void balance_callbacks(struct rq *rq, struct balance_callback *head);
 
+#ifdef CONFIG_PARAVIRT
+void push_current_from_paravirt_cpu(struct rq *rq);
+#else
+static inline void push_current_from_paravirt_cpu(struct rq *rq) { }
+#endif
+
 /*
  * The 'sched_change' pattern is the safe, easy and slow way of changing a
  * task's scheduling properties. It dequeues a task, such that the scheduler
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 11/17] sysfs: Add paravirt CPU file
  2025-11-19 12:44 [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
                   ` (9 preceding siblings ...)
  2025-11-19 12:44 ` [PATCH 10/17] sched/core: Push current task from paravirt CPU Shrikanth Hegde
@ 2025-11-19 12:44 ` Shrikanth Hegde
  2025-11-19 12:44 ` [PATCH 12/17] powerpc: method to initialize ec and vp cores Shrikanth Hegde
                   ` (9 subsequent siblings)
  20 siblings, 0 replies; 41+ messages in thread
From: Shrikanth Hegde @ 2025-11-19 12:44 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev
  Cc: sshegde, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, maddy, srikar, gregkh, pbonzini, seanjc,
	kprateek.nayak, vschneid, iii, huschle, rostedt, dietmar.eggemann,
	christophe.leroy

Add paravirt file in /sys/devices/system/cpu.

This offers
- User can quickly check which CPUs are marked as paravirt.
- Userspace algorithm such as sched_ext or with isolcpus could
  use the mask and make decision.
- daemon such as irqbalance could use this mask and don't spread
  irq's into paravirt CPUs.

For example:
cat /sys/devices/system/cpu/paravirt
600-719      <<< arch marked these are paravirt.

cat /sys/devices/system/cpu/paravirt
             <<< No paravirt CPUs at the moment.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 Documentation/ABI/testing/sysfs-devices-system-cpu |  9 +++++++++
 drivers/base/cpu.c                                 | 12 ++++++++++++
 2 files changed, 21 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu
index 8aed6d94c4cd..1da77430b776 100644
--- a/Documentation/ABI/testing/sysfs-devices-system-cpu
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -777,3 +777,12 @@ Date:		Nov 2022
 Contact:	Linux kernel mailing list <linux-kernel@vger.kernel.org>
 Description:
 		(RO) the list of CPUs that can be brought online.
+
+What:		/sys/devices/system/cpu/paravirt
+Date:		Sep 2025
+Contact:	Linux kernel mailing list <linux-kernel@vger.kernel.org>
+Description:
+		(RO) the list of CPUs that are current marked as paravirt CPUs.
+		These CPUs are not meant to be used at the moment due to
+		contention of underlying physical CPU resource. Dynamically
+		changes to reflect the current situation.
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index fa0a2eef93ac..c216e13c4e2d 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -374,6 +374,15 @@ static int cpu_uevent(const struct device *dev, struct kobj_uevent_env *env)
 }
 #endif
 
+#ifdef CONFIG_PARAVIRT
+static ssize_t paravirt_show(struct device *dev,
+				   struct device_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "%*pbl\n", cpumask_pr_args(cpu_paravirt_mask));
+}
+static DEVICE_ATTR_RO(paravirt);
+#endif
+
 const struct bus_type cpu_subsys = {
 	.name = "cpu",
 	.dev_name = "cpu",
@@ -513,6 +522,9 @@ static struct attribute *cpu_root_attrs[] = {
 #endif
 #ifdef CONFIG_GENERIC_CPU_AUTOPROBE
 	&dev_attr_modalias.attr,
+#endif
+#ifdef CONFIG_PARAVIRT
+	&dev_attr_paravirt.attr,
 #endif
 	NULL
 };
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 12/17] powerpc: method to initialize ec and vp cores
  2025-11-19 12:44 [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
                   ` (10 preceding siblings ...)
  2025-11-19 12:44 ` [PATCH 11/17] sysfs: Add paravirt CPU file Shrikanth Hegde
@ 2025-11-19 12:44 ` Shrikanth Hegde
  2025-11-21  8:29   ` kernel test robot
  2025-11-21 10:14   ` kernel test robot
  2025-11-19 12:44 ` [PATCH 13/17] powerpc: enable/disable paravirt CPUs based on steal time Shrikanth Hegde
                   ` (8 subsequent siblings)
  20 siblings, 2 replies; 41+ messages in thread
From: Shrikanth Hegde @ 2025-11-19 12:44 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev
  Cc: sshegde, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, maddy, srikar, gregkh, pbonzini, seanjc,
	kprateek.nayak, vschneid, iii, huschle, rostedt, dietmar.eggemann,
	christophe.leroy

During system init, capture the number of EC and VP cores on Shared
Processor LPARs(aka VM). (SPLPAR )

EC - Entitled Cores - Hypervisor(PowerVM) guarantees this many cores
worth of cycles.
VP - Virtual Processor Cores - Total logical cores present in the LPAR.
In SPLPAR's typically there is overcommit of vCPUs. i.e VP > EC.

These values will be used in subsequent patches to calculate number of
cores to use when there is steal time.

Note: DLPAR specific method need to call this again. Yet to be done.

Originally-by: Srikar Dronamraju <srikar@linux.ibm.com>
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 arch/powerpc/include/asm/smp.h        |  1 +
 arch/powerpc/kernel/smp.c             |  1 +
 arch/powerpc/platforms/pseries/lpar.c | 30 +++++++++++++++++++++++++++
 3 files changed, 32 insertions(+)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index e41b9ea42122..5a52c6952195 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -266,6 +266,7 @@ extern char __secondary_hold;
 extern unsigned int booting_thread_hwid;
 
 extern void __early_start(void);
+void pseries_init_ec_vp_cores(void);
 #endif /* __ASSEMBLER__ */
 
 #endif /* __KERNEL__ */
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 68edb66c2964..5a3b52dd625b 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1732,6 +1732,7 @@ void __init smp_cpus_done(unsigned int max_cpus)
 
 	dump_numa_cpu_topology();
 	build_sched_topology();
+	pseries_init_ec_vp_cores();
 }
 
 /*
diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c
index 6a415febc53b..935fced6e127 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -2029,3 +2029,33 @@ static int __init vpa_debugfs_init(void)
 }
 machine_arch_initcall(pseries, vpa_debugfs_init);
 #endif /* CONFIG_DEBUG_FS */
+
+#ifdef CONFIG_PARAVIRT
+
+static unsigned int virtual_procs __read_mostly;
+static unsigned int entitled_cores __read_mostly;
+static unsigned int available_cores;
+
+void pseries_init_ec_vp_cores(void)
+{
+	unsigned long retbuf[PLPAR_HCALL9_BUFSIZE];
+	int ret;
+
+	if (available_cores && virtual_procs == num_present_cpus() / threads_per_core)
+		return;
+
+	/* Get EC values from hcall */
+	ret = plpar_hcall9(H_GET_PPP, retbuf);
+	WARN_ON_ONCE(ret != 0);
+	if (ret)
+		return;
+
+	entitled_cores = retbuf[0] / 100;
+	virtual_procs = num_present_cpus() / threads_per_core;
+
+	/* Initialize the available cores to all VP initially */
+	available_cores = max(entitled_cores, virtual_procs);
+}
+#else
+void pseries_init_ec_vp_cores(void) { return; }
+#endif
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 13/17] powerpc: enable/disable paravirt CPUs based on steal time
  2025-11-19 12:44 [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
                   ` (11 preceding siblings ...)
  2025-11-19 12:44 ` [PATCH 12/17] powerpc: method to initialize ec and vp cores Shrikanth Hegde
@ 2025-11-19 12:44 ` Shrikanth Hegde
  2025-11-19 12:44 ` [PATCH 14/17] powerpc: process steal values at fixed intervals Shrikanth Hegde
                   ` (7 subsequent siblings)
  20 siblings, 0 replies; 41+ messages in thread
From: Shrikanth Hegde @ 2025-11-19 12:44 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev
  Cc: sshegde, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, maddy, srikar, gregkh, pbonzini, seanjc,
	kprateek.nayak, vschneid, iii, huschle, rostedt, dietmar.eggemann,
	christophe.leroy

available_cores - Number of cores LPAR(VM) can use at this moment.
remaining cores will have CPUs marked as paravirt.

This follow stepwise approach for reducing/increasing the number of
available_cores.

Very simple Logic.
	if (steal_time > high_threshold)
		available_cores--
	if (steal_time < low_threshould)
		available_cores++

It also check previous direction taken to avoid un-necessary ping-pongs.

Note: It works well only when CPUs are spread out equal numbered across
NUMA nodes.

Originally-by: Srikar Dronamraju <srikar@linux.ibm.com>
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 arch/powerpc/platforms/pseries/lpar.c    | 53 ++++++++++++++++++++++++
 arch/powerpc/platforms/pseries/pseries.h |  1 +
 2 files changed, 54 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c
index 935fced6e127..825b5b4e2b43 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -43,6 +43,7 @@
 #include <asm/fadump.h>
 #include <asm/dtl.h>
 #include <asm/vphn.h>
+#include <linux/sched/isolation.h>
 
 #include "pseries.h"
 
@@ -2056,6 +2057,58 @@ void pseries_init_ec_vp_cores(void)
 	/* Initialize the available cores to all VP initially */
 	available_cores = max(entitled_cores, virtual_procs);
 }
+
+#define STEAL_RATIO_HIGH 400
+#define STEAL_RATIO_LOW  150
+
+void update_soft_entitlement(unsigned long steal_ratio)
+{
+	static int prev_direction;
+	int cpu;
+
+	if  (!entitled_cores)
+		return;
+
+	if (steal_ratio >= STEAL_RATIO_HIGH && prev_direction > 0) {
+		/*
+		 * System entitlement was reduced earlier but we continue to
+		 * see steal time. Reduce entitlement further.
+		 */
+		if (available_cores == entitled_cores)
+			return;
+
+		/* Mark them paravirt, enable tick if it is nohz_full */
+		for (cpu = (available_cores - 1) * threads_per_core;
+		     cpu < available_cores * threads_per_core; cpu++) {
+			set_cpu_paravirt(cpu, true);
+			if (tick_nohz_full_cpu(cpu))
+				tick_nohz_dep_set_cpu(cpu, TICK_DEP_BIT_SCHED);
+		}
+		available_cores--;
+
+	} else if (steal_ratio <= STEAL_RATIO_LOW && prev_direction < 0) {
+		/*
+		 * System entitlement was increased but we continue to see
+		 * less steal time. Increase entitlement further.
+		 */
+		if (available_cores == virtual_procs)
+			return;
+
+		/* mark them avaialble */
+		for (cpu = available_cores * threads_per_core;
+		     cpu < (available_cores + 1) * threads_per_core; cpu++)
+			set_cpu_paravirt(cpu, false);
+
+		available_cores++;
+	}
+	if (steal_ratio >= STEAL_RATIO_HIGH)
+		prev_direction = 1;
+	else if (steal_ratio <= STEAL_RATIO_LOW)
+		prev_direction = -1;
+	else
+		prev_direction = 0;
+}
 #else
 void pseries_init_ec_vp_cores(void) { return; }
+void update_soft_entitlement(unsigned long steal_ratio) { return; }
 #endif
diff --git a/arch/powerpc/platforms/pseries/pseries.h b/arch/powerpc/platforms/pseries/pseries.h
index 3968a6970fa8..d1f9ec77ff57 100644
--- a/arch/powerpc/platforms/pseries/pseries.h
+++ b/arch/powerpc/platforms/pseries/pseries.h
@@ -115,6 +115,7 @@ int dlpar_workqueue_init(void);
 
 extern u32 pseries_security_flavor;
 void pseries_setup_security_mitigations(void);
+void update_soft_entitlement(unsigned long steal_ratio);
 
 #ifdef CONFIG_PPC_64S_HASH_MMU
 void pseries_lpar_read_hblkrm_characteristics(void);
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 14/17] powerpc: process steal values at fixed intervals
  2025-11-19 12:44 [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
                   ` (12 preceding siblings ...)
  2025-11-19 12:44 ` [PATCH 13/17] powerpc: enable/disable paravirt CPUs based on steal time Shrikanth Hegde
@ 2025-11-19 12:44 ` Shrikanth Hegde
  2025-11-19 12:44 ` [PATCH 15/17] powerpc: add debugfs file for controlling handling on steal values Shrikanth Hegde
                   ` (6 subsequent siblings)
  20 siblings, 0 replies; 41+ messages in thread
From: Shrikanth Hegde @ 2025-11-19 12:44 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev
  Cc: sshegde, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, maddy, srikar, gregkh, pbonzini, seanjc,
	kprateek.nayak, vschneid, iii, huschle, rostedt, dietmar.eggemann,
	christophe.leroy

Process steal time at regular intervals. Sum of steal time across the
vCPUs is computed against the time to get the steal ratio.

Only first online CPU does this work. That reduces the racing issues.
This is done only on SPLPAR (non kvm guest). This assumes PowerVM being
the hypervisor.

Originally-by: Srikar Dronamraju <srikar@linux.ibm.com>
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 arch/powerpc/platforms/pseries/lpar.c | 59 +++++++++++++++++++++++++++
 1 file changed, 59 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c
index 825b5b4e2b43..c16d97e1a1fe 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -660,10 +660,58 @@ static int __init vcpudispatch_stats_procfs_init(void)
 machine_device_initcall(pseries, vcpudispatch_stats_procfs_init);
 
 #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
+
+#define STEAL_MULTIPLE 10000
+#define PURR_UPDATE_TB NSEC_PER_SEC
+
+static bool should_cpu_process_steal(int cpu)
+{
+	if (cpu == cpumask_first(cpu_online_mask))
+		return true;
+
+	return false;
+}
+
+static void process_steal(int cpu)
+{
+	static unsigned long next_tb_ns, prev_steal;
+	unsigned long steal_ratio, delta_tb;
+	unsigned long tb_ns = tb_to_ns(mftb());
+	unsigned long steal = 0;
+	unsigned int i;
+
+	if (!should_cpu_process_steal(cpu))
+		return;
+
+	if (tb_ns < next_tb_ns)
+		return;
+
+	for_each_online_cpu(i) {
+		struct lppaca *lppaca = &lppaca_of(i);
+
+		steal += be64_to_cpu(READ_ONCE(lppaca->ready_enqueue_tb));
+		steal += be64_to_cpu(READ_ONCE(lppaca->enqueue_dispatch_tb));
+	}
+
+	steal = tb_to_ns(steal);
+
+	if (next_tb_ns && prev_steal) {
+		delta_tb = max(tb_ns - (next_tb_ns - PURR_UPDATE_TB), 1);
+		steal_ratio = (steal - prev_steal) * STEAL_MULTIPLE;
+		steal_ratio /= (delta_tb * num_online_cpus());
+		update_soft_entitlement(steal_ratio);
+	}
+
+	next_tb_ns = tb_ns + PURR_UPDATE_TB;
+	prev_steal = steal;
+}
+
 u64 pseries_paravirt_steal_clock(int cpu)
 {
 	struct lppaca *lppaca = &lppaca_of(cpu);
 
+	if (is_shared_processor() && !is_kvm_guest())
+		process_steal(cpu);
 	/*
 	 * VPA steal time counters are reported at TB frequency. Hence do a
 	 * conversion to ns before returning
@@ -2061,6 +2109,17 @@ void pseries_init_ec_vp_cores(void)
 #define STEAL_RATIO_HIGH 400
 #define STEAL_RATIO_LOW  150
 
+/*
+ * [0]<----------->[EC]---->{AC}-->[VP]
+ * EC == Entitled Cores. Guaranteed number of cores by hypervsior.
+ * VP == Virtual Processors. Total number of cores. When there is overcommit
+ * this will be higher than EC.
+ * AC == Available Cores Varies between EC <-> VP.
+ *
+ * If Steal time is high, then reduce Available Cores.
+ * If steal time is low, increase Available Cores
+ */
+
 void update_soft_entitlement(unsigned long steal_ratio)
 {
 	static int prev_direction;
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 15/17] powerpc: add debugfs file for controlling handling on steal values
  2025-11-19 12:44 [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
                   ` (13 preceding siblings ...)
  2025-11-19 12:44 ` [PATCH 14/17] powerpc: process steal values at fixed intervals Shrikanth Hegde
@ 2025-11-19 12:44 ` Shrikanth Hegde
  2025-11-19 12:44 ` [PATCH 16/17] sysfs: Provide write method for paravirt Shrikanth Hegde
                   ` (5 subsequent siblings)
  20 siblings, 0 replies; 41+ messages in thread
From: Shrikanth Hegde @ 2025-11-19 12:44 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev
  Cc: sshegde, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, maddy, srikar, gregkh, pbonzini, seanjc,
	kprateek.nayak, vschneid, iii, huschle, rostedt, dietmar.eggemann,
	christophe.leroy

Since the low,high threshold for steal time can change based on the
system, make these values tunable.

Values are be to given as expected percentage value * 100. i.e one
wants say 8% of steal time is high, then should specify 800 as the high
threshold. Similar value computation holds true for low threshold.

Provide one more tunable to control how often steal time compution is
done. By default it is 1 second. If one thinks thats too aggressive can
increase it. Max value is 10 seconds since one should act relatively
fast based on steal values.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 arch/powerpc/platforms/pseries/lpar.c | 94 ++++++++++++++++++++++++---
 1 file changed, 86 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c
index c16d97e1a1fe..090e5c48243b 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -662,7 +662,8 @@ machine_device_initcall(pseries, vcpudispatch_stats_procfs_init);
 #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
 
 #define STEAL_MULTIPLE 10000
-#define PURR_UPDATE_TB NSEC_PER_SEC
+static int steal_check_freq = 1;
+#define PURR_UPDATE_TB (steal_check_freq * NSEC_PER_SEC)
 
 static bool should_cpu_process_steal(int cpu)
 {
@@ -2106,9 +2107,6 @@ void pseries_init_ec_vp_cores(void)
 	available_cores = max(entitled_cores, virtual_procs);
 }
 
-#define STEAL_RATIO_HIGH 400
-#define STEAL_RATIO_LOW  150
-
 /*
  * [0]<----------->[EC]---->{AC}-->[VP]
  * EC == Entitled Cores. Guaranteed number of cores by hypervsior.
@@ -2120,6 +2118,9 @@ void pseries_init_ec_vp_cores(void)
  * If steal time is low, increase Available Cores
  */
 
+static unsigned int steal_ratio_high = 400;
+static unsigned int steal_ratio_low = 150;
+
 void update_soft_entitlement(unsigned long steal_ratio)
 {
 	static int prev_direction;
@@ -2128,7 +2129,7 @@ void update_soft_entitlement(unsigned long steal_ratio)
 	if  (!entitled_cores)
 		return;
 
-	if (steal_ratio >= STEAL_RATIO_HIGH && prev_direction > 0) {
+	if (steal_ratio >= steal_ratio_high && prev_direction > 0) {
 		/*
 		 * System entitlement was reduced earlier but we continue to
 		 * see steal time. Reduce entitlement further.
@@ -2145,7 +2146,7 @@ void update_soft_entitlement(unsigned long steal_ratio)
 		}
 		available_cores--;
 
-	} else if (steal_ratio <= STEAL_RATIO_LOW && prev_direction < 0) {
+	} else if (steal_ratio <= steal_ratio_low && prev_direction < 0) {
 		/*
 		 * System entitlement was increased but we continue to see
 		 * less steal time. Increase entitlement further.
@@ -2160,13 +2161,90 @@ void update_soft_entitlement(unsigned long steal_ratio)
 
 		available_cores++;
 	}
-	if (steal_ratio >= STEAL_RATIO_HIGH)
+	if (steal_ratio >= steal_ratio_high)
 		prev_direction = 1;
-	else if (steal_ratio <= STEAL_RATIO_LOW)
+	else if (steal_ratio <= steal_ratio_low)
 		prev_direction = -1;
 	else
 		prev_direction = 0;
 }
+
+/*
+ * Any value above this set threshold will reduce the available cores
+ * Value can't be more than 100% and can't be less than low threshould value
+ * Specifying 500 means 5% steal time
+ */
+
+static int pv_steal_ratio_high_set(void *data, u64 val)
+{
+	if (val > 10000 || val < steal_ratio_low)
+		return -EINVAL;
+
+	steal_ratio_high = val;
+	return 0;
+}
+
+static int pv_steal_ratio_high_get(void *data, u64 *val)
+{
+	*val = steal_ratio_high;
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(fops_pv_steal_ratio_high, pv_steal_ratio_high_get,
+			pv_steal_ratio_high_set, "%llu\n");
+
+static int pv_steal_ratio_low_set(void *data, u64 val)
+{
+	if (val < 1 || val > steal_ratio_high)
+		return -EINVAL;
+
+	steal_ratio_low = val;
+	return 0;
+}
+
+static int pv_steal_ratio_low_get(void *data, u64 *val)
+{
+	*val = steal_ratio_low;
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(fops_pv_steal_ratio_low, pv_steal_ratio_low_get,
+			pv_steal_ratio_low_set, "%llu\n");
+
+static int pv_steal_check_freq_set(void *data, u64 val)
+{
+	if (val < 1 || val > 10)
+		return -EINVAL;
+
+	steal_check_freq = val;
+	return 0;
+}
+
+static int pv_steal_check_freq_get(void *data, u64 *val)
+{
+	*val = steal_check_freq;
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(fops_pv_steal_check_freq, pv_steal_check_freq_get,
+			pv_steal_check_freq_set, "%llu\n");
+
+static int __init steal_debugfs_init(void)
+{
+	if (!is_shared_processor() || is_kvm_guest())
+		return 0;
+
+	debugfs_create_file("steal_ratio_high", 0600, arch_debugfs_dir,
+			    NULL, &fops_pv_steal_ratio_high);
+	debugfs_create_file("steal_ratio_low", 0600, arch_debugfs_dir,
+			    NULL, &fops_pv_steal_ratio_low);
+	debugfs_create_file("steal_check_frequency", 0600, arch_debugfs_dir,
+			    NULL, &fops_pv_steal_check_freq);
+
+	return 0;
+}
+
+machine_arch_initcall(pseries, steal_debugfs_init);
 #else
 void pseries_init_ec_vp_cores(void) { return; }
 void update_soft_entitlement(unsigned long steal_ratio) { return; }
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 16/17] sysfs: Provide write method for paravirt
  2025-11-19 12:44 [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
                   ` (14 preceding siblings ...)
  2025-11-19 12:44 ` [PATCH 15/17] powerpc: add debugfs file for controlling handling on steal values Shrikanth Hegde
@ 2025-11-19 12:44 ` Shrikanth Hegde
  2025-11-24 17:04   ` Greg KH
  2025-11-19 12:44 ` [PATCH 17/17] sysfs: disable arch handling if paravirt file being written Shrikanth Hegde
                   ` (4 subsequent siblings)
  20 siblings, 1 reply; 41+ messages in thread
From: Shrikanth Hegde @ 2025-11-19 12:44 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev
  Cc: sshegde, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, maddy, srikar, gregkh, pbonzini, seanjc,
	kprateek.nayak, vschneid, iii, huschle, rostedt, dietmar.eggemann,
	christophe.leroy

This is a debug patch which could be used to set the range of CPUs as
paravirt. 

One could make use of this for quick testing of this infra instead of writing
arch specific code. This allows checking some corner cases by providing custom
cpumasks which isn't possible with arch method.

echo 100-200,600-700 >  /sys/devices/system/cpu/paravirt
cat /sys/devices/system/cpu/paravirt
100-200,600-700

echo > /sys/devices/system/cpu/paravirt
cat /sys/devices/system/cpu/paravirt

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
This is currently not meant be merged, since paravirt sysfs file is meant
to be Read-Only. Hence the documentation hasn't changed. If this method is
really helpful, then can consider including it depending on the
discussion.

drivers/base/cpu.c | 47 +++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 46 insertions(+), 1 deletion(-)

diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index c216e13c4e2d..766584c85051 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -375,12 +375,57 @@ static int cpu_uevent(const struct device *dev, struct kobj_uevent_env *env)
 #endif
 
 #ifdef CONFIG_PARAVIRT
+static ssize_t paravirt_store(struct device *dev,
+			      struct device_attribute *attr,
+			      const char *buf, size_t count)
+{
+	cpumask_var_t temp_mask;
+	int retval = 0;
+	int cpu;
+
+	if (!alloc_cpumask_var(&temp_mask, GFP_KERNEL))
+		return -ENOMEM;
+
+	retval = cpulist_parse(buf, temp_mask);
+	if (retval)
+		goto free_mask;
+
+	/* ALL cpus can't be marked as paravirt */
+	if (cpumask_equal(temp_mask, cpu_online_mask)) {
+		retval = -EINVAL;
+		goto free_mask;
+	}
+	if (cpumask_weight(temp_mask) > num_online_cpus()) {
+		retval = -EINVAL;
+		goto free_mask;
+	}
+
+	/* No more paravirt cpus */
+	if (cpumask_empty(temp_mask)) {
+		cpumask_copy((struct cpumask *)&__cpu_paravirt_mask, temp_mask);
+	} else {
+		cpumask_copy((struct cpumask *)&__cpu_paravirt_mask, temp_mask);
+
+		/* Enable tick on nohz_full cpu */
+		for_each_cpu(cpu, temp_mask) {
+			if (tick_nohz_full_cpu(cpu))
+				tick_nohz_dep_set_cpu(cpu, TICK_DEP_BIT_SCHED);
+		}
+	}
+
+	retval = count;
+
+free_mask:
+	free_cpumask_var(temp_mask);
+	return retval;
+}
+
 static ssize_t paravirt_show(struct device *dev,
 				   struct device_attribute *attr, char *buf)
 {
 	return sysfs_emit(buf, "%*pbl\n", cpumask_pr_args(cpu_paravirt_mask));
 }
-static DEVICE_ATTR_RO(paravirt);
+static DEVICE_ATTR_RW(paravirt);
 #endif
 
 const struct bus_type cpu_subsys = {
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 17/17] sysfs: disable arch handling if paravirt file being written
  2025-11-19 12:44 [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
                   ` (15 preceding siblings ...)
  2025-11-19 12:44 ` [PATCH 16/17] sysfs: Provide write method for paravirt Shrikanth Hegde
@ 2025-11-19 12:44 ` Shrikanth Hegde
  2025-11-24 17:05 ` [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption Greg KH
                   ` (3 subsequent siblings)
  20 siblings, 0 replies; 41+ messages in thread
From: Shrikanth Hegde @ 2025-11-19 12:44 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev
  Cc: sshegde, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, maddy, srikar, gregkh, pbonzini, seanjc,
	kprateek.nayak, vschneid, iii, huschle, rostedt, dietmar.eggemann,
	christophe.leroy

Arch specific code can update the mask based on the steal time. For
debugging it is desired to overwrite the arch logic. Do that with this
debug patch.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
This isn't meant to be merged. It is debug patch helping the previous
one for easier debugging.

 arch/powerpc/platforms/pseries/lpar.c | 3 +++
 drivers/base/cpu.c                    | 2 ++
 include/linux/sched.h                 | 4 ++++
 kernel/sched/core.c                   | 1 +
 4 files changed, 10 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c
index 090e5c48243b..04bc75e22e7b 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -681,6 +681,9 @@ static void process_steal(int cpu)
 	unsigned long steal = 0;
 	unsigned int i;
 
+	if (static_branch_unlikely(&disable_arch_paravirt_handling))
+		return;
+
 	if (!should_cpu_process_steal(cpu))
 		return;
 
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index 766584c85051..06a11a69b7c0 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -403,7 +403,9 @@ static ssize_t paravirt_store(struct device *dev,
 	/* No more paravirt cpus */
 	if (cpumask_empty(temp_mask)) {
 		cpumask_copy((struct cpumask *)&__cpu_paravirt_mask, temp_mask);
+		static_branch_disable(&disable_arch_paravirt_handling);
 	} else {
+		static_branch_enable(&disable_arch_paravirt_handling);
 		cpumask_copy((struct cpumask *)&__cpu_paravirt_mask, temp_mask);
 
 		/* Enable tick on nohz_full cpu */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3628edd1468b..1afa5dd5b0ae 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2427,4 +2427,8 @@ extern void migrate_enable(void);
 
 DEFINE_LOCK_GUARD_0(migrate, migrate_disable(), migrate_enable())
 
+#ifdef CONFIG_PARAVIRT
+DECLARE_STATIC_KEY_FALSE(disable_arch_paravirt_handling);
+#endif
+
 #endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 65c247c24191..b65a9898c694 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10873,6 +10873,7 @@ void sched_change_end(struct sched_change_ctx *ctx)
 #ifdef CONFIG_PARAVIRT
 struct cpumask __cpu_paravirt_mask __read_mostly;
 EXPORT_SYMBOL(__cpu_paravirt_mask);
+DEFINE_STATIC_KEY_FALSE(disable_arch_paravirt_handling);
 
 static DEFINE_PER_CPU(struct cpu_stop_work, pv_push_task_work);
 
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 09/17] sched/core: Add support for nohz_full CPUs
  2025-11-19 12:44 ` [PATCH 09/17] sched/core: Add support for nohz_full CPUs Shrikanth Hegde
@ 2025-11-21  3:16   ` K Prateek Nayak
  2025-11-21  4:40     ` Shrikanth Hegde
  0 siblings, 1 reply; 41+ messages in thread
From: K Prateek Nayak @ 2025-11-21  3:16 UTC (permalink / raw)
  To: Shrikanth Hegde, linux-kernel, linuxppc-dev
  Cc: mingo, peterz, juri.lelli, vincent.guittot, tglx, yury.norov,
	maddy, srikar, gregkh, pbonzini, seanjc, vschneid, iii, huschle,
	rostedt, dietmar.eggemann, christophe.leroy

Hello Shrikanth,

On 11/19/2025 6:14 PM, Shrikanth Hegde wrote:
> Enable tick on nohz full CPU when it is marked as paravirt.
> If there in no CFS/RT running there, disable the tick to save the power.

Wouldn't the task be pinned if it is running on a nohz CPU?
We don't push out pinned tasks so this seems unnecessary.
Am I missing something?

-- 
Thanks and Regards,
Prateek



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 09/17] sched/core: Add support for nohz_full CPUs
  2025-11-21  3:16   ` K Prateek Nayak
@ 2025-11-21  4:40     ` Shrikanth Hegde
  2025-11-24  4:36       ` K Prateek Nayak
  0 siblings, 1 reply; 41+ messages in thread
From: Shrikanth Hegde @ 2025-11-21  4:40 UTC (permalink / raw)
  To: K Prateek Nayak, linux-kernel, linuxppc-dev
  Cc: mingo, peterz, juri.lelli, vincent.guittot, tglx, yury.norov,
	maddy, srikar, gregkh, pbonzini, seanjc, vschneid, iii, huschle,
	rostedt, dietmar.eggemann, christophe.leroy

Hi Prateek.

On 11/21/25 8:46 AM, K Prateek Nayak wrote:
> Hello Shrikanth,
> 
> On 11/19/2025 6:14 PM, Shrikanth Hegde wrote:
>> Enable tick on nohz full CPU when it is marked as paravirt.
>> If there in no CFS/RT running there, disable the tick to save the power.
> 
> Wouldn't the task be pinned if it is running on a nohz CPU?
> We don't push out pinned tasks so this seems unnecessary.
> Am I missing something?
> 

No. Task will not be pinned if it running on nohz_full unless
user has done a taskset. They are still part of sched domains.

Pinning is usually true for isolcpus. You need to explicity set taskset
for isolcpus since by default you won't go to those CPUs.

on nohz_full it is just that tick will be disabled when it has
only one task running. If there are more task on it, tick will be
enabled and load balancing can run.

Example:  ( i have 300-479 as nohz_full)
taskset -c 300-315 stress-ng --cpu=2

(it was initially on 301,302)
10:27:37 PM  301  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
10:27:37 PM  302  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00

Ran hackbench for a brief time. After hackbench completes,
(it runs now on 301,314)
10:27:43 PM  301  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
10:27:43 PM  314  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 12/17] powerpc: method to initialize ec and vp cores
  2025-11-19 12:44 ` [PATCH 12/17] powerpc: method to initialize ec and vp cores Shrikanth Hegde
@ 2025-11-21  8:29   ` kernel test robot
  2025-11-21 10:14   ` kernel test robot
  1 sibling, 0 replies; 41+ messages in thread
From: kernel test robot @ 2025-11-21  8:29 UTC (permalink / raw)
  To: Shrikanth Hegde, linux-kernel, linuxppc-dev
  Cc: oe-kbuild-all, sshegde, mingo, peterz, juri.lelli,
	vincent.guittot, tglx, yury.norov, maddy, srikar, gregkh,
	pbonzini, seanjc, kprateek.nayak, vschneid, iii, huschle, rostedt,
	dietmar.eggemann, christophe.leroy

Hi Shrikanth,

kernel test robot noticed the following build errors:

[auto build test ERROR on tip/sched/core]
[also build test ERROR on next-20251121]
[cannot apply to powerpc/next powerpc/fixes driver-core/driver-core-testing driver-core/driver-core-next driver-core/driver-core-linus linus/master v6.18-rc6]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Shrikanth-Hegde/sched-docs-Document-cpu_paravirt_mask-and-Paravirt-CPU-concept/20251119-204931
base:   tip/sched/core
patch link:    https://lore.kernel.org/r/20251119124449.1149616-13-sshegde%40linux.ibm.com
patch subject: [PATCH 12/17] powerpc: method to initialize ec and vp cores
config: powerpc-randconfig-002-20251121 (https://download.01.org/0day-ci/archive/20251121/202511211643.QVn7aDHd-lkp@intel.com/config)
compiler: powerpc-linux-gcc (GCC) 11.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251121/202511211643.QVn7aDHd-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202511211643.QVn7aDHd-lkp@intel.com/

All errors (new ones prefixed by >>):

   powerpc-linux-ld: arch/powerpc/kernel/smp.o: in function `smp_cpus_done':
>> arch/powerpc/kernel/smp.c:1735:(.init.text+0xd5c): undefined reference to `pseries_init_ec_vp_cores'


vim +1735 arch/powerpc/kernel/smp.c

  1721	
  1722	void __init smp_cpus_done(unsigned int max_cpus)
  1723	{
  1724		/*
  1725		 * We are running pinned to the boot CPU, see rest_init().
  1726		 */
  1727		if (smp_ops && smp_ops->setup_cpu)
  1728			smp_ops->setup_cpu(boot_cpuid);
  1729	
  1730		if (smp_ops && smp_ops->bringup_done)
  1731			smp_ops->bringup_done();
  1732	
  1733		dump_numa_cpu_topology();
  1734		build_sched_topology();
> 1735		pseries_init_ec_vp_cores();
  1736	}
  1737	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 12/17] powerpc: method to initialize ec and vp cores
  2025-11-19 12:44 ` [PATCH 12/17] powerpc: method to initialize ec and vp cores Shrikanth Hegde
  2025-11-21  8:29   ` kernel test robot
@ 2025-11-21 10:14   ` kernel test robot
  1 sibling, 0 replies; 41+ messages in thread
From: kernel test robot @ 2025-11-21 10:14 UTC (permalink / raw)
  To: Shrikanth Hegde, linux-kernel, linuxppc-dev
  Cc: llvm, oe-kbuild-all, sshegde, mingo, peterz, juri.lelli,
	vincent.guittot, tglx, yury.norov, maddy, srikar, gregkh,
	pbonzini, seanjc, kprateek.nayak, vschneid, iii, huschle, rostedt,
	dietmar.eggemann, christophe.leroy

Hi Shrikanth,

kernel test robot noticed the following build errors:

[auto build test ERROR on tip/sched/core]
[also build test ERROR on next-20251121]
[cannot apply to powerpc/next powerpc/fixes driver-core/driver-core-testing driver-core/driver-core-next driver-core/driver-core-linus linus/master v6.18-rc6]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Shrikanth-Hegde/sched-docs-Document-cpu_paravirt_mask-and-Paravirt-CPU-concept/20251119-204931
base:   tip/sched/core
patch link:    https://lore.kernel.org/r/20251119124449.1149616-13-sshegde%40linux.ibm.com
patch subject: [PATCH 12/17] powerpc: method to initialize ec and vp cores
config: powerpc-pasemi_defconfig (https://download.01.org/0day-ci/archive/20251121/202511211747.WJdFJoRB-lkp@intel.com/config)
compiler: clang version 22.0.0git (https://github.com/llvm/llvm-project 9e9fe08b16ea2c4d9867fb4974edf2a3776d6ece)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251121/202511211747.WJdFJoRB-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202511211747.WJdFJoRB-lkp@intel.com/

All errors (new ones prefixed by >>):

>> ld.lld: error: undefined symbol: pseries_init_ec_vp_cores
   >>> referenced by smp.c
   >>>               arch/powerpc/kernel/smp.o:(smp_cpus_done) in archive vmlinux.a

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 09/17] sched/core: Add support for nohz_full CPUs
  2025-11-21  4:40     ` Shrikanth Hegde
@ 2025-11-24  4:36       ` K Prateek Nayak
  0 siblings, 0 replies; 41+ messages in thread
From: K Prateek Nayak @ 2025-11-24  4:36 UTC (permalink / raw)
  To: Shrikanth Hegde, linux-kernel, linuxppc-dev
  Cc: mingo, peterz, juri.lelli, vincent.guittot, tglx, yury.norov,
	maddy, srikar, gregkh, pbonzini, seanjc, vschneid, iii, huschle,
	rostedt, dietmar.eggemann, christophe.leroy

Hello Shrikanth,

On 11/21/2025 10:10 AM, Shrikanth Hegde wrote:
>> Wouldn't the task be pinned if it is running on a nohz CPU?
>> We don't push out pinned tasks so this seems unnecessary.
>> Am I missing something?
>>
> 
> No. Task will not be pinned if it running on nohz_full unless
> user has done a taskset. They are still part of sched domains.
> 
> Pinning is usually true for isolcpus. You need to explicity set taskset
> for isolcpus since by default you won't go to those CPUs.

TIL! I was under the impression that nohz_full is a superset of isolcpus
but I was clearly mistaken. Thank you for clarifying.

-- 
Thanks and Regards,
Prateek



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 16/17] sysfs: Provide write method for paravirt
  2025-11-19 12:44 ` [PATCH 16/17] sysfs: Provide write method for paravirt Shrikanth Hegde
@ 2025-11-24 17:04   ` Greg KH
  2025-11-24 17:24     ` Steven Rostedt
  0 siblings, 1 reply; 41+ messages in thread
From: Greg KH @ 2025-11-24 17:04 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: linux-kernel, linuxppc-dev, mingo, peterz, juri.lelli,
	vincent.guittot, tglx, yury.norov, maddy, srikar, pbonzini,
	seanjc, kprateek.nayak, vschneid, iii, huschle, rostedt,
	dietmar.eggemann, christophe.leroy

On Wed, Nov 19, 2025 at 06:14:48PM +0530, Shrikanth Hegde wrote:
> This is a debug patch which could be used to set the range of CPUs as
> paravirt. 
> 
> One could make use of this for quick testing of this infra instead of writing
> arch specific code. This allows checking some corner cases by providing custom
> cpumasks which isn't possible with arch method.
> 
> echo 100-200,600-700 >  /sys/devices/system/cpu/paravirt
> cat /sys/devices/system/cpu/paravirt
> 100-200,600-700
> 
> echo > /sys/devices/system/cpu/paravirt
> cat /sys/devices/system/cpu/paravirt
> 
> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> ---
> This is currently not meant be merged, since paravirt sysfs file is meant
> to be Read-Only. Hence the documentation hasn't changed. If this method is
> really helpful, then can consider including it depending on the
> discussion.

As you added this to this series, if it is picked up, it WILL be merged
:(

Please try a "Nacked-by:" or something else to keep patches from being
applied.  Or better yet, send them as a totally separate series.

thanks,

greg k-h


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption
  2025-11-19 12:44 [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
                   ` (16 preceding siblings ...)
  2025-11-19 12:44 ` [PATCH 17/17] sysfs: disable arch handling if paravirt file being written Shrikanth Hegde
@ 2025-11-24 17:05 ` Greg KH
  2025-11-25  2:39   ` Shrikanth Hegde
  2025-11-27 10:44 ` Shrikanth Hegde
                   ` (2 subsequent siblings)
  20 siblings, 1 reply; 41+ messages in thread
From: Greg KH @ 2025-11-24 17:05 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: linux-kernel, linuxppc-dev, mingo, peterz, juri.lelli,
	vincent.guittot, tglx, yury.norov, maddy, srikar, pbonzini,
	seanjc, kprateek.nayak, vschneid, iii, huschle, rostedt,
	dietmar.eggemann, christophe.leroy

On Wed, Nov 19, 2025 at 06:14:32PM +0530, Shrikanth Hegde wrote:
> Detailed problem statement and some of the implementation choices were 
> discussed earlier[1].
> 
> [1]: https://lore.kernel.org/all/20250910174210.1969750-1-sshegde@linux.ibm.com/
> 
> This is likely the version which would be used for LPC2025 discussion on
> this topic. Feel free to provide your suggestion and hoping for a solution
> that works for different architectures and it's use cases.
> 
> All the existing alternatives such as cpu hotplug, creating isolated
> partitions etc break the user affinity. Since number of CPUs to use change
> depending on the steal time, it is not driven by User. Hence it would be
> wrong to break the affinity. This series allows if the task is pinned
> only paravirt CPUs, it will continue running there.
> 
> Changes compared v3[1]:

There is no "v" for this series :(



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 16/17] sysfs: Provide write method for paravirt
  2025-11-24 17:04   ` Greg KH
@ 2025-11-24 17:24     ` Steven Rostedt
  2025-11-25  2:49       ` Shrikanth Hegde
  0 siblings, 1 reply; 41+ messages in thread
From: Steven Rostedt @ 2025-11-24 17:24 UTC (permalink / raw)
  To: Greg KH
  Cc: Shrikanth Hegde, linux-kernel, linuxppc-dev, mingo, peterz,
	juri.lelli, vincent.guittot, tglx, yury.norov, maddy, srikar,
	pbonzini, seanjc, kprateek.nayak, vschneid, iii, huschle,
	dietmar.eggemann, christophe.leroy

On Mon, 24 Nov 2025 18:04:48 +0100
Greg KH <gregkh@linuxfoundation.org> wrote:

> As you added this to this series, if it is picked up, it WILL be merged
> :(
> 
> Please try a "Nacked-by:" or something else to keep patches from being
> applied.  Or better yet, send them as a totally separate series.

Agreed. But when I do this to a patch in a series, I usually add in subject:

  [PATCH 16/17][DO NOT APPLY!!!] sysfs: Provide write method for paravirt

in order to make it stand out, and not a footnote after the tags.

-- Steve




^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption
  2025-11-24 17:05 ` [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption Greg KH
@ 2025-11-25  2:39   ` Shrikanth Hegde
  2025-11-25  7:48     ` Christophe Leroy (CS GROUP)
  0 siblings, 1 reply; 41+ messages in thread
From: Shrikanth Hegde @ 2025-11-25  2:39 UTC (permalink / raw)
  To: Greg KH
  Cc: linux-kernel, linuxppc-dev, mingo, peterz, juri.lelli,
	vincent.guittot, tglx, yury.norov, maddy, srikar, pbonzini,
	seanjc, kprateek.nayak, vschneid, iii, huschle, rostedt,
	dietmar.eggemann, christophe.leroy

Hi Greg.

On 11/24/25 10:35 PM, Greg KH wrote:
> On Wed, Nov 19, 2025 at 06:14:32PM +0530, Shrikanth Hegde wrote:
>> Detailed problem statement and some of the implementation choices were
>> discussed earlier[1].
>>
>> [1]: https://lore.kernel.org/all/20250910174210.1969750-1-sshegde@linux.ibm.com/
>>
>> This is likely the version which would be used for LPC2025 discussion on
>> this topic. Feel free to provide your suggestion and hoping for a solution
>> that works for different architectures and it's use cases.
>>
>> All the existing alternatives such as cpu hotplug, creating isolated
>> partitions etc break the user affinity. Since number of CPUs to use change
>> depending on the steal time, it is not driven by User. Hence it would be
>> wrong to break the affinity. This series allows if the task is pinned
>> only paravirt CPUs, it will continue running there.
>>
>> Changes compared v3[1]:
> 
> There is no "v" for this series :(
> 

I thought about adding v1.

I made it as PATCH from RFC PATCH since functionally it should
be complete now with arch bits. Since it is v1, I remember usually 
people send out without adding v1. after v1 had tags such as v2.

I will keep v2 for the next series.



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 16/17] sysfs: Provide write method for paravirt
  2025-11-24 17:24     ` Steven Rostedt
@ 2025-11-25  2:49       ` Shrikanth Hegde
  2025-11-25 15:52         ` Steven Rostedt
  0 siblings, 1 reply; 41+ messages in thread
From: Shrikanth Hegde @ 2025-11-25  2:49 UTC (permalink / raw)
  To: Steven Rostedt, Greg KH
  Cc: linux-kernel, linuxppc-dev, mingo, peterz, juri.lelli,
	vincent.guittot, tglx, yury.norov, maddy, srikar, pbonzini,
	seanjc, kprateek.nayak, vschneid, iii, huschle, dietmar.eggemann,
	christophe.leroy



On 11/24/25 10:54 PM, Steven Rostedt wrote:
> On Mon, 24 Nov 2025 18:04:48 +0100
> Greg KH <gregkh@linuxfoundation.org> wrote:
> 
>> As you added this to this series, if it is picked up, it WILL be merged
>> :(
>>
>> Please try a "Nacked-by:" or something else to keep patches from being
>> applied.  Or better yet, send them as a totally separate series.
> 
> Agreed. But when I do this to a patch in a series, I usually add in subject:
> 
>    [PATCH 16/17][DO NOT APPLY!!!] sysfs: Provide write method for paravirt
> 
> in order to make it stand out, and not a footnote after the tags.
> 
> -- Steve
> 

I can follow this.

PS: I was skeptical after last series mistake.



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption
  2025-11-25  2:39   ` Shrikanth Hegde
@ 2025-11-25  7:48     ` Christophe Leroy (CS GROUP)
  2025-11-25  8:48       ` Shrikanth Hegde
  0 siblings, 1 reply; 41+ messages in thread
From: Christophe Leroy (CS GROUP) @ 2025-11-25  7:48 UTC (permalink / raw)
  To: Shrikanth Hegde, Greg KH
  Cc: linux-kernel, linuxppc-dev, mingo, peterz, juri.lelli,
	vincent.guittot, tglx, yury.norov, maddy, srikar, pbonzini,
	seanjc, kprateek.nayak, vschneid, iii, huschle, rostedt,
	dietmar.eggemann

Hi Shrikanth,

Le 25/11/2025 à 03:39, Shrikanth Hegde a écrit :
> Hi Greg.
> 
> On 11/24/25 10:35 PM, Greg KH wrote:
>> On Wed, Nov 19, 2025 at 06:14:32PM +0530, Shrikanth Hegde wrote:
>>> Detailed problem statement and some of the implementation choices were
>>> discussed earlier[1].
>>>
>>> [1]: https://eur01.safelinks.protection.outlook.com/? 
>>> url=https%3A%2F%2Flore.kernel.org%2Fall%2F20250910174210.1969750-1- 
>>> sshegde%40linux.ibm.com%2F&data=05%7C02%7Cchristophe.leroy%40csgroup.eu%7Cc7e5a5830fcb4c796d4808de2bcbe09d%7C8b87af7d86474dc78df45f69a2011bb5%7C0%7C0%7C638996351808032890%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=cV8RTPdV3So1GwQ9uVYgUuGxSfxutSezpaNBq6RYn%2FI%3D&reserved=0
>>>
>>> This is likely the version which would be used for LPC2025 discussion on
>>> this topic. Feel free to provide your suggestion and hoping for a 
>>> solution
>>> that works for different architectures and it's use cases.
>>>
>>> All the existing alternatives such as cpu hotplug, creating isolated
>>> partitions etc break the user affinity. Since number of CPUs to use 
>>> change
>>> depending on the steal time, it is not driven by User. Hence it would be
>>> wrong to break the affinity. This series allows if the task is pinned
>>> only paravirt CPUs, it will continue running there.
>>>
>>> Changes compared v3[1]:
>>
>> There is no "v" for this series :(
>>
> 
> I thought about adding v1.
> 
> I made it as PATCH from RFC PATCH since functionally it should
> be complete now with arch bits. Since it is v1, I remember usually 
> people send out without adding v1. after v1 had tags such as v2.
> 
> I will keep v2 for the next series.
> 

But you are listing changes compared to v3, how can it be a v1 ? 
Shouldn't it be a v4 ? Or in reality a v5 as you already sent a v4 here [1].

[1] 
https://lore.kernel.org/all/20251119062100.1112520-1-sshegde@linux.ibm.com/

Christophe


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption
  2025-11-25  7:48     ` Christophe Leroy (CS GROUP)
@ 2025-11-25  8:48       ` Shrikanth Hegde
  0 siblings, 0 replies; 41+ messages in thread
From: Shrikanth Hegde @ 2025-11-25  8:48 UTC (permalink / raw)
  To: Christophe Leroy (CS GROUP), Greg KH
  Cc: linux-kernel, linuxppc-dev, mingo, peterz, juri.lelli,
	vincent.guittot, tglx, yury.norov, maddy, srikar, pbonzini,
	seanjc, kprateek.nayak, vschneid, iii, huschle, rostedt,
	dietmar.eggemann

Hi Christophe, Greg

>>>
>>> There is no "v" for this series :(
>>>
>>
>> I thought about adding v1.
>>
>> I made it as PATCH from RFC PATCH since functionally it should
>> be complete now with arch bits. Since it is v1, I remember usually 
>> people send out without adding v1. after v1 had tags such as v2.
>>
>> I will keep v2 for the next series.
>>
> 
> But you are listing changes compared to v3, how can it be a v1 ? 
> Shouldn't it be a v4 ? Or in reality a v5 as you already sent a v4 here 
> [1].
> 
> [1] https://lore.kernel.org/all/20251119062100.1112520-1- 
> sshegde@linux.ibm.com/
> 
> Christophe

Sorry about the confusion in numbers. Hopefully below helps for reviewing.
If there are no objections, I will keep next one as v2. Please let me know.

Revision logs:
++++++++++++++++++++++++++++++++++++++
RFC PATCH v4 -> PATCH (This series)
++++++++++++++++++++++++++++++++++++++
- Last two patches were sent out separate instead of being with series.
   Sent it as part of series.
- Use DEVICE_ATTR_RW instead (greg)
- Made it as PATCH since arch specific handling completes the
   functionality.

+++++++++++++++++++++++++++++++++
RFC PATCH v3 -> RFC PATCH v4
+++++++++++++++++++++++++++++++++
- Introduced computation of steal time in powerpc code.
- Derive number of CPUs to use and mark the remaining as paravirt based
   on steal values.
- Provide debugfs knobs to alter how steal time values being used.
- Removed static key check for paravirt CPUs (Yury)
- Removed preempt_disable/enable while calling stopper (Prateek)
- Made select_idle_sibling and friends aware of paravirt CPUs.
- Removed 3 unused schedstat fields and introduced 2 related to paravirt
   handling.
- Handled nohz_full case by enabling tick on it when there is CFS/RT on
   it.
- Updated debug patch to override arch behavior for easier debugging
   during development.
- Kept the method to push only current task out instead of moving all task's
   on rq given the complexity of later.

+++++++++++++++++++++++++++++++++
RFC v2 -> RFC PATCH v3
+++++++++++++++++++++++++++++++++
- Renamed to paravirt_cpus_mask
- Folded the changes under CONFIG_PARAVIRT.
- Fixed the crash due work_buf corruption while using
   stop_one_cpu_nowait.
- Added sysfs documentation.
- Copy most of __balance_push_cpu_stop to new one, this helps it move
   the code out of CONFIG_HOTPLUG_CPU.
- Some of the code movement suggested.

+++++++++++++++++++++++++++++++++
RFC PATCH -> RFC v2
+++++++++++++++++++++++++++++++++
- Renamed to cpu_avoid_mask in place of cpu_parked_mask.
- Used a static key such that no impact to regular case.
- add sysfs file to show avoid CPUs.
- Make RT understand avoid CPUs.
- Add documentation patch
- Took care of reported compile error when NR_CPUS=1


PATCH          : https://lore.kernel.org/all/20251119124449.1149616-1-sshegde@linux.ibm.com/
RFC PATCH v4   : https://lore.kernel.org/all/20251119062100.1112520-1-sshegde@linux.ibm.com/#r
RFC PATCH v3   : https://lore.kernel.org/all/20250910174210.1969750-1-sshegde@linux.ibm.com/#r
RFC v2         : https://lore.kernel.org/all/20250625191108.1646208-1-sshegde@linux.ibm.com/#r
RFC PATCH      : https://lore.kernel.org/all/20250523181448.3777233-1-sshegde@linux.ibm.com/


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 16/17] sysfs: Provide write method for paravirt
  2025-11-25  2:49       ` Shrikanth Hegde
@ 2025-11-25 15:52         ` Steven Rostedt
  2025-11-25 16:02           ` Konstantin Ryabitsev
  0 siblings, 1 reply; 41+ messages in thread
From: Steven Rostedt @ 2025-11-25 15:52 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: Greg KH, linux-kernel, linuxppc-dev, mingo, peterz, juri.lelli,
	vincent.guittot, tglx, yury.norov, maddy, srikar, pbonzini,
	seanjc, kprateek.nayak, vschneid, iii, huschle, dietmar.eggemann,
	christophe.leroy

On Tue, 25 Nov 2025 08:19:45 +0530
Shrikanth Hegde <sshegde@linux.ibm.com> wrote:

> >    [PATCH 16/17][DO NOT APPLY!!!] sysfs: Provide write method for paravirt
> > 
> > in order to make it stand out, and not a footnote after the tags.
> > 
> > -- Steve
> >   
> 
> I can follow this.
> 
> PS: I was skeptical after last series mistake.

You may also want remove the [ ] and use '--' instead:

  [PATCH 16/17] -- DO NOT APPLY!!! -- sysfs: Provide write method for paravirt

Because if someone were to do a b4 pull it would strip out the text within
the brackets. Using -- DO NOT APPLY!!! -- instead, would keep it in the
commit message. And then seeing that in the shortlog would be a really big
red flag ;-)

-- Steve

PS. I plan on changing my usages to use '--' instead of '[' ']'

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 16/17] sysfs: Provide write method for paravirt
  2025-11-25 15:52         ` Steven Rostedt
@ 2025-11-25 16:02           ` Konstantin Ryabitsev
  2025-11-25 16:08             ` Steven Rostedt
  0 siblings, 1 reply; 41+ messages in thread
From: Konstantin Ryabitsev @ 2025-11-25 16:02 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Shrikanth Hegde, Greg KH, linux-kernel, linuxppc-dev, mingo,
	peterz, juri.lelli, vincent.guittot, tglx, yury.norov, maddy,
	srikar, pbonzini, seanjc, kprateek.nayak, vschneid, iii, huschle,
	dietmar.eggemann, christophe.leroy

On Tue, Nov 25, 2025 at 10:52:18AM -0500, Steven Rostedt wrote:
> You may also want remove the [ ] and use '--' instead:
> 
>   [PATCH 16/17] -- DO NOT APPLY!!! -- sysfs: Provide write method for paravirt
> 
> Because if someone were to do a b4 pull it would strip out the text within
> the brackets. Using -- DO NOT APPLY!!! -- instead, would keep it in the
> commit message. And then seeing that in the shortlog would be a really big
> red flag ;-)

Small correction -- it's git itself that strips all content inside [], not b4
specifically.

-K


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 16/17] sysfs: Provide write method for paravirt
  2025-11-25 16:02           ` Konstantin Ryabitsev
@ 2025-11-25 16:08             ` Steven Rostedt
  0 siblings, 0 replies; 41+ messages in thread
From: Steven Rostedt @ 2025-11-25 16:08 UTC (permalink / raw)
  To: Konstantin Ryabitsev
  Cc: Shrikanth Hegde, Greg KH, linux-kernel, linuxppc-dev, mingo,
	peterz, juri.lelli, vincent.guittot, tglx, yury.norov, maddy,
	srikar, pbonzini, seanjc, kprateek.nayak, vschneid, iii, huschle,
	dietmar.eggemann, christophe.leroy

On Tue, 25 Nov 2025 11:02:38 -0500
Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:

> Small correction -- it's git itself that strips all content inside [], not b4
> specifically.

Yes of course. Sorry, I didn't want to blame b4. I actually download the
patch series from patchwork and then do git am which does the stripping. I
just wanted to state a more common workflow, which is usually b4 followed
by a blind execution of git am. Where it does the stripping isn't of
importance at that point.

-- Steve


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption
  2025-11-19 12:44 [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
                   ` (17 preceding siblings ...)
  2025-11-24 17:05 ` [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption Greg KH
@ 2025-11-27 10:44 ` Shrikanth Hegde
  2025-12-04 13:28 ` Ilya Leoshkevich
  2025-12-08  4:47 ` K Prateek Nayak
  20 siblings, 0 replies; 41+ messages in thread
From: Shrikanth Hegde @ 2025-11-27 10:44 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev
  Cc: mingo, peterz, juri.lelli, vincent.guittot, tglx, yury.norov,
	maddy, srikar, gregkh, pbonzini, seanjc, kprateek.nayak, vschneid,
	iii, huschle, rostedt, dietmar.eggemann, christophe.leroy



On 11/19/25 6:14 PM, Shrikanth Hegde wrote:
> Detailed problem statement and some of the implementation choices were
> discussed earlier[1].


Performance data on x86 and PowerPC:

++++++++++++++++++++++++++++++++++++++++++++++++
PowerPC: LPAR(VM) Running on powerVM hypervisor
++++++++++++++++++++++++++++++++++++++++++++++++

Host: 126 cores available in pool.
VM1: 96VP/64EC - 768 CPUs
VM2: 72VP/48EC - 576 CPUs
(VP- Virtual Processor core), (EC - Entitled Cores)
steal_check_frequency:1
steal_ratio_high:400
steal_ratio_low:150

Scenarios:
Secario 1: (Major improvement)
VM1 is running daytrader[1] and VM2 is running stress-ng --cpu=$(nproc)
Note: High gains. In the upstream the steal time was around 15%. With series it comes down
to 3%. With further tuning it could be reduced.

				upstream		+series
daytrader	   	   	1x			  1.7x     <<- 70% gain
throughput

-----------
Scenario 2: (improves thread_count < num_cpus)
VM1 is running schbench and VM2 is running stress-ng --cpu=$(nproc)
Note: Values are average of 5 runs and they are wakeup latencies

schbench -t 400			upstream		+series
50.0th:				  18.00			  16.60
90.0th:				 174.00			  46.80
99.0th:				3197.60                  928.80
99.9th:				6203.20                 4539.20
average rps:                   39665.61		       42334.65
  
schbench -t 600			upstream		+series
50.0th:				  23.80 		  19.80
90.0th:				 917.20                  439.00
99.0th:				5582.40                 3869.60
99.9th:				8982.40      		6574.40
average rps:		       39541.00		       40018.11

-----------
Scenario 3: (Improves)
VM1 is running hackbench and VM2 is running  stress-ng --cpu=$(nproc)
Note: Values are average of 10 runs and 20000 loops.

Process 10 groups          	  2.84               2.62
Process 20 groups          	  5.39               4.48
Process 30 groups          	  7.51               6.29
Process 40 groups          	  9.88               7.42
Process 50 groups    	  	 12.46               9.54
Process 60 groups          	 14.76              12.09
thread  10 groups          	  2.93               2.70
thread  20 groups          	  5.79               4.78
Process(Pipe) 10 groups    	  2.31               2.18
Process(Pipe) 20 groups  	  3.32               3.26
Process(Pipe) 30 groups  	  4.19               4.14
Process(Pipe) 40 groups  	  5.18               5.53
Process(Pipe) 50 groups 	  6.57               6.80
Process(Pipe) 60 groups  	  8.21               8.13
thread(Pipe)  10 groups 	  2.42               2.24
thread(Pipe)  20 groups 	  3.62               3.42

-----------
Notes:

Numbers might be very favorable since VM2 is constantly running and has some CPUs
marked as paravirt when there is steal time and thresholds also might have played a role.
Will plan to run same workload i.e hackbench and schbench on both VM's and see the behavior.

VM1 is CPUs distributed equally across Nodes, while VM2 is not. Since CPUs are marked paravirt
based on core count, some nodes on VM2 would have left unused and that could have added a boot for
VM1 performance specially for daytrader.

[1]: Daytrader is real life benchmark which does stock trading simulation.
https://www.ibm.com/docs/en/linux-on-systems?topic=descriptions-daytrader-benchmark-application
https://cwiki.apache.org/confluence/display/GMOxDOC12/Daytrader

TODO: Get numbers with very high concurrency of hackbench/schbench.

+++++++++++++++++++++++++++++++
on x86_64 (Laptop running KVMs)
+++++++++++++++++++++++++++++++
Host: 8 CPUs.
Two VM. Each spawned with -smp 8.
-----------
Scenario 1:
Both VM's are running hackbench 10 process 10000 loops.
Values are average of 3 runs. High steal of close 50% was seen when
running upstream. So marked 4-7 as paravirt by writing to sysfs file.
Since laptop has lot of host tasks running, there will be still be steal time.

hackbench 10 groups		upstream		+series (4-7 marked as paravirt)
(seconds)		 	  58			   54.42			

Note: Having 5 groups helps too. But when concurrency goes such as very high(40 groups), it regress.

-----------
Scenario 2:
Both VM's are running schbench. Values are average of 2 runs. 		
"schbench -t 4 -r 30 -i 30" (latencies improve but rps is slightly less)

wakeup latencies		upstream		+series(4-7 marked as paravirt)
50.0th				  25.5		  		13.5
90.0th				  70.0				30.0
99.0th				2588.0			      1992.0
99.9th				3844.0			      6032.0
average rps:			   338				326

schbench -t 8 -r 30 -i 30    (Major degradation of rps)
wakeup latencies		upstream		+series(4-7 marked as paravirt)
50.0th				  15.0				11.5
90.0th				1630.0			      2844.0
99.0th				4314.0			      6624.0
99.9th				8572.0			     10896.0
average rps:			 393			       240.5

Anything higher also regress. Need to see why it might be? Maybe too many context
switches since number of threads are too high and CPUs available is less.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption
  2025-11-19 12:44 [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
                   ` (18 preceding siblings ...)
  2025-11-27 10:44 ` Shrikanth Hegde
@ 2025-12-04 13:28 ` Ilya Leoshkevich
  2025-12-05  5:30   ` Shrikanth Hegde
  2025-12-08  4:47 ` K Prateek Nayak
  20 siblings, 1 reply; 41+ messages in thread
From: Ilya Leoshkevich @ 2025-12-04 13:28 UTC (permalink / raw)
  To: Shrikanth Hegde, linux-kernel, linuxppc-dev
  Cc: mingo, peterz, juri.lelli, vincent.guittot, tglx, yury.norov,
	maddy, srikar, gregkh, pbonzini, seanjc, kprateek.nayak, vschneid,
	huschle, rostedt, dietmar.eggemann, christophe.leroy, linux-s390

On Wed, 2025-11-19 at 18:14 +0530, Shrikanth Hegde wrote:
> Detailed problem statement and some of the implementation choices
> were 
> discussed earlier[1].
> 
> [1]:
> https://lore.kernel.org/all/20250910174210.1969750-1-sshegde@linux.ibm.com/
> 
> This is likely the version which would be used for LPC2025 discussion
> on
> this topic. Feel free to provide your suggestion and hoping for a
> solution
> that works for different architectures and it's use cases.
> 
> All the existing alternatives such as cpu hotplug, creating isolated
> partitions etc break the user affinity. Since number of CPUs to use
> change
> depending on the steal time, it is not driven by User. Hence it would
> be
> wrong to break the affinity. This series allows if the task is pinned
> only paravirt CPUs, it will continue running there.
> 
> Changes compared v3[1]:
> 
> - Introduced computation of steal time in powerpc code.
> - Derive number of CPUs to use and mark the remaining as paravirt
> based
>   on steal values. 
> - Provide debugfs knobs to alter how steal time values being used.
> - Removed static key check for paravirt CPUs (Yury)
> - Removed preempt_disable/enable while calling stopper (Prateek)
> - Made select_idle_sibling and friends aware of paravirt CPUs.
> - Removed 3 unused schedstat fields and introduced 2 related to
> paravirt
>   handling.
> - Handled nohz_full case by enabling tick on it when there is CFS/RT
> on
>   it.
> - Updated helper patch to override arch behaviour for easier
> debugging
>   during development.
> - Kept 
> 
> Changes compared to v4[2]:
> - Last two patches were sent out separate instead of being with
> series.
>   That created confusion. Those two patches are debug patches one can
>   make use to check functionality across acrhitectures. Sorry about
>   that.
> - Use DEVICE_ATTR_RW instead (greg)
> - Made it as PATCH since arch specific handling completes the
>   functionality.
> 
> [2]:
> https://lore.kernel.org/all/20251119062100.1112520-1-sshegde@linux.ibm.com/
> 
> TODO: 
> 
> - Get performance numbers on PowerPC, x86 and S390. Hopefully by next
>   week. Didn't want to hold the series till then.
> 
> - The CPUs to mark as paravirt is very simple and doesn't work when
>   vCPUs aren't spread out uniformly across NUMA nodes. Ideal would be
> splice
>   the numbers based on how many CPUs each NUMA node has. It is quite
>   tricky to do specially since cpumask can be on stack too. Given
>   NR_CPUS can be 8192 and nr_possible_nodes 32. Haven't got my head
> into
>   solving it yet. Maybe there is easier way.
> 
> - DLPAR Add/Remove needs to call init of EC/VP cores (powerpc
> specific)
> 
> - Userspace tools awareness such as irqbalance. 
> 
> - Delve into design of hint from Hyeprvisor(HW Hint). i.e Host
> informs
>   guest which/how many CPUs it has to use at this moment. This
> interface
>   should work across archs with each arch doing its specific
> handling.
> 
> - Determine the default values for steal time related knobs
>   empirically and document them.
> 
> - Need to check safety against CPU hotplug specially in
> process_steal.
> 
> 
> Applies cleanly on tip/master:
> commit c2ef745151b21d4dcc4b29a1eabf1096f5ba544b
> 
> 
> Thanks to srikar for providing the initial code around powerpc steal
> time handling code. Thanks to all who went through and provided
> reviews.
> 
> PS: I haven't found a better name. Please suggest if you have any.
> 
> Shrikanth Hegde (17):
>   sched/docs: Document cpu_paravirt_mask and Paravirt CPU concept
>   cpumask: Introduce cpu_paravirt_mask
>   sched/core: Dont allow to use CPU marked as paravirt
>   sched/debug: Remove unused schedstats
>   sched/fair: Add paravirt movements for proc sched file
>   sched/fair: Pass current cpu in select_idle_sibling
>   sched/fair: Don't consider paravirt CPUs for wakeup and load
> balance
>   sched/rt: Don't select paravirt CPU for wakeup and push/pull rt
> task
>   sched/core: Add support for nohz_full CPUs
>   sched/core: Push current task from paravirt CPU
>   sysfs: Add paravirt CPU file
>   powerpc: method to initialize ec and vp cores
>   powerpc: enable/disable paravirt CPUs based on steal time
>   powerpc: process steal values at fixed intervals
>   powerpc: add debugfs file for controlling handling on steal values
>   sysfs: Provide write method for paravirt
>   sysfs: disable arch handling if paravirt file being written
> 
>  .../ABI/testing/sysfs-devices-system-cpu      |   9 +
>  Documentation/scheduler/sched-arch.rst        |  37 +++
>  arch/powerpc/include/asm/smp.h                |   1 +
>  arch/powerpc/kernel/smp.c                     |   1 +
>  arch/powerpc/platforms/pseries/lpar.c         | 223
> ++++++++++++++++++
>  arch/powerpc/platforms/pseries/pseries.h      |   1 +
>  drivers/base/cpu.c                            |  59 +++++
>  include/linux/cpumask.h                       |  20 ++
>  include/linux/sched.h                         |   9 +-
>  kernel/sched/core.c                           | 106 ++++++++-
>  kernel/sched/debug.c                          |   5 +-
>  kernel/sched/fair.c                           |  42 +++-
>  kernel/sched/rt.c                             |  11 +-
>  kernel/sched/sched.h                          |   9 +
>  14 files changed, 519 insertions(+), 14 deletions(-)

The capability to temporarily exclude CPUs from scheduling might be
beneficial for s390x, where users often run Linux using a proprietary
hypervisor called PR/SM and with high overcommit. In these
circumstances virtual CPUs may not be scheduled by a hypervisor for a
very long time.

Today we have an upstream feature called "Hiperdispatch", which
determines that this is about to happen and uses Capacity Aware
Scheduling to prevent processes from being placed on the affected CPUs.
However, at least when used for this purpose, Capacity Aware Scheduling
is best effort and fails to move tasks away from the affected CPUs
under high load.

Therefore I have decided to smoke test this series.

For the purposes of smoke testing, I set up a number of KVM virtual
machines and start the same benchmark inside each one. Then I collect
and compare the aggregate throughput numbers. I have not done testing
with PR/SM yet, but I plan to do this and report back. I also have not
tested this with VMs that are not 100% utilized yet.

Benchmark parameters:

$ sysbench cpu run --threads=$(nproc) --time=10
$ schbench -r 10 --json --no-locking 
$ hackbench --groups 10 --process --loops 5000
$ pgbench -h $WORKDIR --client=$(nproc) --time=10

Figures:

s390x (16 host CPUs):

Benchmark      #VMs    #CPUs/VM  ΔRPS (%)
-----------  ------  ----------  ----------
hackbench        16           4  60.58%
pgbench          16           4  50.01%
hackbench         8           8  46.18%
hackbench         4           8  43.54%
hackbench         2          16  43.23%
hackbench        12           4  42.92%
hackbench         8           4  35.53%
hackbench         4          16  30.98%
pgbench          12           4  18.41%
hackbench         2          24  7.32%
pgbench           8           4  6.84%
pgbench           2          24  3.38%
pgbench           2          16  3.02%
pgbench           4          16  2.08%
hackbench         2          32  1.46%
pgbench           4           8  1.30%
schbench          2          16  0.72%
schbench          4           8  -0.09%
schbench          4           4  -0.20%
schbench          8           8  -0.41%
sysbench          8           4  -0.46%
sysbench          4           8  -0.53%
schbench          8           4  -0.65%
sysbench          2          16  -0.76%
schbench          2           8  -0.77%
sysbench          8           8  -1.72%
schbench          2          24  -1.98%
schbench         12           4  -2.03%
sysbench         12           4  -2.13%
pgbench           2          32  -3.15%
sysbench         16           4  -3.17%
schbench         16           4  -3.50%
sysbench          2           8  -4.01%
pgbench           8           8  -4.10%
schbench          4          16  -5.93%
sysbench          4           4  -5.94%
pgbench           2           4  -6.40%
hackbench         2           8  -10.04%
hackbench         4           4  -10.91%
pgbench           4           4  -11.05%
sysbench          2          24  -13.07%
sysbench          4          16  -13.59%
hackbench         2           4  -13.96%
pgbench           2           8  -16.16%
schbench          2           4  -24.14%
schbench          2          32  -24.25%
sysbench          2           4  -24.98%
sysbench          2          32  -32.84%

x86_64 (32 host CPUs):

Benchmark      #VMs    #CPUs/VM  ΔRPS (%)
-----------  ------  ----------  ----------
hackbench         4          32  87.02%
hackbench         8          16  48.45%
hackbench         4          24  47.95%
hackbench         2           8  42.74%
hackbench         2          32  34.90%
pgbench          16           8  27.87%
pgbench          12           8  25.17%
hackbench         8           8  24.92%
hackbench        16           8  22.41%
hackbench        16           4  20.83%
pgbench           8          16  20.40%
hackbench        12           8  20.37%
hackbench         4          16  20.36%
pgbench          16           4  16.60%
pgbench           8           8  14.92%
hackbench        12           4  14.49%
pgbench           4          32  9.49%
pgbench           2          32  7.26%
hackbench         2          24  6.54%
pgbench           4           4  4.67%
pgbench           8           4  3.24%
pgbench          12           4  2.66%
hackbench         4           8  2.53%
pgbench           4           8  1.96%
hackbench         2          16  1.93%
schbench          4          32  1.24%
pgbench           2           8  0.82%
schbench          4           4  0.69%
schbench          2          32  0.44%
schbench          2          16  0.25%
schbench         12           8  -0.02%
sysbench          2           4  -0.02%
schbench          4          24  -0.12%
sysbench          2          16  -0.17%
schbench         12           4  -0.18%
schbench          2           4  -0.19%
sysbench          4           8  -0.23%
schbench          8           4  -0.24%
sysbench          2           8  -0.24%
schbench          4           8  -0.28%
sysbench          8           4  -0.30%
schbench          4          16  -0.37%
schbench          2          24  -0.39%
schbench          8          16  -0.49%
schbench          2           8  -0.67%
pgbench           4          16  -0.68%
schbench          8           8  -0.83%
sysbench          4           4  -0.92%
schbench         16           4  -0.94%
sysbench         12           4  -0.98%
sysbench          8          16  -1.52%
sysbench         16           4  -1.57%
pgbench           2           4  -1.62%
sysbench         12           8  -1.69%
schbench         16           8  -1.97%
sysbench          8           8  -2.08%
hackbench         8           4  -2.11%
pgbench           4          24  -3.20%
pgbench           2          24  -3.35%
sysbench          2          24  -3.81%
pgbench           2          16  -4.55%
sysbench          4          16  -5.10%
sysbench         16           8  -6.56%
sysbench          2          32  -8.24%
sysbench          4          32  -13.54%
sysbench          4          24  -13.62%
hackbench         2           4  -15.40%
hackbench         4           4  -17.71%

There are some huge wins, especially for hackbench, which corresponds
to Shrikanth's findings. There are some significant degradations too,
which I plan to debug. This may simply have to do with the simplistic
heuristic I am using for testing [1].

sysbench, for example, is not supposed to benefit from this series,
because it is not affected by overcommit. However, it definitely should
not degrade by 30%. Interestingly enough, this happens only with
certain combinations of VM and CPU counts, and this is reproducible.

Initially I have seen degradations as bad as -80% with schbench. It
turned out this was caused by userspace per-CPU locking it implements;
turning it off caused the degradation to go away. To me this looks like
something synthetic and not something used by real-world application,
but please correct me if I am wrong - then this will have to be
resolved.


One note regarding the PARAVIRT Kconfig gating: s390x does not
select PARAVIRT	today. For example, steal time we determine based on
CPU timers and clocks, and not hypervisor hints. For now I had to add
dummy paravirt headers to test this series. But I would appreciate if
Kconfig gating was removed.

Others have already commented on the naming, and I would agree that
"paravirt" is really misleading. I cannot say that the previous "cpu-
avoid" one was perfect, but it was much better.


[1] https://github.com/iii-i/linux/commits/iii/poc/cpu-avoid/v3/


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption
  2025-12-04 13:28 ` Ilya Leoshkevich
@ 2025-12-05  5:30   ` Shrikanth Hegde
  2025-12-15 17:39     ` Yury Norov
  0 siblings, 1 reply; 41+ messages in thread
From: Shrikanth Hegde @ 2025-12-05  5:30 UTC (permalink / raw)
  To: Ilya Leoshkevich, linux-kernel, linuxppc-dev
  Cc: mingo, peterz, juri.lelli, vincent.guittot, tglx, yury.norov,
	maddy, srikar, gregkh, pbonzini, seanjc, kprateek.nayak, vschneid,
	huschle, rostedt, dietmar.eggemann, christophe.leroy, linux-s390



On 12/4/25 6:58 PM, Ilya Leoshkevich wrote:
> On Wed, 2025-11-19 at 18:14 +0530, Shrikanth Hegde wrote:
>> Detailed problem statement and some of the implementation choices
>> were
>> discussed earlier[1].
>>
>> [1]:
>> https://lore.kernel.org/all/20250910174210.1969750-1-sshegde@linux.ibm.com/
>>
>> This is likely the version which would be used for LPC2025 discussion
>> on
>> this topic. Feel free to provide your suggestion and hoping for a
>> solution
>> that works for different architectures and it's use cases.
>>
>> All the existing alternatives such as cpu hotplug, creating isolated
>> partitions etc break the user affinity. Since number of CPUs to use
>> change
>> depending on the steal time, it is not driven by User. Hence it would
>> be
>> wrong to break the affinity. This series allows if the task is pinned
>> only paravirt CPUs, it will continue running there.
>>
>> Changes compared v3[1]:
>>
>> - Introduced computation of steal time in powerpc code.
>> - Derive number of CPUs to use and mark the remaining as paravirt
>> based
>>    on steal values.
>> - Provide debugfs knobs to alter how steal time values being used.
>> - Removed static key check for paravirt CPUs (Yury)
>> - Removed preempt_disable/enable while calling stopper (Prateek)
>> - Made select_idle_sibling and friends aware of paravirt CPUs.
>> - Removed 3 unused schedstat fields and introduced 2 related to
>> paravirt
>>    handling.
>> - Handled nohz_full case by enabling tick on it when there is CFS/RT
>> on
>>    it.
>> - Updated helper patch to override arch behaviour for easier
>> debugging
>>    during development.
>> - Kept
>>
>> Changes compared to v4[2]:
>> - Last two patches were sent out separate instead of being with
>> series.
>>    That created confusion. Those two patches are debug patches one can
>>    make use to check functionality across acrhitectures. Sorry about
>>    that.
>> - Use DEVICE_ATTR_RW instead (greg)
>> - Made it as PATCH since arch specific handling completes the
>>    functionality.
>>
>> [2]:
>> https://lore.kernel.org/all/20251119062100.1112520-1-sshegde@linux.ibm.com/
>>
>> TODO:
>>
>> - Get performance numbers on PowerPC, x86 and S390. Hopefully by next
>>    week. Didn't want to hold the series till then.
>>
>> - The CPUs to mark as paravirt is very simple and doesn't work when
>>    vCPUs aren't spread out uniformly across NUMA nodes. Ideal would be
>> splice
>>    the numbers based on how many CPUs each NUMA node has. It is quite
>>    tricky to do specially since cpumask can be on stack too. Given
>>    NR_CPUS can be 8192 and nr_possible_nodes 32. Haven't got my head
>> into
>>    solving it yet. Maybe there is easier way.
>>
>> - DLPAR Add/Remove needs to call init of EC/VP cores (powerpc
>> specific)
>>
>> - Userspace tools awareness such as irqbalance.
>>
>> - Delve into design of hint from Hyeprvisor(HW Hint). i.e Host
>> informs
>>    guest which/how many CPUs it has to use at this moment. This
>> interface
>>    should work across archs with each arch doing its specific
>> handling.
>>
>> - Determine the default values for steal time related knobs
>>    empirically and document them.
>>
>> - Need to check safety against CPU hotplug specially in
>> process_steal.
>>
>>
>> Applies cleanly on tip/master:
>> commit c2ef745151b21d4dcc4b29a1eabf1096f5ba544b
>>
>>
>> Thanks to srikar for providing the initial code around powerpc steal
>> time handling code. Thanks to all who went through and provided
>> reviews.
>>
>> PS: I haven't found a better name. Please suggest if you have any.
>>
>> Shrikanth Hegde (17):
>>    sched/docs: Document cpu_paravirt_mask and Paravirt CPU concept
>>    cpumask: Introduce cpu_paravirt_mask
>>    sched/core: Dont allow to use CPU marked as paravirt
>>    sched/debug: Remove unused schedstats
>>    sched/fair: Add paravirt movements for proc sched file
>>    sched/fair: Pass current cpu in select_idle_sibling
>>    sched/fair: Don't consider paravirt CPUs for wakeup and load
>> balance
>>    sched/rt: Don't select paravirt CPU for wakeup and push/pull rt
>> task
>>    sched/core: Add support for nohz_full CPUs
>>    sched/core: Push current task from paravirt CPU
>>    sysfs: Add paravirt CPU file
>>    powerpc: method to initialize ec and vp cores
>>    powerpc: enable/disable paravirt CPUs based on steal time
>>    powerpc: process steal values at fixed intervals
>>    powerpc: add debugfs file for controlling handling on steal values
>>    sysfs: Provide write method for paravirt
>>    sysfs: disable arch handling if paravirt file being written
>>
>>   .../ABI/testing/sysfs-devices-system-cpu      |   9 +
>>   Documentation/scheduler/sched-arch.rst        |  37 +++
>>   arch/powerpc/include/asm/smp.h                |   1 +
>>   arch/powerpc/kernel/smp.c                     |   1 +
>>   arch/powerpc/platforms/pseries/lpar.c         | 223
>> ++++++++++++++++++
>>   arch/powerpc/platforms/pseries/pseries.h      |   1 +
>>   drivers/base/cpu.c                            |  59 +++++
>>   include/linux/cpumask.h                       |  20 ++
>>   include/linux/sched.h                         |   9 +-
>>   kernel/sched/core.c                           | 106 ++++++++-
>>   kernel/sched/debug.c                          |   5 +-
>>   kernel/sched/fair.c                           |  42 +++-
>>   kernel/sched/rt.c                             |  11 +-
>>   kernel/sched/sched.h                          |   9 +
>>   14 files changed, 519 insertions(+), 14 deletions(-)
> 
> The capability to temporarily exclude CPUs from scheduling might be
> beneficial for s390x, where users often run Linux using a proprietary
> hypervisor called PR/SM and with high overcommit. In these
> circumstances virtual CPUs may not be scheduled by a hypervisor for a
> very long time.
> 
> Today we have an upstream feature called "Hiperdispatch", which
> determines that this is about to happen and uses Capacity Aware
> Scheduling to prevent processes from being placed on the affected CPUs.
> However, at least when used for this purpose, Capacity Aware Scheduling
> is best effort and fails to move tasks away from the affected CPUs
> under high load.
> 
> Therefore I have decided to smoke test this series.
> 
> For the purposes of smoke testing, I set up a number of KVM virtual
> machines and start the same benchmark inside each one. Then I collect
> and compare the aggregate throughput numbers. I have not done testing
> with PR/SM yet, but I plan to do this and report back. I also have not
> tested this with VMs that are not 100% utilized yet.
> 

Best results would be when it works as HW hint from hypervisor.

> Benchmark parameters:
> 
> $ sysbench cpu run --threads=$(nproc) --time=10
> $ schbench -r 10 --json --no-locking
> $ hackbench --groups 10 --process --loops 5000
> $ pgbench -h $WORKDIR --client=$(nproc) --time=10
> 
> Figures:
> 
> s390x (16 host CPUs):
> 
> Benchmark      #VMs    #CPUs/VM  ΔRPS (%)
> -----------  ------  ----------  ----------
> hackbench        16           4  60.58%
> pgbench          16           4  50.01%
> hackbench         8           8  46.18%
> hackbench         4           8  43.54%
> hackbench         2          16  43.23%
> hackbench        12           4  42.92%
> hackbench         8           4  35.53%
> hackbench         4          16  30.98%
> pgbench          12           4  18.41%
> hackbench         2          24  7.32%
> pgbench           8           4  6.84%
> pgbench           2          24  3.38%
> pgbench           2          16  3.02%
> pgbench           4          16  2.08%
> hackbench         2          32  1.46%
> pgbench           4           8  1.30%
> schbench          2          16  0.72%
> schbench          4           8  -0.09%
> schbench          4           4  -0.20%
> schbench          8           8  -0.41%
> sysbench          8           4  -0.46%
> sysbench          4           8  -0.53%
> schbench          8           4  -0.65%
> sysbench          2          16  -0.76%
> schbench          2           8  -0.77%
> sysbench          8           8  -1.72%
> schbench          2          24  -1.98%
> schbench         12           4  -2.03%
> sysbench         12           4  -2.13%
> pgbench           2          32  -3.15%
> sysbench         16           4  -3.17%
> schbench         16           4  -3.50%
> sysbench          2           8  -4.01%
> pgbench           8           8  -4.10%
> schbench          4          16  -5.93%
> sysbench          4           4  -5.94%
> pgbench           2           4  -6.40%
> hackbench         2           8  -10.04%
> hackbench         4           4  -10.91%
> pgbench           4           4  -11.05%
> sysbench          2          24  -13.07%
> sysbench          4          16  -13.59%
> hackbench         2           4  -13.96%
> pgbench           2           8  -16.16%
> schbench          2           4  -24.14%
> schbench          2          32  -24.25%
> sysbench          2           4  -24.98%
> sysbench          2          32  -32.84%
> 
> x86_64 (32 host CPUs):
> 
> Benchmark      #VMs    #CPUs/VM  ΔRPS (%)
> -----------  ------  ----------  ----------
> hackbench         4          32  87.02%
> hackbench         8          16  48.45%
> hackbench         4          24  47.95%
> hackbench         2           8  42.74%
> hackbench         2          32  34.90%
> pgbench          16           8  27.87%
> pgbench          12           8  25.17%
> hackbench         8           8  24.92%
> hackbench        16           8  22.41%
> hackbench        16           4  20.83%
> pgbench           8          16  20.40%
> hackbench        12           8  20.37%
> hackbench         4          16  20.36%
> pgbench          16           4  16.60%
> pgbench           8           8  14.92%
> hackbench        12           4  14.49%
> pgbench           4          32  9.49%
> pgbench           2          32  7.26%
> hackbench         2          24  6.54%
> pgbench           4           4  4.67%
> pgbench           8           4  3.24%
> pgbench          12           4  2.66%
> hackbench         4           8  2.53%
> pgbench           4           8  1.96%
> hackbench         2          16  1.93%
> schbench          4          32  1.24%
> pgbench           2           8  0.82%
> schbench          4           4  0.69%
> schbench          2          32  0.44%
> schbench          2          16  0.25%
> schbench         12           8  -0.02%
> sysbench          2           4  -0.02%
> schbench          4          24  -0.12%
> sysbench          2          16  -0.17%
> schbench         12           4  -0.18%
> schbench          2           4  -0.19%
> sysbench          4           8  -0.23%
> schbench          8           4  -0.24%
> sysbench          2           8  -0.24%
> schbench          4           8  -0.28%
> sysbench          8           4  -0.30%
> schbench          4          16  -0.37%
> schbench          2          24  -0.39%
> schbench          8          16  -0.49%
> schbench          2           8  -0.67%
> pgbench           4          16  -0.68%
> schbench          8           8  -0.83%
> sysbench          4           4  -0.92%
> schbench         16           4  -0.94%
> sysbench         12           4  -0.98%
> sysbench          8          16  -1.52%
> sysbench         16           4  -1.57%
> pgbench           2           4  -1.62%
> sysbench         12           8  -1.69%
> schbench         16           8  -1.97%
> sysbench          8           8  -2.08%
> hackbench         8           4  -2.11%
> pgbench           4          24  -3.20%
> pgbench           2          24  -3.35%
> sysbench          2          24  -3.81%
> pgbench           2          16  -4.55%
> sysbench          4          16  -5.10%
> sysbench         16           8  -6.56%
> sysbench          2          32  -8.24%
> sysbench          4          32  -13.54%
> sysbench          4          24  -13.62%
> hackbench         2           4  -15.40%
> hackbench         4           4  -17.71%
> 
> There are some huge wins, especially for hackbench, which corresponds
> to Shrikanth's findings. There are some significant degradations too,
> which I plan to debug. This may simply have to do with the simplistic
> heuristic I am using for testing [1].
> 

Thank you very much!! for running these numbers.

> sysbench, for example, is not supposed to benefit from this series,
> because it is not affected by overcommit. However, it definitely should
> not degrade by 30%. Interestingly enough, this happens only with
> certain combinations of VM and CPU counts, and this is reproducible.
> 

is the host baremetal? is those cases cpufreq governer ramp up or down
might play a role. (speculating)

> Initially I have seen degradations as bad as -80% with schbench. It
> turned out this was caused by userspace per-CPU locking it implements;
> turning it off caused the degradation to go away. To me this looks like
> something synthetic and not something used by real-world application,
> but please correct me if I am wrong - then this will have to be
> resolved.
> 

That's nice to hear. I was concerned with schbench rps. Now i am bit relieved.


Is this with schbench -L option?
I ran with it. and regression i was seeing earlier is gone now.

> 
> One note regarding the PARAVIRT Kconfig gating: s390x does not
> select PARAVIRT	today. For example, steal time we determine based on
> CPU timers and clocks, and not hypervisor hints. For now I had to add
> dummy paravirt headers to test this series. But I would appreciate if
> Kconfig gating was removed.
> 

Keeping PARAVIRT checks on is probably right thing. I will wait to see if
anyone objects.

> Others have already commented on the naming, and I would agree that
> "paravirt" is really misleading. I cannot say that the previous "cpu-
> avoid" one was perfect, but it was much better.
> 
> 
> [1] https://github.com/iii-i/linux/commits/iii/poc/cpu-avoid/v3/

Will look into it. one thing to to be careful are CPU numbers.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption
  2025-11-19 12:44 [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
                   ` (19 preceding siblings ...)
  2025-12-04 13:28 ` Ilya Leoshkevich
@ 2025-12-08  4:47 ` K Prateek Nayak
  2025-12-08  9:57   ` Shrikanth Hegde
  20 siblings, 1 reply; 41+ messages in thread
From: K Prateek Nayak @ 2025-12-08  4:47 UTC (permalink / raw)
  To: Shrikanth Hegde, linux-kernel, linuxppc-dev
  Cc: mingo, peterz, juri.lelli, vincent.guittot, tglx, yury.norov,
	maddy, srikar, gregkh, pbonzini, seanjc, vschneid, iii, huschle,
	rostedt, dietmar.eggemann, christophe.leroy

On 11/19/2025 6:14 PM, Shrikanth Hegde wrote:
> Detailed problem statement and some of the implementation choices were 
> discussed earlier[1].
> 
> [1]: https://lore.kernel.org/all/20250910174210.1969750-1-sshegde@linux.ibm.com/
> 
> This is likely the version which would be used for LPC2025 discussion on
> this topic. Feel free to provide your suggestion and hoping for a solution
> that works for different architectures and it's use cases.
> 
> All the existing alternatives such as cpu hotplug, creating isolated
> partitions etc break the user affinity. Since number of CPUs to use change
> depending on the steal time, it is not driven by User. Hence it would be
> wrong to break the affinity. This series allows if the task is pinned
> only paravirt CPUs, it will continue running there.

If maintaining task affinity is the only problem that cpusets don't
offer, attached below is a very naive prototype that seems to work in
my case without hitting any obvious splats so far.

Idea is to keep task affinity untouched, but remove the CPUs from
the sched domains.

That way, all the balancing, and wakeups will steer away from these
CPUs automatically but once the CPUs are put back, the balancing will
automatically move tasks back.

I tested this with a bunch of spinners and with partitions and both
seem to work as expected. For real world VM based testing, I pinned 2
6C/12C VMs to a 8C/16T LLC with 1:1 pinning - 2 virtual cores from
either VMs pin to same set of physical cores.

Running 8 groups of perf bench sched messaging on each VM at the same
time gives the following numbers for total runtime:

All CPUs available in the VM:      88.775s & 91.002s  (2 cores overlap)
Only 4 cores available in the VM:  67.365s & 73.015s  (No cores overlap)

Note: The unavailable mask didn't change in my runs. I've noticed a
bit of delay before the load balancer moves the tasks to the CPU
going from unavailable to available - your mileage may vary depending
on the frequency of mask updates.

Following is the diff on top of tip/master:

(Very raw PoC; Only fair tasks are considered for now to push away)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 2ddb256187b5..7c1cfdd7ffea 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -174,6 +174,10 @@ static inline void set_mems_allowed(nodemask_t nodemask)
 }
 
 extern bool cpuset_node_allowed(struct cgroup *cgroup, int nid);
+
+void sched_fair_notify_unavaialable_cpus(struct cpumask *unavailable_mask);
+const struct cpumask *cpuset_unavailable_mask(void);
+bool cpuset_cpu_unavailable(int cpu);
 #else /* !CONFIG_CPUSETS */
 
 static inline bool cpusets_enabled(void) { return false; }
diff --git a/kernel/cgroup/cpuset-internal.h b/kernel/cgroup/cpuset-internal.h
index 337608f408ce..170aba16141e 100644
--- a/kernel/cgroup/cpuset-internal.h
+++ b/kernel/cgroup/cpuset-internal.h
@@ -59,6 +59,7 @@ typedef enum {
 	FILE_EXCLUSIVE_CPULIST,
 	FILE_EFFECTIVE_XCPULIST,
 	FILE_ISOLATED_CPULIST,
+	FILE_UNAVAILABLE_CPULIST,
 	FILE_CPU_EXCLUSIVE,
 	FILE_MEM_EXCLUSIVE,
 	FILE_MEM_HARDWALL,
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 4aaad07b0bd1..22d38f2299c4 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -87,6 +87,19 @@ static cpumask_var_t	isolated_cpus;
 static cpumask_var_t	boot_hk_cpus;
 static bool		have_boot_isolcpus;
 
+/*
+ * CPUs that may be unavailable to run tasks as a result of physical
+ * constraints (vCPU being preempted, pCPU handling interrupt storm).
+ *
+ * Unlike isolated_cpus, the unavailable_cpus are simply excluded from
+ * HK_TYPE_DOMAIN but leave the tasks affinity untouched. These CPUs
+ * should be avoided unless the task has specifically asked to be run
+ * only on these CPUs.
+ */
+static cpumask_var_t	unavailable_cpus;
+static cpumask_var_t	available_tmp_mask;	/* For intermediate operations. */
+static bool 		cpu_turned_unavailable;
+
 /* List of remote partition root children */
 static struct list_head remote_children;
 
@@ -844,6 +857,7 @@ static int generate_sched_domains(cpumask_var_t **domains,
 		}
 		cpumask_and(doms[0], top_cpuset.effective_cpus,
 			    housekeeping_cpumask(HK_TYPE_DOMAIN));
+		cpumask_andnot(doms[0], doms[0], unavailable_cpus);
 
 		goto done;
 	}
@@ -960,11 +974,13 @@ static int generate_sched_domains(cpumask_var_t **domains,
 			 * The top cpuset may contain some boot time isolated
 			 * CPUs that need to be excluded from the sched domain.
 			 */
-			if (csa[i] == &top_cpuset)
+			if (csa[i] == &top_cpuset) {
 				cpumask_and(doms[i], csa[i]->effective_cpus,
 					    housekeeping_cpumask(HK_TYPE_DOMAIN));
-			else
-				cpumask_copy(doms[i], csa[i]->effective_cpus);
+				cpumask_andnot(doms[i], doms[i], unavailable_cpus);
+			 } else {
+				cpumask_andnot(doms[i], csa[i]->effective_cpus, unavailable_cpus);
+			 }
 			if (dattr)
 				dattr[i] = SD_ATTR_INIT;
 		}
@@ -985,6 +1001,7 @@ static int generate_sched_domains(cpumask_var_t **domains,
 				}
 				cpumask_or(dp, dp, csa[j]->effective_cpus);
 				cpumask_and(dp, dp, housekeeping_cpumask(HK_TYPE_DOMAIN));
+				cpumask_andnot(dp, dp, unavailable_cpus);
 				if (dattr)
 					update_domain_attr_tree(dattr + nslot, csa[j]);
 			}
@@ -1418,6 +1435,17 @@ bool cpuset_cpu_is_isolated(int cpu)
 }
 EXPORT_SYMBOL_GPL(cpuset_cpu_is_isolated);
 
+/* Get the set of CPUs marked unavailable. */
+const struct cpumask *cpuset_unavailable_mask(void)
+{
+	return unavailable_cpus;
+}
+
+bool cpuset_cpu_unavailable(int cpu)
+{
+	return  cpumask_test_cpu(cpu, unavailable_cpus);
+}
+
 /**
  * rm_siblings_excl_cpus - Remove exclusive CPUs that are used by sibling cpusets
  * @parent: Parent cpuset containing all siblings
@@ -2612,6 +2640,53 @@ static int update_exclusive_cpumask(struct cpuset *cs, struct cpuset *trialcs,
 	return 0;
 }
 
+/**
+ * update_exclusive_cpumask - update the exclusive_cpus mask of a cpuset
+ * @cs: the cpuset to consider
+ * @trialcs: trial cpuset
+ * @buf: buffer of cpu numbers written to this cpuset
+ *
+ * The tasks' cpumask will be updated if cs is a valid partition root.
+ */
+static int update_unavailable_cpumask(const char *buf)
+{
+	cpumask_var_t tmp;
+	int retval;
+
+	if (!alloc_cpumask_var(&tmp, GFP_KERNEL))
+		return -ENOMEM;
+
+	retval = cpulist_parse(buf, tmp);
+	if (retval < 0)
+		goto out;
+
+	/* Nothing to do if the CPUs didn't change */
+	if (cpumask_equal(tmp, unavailable_cpus))
+		goto out;
+
+	/* Save the CPUs that went unavailable to push task out. */
+	if (cpumask_andnot(available_tmp_mask, tmp, unavailable_cpus))
+		cpu_turned_unavailable = true;
+
+	cpumask_copy(unavailable_cpus, tmp);
+	cpuset_force_rebuild();
+out:
+	free_cpumask_var(tmp);
+	return retval;
+}
+
+static void cpuset_notify_unavailable_cpus(void)
+{
+	/*
+	 * Prevent being preempted by the stopper if the local CPU
+	 * turned unavailable.
+	 */
+	guard(preempt)();
+
+	sched_fair_notify_unavaialable_cpus(available_tmp_mask);
+	cpu_turned_unavailable = false;
+}
+
 /*
  * Migrate memory region from one set of nodes to another.  This is
  * performed asynchronously as it can be called from process migration path
@@ -3302,11 +3377,16 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
 				    char *buf, size_t nbytes, loff_t off)
 {
 	struct cpuset *cs = css_cs(of_css(of));
+	int file_type = of_cft(of)->private;
 	struct cpuset *trialcs;
 	int retval = -ENODEV;
 
-	/* root is read-only */
-	if (cs == &top_cpuset)
+	/* root is read-only; except for unavailable mask */
+	if (file_type != FILE_UNAVAILABLE_CPULIST && cs == &top_cpuset)
+		return -EACCES;
+
+	/* unavailable mask can be only set on root. */
+	if (file_type == FILE_UNAVAILABLE_CPULIST && cs != &top_cpuset)
 		return -EACCES;
 
 	buf = strstrip(buf);
@@ -3330,6 +3410,9 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
 	case FILE_MEMLIST:
 		retval = update_nodemask(cs, trialcs, buf);
 		break;
+	case FILE_UNAVAILABLE_CPULIST:
+		retval = update_unavailable_cpumask(buf);
+		break;
 	default:
 		retval = -EINVAL;
 		break;
@@ -3338,6 +3421,8 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
 	free_cpuset(trialcs);
 	if (force_sd_rebuild)
 		rebuild_sched_domains_locked();
+	if (cpu_turned_unavailable)
+		cpuset_notify_unavailable_cpus();
 out_unlock:
 	cpuset_full_unlock();
 	if (of_cft(of)->private == FILE_MEMLIST)
@@ -3386,6 +3471,9 @@ int cpuset_common_seq_show(struct seq_file *sf, void *v)
 	case FILE_ISOLATED_CPULIST:
 		seq_printf(sf, "%*pbl\n", cpumask_pr_args(isolated_cpus));
 		break;
+	case FILE_UNAVAILABLE_CPULIST:
+		seq_printf(sf, "%*pbl\n", cpumask_pr_args(unavailable_cpus));
+		break;
 	default:
 		ret = -EINVAL;
 	}
@@ -3524,6 +3612,15 @@ static struct cftype dfl_files[] = {
 		.flags = CFTYPE_ONLY_ON_ROOT,
 	},
 
+	{
+		.name = "cpus.unavailable",
+		.seq_show = cpuset_common_seq_show,
+		.write = cpuset_write_resmask,
+		.max_write_len = (100U + 6 * NR_CPUS),
+		.private = FILE_UNAVAILABLE_CPULIST,
+		.flags = CFTYPE_ONLY_ON_ROOT,
+	},
+
 	{ }	/* terminate */
 };
 
@@ -3814,6 +3911,8 @@ int __init cpuset_init(void)
 	BUG_ON(!alloc_cpumask_var(&top_cpuset.exclusive_cpus, GFP_KERNEL));
 	BUG_ON(!zalloc_cpumask_var(&subpartitions_cpus, GFP_KERNEL));
 	BUG_ON(!zalloc_cpumask_var(&isolated_cpus, GFP_KERNEL));
+	BUG_ON(!zalloc_cpumask_var(&unavailable_cpus, GFP_KERNEL));
+	BUG_ON(!zalloc_cpumask_var(&available_tmp_mask, GFP_KERNEL));
 
 	cpumask_setall(top_cpuset.cpus_allowed);
 	nodes_setall(top_cpuset.mems_allowed);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ee7dfbf01792..13d0d9587aca 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2396,7 +2396,7 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
 
 	/* Non kernel threads are not allowed during either online or offline. */
 	if (!(p->flags & PF_KTHREAD))
-		return cpu_active(cpu);
+		return (cpu_active(cpu) && !cpuset_cpu_unavailable(cpu));
 
 	/* KTHREAD_IS_PER_CPU is always allowed. */
 	if (kthread_is_per_cpu(p))
@@ -3451,6 +3451,26 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
 			goto out;
 		}
 
+		/*
+		 * Only user threads can be forced out of
+		 * unavaialable CPUs.
+		 */
+		if (p->flags & PF_KTHREAD)
+			goto rude;
+
+		/* Any unavailable CPUs that can run the task? */
+		for_each_cpu(dest_cpu, cpuset_unavailable_mask()) {
+			if (!task_allowed_on_cpu(p, dest_cpu))
+				continue;
+
+			/* Can we hoist this up to goto rude? */
+			if (is_migration_disabled(p))
+				continue;
+
+			if (cpu_active(dest_cpu))
+				goto out;
+		}
+rude:
 		/* No more Mr. Nice Guy. */
 		switch (state) {
 		case cpuset:
@@ -3766,7 +3786,7 @@ bool call_function_single_prep_ipi(int cpu)
  * via sched_ttwu_wakeup() for activation so the wakee incurs the cost
  * of the wakeup instead of the waker.
  */
-static void __ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
+void __ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
 {
 	struct rq *rq = cpu_rq(cpu);
 
@@ -5365,7 +5385,9 @@ void sched_exec(void)
 	int dest_cpu;
 
 	scoped_guard (raw_spinlock_irqsave, &p->pi_lock) {
-		dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), WF_EXEC);
+		int wake_flags = WF_EXEC;
+
+		dest_cpu = select_task_rq(p, task_cpu(p), &wake_flags);
 		if (dest_cpu == smp_processor_id())
 			return;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index da46c3164537..e502cccdae64 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12094,6 +12094,61 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
 	return ld_moved;
 }
 
+static int unavailable_balance_cpu_stop(void *data)
+{
+	struct task_struct *p, *tmp;
+	struct rq *rq = data;
+	int this_cpu = cpu_of(rq);
+
+	guard(rq_lock_irq)(rq);
+
+	list_for_each_entry_safe(p, tmp, &rq->cfs_tasks, se.group_node) {
+		int target_cpu;
+
+		/*
+		 * Bail out if a concurrent change to unavailable_mask turned
+		 * this CPU available.
+		 */
+		rq->unavailable_balance = cpumask_test_cpu(this_cpu, cpuset_unavailable_mask());
+		if (!rq->unavailable_balance)
+			break;
+
+		/* XXX: Does not deal with migration disabled tasks. */
+		target_cpu = cpumask_first_andnot(p->cpus_ptr, cpuset_unavailable_mask());
+		if ((unsigned int)target_cpu < nr_cpumask_bits) {
+			deactivate_task(rq, p, 0);
+			set_task_cpu(p, target_cpu);
+
+			/*
+			 * Switch to move_queued_task() later.
+			 * For PoC send an IPI and be done with it.
+			 */
+			__ttwu_queue_wakelist(p, target_cpu, 0);
+		}
+	}
+
+	rq->unavailable_balance = 0;
+
+	return 0;
+}
+
+void sched_fair_notify_unavaialable_cpus(struct cpumask *unavailable_mask)
+{
+	int cpu, this_cpu = smp_processor_id();
+
+	for_each_cpu_wrap(cpu, unavailable_mask, this_cpu + 1) {
+		struct rq *rq = cpu_rq(cpu);
+
+		/* Balance in progress. Tasks will be pushed out. */
+		if (rq->unavailable_balance)
+			return;
+
+		stop_one_cpu_nowait(cpu, unavailable_balance_cpu_stop,
+				    rq, &rq->unavailable_balance_work);
+		rq->unavailable_balance = 1;
+	}
+}
+
 static inline unsigned long
 get_sd_balance_interval(struct sched_domain *sd, int cpu_busy)
 {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index cb80666addec..c21ffb128734 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1221,6 +1221,10 @@ struct rq {
 	int			push_cpu;
 	struct cpu_stop_work	active_balance_work;
 
+	/* For pushing out taks from unavailable CPUs. */
+	struct cpu_stop_work	unavailable_balance_work;
+	int			unavailable_balance;
+
 	/* CPU of this runqueue: */
 	int			cpu;
 	int			online;
@@ -2413,6 +2417,8 @@ extern const u32		sched_prio_to_wmult[40];
 
 #define RETRY_TASK		((void *)-1UL)
 
+void __ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags);
+
 struct affinity_context {
 	const struct cpumask	*new_mask;
 	struct cpumask		*user_mask;

base-commit: 5e8f8a25efb277ac6f61f553f0c533ff1402bd7c
-- 
Thanks and Regards,
Prateek



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption
  2025-12-08  4:47 ` K Prateek Nayak
@ 2025-12-08  9:57   ` Shrikanth Hegde
  2025-12-08 17:58     ` K Prateek Nayak
  0 siblings, 1 reply; 41+ messages in thread
From: Shrikanth Hegde @ 2025-12-08  9:57 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: mingo, peterz, juri.lelli, vincent.guittot, tglx, yury.norov,
	maddy, srikar, gregkh, pbonzini, seanjc, vschneid, iii, huschle,
	rostedt, dietmar.eggemann, christophe.leroy, linux-kernel,
	linuxppc-dev

Hi Prateek.

Thank you very much for going throguh the series.

On 12/8/25 10:17 AM, K Prateek Nayak wrote:
> On 11/19/2025 6:14 PM, Shrikanth Hegde wrote:
>> Detailed problem statement and some of the implementation choices were
>> discussed earlier[1].
>>
>> [1]: https://lore.kernel.org/all/20250910174210.1969750-1-sshegde@linux.ibm.com/
>>
>> This is likely the version which would be used for LPC2025 discussion on
>> this topic. Feel free to provide your suggestion and hoping for a solution
>> that works for different architectures and it's use cases.
>>
>> All the existing alternatives such as cpu hotplug, creating isolated
>> partitions etc break the user affinity. Since number of CPUs to use change
>> depending on the steal time, it is not driven by User. Hence it would be
>> wrong to break the affinity. This series allows if the task is pinned
>> only paravirt CPUs, it will continue running there.
> 
> If maintaining task affinity is the only problem that cpusets don't
> offer, attached below is a very naive prototype that seems to work in
> my case without hitting any obvious splats so far.
> 
> Idea is to keep task affinity untouched, but remove the CPUs from
> the sched domains.
> 
> That way, all the balancing, and wakeups will steer away from these
> CPUs automatically but once the CPUs are put back, the balancing will
> automatically move tasks back.
> 
> I tested this with a bunch of spinners and with partitions and both
> seem to work as expected. For real world VM based testing, I pinned 2
> 6C/12C VMs to a 8C/16T LLC with 1:1 pinning - 2 virtual cores from
> either VMs pin to same set of physical cores.
> 
> Running 8 groups of perf bench sched messaging on each VM at the same
> time gives the following numbers for total runtime:
> 
> All CPUs available in the VM:      88.775s & 91.002s  (2 cores overlap)
> Only 4 cores available in the VM:  67.365s & 73.015s  (No cores overlap)
> 
> Note: The unavailable mask didn't change in my runs. I've noticed a
> bit of delay before the load balancer moves the tasks to the CPU
> going from unavailable to available - your mileage may vary depending

Depends on the scale of systems. I have seen it unfolding is slower
compared to folding on large systems.

> on the frequency of mask updates.
> 

What do you mean "The unavailable mask didn't change in my runs" ?
If so, how did it take effect?

> Following is the diff on top of tip/master:
> 
> (Very raw PoC; Only fair tasks are considered for now to push away)
> 

I skimmed through it. It is very close to the current approach.

Advantage:
Happens immediately instead of waiting for tick.
Current approach too can move all the tasks at one tick.
the concern could be latency being high and races around the list.

Disadvantages:

Causes a sched domain rebuild. Which is known to be expensive on large systems.
But since steal time changes are not very aggressive at this point, this overhead
maybe ok.

Keeping the interface in cpuset maybe tricky. there could multiple cpusets, and different versions
complications too. Specially you can have cpusets in nested fashion. And all of this is
not user driven. i think cpuset is inherently user driven.

Impementation looks more complicated to me atleast at this point.

Current poc needs to enhanced to make arch specific triggers. That is doable.

> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index 2ddb256187b5..7c1cfdd7ffea 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -174,6 +174,10 @@ static inline void set_mems_allowed(nodemask_t nodemask)
>   }
>   
>   extern bool cpuset_node_allowed(struct cgroup *cgroup, int nid);
> +
> +void sched_fair_notify_unavaialable_cpus(struct cpumask *unavailable_mask);
> +const struct cpumask *cpuset_unavailable_mask(void);
> +bool cpuset_cpu_unavailable(int cpu);
>   #else /* !CONFIG_CPUSETS */
>   
>   static inline bool cpusets_enabled(void) { return false; }
> diff --git a/kernel/cgroup/cpuset-internal.h b/kernel/cgroup/cpuset-internal.h
> index 337608f408ce..170aba16141e 100644
> --- a/kernel/cgroup/cpuset-internal.h
> +++ b/kernel/cgroup/cpuset-internal.h
> @@ -59,6 +59,7 @@ typedef enum {
>   	FILE_EXCLUSIVE_CPULIST,
>   	FILE_EFFECTIVE_XCPULIST,
>   	FILE_ISOLATED_CPULIST,
> +	FILE_UNAVAILABLE_CPULIST,
>   	FILE_CPU_EXCLUSIVE,
>   	FILE_MEM_EXCLUSIVE,
>   	FILE_MEM_HARDWALL,
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 4aaad07b0bd1..22d38f2299c4 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -87,6 +87,19 @@ static cpumask_var_t	isolated_cpus;
>   static cpumask_var_t	boot_hk_cpus;
>   static bool		have_boot_isolcpus;
>   
> +/*
> + * CPUs that may be unavailable to run tasks as a result of physical
> + * constraints (vCPU being preempted, pCPU handling interrupt storm).
> + *
> + * Unlike isolated_cpus, the unavailable_cpus are simply excluded from
> + * HK_TYPE_DOMAIN but leave the tasks affinity untouched. These CPUs
> + * should be avoided unless the task has specifically asked to be run
> + * only on these CPUs.
> + */
> +static cpumask_var_t	unavailable_cpus;
> +static cpumask_var_t	available_tmp_mask;	/* For intermediate operations. */
> +static bool 		cpu_turned_unavailable;
> +

This unavailable name is not probably right. When system boots, there is available_cpu
and that is fixed and not expected to change. It can confuse users.

>   /* List of remote partition root children */
>   static struct list_head remote_children;
>   
> @@ -844,6 +857,7 @@ static int generate_sched_domains(cpumask_var_t **domains,
>   		}
>   		cpumask_and(doms[0], top_cpuset.effective_cpus,
>   			    housekeeping_cpumask(HK_TYPE_DOMAIN));
> +		cpumask_andnot(doms[0], doms[0], unavailable_cpus);
>   
>   		goto done;
>   	}
> @@ -960,11 +974,13 @@ static int generate_sched_domains(cpumask_var_t **domains,
>   			 * The top cpuset may contain some boot time isolated
>   			 * CPUs that need to be excluded from the sched domain.
>   			 */
> -			if (csa[i] == &top_cpuset)
> +			if (csa[i] == &top_cpuset) {
>   				cpumask_and(doms[i], csa[i]->effective_cpus,
>   					    housekeeping_cpumask(HK_TYPE_DOMAIN));
> -			else
> -				cpumask_copy(doms[i], csa[i]->effective_cpus);
> +				cpumask_andnot(doms[i], doms[i], unavailable_cpus);
> +			 } else {
> +				cpumask_andnot(doms[i], csa[i]->effective_cpus, unavailable_cpus);
> +			 }
>   			if (dattr)
>   				dattr[i] = SD_ATTR_INIT;
>   		}
> @@ -985,6 +1001,7 @@ static int generate_sched_domains(cpumask_var_t **domains,
>   				}
>   				cpumask_or(dp, dp, csa[j]->effective_cpus);
>   				cpumask_and(dp, dp, housekeeping_cpumask(HK_TYPE_DOMAIN));
> +				cpumask_andnot(dp, dp, unavailable_cpus);
>   				if (dattr)
>   					update_domain_attr_tree(dattr + nslot, csa[j]);
>   			}
> @@ -1418,6 +1435,17 @@ bool cpuset_cpu_is_isolated(int cpu)
>   }
>   EXPORT_SYMBOL_GPL(cpuset_cpu_is_isolated);
>   
> +/* Get the set of CPUs marked unavailable. */
> +const struct cpumask *cpuset_unavailable_mask(void)
> +{
> +	return unavailable_cpus;
> +}
> +
> +bool cpuset_cpu_unavailable(int cpu)
> +{
> +	return  cpumask_test_cpu(cpu, unavailable_cpus);
> +}
> +
>   /**
>    * rm_siblings_excl_cpus - Remove exclusive CPUs that are used by sibling cpusets
>    * @parent: Parent cpuset containing all siblings
> @@ -2612,6 +2640,53 @@ static int update_exclusive_cpumask(struct cpuset *cs, struct cpuset *trialcs,
>   	return 0;
>   }
>   
> +/**
> + * update_exclusive_cpumask - update the exclusive_cpus mask of a cpuset
> + * @cs: the cpuset to consider
> + * @trialcs: trial cpuset
> + * @buf: buffer of cpu numbers written to this cpuset
> + *
> + * The tasks' cpumask will be updated if cs is a valid partition root.
> + */
> +static int update_unavailable_cpumask(const char *buf)
> +{
> +	cpumask_var_t tmp;
> +	int retval;
> +
> +	if (!alloc_cpumask_var(&tmp, GFP_KERNEL))
> +		return -ENOMEM;
> +
> +	retval = cpulist_parse(buf, tmp);
> +	if (retval < 0)
> +		goto out;
> +
> +	/* Nothing to do if the CPUs didn't change */
> +	if (cpumask_equal(tmp, unavailable_cpus))
> +		goto out;
> +
> +	/* Save the CPUs that went unavailable to push task out. */
> +	if (cpumask_andnot(available_tmp_mask, tmp, unavailable_cpus))
> +		cpu_turned_unavailable = true;
> +
> +	cpumask_copy(unavailable_cpus, tmp);
> +	cpuset_force_rebuild();

I think this rebuilding sched domains could add quite overhead.

> +out:
> +	free_cpumask_var(tmp);
> +	return retval;
> +}
> +
> +static void cpuset_notify_unavailable_cpus(void)
> +{
> +	/*
> +	 * Prevent being preempted by the stopper if the local CPU
> +	 * turned unavailable.
> +	 */
> +	guard(preempt)();
> +
> +	sched_fair_notify_unavaialable_cpus(available_tmp_mask);
> +	cpu_turned_unavailable = false;
> +}
> +
>   /*
>    * Migrate memory region from one set of nodes to another.  This is
>    * performed asynchronously as it can be called from process migration path
> @@ -3302,11 +3377,16 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
>   				    char *buf, size_t nbytes, loff_t off)
>   {
>   	struct cpuset *cs = css_cs(of_css(of));
> +	int file_type = of_cft(of)->private;
>   	struct cpuset *trialcs;
>   	int retval = -ENODEV;
>   
> -	/* root is read-only */
> -	if (cs == &top_cpuset)
> +	/* root is read-only; except for unavailable mask */
> +	if (file_type != FILE_UNAVAILABLE_CPULIST && cs == &top_cpuset)
> +		return -EACCES;
> +
> +	/* unavailable mask can be only set on root. */
> +	if (file_type == FILE_UNAVAILABLE_CPULIST && cs != &top_cpuset)
>   		return -EACCES;
>   
>   	buf = strstrip(buf);
> @@ -3330,6 +3410,9 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
>   	case FILE_MEMLIST:
>   		retval = update_nodemask(cs, trialcs, buf);
>   		break;
> +	case FILE_UNAVAILABLE_CPULIST:
> +		retval = update_unavailable_cpumask(buf);
> +		break;
>   	default:
>   		retval = -EINVAL;
>   		break;
> @@ -3338,6 +3421,8 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
>   	free_cpuset(trialcs);
>   	if (force_sd_rebuild)
>   		rebuild_sched_domains_locked();
> +	if (cpu_turned_unavailable)
> +		cpuset_notify_unavailable_cpus();
>   out_unlock:
>   	cpuset_full_unlock();
>   	if (of_cft(of)->private == FILE_MEMLIST)
> @@ -3386,6 +3471,9 @@ int cpuset_common_seq_show(struct seq_file *sf, void *v)
>   	case FILE_ISOLATED_CPULIST:
>   		seq_printf(sf, "%*pbl\n", cpumask_pr_args(isolated_cpus));
>   		break;
> +	case FILE_UNAVAILABLE_CPULIST:
> +		seq_printf(sf, "%*pbl\n", cpumask_pr_args(unavailable_cpus));
> +		break;
>   	default:
>   		ret = -EINVAL;
>   	}
> @@ -3524,6 +3612,15 @@ static struct cftype dfl_files[] = {
>   		.flags = CFTYPE_ONLY_ON_ROOT,
>   	},
>   
> +	{
> +		.name = "cpus.unavailable",
> +		.seq_show = cpuset_common_seq_show,
> +		.write = cpuset_write_resmask,
> +		.max_write_len = (100U + 6 * NR_CPUS),
> +		.private = FILE_UNAVAILABLE_CPULIST,
> +		.flags = CFTYPE_ONLY_ON_ROOT,
> +	},
> +
>   	{ }	/* terminate */
>   };
>   
> @@ -3814,6 +3911,8 @@ int __init cpuset_init(void)
>   	BUG_ON(!alloc_cpumask_var(&top_cpuset.exclusive_cpus, GFP_KERNEL));
>   	BUG_ON(!zalloc_cpumask_var(&subpartitions_cpus, GFP_KERNEL));
>   	BUG_ON(!zalloc_cpumask_var(&isolated_cpus, GFP_KERNEL));
> +	BUG_ON(!zalloc_cpumask_var(&unavailable_cpus, GFP_KERNEL));
> +	BUG_ON(!zalloc_cpumask_var(&available_tmp_mask, GFP_KERNEL));
>   
>   	cpumask_setall(top_cpuset.cpus_allowed);
>   	nodes_setall(top_cpuset.mems_allowed);
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index ee7dfbf01792..13d0d9587aca 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2396,7 +2396,7 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
>   
>   	/* Non kernel threads are not allowed during either online or offline. */
>   	if (!(p->flags & PF_KTHREAD))
> -		return cpu_active(cpu);
> +		return (cpu_active(cpu) && !cpuset_cpu_unavailable(cpu));
>   
>   	/* KTHREAD_IS_PER_CPU is always allowed. */
>   	if (kthread_is_per_cpu(p))
> @@ -3451,6 +3451,26 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
>   			goto out;
>   		}
>   
> +		/*
> +		 * Only user threads can be forced out of
> +		 * unavaialable CPUs.
> +		 */
> +		if (p->flags & PF_KTHREAD)
> +			goto rude;
> +
> +		/* Any unavailable CPUs that can run the task? */
> +		for_each_cpu(dest_cpu, cpuset_unavailable_mask()) {
> +			if (!task_allowed_on_cpu(p, dest_cpu))
> +				continue;
> +
> +			/* Can we hoist this up to goto rude? */
> +			if (is_migration_disabled(p))
> +				continue;
> +
> +			if (cpu_active(dest_cpu))
> +				goto out;
> +		}
> +rude:
>   		/* No more Mr. Nice Guy. */
>   		switch (state) {
>   		case cpuset:
> @@ -3766,7 +3786,7 @@ bool call_function_single_prep_ipi(int cpu)
>    * via sched_ttwu_wakeup() for activation so the wakee incurs the cost
>    * of the wakeup instead of the waker.
>    */
> -static void __ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
> +void __ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
>   {
>   	struct rq *rq = cpu_rq(cpu);
>   
> @@ -5365,7 +5385,9 @@ void sched_exec(void)
>   	int dest_cpu;
>   
>   	scoped_guard (raw_spinlock_irqsave, &p->pi_lock) {
> -		dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), WF_EXEC);
> +		int wake_flags = WF_EXEC;
> +
> +		dest_cpu = select_task_rq(p, task_cpu(p), &wake_flags);

Whats this logic?

>   		if (dest_cpu == smp_processor_id())
>   			return;
>   
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index da46c3164537..e502cccdae64 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12094,6 +12094,61 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>   	return ld_moved;
>   }
>   
> +static int unavailable_balance_cpu_stop(void *data)
> +{
> +	struct task_struct *p, *tmp;
> +	struct rq *rq = data;
> +	int this_cpu = cpu_of(rq);
> +
> +	guard(rq_lock_irq)(rq);
> +
> +	list_for_each_entry_safe(p, tmp, &rq->cfs_tasks, se.group_node) {
> +		int target_cpu;
> +
> +		/*
> +		 * Bail out if a concurrent change to unavailable_mask turned
> +		 * this CPU available.
> +		 */
> +		rq->unavailable_balance = cpumask_test_cpu(this_cpu, cpuset_unavailable_mask());
> +		if (!rq->unavailable_balance)
> +			break;
> +
> +		/* XXX: Does not deal with migration disabled tasks. */
> +		target_cpu = cpumask_first_andnot(p->cpus_ptr, cpuset_unavailable_mask());

This can cause it to go first CPU always and then load balancer to move it later on.
First should check the nodemask the current cpu is on to avoid NUMA costs.

> +		if ((unsigned int)target_cpu < nr_cpumask_bits) {
> +			deactivate_task(rq, p, 0);
> +			set_task_cpu(p, target_cpu);
> +
> +			/*
> +			 * Switch to move_queued_task() later.
> +			 * For PoC send an IPI and be done with it.
> +			 */
> +			__ttwu_queue_wakelist(p, target_cpu, 0);
> +		}
> +	}
> +
> +	rq->unavailable_balance = 0;
> +
> +	return 0;
> +}
> +
> +void sched_fair_notify_unavaialable_cpus(struct cpumask *unavailable_mask)
> +{
> +	int cpu, this_cpu = smp_processor_id();
> +
> +	for_each_cpu_wrap(cpu, unavailable_mask, this_cpu + 1) {
> +		struct rq *rq = cpu_rq(cpu);
> +
> +		/* Balance in progress. Tasks will be pushed out. */
> +		if (rq->unavailable_balance)
> +			return;
> +

Need to run stopper, if there is active current task. otherise that work
can be done here itself.

> +		stop_one_cpu_nowait(cpu, unavailable_balance_cpu_stop,
> +				    rq, &rq->unavailable_balance_work);
> +		rq->unavailable_balance = 1;
> +	}
> +}
> +
>   static inline unsigned long
>   get_sd_balance_interval(struct sched_domain *sd, int cpu_busy)
>   {
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index cb80666addec..c21ffb128734 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1221,6 +1221,10 @@ struct rq {
>   	int			push_cpu;
>   	struct cpu_stop_work	active_balance_work;
>   
> +	/* For pushing out taks from unavailable CPUs. */
> +	struct cpu_stop_work	unavailable_balance_work;
> +	int			unavailable_balance;
> +
>   	/* CPU of this runqueue: */
>   	int			cpu;
>   	int			online;
> @@ -2413,6 +2417,8 @@ extern const u32		sched_prio_to_wmult[40];
>   
>   #define RETRY_TASK		((void *)-1UL)
>   
> +void __ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags);
> +
>   struct affinity_context {
>   	const struct cpumask	*new_mask;
>   	struct cpumask		*user_mask;
> 
> base-commit: 5e8f8a25efb277ac6f61f553f0c533ff1402bd7c



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption
  2025-12-08  9:57   ` Shrikanth Hegde
@ 2025-12-08 17:58     ` K Prateek Nayak
  0 siblings, 0 replies; 41+ messages in thread
From: K Prateek Nayak @ 2025-12-08 17:58 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: mingo, peterz, juri.lelli, vincent.guittot, tglx, yury.norov,
	maddy, srikar, gregkh, pbonzini, seanjc, vschneid, iii, huschle,
	rostedt, dietmar.eggemann, christophe.leroy, linux-kernel,
	linuxppc-dev

Hello Shrikanth,

Thank you for taking a look at the PoC.

On 12/8/2025 3:27 PM, Shrikanth Hegde wrote:
> Hi Prateek.
> 
> Thank you very much for going throguh the series.
> 
> On 12/8/25 10:17 AM, K Prateek Nayak wrote:
>> On 11/19/2025 6:14 PM, Shrikanth Hegde wrote:
>>> Detailed problem statement and some of the implementation choices were
>>> discussed earlier[1].
>>>
>>> [1]: https://lore.kernel.org/all/20250910174210.1969750-1-sshegde@linux.ibm.com/
>>>
>>> This is likely the version which would be used for LPC2025 discussion on
>>> this topic. Feel free to provide your suggestion and hoping for a solution
>>> that works for different architectures and it's use cases.
>>>
>>> All the existing alternatives such as cpu hotplug, creating isolated
>>> partitions etc break the user affinity. Since number of CPUs to use change
>>> depending on the steal time, it is not driven by User. Hence it would be
>>> wrong to break the affinity. This series allows if the task is pinned
>>> only paravirt CPUs, it will continue running there.
>>
>> If maintaining task affinity is the only problem that cpusets don't
>> offer, attached below is a very naive prototype that seems to work in
>> my case without hitting any obvious splats so far.
>>
>> Idea is to keep task affinity untouched, but remove the CPUs from
>> the sched domains.
>>
>> That way, all the balancing, and wakeups will steer away from these
>> CPUs automatically but once the CPUs are put back, the balancing will
>> automatically move tasks back.
>>
>> I tested this with a bunch of spinners and with partitions and both
>> seem to work as expected. For real world VM based testing, I pinned 2
>> 6C/12C VMs to a 8C/16T LLC with 1:1 pinning - 2 virtual cores from
>> either VMs pin to same set of physical cores.
>>
>> Running 8 groups of perf bench sched messaging on each VM at the same
>> time gives the following numbers for total runtime:
>>
>> All CPUs available in the VM:      88.775s & 91.002s  (2 cores overlap)
>> Only 4 cores available in the VM:  67.365s & 73.015s  (No cores overlap)
>>
>> Note: The unavailable mask didn't change in my runs. I've noticed a
>> bit of delay before the load balancer moves the tasks to the CPU
>> going from unavailable to available - your mileage may vary depending
> 
> Depends on the scale of systems. I have seen it unfolding is slower
> compared to folding on large systems.
> 
>> on the frequency of mask updates.
>>
> 
> What do you mean "The unavailable mask didn't change in my runs" ?
> If so, how did it take effect?

The unavailable mask was set with the last two cores so that there
is no overlap in the pCPU usage. The mask remained same throughout
the runtime of the benchmarks - no dynamism in modifying the masks
within the VM.

> 
>> Following is the diff on top of tip/master:
>>
>> (Very raw PoC; Only fair tasks are considered for now to push away)
>>
> 
> I skimmed through it. It is very close to the current approach.
> 
> Advantage:
> Happens immediately instead of waiting for tick.
> Current approach too can move all the tasks at one tick.
> the concern could be latency being high and races around the list.
> 
> Disadvantages:
> 
> Causes a sched domain rebuild. Which is known to be expensive on large systems.
> But since steal time changes are not very aggressive at this point, this overhead
> maybe ok.
> 
> Keeping the interface in cpuset maybe tricky. there could multiple cpusets, and different versions
> complications too. Specially you can have cpusets in nested fashion. And all of this is
> not user driven. i think cpuset is inherently user driven.

For that reason I only kept this mask for root cgroup. Putting any
CPU on it is as good as removing them from all partitions.

> 
> Impementation looks more complicated to me atleast at this point.
> 
> Current poc needs to enhanced to make arch specific triggers. That is doable.
> 
>> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
>> index 2ddb256187b5..7c1cfdd7ffea 100644
>> --- a/include/linux/cpuset.h
>> +++ b/include/linux/cpuset.h
>> @@ -174,6 +174,10 @@ static inline void set_mems_allowed(nodemask_t nodemask)
>>   }
>>     extern bool cpuset_node_allowed(struct cgroup *cgroup, int nid);
>> +
>> +void sched_fair_notify_unavaialable_cpus(struct cpumask *unavailable_mask);
>> +const struct cpumask *cpuset_unavailable_mask(void);
>> +bool cpuset_cpu_unavailable(int cpu);
>>   #else /* !CONFIG_CPUSETS */
>>     static inline bool cpusets_enabled(void) { return false; }
>> diff --git a/kernel/cgroup/cpuset-internal.h b/kernel/cgroup/cpuset-internal.h
>> index 337608f408ce..170aba16141e 100644
>> --- a/kernel/cgroup/cpuset-internal.h
>> +++ b/kernel/cgroup/cpuset-internal.h
>> @@ -59,6 +59,7 @@ typedef enum {
>>       FILE_EXCLUSIVE_CPULIST,
>>       FILE_EFFECTIVE_XCPULIST,
>>       FILE_ISOLATED_CPULIST,
>> +    FILE_UNAVAILABLE_CPULIST,
>>       FILE_CPU_EXCLUSIVE,
>>       FILE_MEM_EXCLUSIVE,
>>       FILE_MEM_HARDWALL,
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index 4aaad07b0bd1..22d38f2299c4 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -87,6 +87,19 @@ static cpumask_var_t    isolated_cpus;
>>   static cpumask_var_t    boot_hk_cpus;
>>   static bool        have_boot_isolcpus;
>>   +/*
>> + * CPUs that may be unavailable to run tasks as a result of physical
>> + * constraints (vCPU being preempted, pCPU handling interrupt storm).
>> + *
>> + * Unlike isolated_cpus, the unavailable_cpus are simply excluded from
>> + * HK_TYPE_DOMAIN but leave the tasks affinity untouched. These CPUs
>> + * should be avoided unless the task has specifically asked to be run
>> + * only on these CPUs.
>> + */
>> +static cpumask_var_t    unavailable_cpus;
>> +static cpumask_var_t    available_tmp_mask;    /* For intermediate operations. */
>> +static bool         cpu_turned_unavailable;
>> +
> 
> This unavailable name is not probably right. When system boots, there is available_cpu
> and that is fixed and not expected to change. It can confuse users.

Ack! Just some name that I thought was appropriate. Too much
thought wasn't put into it ;)

> 
>>   /* List of remote partition root children */
>>   static struct list_head remote_children;
>>   @@ -844,6 +857,7 @@ static int generate_sched_domains(cpumask_var_t **domains,
>>           }
>>           cpumask_and(doms[0], top_cpuset.effective_cpus,
>>                   housekeeping_cpumask(HK_TYPE_DOMAIN));
>> +        cpumask_andnot(doms[0], doms[0], unavailable_cpus);
>>             goto done;
>>       }
>> @@ -960,11 +974,13 @@ static int generate_sched_domains(cpumask_var_t **domains,
>>                * The top cpuset may contain some boot time isolated
>>                * CPUs that need to be excluded from the sched domain.
>>                */
>> -            if (csa[i] == &top_cpuset)
>> +            if (csa[i] == &top_cpuset) {
>>                   cpumask_and(doms[i], csa[i]->effective_cpus,
>>                           housekeeping_cpumask(HK_TYPE_DOMAIN));
>> -            else
>> -                cpumask_copy(doms[i], csa[i]->effective_cpus);
>> +                cpumask_andnot(doms[i], doms[i], unavailable_cpus);
>> +             } else {
>> +                cpumask_andnot(doms[i], csa[i]->effective_cpus, unavailable_cpus);
>> +             }
>>               if (dattr)
>>                   dattr[i] = SD_ATTR_INIT;
>>           }
>> @@ -985,6 +1001,7 @@ static int generate_sched_domains(cpumask_var_t **domains,
>>                   }
>>                   cpumask_or(dp, dp, csa[j]->effective_cpus);
>>                   cpumask_and(dp, dp, housekeeping_cpumask(HK_TYPE_DOMAIN));
>> +                cpumask_andnot(dp, dp, unavailable_cpus);
>>                   if (dattr)
>>                       update_domain_attr_tree(dattr + nslot, csa[j]);
>>               }
>> @@ -1418,6 +1435,17 @@ bool cpuset_cpu_is_isolated(int cpu)
>>   }
>>   EXPORT_SYMBOL_GPL(cpuset_cpu_is_isolated);
>>   +/* Get the set of CPUs marked unavailable. */
>> +const struct cpumask *cpuset_unavailable_mask(void)
>> +{
>> +    return unavailable_cpus;
>> +}
>> +
>> +bool cpuset_cpu_unavailable(int cpu)
>> +{
>> +    return  cpumask_test_cpu(cpu, unavailable_cpus);
>> +}
>> +
>>   /**
>>    * rm_siblings_excl_cpus - Remove exclusive CPUs that are used by sibling cpusets
>>    * @parent: Parent cpuset containing all siblings
>> @@ -2612,6 +2640,53 @@ static int update_exclusive_cpumask(struct cpuset *cs, struct cpuset *trialcs,
>>       return 0;
>>   }
>>   +/**
>> + * update_exclusive_cpumask - update the exclusive_cpus mask of a cpuset
>> + * @cs: the cpuset to consider
>> + * @trialcs: trial cpuset
>> + * @buf: buffer of cpu numbers written to this cpuset
>> + *
>> + * The tasks' cpumask will be updated if cs is a valid partition root.
>> + */
>> +static int update_unavailable_cpumask(const char *buf)
>> +{
>> +    cpumask_var_t tmp;
>> +    int retval;
>> +
>> +    if (!alloc_cpumask_var(&tmp, GFP_KERNEL))
>> +        return -ENOMEM;
>> +
>> +    retval = cpulist_parse(buf, tmp);
>> +    if (retval < 0)
>> +        goto out;
>> +
>> +    /* Nothing to do if the CPUs didn't change */
>> +    if (cpumask_equal(tmp, unavailable_cpus))
>> +        goto out;
>> +
>> +    /* Save the CPUs that went unavailable to push task out. */
>> +    if (cpumask_andnot(available_tmp_mask, tmp, unavailable_cpus))
>> +        cpu_turned_unavailable = true;
>> +
>> +    cpumask_copy(unavailable_cpus, tmp);
>> +    cpuset_force_rebuild();
> 
> I think this rebuilding sched domains could add quite overhead.

I agree! But I somewhat dislike putting a cpumask_and() in a
bunch of places where we deal with sched_domain when we can
simply adjust the sched_domain to account for it - it is
definitely not performant but IMO, it is somewhat cleaner.

But if CPUs are transitioning in and out of the paravirt mask
as such a high rate, wouldn't you just end up pushing the
tasks away only to soon pull them back?

What changes so suddenly in the hypervisor that a paravirt
CPU is now fully available after a sec or two?

On a sidenote, we do have vcpu_is_preempted() - isn't that
sufficient enough to steer tasks away if we start being a
bit more aggressive about it? Do we need a mask?

> 
>> +out:
>> +    free_cpumask_var(tmp);
>> +    return retval;
>> +}
>> +
>> +static void cpuset_notify_unavailable_cpus(void)
>> +{
>> +    /*
>> +     * Prevent being preempted by the stopper if the local CPU
>> +     * turned unavailable.
>> +     */
>> +    guard(preempt)();
>> +
>> +    sched_fair_notify_unavaialable_cpus(available_tmp_mask);
>> +    cpu_turned_unavailable = false;
>> +}
>> +
>>   /*
>>    * Migrate memory region from one set of nodes to another.  This is
>>    * performed asynchronously as it can be called from process migration path
>> @@ -3302,11 +3377,16 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
>>                       char *buf, size_t nbytes, loff_t off)
>>   {
>>       struct cpuset *cs = css_cs(of_css(of));
>> +    int file_type = of_cft(of)->private;
>>       struct cpuset *trialcs;
>>       int retval = -ENODEV;
>>   -    /* root is read-only */
>> -    if (cs == &top_cpuset)
>> +    /* root is read-only; except for unavailable mask */
>> +    if (file_type != FILE_UNAVAILABLE_CPULIST && cs == &top_cpuset)
>> +        return -EACCES;
>> +
>> +    /* unavailable mask can be only set on root. */
>> +    if (file_type == FILE_UNAVAILABLE_CPULIST && cs != &top_cpuset)
>>           return -EACCES;
>>         buf = strstrip(buf);
>> @@ -3330,6 +3410,9 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
>>       case FILE_MEMLIST:
>>           retval = update_nodemask(cs, trialcs, buf);
>>           break;
>> +    case FILE_UNAVAILABLE_CPULIST:
>> +        retval = update_unavailable_cpumask(buf);
>> +        break;
>>       default:
>>           retval = -EINVAL;
>>           break;
>> @@ -3338,6 +3421,8 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
>>       free_cpuset(trialcs);
>>       if (force_sd_rebuild)
>>           rebuild_sched_domains_locked();
>> +    if (cpu_turned_unavailable)
>> +        cpuset_notify_unavailable_cpus();
>>   out_unlock:
>>       cpuset_full_unlock();
>>       if (of_cft(of)->private == FILE_MEMLIST)
>> @@ -3386,6 +3471,9 @@ int cpuset_common_seq_show(struct seq_file *sf, void *v)
>>       case FILE_ISOLATED_CPULIST:
>>           seq_printf(sf, "%*pbl\n", cpumask_pr_args(isolated_cpus));
>>           break;
>> +    case FILE_UNAVAILABLE_CPULIST:
>> +        seq_printf(sf, "%*pbl\n", cpumask_pr_args(unavailable_cpus));
>> +        break;
>>       default:
>>           ret = -EINVAL;
>>       }
>> @@ -3524,6 +3612,15 @@ static struct cftype dfl_files[] = {
>>           .flags = CFTYPE_ONLY_ON_ROOT,
>>       },
>>   +    {
>> +        .name = "cpus.unavailable",
>> +        .seq_show = cpuset_common_seq_show,
>> +        .write = cpuset_write_resmask,
>> +        .max_write_len = (100U + 6 * NR_CPUS),
>> +        .private = FILE_UNAVAILABLE_CPULIST,
>> +        .flags = CFTYPE_ONLY_ON_ROOT,
>> +    },
>> +
>>       { }    /* terminate */
>>   };
>>   @@ -3814,6 +3911,8 @@ int __init cpuset_init(void)
>>       BUG_ON(!alloc_cpumask_var(&top_cpuset.exclusive_cpus, GFP_KERNEL));
>>       BUG_ON(!zalloc_cpumask_var(&subpartitions_cpus, GFP_KERNEL));
>>       BUG_ON(!zalloc_cpumask_var(&isolated_cpus, GFP_KERNEL));
>> +    BUG_ON(!zalloc_cpumask_var(&unavailable_cpus, GFP_KERNEL));
>> +    BUG_ON(!zalloc_cpumask_var(&available_tmp_mask, GFP_KERNEL));
>>         cpumask_setall(top_cpuset.cpus_allowed);
>>       nodes_setall(top_cpuset.mems_allowed);
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index ee7dfbf01792..13d0d9587aca 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -2396,7 +2396,7 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
>>         /* Non kernel threads are not allowed during either online or offline. */
>>       if (!(p->flags & PF_KTHREAD))
>> -        return cpu_active(cpu);
>> +        return (cpu_active(cpu) && !cpuset_cpu_unavailable(cpu));
>>         /* KTHREAD_IS_PER_CPU is always allowed. */
>>       if (kthread_is_per_cpu(p))
>> @@ -3451,6 +3451,26 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
>>               goto out;
>>           }
>>   +        /*
>> +         * Only user threads can be forced out of
>> +         * unavaialable CPUs.
>> +         */
>> +        if (p->flags & PF_KTHREAD)
>> +            goto rude;
>> +
>> +        /* Any unavailable CPUs that can run the task? */
>> +        for_each_cpu(dest_cpu, cpuset_unavailable_mask()) {
>> +            if (!task_allowed_on_cpu(p, dest_cpu))
>> +                continue;
>> +
>> +            /* Can we hoist this up to goto rude? */
>> +            if (is_migration_disabled(p))
>> +                continue;
>> +
>> +            if (cpu_active(dest_cpu))
>> +                goto out;
>> +        }
>> +rude:
>>           /* No more Mr. Nice Guy. */
>>           switch (state) {
>>           case cpuset:
>> @@ -3766,7 +3786,7 @@ bool call_function_single_prep_ipi(int cpu)
>>    * via sched_ttwu_wakeup() for activation so the wakee incurs the cost
>>    * of the wakeup instead of the waker.
>>    */
>> -static void __ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
>> +void __ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
>>   {
>>       struct rq *rq = cpu_rq(cpu);
>>   @@ -5365,7 +5385,9 @@ void sched_exec(void)
>>       int dest_cpu;
>>         scoped_guard (raw_spinlock_irqsave, &p->pi_lock) {
>> -        dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), WF_EXEC);
>> +        int wake_flags = WF_EXEC;
>> +
>> +        dest_cpu = select_task_rq(p, task_cpu(p), &wake_flags);
> 
> Whats this logic?

WF_EXEC path would not care about the unavailable CPUs and won't run
the select_fallback_rq() path if the sched_class->select_task() is
called directly.

> 
>>           if (dest_cpu == smp_processor_id())
>>               return;
>>   diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index da46c3164537..e502cccdae64 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -12094,6 +12094,61 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>>       return ld_moved;
>>   }
>>   +static int unavailable_balance_cpu_stop(void *data)
>> +{
>> +    struct task_struct *p, *tmp;
>> +    struct rq *rq = data;
>> +    int this_cpu = cpu_of(rq);
>> +
>> +    guard(rq_lock_irq)(rq);
>> +
>> +    list_for_each_entry_safe(p, tmp, &rq->cfs_tasks, se.group_node) {
>> +        int target_cpu;
>> +
>> +        /*
>> +         * Bail out if a concurrent change to unavailable_mask turned
>> +         * this CPU available.
>> +         */
>> +        rq->unavailable_balance = cpumask_test_cpu(this_cpu, cpuset_unavailable_mask());
>> +        if (!rq->unavailable_balance)
>> +            break;
>> +
>> +        /* XXX: Does not deal with migration disabled tasks. */
>> +        target_cpu = cpumask_first_andnot(p->cpus_ptr, cpuset_unavailable_mask());
> 
> This can cause it to go first CPU always and then load balancer to move it later on.
> First should check the nodemask the current cpu is on to avoid NUMA costs.

Ack! I agree there is plenty of room for optimizations.

> 
>> +        if ((unsigned int)target_cpu < nr_cpumask_bits) {
>> +            deactivate_task(rq, p, 0);
>> +            set_task_cpu(p, target_cpu);
>> +
>> +            /*
>> +             * Switch to move_queued_task() later.
>> +             * For PoC send an IPI and be done with it.
>> +             */
>> +            __ttwu_queue_wakelist(p, target_cpu, 0);
>> +        }
>> +    }
>> +
>> +    rq->unavailable_balance = 0;
>> +
>> +    return 0;
>> +}
>> +
>> +void sched_fair_notify_unavaialable_cpus(struct cpumask *unavailable_mask)
>> +{
>> +    int cpu, this_cpu = smp_processor_id();
>> +
>> +    for_each_cpu_wrap(cpu, unavailable_mask, this_cpu + 1) {
>> +        struct rq *rq = cpu_rq(cpu);
>> +
>> +        /* Balance in progress. Tasks will be pushed out. */
>> +        if (rq->unavailable_balance)
>> +            return;
>> +
> 
> Need to run stopper, if there is active current task. otherise that work
> can be done here itself.

Ack! My thinking was to not take a rq_lock early and let stopper
run and then push all queued fair tasks out with rq_lock held.

> 
>> +        stop_one_cpu_nowait(cpu, unavailable_balance_cpu_stop,
>> +                    rq, &rq->unavailable_balance_work);
>> +        rq->unavailable_balance = 1;
>> +    }
>> +}
>> +

-- 
Thanks and Regards,
Prateek



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption
  2025-12-05  5:30   ` Shrikanth Hegde
@ 2025-12-15 17:39     ` Yury Norov
  2025-12-18  5:22       ` Shrikanth Hegde
  0 siblings, 1 reply; 41+ messages in thread
From: Yury Norov @ 2025-12-15 17:39 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: Ilya Leoshkevich, linux-kernel, linuxppc-dev, mingo, peterz,
	juri.lelli, vincent.guittot, tglx, maddy, srikar, gregkh,
	pbonzini, seanjc, kprateek.nayak, vschneid, huschle, rostedt,
	dietmar.eggemann, christophe.leroy, linux-s390

On Fri, Dec 05, 2025 at 11:00:18AM +0530, Shrikanth Hegde wrote:
> 
> 
> On 12/4/25 6:58 PM, Ilya Leoshkevich wrote:
> > On Wed, 2025-11-19 at 18:14 +0530, Shrikanth Hegde wrote:

...

> > Others have already commented on the naming, and I would agree that
> > "paravirt" is really misleading. I cannot say that the previous "cpu-
> > avoid" one was perfect, but it was much better.

It was my suggestion to switch names. cpu-avoid is definitely a
no-go. Because it doesn't explain anything and only confuses.

I suggested 'paravirt' (notice - only suggested) because the patch
series is mainly discussing paravirtualized VMs. But now I'm not even
sure that the idea of the series is:

1. Applicable only to paravirtualized VMs; and 
2. Preemption and rescheduling throttling requires another in-kernel
   concept other than nohs, isolcpus, cgroups and similar.

Shrikanth, can you please clarify the scope of the new feature? Would
it be useful for non-paravirtualized VMs, for example? Any other
task-cpu bonding problems?

On previous rounds you tried to implement the same with cgroups, as
far as I understood. Can you discuss that? What exactly can't be done
with the existing kernel APIs?

Thanks,
Yury

> > [1] https://github.com/iii-i/linux/commits/iii/poc/cpu-avoid/v3/
> 
> Will look into it. one thing to to be careful are CPU numbers.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption
  2025-12-15 17:39     ` Yury Norov
@ 2025-12-18  5:22       ` Shrikanth Hegde
  0 siblings, 0 replies; 41+ messages in thread
From: Shrikanth Hegde @ 2025-12-18  5:22 UTC (permalink / raw)
  To: Yury Norov, vincent.guittot
  Cc: Ilya Leoshkevich, linux-kernel, linuxppc-dev, mingo, peterz,
	juri.lelli, tglx, maddy, srikar, gregkh, pbonzini, seanjc,
	kprateek.nayak, vschneid, huschle, rostedt, dietmar.eggemann,
	christophe.leroy, linux-s390

Hi, Sorry for delay in response. Just landed yesterday from LPC.

>>> Others have already commented on the naming, and I would agree that
>>> "paravirt" is really misleading. I cannot say that the previous "cpu-
>>> avoid" one was perfect, but it was much better.
>   
> It was my suggestion to switch names. cpu-avoid is definitely a
> no-go. Because it doesn't explain anything and only confuses.
> 
> I suggested 'paravirt' (notice - only suggested) because the patch
> series is mainly discussing paravirtualized VMs. But now I'm not even
> sure that the idea of the series is:
> 
> 1. Applicable only to paravirtualized VMs; and
> 2. Preemption and rescheduling throttling requires another in-kernel
>     concept other than nohs, isolcpus, cgroups and similar.
> 
> Shrikanth, can you please clarify the scope of the new feature? Would
> it be useful for non-paravirtualized VMs, for example? Any other
> task-cpu bonding problems?

Current scope of the feature in virtulaized environment where the idea is
to do co-operative folding in each VM based on hint(either HW hint or steal time).

If you see from macro level, this is framework which allows one to avoid some vCPUs(In
Guest) to achieve better throughput or latency. So one could come up with more usecases
even in non-paravirtualized VMs. For example, one crazy idea such as avoid using SMT siblings
when the system utilization is low to achieve higher ipc(instruction per cycle) value.

> 
> On previous rounds you tried to implement the same with cgroups, as
> far as I understood. Can you discuss that? What exactly can't be done
> with the existing kernel APIs?
> 
> Thanks,
> Yury
> 

We discussed this in Sched-MC this year.
https://youtu.be/zf-MBoUIz1Q?t=8581

Currently explored options.

1. CPU Hotplug - slow. Some efforts underway to speed it up.
2. Creating isolated cpusets - Faster. still involves sched domain rebuilds.

The reason why they both won't work is that they break user affinities in the guest.
i.e guest can do "taskset -c <some_vcpus> <workload>, when the
last vCPU goes offline(guest vCPU hotplug) in that list of vCPUs
the affinity mask is reset and workload can run on online vCPUs and it
doesn't set back to earlier value. That is okay for hotlug or isolated cpusets
since it is driven by user in the guest. So user is aware of it.

Whereas here, the change is driven by the system than user in the guest.
So it cannot break user-space affinities.
So we need a new interface to drive this. I think it is better if it is
non cgroup based framework since cgroup is usually user driven.
(correct me if i am wrong).

PS:
There were some confusion around this affinity breaking. Note it is guest vCPU being marked and
guest vCPU being hotplugged. Task affinied workload was running in guest. Host CPUs(pCPU) are not
hotplugged.

---

I had discussion with vincent in hallway, idea is to use the push framework bits and set the
CPU Capacity=1 (lowest value and consider it as special value) and use a static key check to do
this stuff only when HW says to do so.
Such as (considering name as paravirt):

static inline bool cpu_paravirt(int cpu)
{
	if (static_branch_unlikely(&cpu_paravirt_framework))
		return arch_scale_cpu_capacity(cpu) == 1;

	return false;
}

Rest of the bits remain same. I found an issue with current series where setting affinity
is going wrong after cpu is marked paravirt, i will fix it next version. will do some more
testing and send next version in 2026.

Happy Holidays!

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2025-12-18  5:23 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-19 12:44 [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
2025-11-19 12:44 ` [PATCH 01/17] sched/docs: Document cpu_paravirt_mask and Paravirt CPU concept Shrikanth Hegde
2025-11-19 12:44 ` [PATCH 02/17] cpumask: Introduce cpu_paravirt_mask Shrikanth Hegde
2025-11-19 12:44 ` [PATCH 03/17] sched/core: Dont allow to use CPU marked as paravirt Shrikanth Hegde
2025-11-19 12:44 ` [PATCH 04/17] sched/debug: Remove unused schedstats Shrikanth Hegde
2025-11-19 12:44 ` [PATCH 05/17] sched/fair: Add paravirt movements for proc sched file Shrikanth Hegde
2025-11-19 12:44 ` [PATCH 06/17] sched/fair: Pass current cpu in select_idle_sibling Shrikanth Hegde
2025-11-19 12:44 ` [PATCH 07/17] sched/fair: Don't consider paravirt CPUs for wakeup and load balance Shrikanth Hegde
2025-11-19 12:44 ` [PATCH 08/17] sched/rt: Don't select paravirt CPU for wakeup and push/pull rt task Shrikanth Hegde
2025-11-19 12:44 ` [PATCH 09/17] sched/core: Add support for nohz_full CPUs Shrikanth Hegde
2025-11-21  3:16   ` K Prateek Nayak
2025-11-21  4:40     ` Shrikanth Hegde
2025-11-24  4:36       ` K Prateek Nayak
2025-11-19 12:44 ` [PATCH 10/17] sched/core: Push current task from paravirt CPU Shrikanth Hegde
2025-11-19 12:44 ` [PATCH 11/17] sysfs: Add paravirt CPU file Shrikanth Hegde
2025-11-19 12:44 ` [PATCH 12/17] powerpc: method to initialize ec and vp cores Shrikanth Hegde
2025-11-21  8:29   ` kernel test robot
2025-11-21 10:14   ` kernel test robot
2025-11-19 12:44 ` [PATCH 13/17] powerpc: enable/disable paravirt CPUs based on steal time Shrikanth Hegde
2025-11-19 12:44 ` [PATCH 14/17] powerpc: process steal values at fixed intervals Shrikanth Hegde
2025-11-19 12:44 ` [PATCH 15/17] powerpc: add debugfs file for controlling handling on steal values Shrikanth Hegde
2025-11-19 12:44 ` [PATCH 16/17] sysfs: Provide write method for paravirt Shrikanth Hegde
2025-11-24 17:04   ` Greg KH
2025-11-24 17:24     ` Steven Rostedt
2025-11-25  2:49       ` Shrikanth Hegde
2025-11-25 15:52         ` Steven Rostedt
2025-11-25 16:02           ` Konstantin Ryabitsev
2025-11-25 16:08             ` Steven Rostedt
2025-11-19 12:44 ` [PATCH 17/17] sysfs: disable arch handling if paravirt file being written Shrikanth Hegde
2025-11-24 17:05 ` [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption Greg KH
2025-11-25  2:39   ` Shrikanth Hegde
2025-11-25  7:48     ` Christophe Leroy (CS GROUP)
2025-11-25  8:48       ` Shrikanth Hegde
2025-11-27 10:44 ` Shrikanth Hegde
2025-12-04 13:28 ` Ilya Leoshkevich
2025-12-05  5:30   ` Shrikanth Hegde
2025-12-15 17:39     ` Yury Norov
2025-12-18  5:22       ` Shrikanth Hegde
2025-12-08  4:47 ` K Prateek Nayak
2025-12-08  9:57   ` Shrikanth Hegde
2025-12-08 17:58     ` K Prateek Nayak

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).