LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v4 00/17] Paravirt CPUs and push task for less vCPU preemption
@ 2025-11-19  6:20 Shrikanth Hegde
  2025-11-19  6:20 ` [RFC PATCH v4 01/17] sched/docs: Document cpu_paravirt_mask and Paravirt CPU concept Shrikanth Hegde
                   ` (17 more replies)
  0 siblings, 18 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2025-11-19  6:20 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev
  Cc: sshegde, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, maddy, srikar, gregkh, pbonzini, seanjc,
	kprateek.nayak, vschneid, iii, huschle, rostedt, dietmar.eggemann,
	christophe.leroy

Detailed problem statement and some of the implementation choices were 
discussed earlier[1].

[1]: https://lore.kernel.org/all/20250910174210.1969750-1-sshegde@linux.ibm.com/

This is likely the version which would be used for LPC2025 discussion on
this topic. Feel free to provide your suggestion and hoping for a solution
that works for different architectures and it's use cases.

All the existing alternatives such as cpu hotplug, creating isolated
partitions etc break the user affinity. Since number of CPUs to use change
depending on the steal time, it is not driven by User. Hence it would be
wrong to break the affinity. This series allows if the task is pinned
only paravirt CPUs, it will continue running there.

Changes compared v3[1]:

- Introduced computation of steal time in powerpc code.
- Derive number of CPUs to use and mark the remaining as paravirt based
  on steal values. 
- Provide debugfs knobs to alter how steal time values being used.
- Removed static key check for paravirt CPUs (Yury)
- Removed preempt_disable/enable while calling stopper (Prateek)
- Made select_idle_sibling and friends aware of paravirt CPUs.
- Removed 3 unused schedstat fields and introduced 2 related to paravirt
  handling.
- Handled nohz_full case by enabling tick on it when there is CFS/RT on
  it.
- Updated helper patch to override arch behaviour for easier debugging
  during development.

TODO: 

- Get performance numbers on PowerPC, x86 and S390. Hopefully by next
  week. Didn't want to hold the series till then.

- The CPUs to mark as paravirt is very simple and doesn't work when
  vCPUs aren't spread out uniformly across NUMA nodes. Ideal would be splice
  the numbers based on how many CPUs each NUMA node has. It is quite
  tricky to do specially since cpumask can be on stack too. Given
  NR_CPUS can be 8192 and nr_possible_nodes 32. Haven't got my head into
  solving it yet. Maybe there is easier way.

- DLPAR Add/Remove needs to call init of EC/VP cores (powerpc specific)

- Userspace tools awareness such as irqbalance. 

- Delve into design of hint from Hyeprvisor(HW Hint). i.e Host informs
  guest which/how many CPUs it has to use at this moment. This interface
  should work across archs with each arch doing its specific handling.

- Determine the default values for steal time related knobs
  empirically and document them.

- Need to check safety against CPU hotplug specially in process_steal.


Applies cleanly on tip/master:
commit c2ef745151b21d4dcc4b29a1eabf1096f5ba544b


Thanks to srikar for providing the initial code around powerpc steal
time handling code. Thanks to all who went through and provided reviews.

PS: I haven't found a better name. Please suggest if you have any.

Shrikanth Hegde (17):
  sched/docs: Document cpu_paravirt_mask and Paravirt CPU concept
  cpumask: Introduce cpu_paravirt_mask
  sched/core: Dont allow to use CPU marked as paravirt
  sched/debug: Remove unused schedstats
  sched/fair: Add paravirt movements for proc sched file
  sched/fair: Pass current cpu in select_idle_sibling
  sched/fair: Don't consider paravirt CPUs for wakeup and load balance
  sched/rt: Don't select paravirt CPU for wakeup and push/pull rt task
  sched/core: Add support for nohz_full CPUs
  sched/core: Push current task from paravirt CPU
  sysfs: Add paravirt CPU file
  powerpc: method to initialize ec and vp cores
  powerpc: enable/disable paravirt CPUs based on steal time
  powerpc: process steal values at fixed intervals
  powerpc: add debugfs file for controlling handling on steal values
  sysfs: Provide write method for paravirt
  helper: disable arch handling if paravirt file being written

 .../ABI/testing/sysfs-devices-system-cpu      |   9 +
 Documentation/scheduler/sched-arch.rst        |  37 +++
 arch/powerpc/include/asm/smp.h                |   1 +
 arch/powerpc/kernel/smp.c                     |   1 +
 arch/powerpc/platforms/pseries/lpar.c         | 223 ++++++++++++++++++
 arch/powerpc/platforms/pseries/pseries.h      |   1 +
 drivers/base/cpu.c                            |  60 ++++-
 include/linux/cpumask.h                       |  20 ++
 include/linux/sched.h                         |   9 +-
 kernel/sched/core.c                           | 106 ++++++++-
 kernel/sched/debug.c                          |   5 +-
 kernel/sched/fair.c                           |  42 +++-
 kernel/sched/rt.c                             |  11 +-
 kernel/sched/sched.h                          |   9 +
 14 files changed, 519 insertions(+), 15 deletions(-)

-- 
2.47.3



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [RFC PATCH v4 01/17] sched/docs: Document cpu_paravirt_mask and Paravirt CPU concept
  2025-11-19  6:20 [RFC PATCH v4 00/17] Paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
@ 2025-11-19  6:20 ` Shrikanth Hegde
  2025-11-19  6:20 ` [RFC PATCH v4 02/17] cpumask: Introduce cpu_paravirt_mask Shrikanth Hegde
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2025-11-19  6:20 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev
  Cc: sshegde, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, maddy, srikar, gregkh, pbonzini, seanjc,
	kprateek.nayak, vschneid, iii, huschle, rostedt, dietmar.eggemann,
	christophe.leroy

Add documentation for new cpumask called cpu_paravirt_mask. This could
help users in understanding what this mask and the concept behind it.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 Documentation/scheduler/sched-arch.rst | 37 ++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/Documentation/scheduler/sched-arch.rst b/Documentation/scheduler/sched-arch.rst
index ed07efea7d02..6972c295013d 100644
--- a/Documentation/scheduler/sched-arch.rst
+++ b/Documentation/scheduler/sched-arch.rst
@@ -62,6 +62,43 @@ Your cpu_idle routines need to obey the following rules:
 arch/x86/kernel/process.c has examples of both polling and
 sleeping idle functions.
 
+Paravirt CPUs
+=============
+
+Under virtualised environments it is possible to overcommit CPU resources.
+i.e sum of virtual CPU(vCPU) of all VM's is greater than number of physical
+CPUs(pCPU). Under such conditions when all or many VM's have high utilization,
+hypervisor won't be able to satisfy the CPU requirement and has to context
+switch within or across VM. i.e hypervisor need to preempt one vCPU to run
+another. This is called vCPU preemption. This is more expensive compared to
+task context switch within a vCPU.
+
+In such cases it is better that VM's co-ordinate among themselves and ask for
+less CPU by not using some of the vCPUs. Such vCPUs where workload can be
+avoided at the moment for less vCPU preemption are called as "Paravirt CPUs".
+Note that when the pCPU contention goes away, these vCPUs can be used again
+by the workload.
+
+Arch need to set/unset the specific vCPU in cpu_paravirt_mask. When set, avoid
+that vCPU and when unset, use it as usual.
+
+Scheduler will try to avoid paravirt vCPUs as much as it can.
+This is achieved by
+1. Not selecting paravirt CPU at wakeup.
+2. Push the task away from paravirt CPU at tick.
+3. Not selecting paravirt CPU at load balance.
+
+This works only for SCHED_RT and SCHED_NORMAL. SCHED_EXT and userspace can make
+choices accordingly using cpu_paravirt_mask.
+
+/sys/devices/system/cpu/paravirt prints the current cpu_paravirt_mask in
+cpulist format.
+
+Notes:
+1. A task pinned only on paravirt CPUs will continue to run there.
+2. This feature is available under CONFIG_PARAVIRT
+3. Refer to PowerPC for architecure implementation side.
+4. Doesn't push out any task running on isolated CPUs.
 
 Possible arch/ problems
 =======================
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH v4 02/17] cpumask: Introduce cpu_paravirt_mask
  2025-11-19  6:20 [RFC PATCH v4 00/17] Paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
  2025-11-19  6:20 ` [RFC PATCH v4 01/17] sched/docs: Document cpu_paravirt_mask and Paravirt CPU concept Shrikanth Hegde
@ 2025-11-19  6:20 ` Shrikanth Hegde
  2025-11-19  6:20 ` [RFC PATCH v4 03/17] sched/core: Dont allow to use CPU marked as paravirt Shrikanth Hegde
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2025-11-19  6:20 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev
  Cc: sshegde, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, maddy, srikar, gregkh, pbonzini, seanjc,
	kprateek.nayak, vschneid, iii, huschle, rostedt, dietmar.eggemann,
	christophe.leroy

This patch does
- Declare and Define cpu_paravirt_mask.
- Get/Set helpers for it.

Values are set by arch code and consumed by the scheduler.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 include/linux/cpumask.h | 20 ++++++++++++++++++++
 kernel/sched/core.c     |  5 +++++
 2 files changed, 25 insertions(+)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index ff8f41ab7ce6..079903851341 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -1270,6 +1270,26 @@ static __always_inline bool cpu_dying(unsigned int cpu)
 
 #endif /* NR_CPUS > 1 */
 
+/*
+ * All related wrappers kept together to avoid too many ifdefs
+ * See Documentation/scheduler/sched-arch.rst for details
+ */
+#ifdef CONFIG_PARAVIRT
+extern struct cpumask __cpu_paravirt_mask;
+#define cpu_paravirt_mask    ((const struct cpumask *)&__cpu_paravirt_mask)
+#define set_cpu_paravirt(cpu, paravirt) assign_cpu((cpu), &__cpu_paravirt_mask, (paravirt))
+
+static __always_inline bool cpu_paravirt(unsigned int cpu)
+{
+	return cpumask_test_cpu(cpu, cpu_paravirt_mask);
+}
+#else
+static __always_inline bool cpu_paravirt(unsigned int cpu)
+{
+	return false;
+}
+#endif
+
 #define cpu_is_offline(cpu)	unlikely(!cpu_online(cpu))
 
 #if NR_CPUS <= BITS_PER_LONG
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9f10cfbdc228..40db5e659994 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10852,3 +10852,8 @@ void sched_change_end(struct sched_change_ctx *ctx)
 		p->sched_class->prio_changed(rq, p, ctx->prio);
 	}
 }
+
+#ifdef CONFIG_PARAVIRT
+struct cpumask __cpu_paravirt_mask __read_mostly;
+EXPORT_SYMBOL(__cpu_paravirt_mask);
+#endif
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH v4 03/17] sched/core: Dont allow to use CPU marked as paravirt
  2025-11-19  6:20 [RFC PATCH v4 00/17] Paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
  2025-11-19  6:20 ` [RFC PATCH v4 01/17] sched/docs: Document cpu_paravirt_mask and Paravirt CPU concept Shrikanth Hegde
  2025-11-19  6:20 ` [RFC PATCH v4 02/17] cpumask: Introduce cpu_paravirt_mask Shrikanth Hegde
@ 2025-11-19  6:20 ` Shrikanth Hegde
  2025-11-19  6:20 ` [RFC PATCH v4 04/17] sched/debug: Remove unused schedstats Shrikanth Hegde
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2025-11-19  6:20 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev
  Cc: sshegde, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, maddy, srikar, gregkh, pbonzini, seanjc,
	kprateek.nayak, vschneid, iii, huschle, rostedt, dietmar.eggemann,
	christophe.leroy

Don't allow a paravirt CPU to be used while looking for a CPU to use.

Push task mechanism uses stopper thread which going to call
select_fallback_rq and use this mechanism to avoid picking a paravirt CPU.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/core.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 40db5e659994..90fc04d84b74 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2397,8 +2397,13 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
 		return cpu_online(cpu);
 
 	/* Non kernel threads are not allowed during either online or offline. */
-	if (!(p->flags & PF_KTHREAD))
-		return cpu_active(cpu);
+	if (!(p->flags & PF_KTHREAD)) {
+		/* A user thread shouldn't be allowed on a paravirt cpu */
+		if (cpu_paravirt(cpu))
+			return false;
+		else
+			return cpu_active(cpu);
+	}
 
 	/* KTHREAD_IS_PER_CPU is always allowed. */
 	if (kthread_is_per_cpu(p))
@@ -2408,6 +2413,10 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
 	if (cpu_dying(cpu))
 		return false;
 
+	/* Non percpu kthreads should stay away from paravirt cpu*/
+	if (cpu_paravirt(cpu))
+		return false;
+
 	/* But are allowed during online. */
 	return cpu_online(cpu);
 }
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH v4 04/17] sched/debug: Remove unused schedstats
  2025-11-19  6:20 [RFC PATCH v4 00/17] Paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
                   ` (2 preceding siblings ...)
  2025-11-19  6:20 ` [RFC PATCH v4 03/17] sched/core: Dont allow to use CPU marked as paravirt Shrikanth Hegde
@ 2025-11-19  6:20 ` Shrikanth Hegde
  2025-11-19  6:20 ` [RFC PATCH v5 05/17] sched/fair: Add paravirt movements for proc sched file Shrikanth Hegde
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2025-11-19  6:20 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev
  Cc: sshegde, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, maddy, srikar, gregkh, pbonzini, seanjc,
	kprateek.nayak, vschneid, iii, huschle, rostedt, dietmar.eggemann,
	christophe.leroy

nr_migrations_cold, nr_wakeups_passive and nr_wakeups_idle are not
being updated anywhere. So remove them.

This will help to add couple more stats in the next patch without
bloating the size.

These are per process stats. So updating sched stats version isn't
necessary.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 include/linux/sched.h | 3 ---
 kernel/sched/debug.c  | 3 ---
 2 files changed, 6 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index bb436ee1942d..f802bfd7120f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -545,7 +545,6 @@ struct sched_statistics {
 	s64				exec_max;
 	u64				slice_max;
 
-	u64				nr_migrations_cold;
 	u64				nr_failed_migrations_affine;
 	u64				nr_failed_migrations_running;
 	u64				nr_failed_migrations_hot;
@@ -558,8 +557,6 @@ struct sched_statistics {
 	u64				nr_wakeups_remote;
 	u64				nr_wakeups_affine;
 	u64				nr_wakeups_affine_attempts;
-	u64				nr_wakeups_passive;
-	u64				nr_wakeups_idle;
 
 #ifdef CONFIG_SCHED_CORE
 	u64				core_forceidle_sum;
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 41caa22e0680..2cb3ffc653df 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1182,7 +1182,6 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 		P_SCHEDSTAT(wait_count);
 		PN_SCHEDSTAT(iowait_sum);
 		P_SCHEDSTAT(iowait_count);
-		P_SCHEDSTAT(nr_migrations_cold);
 		P_SCHEDSTAT(nr_failed_migrations_affine);
 		P_SCHEDSTAT(nr_failed_migrations_running);
 		P_SCHEDSTAT(nr_failed_migrations_hot);
@@ -1194,8 +1193,6 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 		P_SCHEDSTAT(nr_wakeups_remote);
 		P_SCHEDSTAT(nr_wakeups_affine);
 		P_SCHEDSTAT(nr_wakeups_affine_attempts);
-		P_SCHEDSTAT(nr_wakeups_passive);
-		P_SCHEDSTAT(nr_wakeups_idle);
 
 		avg_atom = p->se.sum_exec_runtime;
 		if (nr_switches)
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH v5 05/17] sched/fair: Add paravirt movements for proc sched file
  2025-11-19  6:20 [RFC PATCH v4 00/17] Paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
                   ` (3 preceding siblings ...)
  2025-11-19  6:20 ` [RFC PATCH v4 04/17] sched/debug: Remove unused schedstats Shrikanth Hegde
@ 2025-11-19  6:20 ` Shrikanth Hegde
  2025-11-19  6:20 ` [RFC PATCH v4 06/17] sched/fair: Pass current cpu in select_idle_sibling Shrikanth Hegde
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2025-11-19  6:20 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev
  Cc: sshegde, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, maddy, srikar, gregkh, pbonzini, seanjc,
	kprateek.nayak, vschneid, iii, huschle, rostedt, dietmar.eggemann,
	christophe.leroy

Add couple of new stats.
- nr_migrations_paravirt: number of migrations due to current task being
  moved out of paravirt CPU.

- nr_wakeups_paravirt - number of wakeups where previous CPU was marked
  as paravirt and hence task is being woken up on current CPU.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 include/linux/sched.h | 2 ++
 kernel/sched/debug.c  | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f802bfd7120f..3628edd1468b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -549,6 +549,7 @@ struct sched_statistics {
 	u64				nr_failed_migrations_running;
 	u64				nr_failed_migrations_hot;
 	u64				nr_forced_migrations;
+	u64				nr_migrations_paravirt;
 
 	u64				nr_wakeups;
 	u64				nr_wakeups_sync;
@@ -557,6 +558,7 @@ struct sched_statistics {
 	u64				nr_wakeups_remote;
 	u64				nr_wakeups_affine;
 	u64				nr_wakeups_affine_attempts;
+	u64				nr_wakeups_paravirt;
 
 #ifdef CONFIG_SCHED_CORE
 	u64				core_forceidle_sum;
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 2cb3ffc653df..0e7d08514148 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1186,6 +1186,7 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 		P_SCHEDSTAT(nr_failed_migrations_running);
 		P_SCHEDSTAT(nr_failed_migrations_hot);
 		P_SCHEDSTAT(nr_forced_migrations);
+		P_SCHEDSTAT(nr_migrations_paravirt);
 		P_SCHEDSTAT(nr_wakeups);
 		P_SCHEDSTAT(nr_wakeups_sync);
 		P_SCHEDSTAT(nr_wakeups_migrate);
@@ -1193,6 +1194,7 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 		P_SCHEDSTAT(nr_wakeups_remote);
 		P_SCHEDSTAT(nr_wakeups_affine);
 		P_SCHEDSTAT(nr_wakeups_affine_attempts);
+		P_SCHEDSTAT(nr_wakeups_paravirt);
 
 		avg_atom = p->se.sum_exec_runtime;
 		if (nr_switches)
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH v4 06/17] sched/fair: Pass current cpu in select_idle_sibling
  2025-11-19  6:20 [RFC PATCH v4 00/17] Paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
                   ` (4 preceding siblings ...)
  2025-11-19  6:20 ` [RFC PATCH v5 05/17] sched/fair: Add paravirt movements for proc sched file Shrikanth Hegde
@ 2025-11-19  6:20 ` Shrikanth Hegde
  2025-11-19  6:20 ` [RFC PATCH v4 07/17] sched/fair: Don't consider paravirt CPUs for wakeup and load balance Shrikanth Hegde
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2025-11-19  6:20 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev
  Cc: sshegde, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, maddy, srikar, gregkh, pbonzini, seanjc,
	kprateek.nayak, vschneid, iii, huschle, rostedt, dietmar.eggemann,
	christophe.leroy

Pattern in select_task_rq_fair:

	cpu = smp_processor_id();
	new_cpu = prev_cpu;

	//May change new_cpu due to wake_affine, otherwise it remains prev_cpu

	new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);

Due to this often prev_cpu == new_cpu. If the task was sleeping when
the prev_cpu was marked as paravirt, it would be beneficial to choose current
cpu instead. If the current cpu is paravirt too, then wakeup will happen there and
at next tick task will move out.

So pass current CPU as well in the select_idle_sibling.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/fair.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1855975b8248..015e00b370c9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1048,7 +1048,7 @@ static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
 
 #include "pelt.h"
 
-static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu);
+static int select_idle_sibling(struct task_struct *p, int this_cpu, int prev, int target);
 static unsigned long task_h_load(struct task_struct *p);
 static unsigned long capacity_of(int cpu);
 
@@ -7770,7 +7770,7 @@ static inline bool asym_fits_cpu(unsigned long util,
 /*
  * Try and locate an idle core/thread in the LLC cache domain.
  */
-static int select_idle_sibling(struct task_struct *p, int prev, int target)
+static int select_idle_sibling(struct task_struct *p, int this_cpu, int prev, int target)
 {
 	bool has_idle_core = false;
 	struct sched_domain *sd;
@@ -8578,7 +8578,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 		new_cpu = sched_balance_find_dst_cpu(sd, p, cpu, prev_cpu, sd_flag);
 	} else if (wake_flags & WF_TTWU) { /* XXX always ? */
 		/* Fast path */
-		new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
+		new_cpu = select_idle_sibling(p, cpu, prev_cpu, new_cpu);
 	}
 	rcu_read_unlock();
 
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH v4 07/17] sched/fair: Don't consider paravirt CPUs for wakeup and load balance
  2025-11-19  6:20 [RFC PATCH v4 00/17] Paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
                   ` (5 preceding siblings ...)
  2025-11-19  6:20 ` [RFC PATCH v4 06/17] sched/fair: Pass current cpu in select_idle_sibling Shrikanth Hegde
@ 2025-11-19  6:20 ` Shrikanth Hegde
  2025-11-19  6:20 ` [RFC PATCH v4 08/17] sched/rt: Don't select paravirt CPU for wakeup and push/pull rt task Shrikanth Hegde
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2025-11-19  6:20 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev
  Cc: sshegde, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, maddy, srikar, gregkh, pbonzini, seanjc,
	kprateek.nayak, vschneid, iii, huschle, rostedt, dietmar.eggemann,
	christophe.leroy

For CFS load balancer,
- mask out paravirt CPUs from list of cpus to balance.
- This helps to restrict/expand the workload depending on the mask.

At wakeup,
- If prev_cpu is paravirt, see if recent_used_cpu can be chosen.
If not choose current cpu.
- For EAS system, put a warning if wake up happens on paravirt CPU.
At this point, not expecting any EAS system will have a overcommit of
CPUs.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/fair.c | 36 +++++++++++++++++++++++++++++++++++-
 1 file changed, 35 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 015e00b370c9..760813802cb9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7358,6 +7358,9 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 {
 	int target = nr_cpumask_bits;
 
+	if (cpu_paravirt(prev_cpu))
+		return this_cpu;
+
 	if (sched_feat(WA_IDLE))
 		target = wake_affine_idle(this_cpu, prev_cpu, sync);
 
@@ -7441,6 +7444,11 @@ static inline int sched_balance_find_dst_cpu(struct sched_domain *sd, struct tas
 {
 	int new_cpu = cpu;
 
+	if (cpu_paravirt(prev_cpu)) {
+		schedstat_inc(p->stats.nr_wakeups_paravirt);
+		return cpu;
+	}
+
 	if (!cpumask_intersects(sched_domain_span(sd), p->cpus_ptr))
 		return prev_cpu;
 
@@ -7777,10 +7785,25 @@ static int select_idle_sibling(struct task_struct *p, int this_cpu, int prev, in
 	unsigned long task_util, util_min, util_max;
 	int i, recent_used_cpu, prev_aff = -1;
 
+	/* Likely prev,target belong to same LLC, it is better at wakeup
+	 * to move away from them. at best return recent_used_cpu if it
+	 * is usable
+	 */
+	if (cpu_paravirt(prev) || cpu_paravirt(target)) {
+		schedstat_inc(p->stats.nr_wakeups_paravirt);
+
+		recent_used_cpu = p->recent_used_cpu;
+		if (!cpu_paravirt(recent_used_cpu))
+			return recent_used_cpu;
+		else
+			return this_cpu;
+	}
+
 	/*
 	 * On asymmetric system, update task utilization because we will check
 	 * that the task fits with CPU's capacity.
 	 */
+
 	if (sched_asym_cpucap_active()) {
 		sync_entity_load_avg(&p->se);
 		task_util = task_util_est(p);
@@ -8539,8 +8562,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 
 		if (!is_rd_overutilized(this_rq()->rd)) {
 			new_cpu = find_energy_efficient_cpu(p, prev_cpu);
-			if (new_cpu >= 0)
+
+			/* System supporting Energy model isn't expected
+			 * have a CPU marked as paravirt
+			 */
+			if (new_cpu >= 0) {
+				WARN_ON_ONCE(cpu_paravirt(new_cpu));
 				return new_cpu;
+			}
 			new_cpu = prev_cpu;
 		}
 
@@ -11832,6 +11861,11 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
 
 	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
 
+#ifdef CONFIG_PARAVIRT
+	/* Don't spread load to paravirt CPUs */
+	cpumask_andnot(cpus, cpus, cpu_paravirt_mask);
+#endif
+
 	schedstat_inc(sd->lb_count[idle]);
 
 redo:
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH v4 08/17] sched/rt: Don't select paravirt CPU for wakeup and push/pull rt task
  2025-11-19  6:20 [RFC PATCH v4 00/17] Paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
                   ` (6 preceding siblings ...)
  2025-11-19  6:20 ` [RFC PATCH v4 07/17] sched/fair: Don't consider paravirt CPUs for wakeup and load balance Shrikanth Hegde
@ 2025-11-19  6:20 ` Shrikanth Hegde
  2025-11-19  6:20 ` [RFC PATCH v4 09/17] sched/core: Add support for nohz_full CPUs Shrikanth Hegde
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2025-11-19  6:20 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev
  Cc: sshegde, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, maddy, srikar, gregkh, pbonzini, seanjc,
	kprateek.nayak, vschneid, iii, huschle, rostedt, dietmar.eggemann,
	christophe.leroy

For RT class,
- During wakeup don't select a paravirt CPU.
- Don't pull a task towards a paravirt CPU.
- Don't push a task to a paravirt CPU.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/rt.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index f1867fe8e5c5..0b78c74dbbe3 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1552,6 +1552,9 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
 		if (!test && target != -1 && !rt_task_fits_capacity(p, target))
 			goto out_unlock;
 
+		/* Avoid moving to a paravirt CPU */
+		if (cpu_paravirt(target))
+			goto out_unlock;
 		/*
 		 * Don't bother moving it if the destination CPU is
 		 * not running a lower priority task.
@@ -1876,7 +1879,7 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
 	for (tries = 0; tries < RT_MAX_TRIES; tries++) {
 		cpu = find_lowest_rq(task);
 
-		if ((cpu == -1) || (cpu == rq->cpu))
+		if ((cpu == -1) || (cpu == rq->cpu) || cpu_paravirt(cpu))
 			break;
 
 		lowest_rq = cpu_rq(cpu);
@@ -1974,7 +1977,7 @@ static int push_rt_task(struct rq *rq, bool pull)
 			return 0;
 
 		cpu = find_lowest_rq(rq->curr);
-		if (cpu == -1 || cpu == rq->cpu)
+		if (cpu == -1 || cpu == rq->cpu || cpu_paravirt(cpu))
 			return 0;
 
 		/*
@@ -2237,6 +2240,10 @@ static void pull_rt_task(struct rq *this_rq)
 	if (likely(!rt_overload_count))
 		return;
 
+	/* There is no point in pulling the task towards a paravirt cpu */
+	if (cpu_paravirt(this_rq->cpu))
+		return;
+
 	/*
 	 * Match the barrier from rt_set_overloaded; this guarantees that if we
 	 * see overloaded we must also see the rto_mask bit.
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH v4 09/17] sched/core: Add support for nohz_full CPUs
  2025-11-19  6:20 [RFC PATCH v4 00/17] Paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
                   ` (7 preceding siblings ...)
  2025-11-19  6:20 ` [RFC PATCH v4 08/17] sched/rt: Don't select paravirt CPU for wakeup and push/pull rt task Shrikanth Hegde
@ 2025-11-19  6:20 ` Shrikanth Hegde
  2025-11-19  6:20 ` [RFC PATCH v4 10/17] sched/core: Push current task from paravirt CPU Shrikanth Hegde
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2025-11-19  6:20 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev
  Cc: sshegde, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, maddy, srikar, gregkh, pbonzini, seanjc,
	kprateek.nayak, vschneid, iii, huschle, rostedt, dietmar.eggemann,
	christophe.leroy

Enable tick on nohz full CPU when it is marked as paravirt.
If there in no CFS/RT running there, disable the tick to save the power.

In addition to this, arch specific code which enables the paravirt CPU
should call, tick_nohz_dep_set_cpu with TICK_DEP_BIT_SCHED for moving
the task out of nohz_full CPU fast.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/core.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 90fc04d84b74..73d1d49a3c72 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1336,6 +1336,10 @@ bool sched_can_stop_tick(struct rq *rq)
 {
 	int fifo_nr_running;
 
+	/* Keep the tick running until both RT and CFS are pushed out*/
+	if (cpu_paravirt(rq->cpu) && (rq->rt.rt_nr_running || rq->cfs.h_nr_queued))
+		return false;
+
 	/* Deadline tasks, even if single, need the tick */
 	if (rq->dl.dl_nr_running)
 		return false;
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH v4 10/17] sched/core: Push current task from paravirt CPU
  2025-11-19  6:20 [RFC PATCH v4 00/17] Paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
                   ` (8 preceding siblings ...)
  2025-11-19  6:20 ` [RFC PATCH v4 09/17] sched/core: Add support for nohz_full CPUs Shrikanth Hegde
@ 2025-11-19  6:20 ` Shrikanth Hegde
  2025-11-19  6:20 ` [RFC PATCH v4 11/17] sysfs: Add paravirt CPU file Shrikanth Hegde
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2025-11-19  6:20 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev
  Cc: sshegde, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, maddy, srikar, gregkh, pbonzini, seanjc,
	kprateek.nayak, vschneid, iii, huschle, rostedt, dietmar.eggemann,
	christophe.leroy

Actively push out RT/CFS running on a paravirt CPU. Since the task is
running on the CPU, need to stop the cpu and push the task out.
However, if the task in pinned only to paravirt CPUs, it will continue
running there.

Though code is almost same as __balance_push_cpu_stop and quite close to
push_cpu_stop, it provides a cleaner implementation w.r.t to PARAVIRT config.

Add push_task_work_done flag to protect pv_push_task_work buffer.
This currently works only FAIR and RT.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/core.c  | 83 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |  9 +++++
 2 files changed, 92 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 73d1d49a3c72..65c247c24191 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5521,6 +5521,10 @@ void sched_tick(void)
 	unsigned long hw_pressure;
 	u64 resched_latency;
 
+	/* push the current task out if a paravirt CPU */
+	if (cpu_paravirt(cpu))
+		push_current_from_paravirt_cpu(rq);
+
 	if (housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE))
 		arch_scale_freq_tick();
 
@@ -10869,4 +10873,83 @@ void sched_change_end(struct sched_change_ctx *ctx)
 #ifdef CONFIG_PARAVIRT
 struct cpumask __cpu_paravirt_mask __read_mostly;
 EXPORT_SYMBOL(__cpu_paravirt_mask);
+
+static DEFINE_PER_CPU(struct cpu_stop_work, pv_push_task_work);
+
+static int paravirt_push_cpu_stop(void *arg)
+{
+	struct task_struct *p = arg;
+	struct rq *rq = this_rq();
+	struct rq_flags rf;
+	int cpu;
+
+	raw_spin_lock_irq(&p->pi_lock);
+	rq_lock(rq, &rf);
+	rq->push_task_work_done = 0;
+
+	update_rq_clock(rq);
+
+	if (task_rq(p) == rq && task_on_rq_queued(p)) {
+		cpu = select_fallback_rq(rq->cpu, p);
+		rq = __migrate_task(rq, &rf, p, cpu);
+	}
+
+	rq_unlock(rq, &rf);
+	raw_spin_unlock_irq(&p->pi_lock);
+	put_task_struct(p);
+
+	return 0;
+}
+
+/* A CPU is marked as Paravirt when there is contention for underlying
+ * physical CPU and using this CPU will lead to hypervisor preemptions.
+ * It is better not to use this CPU.
+ *
+ * In case any task is scheduled on such CPU, move it out. In
+ * select_fallback_rq a non paravirt CPU will be chosen and henceforth
+ * task shouldn't come back to this CPU
+ */
+void push_current_from_paravirt_cpu(struct rq *rq)
+{
+	struct task_struct *push_task = rq->curr;
+	unsigned long flags;
+	struct rq_flags rf;
+
+	if (!cpu_paravirt(rq->cpu))
+		return;
+
+	/* Idle task can't be pused out */
+	if (rq->curr == rq->idle)
+		return;
+
+	/* Do for only SCHED_NORMAL AND RT for now */
+	if (push_task->sched_class != &fair_sched_class &&
+	    push_task->sched_class != &rt_sched_class)
+		return;
+
+	if (kthread_is_per_cpu(push_task) ||
+	    is_migration_disabled(push_task))
+		return;
+
+	/* Is it affine to only paravirt cpus? */
+	if (cpumask_subset(push_task->cpus_ptr, cpu_paravirt_mask))
+		return;
+
+	/* There is already a stopper thread for this. Dont race with it */
+	if (rq->push_task_work_done == 1)
+		return;
+
+	local_irq_save(flags);
+
+	get_task_struct(push_task);
+	schedstat_inc(push_task->stats.nr_migrations_paravirt);
+
+	rq_lock(rq, &rf);
+	rq->push_task_work_done = 1;
+	rq_unlock(rq, &rf);
+
+	stop_one_cpu_nowait(rq->cpu, paravirt_push_cpu_stop, push_task,
+			    this_cpu_ptr(&pv_push_task_work));
+	local_irq_restore(flags);
+}
 #endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b419a4d98461..42984a65384c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1214,6 +1214,9 @@ struct rq {
 	unsigned char		nohz_idle_balance;
 	unsigned char		idle_balance;
 
+#ifdef CONFIG_PARAVIRT
+	bool			push_task_work_done;
+#endif
 	unsigned long		misfit_task_load;
 
 	/* For active balancing */
@@ -4017,6 +4020,12 @@ extern bool dequeue_task(struct rq *rq, struct task_struct *p, int flags);
 extern struct balance_callback *splice_balance_callbacks(struct rq *rq);
 extern void balance_callbacks(struct rq *rq, struct balance_callback *head);
 
+#ifdef CONFIG_PARAVIRT
+void push_current_from_paravirt_cpu(struct rq *rq);
+#else
+static inline void push_current_from_paravirt_cpu(struct rq *rq) { }
+#endif
+
 /*
  * The 'sched_change' pattern is the safe, easy and slow way of changing a
  * task's scheduling properties. It dequeues a task, such that the scheduler
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH v4 11/17] sysfs: Add paravirt CPU file
  2025-11-19  6:20 [RFC PATCH v4 00/17] Paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
                   ` (9 preceding siblings ...)
  2025-11-19  6:20 ` [RFC PATCH v4 10/17] sched/core: Push current task from paravirt CPU Shrikanth Hegde
@ 2025-11-19  6:20 ` Shrikanth Hegde
  2025-11-19  6:20 ` [RFC PATCH v4 12/17] powerpc: method to initialize ec and vp cores Shrikanth Hegde
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2025-11-19  6:20 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev
  Cc: sshegde, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, maddy, srikar, gregkh, pbonzini, seanjc,
	kprateek.nayak, vschneid, iii, huschle, rostedt, dietmar.eggemann,
	christophe.leroy

Add paravirt file in /sys/devices/system/cpu.

This offers
- User can quickly check which CPUs are marked as paravirt.
- Userspace algorithm such as sched_ext or with isolcpus could
  use the mask and make decision.
- daemon such as irqbalance could use this mask and don't spread
  irq's into paravirt CPUs.

For example:
cat /sys/devices/system/cpu/paravirt
600-719      <<< arch marked these are paravirt.

cat /sys/devices/system/cpu/paravirt
             <<< No paravirt CPUs at the moment.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 Documentation/ABI/testing/sysfs-devices-system-cpu |  9 +++++++++
 drivers/base/cpu.c                                 | 12 ++++++++++++
 2 files changed, 21 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu
index 8aed6d94c4cd..1da77430b776 100644
--- a/Documentation/ABI/testing/sysfs-devices-system-cpu
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -777,3 +777,12 @@ Date:		Nov 2022
 Contact:	Linux kernel mailing list <linux-kernel@vger.kernel.org>
 Description:
 		(RO) the list of CPUs that can be brought online.
+
+What:		/sys/devices/system/cpu/paravirt
+Date:		Sep 2025
+Contact:	Linux kernel mailing list <linux-kernel@vger.kernel.org>
+Description:
+		(RO) the list of CPUs that are current marked as paravirt CPUs.
+		These CPUs are not meant to be used at the moment due to
+		contention of underlying physical CPU resource. Dynamically
+		changes to reflect the current situation.
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index fa0a2eef93ac..59ceae217b22 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -374,6 +374,15 @@ static int cpu_uevent(const struct device *dev, struct kobj_uevent_env *env)
 }
 #endif
 
+#ifdef CONFIG_PARAVIRT
+static ssize_t print_paravirt_cpus(struct device *dev,
+				   struct device_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "%*pbl\n", cpumask_pr_args(cpu_paravirt_mask));
+}
+static DEVICE_ATTR(paravirt, 0444, print_paravirt_cpus, NULL);
+#endif
+
 const struct bus_type cpu_subsys = {
 	.name = "cpu",
 	.dev_name = "cpu",
@@ -513,6 +522,9 @@ static struct attribute *cpu_root_attrs[] = {
 #endif
 #ifdef CONFIG_GENERIC_CPU_AUTOPROBE
 	&dev_attr_modalias.attr,
+#endif
+#ifdef CONFIG_PARAVIRT
+	&dev_attr_paravirt.attr,
 #endif
 	NULL
 };
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH v4 12/17] powerpc: method to initialize ec and vp cores
  2025-11-19  6:20 [RFC PATCH v4 00/17] Paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
                   ` (10 preceding siblings ...)
  2025-11-19  6:20 ` [RFC PATCH v4 11/17] sysfs: Add paravirt CPU file Shrikanth Hegde
@ 2025-11-19  6:20 ` Shrikanth Hegde
  2025-11-19  6:20 ` [RFC PATCH v4 13/17] powerpc: enable/disable paravirt CPUs based on steal time Shrikanth Hegde
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2025-11-19  6:20 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev
  Cc: sshegde, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, maddy, srikar, gregkh, pbonzini, seanjc,
	kprateek.nayak, vschneid, iii, huschle, rostedt, dietmar.eggemann,
	christophe.leroy

During system init, capture the number of EC and VP cores on Shared
Processor LPARs(aka VM). (SPLPAR )

EC - Entitled Cores - Hypervisor(PowerVM) guarantees this many cores
worth of cycles.
VP - Virtual Processor Cores - Total logical cores present in the LPAR.
In SPLPAR's typically there is overcommit of vCPUs. i.e VP > EC.

These values will be used in subsequent patches to calculate number of
cores to use when there is steal time.

Note: DLPAR specific method need to call this again. Yet to be done.

Originally-by: Srikar Dronamraju <srikar@linux.ibm.com>
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 arch/powerpc/include/asm/smp.h        |  1 +
 arch/powerpc/kernel/smp.c             |  1 +
 arch/powerpc/platforms/pseries/lpar.c | 30 +++++++++++++++++++++++++++
 3 files changed, 32 insertions(+)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index e41b9ea42122..5a52c6952195 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -266,6 +266,7 @@ extern char __secondary_hold;
 extern unsigned int booting_thread_hwid;
 
 extern void __early_start(void);
+void pseries_init_ec_vp_cores(void);
 #endif /* __ASSEMBLER__ */
 
 #endif /* __KERNEL__ */
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 68edb66c2964..5a3b52dd625b 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1732,6 +1732,7 @@ void __init smp_cpus_done(unsigned int max_cpus)
 
 	dump_numa_cpu_topology();
 	build_sched_topology();
+	pseries_init_ec_vp_cores();
 }
 
 /*
diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c
index 6a415febc53b..935fced6e127 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -2029,3 +2029,33 @@ static int __init vpa_debugfs_init(void)
 }
 machine_arch_initcall(pseries, vpa_debugfs_init);
 #endif /* CONFIG_DEBUG_FS */
+
+#ifdef CONFIG_PARAVIRT
+
+static unsigned int virtual_procs __read_mostly;
+static unsigned int entitled_cores __read_mostly;
+static unsigned int available_cores;
+
+void pseries_init_ec_vp_cores(void)
+{
+	unsigned long retbuf[PLPAR_HCALL9_BUFSIZE];
+	int ret;
+
+	if (available_cores && virtual_procs == num_present_cpus() / threads_per_core)
+		return;
+
+	/* Get EC values from hcall */
+	ret = plpar_hcall9(H_GET_PPP, retbuf);
+	WARN_ON_ONCE(ret != 0);
+	if (ret)
+		return;
+
+	entitled_cores = retbuf[0] / 100;
+	virtual_procs = num_present_cpus() / threads_per_core;
+
+	/* Initialize the available cores to all VP initially */
+	available_cores = max(entitled_cores, virtual_procs);
+}
+#else
+void pseries_init_ec_vp_cores(void) { return; }
+#endif
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH v4 13/17] powerpc: enable/disable paravirt CPUs based on steal time
  2025-11-19  6:20 [RFC PATCH v4 00/17] Paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
                   ` (11 preceding siblings ...)
  2025-11-19  6:20 ` [RFC PATCH v4 12/17] powerpc: method to initialize ec and vp cores Shrikanth Hegde
@ 2025-11-19  6:20 ` Shrikanth Hegde
  2025-11-19  6:20 ` [RFC PATCH v4 14/17] powerpc: process steal values at fixed intervals Shrikanth Hegde
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2025-11-19  6:20 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev
  Cc: sshegde, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, maddy, srikar, gregkh, pbonzini, seanjc,
	kprateek.nayak, vschneid, iii, huschle, rostedt, dietmar.eggemann,
	christophe.leroy

available_cores - Number of cores LPAR(VM) can use at this moment.
remaining cores will have CPUs marked as paravirt.

This follow stepwise approach for reducing/increasing the number of
available_cores.

Very simple Logic.
	if (steal_time > high_threshold)
		available_cores--
	if (steal_time < low_threshould)
		available_cores++

It also check previous direction taken to avoid un-necessary ping-pongs.

Note: It works well only when CPUs are spread out equal numbered across
NUMA nodes.

Originally-by: Srikar Dronamraju <srikar@linux.ibm.com>
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 arch/powerpc/platforms/pseries/lpar.c    | 53 ++++++++++++++++++++++++
 arch/powerpc/platforms/pseries/pseries.h |  1 +
 2 files changed, 54 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c
index 935fced6e127..825b5b4e2b43 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -43,6 +43,7 @@
 #include <asm/fadump.h>
 #include <asm/dtl.h>
 #include <asm/vphn.h>
+#include <linux/sched/isolation.h>
 
 #include "pseries.h"
 
@@ -2056,6 +2057,58 @@ void pseries_init_ec_vp_cores(void)
 	/* Initialize the available cores to all VP initially */
 	available_cores = max(entitled_cores, virtual_procs);
 }
+
+#define STEAL_RATIO_HIGH 400
+#define STEAL_RATIO_LOW  150
+
+void update_soft_entitlement(unsigned long steal_ratio)
+{
+	static int prev_direction;
+	int cpu;
+
+	if  (!entitled_cores)
+		return;
+
+	if (steal_ratio >= STEAL_RATIO_HIGH && prev_direction > 0) {
+		/*
+		 * System entitlement was reduced earlier but we continue to
+		 * see steal time. Reduce entitlement further.
+		 */
+		if (available_cores == entitled_cores)
+			return;
+
+		/* Mark them paravirt, enable tick if it is nohz_full */
+		for (cpu = (available_cores - 1) * threads_per_core;
+		     cpu < available_cores * threads_per_core; cpu++) {
+			set_cpu_paravirt(cpu, true);
+			if (tick_nohz_full_cpu(cpu))
+				tick_nohz_dep_set_cpu(cpu, TICK_DEP_BIT_SCHED);
+		}
+		available_cores--;
+
+	} else if (steal_ratio <= STEAL_RATIO_LOW && prev_direction < 0) {
+		/*
+		 * System entitlement was increased but we continue to see
+		 * less steal time. Increase entitlement further.
+		 */
+		if (available_cores == virtual_procs)
+			return;
+
+		/* mark them avaialble */
+		for (cpu = available_cores * threads_per_core;
+		     cpu < (available_cores + 1) * threads_per_core; cpu++)
+			set_cpu_paravirt(cpu, false);
+
+		available_cores++;
+	}
+	if (steal_ratio >= STEAL_RATIO_HIGH)
+		prev_direction = 1;
+	else if (steal_ratio <= STEAL_RATIO_LOW)
+		prev_direction = -1;
+	else
+		prev_direction = 0;
+}
 #else
 void pseries_init_ec_vp_cores(void) { return; }
+void update_soft_entitlement(unsigned long steal_ratio) { return; }
 #endif
diff --git a/arch/powerpc/platforms/pseries/pseries.h b/arch/powerpc/platforms/pseries/pseries.h
index 3968a6970fa8..d1f9ec77ff57 100644
--- a/arch/powerpc/platforms/pseries/pseries.h
+++ b/arch/powerpc/platforms/pseries/pseries.h
@@ -115,6 +115,7 @@ int dlpar_workqueue_init(void);
 
 extern u32 pseries_security_flavor;
 void pseries_setup_security_mitigations(void);
+void update_soft_entitlement(unsigned long steal_ratio);
 
 #ifdef CONFIG_PPC_64S_HASH_MMU
 void pseries_lpar_read_hblkrm_characteristics(void);
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH v4 14/17] powerpc: process steal values at fixed intervals
  2025-11-19  6:20 [RFC PATCH v4 00/17] Paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
                   ` (12 preceding siblings ...)
  2025-11-19  6:20 ` [RFC PATCH v4 13/17] powerpc: enable/disable paravirt CPUs based on steal time Shrikanth Hegde
@ 2025-11-19  6:20 ` Shrikanth Hegde
  2025-11-19  6:20 ` [RFC PATCH v4 15/17] powerpc: add debugfs file for controlling handling on steal values Shrikanth Hegde
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2025-11-19  6:20 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev
  Cc: sshegde, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, maddy, srikar, gregkh, pbonzini, seanjc,
	kprateek.nayak, vschneid, iii, huschle, rostedt, dietmar.eggemann,
	christophe.leroy

Process steal time at regular intervals. Sum of steal time across the
vCPUs is computed against the time to get the steal ratio.

Only first online CPU does this work. That reduces the racing issues.
This is done only on SPLPAR (non kvm guest). This assumes PowerVM being
the hypervisor.
 
Originally-by: Srikar Dronamraju <srikar@linux.ibm.com>
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 arch/powerpc/platforms/pseries/lpar.c | 59 +++++++++++++++++++++++++++
 1 file changed, 59 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c
index 825b5b4e2b43..c16d97e1a1fe 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -660,10 +660,58 @@ static int __init vcpudispatch_stats_procfs_init(void)
 machine_device_initcall(pseries, vcpudispatch_stats_procfs_init);
 
 #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
+
+#define STEAL_MULTIPLE 10000
+#define PURR_UPDATE_TB NSEC_PER_SEC
+
+static bool should_cpu_process_steal(int cpu)
+{
+	if (cpu == cpumask_first(cpu_online_mask))
+		return true;
+
+	return false;
+}
+
+static void process_steal(int cpu)
+{
+	static unsigned long next_tb_ns, prev_steal;
+	unsigned long steal_ratio, delta_tb;
+	unsigned long tb_ns = tb_to_ns(mftb());
+	unsigned long steal = 0;
+	unsigned int i;
+
+	if (!should_cpu_process_steal(cpu))
+		return;
+
+	if (tb_ns < next_tb_ns)
+		return;
+
+	for_each_online_cpu(i) {
+		struct lppaca *lppaca = &lppaca_of(i);
+
+		steal += be64_to_cpu(READ_ONCE(lppaca->ready_enqueue_tb));
+		steal += be64_to_cpu(READ_ONCE(lppaca->enqueue_dispatch_tb));
+	}
+
+	steal = tb_to_ns(steal);
+
+	if (next_tb_ns && prev_steal) {
+		delta_tb = max(tb_ns - (next_tb_ns - PURR_UPDATE_TB), 1);
+		steal_ratio = (steal - prev_steal) * STEAL_MULTIPLE;
+		steal_ratio /= (delta_tb * num_online_cpus());
+		update_soft_entitlement(steal_ratio);
+	}
+
+	next_tb_ns = tb_ns + PURR_UPDATE_TB;
+	prev_steal = steal;
+}
+
 u64 pseries_paravirt_steal_clock(int cpu)
 {
 	struct lppaca *lppaca = &lppaca_of(cpu);
 
+	if (is_shared_processor() && !is_kvm_guest())
+		process_steal(cpu);
 	/*
 	 * VPA steal time counters are reported at TB frequency. Hence do a
 	 * conversion to ns before returning
@@ -2061,6 +2109,17 @@ void pseries_init_ec_vp_cores(void)
 #define STEAL_RATIO_HIGH 400
 #define STEAL_RATIO_LOW  150
 
+/*
+ * [0]<----------->[EC]---->{AC}-->[VP]
+ * EC == Entitled Cores. Guaranteed number of cores by hypervsior.
+ * VP == Virtual Processors. Total number of cores. When there is overcommit
+ * this will be higher than EC.
+ * AC == Available Cores Varies between EC <-> VP.
+ *
+ * If Steal time is high, then reduce Available Cores.
+ * If steal time is low, increase Available Cores
+ */
+
 void update_soft_entitlement(unsigned long steal_ratio)
 {
 	static int prev_direction;
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH v4 15/17] powerpc: add debugfs file for controlling handling on steal values
  2025-11-19  6:20 [RFC PATCH v4 00/17] Paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
                   ` (13 preceding siblings ...)
  2025-11-19  6:20 ` [RFC PATCH v4 14/17] powerpc: process steal values at fixed intervals Shrikanth Hegde
@ 2025-11-19  6:20 ` Shrikanth Hegde
  2025-11-19  6:20 ` [HELPER PATCH 1] sysfs: Provide write method for paravirt Shrikanth Hegde
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2025-11-19  6:20 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev
  Cc: sshegde, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, maddy, srikar, gregkh, pbonzini, seanjc,
	kprateek.nayak, vschneid, iii, huschle, rostedt, dietmar.eggemann,
	christophe.leroy

Since the low,high threshold for steal time can change based on the
system, make these values tunable.

Values are be to given as expected percentage value * 100. i.e one
wants say 8% of steal time is high, then should specify 800 as the high
threshold. Similar value computation holds true for low threshold.

Provide one more tunable to control how often steal time compution is
done. By default it is 1 second. If one thinks thats too aggressive can
increase it. Max value is 10 seconds since one should act relatively
fast based on steal values.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 arch/powerpc/platforms/pseries/lpar.c | 94 ++++++++++++++++++++++++---
 1 file changed, 86 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c
index c16d97e1a1fe..090e5c48243b 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -662,7 +662,8 @@ machine_device_initcall(pseries, vcpudispatch_stats_procfs_init);
 #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
 
 #define STEAL_MULTIPLE 10000
-#define PURR_UPDATE_TB NSEC_PER_SEC
+static int steal_check_freq = 1;
+#define PURR_UPDATE_TB (steal_check_freq * NSEC_PER_SEC)
 
 static bool should_cpu_process_steal(int cpu)
 {
@@ -2106,9 +2107,6 @@ void pseries_init_ec_vp_cores(void)
 	available_cores = max(entitled_cores, virtual_procs);
 }
 
-#define STEAL_RATIO_HIGH 400
-#define STEAL_RATIO_LOW  150
-
 /*
  * [0]<----------->[EC]---->{AC}-->[VP]
  * EC == Entitled Cores. Guaranteed number of cores by hypervsior.
@@ -2120,6 +2118,9 @@ void pseries_init_ec_vp_cores(void)
  * If steal time is low, increase Available Cores
  */
 
+static unsigned int steal_ratio_high = 400;
+static unsigned int steal_ratio_low = 150;
+
 void update_soft_entitlement(unsigned long steal_ratio)
 {
 	static int prev_direction;
@@ -2128,7 +2129,7 @@ void update_soft_entitlement(unsigned long steal_ratio)
 	if  (!entitled_cores)
 		return;
 
-	if (steal_ratio >= STEAL_RATIO_HIGH && prev_direction > 0) {
+	if (steal_ratio >= steal_ratio_high && prev_direction > 0) {
 		/*
 		 * System entitlement was reduced earlier but we continue to
 		 * see steal time. Reduce entitlement further.
@@ -2145,7 +2146,7 @@ void update_soft_entitlement(unsigned long steal_ratio)
 		}
 		available_cores--;
 
-	} else if (steal_ratio <= STEAL_RATIO_LOW && prev_direction < 0) {
+	} else if (steal_ratio <= steal_ratio_low && prev_direction < 0) {
 		/*
 		 * System entitlement was increased but we continue to see
 		 * less steal time. Increase entitlement further.
@@ -2160,13 +2161,90 @@ void update_soft_entitlement(unsigned long steal_ratio)
 
 		available_cores++;
 	}
-	if (steal_ratio >= STEAL_RATIO_HIGH)
+	if (steal_ratio >= steal_ratio_high)
 		prev_direction = 1;
-	else if (steal_ratio <= STEAL_RATIO_LOW)
+	else if (steal_ratio <= steal_ratio_low)
 		prev_direction = -1;
 	else
 		prev_direction = 0;
 }
+
+/*
+ * Any value above this set threshold will reduce the available cores
+ * Value can't be more than 100% and can't be less than low threshould value
+ * Specifying 500 means 5% steal time
+ */
+
+static int pv_steal_ratio_high_set(void *data, u64 val)
+{
+	if (val > 10000 || val < steal_ratio_low)
+		return -EINVAL;
+
+	steal_ratio_high = val;
+	return 0;
+}
+
+static int pv_steal_ratio_high_get(void *data, u64 *val)
+{
+	*val = steal_ratio_high;
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(fops_pv_steal_ratio_high, pv_steal_ratio_high_get,
+			pv_steal_ratio_high_set, "%llu\n");
+
+static int pv_steal_ratio_low_set(void *data, u64 val)
+{
+	if (val < 1 || val > steal_ratio_high)
+		return -EINVAL;
+
+	steal_ratio_low = val;
+	return 0;
+}
+
+static int pv_steal_ratio_low_get(void *data, u64 *val)
+{
+	*val = steal_ratio_low;
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(fops_pv_steal_ratio_low, pv_steal_ratio_low_get,
+			pv_steal_ratio_low_set, "%llu\n");
+
+static int pv_steal_check_freq_set(void *data, u64 val)
+{
+	if (val < 1 || val > 10)
+		return -EINVAL;
+
+	steal_check_freq = val;
+	return 0;
+}
+
+static int pv_steal_check_freq_get(void *data, u64 *val)
+{
+	*val = steal_check_freq;
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(fops_pv_steal_check_freq, pv_steal_check_freq_get,
+			pv_steal_check_freq_set, "%llu\n");
+
+static int __init steal_debugfs_init(void)
+{
+	if (!is_shared_processor() || is_kvm_guest())
+		return 0;
+
+	debugfs_create_file("steal_ratio_high", 0600, arch_debugfs_dir,
+			    NULL, &fops_pv_steal_ratio_high);
+	debugfs_create_file("steal_ratio_low", 0600, arch_debugfs_dir,
+			    NULL, &fops_pv_steal_ratio_low);
+	debugfs_create_file("steal_check_frequency", 0600, arch_debugfs_dir,
+			    NULL, &fops_pv_steal_check_freq);
+
+	return 0;
+}
+
+machine_arch_initcall(pseries, steal_debugfs_init);
 #else
 void pseries_init_ec_vp_cores(void) { return; }
 void update_soft_entitlement(unsigned long steal_ratio) { return; }
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [HELPER PATCH 1] sysfs: Provide write method for paravirt
  2025-11-19  6:20 [RFC PATCH v4 00/17] Paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
                   ` (14 preceding siblings ...)
  2025-11-19  6:20 ` [RFC PATCH v4 15/17] powerpc: add debugfs file for controlling handling on steal values Shrikanth Hegde
@ 2025-11-19  6:20 ` Shrikanth Hegde
  2025-11-19  7:42   ` Greg KH
  2025-11-19  6:21 ` [HELPER PATCH 2] helper: disable arch handling if paravirt file being written Shrikanth Hegde
  2025-11-19 12:53 ` [RFC PATCH v4 00/17] Paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
  17 siblings, 1 reply; 25+ messages in thread
From: Shrikanth Hegde @ 2025-11-19  6:20 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev
  Cc: sshegde, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, maddy, srikar, gregkh, pbonzini, seanjc,
	kprateek.nayak, vschneid, iii, huschle, rostedt, dietmar.eggemann,
	christophe.leroy

This is helper patch which could be used to set the range of CPUs as
paravirt. One could make use of this for quick testing of this infra
instead of writing arch specific code.

This is currently not meant be merged, since paravirt sysfs file is meant
to be Read-Only.

echo 100-200,600-700 >  /sys/devices/system/cpu/paravirt
cat /sys/devices/system/cpu/paravirt
100-200,600-700

echo > /sys/devices/system/cpu/paravirt
cat /sys/devices/system/cpu/paravirt

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 drivers/base/cpu.c | 48 ++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 46 insertions(+), 2 deletions(-)

diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index 59ceae217b22..043e4f4ce1a9 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -375,12 +375,57 @@ static int cpu_uevent(const struct device *dev, struct kobj_uevent_env *env)
 #endif
 
 #ifdef CONFIG_PARAVIRT
+static ssize_t store_paravirt_cpus(struct device *dev,
+				   struct device_attribute *attr,
+				   const char *buf, size_t count)
+{
+	cpumask_var_t temp_mask;
+	int retval = 0;
+
+	if (!alloc_cpumask_var(&temp_mask, GFP_KERNEL))
+		return -ENOMEM;
+
+	retval = cpulist_parse(buf, temp_mask);
+	if (retval)
+		goto free_mask;
+
+	/* ALL cpus can't be marked as paravirt */
+	if (cpumask_equal(temp_mask, cpu_online_mask)) {
+		retval = -EINVAL;
+		goto free_mask;
+	}
+	if (cpumask_weight(temp_mask) > num_online_cpus()) {
+		retval = -EINVAL;
+		goto free_mask;
+	}
+
+	/* No more paravirt cpus */
+	if (cpumask_empty(temp_mask)) {
+		cpumask_copy((struct cpumask *)&__cpu_paravirt_mask, temp_mask);
+	} else {
+		cpumask_copy((struct cpumask *)&__cpu_paravirt_mask, temp_mask);
+
+		/* Enable tick on nohz_full cpu */
+		int cpu;
+		for_each_cpu(cpu, temp_mask) {
+			if (tick_nohz_full_cpu(cpu))
+				tick_nohz_dep_set_cpu(cpu, TICK_DEP_BIT_SCHED);
+		}
+	}
+
+	retval = count;
+
+free_mask:
+	free_cpumask_var(temp_mask);
+	return retval;
+}
+
 static ssize_t print_paravirt_cpus(struct device *dev,
 				   struct device_attribute *attr, char *buf)
 {
 	return sysfs_emit(buf, "%*pbl\n", cpumask_pr_args(cpu_paravirt_mask));
 }
-static DEVICE_ATTR(paravirt, 0444, print_paravirt_cpus, NULL);
+static DEVICE_ATTR(paravirt, 0644, print_paravirt_cpus, store_paravirt_cpus);
 #endif
 
 const struct bus_type cpu_subsys = {
@@ -675,7 +720,6 @@ static void __init cpu_register_vulnerabilities(void)
 		put_device(dev);
 	}
 }
-
 #else
 static inline void cpu_register_vulnerabilities(void) { }
 #endif
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [HELPER PATCH 2] helper: disable arch handling if paravirt file being written
  2025-11-19  6:20 [RFC PATCH v4 00/17] Paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
                   ` (15 preceding siblings ...)
  2025-11-19  6:20 ` [HELPER PATCH 1] sysfs: Provide write method for paravirt Shrikanth Hegde
@ 2025-11-19  6:21 ` Shrikanth Hegde
  2025-11-19 12:53 ` [RFC PATCH v4 00/17] Paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
  17 siblings, 0 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2025-11-19  6:21 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev
  Cc: sshegde, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, maddy, srikar, gregkh, pbonzini, seanjc,
	kprateek.nayak, vschneid, iii, huschle, rostedt, dietmar.eggemann,
	christophe.leroy

Arch specific code can update the mask based on the steal time. For
debugging it is desired to overwrite the arch logic. Do that with this
helper script.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 arch/powerpc/platforms/pseries/lpar.c | 3 +++
 drivers/base/cpu.c                    | 2 ++
 include/linux/sched.h                 | 4 ++++
 kernel/sched/core.c                   | 1 +
 4 files changed, 10 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c
index 090e5c48243b..04bc75e22e7b 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -681,6 +681,9 @@ static void process_steal(int cpu)
 	unsigned long steal = 0;
 	unsigned int i;
 
+	if (static_branch_unlikely(&disable_arch_paravirt_handling))
+		return;
+
 	if (!should_cpu_process_steal(cpu))
 		return;
 
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index 043e4f4ce1a9..fbaddbfe0b01 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -402,7 +402,9 @@ static ssize_t store_paravirt_cpus(struct device *dev,
 	/* No more paravirt cpus */
 	if (cpumask_empty(temp_mask)) {
 		cpumask_copy((struct cpumask *)&__cpu_paravirt_mask, temp_mask);
+		static_branch_disable(&disable_arch_paravirt_handling);
 	} else {
+		static_branch_enable(&disable_arch_paravirt_handling);
 		cpumask_copy((struct cpumask *)&__cpu_paravirt_mask, temp_mask);
 
 		/* Enable tick on nohz_full cpu */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3628edd1468b..1afa5dd5b0ae 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2427,4 +2427,8 @@ extern void migrate_enable(void);
 
 DEFINE_LOCK_GUARD_0(migrate, migrate_disable(), migrate_enable())
 
+#ifdef CONFIG_PARAVIRT
+DECLARE_STATIC_KEY_FALSE(disable_arch_paravirt_handling);
+#endif
+
 #endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 65c247c24191..b65a9898c694 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10873,6 +10873,7 @@ void sched_change_end(struct sched_change_ctx *ctx)
 #ifdef CONFIG_PARAVIRT
 struct cpumask __cpu_paravirt_mask __read_mostly;
 EXPORT_SYMBOL(__cpu_paravirt_mask);
+DEFINE_STATIC_KEY_FALSE(disable_arch_paravirt_handling);
 
 static DEFINE_PER_CPU(struct cpu_stop_work, pv_push_task_work);
 
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [HELPER PATCH 1] sysfs: Provide write method for paravirt
  2025-11-19  6:20 ` [HELPER PATCH 1] sysfs: Provide write method for paravirt Shrikanth Hegde
@ 2025-11-19  7:42   ` Greg KH
  2025-11-19  8:08     ` Shrikanth Hegde
  0 siblings, 1 reply; 25+ messages in thread
From: Greg KH @ 2025-11-19  7:42 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: linux-kernel, linuxppc-dev, mingo, peterz, juri.lelli,
	vincent.guittot, tglx, yury.norov, maddy, srikar, pbonzini,
	seanjc, kprateek.nayak, vschneid, iii, huschle, rostedt,
	dietmar.eggemann, christophe.leroy

On Wed, Nov 19, 2025 at 11:50:59AM +0530, Shrikanth Hegde wrote:
> This is helper patch which could be used to set the range of CPUs as
> paravirt. One could make use of this for quick testing of this infra
> instead of writing arch specific code.
> 
> This is currently not meant be merged, since paravirt sysfs file is meant
> to be Read-Only.
> 
> echo 100-200,600-700 >  /sys/devices/system/cpu/paravirt
> cat /sys/devices/system/cpu/paravirt
> 100-200,600-700
> 
> echo > /sys/devices/system/cpu/paravirt
> cat /sys/devices/system/cpu/paravirt
> 
> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> ---
>  drivers/base/cpu.c | 48 ++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 46 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
> index 59ceae217b22..043e4f4ce1a9 100644
> --- a/drivers/base/cpu.c
> +++ b/drivers/base/cpu.c
> @@ -375,12 +375,57 @@ static int cpu_uevent(const struct device *dev, struct kobj_uevent_env *env)
>  #endif
>  
>  #ifdef CONFIG_PARAVIRT
> +static ssize_t store_paravirt_cpus(struct device *dev,
> +				   struct device_attribute *attr,
> +				   const char *buf, size_t count)
> +{
> +	cpumask_var_t temp_mask;
> +	int retval = 0;
> +
> +	if (!alloc_cpumask_var(&temp_mask, GFP_KERNEL))
> +		return -ENOMEM;
> +
> +	retval = cpulist_parse(buf, temp_mask);
> +	if (retval)
> +		goto free_mask;
> +
> +	/* ALL cpus can't be marked as paravirt */
> +	if (cpumask_equal(temp_mask, cpu_online_mask)) {
> +		retval = -EINVAL;
> +		goto free_mask;
> +	}
> +	if (cpumask_weight(temp_mask) > num_online_cpus()) {
> +		retval = -EINVAL;
> +		goto free_mask;
> +	}
> +
> +	/* No more paravirt cpus */
> +	if (cpumask_empty(temp_mask)) {
> +		cpumask_copy((struct cpumask *)&__cpu_paravirt_mask, temp_mask);
> +	} else {
> +		cpumask_copy((struct cpumask *)&__cpu_paravirt_mask, temp_mask);
> +
> +		/* Enable tick on nohz_full cpu */
> +		int cpu;
> +		for_each_cpu(cpu, temp_mask) {
> +			if (tick_nohz_full_cpu(cpu))
> +				tick_nohz_dep_set_cpu(cpu, TICK_DEP_BIT_SCHED);
> +		}
> +	}
> +
> +	retval = count;
> +
> +free_mask:
> +	free_cpumask_var(temp_mask);
> +	return retval;
> +}
> +
>  static ssize_t print_paravirt_cpus(struct device *dev,
>  				   struct device_attribute *attr, char *buf)
>  {
>  	return sysfs_emit(buf, "%*pbl\n", cpumask_pr_args(cpu_paravirt_mask));
>  }
> -static DEVICE_ATTR(paravirt, 0444, print_paravirt_cpus, NULL);
> +static DEVICE_ATTR(paravirt, 0644, print_paravirt_cpus, store_paravirt_cpus);

DEVICE_ATTR_RW()?

And where is the documentation update for this sysfs file change?

>  #endif
>  
>  const struct bus_type cpu_subsys = {
> @@ -675,7 +720,6 @@ static void __init cpu_register_vulnerabilities(void)
>  		put_device(dev);
>  	}
>  }
> -
>  #else

Why is this change needed?

thanks,

greg k-h


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [HELPER PATCH 1] sysfs: Provide write method for paravirt
  2025-11-19  7:42   ` Greg KH
@ 2025-11-19  8:08     ` Shrikanth Hegde
  2025-11-19  8:20       ` Christophe Leroy
  2025-11-19  8:23       ` Greg KH
  0 siblings, 2 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2025-11-19  8:08 UTC (permalink / raw)
  To: Greg KH
  Cc: linux-kernel, linuxppc-dev, mingo, peterz, juri.lelli,
	vincent.guittot, tglx, yury.norov, maddy, srikar, pbonzini,
	seanjc, kprateek.nayak, vschneid, iii, huschle, rostedt,
	dietmar.eggemann, christophe.leroy

Hi Greg.

On 11/19/25 1:12 PM, Greg KH wrote:
> On Wed, Nov 19, 2025 at 11:50:59AM +0530, Shrikanth Hegde wrote:
>> This is helper patch which could be used to set the range of CPUs as
>> paravirt. One could make use of this for quick testing of this infra
>> instead of writing arch specific code.
>>
>> This is currently not meant be merged, since paravirt sysfs file is meant
>> to be Read-Only.
>>
>> echo 100-200,600-700 >  /sys/devices/system/cpu/paravirt
>> cat /sys/devices/system/cpu/paravirt
>> 100-200,600-700
>>
>> echo > /sys/devices/system/cpu/paravirt
>> cat /sys/devices/system/cpu/paravirt
>>
>> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
>> ---
>>   drivers/base/cpu.c | 48 ++++++++++++++++++++++++++++++++++++++++++++--
>>   1 file changed, 46 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
>> index 59ceae217b22..043e4f4ce1a9 100644
>> --- a/drivers/base/cpu.c
>> +++ b/drivers/base/cpu.c
>> @@ -375,12 +375,57 @@ static int cpu_uevent(const struct device *dev, struct kobj_uevent_env *env)
>>   #endif
>>   
>>   #ifdef CONFIG_PARAVIRT
>> +static ssize_t store_paravirt_cpus(struct device *dev,
>> +				   struct device_attribute *attr,
>> +				   const char *buf, size_t count)
>> +{
>> +	cpumask_var_t temp_mask;
>> +	int retval = 0;
>> +
>> +	if (!alloc_cpumask_var(&temp_mask, GFP_KERNEL))
>> +		return -ENOMEM;
>> +
>> +	retval = cpulist_parse(buf, temp_mask);
>> +	if (retval)
>> +		goto free_mask;
>> +
>> +	/* ALL cpus can't be marked as paravirt */
>> +	if (cpumask_equal(temp_mask, cpu_online_mask)) {
>> +		retval = -EINVAL;
>> +		goto free_mask;
>> +	}
>> +	if (cpumask_weight(temp_mask) > num_online_cpus()) {
>> +		retval = -EINVAL;
>> +		goto free_mask;
>> +	}
>> +
>> +	/* No more paravirt cpus */
>> +	if (cpumask_empty(temp_mask)) {
>> +		cpumask_copy((struct cpumask *)&__cpu_paravirt_mask, temp_mask);
>> +	} else {
>> +		cpumask_copy((struct cpumask *)&__cpu_paravirt_mask, temp_mask);
>> +
>> +		/* Enable tick on nohz_full cpu */
>> +		int cpu;
>> +		for_each_cpu(cpu, temp_mask) {
>> +			if (tick_nohz_full_cpu(cpu))
>> +				tick_nohz_dep_set_cpu(cpu, TICK_DEP_BIT_SCHED);
>> +		}
>> +	}
>> +
>> +	retval = count;
>> +
>> +free_mask:
>> +	free_cpumask_var(temp_mask);
>> +	return retval;
>> +}
>> +
>>   static ssize_t print_paravirt_cpus(struct device *dev,
>>   				   struct device_attribute *attr, char *buf)
>>   {
>>   	return sysfs_emit(buf, "%*pbl\n", cpumask_pr_args(cpu_paravirt_mask));
>>   }
>> -static DEVICE_ATTR(paravirt, 0444, print_paravirt_cpus, NULL);
>> +static DEVICE_ATTR(paravirt, 0644, print_paravirt_cpus, store_paravirt_cpus);
> 
> DEVICE_ATTR_RW()?

ok.

> 
> And where is the documentation update for this sysfs file change?
> 

[RFC PATCH v4 11/17] has the documentation of this sysfs file.
https://lore.kernel.org/all/20251119062100.1112520-12-sshegde@linux.ibm.com/

>>   #endif
>>   
>>   const struct bus_type cpu_subsys = {
>> @@ -675,7 +720,6 @@ static void __init cpu_register_vulnerabilities(void)
>>   		put_device(dev);
>>   	}
>>   }
>> -
>>   #else
> 
> Why is this change needed?
> 
> thanks,
> 
> greg k-h

This is a helper patch. This helps to verify functionality of any combination
of CPUs being marked as paravirt which helped me to test some corner cases.

This is also helpful until the arch specific hint becomes better.

This is also useful for other archs which haven't implemented archs specific handling of
steal time, but want to play around with series for their usecase (ex: S390)

Once arch specific hint becomes better, we could decide to remove it or keep in more appropriate
place. It really is debugfs for infra which says I don't want to use these CPUs for now.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [HELPER PATCH 1] sysfs: Provide write method for paravirt
  2025-11-19  8:08     ` Shrikanth Hegde
@ 2025-11-19  8:20       ` Christophe Leroy
  2025-11-19 10:01         ` Shrikanth Hegde
  2025-11-19  8:23       ` Greg KH
  1 sibling, 1 reply; 25+ messages in thread
From: Christophe Leroy @ 2025-11-19  8:20 UTC (permalink / raw)
  To: Shrikanth Hegde, Greg KH
  Cc: linux-kernel, linuxppc-dev, mingo, peterz, juri.lelli,
	vincent.guittot, tglx, yury.norov, maddy, srikar, pbonzini,
	seanjc, kprateek.nayak, vschneid, iii, huschle, rostedt,
	dietmar.eggemann



Le 19/11/2025 à 09:08, Shrikanth Hegde a écrit :
> Hi Greg.
> 
> On 11/19/25 1:12 PM, Greg KH wrote:
>> On Wed, Nov 19, 2025 at 11:50:59AM +0530, Shrikanth Hegde wrote:
>>> This is helper patch which could be used to set the range of CPUs as
>>> paravirt. One could make use of this for quick testing of this infra
>>> instead of writing arch specific code.
>>>
>>> This is currently not meant be merged, since paravirt sysfs file is 
>>> meant
>>> to be Read-Only.
>>>
>>> echo 100-200,600-700 >  /sys/devices/system/cpu/paravirt
>>> cat /sys/devices/system/cpu/paravirt
>>> 100-200,600-700
>>>
>>> echo > /sys/devices/system/cpu/paravirt
>>> cat /sys/devices/system/cpu/paravirt
>>>
>>> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
>>> ---
>>>   drivers/base/cpu.c | 48 ++++++++++++++++++++++++++++++++++++++++++++--
>>>   1 file changed, 46 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
>>> index 59ceae217b22..043e4f4ce1a9 100644
>>> --- a/drivers/base/cpu.c
>>> +++ b/drivers/base/cpu.c
>>> @@ -375,12 +375,57 @@ static int cpu_uevent(const struct device *dev, 
>>> struct kobj_uevent_env *env)
>>>   #endif
>>>   #ifdef CONFIG_PARAVIRT
>>> +static ssize_t store_paravirt_cpus(struct device *dev,
>>> +                   struct device_attribute *attr,
>>> +                   const char *buf, size_t count)
>>> +{
>>> +    cpumask_var_t temp_mask;
>>> +    int retval = 0;
>>> +
>>> +    if (!alloc_cpumask_var(&temp_mask, GFP_KERNEL))
>>> +        return -ENOMEM;
>>> +
>>> +    retval = cpulist_parse(buf, temp_mask);
>>> +    if (retval)
>>> +        goto free_mask;
>>> +
>>> +    /* ALL cpus can't be marked as paravirt */
>>> +    if (cpumask_equal(temp_mask, cpu_online_mask)) {
>>> +        retval = -EINVAL;
>>> +        goto free_mask;
>>> +    }
>>> +    if (cpumask_weight(temp_mask) > num_online_cpus()) {
>>> +        retval = -EINVAL;
>>> +        goto free_mask;
>>> +    }
>>> +
>>> +    /* No more paravirt cpus */
>>> +    if (cpumask_empty(temp_mask)) {
>>> +        cpumask_copy((struct cpumask *)&__cpu_paravirt_mask, 
>>> temp_mask);
>>> +    } else {
>>> +        cpumask_copy((struct cpumask *)&__cpu_paravirt_mask, 
>>> temp_mask);
>>> +
>>> +        /* Enable tick on nohz_full cpu */
>>> +        int cpu;
>>> +        for_each_cpu(cpu, temp_mask) {
>>> +            if (tick_nohz_full_cpu(cpu))
>>> +                tick_nohz_dep_set_cpu(cpu, TICK_DEP_BIT_SCHED);
>>> +        }
>>> +    }
>>> +
>>> +    retval = count;
>>> +
>>> +free_mask:
>>> +    free_cpumask_var(temp_mask);
>>> +    return retval;
>>> +}
>>> +
>>>   static ssize_t print_paravirt_cpus(struct device *dev,
>>>                      struct device_attribute *attr, char *buf)
>>>   {
>>>       return sysfs_emit(buf, "%*pbl\n", 
>>> cpumask_pr_args(cpu_paravirt_mask));
>>>   }
>>> -static DEVICE_ATTR(paravirt, 0444, print_paravirt_cpus, NULL);
>>> +static DEVICE_ATTR(paravirt, 0644, print_paravirt_cpus, 
>>> store_paravirt_cpus);
>>
>> DEVICE_ATTR_RW()?
> 
> ok.
> 
>>
>> And where is the documentation update for this sysfs file change?
>>
> 
> [RFC PATCH v4 11/17] has the documentation of this sysfs file.

There is a problem in the way you sent this patch and the other helper 
patch. They appear in the cover letter of your series are part of it but 
at the end the series is only sent with 15 patches over 17, and the last 
two patches appear as independent from the series:

Series at 
https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=482680

Other patches are on their own: 
https://patchwork.ozlabs.org/project/linuxppc-dev/list/?submitter=87866

> https://eur01.safelinks.protection.outlook.com/? 
> url=https%3A%2F%2Flore.kernel.org%2Fall%2F20251119062100.1112520-12- 
> sshegde%40linux.ibm.com%2F&data=05%7C02%7Cchristophe.leroy%40csgroup.eu%7Ce78250c7fb3647ca116608de2742daa2%7C8b87af7d86474dc78df45f69a2011bb5%7C0%7C0%7C638991365240852443%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=RxgUm4gnzXXVnTq0cGQ41Zf1wj83PBEfZm3k%2FPC9Abw%3D&reserved=0
> 
>>>   #endif
>>>   const struct bus_type cpu_subsys = {
>>> @@ -675,7 +720,6 @@ static void __init 
>>> cpu_register_vulnerabilities(void)
>>>           put_device(dev);
>>>       }
>>>   }
>>> -
>>>   #else
>>
>> Why is this change needed?
>>
>> thanks,
>>
>> greg k-h
> 
> This is a helper patch. This helps to verify functionality of any 
> combination
> of CPUs being marked as paravirt which helped me to test some corner cases.
> 
> This is also helpful until the arch specific hint becomes better.
> 
> This is also useful for other archs which haven't implemented archs 
> specific handling of
> steal time, but want to play around with series for their usecase (ex: 
> S390)
> 
> Once arch specific hint becomes better, we could decide to remove it or 
> keep in more appropriate
> place. It really is debugfs for infra which says I don't want to use 
> these CPUs for now.



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [HELPER PATCH 1] sysfs: Provide write method for paravirt
  2025-11-19  8:08     ` Shrikanth Hegde
  2025-11-19  8:20       ` Christophe Leroy
@ 2025-11-19  8:23       ` Greg KH
  2025-11-19  9:56         ` Shrikanth Hegde
  1 sibling, 1 reply; 25+ messages in thread
From: Greg KH @ 2025-11-19  8:23 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: linux-kernel, linuxppc-dev, mingo, peterz, juri.lelli,
	vincent.guittot, tglx, yury.norov, maddy, srikar, pbonzini,
	seanjc, kprateek.nayak, vschneid, iii, huschle, rostedt,
	dietmar.eggemann, christophe.leroy

On Wed, Nov 19, 2025 at 01:38:24PM +0530, Shrikanth Hegde wrote:
> Hi Greg.
> 
> On 11/19/25 1:12 PM, Greg KH wrote:
> > On Wed, Nov 19, 2025 at 11:50:59AM +0530, Shrikanth Hegde wrote:
> > > This is helper patch which could be used to set the range of CPUs as
> > > paravirt. One could make use of this for quick testing of this infra
> > > instead of writing arch specific code.
> > > 
> > > This is currently not meant be merged, since paravirt sysfs file is meant
> > > to be Read-Only.
> > > 
> > > echo 100-200,600-700 >  /sys/devices/system/cpu/paravirt
> > > cat /sys/devices/system/cpu/paravirt
> > > 100-200,600-700
> > > 
> > > echo > /sys/devices/system/cpu/paravirt
> > > cat /sys/devices/system/cpu/paravirt
> > > 
> > > Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> > > ---
> > >   drivers/base/cpu.c | 48 ++++++++++++++++++++++++++++++++++++++++++++--
> > >   1 file changed, 46 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
> > > index 59ceae217b22..043e4f4ce1a9 100644
> > > --- a/drivers/base/cpu.c
> > > +++ b/drivers/base/cpu.c
> > > @@ -375,12 +375,57 @@ static int cpu_uevent(const struct device *dev, struct kobj_uevent_env *env)
> > >   #endif
> > >   #ifdef CONFIG_PARAVIRT
> > > +static ssize_t store_paravirt_cpus(struct device *dev,
> > > +				   struct device_attribute *attr,
> > > +				   const char *buf, size_t count)
> > > +{
> > > +	cpumask_var_t temp_mask;
> > > +	int retval = 0;
> > > +
> > > +	if (!alloc_cpumask_var(&temp_mask, GFP_KERNEL))
> > > +		return -ENOMEM;
> > > +
> > > +	retval = cpulist_parse(buf, temp_mask);
> > > +	if (retval)
> > > +		goto free_mask;
> > > +
> > > +	/* ALL cpus can't be marked as paravirt */
> > > +	if (cpumask_equal(temp_mask, cpu_online_mask)) {
> > > +		retval = -EINVAL;
> > > +		goto free_mask;
> > > +	}
> > > +	if (cpumask_weight(temp_mask) > num_online_cpus()) {
> > > +		retval = -EINVAL;
> > > +		goto free_mask;
> > > +	}
> > > +
> > > +	/* No more paravirt cpus */
> > > +	if (cpumask_empty(temp_mask)) {
> > > +		cpumask_copy((struct cpumask *)&__cpu_paravirt_mask, temp_mask);
> > > +	} else {
> > > +		cpumask_copy((struct cpumask *)&__cpu_paravirt_mask, temp_mask);
> > > +
> > > +		/* Enable tick on nohz_full cpu */
> > > +		int cpu;
> > > +		for_each_cpu(cpu, temp_mask) {
> > > +			if (tick_nohz_full_cpu(cpu))
> > > +				tick_nohz_dep_set_cpu(cpu, TICK_DEP_BIT_SCHED);
> > > +		}
> > > +	}
> > > +
> > > +	retval = count;
> > > +
> > > +free_mask:
> > > +	free_cpumask_var(temp_mask);
> > > +	return retval;
> > > +}
> > > +
> > >   static ssize_t print_paravirt_cpus(struct device *dev,
> > >   				   struct device_attribute *attr, char *buf)
> > >   {
> > >   	return sysfs_emit(buf, "%*pbl\n", cpumask_pr_args(cpu_paravirt_mask));
> > >   }
> > > -static DEVICE_ATTR(paravirt, 0444, print_paravirt_cpus, NULL);
> > > +static DEVICE_ATTR(paravirt, 0644, print_paravirt_cpus, store_paravirt_cpus);
> > 
> > DEVICE_ATTR_RW()?
> 
> ok.
> 
> > 
> > And where is the documentation update for this sysfs file change?
> > 
> 
> [RFC PATCH v4 11/17] has the documentation of this sysfs file.
> https://lore.kernel.org/all/20251119062100.1112520-12-sshegde@linux.ibm.com/

So a rfc patch has the documentation for a change that you don't want to
have applied?  This is an odd series, how are we supposed to review
this?

> This is a helper patch. This helps to verify functionality of any combination
> of CPUs being marked as paravirt which helped me to test some corner cases.

I don't think I have ever seen a "helper patch" to know what to do with
it :(

thanks,

greg k-h


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [HELPER PATCH 1] sysfs: Provide write method for paravirt
  2025-11-19  8:23       ` Greg KH
@ 2025-11-19  9:56         ` Shrikanth Hegde
  0 siblings, 0 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2025-11-19  9:56 UTC (permalink / raw)
  To: Greg KH
  Cc: linux-kernel, linuxppc-dev, mingo, peterz, juri.lelli,
	vincent.guittot, tglx, yury.norov, maddy, srikar, pbonzini,
	seanjc, kprateek.nayak, vschneid, iii, huschle, rostedt,
	dietmar.eggemann, christophe.leroy



On 11/19/25 1:53 PM, Greg KH wrote:
> On Wed, Nov 19, 2025 at 01:38:24PM +0530, Shrikanth Hegde wrote:
>> Hi Greg.
>>
>> On 11/19/25 1:12 PM, Greg KH wrote:
>>> On Wed, Nov 19, 2025 at 11:50:59AM +0530, Shrikanth Hegde wrote:
>>>> This is helper patch which could be used to set the range of CPUs as
>>>> paravirt. One could make use of this for quick testing of this infra
>>>> instead of writing arch specific code.
>>>>
>>>> This is currently not meant be merged, since paravirt sysfs file is meant
>>>> to be Read-Only.
>>>>
>>>> echo 100-200,600-700 >  /sys/devices/system/cpu/paravirt
>>>> cat /sys/devices/system/cpu/paravirt
>>>> 100-200,600-700
>>>>
>>>> echo > /sys/devices/system/cpu/paravirt
>>>> cat /sys/devices/system/cpu/paravirt
>>>>
>>>> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
>>>> ---
>>>>    drivers/base/cpu.c | 48 ++++++++++++++++++++++++++++++++++++++++++++--
>>>>    1 file changed, 46 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
>>>> index 59ceae217b22..043e4f4ce1a9 100644
>>>> --- a/drivers/base/cpu.c
>>>> +++ b/drivers/base/cpu.c
>>>> @@ -375,12 +375,57 @@ static int cpu_uevent(const struct device *dev, struct kobj_uevent_env *env)
>>>>    #endif
>>>>    #ifdef CONFIG_PARAVIRT
>>>> +static ssize_t store_paravirt_cpus(struct device *dev,
>>>> +				   struct device_attribute *attr,
>>>> +				   const char *buf, size_t count)
>>>> +{
>>>> +	cpumask_var_t temp_mask;
>>>> +	int retval = 0;
>>>> +
>>>> +	if (!alloc_cpumask_var(&temp_mask, GFP_KERNEL))
>>>> +		return -ENOMEM;
>>>> +
>>>> +	retval = cpulist_parse(buf, temp_mask);
>>>> +	if (retval)
>>>> +		goto free_mask;
>>>> +
>>>> +	/* ALL cpus can't be marked as paravirt */
>>>> +	if (cpumask_equal(temp_mask, cpu_online_mask)) {
>>>> +		retval = -EINVAL;
>>>> +		goto free_mask;
>>>> +	}
>>>> +	if (cpumask_weight(temp_mask) > num_online_cpus()) {
>>>> +		retval = -EINVAL;
>>>> +		goto free_mask;
>>>> +	}
>>>> +
>>>> +	/* No more paravirt cpus */
>>>> +	if (cpumask_empty(temp_mask)) {
>>>> +		cpumask_copy((struct cpumask *)&__cpu_paravirt_mask, temp_mask);
>>>> +	} else {
>>>> +		cpumask_copy((struct cpumask *)&__cpu_paravirt_mask, temp_mask);
>>>> +
>>>> +		/* Enable tick on nohz_full cpu */
>>>> +		int cpu;
>>>> +		for_each_cpu(cpu, temp_mask) {
>>>> +			if (tick_nohz_full_cpu(cpu))
>>>> +				tick_nohz_dep_set_cpu(cpu, TICK_DEP_BIT_SCHED);
>>>> +		}
>>>> +	}
>>>> +
>>>> +	retval = count;
>>>> +
>>>> +free_mask:
>>>> +	free_cpumask_var(temp_mask);
>>>> +	return retval;
>>>> +}
>>>> +
>>>>    static ssize_t print_paravirt_cpus(struct device *dev,
>>>>    				   struct device_attribute *attr, char *buf)
>>>>    {
>>>>    	return sysfs_emit(buf, "%*pbl\n", cpumask_pr_args(cpu_paravirt_mask));
>>>>    }
>>>> -static DEVICE_ATTR(paravirt, 0444, print_paravirt_cpus, NULL);
>>>> +static DEVICE_ATTR(paravirt, 0644, print_paravirt_cpus, store_paravirt_cpus);
>>>
>>> DEVICE_ATTR_RW()?
>>
>> ok.
>>
>>>
>>> And where is the documentation update for this sysfs file change?
>>>
>>
>> [RFC PATCH v4 11/17] has the documentation of this sysfs file.
>> https://lore.kernel.org/all/20251119062100.1112520-12-sshegde@linux.ibm.com/
> 
> So a rfc patch has the documentation for a change that you don't want to
> have applied?  This is an odd series, how are we supposed to review
> this?

I added the documentation for sysfs file as the file is read only. The last two
patch are debug patches. So i didn't update the documentation saying it can be written
too. I hope this clears the doubts.

> 
>> This is a helper patch. This helps to verify functionality of any combination
>> of CPUs being marked as paravirt which helped me to test some corner cases.
> 
> I don't think I have ever seen a "helper patch" to know what to do with
> it :(
> 

Sorry for confusion with the name.

All I wanted to say there was it is debug patch one could use.
Would [RFC PATCH 16/17][DEBUG] would have been a better name?



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [HELPER PATCH 1] sysfs: Provide write method for paravirt
  2025-11-19  8:20       ` Christophe Leroy
@ 2025-11-19 10:01         ` Shrikanth Hegde
  0 siblings, 0 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2025-11-19 10:01 UTC (permalink / raw)
  To: Christophe Leroy, Greg KH
  Cc: linux-kernel, linuxppc-dev, mingo, peterz, juri.lelli,
	vincent.guittot, tglx, yury.norov, maddy, srikar, pbonzini,
	seanjc, kprateek.nayak, vschneid, iii, huschle, rostedt,
	dietmar.eggemann



On 11/19/25 1:50 PM, Christophe Leroy wrote:
> 
> 
> Le 19/11/2025 à 09:08, Shrikanth Hegde a écrit :
>> Hi Greg.
>>
>> On 11/19/25 1:12 PM, Greg KH wrote:
>>> On Wed, Nov 19, 2025 at 11:50:59AM +0530, Shrikanth Hegde wrote:
>>>> This is helper patch which could be used to set the range of CPUs as
>>>> paravirt. One could make use of this for quick testing of this infra
>>>> instead of writing arch specific code.
>>>>
>>>> This is currently not meant be merged, since paravirt sysfs file is 
>>>> meant
>>>> to be Read-Only.
>>>>
>>>> echo 100-200,600-700 >  /sys/devices/system/cpu/paravirt
>>>> cat /sys/devices/system/cpu/paravirt
>>>> 100-200,600-700
>>>>
>>>> echo > /sys/devices/system/cpu/paravirt
>>>> cat /sys/devices/system/cpu/paravirt
>>>>
>>>> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
>>>> ---
>>>>   drivers/base/cpu.c | 48 ++++++++++++++++++++++++++++++++++++++++++ 
>>>> ++--
>>>>   1 file changed, 46 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
>>>> index 59ceae217b22..043e4f4ce1a9 100644
>>>> --- a/drivers/base/cpu.c
>>>> +++ b/drivers/base/cpu.c
>>>> @@ -375,12 +375,57 @@ static int cpu_uevent(const struct device 
>>>> *dev, struct kobj_uevent_env *env)
>>>>   #endif
>>>>   #ifdef CONFIG_PARAVIRT
>>>> +static ssize_t store_paravirt_cpus(struct device *dev,
>>>> +                   struct device_attribute *attr,
>>>> +                   const char *buf, size_t count)
>>>> +{
>>>> +    cpumask_var_t temp_mask;
>>>> +    int retval = 0;
>>>> +
>>>> +    if (!alloc_cpumask_var(&temp_mask, GFP_KERNEL))
>>>> +        return -ENOMEM;
>>>> +
>>>> +    retval = cpulist_parse(buf, temp_mask);
>>>> +    if (retval)
>>>> +        goto free_mask;
>>>> +
>>>> +    /* ALL cpus can't be marked as paravirt */
>>>> +    if (cpumask_equal(temp_mask, cpu_online_mask)) {
>>>> +        retval = -EINVAL;
>>>> +        goto free_mask;
>>>> +    }
>>>> +    if (cpumask_weight(temp_mask) > num_online_cpus()) {
>>>> +        retval = -EINVAL;
>>>> +        goto free_mask;
>>>> +    }
>>>> +
>>>> +    /* No more paravirt cpus */
>>>> +    if (cpumask_empty(temp_mask)) {
>>>> +        cpumask_copy((struct cpumask *)&__cpu_paravirt_mask, 
>>>> temp_mask);
>>>> +    } else {
>>>> +        cpumask_copy((struct cpumask *)&__cpu_paravirt_mask, 
>>>> temp_mask);
>>>> +
>>>> +        /* Enable tick on nohz_full cpu */
>>>> +        int cpu;
>>>> +        for_each_cpu(cpu, temp_mask) {
>>>> +            if (tick_nohz_full_cpu(cpu))
>>>> +                tick_nohz_dep_set_cpu(cpu, TICK_DEP_BIT_SCHED);
>>>> +        }
>>>> +    }
>>>> +
>>>> +    retval = count;
>>>> +
>>>> +free_mask:
>>>> +    free_cpumask_var(temp_mask);
>>>> +    return retval;
>>>> +}
>>>> +
>>>>   static ssize_t print_paravirt_cpus(struct device *dev,
>>>>                      struct device_attribute *attr, char *buf)
>>>>   {
>>>>       return sysfs_emit(buf, "%*pbl\n", 
>>>> cpumask_pr_args(cpu_paravirt_mask));
>>>>   }
>>>> -static DEVICE_ATTR(paravirt, 0444, print_paravirt_cpus, NULL);
>>>> +static DEVICE_ATTR(paravirt, 0644, print_paravirt_cpus, 
>>>> store_paravirt_cpus);
>>>
>>> DEVICE_ATTR_RW()?
>>
>> ok.
>>
>>>
>>> And where is the documentation update for this sysfs file change?
>>>
>>
>> [RFC PATCH v4 11/17] has the documentation of this sysfs file.
> 
> There is a problem in the way you sent this patch and the other helper 
> patch. They appear in the cover letter of your series are part of it but 
> at the end the series is only sent with 15 patches over 17, and the last 
> two patches appear as independent from the series:
> 
> Series at https://patchwork.ozlabs.org/project/linuxppc-dev/list/? 
> series=482680
> 
> Other patches are on their own: https://patchwork.ozlabs.org/project/ 
> linuxppc-dev/list/?submitter=87866
> 

I edited the patch header before sending. Wanted to say they are debug patches.
Thought "helper" maybe a name. My bad.

I didn't realize that it could seen an separate patches :(
So sorry. I thought it would come up as thread of the series. Like it showed up in
https://lore.kernel.org/all/20251119062100.1112520-12-sshegde@linux.ibm.com/#r

Should I resend the series?


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH v4 00/17] Paravirt CPUs and push task for less vCPU preemption
  2025-11-19  6:20 [RFC PATCH v4 00/17] Paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
                   ` (16 preceding siblings ...)
  2025-11-19  6:21 ` [HELPER PATCH 2] helper: disable arch handling if paravirt file being written Shrikanth Hegde
@ 2025-11-19 12:53 ` Shrikanth Hegde
  17 siblings, 0 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2025-11-19 12:53 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev, gregkh, christophe.leroy
  Cc: mingo, peterz, juri.lelli, vincent.guittot, tglx, yury.norov,
	maddy, srikar, pbonzini, seanjc, kprateek.nayak, vschneid, iii,
	huschle, rostedt, dietmar.eggemann



On 11/19/25 11:50 AM, Shrikanth Hegde wrote:
> Detailed problem statement and some of the implementation choices were
> discussed earlier[1].
> 
> [1]: https://lore.kernel.org/all/20250910174210.1969750-1-sshegde@linux.ibm.com/
> 
> This is likely the version which would be used for LPC2025 discussion on
> this topic. Feel free to provide your suggestion and hoping for a solution
> that works for different architectures and it's use cases.
> 
> All the existing alternatives such as cpu hotplug, creating isolated
> partitions etc break the user affinity. Since number of CPUs to use change
> depending on the steal time, it is not driven by User. Hence it would be
> wrong to break the affinity. This series allows if the task is pinned
> only paravirt CPUs, it will continue running there.
> 
> Changes compared v3[1]:
> 
> - Introduced computation of steal time in powerpc code.
> - Derive number of CPUs to use and mark the remaining as paravirt based
>    on steal values.
> - Provide debugfs knobs to alter how steal time values being used.
> - Removed static key check for paravirt CPUs (Yury)
> - Removed preempt_disable/enable while calling stopper (Prateek)
> - Made select_idle_sibling and friends aware of paravirt CPUs.
> - Removed 3 unused schedstat fields and introduced 2 related to paravirt
>    handling.
> - Handled nohz_full case by enabling tick on it when there is CFS/RT on
>    it.
> - Updated helper patch to override arch behaviour for easier debugging
>    during development.

Sorry for creating confusion around last two patches. Sent out new version.

https://lore.kernel.org/all/20251119124449.1149616-1-sshegde@linux.ibm.com/


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2025-11-19 12:54 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-19  6:20 [RFC PATCH v4 00/17] Paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
2025-11-19  6:20 ` [RFC PATCH v4 01/17] sched/docs: Document cpu_paravirt_mask and Paravirt CPU concept Shrikanth Hegde
2025-11-19  6:20 ` [RFC PATCH v4 02/17] cpumask: Introduce cpu_paravirt_mask Shrikanth Hegde
2025-11-19  6:20 ` [RFC PATCH v4 03/17] sched/core: Dont allow to use CPU marked as paravirt Shrikanth Hegde
2025-11-19  6:20 ` [RFC PATCH v4 04/17] sched/debug: Remove unused schedstats Shrikanth Hegde
2025-11-19  6:20 ` [RFC PATCH v5 05/17] sched/fair: Add paravirt movements for proc sched file Shrikanth Hegde
2025-11-19  6:20 ` [RFC PATCH v4 06/17] sched/fair: Pass current cpu in select_idle_sibling Shrikanth Hegde
2025-11-19  6:20 ` [RFC PATCH v4 07/17] sched/fair: Don't consider paravirt CPUs for wakeup and load balance Shrikanth Hegde
2025-11-19  6:20 ` [RFC PATCH v4 08/17] sched/rt: Don't select paravirt CPU for wakeup and push/pull rt task Shrikanth Hegde
2025-11-19  6:20 ` [RFC PATCH v4 09/17] sched/core: Add support for nohz_full CPUs Shrikanth Hegde
2025-11-19  6:20 ` [RFC PATCH v4 10/17] sched/core: Push current task from paravirt CPU Shrikanth Hegde
2025-11-19  6:20 ` [RFC PATCH v4 11/17] sysfs: Add paravirt CPU file Shrikanth Hegde
2025-11-19  6:20 ` [RFC PATCH v4 12/17] powerpc: method to initialize ec and vp cores Shrikanth Hegde
2025-11-19  6:20 ` [RFC PATCH v4 13/17] powerpc: enable/disable paravirt CPUs based on steal time Shrikanth Hegde
2025-11-19  6:20 ` [RFC PATCH v4 14/17] powerpc: process steal values at fixed intervals Shrikanth Hegde
2025-11-19  6:20 ` [RFC PATCH v4 15/17] powerpc: add debugfs file for controlling handling on steal values Shrikanth Hegde
2025-11-19  6:20 ` [HELPER PATCH 1] sysfs: Provide write method for paravirt Shrikanth Hegde
2025-11-19  7:42   ` Greg KH
2025-11-19  8:08     ` Shrikanth Hegde
2025-11-19  8:20       ` Christophe Leroy
2025-11-19 10:01         ` Shrikanth Hegde
2025-11-19  8:23       ` Greg KH
2025-11-19  9:56         ` Shrikanth Hegde
2025-11-19  6:21 ` [HELPER PATCH 2] helper: disable arch handling if paravirt file being written Shrikanth Hegde
2025-11-19 12:53 ` [RFC PATCH v4 00/17] Paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox