[RFC PATCH v3 00/10] paravirt CPUs and push task for less vCPU preemption

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v3 00/10] paravirt CPUs and push task for less vCPU preemption
@ 2025-09-10 17:42 Shrikanth Hegde
  2025-09-10 17:42 ` [RFC PATCH v3 01/10] sched/docs: Document cpu_paravirt_mask and Paravirt CPU concept Shrikanth Hegde
                   ` (10 more replies)
  0 siblings, 11 replies; 33+ messages in thread
From: Shrikanth Hegde @ 2025-09-10 17:42 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, tglx, yury.norov,
	maddy, linux-kernel, linuxppc-dev, gregkh
  Cc: sshegde, vschneid, iii, huschle, rostedt, dietmar.eggemann,
	vineeth, jgross, pbonzini, seanjc

tl;dr

This is follow up of [1] with few fixes and addressing review comments.
Upgraded it to RFC PATCH from RFC. 
Please review. 

[1]: v2 - https://lore.kernel.org/all/20250625191108.1646208-1-sshegde@linux.ibm.com/

v2 -> v3:
- Renamed to paravirt CPUs
- Folded the changes under CONFIG_PARAVIRT.
- Fixed the crash due work_buf corruption while using
  stop_one_cpu_nowait. 
- Added sysfs documentation.
- Copy most of __balance_push_cpu_stop to new one, this helps it move 
  the code out of CONFIG_HOTPLUG_CPU. 
- Some of the code movement suggested. 

-----------------
::Detailed info:: 
-----------------
Problem statement 

vCPU - Virtual CPUs - CPU in VM world.
pCPU - Physical CPUs - CPU in baremetal world.

A hypervisor does scheduling of vCPUs on a pCPUs. It has to give each
vCPU some cycles and be fair. When there are more vCPU requests than
the pCPUs, hypervsior has to preempt some vCPUs in order to run others.
This is called as vCPU preemption.

If we take two VM's, When hypervisor preempts vCPU from VM1 to run vCPU from 
VM2, it has to do save/restore VM context.Instead if VM's can co-ordinate among
each other and request for limited  vCPUs, it avoids the above overhead and 
there is context switching within vCPU(less expensive). Even if hypervisor
is preempting one vCPU to run another within the same VM, it is still more 
expensive than the task preemption within the vCPU. So basic aim to avoid 
vCPU preemption.

So to achieve this, introduce "Paravirt CPU" concept, where it is better if
workload avoids these vCPUs at this moment. (vCPUs stays online, don't want
the overhead of sched domain rebuild and hotplug takes a lot of time too).

When there is contention, don't use paravirt CPUs.
When there is no contention, use all vCPUs. 

----------------------------------
Implementation details and choices

- current version copies most of code in __balance_push_cpu_stop. This
  was done to avoid the CONFIG_HOTPLUG_CPU dependency and move it
  under CONFIG_PARAVIRT. This also allows fixing the race in
  stop_one_cpu_nowait. Hacks are needed in __balance_push_cpu_stop
  otherwise. 

- Did explore using task_work_add instead of stop_one_cpu_nowait,
  something similar to what mm_cid does. It ended up in locking up the
  system sometimes. Takes slightly longer to move tasks compared to
  stop_one_cpu_nowait

- Tried using push_cpu_stop instead of adding more code. Made it work 
  for CFS by adding find_lock_rq. But rt tasks fail to move out of paravirt
  CPUs completely. There is like 5-10% utilization left. Maybe it races
  with pull/push rt tasks since they all use push_busy for gating.

- Kept the helper patch where one could specify the cpulist to set the
  paravirt CPUs. It helped to uncover some of the corner cases. Such as
  if say CPUs 0-100 are marked as paravirt. Number based debug file didn't do
  that. Nature of hint could change, so kept both the flavours as of now.
  Depending on how hint design goes will change it accordingly.

---------------------
bloat-o-meter reports

- CONFIG_PARAVIRT=y
add/remove: 12/0 grow/shrink: 14/0 up/down: 1767/0 (1767)
Function                                     old     new   delta
paravirt_push_cpu_stop                         -     479    +479
push_current_from_paravirt_cpu                 -     410    +410
store_paravirt_cpus                            -     174    +174
...
Total: Before=25132435, After=25134202, chg +0.01%
Values depend on NR_CPUS. Above data is for NR_CPUS=64 on x86.

add/remove: 18/3 grow/shrink: 26/12 up/down: 5320/-484 (4836)
Function                                     old     new   delta
__cpu_paravirt_mask                            -    1024   +1024
paravirt_push_cpu_stop                         -     864    +864
push_current_from_paravirt_cpu                 -     648    +648
...
Total: Before=30273517, After=30278353, chg +0.02%
on PowerPC with NR_CPUS=8192.


- CONFIG_PARAVIRT=n
add/remove: 0/0 grow/shrink: 2/1 up/down: 35/-32 (3)
Function                                     old     new   delta
select_task_rq_fair                         4376    4395     +19
check_preempt_wakeup_fair                    895     911     +16
set_next_entity                              659     627     -32
Total: Before=25106525, After=25106528, chg +0.00%

------------------------------
Functional and Performance data

- tasks move out of paravirt CPUs quite fast. Even when system is
  heavily loaded, max it takes 1-2 seconds for tasks to move out of all
  paravirt CPUs.

- schbench results. Experiments were done on a system with physical 94 cores.
  Two Shared Processor LPARs(VMs). LPAR1 has 90 Cores(entitled 60) and
  LPAR2 has 64 Cores(entitled 32). Entitled here means it should get those
  many cores worth of cycles at least. When both LPAR run at high
  utilization at the same time, there will be contention and high steal
  time was seen. When there is contention, the non-entitled number of
  Cores were made as paravirt CPUs. In another experiment non-entitled
  cpus were hotplugged. Both data below shows advantage in using
  paravirt CPUs instead.
  LPAR1 is running schbench and LPAR2 is running stress-ng intermittently
  i.e busy/idle (stress-ng is running for 60sec and then idle for 60 sec)

Wakeup Latencies      Out of Box        cpu_hotplug       cpu_paravirt
50.0th: 		15		   15			14
90.0th: 		70         	   25                   19
99.0th: 	      3084		  345                   95 
99.9th: 	      6184               3004                  523

  When the busy/idle duration is reduced close to 10 seconds in LPAR2,
  the benefit of cpu_paravirt reduces. cpu_hotplug wont work in those
  cases at all since hotplug operation itself takes close to 20+
  seconds. Benefit of cpu_paravirt shows up compared to out of box when
  the busy/idle duration is greater than 10 seconds. When the concurrency
  of the system is lowered, benefit is seen even with 10 seconds. So
  using paravirt CPUs will likely help workloads which are sensitive to
  latency.
------------
Open issues: 

- Derivation of hint from steal time is still a challenge. Some work is
  underway to address it. 

- Consider kvm and other hypervsiors and how they could derive the hint.
  Need inputs from community. 

- make irqbalance understand cpu_paravirt_mask. 

- works on nohz_full cpus somewhat, but doesn't completely move out of few CPUs.
  Was wondering if it would work at all since tick is usually disabled there. 
  Need to understand/investigate further. 

Shrikanth Hegde (10):
  sched/docs: Document cpu_paravirt_mask and Paravirt CPU concept
  cpumask: Introduce cpu_paravirt_mask
  sched: Static key to check paravirt cpu push
  sched/core: Dont allow to use CPU marked as paravirt
  sched/fair: Don't consider paravirt CPUs for wakeup and load balance
  sched/rt: Don't select paravirt CPU for wakeup and push/pull rt task
  sched/core: Push current task from paravirt CPU
  sysfs: Add cpu paravirt file
  powerpc: Add debug file for set/unset paravirt CPUs
  sysfs: Provide write method for paravirt

 .../ABI/testing/sysfs-devices-system-cpu      |   9 ++
 Documentation/scheduler/sched-arch.rst        |  37 +++++++
 arch/powerpc/include/asm/paravirt.h           |   1 +
 arch/powerpc/kernel/smp.c                     |  58 ++++++++++
 drivers/base/base.h                           |   4 +
 drivers/base/cpu.c                            |  53 +++++++++
 include/linux/cpumask.h                       |  15 +++
 kernel/sched/core.c                           | 103 +++++++++++++++++-
 kernel/sched/fair.c                           |  15 ++-
 kernel/sched/rt.c                             |  11 +-
 kernel/sched/sched.h                          |  26 ++++-
 11 files changed, 325 insertions(+), 7 deletions(-)

-- 
2.47.3



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [RFC PATCH v3 01/10] sched/docs: Document cpu_paravirt_mask and Paravirt CPU concept
  2025-09-10 17:42 [RFC PATCH v3 00/10] paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
@ 2025-09-10 17:42 ` Shrikanth Hegde
  2025-09-10 17:42 ` [RFC PATCH v3 02/10] cpumask: Introduce cpu_paravirt_mask Shrikanth Hegde
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 33+ messages in thread
From: Shrikanth Hegde @ 2025-09-10 17:42 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, tglx, yury.norov,
	maddy, linux-kernel, linuxppc-dev, gregkh
  Cc: sshegde, vschneid, iii, huschle, rostedt, dietmar.eggemann,
	vineeth, jgross, pbonzini, seanjc

Add documentation for new cpumask called cpu_paravirt_mask. This could
help users in understanding what this mask and the concept behind it.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 Documentation/scheduler/sched-arch.rst | 37 ++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/Documentation/scheduler/sched-arch.rst b/Documentation/scheduler/sched-arch.rst
index ed07efea7d02..e665d4a20e91 100644
--- a/Documentation/scheduler/sched-arch.rst
+++ b/Documentation/scheduler/sched-arch.rst
@@ -62,6 +62,43 @@ Your cpu_idle routines need to obey the following rules:
 arch/x86/kernel/process.c has examples of both polling and
 sleeping idle functions.
 
+Paravirt CPUs
+============
+
+Under virtualised environments it is possible to overcommit CPU resources.
+i.e sum of virtual CPU(vCPU) of all VM's is greater than number of physical
+CPUs(pCPU). Under such conditions when all or many VM's have high utilization,
+hypervisor won't be able to satisfy the CPU requirement and has to context
+switch within or across VM. i.e hypervisor need to preempt one vCPU to run
+another. This is called vCPU preemption. This is more expensive compared to
+task context switch within a vCPU.
+
+In such cases it is better that VM's co-ordinate among themselves and ask for
+less CPU by not using some of the vCPUs. Such vCPUs where workload can be
+avoided at the moment for less vCPU preemption are called as "Paravirt CPUs".
+Note that when the pCPU contention goes away, these vCPUs can be used again
+by the workload.
+
+Arch need to set/unset the specific vCPU in cpu_paravirt_mask. When set, avoid
+that vCPU and when unset, use it as usual.
+
+Scheduler will try to avoid paravirt vCPUs as much as it can.
+This is achieved by
+1. Not selecting paravirt CPU at wakeup.
+2. Push the task away from paravirt CPU at tick.
+3. Not selecting paravirt CPU at load balance.
+
+This works only for SCHED_RT and SCHED_NORMAL. SCHED_EXT and userspace can make
+choices accordingly using cpu_paravirt_mask.
+
+/sys/devices/system/cpu/paravirt prints the current cpu_paravirt_mask in
+cpulist format.
+
+Notes:
+1. A task pinned only on paravirt CPUs will continue to run there.
+2. This feature is available under CONFIG_PARAVIRT
+3. Runtime checks are guarded with static keys for minimal overhead
+   when there are no paravirt CPUs.
 
 Possible arch/ problems
 =======================
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH v3 02/10] cpumask: Introduce cpu_paravirt_mask
  2025-09-10 17:42 [RFC PATCH v3 00/10] paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
  2025-09-10 17:42 ` [RFC PATCH v3 01/10] sched/docs: Document cpu_paravirt_mask and Paravirt CPU concept Shrikanth Hegde
@ 2025-09-10 17:42 ` Shrikanth Hegde
  2025-09-10 17:42 ` [RFC PATCH v3 03/10] sched: Static key to check paravirt cpu push Shrikanth Hegde
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 33+ messages in thread
From: Shrikanth Hegde @ 2025-09-10 17:42 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, tglx, yury.norov,
	maddy, linux-kernel, linuxppc-dev, gregkh
  Cc: sshegde, vschneid, iii, huschle, rostedt, dietmar.eggemann,
	vineeth, jgross, pbonzini, seanjc

This patch does
- Declare and Define cpu_paravirt_mask.
- Get/Set helpers for it.

It is not declared next to existing masks since that would cause too
many ifdefs. Still kept it in cpumask.h instead of sched.h
so any interested users can still see it when looking at other
cpumasks available.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 include/linux/cpumask.h | 15 +++++++++++++++
 kernel/sched/core.c     |  5 +++++
 2 files changed, 20 insertions(+)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index ff8f41ab7ce6..afbc2ca5c1b7 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -1270,6 +1270,21 @@ static __always_inline bool cpu_dying(unsigned int cpu)
 
 #endif /* NR_CPUS > 1 */
 
+/*
+ * All related wrappers kept together to avoid too many ifdefs
+ * See Documentation/scheduler/sched-arch.rst for details
+ */
+#ifdef CONFIG_PARAVIRT
+extern struct cpumask __cpu_paravirt_mask;
+#define cpu_paravirt_mask    ((const struct cpumask *)&__cpu_paravirt_mask)
+#define set_cpu_paravirt(cpu, paravirt) assign_cpu((cpu), &__cpu_paravirt_mask, (paravirt))
+
+static __always_inline bool cpu_paravirt(unsigned int cpu)
+{
+	return cpumask_test_cpu(cpu, cpu_paravirt_mask);
+}
+#endif
+
 #define cpu_is_offline(cpu)	unlikely(!cpu_online(cpu))
 
 #if NR_CPUS <= BITS_PER_LONG
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index feb750aae71b..0f1e36bb5779 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10963,3 +10963,8 @@ void sched_enq_and_set_task(struct sched_enq_and_set_ctx *ctx)
 		set_next_task(rq, ctx->p);
 }
 #endif /* CONFIG_SCHED_CLASS_EXT */
+
+#ifdef CONFIG_PARAVIRT
+struct cpumask __cpu_paravirt_mask __read_mostly;
+EXPORT_SYMBOL(__cpu_paravirt_mask);
+#endif
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH v3 03/10] sched: Static key to check paravirt cpu push
  2025-09-10 17:42 [RFC PATCH v3 00/10] paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
  2025-09-10 17:42 ` [RFC PATCH v3 01/10] sched/docs: Document cpu_paravirt_mask and Paravirt CPU concept Shrikanth Hegde
  2025-09-10 17:42 ` [RFC PATCH v3 02/10] cpumask: Introduce cpu_paravirt_mask Shrikanth Hegde
@ 2025-09-10 17:42 ` Shrikanth Hegde
  2025-09-11  1:53   ` Yury Norov
  2025-09-10 17:42 ` [RFC PATCH v3 04/10] sched/core: Dont allow to use CPU marked as paravirt Shrikanth Hegde
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 33+ messages in thread
From: Shrikanth Hegde @ 2025-09-10 17:42 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, tglx, yury.norov,
	maddy, linux-kernel, linuxppc-dev, gregkh
  Cc: sshegde, vschneid, iii, huschle, rostedt, dietmar.eggemann,
	vineeth, jgross, pbonzini, seanjc

CPUs are marked paravirt when there is contention for underlying
physical CPU.

The push mechanism and check for paravirt CPUs are in sched tick
and wakeup. It should be close to no-op when there is no need for it.
Achieve that using static key.

Architecture needs to enable this key when it decides there are
paravirt CPUs. Disable it when there are no paravirt CPUs.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/core.c  |  1 +
 kernel/sched/sched.h | 17 +++++++++++++++++
 2 files changed, 18 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0f1e36bb5779..b8a84e4691c8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10967,4 +10967,5 @@ void sched_enq_and_set_task(struct sched_enq_and_set_ctx *ctx)
 #ifdef CONFIG_PARAVIRT
 struct cpumask __cpu_paravirt_mask __read_mostly;
 EXPORT_SYMBOL(__cpu_paravirt_mask);
+DEFINE_STATIC_KEY_FALSE(cpu_paravirt_push_tasks);
 #endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b5367c514c14..8f9991453d36 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3880,4 +3880,21 @@ void sched_enq_and_set_task(struct sched_enq_and_set_ctx *ctx);
 
 #include "ext.h"
 
+#ifdef CONFIG_PARAVIRT
+DECLARE_STATIC_KEY_FALSE(cpu_paravirt_push_tasks);
+
+static inline bool is_cpu_paravirt(int cpu)
+{
+	if (static_branch_unlikely(&cpu_paravirt_push_tasks))
+		return cpu_paravirt(cpu);
+
+	return false;
+}
+#else	/* !CONFIG_PARAVIRT */
+static inline bool is_cpu_paravirt(int cpu)
+{
+	return false;
+}
+#endif	/* !CONFIG_PARAVIRT */
+
 #endif /* _KERNEL_SCHED_SCHED_H */
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v3 03/10] sched: Static key to check paravirt cpu push
  2025-09-10 17:42 ` [RFC PATCH v3 03/10] sched: Static key to check paravirt cpu push Shrikanth Hegde
@ 2025-09-11  1:53   ` Yury Norov
  2025-09-11 14:37     ` Shrikanth Hegde
  0 siblings, 1 reply; 33+ messages in thread
From: Yury Norov @ 2025-09-11  1:53 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: mingo, peterz, juri.lelli, vincent.guittot, tglx, maddy,
	linux-kernel, linuxppc-dev, gregkh, vschneid, iii, huschle,
	rostedt, dietmar.eggemann, vineeth, jgross, pbonzini, seanjc

On Wed, Sep 10, 2025 at 11:12:03PM +0530, Shrikanth Hegde wrote:
> CPUs are marked paravirt when there is contention for underlying
> physical CPU.
> 
> The push mechanism and check for paravirt CPUs are in sched tick
> and wakeup. It should be close to no-op when there is no need for it.
> Achieve that using static key.
> 
> Architecture needs to enable this key when it decides there are
> paravirt CPUs. Disable it when there are no paravirt CPUs.

Testing a bit is quite close to a no-op, isn't it? Have you measured
the performance impact that would advocate the static key? Please
share some numbers then. I believe I asked you about it on the previous
round.

> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> ---
>  kernel/sched/core.c  |  1 +
>  kernel/sched/sched.h | 17 +++++++++++++++++
>  2 files changed, 18 insertions(+)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0f1e36bb5779..b8a84e4691c8 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -10967,4 +10967,5 @@ void sched_enq_and_set_task(struct sched_enq_and_set_ctx *ctx)
>  #ifdef CONFIG_PARAVIRT
>  struct cpumask __cpu_paravirt_mask __read_mostly;
>  EXPORT_SYMBOL(__cpu_paravirt_mask);
> +DEFINE_STATIC_KEY_FALSE(cpu_paravirt_push_tasks);
>  #endif
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index b5367c514c14..8f9991453d36 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -3880,4 +3880,21 @@ void sched_enq_and_set_task(struct sched_enq_and_set_ctx *ctx);
>  
>  #include "ext.h"
>  
> +#ifdef CONFIG_PARAVIRT
> +DECLARE_STATIC_KEY_FALSE(cpu_paravirt_push_tasks);
> +
> +static inline bool is_cpu_paravirt(int cpu)
> +{
> +	if (static_branch_unlikely(&cpu_paravirt_push_tasks))
> +		return cpu_paravirt(cpu);
> +
> +	return false;
> +}

So is_cpu_paravirt and cpu_paravirt are exactly the same in terms of
functionality. If you really believe that static branch benefits the
performance, it should go straight to the cpu_paravirt().

> +#else	/* !CONFIG_PARAVIRT */
> +static inline bool is_cpu_paravirt(int cpu)
> +{
> +	return false;
> +}
> +#endif	/* !CONFIG_PARAVIRT */
> +
>  #endif /* _KERNEL_SCHED_SCHED_H */
> -- 
> 2.47.3


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v3 03/10] sched: Static key to check paravirt cpu push
  2025-09-11  1:53   ` Yury Norov
@ 2025-09-11 14:37     ` Shrikanth Hegde
  2025-09-11 15:29       ` Yury Norov
  0 siblings, 1 reply; 33+ messages in thread
From: Shrikanth Hegde @ 2025-09-11 14:37 UTC (permalink / raw)
  To: Yury Norov
  Cc: mingo, peterz, juri.lelli, vincent.guittot, tglx, maddy,
	linux-kernel, linuxppc-dev, gregkh, vschneid, iii, huschle,
	rostedt, dietmar.eggemann, vineeth, jgross, pbonzini, seanjc



On 9/11/25 7:23 AM, Yury Norov wrote:
> On Wed, Sep 10, 2025 at 11:12:03PM +0530, Shrikanth Hegde wrote:
>> CPUs are marked paravirt when there is contention for underlying
>> physical CPU.
>>
>> The push mechanism and check for paravirt CPUs are in sched tick
>> and wakeup. It should be close to no-op when there is no need for it.
>> Achieve that using static key.
>>
>> Architecture needs to enable this key when it decides there are
>> paravirt CPUs. Disable it when there are no paravirt CPUs.
> 

Hi Yury, Thanks for looking into this series.

> Testing a bit is quite close to a no-op, isn't it? Have you measured
> the performance impact that would advocate the static key? Please
> share some numbers then. I believe I asked you about it on the previous
> round.

The reasons I thought to keep are:

1. In load balance there is cpumask_and which does a loop.
Might be better to avoid it when it is not necessary.

2. Since __cpu_paravirt_mask is going to in one of the memory node in large NUMA systems
(likely on boot cpu node), access to it from other nodes might take time and costly when
it is not in cache. one could say same for static key too. but cpumask can be large when
NR_CPUS=8192 or so.


In most of the cases hackbench,schbench didn't show much difference.

> 
>> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
>> ---
>>   kernel/sched/core.c  |  1 +
>>   kernel/sched/sched.h | 17 +++++++++++++++++
>>   2 files changed, 18 insertions(+)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 0f1e36bb5779..b8a84e4691c8 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -10967,4 +10967,5 @@ void sched_enq_and_set_task(struct sched_enq_and_set_ctx *ctx)
>>   #ifdef CONFIG_PARAVIRT
>>   struct cpumask __cpu_paravirt_mask __read_mostly;
>>   EXPORT_SYMBOL(__cpu_paravirt_mask);
>> +DEFINE_STATIC_KEY_FALSE(cpu_paravirt_push_tasks);
>>   #endif
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index b5367c514c14..8f9991453d36 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -3880,4 +3880,21 @@ void sched_enq_and_set_task(struct sched_enq_and_set_ctx *ctx);
>>   
>>   #include "ext.h"
>>   
>> +#ifdef CONFIG_PARAVIRT
>> +DECLARE_STATIC_KEY_FALSE(cpu_paravirt_push_tasks);
>> +
>> +static inline bool is_cpu_paravirt(int cpu)
>> +{
>> +	if (static_branch_unlikely(&cpu_paravirt_push_tasks))
>> +		return cpu_paravirt(cpu);
>> +
>> +	return false;
>> +}
> 
> So is_cpu_paravirt and cpu_paravirt are exactly the same in terms of
> functionality. If you really believe that static branch benefits the
> performance, it should go straight to the cpu_paravirt().
> 
>> +#else	/* !CONFIG_PARAVIRT */
>> +static inline bool is_cpu_paravirt(int cpu)
>> +{
>> +	return false;
>> +}
>> +#endif	/* !CONFIG_PARAVIRT */
>> +
>>   #endif /* _KERNEL_SCHED_SCHED_H */
>> -- 
>> 2.47.3



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v3 03/10] sched: Static key to check paravirt cpu push
  2025-09-11 14:37     ` Shrikanth Hegde
@ 2025-09-11 15:29       ` Yury Norov
  0 siblings, 0 replies; 33+ messages in thread
From: Yury Norov @ 2025-09-11 15:29 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: mingo, peterz, juri.lelli, vincent.guittot, tglx, maddy,
	linux-kernel, linuxppc-dev, gregkh, vschneid, iii, huschle,
	rostedt, dietmar.eggemann, vineeth, jgross, pbonzini, seanjc

On Thu, Sep 11, 2025 at 08:07:46PM +0530, Shrikanth Hegde wrote:
> 
> 
> On 9/11/25 7:23 AM, Yury Norov wrote:
> > On Wed, Sep 10, 2025 at 11:12:03PM +0530, Shrikanth Hegde wrote:
> > > CPUs are marked paravirt when there is contention for underlying
> > > physical CPU.
> > > 
> > > The push mechanism and check for paravirt CPUs are in sched tick
> > > and wakeup. It should be close to no-op when there is no need for it.
> > > Achieve that using static key.
> > > 
> > > Architecture needs to enable this key when it decides there are
> > > paravirt CPUs. Disable it when there are no paravirt CPUs.
> > 
> 
> Hi Yury, Thanks for looking into this series.
> 
> > Testing a bit is quite close to a no-op, isn't it? Have you measured
> > the performance impact that would advocate the static key? Please
> > share some numbers then. I believe I asked you about it on the previous
> > round.
> 
> The reasons I thought to keep are:
> 
> 1. In load balance there is cpumask_and which does a loop.
> Might be better to avoid it when it is not necessary.
> 
> 2. Since __cpu_paravirt_mask is going to in one of the memory node in large NUMA systems
> (likely on boot cpu node), access to it from other nodes might take time and costly when
> it is not in cache. one could say same for static key too. but cpumask can be large when
> NR_CPUS=8192 or so.
>
> 
> In most of the cases hackbench,schbench didn't show much difference.
 
So, you're adding a complication for no clear benefit. Just drop it.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [RFC PATCH v3 04/10] sched/core: Dont allow to use CPU marked as paravirt
  2025-09-10 17:42 [RFC PATCH v3 00/10] paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
                   ` (2 preceding siblings ...)
  2025-09-10 17:42 ` [RFC PATCH v3 03/10] sched: Static key to check paravirt cpu push Shrikanth Hegde
@ 2025-09-10 17:42 ` Shrikanth Hegde
  2025-09-11  5:16   ` K Prateek Nayak
  2025-09-10 17:42 ` [RFC PATCH v3 05/10] sched/fair: Don't consider paravirt CPUs for wakeup and load balance Shrikanth Hegde
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 33+ messages in thread
From: Shrikanth Hegde @ 2025-09-10 17:42 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, tglx, yury.norov,
	maddy, linux-kernel, linuxppc-dev, gregkh
  Cc: sshegde, vschneid, iii, huschle, rostedt, dietmar.eggemann,
	vineeth, jgross, pbonzini, seanjc

Don't allow a paravirt CPU to be used while looking for a CPU to use.

Push task mechanism uses stopper thread which going to call
select_fallback_rq and use this mechanism to avoid picking a paravirt CPU.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/core.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b8a84e4691c8..279b0dd72b5e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2462,8 +2462,13 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
 		return cpu_online(cpu);
 
 	/* Non kernel threads are not allowed during either online or offline. */
-	if (!(p->flags & PF_KTHREAD))
-		return cpu_active(cpu);
+	if (!(p->flags & PF_KTHREAD)) {
+		/* A user thread shouldn't be allowed on a paravirt cpu */
+		if (is_cpu_paravirt(cpu))
+			return false;
+		else
+			return cpu_active(cpu);
+	}
 
 	/* KTHREAD_IS_PER_CPU is always allowed. */
 	if (kthread_is_per_cpu(p))
@@ -2473,6 +2478,10 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
 	if (cpu_dying(cpu))
 		return false;
 
+	/* Non percpu kthreads should stay away from paravirt cpu*/
+	if (is_cpu_paravirt(cpu))
+		return false;
+
 	/* But are allowed during online. */
 	return cpu_online(cpu);
 }
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v3 04/10] sched/core: Dont allow to use CPU marked as paravirt
  2025-09-10 17:42 ` [RFC PATCH v3 04/10] sched/core: Dont allow to use CPU marked as paravirt Shrikanth Hegde
@ 2025-09-11  5:16   ` K Prateek Nayak
  2025-09-11 14:44     ` Shrikanth Hegde
  0 siblings, 1 reply; 33+ messages in thread
From: K Prateek Nayak @ 2025-09-11  5:16 UTC (permalink / raw)
  To: Shrikanth Hegde, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, maddy, linux-kernel, linuxppc-dev, gregkh
  Cc: vschneid, iii, huschle, rostedt, dietmar.eggemann, vineeth,
	jgross, pbonzini, seanjc

Hello Shrikanth,

On 9/10/2025 11:12 PM, Shrikanth Hegde wrote:
> @@ -2462,8 +2462,13 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
>  		return cpu_online(cpu);
>  
>  	/* Non kernel threads are not allowed during either online or offline. */
> -	if (!(p->flags & PF_KTHREAD))
> -		return cpu_active(cpu);
> +	if (!(p->flags & PF_KTHREAD)) {
> +		/* A user thread shouldn't be allowed on a paravirt cpu */
> +		if (is_cpu_paravirt(cpu))
> +			return false;
> +		else

nit. redundant "else". I think this can be simplified as:

    return !is_cpu_paravirt(cpu) && cpu_active(cpu);

> +			return cpu_active(cpu);
> +	}

-- 
Thanks and Regards,
Prateek



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v3 04/10] sched/core: Dont allow to use CPU marked as paravirt
  2025-09-11  5:16   ` K Prateek Nayak
@ 2025-09-11 14:44     ` Shrikanth Hegde
  0 siblings, 0 replies; 33+ messages in thread
From: Shrikanth Hegde @ 2025-09-11 14:44 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: vschneid, iii, huschle, rostedt, dietmar.eggemann, vineeth,
	jgross, pbonzini, seanjc, mingo, peterz, juri.lelli,
	vincent.guittot, tglx, yury.norov, maddy, linux-kernel,
	linuxppc-dev, gregkh



On 9/11/25 10:46 AM, K Prateek Nayak wrote:
> Hello Shrikanth,
> 

Hi Prateek, Thanks for looking into this.

> On 9/10/2025 11:12 PM, Shrikanth Hegde wrote:
>> @@ -2462,8 +2462,13 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
>>   		return cpu_online(cpu);
>>   
>>   	/* Non kernel threads are not allowed during either online or offline. */
>> -	if (!(p->flags & PF_KTHREAD))
>> -		return cpu_active(cpu);
>> +	if (!(p->flags & PF_KTHREAD)) {
>> +		/* A user thread shouldn't be allowed on a paravirt cpu */
>> +		if (is_cpu_paravirt(cpu))
>> +			return false;
>> +		else
> 
> nit. redundant "else". I think this can be simplified as:
>

alright.
>      return !is_cpu_paravirt(cpu) && cpu_active(cpu);
> 
>> +			return cpu_active(cpu);
>> +	}
> 



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [RFC PATCH v3 05/10] sched/fair: Don't consider paravirt CPUs for wakeup and load balance
  2025-09-10 17:42 [RFC PATCH v3 00/10] paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
                   ` (3 preceding siblings ...)
  2025-09-10 17:42 ` [RFC PATCH v3 04/10] sched/core: Dont allow to use CPU marked as paravirt Shrikanth Hegde
@ 2025-09-10 17:42 ` Shrikanth Hegde
  2025-09-11  5:23   ` K Prateek Nayak
  2025-09-10 17:42 ` [RFC PATCH v3 06/10] sched/rt: Don't select paravirt CPU for wakeup and push/pull rt task Shrikanth Hegde
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 33+ messages in thread
From: Shrikanth Hegde @ 2025-09-10 17:42 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, tglx, yury.norov,
	maddy, linux-kernel, linuxppc-dev, gregkh
  Cc: sshegde, vschneid, iii, huschle, rostedt, dietmar.eggemann,
	vineeth, jgross, pbonzini, seanjc

load balancer for fair class looks at sched domain and active cpus to consider
spreading the load. mask out the paravirt CPUs so that tasks doesn't spread to
those.

At wakeup, don't select a paravirt CPU.

Expect minimal impact when it is disabled.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/fair.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index df8dc389af8e..3dc76525b32c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8563,7 +8563,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 		if (!is_rd_overutilized(this_rq()->rd)) {
 			new_cpu = find_energy_efficient_cpu(p, prev_cpu);
 			if (new_cpu >= 0)
-				return new_cpu;
+				goto check_new_cpu;
 			new_cpu = prev_cpu;
 		}
 
@@ -8605,7 +8605,12 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 	}
 	rcu_read_unlock();
 
-	return new_cpu;
+	/* If newly found or prev_cpu is a paravirt cpu, use current cpu */
+check_new_cpu:
+	if (is_cpu_paravirt(new_cpu))
+		return cpu;
+	else
+		return new_cpu;
 }
 
 /*
@@ -11734,6 +11739,12 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
 
 	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
 
+#ifdef CONFIG_PARAVIRT
+	/* Don't spread load to paravirt CPUs */
+	if (static_branch_unlikely(&cpu_paravirt_push_tasks))
+		cpumask_andnot(cpus, cpus, cpu_paravirt_mask);
+#endif
+
 	schedstat_inc(sd->lb_count[idle]);
 
 redo:
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v3 05/10] sched/fair: Don't consider paravirt CPUs for wakeup and load balance
  2025-09-10 17:42 ` [RFC PATCH v3 05/10] sched/fair: Don't consider paravirt CPUs for wakeup and load balance Shrikanth Hegde
@ 2025-09-11  5:23   ` K Prateek Nayak
  2025-09-11 15:56     ` Shrikanth Hegde
  2025-11-08 12:04     ` Shrikanth Hegde
  0 siblings, 2 replies; 33+ messages in thread
From: K Prateek Nayak @ 2025-09-11  5:23 UTC (permalink / raw)
  To: Shrikanth Hegde, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, maddy, linux-kernel, linuxppc-dev, gregkh
  Cc: vschneid, iii, huschle, rostedt, dietmar.eggemann, vineeth,
	jgross, pbonzini, seanjc

Hello Shrikanth,

On 9/10/2025 11:12 PM, Shrikanth Hegde wrote:
> @@ -8563,7 +8563,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>  		if (!is_rd_overutilized(this_rq()->rd)) {
>  			new_cpu = find_energy_efficient_cpu(p, prev_cpu);
>  			if (new_cpu >= 0)
> -				return new_cpu;
> +				goto check_new_cpu;

Should this fallback to the overutilized path if the most energy
efficient CPU is found to be paravirtualized or should
find_energy_efficient_cpu() be made aware of it?

>  			new_cpu = prev_cpu;
>  		}
>  
> @@ -8605,7 +8605,12 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>  	}
>  	rcu_read_unlock();
>  
> -	return new_cpu;
> +	/* If newly found or prev_cpu is a paravirt cpu, use current cpu */
> +check_new_cpu:
> +	if (is_cpu_paravirt(new_cpu))
> +		return cpu;
> +	else

nit. redundant else.

> +		return new_cpu;
>  }
>  
>  /*
> @@ -11734,6 +11739,12 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>  
>  	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
>  
> +#ifdef CONFIG_PARAVIRT
> +	/* Don't spread load to paravirt CPUs */
> +	if (static_branch_unlikely(&cpu_paravirt_push_tasks))
> +		cpumask_andnot(cpus, cpus, cpu_paravirt_mask);
> +#endif

Can something similar be also be done in select_idle_sibling() and
sched_balance_find_dst_cpu() for wakeup path?

> +
>  	schedstat_inc(sd->lb_count[idle]);
>  
>  redo:
-- 
Thanks and Regards,
Prateek



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v3 05/10] sched/fair: Don't consider paravirt CPUs for wakeup and load balance
  2025-09-11  5:23   ` K Prateek Nayak
@ 2025-09-11 15:56     ` Shrikanth Hegde
  2025-09-11 16:55       ` K Prateek Nayak
  2025-11-08 12:04     ` Shrikanth Hegde
  1 sibling, 1 reply; 33+ messages in thread
From: Shrikanth Hegde @ 2025-09-11 15:56 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: vschneid, iii, huschle, rostedt, dietmar.eggemann, vineeth,
	jgross, pbonzini, seanjc, mingo, peterz, juri.lelli,
	vincent.guittot, tglx, yury.norov, maddy, linux-kernel,
	linuxppc-dev, gregkh



On 9/11/25 10:53 AM, K Prateek Nayak wrote:
> Hello Shrikanth,
> 
> On 9/10/2025 11:12 PM, Shrikanth Hegde wrote:
>> @@ -8563,7 +8563,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>>   		if (!is_rd_overutilized(this_rq()->rd)) {
>>   			new_cpu = find_energy_efficient_cpu(p, prev_cpu);
>>   			if (new_cpu >= 0)
>> -				return new_cpu;
>> +				goto check_new_cpu;
> 
> Should this fallback to the overutilized path if the most energy
> efficient CPU is found to be paravirtualized or should
> find_energy_efficient_cpu() be made aware of it?
> 
>>   			new_cpu = prev_cpu;
>>   		}
>>   
>> @@ -8605,7 +8605,12 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>>   	}
>>   	rcu_read_unlock();
>>   
>> -	return new_cpu;
>> +	/* If newly found or prev_cpu is a paravirt cpu, use current cpu */
>> +check_new_cpu:
>> +	if (is_cpu_paravirt(new_cpu))
>> +		return cpu;
>> +	else
> 
> nit. redundant else.
> 

Do you mean "is_cpu_paravirt(new_cpu) ? cpu; new_cpu"

This needs to return cpu instead of true/false. maybe i not seeing the obvious.

>> +		return new_cpu;
>>   }
>>   
>>   /*
>> @@ -11734,6 +11739,12 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>>   
>>   	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
>>   
>> +#ifdef CONFIG_PARAVIRT
>> +	/* Don't spread load to paravirt CPUs */
>> +	if (static_branch_unlikely(&cpu_paravirt_push_tasks))
>> +		cpumask_andnot(cpus, cpus, cpu_paravirt_mask);
>> +#endif
> 
> Can something similar be also be done in select_idle_sibling() and
> sched_balance_find_dst_cpu() for wakeup path?

That's a good suggestion. don't make a choice which is a paravirt CPU.
Will explore.

> 
>> +
>>   	schedstat_inc(sd->lb_count[idle]);
>>   
>>   redo:



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v3 05/10] sched/fair: Don't consider paravirt CPUs for wakeup and load balance
  2025-09-11 15:56     ` Shrikanth Hegde
@ 2025-09-11 16:55       ` K Prateek Nayak
  0 siblings, 0 replies; 33+ messages in thread
From: K Prateek Nayak @ 2025-09-11 16:55 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: vschneid, iii, huschle, rostedt, dietmar.eggemann, vineeth,
	jgross, pbonzini, seanjc, mingo, peterz, juri.lelli,
	vincent.guittot, tglx, yury.norov, maddy, linux-kernel,
	linuxppc-dev, gregkh

Hello Shrikanth,

On 9/11/2025 9:26 PM, Shrikanth Hegde wrote:
>>> +check_new_cpu:
>>> +    if (is_cpu_paravirt(new_cpu))
>>> +        return cpu;
>>> +    else
>>
>> nit. redundant else.
>>
> 
> Do you mean "is_cpu_paravirt(new_cpu) ? cpu; new_cpu"

Sorry for the confusion! I meant we can have:

	if (is_cpu_paravirt(new_cpu))
		return cpu;

	return new_cpu;

Since we return from the if clause, we don't need to specify else.

-- 
Thanks and Regards,
Prateek



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v3 05/10] sched/fair: Don't consider paravirt CPUs for wakeup and load balance
  2025-09-11  5:23   ` K Prateek Nayak
  2025-09-11 15:56     ` Shrikanth Hegde
@ 2025-11-08 12:04     ` Shrikanth Hegde
  1 sibling, 0 replies; 33+ messages in thread
From: Shrikanth Hegde @ 2025-11-08 12:04 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: vschneid, iii, huschle, rostedt, dietmar.eggemann, vineeth,
	jgross, pbonzini, seanjc, mingo, peterz, juri.lelli,
	vincent.guittot, tglx, yury.norov, maddy, linux-kernel,
	linuxppc-dev, gregkh



On 9/11/25 10:53 AM, K Prateek Nayak wrote:
> Hello Shrikanth,
> 
> On 9/10/2025 11:12 PM, Shrikanth Hegde wrote:
>> @@ -8563,7 +8563,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>>   		if (!is_rd_overutilized(this_rq()->rd)) {
>>   			new_cpu = find_energy_efficient_cpu(p, prev_cpu);
>>   			if (new_cpu >= 0)
>> -				return new_cpu;
>> +				goto check_new_cpu;
> 
> Should this fallback to the overutilized path if the most energy
> efficient CPU is found to be paravirtualized or should
> find_energy_efficient_cpu() be made aware of it?


While thinking about this, are there any such system which has vCPUs and
overcommits and still has energy model backing it?

Highly unlikely. So, I am planning to put a warning there and see if any
such usage exists there.

> 
>>   			new_cpu = prev_cpu;
>>   		}
>>   
>> @@ -8605,7 +8605,12 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>>   	}
>>   	rcu_read_unlock();
>>   
>> -	return new_cpu;
>> +	/* If newly found or prev_cpu is a paravirt cpu, use current cpu */
>> +check_new_cpu:
>> +	if (is_cpu_paravirt(new_cpu))
>> +		return cpu;
>> +	else
> 
> nit. redundant else.
> 
>> +		return new_cpu;
>>   }
>>   
>>   /*
>> @@ -11734,6 +11739,12 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>>   
>>   	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
>>   
>> +#ifdef CONFIG_PARAVIRT
>> +	/* Don't spread load to paravirt CPUs */
>> +	if (static_branch_unlikely(&cpu_paravirt_push_tasks))
>> +		cpumask_andnot(cpus, cpus, cpu_paravirt_mask);
>> +#endif
> 
> Can something similar be also be done in select_idle_sibling() and
> sched_balance_find_dst_cpu() for wakeup path?
> 

We have this pattern in select_task_rq_fair

cpu = smp_processor_id();
new_cpu = prev_cpu;

task is waking up after a while, so likely prev_cpu is marked as paravirt
and in such cases we should return current cpu. if current cpu is paravirt(unlikely),
and prev_cpu is also paravirt, then should return current cpu.
In next sched tick it will be pushed out.

select_idle_sibling(p, prev_cpu, new_cpu); - (new_cpu will remain prev_cpu if wake_affine doesn't change it)
Will have to change the prototype to send current cpu as well.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [RFC PATCH v3 06/10] sched/rt: Don't select paravirt CPU for wakeup and push/pull rt task
  2025-09-10 17:42 [RFC PATCH v3 00/10] paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
                   ` (4 preceding siblings ...)
  2025-09-10 17:42 ` [RFC PATCH v3 05/10] sched/fair: Don't consider paravirt CPUs for wakeup and load balance Shrikanth Hegde
@ 2025-09-10 17:42 ` Shrikanth Hegde
  2025-09-10 17:42 ` [RFC PATCH v3 07/10] sched/core: Push current task from paravirt CPU Shrikanth Hegde
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 33+ messages in thread
From: Shrikanth Hegde @ 2025-09-10 17:42 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, tglx, yury.norov,
	maddy, linux-kernel, linuxppc-dev, gregkh
  Cc: sshegde, vschneid, iii, huschle, rostedt, dietmar.eggemann,
	vineeth, jgross, pbonzini, seanjc

For RT class, 
- During wakeup don't select a paravirt CPU.
- Don't pull a task towards a paravirt CPU.
- Don't push a task to a paravirt CPU.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/rt.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 7936d4333731..54bfac66624b 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1552,6 +1552,9 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
 		if (!test && target != -1 && !rt_task_fits_capacity(p, target))
 			goto out_unlock;
 
+		/* Avoid moving to a paravirt CPU */
+		if (is_cpu_paravirt(target))
+			goto out_unlock;
 		/*
 		 * Don't bother moving it if the destination CPU is
 		 * not running a lower priority task.
@@ -1876,7 +1879,7 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
 	for (tries = 0; tries < RT_MAX_TRIES; tries++) {
 		cpu = find_lowest_rq(task);
 
-		if ((cpu == -1) || (cpu == rq->cpu))
+		if ((cpu == -1) || (cpu == rq->cpu) || is_cpu_paravirt(cpu))
 			break;
 
 		lowest_rq = cpu_rq(cpu);
@@ -1974,7 +1977,7 @@ static int push_rt_task(struct rq *rq, bool pull)
 			return 0;
 
 		cpu = find_lowest_rq(rq->curr);
-		if (cpu == -1 || cpu == rq->cpu)
+		if (cpu == -1 || cpu == rq->cpu || is_cpu_paravirt(cpu))
 			return 0;
 
 		/*
@@ -2237,6 +2240,10 @@ static void pull_rt_task(struct rq *this_rq)
 	if (likely(!rt_overload_count))
 		return;
 
+	/* There is no point in pulling the task towards a paravirt cpu */
+	if (is_cpu_paravirt(this_rq->cpu))
+		return;
+
 	/*
 	 * Match the barrier from rt_set_overloaded; this guarantees that if we
 	 * see overloaded we must also see the rto_mask bit.
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH v3 07/10] sched/core: Push current task from paravirt CPU
  2025-09-10 17:42 [RFC PATCH v3 00/10] paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
                   ` (5 preceding siblings ...)
  2025-09-10 17:42 ` [RFC PATCH v3 06/10] sched/rt: Don't select paravirt CPU for wakeup and push/pull rt task Shrikanth Hegde
@ 2025-09-10 17:42 ` Shrikanth Hegde
  2025-09-11  5:40   ` K Prateek Nayak
  2025-09-10 17:42 ` [RFC PATCH v3 08/10] sysfs: Add paravirt CPU file Shrikanth Hegde
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 33+ messages in thread
From: Shrikanth Hegde @ 2025-09-10 17:42 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, tglx, yury.norov,
	maddy, linux-kernel, linuxppc-dev, gregkh
  Cc: sshegde, vschneid, iii, huschle, rostedt, dietmar.eggemann,
	vineeth, jgross, pbonzini, seanjc

Actively push out any task running on a paravirt CPU. Since the task is
running on the CPU need to spawn a stopper thread and push the task out.

If task is sleeping, when it wakes up it is expected to move out. In
case it still chooses a paravirt CPU, next tick will move it out.
However, if the task in pinned only to paravirt CPUs, it will continue
running there.

Though code is almost same as __balance_push_cpu_stop and quite close to
push_cpu_stop, it provides a cleaner implementation w.r.t to PARAVIRT
config.

Add push_task_work_done flag to protect pv_push_task_work buffer. This has
been placed at the empty slot available considering 64/128 byte
cacheline.

This currently works only FAIR and RT.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/core.c  | 84 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |  9 ++++-
 2 files changed, 92 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 279b0dd72b5e..1f9df5b8a3a2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5629,6 +5629,10 @@ void sched_tick(void)
 
 	sched_clock_tick();
 
+	/* push the current task out if a paravirt CPU */
+	if (is_cpu_paravirt(cpu))
+		push_current_from_paravirt_cpu(rq);
+
 	rq_lock(rq, &rf);
 	donor = rq->donor;
 
@@ -10977,4 +10981,84 @@ void sched_enq_and_set_task(struct sched_enq_and_set_ctx *ctx)
 struct cpumask __cpu_paravirt_mask __read_mostly;
 EXPORT_SYMBOL(__cpu_paravirt_mask);
 DEFINE_STATIC_KEY_FALSE(cpu_paravirt_push_tasks);
+
+static DEFINE_PER_CPU(struct cpu_stop_work, pv_push_task_work);
+
+static int paravirt_push_cpu_stop(void *arg)
+{
+	struct task_struct *p = arg;
+	struct rq *rq = this_rq();
+	struct rq_flags rf;
+	int cpu;
+
+	raw_spin_lock_irq(&p->pi_lock);
+	rq_lock(rq, &rf);
+	rq->push_task_work_done = 0;
+
+	update_rq_clock(rq);
+
+	if (task_rq(p) == rq && task_on_rq_queued(p)) {
+		cpu = select_fallback_rq(rq->cpu, p);
+		rq = __migrate_task(rq, &rf, p, cpu);
+	}
+
+	rq_unlock(rq, &rf);
+	raw_spin_unlock_irq(&p->pi_lock);
+	put_task_struct(p);
+
+	return 0;
+}
+
+/* A CPU is marked as Paravirt when there is contention for underlying
+ * physical CPU and using this CPU will lead to hypervisor preemptions.
+ * It is better not to use this CPU.
+ *
+ * In case any task is scheduled on such CPU, move it out. In
+ * select_fallback_rq a non paravirt CPU will be chosen and henceforth
+ * task shouldn't come back to this CPU
+ */
+void push_current_from_paravirt_cpu(struct rq *rq)
+{
+	struct task_struct *push_task = rq->curr;
+	unsigned long flags;
+	struct rq_flags rf;
+
+	if (!is_cpu_paravirt(rq->cpu))
+		return;
+
+	/* Idle task can't be pused out */
+	if (rq->curr == rq->idle)
+		return;
+
+	/* Do for only SCHED_NORMAL AND RT for now */
+	if (push_task->sched_class != &fair_sched_class &&
+	    push_task->sched_class != &rt_sched_class)
+		return;
+
+	if (kthread_is_per_cpu(push_task) ||
+	    is_migration_disabled(push_task))
+		return;
+
+	/* Is it affine to only paravirt cpus? */
+	if (cpumask_subset(push_task->cpus_ptr, cpu_paravirt_mask))
+		return;
+
+	/* There is already a stopper thread for this. Dont race with it */
+	if (rq->push_task_work_done == 1)
+		return;
+
+	local_irq_save(flags);
+	preempt_disable();
+
+	get_task_struct(push_task);
+
+	rq_lock(rq, &rf);
+	rq->push_task_work_done = 1;
+	rq_unlock(rq, &rf);
+
+	stop_one_cpu_nowait(rq->cpu, paravirt_push_cpu_stop, push_task,
+			    this_cpu_ptr(&pv_push_task_work));
+	preempt_enable();
+	local_irq_restore(flags);
+}
 #endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8f9991453d36..5077a32593da 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1187,7 +1187,9 @@ struct rq {
 
 	unsigned char		nohz_idle_balance;
 	unsigned char		idle_balance;
-
+#ifdef CONFIG_PARAVIRT
+	bool			push_task_work_done;
+#endif
 	unsigned long		misfit_task_load;
 
 	/* For active balancing */
@@ -3890,11 +3892,16 @@ static inline bool is_cpu_paravirt(int cpu)
 
 	return false;
 }
+
+void push_current_from_paravirt_cpu(struct rq *rq);
+
 #else	/* !CONFIG_PARAVIRT */
 static inline bool is_cpu_paravirt(int cpu)
 {
 	return false;
 }
+
+static inline void push_current_from_paravirt_cpu(struct rq *rq) { }
 #endif	/* !CONFIG_PARAVIRT */
 
 #endif /* _KERNEL_SCHED_SCHED_H */
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v3 07/10] sched/core: Push current task from paravirt CPU
  2025-09-10 17:42 ` [RFC PATCH v3 07/10] sched/core: Push current task from paravirt CPU Shrikanth Hegde
@ 2025-09-11  5:40   ` K Prateek Nayak
  2025-09-11 16:52     ` Shrikanth Hegde
  2025-11-10  4:54     ` Shrikanth Hegde
  0 siblings, 2 replies; 33+ messages in thread
From: K Prateek Nayak @ 2025-09-11  5:40 UTC (permalink / raw)
  To: Shrikanth Hegde, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, maddy, linux-kernel, linuxppc-dev, gregkh
  Cc: vschneid, iii, huschle, rostedt, dietmar.eggemann, vineeth,
	jgross, pbonzini, seanjc

Hello Shrikanth,

On 9/10/2025 11:12 PM, Shrikanth Hegde wrote:
> Actively push out any task running on a paravirt CPU. Since the task is
> running on the CPU need to spawn a stopper thread and push the task out.
> 
> If task is sleeping, when it wakes up it is expected to move out. In
> case it still chooses a paravirt CPU, next tick will move it out.
> However, if the task in pinned only to paravirt CPUs, it will continue
> running there.
> 
> Though code is almost same as __balance_push_cpu_stop and quite close to
> push_cpu_stop, it provides a cleaner implementation w.r.t to PARAVIRT
> config.
> 
> Add push_task_work_done flag to protect pv_push_task_work buffer. This has
> been placed at the empty slot available considering 64/128 byte
> cacheline.
> 
> This currently works only FAIR and RT.

EXT can perhaps use the ops->cpu_{release,acquire}() if they are
interested in this.

> 
> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> ---
>  kernel/sched/core.c  | 84 ++++++++++++++++++++++++++++++++++++++++++++
>  kernel/sched/sched.h |  9 ++++-
>  2 files changed, 92 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 279b0dd72b5e..1f9df5b8a3a2 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5629,6 +5629,10 @@ void sched_tick(void)
>  
>  	sched_clock_tick();
>  
> +	/* push the current task out if a paravirt CPU */
> +	if (is_cpu_paravirt(cpu))
> +		push_current_from_paravirt_cpu(rq);

Does this mean paravirt CPU is capable of handling an interrupt but may
not be continuously available to run a task? Or is the VMM expected to set
the CPU on the paravirt mask and give the vCPU sufficient time to move the
task before yanking it away from the pCPU?

> +
>  	rq_lock(rq, &rf);
>  	donor = rq->donor;
>  
> @@ -10977,4 +10981,84 @@ void sched_enq_and_set_task(struct sched_enq_and_set_ctx *ctx)
>  struct cpumask __cpu_paravirt_mask __read_mostly;
>  EXPORT_SYMBOL(__cpu_paravirt_mask);
>  DEFINE_STATIC_KEY_FALSE(cpu_paravirt_push_tasks);
> +
> +static DEFINE_PER_CPU(struct cpu_stop_work, pv_push_task_work);
> +
> +static int paravirt_push_cpu_stop(void *arg)
> +{
> +	struct task_struct *p = arg;

Can we move all pushable tasks at once instead of just the rq->curr at
the time of the tick? It can also avoid keeping the reference to "p"
and only selectively pushing it. Thoughts?

> +	struct rq *rq = this_rq();
> +	struct rq_flags rf;
> +	int cpu;
> +
> +	raw_spin_lock_irq(&p->pi_lock);
> +	rq_lock(rq, &rf);
> +	rq->push_task_work_done = 0;
> +
> +	update_rq_clock(rq);
> +
> +	if (task_rq(p) == rq && task_on_rq_queued(p)) {
> +		cpu = select_fallback_rq(rq->cpu, p);
> +		rq = __migrate_task(rq, &rf, p, cpu);
> +	}
> +
> +	rq_unlock(rq, &rf);
> +	raw_spin_unlock_irq(&p->pi_lock);
> +	put_task_struct(p);
> +
> +	return 0;
> +}
> +
> +/* A CPU is marked as Paravirt when there is contention for underlying
> + * physical CPU and using this CPU will lead to hypervisor preemptions.
> + * It is better not to use this CPU.
> + *
> + * In case any task is scheduled on such CPU, move it out. In
> + * select_fallback_rq a non paravirt CPU will be chosen and henceforth
> + * task shouldn't come back to this CPU
> + */
> +void push_current_from_paravirt_cpu(struct rq *rq)
> +{
> +	struct task_struct *push_task = rq->curr;
> +	unsigned long flags;
> +	struct rq_flags rf;
> +
> +	if (!is_cpu_paravirt(rq->cpu))
> +		return;
> +
> +	/* Idle task can't be pused out */
> +	if (rq->curr == rq->idle)
> +		return;
> +
> +	/* Do for only SCHED_NORMAL AND RT for now */
> +	if (push_task->sched_class != &fair_sched_class &&
> +	    push_task->sched_class != &rt_sched_class)
> +		return;
> +
> +	if (kthread_is_per_cpu(push_task) ||
> +	    is_migration_disabled(push_task))
> +		return;
> +
> +	/* Is it affine to only paravirt cpus? */
> +	if (cpumask_subset(push_task->cpus_ptr, cpu_paravirt_mask))
> +		return;
> +
> +	/* There is already a stopper thread for this. Dont race with it */
> +	if (rq->push_task_work_done == 1)
> +		return;
> +
> +	local_irq_save(flags);
> +	preempt_disable();

Disabling IRQs implies preemption is disabled.

> +
> +	get_task_struct(push_task);
> +
> +	rq_lock(rq, &rf);
> +	rq->push_task_work_done = 1;
> +	rq_unlock(rq, &rf);
> +
> +	stop_one_cpu_nowait(rq->cpu, paravirt_push_cpu_stop, push_task,
> +			    this_cpu_ptr(&pv_push_task_work));
> +	preempt_enable();
> +	local_irq_restore(flags);
> +}
>  #endif
-- 
Thanks and Regards,
Prateek



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v3 07/10] sched/core: Push current task from paravirt CPU
  2025-09-11  5:40   ` K Prateek Nayak
@ 2025-09-11 16:52     ` Shrikanth Hegde
  2025-09-11 17:06       ` K Prateek Nayak
  2025-11-10  4:54     ` Shrikanth Hegde
  1 sibling, 1 reply; 33+ messages in thread
From: Shrikanth Hegde @ 2025-09-11 16:52 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: vschneid, iii, huschle, rostedt, dietmar.eggemann, vineeth,
	jgross, pbonzini, seanjc, mingo, peterz, juri.lelli,
	vincent.guittot, tglx, yury.norov, maddy, linux-kernel,
	linuxppc-dev, gregkh



On 9/11/25 11:10 AM, K Prateek Nayak wrote:
> Hello Shrikanth,
> 
> On 9/10/2025 11:12 PM, Shrikanth Hegde wrote:
>> Actively push out any task running on a paravirt CPU. Since the task is
>> running on the CPU need to spawn a stopper thread and push the task out.
>>
>> If task is sleeping, when it wakes up it is expected to move out. In
>> case it still chooses a paravirt CPU, next tick will move it out.
>> However, if the task in pinned only to paravirt CPUs, it will continue
>> running there.
>>
>> Though code is almost same as __balance_push_cpu_stop and quite close to
>> push_cpu_stop, it provides a cleaner implementation w.r.t to PARAVIRT
>> config.
>>
>> Add push_task_work_done flag to protect pv_push_task_work buffer. This has
>> been placed at the empty slot available considering 64/128 byte
>> cacheline.
>>
>> This currently works only FAIR and RT.
> 
> EXT can perhaps use the ops->cpu_{release,acquire}() if they are
> interested in this.
>

>>
>> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
>> ---
>>   kernel/sched/core.c  | 84 ++++++++++++++++++++++++++++++++++++++++++++
>>   kernel/sched/sched.h |  9 ++++-
>>   2 files changed, 92 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 279b0dd72b5e..1f9df5b8a3a2 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -5629,6 +5629,10 @@ void sched_tick(void)
>>   
>>   	sched_clock_tick();
>>   
>> +	/* push the current task out if a paravirt CPU */
>> +	if (is_cpu_paravirt(cpu))
>> +		push_current_from_paravirt_cpu(rq);
> 
> Does this mean paravirt CPU is capable of handling an interrupt but may
> not be continuously available to run a task?

When i run hackbench which involves fair bit of IRQ stuff, it moves out.

For example,

echo 600-710 > /sys/devices/system/cpu/paravirt

11:31:54 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
11:31:57 AM  598    2.04    0.00   77.55    0.00   18.37    0.00    1.02    0.00    0.00    1.02
11:31:57 AM  599    1.01    0.00   79.80    0.00   17.17    0.00    1.01    0.00    0.00    1.01
11:31:57 AM  600    0.00    0.00    0.00    0.00    0.00    0.00    0.99    0.00    0.00   99.01
11:31:57 AM  601    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
11:31:57 AM  602    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00


There could some workloads which doesn't move irq's out, for which needs irqbalance change.
Looking into it.

  Or is the VMM expected to set
> the CPU on the paravirt mask and give the vCPU sufficient time to move the
> task before yanking it away from the pCPU?
>

If the vCPU is running something, it is going to run at some point on pCPU.
hypervisor will give the cycles to this vCPU by preempting some other vCPU.

It is that using this infra, there is should be nothing on that paravirt vCPU.
That way collectively VMM gets only limited request for pCPU which it can satify
without vCPU preemption.

  
>> +
>>   	rq_lock(rq, &rf);
>>   	donor = rq->donor;
>>   
>> @@ -10977,4 +10981,84 @@ void sched_enq_and_set_task(struct sched_enq_and_set_ctx *ctx)
>>   struct cpumask __cpu_paravirt_mask __read_mostly;
>>   EXPORT_SYMBOL(__cpu_paravirt_mask);
>>   DEFINE_STATIC_KEY_FALSE(cpu_paravirt_push_tasks);
>> +
>> +static DEFINE_PER_CPU(struct cpu_stop_work, pv_push_task_work);
>> +
>> +static int paravirt_push_cpu_stop(void *arg)
>> +{
>> +	struct task_struct *p = arg;
> 
> Can we move all pushable tasks at once instead of just the rq->curr at
> the time of the tick? It can also avoid keeping the reference to "p"
> and only selectively pushing it. Thoughts?
> 

I think that is doable.
need to pass rq as arg and go through all tasks in rq in the stopped thread.

>> +	struct rq *rq = this_rq();
>> +	struct rq_flags rf;
>> +	int cpu;
>> +
>> +	raw_spin_lock_irq(&p->pi_lock);
>> +	rq_lock(rq, &rf);
>> +	rq->push_task_work_done = 0;
>> +
>> +	update_rq_clock(rq);
>> +
>> +	if (task_rq(p) == rq && task_on_rq_queued(p)) {
>> +		cpu = select_fallback_rq(rq->cpu, p);
>> +		rq = __migrate_task(rq, &rf, p, cpu);
>> +	}
>> +
>> +	rq_unlock(rq, &rf);
>> +	raw_spin_unlock_irq(&p->pi_lock);
>> +	put_task_struct(p);
>> +
>> +	return 0;
>> +}
>> +
>> +/* A CPU is marked as Paravirt when there is contention for underlying
>> + * physical CPU and using this CPU will lead to hypervisor preemptions.
>> + * It is better not to use this CPU.
>> + *
>> + * In case any task is scheduled on such CPU, move it out. In
>> + * select_fallback_rq a non paravirt CPU will be chosen and henceforth
>> + * task shouldn't come back to this CPU
>> + */
>> +void push_current_from_paravirt_cpu(struct rq *rq)
>> +{
>> +	struct task_struct *push_task = rq->curr;
>> +	unsigned long flags;
>> +	struct rq_flags rf;
>> +
>> +	if (!is_cpu_paravirt(rq->cpu))
>> +		return;
>> +
>> +	/* Idle task can't be pused out */
>> +	if (rq->curr == rq->idle)
>> +		return;
>> +
>> +	/* Do for only SCHED_NORMAL AND RT for now */
>> +	if (push_task->sched_class != &fair_sched_class &&
>> +	    push_task->sched_class != &rt_sched_class)
>> +		return;
>> +
>> +	if (kthread_is_per_cpu(push_task) ||
>> +	    is_migration_disabled(push_task))
>> +		return;
>> +
>> +	/* Is it affine to only paravirt cpus? */
>> +	if (cpumask_subset(push_task->cpus_ptr, cpu_paravirt_mask))
>> +		return;
>> +
>> +	/* There is already a stopper thread for this. Dont race with it */
>> +	if (rq->push_task_work_done == 1)
>> +		return;
>> +
>> +	local_irq_save(flags);
>> +	preempt_disable();
> 
> Disabling IRQs implies preemption is disabled.
>

Most of the places stop_one_cpu_nowait called with preemption & irq disabled.
stopper runs at the next possible opportunity.

stop_one_cpu_nowait
  ->queues the task into stopper list
     -> wake_up_process(stopper)
        -> set need_resched
          -> stopper runs as early as possible.
          
>> +
>> +	get_task_struct(push_task);
>> +
>> +	rq_lock(rq, &rf);
>> +	rq->push_task_work_done = 1;
>> +	rq_unlock(rq, &rf);
>> +
>> +	stop_one_cpu_nowait(rq->cpu, paravirt_push_cpu_stop, push_task,
>> +			    this_cpu_ptr(&pv_push_task_work));
>> +	preempt_enable();
>> +	local_irq_restore(flags);
>> +}
>>   #endif


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v3 07/10] sched/core: Push current task from paravirt CPU
  2025-09-11 16:52     ` Shrikanth Hegde
@ 2025-09-11 17:06       ` K Prateek Nayak
  2025-09-12  5:22         ` Shrikanth Hegde
  0 siblings, 1 reply; 33+ messages in thread
From: K Prateek Nayak @ 2025-09-11 17:06 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: vschneid, iii, huschle, rostedt, dietmar.eggemann, vineeth,
	jgross, pbonzini, seanjc, mingo, peterz, juri.lelli,
	vincent.guittot, tglx, yury.norov, maddy, linux-kernel,
	linuxppc-dev, gregkh

Hello Shrikanth,

On 9/11/2025 10:22 PM, Shrikanth Hegde wrote:
>>> +    if (is_cpu_paravirt(cpu))
>>> +        push_current_from_paravirt_cpu(rq);
>>
>> Does this mean paravirt CPU is capable of handling an interrupt but may
>> not be continuously available to run a task?
> 
> When i run hackbench which involves fair bit of IRQ stuff, it moves out.
> 
> For example,
> 
> echo 600-710 > /sys/devices/system/cpu/paravirt
> 
> 11:31:54 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
> 11:31:57 AM  598    2.04    0.00   77.55    0.00   18.37    0.00    1.02    0.00    0.00    1.02
> 11:31:57 AM  599    1.01    0.00   79.80    0.00   17.17    0.00    1.01    0.00    0.00    1.01
> 11:31:57 AM  600    0.00    0.00    0.00    0.00    0.00    0.00    0.99    0.00    0.00   99.01
> 11:31:57 AM  601    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
> 11:31:57 AM  602    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
> 
> 
> There could some workloads which doesn't move irq's out, for which needs irqbalance change.
> Looking into it.
> 
>  Or is the VMM expected to set
>> the CPU on the paravirt mask and give the vCPU sufficient time to move the
>> task before yanking it away from the pCPU?
>>
> 
> If the vCPU is running something, it is going to run at some point on pCPU.
> hypervisor will give the cycles to this vCPU by preempting some other vCPU.
> 
> It is that using this infra, there is should be nothing on that paravirt vCPU.
> That way collectively VMM gets only limited request for pCPU which it can satify
> without vCPU preemption.

Ack! Just wanted to understand the usage.

P.S. I remember discussions during last LPC where we could communicate
this unavailability via CPU capacity. Was that problematic for some
reason? Sorry if I didn't follow this discussion earlier.

[..snip..]
>>> +    local_irq_save(flags);
>>> +    preempt_disable();
>>
>> Disabling IRQs implies preemption is disabled.
>>
> 
> Most of the places stop_one_cpu_nowait called with preemption & irq disabled.
> stopper runs at the next possible opportunity.

But is there any reason to do both local_irq_save() and
preempt_disable()? include/linux/preempt.h defines preemptible() as:

    #define preemptible()   (preempt_count() == 0 && !irqs_disabled())

so disabling IRQs should be sufficient right or am I missing something?

> 
> stop_one_cpu_nowait
>  ->queues the task into stopper list
>     -> wake_up_process(stopper)
>        -> set need_resched
>          -> stopper runs as early as possible.
>         
-- 
Thanks and Regards,
Prateek



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v3 07/10] sched/core: Push current task from paravirt CPU
  2025-09-11 17:06       ` K Prateek Nayak
@ 2025-09-12  5:22         ` Shrikanth Hegde
  2025-09-12  8:48           ` K Prateek Nayak
  0 siblings, 1 reply; 33+ messages in thread
From: Shrikanth Hegde @ 2025-09-12  5:22 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: vschneid, iii, huschle, rostedt, dietmar.eggemann, vineeth,
	jgross, pbonzini, seanjc, mingo, peterz, juri.lelli,
	vincent.guittot, tglx, yury.norov, maddy, linux-kernel,
	linuxppc-dev, gregkh



On 9/11/25 10:36 PM, K Prateek Nayak wrote:
> Hello Shrikanth,
> 
> On 9/11/2025 10:22 PM, Shrikanth Hegde wrote:
>>>> +    if (is_cpu_paravirt(cpu))
>>>> +        push_current_from_paravirt_cpu(rq);
>>>
>>> Does this mean paravirt CPU is capable of handling an interrupt but may
>>> not be continuously available to run a task?
>>
>> When i run hackbench which involves fair bit of IRQ stuff, it moves out.
>>
>> For example,
>>
>> echo 600-710 > /sys/devices/system/cpu/paravirt
>>
>> 11:31:54 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
>> 11:31:57 AM  598    2.04    0.00   77.55    0.00   18.37    0.00    1.02    0.00    0.00    1.02
>> 11:31:57 AM  599    1.01    0.00   79.80    0.00   17.17    0.00    1.01    0.00    0.00    1.01
>> 11:31:57 AM  600    0.00    0.00    0.00    0.00    0.00    0.00    0.99    0.00    0.00   99.01
>> 11:31:57 AM  601    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
>> 11:31:57 AM  602    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
>>
>>
>> There could some workloads which doesn't move irq's out, for which needs irqbalance change.
>> Looking into it.
>>
>>   Or is the VMM expected to set
>>> the CPU on the paravirt mask and give the vCPU sufficient time to move the
>>> task before yanking it away from the pCPU?
>>>
>>
>> If the vCPU is running something, it is going to run at some point on pCPU.
>> hypervisor will give the cycles to this vCPU by preempting some other vCPU.
>>
>> It is that using this infra, there is should be nothing on that paravirt vCPU.
>> That way collectively VMM gets only limited request for pCPU which it can satify
>> without vCPU preemption.
> 
> Ack! Just wanted to understand the usage.
> 
> P.S. I remember discussions during last LPC where we could communicate
> this unavailability via CPU capacity. Was that problematic for some
> reason? Sorry if I didn't follow this discussion earlier.
> 

Thanks for that questions. Gives a opportunity to retrospect.

Yes. That's where we started. but that has a lot of implementation challenges.
Still an option though.

History upto current state:

1. At LPC24 presented the problem statement, and why existing approaches such as hotplug,
    cpuset cgroup or taskset are not viable solution. Hotplug would have come handy if the cost was low.
    The overhead of sched domain rebuild and serial nature of hotplug makes it not viable option.
    One of the possible approach was CPU Capacity.

1. Issues with CPU Capacity approach:
    a. Need to make group_misfit_task as the highest priority. That alone will break big.LITTLE
       since it relies on group misfit and group_overload should have higher priority there.
    b. At high concurrency tasks still moved those CPUs with CAPACITY=1.
    c. A lot of scheduler stats would need to be aware of change in CAPACITY specially load balance/wakeup.
    d. in update_group_misfit - need to set the misfit load based on capacity. the current code sets to 0,
       because of task_fits_cpu stuff
    e. More challenges in RT.

That's when Tobias had introduced a new group type called group_parked.
https://lore.kernel.org/all/20241204112149.25872-2-huschle@linux.ibm.com/
   
It has relatively cleaner implementation compared to CPU CAPACITY.

It had a few disadvantages too:
1. It use to take around 8-10 seconds for tasks to move out of those CPUs. That was the main
    concern.
2. Needs a few stats based changes in update_sg_lb_stats. might be tricky in all scenarios.

That's when we were exploring how the tasks move out when the cpu goes offline. It happens quite fast too.
So tried a similar mechanism and this is where we are right now.

> [..snip..]
>>>> +    local_irq_save(flags);
>>>> +    preempt_disable();
>>>
>>> Disabling IRQs implies preemption is disabled.
>>>
>>
>> Most of the places stop_one_cpu_nowait called with preemption & irq disabled.
>> stopper runs at the next possible opportunity.
> 
> But is there any reason to do both local_irq_save() and
> preempt_disable()? include/linux/preempt.h defines preemptible() as:
>
>      #define preemptible()   (preempt_count() == 0 && !irqs_disabled())
> 
> so disabling IRQs should be sufficient right or am I missing something?
> 

f0498d2a54e79 (Peter Zijlstra) "sched: Fix stop_one_cpu_nowait() vs hotplug"
could be the answer you are looking for.

>>
>> stop_one_cpu_nowait
>>   ->queues the task into stopper list
>>      -> wake_up_process(stopper)
>>         -> set need_resched
>>           -> stopper runs as early as possible.
>>          



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v3 07/10] sched/core: Push current task from paravirt CPU
  2025-09-12  5:22         ` Shrikanth Hegde
@ 2025-09-12  8:48           ` K Prateek Nayak
  2025-09-12 12:49             ` Shrikanth Hegde
  0 siblings, 1 reply; 33+ messages in thread
From: K Prateek Nayak @ 2025-09-12  8:48 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: vschneid, iii, huschle, rostedt, dietmar.eggemann, vineeth,
	jgross, pbonzini, seanjc, mingo, peterz, juri.lelli,
	vincent.guittot, tglx, yury.norov, maddy, linux-kernel,
	linuxppc-dev, gregkh

Hello Shrikanth,

On 9/12/2025 10:52 AM, Shrikanth Hegde wrote:
> 
> 
> On 9/11/25 10:36 PM, K Prateek Nayak wrote:
>> Hello Shrikanth,
>>
>> On 9/11/2025 10:22 PM, Shrikanth Hegde wrote:
>>>>> +    if (is_cpu_paravirt(cpu))
>>>>> +        push_current_from_paravirt_cpu(rq);
>>>>
>>>> Does this mean paravirt CPU is capable of handling an interrupt but may
>>>> not be continuously available to run a task?
>>>
>>> When i run hackbench which involves fair bit of IRQ stuff, it moves out.
>>>
>>> For example,
>>>
>>> echo 600-710 > /sys/devices/system/cpu/paravirt
>>>
>>> 11:31:54 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
>>> 11:31:57 AM  598    2.04    0.00   77.55    0.00   18.37    0.00    1.02    0.00    0.00    1.02
>>> 11:31:57 AM  599    1.01    0.00   79.80    0.00   17.17    0.00    1.01    0.00    0.00    1.01
>>> 11:31:57 AM  600    0.00    0.00    0.00    0.00    0.00    0.00    0.99    0.00    0.00   99.01
>>> 11:31:57 AM  601    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
>>> 11:31:57 AM  602    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
>>>
>>>
>>> There could some workloads which doesn't move irq's out, for which needs irqbalance change.
>>> Looking into it.
>>>
>>>   Or is the VMM expected to set
>>>> the CPU on the paravirt mask and give the vCPU sufficient time to move the
>>>> task before yanking it away from the pCPU?
>>>>
>>>
>>> If the vCPU is running something, it is going to run at some point on pCPU.
>>> hypervisor will give the cycles to this vCPU by preempting some other vCPU.
>>>
>>> It is that using this infra, there is should be nothing on that paravirt vCPU.
>>> That way collectively VMM gets only limited request for pCPU which it can satify
>>> without vCPU preemption.
>>
>> Ack! Just wanted to understand the usage.
>>
>> P.S. I remember discussions during last LPC where we could communicate
>> this unavailability via CPU capacity. Was that problematic for some
>> reason? Sorry if I didn't follow this discussion earlier.
>>
> 
> Thanks for that questions. Gives a opportunity to retrospect.
> 
> Yes. That's where we started. but that has a lot of implementation challenges.
> Still an option though.
> 
> History upto current state:
> 
> 1. At LPC24 presented the problem statement, and why existing approaches such as hotplug,
>    cpuset cgroup or taskset are not viable solution. Hotplug would have come handy if the cost was low.
>    The overhead of sched domain rebuild and serial nature of hotplug makes it not viable option.
>    One of the possible approach was CPU Capacity.

Ack. Is creating an isolated partition on the fly too expensive too?
I don't think creation of that partition is serialized and it should
achieve a similar result with a single sched-domain rebuild and I'm
hoping VMM doesn't change the paravirt mask at an alarming rate.

P.S. Some stupid benchmarking on a 256CPU machine:

    mkdir /sys/fs/cgroup/isol/
    echo isolated >  /sys/fs/cgroup/isol/cpuset.cpus.partition

    time for i in {1..1000}; do \
    echo "8-15" > /sys/fs/cgroup/isol/cpuset.cpus.exclusive; \
    echo "16-23" > /sys/fs/cgroup/isol/cpuset.cpus.exclusive; \
    done

    real    2m50.016s
    user    0m0.198s
    sys     1m47.708s

So that is about (170sec / 2000) ~ 85ms per cpuset operation.
Definitely more expensive than setting the paravirt but compare that to:

    for i in {8..15}; do echo 0 > /sys/devices/system/cpu/cpu$i/online; done; \
    for i in {8..15}; do echo 1 > /sys/devices/system/cpu/cpu$i/online; done; \
    for i in {16..23}; do echo 0 > /sys/devices/system/cpu/cpu$i/online; done; \
    for i in {16..23}; do echo 1 > /sys/devices/system/cpu/cpu$i/online; done;'

    real    0m5.046s
    user    0m0.014s
    sys     0m0.110s

Definitely less expensive than a full hotplug.

> 
> 1. Issues with CPU Capacity approach:
>    a. Need to make group_misfit_task as the highest priority. That alone will break big.LITTLE
>       since it relies on group misfit and group_overload should have higher priority there.
>    b. At high concurrency tasks still moved those CPUs with CAPACITY=1.
>    c. A lot of scheduler stats would need to be aware of change in CAPACITY specially load balance/wakeup.

Ack. Thinking out loud: Can capacity go to 0 via H/W pressure interface?
Maybe we can toggle the "sched_asym_cpucapacity" static branch without
actually having SD_ASYM_CAPACITY in these special case to enable
asym_fits_cpu() steer away from these 0 capacity CPUs.

>    d. in update_group_misfit - need to set the misfit load based on capacity. the current code sets to 0,
>       because of task_fits_cpu stuff
>    e. More challenges in RT.
> 
> That's when Tobias had introduced a new group type called group_parked.
> https://lore.kernel.org/all/20241204112149.25872-2-huschle@linux.ibm.com/
>   It has relatively cleaner implementation compared to CPU CAPACITY.
> 
> It had a few disadvantages too:
> 1. It use to take around 8-10 seconds for tasks to move out of those CPUs. That was the main
>    concern.
> 2. Needs a few stats based changes in update_sg_lb_stats. might be tricky in all scenarios.
> 
> That's when we were exploring how the tasks move out when the cpu goes offline. It happens quite fast too.
> So tried a similar mechanism and this is where we are right now.

I agree push is great from that perspective.

> 
>> [..snip..]
>>>>> +    local_irq_save(flags);
>>>>> +    preempt_disable();
>>>>
>>>> Disabling IRQs implies preemption is disabled.
>>>>
>>>
>>> Most of the places stop_one_cpu_nowait called with preemption & irq disabled.
>>> stopper runs at the next possible opportunity.
>>
>> But is there any reason to do both local_irq_save() and
>> preempt_disable()? include/linux/preempt.h defines preemptible() as:
>>
>>      #define preemptible()   (preempt_count() == 0 && !irqs_disabled())
>>
>> so disabling IRQs should be sufficient right or am I missing something?
>>
> 
> f0498d2a54e79 (Peter Zijlstra) "sched: Fix stop_one_cpu_nowait() vs hotplug"
> could be the answer you are looking for.

I think in all the cases covered by that commit, "task_rq_unlock(...)" would
have enabled interrupts which required that specified pattern but here we
have preempt_disable() within a local_irq_save() section which might not be
necessary.

> 
>>>
>>> stop_one_cpu_nowait
>>>   ->queues the task into stopper list
>>>      -> wake_up_process(stopper)
>>>         -> set need_resched
>>>           -> stopper runs as early as possible.
>>>          
> 

-- 
Thanks and Regards,
Prateek



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v3 07/10] sched/core: Push current task from paravirt CPU
  2025-09-12  8:48           ` K Prateek Nayak
@ 2025-09-12 12:49             ` Shrikanth Hegde
  0 siblings, 0 replies; 33+ messages in thread
From: Shrikanth Hegde @ 2025-09-12 12:49 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: vschneid, iii, huschle, rostedt, dietmar.eggemann, vineeth,
	jgross, pbonzini, seanjc, mingo, peterz, juri.lelli,
	vincent.guittot, tglx, yury.norov, maddy, linux-kernel,
	linuxppc-dev, gregkh



On 9/12/25 2:18 PM, K Prateek Nayak wrote:
> Hello Shrikanth,
> 
> On 9/12/2025 10:52 AM, Shrikanth Hegde wrote:
>>
>>
>> On 9/11/25 10:36 PM, K Prateek Nayak wrote:
>>> Hello Shrikanth,
>>>
>>> On 9/11/2025 10:22 PM, Shrikanth Hegde wrote:
>>>>>> +    if (is_cpu_paravirt(cpu))
>>>>>> +        push_current_from_paravirt_cpu(rq);
>>>>>
>>>>> Does this mean paravirt CPU is capable of handling an interrupt but may
>>>>> not be continuously available to run a task?
>>>>
>>>> When i run hackbench which involves fair bit of IRQ stuff, it moves out.
>>>>
>>>> For example,
>>>>
>>>> echo 600-710 > /sys/devices/system/cpu/paravirt
>>>>
>>>> 11:31:54 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
>>>> 11:31:57 AM  598    2.04    0.00   77.55    0.00   18.37    0.00    1.02    0.00    0.00    1.02
>>>> 11:31:57 AM  599    1.01    0.00   79.80    0.00   17.17    0.00    1.01    0.00    0.00    1.01
>>>> 11:31:57 AM  600    0.00    0.00    0.00    0.00    0.00    0.00    0.99    0.00    0.00   99.01
>>>> 11:31:57 AM  601    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
>>>> 11:31:57 AM  602    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
>>>>
>>>>
>>>> There could some workloads which doesn't move irq's out, for which needs irqbalance change.
>>>> Looking into it.
>>>>
>>>>    Or is the VMM expected to set
>>>>> the CPU on the paravirt mask and give the vCPU sufficient time to move the
>>>>> task before yanking it away from the pCPU?
>>>>>
>>>>
>>>> If the vCPU is running something, it is going to run at some point on pCPU.
>>>> hypervisor will give the cycles to this vCPU by preempting some other vCPU.
>>>>
>>>> It is that using this infra, there is should be nothing on that paravirt vCPU.
>>>> That way collectively VMM gets only limited request for pCPU which it can satify
>>>> without vCPU preemption.
>>>
>>> Ack! Just wanted to understand the usage.
>>>
>>> P.S. I remember discussions during last LPC where we could communicate
>>> this unavailability via CPU capacity. Was that problematic for some
>>> reason? Sorry if I didn't follow this discussion earlier.
>>>
>>
>> Thanks for that questions. Gives a opportunity to retrospect.
>>
>> Yes. That's where we started. but that has a lot of implementation challenges.
>> Still an option though.
>>
>> History upto current state:
>>
>> 1. At LPC24 presented the problem statement, and why existing approaches such as hotplug,
>>     cpuset cgroup or taskset are not viable solution. Hotplug would have come handy if the cost was low.
>>     The overhead of sched domain rebuild and serial nature of hotplug makes it not viable option.
>>     One of the possible approach was CPU Capacity.
> 
> Ack. Is creating an isolated partition on the fly too expensive too?
> I don't think creation of that partition is serialized and it should
> achieve a similar result with a single sched-domain rebuild and I'm
> hoping VMM doesn't change the paravirt mask at an alarming rate.
> 

That is a good idea too.

Main issue is with when workload does taskset.

For example,
taskset -c 650-700 stress-ng --cpu=100 -t 10
echo isolated > cpuset.cpus.partition
echo 600-710 > cpuset.cpus.exclusive

Tasks move out and is cpu affinity is reset to all cpus. Similar to hotplug.
But both hotplug and write to exclusive are triggered by user, and hence user
is aware of it.

I don't think it is good idea to reset users cpu affinity without an action from them.

Looking at code of
      cpuset_write_resmask
      update_exclusive_cpumask
    - update_parent_effective_cpumask
       + 6.16% cpuset_update_tasks_cpumask
            set_cpus_allowed_ptr
            __set_cpus_allowed_ptr
            affine_move_task

affine_move_task -> would call migration_cpu_stop -> Moves one task at a time

If you see we do the same/similar thing in paravirt infra, but we don't touch/reset task's cpu affinity.
Affined tasks continue to run if it is affined to only paravirt CPUs. If there is any one one
non paravirt CPU in its cpus_ptr it will move there.

> P.S. Some stupid benchmarking on a 256CPU machine:
> 
>      mkdir /sys/fs/cgroup/isol/
>      echo isolated >  /sys/fs/cgroup/isol/cpuset.cpus.partition
> 
>      time for i in {1..1000}; do \
>      echo "8-15" > /sys/fs/cgroup/isol/cpuset.cpus.exclusive; \
>      echo "16-23" > /sys/fs/cgroup/isol/cpuset.cpus.exclusive; \
>      done
> 
>      real    2m50.016s
>      user    0m0.198s
>      sys     1m47.708s
> 
> So that is about (170sec / 2000) ~ 85ms per cpuset operation.

That cost would be okay. VMM isn't expected to change at very high rate.

> Definitely more expensive than setting the paravirt but compare that to:
> 
>      for i in {8..15}; do echo 0 > /sys/devices/system/cpu/cpu$i/online; done; \
>      for i in {8..15}; do echo 1 > /sys/devices/system/cpu/cpu$i/online; done; \
>      for i in {16..23}; do echo 0 > /sys/devices/system/cpu/cpu$i/online; done; \
>      for i in {16..23}; do echo 1 > /sys/devices/system/cpu/cpu$i/online; done;'
> 
>      real    0m5.046s
>      user    0m0.014s
>      sys     0m0.110s
> 
> Definitely less expensive than a full hotplug.

This happens mainly due to the synchronize_rcu there.

> 
>>
>> 1. Issues with CPU Capacity approach:
>>     a. Need to make group_misfit_task as the highest priority. That alone will break big.LITTLE
>>        since it relies on group misfit and group_overload should have higher priority there.
>>     b. At high concurrency tasks still moved those CPUs with CAPACITY=1.
>>     c. A lot of scheduler stats would need to be aware of change in CAPACITY specially load balance/wakeup.
> 
> Ack. Thinking out loud: Can capacity go to 0 via H/W pressure interface?
> Maybe we can toggle the "sched_asym_cpucapacity" static branch without
> actually having SD_ASYM_CAPACITY in these special case to enable
> asym_fits_cpu() steer away from these 0 capacity CPUs.

bigger concern is around that group_misfit_task IMO.

> 
>>     d. in update_group_misfit - need to set the misfit load based on capacity. the current code sets to 0,
>>        because of task_fits_cpu stuff
>>     e. More challenges in RT.
>>
>> That's when Tobias had introduced a new group type called group_parked.
>> https://lore.kernel.org/all/20241204112149.25872-2-huschle@linux.ibm.com/
>>    It has relatively cleaner implementation compared to CPU CAPACITY.
>>
>> It had a few disadvantages too:
>> 1. It use to take around 8-10 seconds for tasks to move out of those CPUs. That was the main
>>     concern.
>> 2. Needs a few stats based changes in update_sg_lb_stats. might be tricky in all scenarios.
>>
>> That's when we were exploring how the tasks move out when the cpu goes offline. It happens quite fast too.
>> So tried a similar mechanism and this is where we are right now.
> 
> I agree push is great from that perspective.
> 

Yes. It is same at the moment.

>>
>>> [..snip..]
>>>>>> +    local_irq_save(flags);
>>>>>> +    preempt_disable();
>>>>>
>>>>> Disabling IRQs implies preemption is disabled.
>>>>>
>>>>
>>>> Most of the places stop_one_cpu_nowait called with preemption & irq disabled.
>>>> stopper runs at the next possible opportunity.
>>>
>>> But is there any reason to do both local_irq_save() and
>>> preempt_disable()? include/linux/preempt.h defines preemptible() as:
>>>
>>>       #define preemptible()   (preempt_count() == 0 && !irqs_disabled())
>>>
>>> so disabling IRQs should be sufficient right or am I missing something?
>>>
>>
>> f0498d2a54e79 (Peter Zijlstra) "sched: Fix stop_one_cpu_nowait() vs hotplug"
>> could be the answer you are looking for.
> 
> I think in all the cases covered by that commit, "task_rq_unlock(...)" would
> have enabled interrupts which required that specified pattern but here we
> have preempt_disable() within a local_irq_save() section which might not be
> necessary.
> 
>>
>>>>
>>>> stop_one_cpu_nowait
>>>>    ->queues the task into stopper list
>>>>       -> wake_up_process(stopper)
>>>>          -> set need_resched
>>>>            -> stopper runs as early as possible.
>>>>           
>>
> 



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v3 07/10] sched/core: Push current task from paravirt CPU
  2025-09-11  5:40   ` K Prateek Nayak
  2025-09-11 16:52     ` Shrikanth Hegde
@ 2025-11-10  4:54     ` Shrikanth Hegde
  1 sibling, 0 replies; 33+ messages in thread
From: Shrikanth Hegde @ 2025-11-10  4:54 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: vschneid, iii, huschle, rostedt, dietmar.eggemann, vineeth,
	jgross, pbonzini, seanjc, mingo, peterz, juri.lelli,
	vincent.guittot, tglx, yury.norov, maddy, linux-kernel,
	linuxppc-dev, gregkh


>> +
>> +static DEFINE_PER_CPU(struct cpu_stop_work, pv_push_task_work);
>> +
>> +static int paravirt_push_cpu_stop(void *arg)
>> +{
>> +	struct task_struct *p = arg;
> 
> Can we move all pushable tasks at once instead of just the rq->curr at
> the time of the tick? It can also avoid keeping the reference to "p"
> and only selectively pushing it. Thoughts?
> 
>> +	struct rq *rq = this_rq();
>> +	struct rq_flags rf;
>> +	int cpu;
>> +
>> +	raw_spin_lock_irq(&p->pi_lock);
>> +	rq_lock(rq, &rf);
>> +	rq->push_task_work_done = 0;
>> +
>> +	update_rq_clock(rq);
>> +
>> +	if (task_rq(p) == rq && task_on_rq_queued(p)) {
>> +		cpu = select_fallback_rq(rq->cpu, p);
>> +		rq = __migrate_task(rq, &rf, p, cpu);
>> +	}
>> +
>> +	rq_unlock(rq, &rf);
>> +	raw_spin_unlock_irq(&p->pi_lock);
>> +	put_task_struct(p);
>> +
>> +	return 0;
>> +}
>> +

Got it work by using by using rt.pushable_tasks(RT) and rq->cfs_tasks(CFS).

I don't see any significant benefit by doing this. there is slight improvement in time
it takes to move the tasks out. This could help when there are way too many tasks on rq.
But these days most system are with HZ=1000, that means it is 1ms tick, it shouldn't take
very long to push the current task out. Also, rq lock likely needs to be held across
the loop to ensure loop doesn't get altered by irq etc.

Given the complexity, I prefer the method of pushing the current task out.
---

         /* push the rt tasks out first */
         plist_for_each_entry_safe(p, tmp_p, &orig_rq->rt.pushable_tasks, pushable_tasks) {
                 rq = orig_rq;

                 if (kthread_is_per_cpu(p) ||is_migration_disabled(p))
                         continue;

                 raw_spin_lock_irqsave(&p->pi_lock, flags);
                 rq_lock(rq, &rf);

                 update_rq_clock(rq);

                 if (task_rq(p) == rq && task_on_rq_queued(p)) {
                         cpu = select_fallback_rq(rq->cpu, p);
                         rq = __migrate_task(rq, &rf, p, cpu);
                 }

                 rq_unlock(rq, &rf);
                 raw_spin_unlock_irqrestore(&p->pi_lock, flags);
         }


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [RFC PATCH v3 08/10] sysfs: Add paravirt CPU file
  2025-09-10 17:42 [RFC PATCH v3 00/10] paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
                   ` (6 preceding siblings ...)
  2025-09-10 17:42 ` [RFC PATCH v3 07/10] sched/core: Push current task from paravirt CPU Shrikanth Hegde
@ 2025-09-10 17:42 ` Shrikanth Hegde
  2025-09-10 17:42 ` [RFC PATCH v3 09/10] powerpc: Add debug file for set/unset paravirt CPUs Shrikanth Hegde
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 33+ messages in thread
From: Shrikanth Hegde @ 2025-09-10 17:42 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, tglx, yury.norov,
	maddy, linux-kernel, linuxppc-dev, gregkh
  Cc: sshegde, vschneid, iii, huschle, rostedt, dietmar.eggemann,
	vineeth, jgross, pbonzini, seanjc

Add paravirt file in /sys/devices/system/cpu.

This offers
- User can quickly check which CPUs are marked as paravirt.
- Userspace algorithm such as sched_ext or with isolcpus could
  use the mask and make decision.
- daemon such as irqbalance could use this mask and don't spread
  irq's into paravirt CPUs.

For example:
cat /sys/devices/system/cpu/paravirt
600-719      <<< arch marked these are paravirt.

cat /sys/devices/system/cpu/paravirt
             <<< No paravirt CPUs at the moment.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 Documentation/ABI/testing/sysfs-devices-system-cpu |  9 +++++++++
 drivers/base/cpu.c                                 | 12 ++++++++++++
 2 files changed, 21 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu
index ab8cd337f43a..6701e97d3f8d 100644
--- a/Documentation/ABI/testing/sysfs-devices-system-cpu
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -776,3 +776,12 @@ Date:		Nov 2022
 Contact:	Linux kernel mailing list <linux-kernel@vger.kernel.org>
 Description:
 		(RO) the list of CPUs that can be brought online.
+
+What:		/sys/devices/system/cpu/paravirt
+Date:		Sep 2025
+Contact:	Linux kernel mailing list <linux-kernel@vger.kernel.org>
+Description:
+		(RO) the list of CPUs that are current marked as paravirt CPUs.
+		These CPUs are not meant to be used at the moment due to
+		contention of underlying physical CPU resource. Dynamically
+		changes to reflect the current situation.
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index efc575a00edd..902747ff4988 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -374,6 +374,15 @@ static int cpu_uevent(const struct device *dev, struct kobj_uevent_env *env)
 }
 #endif
 
+#ifdef CONFIG_PARAVIRT
+static ssize_t print_paravirt_cpus(struct device *dev,
+				   struct device_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "%*pbl\n", cpumask_pr_args(cpu_paravirt_mask));
+}
+static DEVICE_ATTR(paravirt, 0444, print_paravirt_cpus, NULL);
+#endif
+
 const struct bus_type cpu_subsys = {
 	.name = "cpu",
 	.dev_name = "cpu",
@@ -513,6 +522,9 @@ static struct attribute *cpu_root_attrs[] = {
 #endif
 #ifdef CONFIG_GENERIC_CPU_AUTOPROBE
 	&dev_attr_modalias.attr,
+#endif
+#ifdef CONFIG_PARAVIRT
+	&dev_attr_paravirt.attr,
 #endif
 	NULL
 };
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH v3 09/10] powerpc: Add debug file for set/unset paravirt CPUs
  2025-09-10 17:42 [RFC PATCH v3 00/10] paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
                   ` (7 preceding siblings ...)
  2025-09-10 17:42 ` [RFC PATCH v3 08/10] sysfs: Add paravirt CPU file Shrikanth Hegde
@ 2025-09-10 17:42 ` Shrikanth Hegde
  2025-09-10 17:42 ` [HELPER PATCH] sysfs: Provide write method for paravirt Shrikanth Hegde
  2025-10-20 14:32 ` [RFC PATCH v3 00/10] paravirt CPUs and push task for less vCPU preemption Sean Christopherson
  10 siblings, 0 replies; 33+ messages in thread
From: Shrikanth Hegde @ 2025-09-10 17:42 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, tglx, yury.norov,
	maddy, linux-kernel, linuxppc-dev, gregkh
  Cc: sshegde, vschneid, iii, huschle, rostedt, dietmar.eggemann,
	vineeth, jgross, pbonzini, seanjc

PowerPC systems can be deployed shared processor Logical Partitions (SPLPAR)
aka Shared VM. These configurations allows overcommit of CPU resource.
i.e more virtual CPUs than physical CPUs.

When there is contention of physical CPUs in such cases arch needs to
have a mechanism to set the CPUs as paravirt. It also needs to clear
them when the contention goes away.

Ideal would be get the hint from hypervisor. It would be more accurate
since it has knowledge of all SPLPARs deployed in the system.

Till the hint from underlying hypervisor arrives, another idea is to
approximate the hint from steal time. There are some works ongoing, but
not there yet due to challenges revolving around limits and
convergence.

Till that happens, there is a need for debugfs file which could be used to
set/unset the hint. The interface currently is number starting from which
CPUs will marked as paravirt. It could be changed to one the takes a
cpumask(list of CPUs) in future.

============== Usage Example ============

Lets say 720 CPU system. It is observing 20% steal time. It is evident
that one should probably only 576 CPUs. Do,

echo 576 > /sys/kernel/debug/powerpc/vp_manual_hint
cat /sys/devices/system/cpu/paravirt
576-719

This marks CPUs 576-719 as paravirt and move the tasks out of these
CPUs. To unset, echo total number of CPUs(720) or higher value.

echo 720 > /sys/kernel/debug/powerpc/vp_manual_hint
cat /sys/devices/system/cpu/paravirt

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 arch/powerpc/include/asm/paravirt.h |  1 +
 arch/powerpc/kernel/smp.c           | 58 +++++++++++++++++++++++++++++
 2 files changed, 59 insertions(+)

diff --git a/arch/powerpc/include/asm/paravirt.h b/arch/powerpc/include/asm/paravirt.h
index b78b82d66057..8854da8e532c 100644
--- a/arch/powerpc/include/asm/paravirt.h
+++ b/arch/powerpc/include/asm/paravirt.h
@@ -16,6 +16,7 @@
 #include <asm/cputhreads.h>
 
 DECLARE_STATIC_KEY_FALSE(shared_processor);
+DECLARE_STATIC_KEY_FALSE(cpu_paravirt_push_tasks);
 
 static inline bool is_shared_processor(void)
 {
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 68edb66c2964..1c0d59d353bd 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -64,6 +64,7 @@
 #include <asm/systemcfg.h>
 
 #include <trace/events/ipi.h>
+#include <linux/debugfs.h>
 
 #ifdef DEBUG
 #include <asm/udbg.h>
@@ -82,6 +83,7 @@ bool has_big_cores __ro_after_init;
 bool coregroup_enabled __ro_after_init;
 bool thread_group_shares_l2 __ro_after_init;
 bool thread_group_shares_l3 __ro_after_init;
+static int vp_manual_hint = NR_CPUS;
 
 DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map);
 DEFINE_PER_CPU(cpumask_var_t, cpu_smallcore_map);
@@ -1717,6 +1719,7 @@ static void __init build_sched_topology(void)
 	BUG_ON(i >= ARRAY_SIZE(powerpc_topology) - 1);
 
 	set_sched_topology(powerpc_topology);
+	vp_manual_hint = num_present_cpus();
 }
 
 void __init smp_cpus_done(unsigned int max_cpus)
@@ -1797,4 +1800,59 @@ void __noreturn arch_cpu_idle_dead(void)
 	start_secondary_resume();
 }
 
+#ifdef CONFIG_PARAVIRT
+/*
+ * sysfs hint to mark CPUs as paravirt. This will help in restricting
+ * the workload to specified number of CPUs.
+ * For example, On 720 CPU system 576 > vp_manual_hint means, workload will
+ * run on 0-575 CPUs. Tasks will move out of 576-719 CPUs.
+ */
+
+static int pv_vp_manual_hint_set(void *data, u64 val)
+{
+	int cpu;
+	int online_cpus = num_online_cpus();
+
+	if (val == vp_manual_hint)
+		return 0;
+
+	if (val == 0 || val > online_cpus)
+		val = online_cpus;
+
+	vp_manual_hint = val;
+
+	if (vp_manual_hint < online_cpus)
+		static_branch_enable(&cpu_paravirt_push_tasks);
+	else
+		static_branch_disable(&cpu_paravirt_push_tasks);
+
+	for_each_online_cpu(cpu) {
+		if (cpu >= vp_manual_hint)
+			set_cpu_paravirt(cpu, true);
+		else
+			set_cpu_paravirt(cpu, false);
+	}
+	return 0;
+}
+
+static int pv_vp_manual_hint_get(void *data, u64 *val)
+{
+	*val = vp_manual_hint;
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(fops_pv_vp_manual_hint, pv_vp_manual_hint_get,
+			pv_vp_manual_hint_set, "%llu\n");
+
+static __init int paravirt_debugfs_init(void)
+{
+	if (is_shared_processor())
+		debugfs_create_file("vp_manual_hint", 0600, arch_debugfs_dir,
+				    NULL, &fops_pv_vp_manual_hint);
+	return 0;
+}
+
+device_initcall(paravirt_debugfs_init)
+#endif
+
 #endif
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [HELPER PATCH] sysfs: Provide write method for paravirt
  2025-09-10 17:42 [RFC PATCH v3 00/10] paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
                   ` (8 preceding siblings ...)
  2025-09-10 17:42 ` [RFC PATCH v3 09/10] powerpc: Add debug file for set/unset paravirt CPUs Shrikanth Hegde
@ 2025-09-10 17:42 ` Shrikanth Hegde
  2025-10-20 14:32 ` [RFC PATCH v3 00/10] paravirt CPUs and push task for less vCPU preemption Sean Christopherson
  10 siblings, 0 replies; 33+ messages in thread
From: Shrikanth Hegde @ 2025-09-10 17:42 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, tglx, yury.norov,
	maddy, linux-kernel, linuxppc-dev, gregkh
  Cc: sshegde, vschneid, iii, huschle, rostedt, dietmar.eggemann,
	vineeth, jgross, pbonzini, seanjc

This is helper patch which could be used to set the range of CPUs as
paravirt. One could make use of this for quick testing of this infra
instead of writing arch specific code.

This is currently not meant be merged, since paravirt sysfs file is meant
to be Read-Only.

echo 100-200,600-700 >  /sys/devices/system/cpu/paravirt
cat /sys/devices/system/cpu/paravirt
100-200,600-700

echo > /sys/devices/system/cpu/paravirt
cat /sys/devices/system/cpu/paravirt

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
Idea was borrowed from Ilya's patch shared to me internally.

It is up for debate to have something like this or like powerpc patch.

 drivers/base/base.h |  4 ++++
 drivers/base/cpu.c  | 43 ++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 46 insertions(+), 1 deletion(-)

diff --git a/drivers/base/base.h b/drivers/base/base.h
index 123031a757d9..bd93b2895b24 100644
--- a/drivers/base/base.h
+++ b/drivers/base/base.h
@@ -264,3 +264,7 @@ static inline int devtmpfs_delete_node(struct device *dev) { return 0; }
 
 void software_node_notify(struct device *dev);
 void software_node_notify_remove(struct device *dev);
+
+#ifdef CONFIG_PARAVIRT
+DECLARE_STATIC_KEY_FALSE(cpu_paravirt_push_tasks);
+#endif
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index 902747ff4988..d66cbd0c3060 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -375,12 +375,53 @@ static int cpu_uevent(const struct device *dev, struct kobj_uevent_env *env)
 #endif
 
 #ifdef CONFIG_PARAVIRT
+static ssize_t store_paravirt_cpus(struct device *dev,
+				   struct device_attribute *attr,
+				   const char *buf, size_t count)
+{
+	cpumask_var_t temp_mask;
+	int retval = 0;
+
+	if (!alloc_cpumask_var(&temp_mask, GFP_KERNEL))
+		return -ENOMEM;
+
+	retval = cpulist_parse(buf, temp_mask);
+	if (retval)
+		goto free_mask;
+
+	/* ALL cpus can't be marked as paravirt */
+	if (cpumask_equal(temp_mask, cpu_online_mask)) {
+		retval = -EINVAL;
+		goto free_mask;
+	}
+	if (cpumask_weight(temp_mask) > num_online_cpus()) {
+		retval = -EINVAL;
+		goto free_mask;
+	}
+
+	/* No more paravirt cpus */
+	if (cpumask_empty(temp_mask)) {
+		static_branch_disable(&cpu_paravirt_push_tasks);
+		cpumask_copy((struct cpumask *)&__cpu_paravirt_mask, temp_mask);
+
+	} else {
+		static_branch_enable(&cpu_paravirt_push_tasks);
+		cpumask_copy((struct cpumask *)&__cpu_paravirt_mask, temp_mask);
+	}
+
+	retval = count;
+
+free_mask:
+	free_cpumask_var(temp_mask);
+	return retval;
+}
+
 static ssize_t print_paravirt_cpus(struct device *dev,
 				   struct device_attribute *attr, char *buf)
 {
 	return sysfs_emit(buf, "%*pbl\n", cpumask_pr_args(cpu_paravirt_mask));
 }
-static DEVICE_ATTR(paravirt, 0444, print_paravirt_cpus, NULL);
+static DEVICE_ATTR(paravirt, 0644, print_paravirt_cpus, store_paravirt_cpus);
 #endif
 
 const struct bus_type cpu_subsys = {
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v3 00/10] paravirt CPUs and push task for less vCPU preemption
  2025-09-10 17:42 [RFC PATCH v3 00/10] paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
                   ` (9 preceding siblings ...)
  2025-09-10 17:42 ` [HELPER PATCH] sysfs: Provide write method for paravirt Shrikanth Hegde
@ 2025-10-20 14:32 ` Sean Christopherson
  2025-10-20 15:05   ` Paolo Bonzini
  2025-10-21  6:10   ` Shrikanth Hegde
  10 siblings, 2 replies; 33+ messages in thread
From: Sean Christopherson @ 2025-10-20 14:32 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: mingo, peterz, juri.lelli, vincent.guittot, tglx, yury.norov,
	maddy, linux-kernel, linuxppc-dev, gregkh, vschneid, iii, huschle,
	rostedt, dietmar.eggemann, vineeth, jgross, pbonzini

On Wed, Sep 10, 2025, Shrikanth Hegde wrote:
> tl;dr
> 
> This is follow up of [1] with few fixes and addressing review comments.
> Upgraded it to RFC PATCH from RFC. 
> Please review. 
> 
> [1]: v2 - https://lore.kernel.org/all/20250625191108.1646208-1-sshegde@linux.ibm.com/
> 
> v2 -> v3:
> - Renamed to paravirt CPUs

There are myriad uses of "paravirt" throughout Linux and related environments,
and none of them mean "oversubscribed" or "contended".  I assume Hillf's comments
triggered the rename from "avoid CPUs", but IMO "avoid" is at least somewhat
accurate; "paravirt" is wildly misleading.

> - Folded the changes under CONFIG_PARAVIRT.
> - Fixed the crash due work_buf corruption while using
>   stop_one_cpu_nowait. 
> - Added sysfs documentation.
> - Copy most of __balance_push_cpu_stop to new one, this helps it move 
>   the code out of CONFIG_HOTPLUG_CPU. 
> - Some of the code movement suggested. 
> 
> -----------------
> ::Detailed info:: 
> -----------------
> Problem statement 
> 
> vCPU - Virtual CPUs - CPU in VM world.
> pCPU - Physical CPUs - CPU in baremetal world.
> 
> A hypervisor does scheduling of vCPUs on a pCPUs. It has to give each
> vCPU some cycles and be fair. When there are more vCPU requests than
> the pCPUs, hypervsior has to preempt some vCPUs in order to run others.
> This is called as vCPU preemption.
> 
> If we take two VM's, When hypervisor preempts vCPU from VM1 to run vCPU from 
> VM2, it has to do save/restore VM context.Instead if VM's can co-ordinate among
> each other and request for limited  vCPUs, it avoids the above overhead and 
> there is context switching within vCPU(less expensive). Even if hypervisor
> is preempting one vCPU to run another within the same VM, it is still more 
> expensive than the task preemption within the vCPU. So basic aim to avoid 
> vCPU preemption.
> 
> So to achieve this, introduce "Paravirt CPU" concept, where it is better if
> workload avoids these vCPUs at this moment. (vCPUs stays online, don't want
> the overhead of sched domain rebuild and hotplug takes a lot of time too).
> 
> When there is contention, don't use paravirt CPUs.
> When there is no contention, use all vCPUs. 

...

> ------------
> Open issues: 
> 
> - Derivation of hint from steal time is still a challenge. Some work is
>   underway to address it. 
> 
> - Consider kvm and other hypervsiors and how they could derive the hint.
>   Need inputs from community. 

Bluntly, this series is never going to land, at least not in a form that's remotely
close to what is proposed here.  This is an incredibly simplistic way of handling
overcommit, and AFAICT there's no line of sight to supporting more complex scenarios.

I.e. I don't see a path to resolving all these "todos" in the changelog from the
last patch:

 : Ideal would be get the hint from hypervisor. It would be more accurate
 : since it has knowledge of all SPLPARs deployed in the system.
 : 
 : Till the hint from underlying hypervisor arrives, another idea is to
 : approximate the hint from steal time. There are some works ongoing, but
 : not there yet due to challenges revolving around limits and
 : convergence.
 : 
 : Till that happens, there is a need for debugfs file which could be used to
 : set/unset the hint. The interface currently is number starting from which
 : CPUs will marked as paravirt. It could be changed to one the takes a
 : cpumask(list of CPUs) in future.

I see Vineeth and Steven are on the Cc.  Argh, and you even commented on their
first RFC[1], where it was made quite clear that sprinkling one-off "hints"
throughoug the kernel wasn't a viable approach.

I don't know the current status of the ChromeOS work, but there was agreement in
principle that the bulk of paravirt scheduling should not need to touch the kernel
(host or guest)[2].

[1] https://lore.kernel.org/all/20231214024727.3503870-1-vineeth@bitbyteword.org
[2] https://lore.kernel.org/all/ZjJf27yn-vkdB32X@google.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v3 00/10] paravirt CPUs and push task for less vCPU preemption
  2025-10-20 14:32 ` [RFC PATCH v3 00/10] paravirt CPUs and push task for less vCPU preemption Sean Christopherson
@ 2025-10-20 15:05   ` Paolo Bonzini
  2025-10-23  4:03     ` Shrikanth Hegde
  2025-10-21  6:10   ` Shrikanth Hegde
  1 sibling, 1 reply; 33+ messages in thread
From: Paolo Bonzini @ 2025-10-20 15:05 UTC (permalink / raw)
  To: Sean Christopherson, Shrikanth Hegde
  Cc: mingo, peterz, juri.lelli, vincent.guittot, tglx, yury.norov,
	maddy, linux-kernel, linuxppc-dev, gregkh, vschneid, iii, huschle,
	rostedt, dietmar.eggemann, vineeth, jgross

On 10/20/25 16:32, Sean Christopherson wrote:
>   : Till the hint from underlying hypervisor arrives, another idea is to
>   : approximate the hint from steal time.

I think this is the first thing to look at.

Perhaps single_task_running() can be exposed in the x86 steal time data 
structure, and in fact even in the rseq data for non-VM usecases?  This 
is not specific to VMs and I'd like the steal time implementation to 
follow the footsteps of rseq rather than the opposite.

Paolo



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v3 00/10] paravirt CPUs and push task for less vCPU preemption
  2025-10-20 15:05   ` Paolo Bonzini
@ 2025-10-23  4:03     ` Shrikanth Hegde
  0 siblings, 0 replies; 33+ messages in thread
From: Shrikanth Hegde @ 2025-10-23  4:03 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: mingo, peterz, juri.lelli, vincent.guittot, tglx, yury.norov,
	maddy, linux-kernel, linuxppc-dev, gregkh, vschneid, iii, huschle,
	rostedt, dietmar.eggemann, vineeth, jgross

Hi Paolo. Thanks for looking into this series.

On 10/20/25 8:35 PM, Paolo Bonzini wrote:
> On 10/20/25 16:32, Sean Christopherson wrote:
>>   : Till the hint from underlying hypervisor arrives, another idea is to
>>   : approximate the hint from steal time.
> 
> I think this is the first thing to look at.
>

The current code i have does the below: All of this happens in the Guest.
No change in host. (Host is running PowerVM, a non linux hypervisor)

At every 1second (configurable):
1. Low and High steal time thresholds are defined.(configurable)
2. Gathers steam time from all CPUs.
3. If it higher than the High threshold reduce the core(SMT8) usage by 1
4. If it lower than low threshould increase core usage by 1.
5. Avoid ping-pong as much as possible.

Its an initial code to try out if it works with plumbing the push current task framework
given in the series.

> Perhaps single_task_running() can be exposed in the x86 steal time data 
> structure, and in fact even in the rseq data for non-VM usecases?  This 
> is not specific to VMs and I'd like the steal time implementation to 
> follow the footsteps of rseq rather than the opposite.
> 
> Paolo
> 

Sorry, I didn't follow. You mean KVM usecases?

I don't know much about rseq(on todo list). Any specific implementation i could
look at done via rseq that you are talking about?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v3 00/10] paravirt CPUs and push task for less vCPU preemption
  2025-10-20 14:32 ` [RFC PATCH v3 00/10] paravirt CPUs and push task for less vCPU preemption Sean Christopherson
  2025-10-20 15:05   ` Paolo Bonzini
@ 2025-10-21  6:10   ` Shrikanth Hegde
  2025-10-22 18:46     ` Sean Christopherson
  1 sibling, 1 reply; 33+ messages in thread
From: Shrikanth Hegde @ 2025-10-21  6:10 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: mingo, peterz, juri.lelli, vincent.guittot, tglx, yury.norov,
	maddy, linux-kernel, linuxppc-dev, gregkh, vschneid, iii, huschle,
	rostedt, dietmar.eggemann, vineeth, jgross, pbonzini


Hi Sean.
Thanks for taking time and going through the series.

On 10/20/25 8:02 PM, Sean Christopherson wrote:
> On Wed, Sep 10, 2025, Shrikanth Hegde wrote:
>> tl;dr
>>
>> This is follow up of [1] with few fixes and addressing review comments.
>> Upgraded it to RFC PATCH from RFC.
>> Please review.
>>
>> [1]: v2 - https://lore.kernel.org/all/20250625191108.1646208-1-sshegde@linux.ibm.com/
>>
>> v2 -> v3:
>> - Renamed to paravirt CPUs
> 
> There are myriad uses of "paravirt" throughout Linux and related environments,
> and none of them mean "oversubscribed" or "contended".  I assume Hillf's comments
> triggered the rename from "avoid CPUs", but IMO "avoid" is at least somewhat
> accurate; "paravirt" is wildly misleading.

Name has been tricky. We want to have a positive sounding name while conveying
that these CPUs are not be used for now due to contention,
they may be used again when the contention has gone.


> 
>> - Folded the changes under CONFIG_PARAVIRT.
>> - Fixed the crash due work_buf corruption while using
>>    stop_one_cpu_nowait.
>> - Added sysfs documentation.
>> - Copy most of __balance_push_cpu_stop to new one, this helps it move
>>    the code out of CONFIG_HOTPLUG_CPU.
>> - Some of the code movement suggested.
>>
>> -----------------
>> ::Detailed info::
>> -----------------
>> Problem statement
>>
>> vCPU - Virtual CPUs - CPU in VM world.
>> pCPU - Physical CPUs - CPU in baremetal world.
>>
>> A hypervisor does scheduling of vCPUs on a pCPUs. It has to give each
>> vCPU some cycles and be fair. When there are more vCPU requests than
>> the pCPUs, hypervsior has to preempt some vCPUs in order to run others.
>> This is called as vCPU preemption.
>>
>> If we take two VM's, When hypervisor preempts vCPU from VM1 to run vCPU from
>> VM2, it has to do save/restore VM context.Instead if VM's can co-ordinate among
>> each other and request for limited  vCPUs, it avoids the above overhead and
>> there is context switching within vCPU(less expensive). Even if hypervisor
>> is preempting one vCPU to run another within the same VM, it is still more
>> expensive than the task preemption within the vCPU. So basic aim to avoid
>> vCPU preemption.
>>
>> So to achieve this, introduce "Paravirt CPU" concept, where it is better if
>> workload avoids these vCPUs at this moment. (vCPUs stays online, don't want
>> the overhead of sched domain rebuild and hotplug takes a lot of time too).
>>
>> When there is contention, don't use paravirt CPUs.
>> When there is no contention, use all vCPUs.
> 
> ...
> 
>> ------------
>> Open issues:
>>
>> - Derivation of hint from steal time is still a challenge. Some work is
>>    underway to address it.
>>
>> - Consider kvm and other hypervsiors and how they could derive the hint.
>>    Need inputs from community.
> 
> Bluntly, this series is never going to land, at least not in a form that's remotely
> close to what is proposed here.  This is an incredibly simplistic way of handling
> overcommit, and AFAICT there's no line of sight to supporting more complex scenarios.
> 

Could you describe these complex scenarios?

Current usecase has been on two archs. powerpc and s390.
IIUC, both have an non-linux hypervisor running on host and linux guests.

Currently the s390 Hypervsior has a way of marking vCPU has Vertical High,
vertical Medium, Vertical Low. So when there is steal time, arch could easily
mark vertical Lows as "paravirt" CPUs.

> I.e. I don't see a path to resolving all these "todos" in the changelog from the
> last patch:
> 
>   : Ideal would be get the hint from hypervisor. It would be more accurate
>   : since it has knowledge of all SPLPARs deployed in the system.
>   :
>   : Till the hint from underlying hypervisor arrives, another idea is to
>   : approximate the hint from steal time. There are some works ongoing, but
>   : not there yet due to challenges revolving around limits and
>   : convergence.
>   :
>   : Till that happens, there is a need for debugfs file which could be used to
>   : set/unset the hint. The interface currently is number starting from which
>   : CPUs will marked as paravirt. It could be changed to one the takes a
>   : cpumask(list of CPUs) in future.
> 
> I see Vineeth and Steven are on the Cc.  Argh, and you even commented on their
> first RFC[1], where it was made quite clear that sprinkling one-off "hints"
> throughoug the kernel wasn't a viable approach.

IIRC, it was in other direction. guest was asking the host to mark some vCPU has
RT task to have it boosted in host.

> 
> I don't know the current status of the ChromeOS work, but there was agreement in
> principle that the bulk of paravirt scheduling should not need to touch the kernel
> (host or guest)[2].
> 

Based on some event if all the tasks on a CPU have to move out, then scheduler needs to
be there no? to move the task out, and not schedule anything new on it.

The current mechanisms such as cpu hotplug, isolated partitions all break the task affinity.
So need a new mechanism.

Note: Host is not running linux kernel. We are requesting host to provide this info through
HCALL or VPA area.

> [1] https://lore.kernel.org/all/20231214024727.3503870-1-vineeth@bitbyteword.org
> [2] https://lore.kernel.org/all/ZjJf27yn-vkdB32X@google.com

Vineeth,
whats the latest on vcpu_boosted framework? AFAIR both guest/host were running linux there.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v3 00/10] paravirt CPUs and push task for less vCPU preemption
  2025-10-21  6:10   ` Shrikanth Hegde
@ 2025-10-22 18:46     ` Sean Christopherson
  2025-10-30 17:43       ` Shrikanth Hegde
  0 siblings, 1 reply; 33+ messages in thread
From: Sean Christopherson @ 2025-10-22 18:46 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: mingo, peterz, juri.lelli, vincent.guittot, tglx, yury.norov,
	maddy, linux-kernel, linuxppc-dev, gregkh, vschneid, iii, huschle,
	rostedt, dietmar.eggemann, vineeth, jgross, pbonzini

On Tue, Oct 21, 2025, Shrikanth Hegde wrote:
> 
> Hi Sean.
> Thanks for taking time and going through the series.
> 
> On 10/20/25 8:02 PM, Sean Christopherson wrote:
> > On Wed, Sep 10, 2025, Shrikanth Hegde wrote:
> > > tl;dr
> > > 
> > > This is follow up of [1] with few fixes and addressing review comments.
> > > Upgraded it to RFC PATCH from RFC.
> > > Please review.
> > > 
> > > [1]: v2 - https://lore.kernel.org/all/20250625191108.1646208-1-sshegde@linux.ibm.com/
> > > 
> > > v2 -> v3:
> > > - Renamed to paravirt CPUs
> > 
> > There are myriad uses of "paravirt" throughout Linux and related environments,
> > and none of them mean "oversubscribed" or "contended".  I assume Hillf's comments
> > triggered the rename from "avoid CPUs", but IMO "avoid" is at least somewhat
> > accurate; "paravirt" is wildly misleading.
> 
> Name has been tricky. We want to have a positive sounding name while
> conveying that these CPUs are not be used for now due to contention,
> they may be used again when the contention has gone.

I suspect part of the problem with naming is the all-or-nothing approach itself.
There's a _lot_ of policy baked into that seemingly simple decision, and thus
it's hard to describe with a human-friendly name.

> > > Open issues:
> > > 
> > > - Derivation of hint from steal time is still a challenge. Some work is
> > >    underway to address it.
> > > 
> > > - Consider kvm and other hypervsiors and how they could derive the hint.
> > >    Need inputs from community.
> > 
> > Bluntly, this series is never going to land, at least not in a form that's remotely
> > close to what is proposed here.  This is an incredibly simplistic way of handling
> > overcommit, and AFAICT there's no line of sight to supporting more complex scenarios.
> > 
> 
> Could you describe these complex scenarios?

Any setup where "don't use this CPU" isn't a viable option, e.g. because all cores
could be overcommitted at any given time, or is far, far too coarse-grained.  Very
few use cases can distill vCPU scheduling needs and policies into single flag.

E.g. if all CPUs in a system are being used to vCPU tasks, all vCPUs are actively
running, and the host has a non-vCPU task that _must_ run, then the host will need
to preempt a vCPU task.  Ideally, a paravirtualized scheduling system would allow
the host to make an informed decision when choosing which vCPU to preempt, e.g. to
minimize disruption to the guest(s).


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v3 00/10] paravirt CPUs and push task for less vCPU preemption
  2025-10-22 18:46     ` Sean Christopherson
@ 2025-10-30 17:43       ` Shrikanth Hegde
  0 siblings, 0 replies; 33+ messages in thread
From: Shrikanth Hegde @ 2025-10-30 17:43 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: mingo, peterz, juri.lelli, vincent.guittot, tglx, yury.norov,
	maddy, linux-kernel, linuxppc-dev, gregkh, vschneid, iii, huschle,
	rostedt, dietmar.eggemann, vineeth, jgross, pbonzini

Hi Sean.

On 10/23/25 12:16 AM, Sean Christopherson wrote:
> On Tue, Oct 21, 2025, Shrikanth Hegde wrote:
>>
>> Hi Sean.
>> Thanks for taking time and going through the series.
>>
>> On 10/20/25 8:02 PM, Sean Christopherson wrote:
>>> On Wed, Sep 10, 2025, Shrikanth Hegde wrote:
>>>> tl;dr
>>>>
>>>> This is follow up of [1] with few fixes and addressing review comments.
>>>> Upgraded it to RFC PATCH from RFC.
>>>> Please review.
>>>>
>>>> [1]: v2 - https://lore.kernel.org/all/20250625191108.1646208-1-sshegde@linux.ibm.com/
>>>>
>>>> v2 -> v3:
>>>> - Renamed to paravirt CPUs
>>>
>>> There are myriad uses of "paravirt" throughout Linux and related environments,
>>> and none of them mean "oversubscribed" or "contended".  I assume Hillf's comments
>>> triggered the rename from "avoid CPUs", but IMO "avoid" is at least somewhat
>>> accurate; "paravirt" is wildly misleading.
>>
>> Name has been tricky. We want to have a positive sounding name while
>> conveying that these CPUs are not be used for now due to contention,
>> they may be used again when the contention has gone.
> 
> I suspect part of the problem with naming is the all-or-nothing approach itself.
> There's a _lot_ of policy baked into that seemingly simple decision, and thus
> it's hard to describe with a human-friendly name.
> 

open for suggestions :)

>>>> Open issues:
>>>>
>>>> - Derivation of hint from steal time is still a challenge. Some work is
>>>>     underway to address it.
>>>>
>>>> - Consider kvm and other hypervsiors and how they could derive the hint.
>>>>     Need inputs from community.
>>>
>>> Bluntly, this series is never going to land, at least not in a form that's remotely
>>> close to what is proposed here.  This is an incredibly simplistic way of handling
>>> overcommit, and AFAICT there's no line of sight to supporting more complex scenarios.
>>>
>>
>> Could you describe these complex scenarios?
> 
> Any setup where "don't use this CPU" isn't a viable option, e.g. because all cores
> could be overcommitted at any given time, or is far, far too coarse-grained.  Very
> few use cases can distill vCPU scheduling needs and policies into single flag.
> 

Okay. Let me explain whats the current thought process is.

On S390 and pseries are the current main use cases.

On S390, Z hypervisor provides distinction among vCPUs. vCPU are marked as Vertical High,
Vertical Medium and Vertical Low. When there is a steal time it is recommended
to use Vertical Highs and avoid using Vertical Lows. In such cases, using this infra, one
can avoid scheduling anything on these vertical low vCPUs. Performance benefit is
observed since there is less contention and CPU cycles are mainly from Vertical Highs.

On PowerVM hypervisor, hypervisor dispatches full core at a time. all SMT=8 siblings are dispatched
to the same core always. That means it beneficial to schedule on vCPU siblings together at core level.
When there is contention for pCPU full core is preempted. i.e all vCPU belonging to that core would be
preempted. In such cases, depending on the configuration of overcommit, and depending on the steal time
one could limit the number of cores usage by using limited vCPUs. When done in that way, we see better
latency numbers and increase in throughput compared to out-box. The cover letter has those numbers.

Now, lets come to KVM with Linux running as Hypervisor. Correct me if i am wrong.
each vCPU in KVM will be a process in the host. when vCPU is running, that process will be in
running state as well. When there is overcommit and all vCPU are running, there will be more
process than number of physical CPUs and host has to context switch and will preempt one vCPU
to run another. It can also preempt vCPU to run some host process.
If we restrict the number of vCPU where workload is currently running, then
number of runnable process in the host also will reduce and less chance of host context switches.
Since this avoid any overhead of kvm context save/restore the workload is likely to benefit.

I guess it is possible to distinguish between host process and vCPU running as process.
If so, host can decide how many threads it can optimally run and give signal to each guest
depending on the configuration.

Currently keeping it arch dependent, since IMHO it is each Hypervisor is in right place to
make decision. Not sure if one fit for all approach works here.

Another tricky point is how this signal is going to be. It could be hcall, or vpa area
or some shared memory region or using bpf method similar to vCPU boosting patch series.
There too, i think it is best to leave to arch to specify how. the reason being bpf method
will not work for powerVM hypervisors.

> E.g. if all CPUs in a system are being used to vCPU tasks, all vCPUs are actively
> running, and the host has a non-vCPU task that _must_ run, then the host will need
> to preempt a vCPU task.  Ideally, a paravirtualized scheduling system would allow

Host/Hypervsior need not make the vCPU as "Not use" every single time it preempts.
It needs to do so, only when there are more vCPU processes than number of physical CPUS and
preemption is happening between vCPU process.

There would be corner cases such as only one physical process is there, and two
KVM each with one vCPU, then nothing much can be done.

> the host to make an informed decision when choosing which vCPU to preempt, e.g. to
> minimize disruption to the guest(s).

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2025-11-10  4:55 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-10 17:42 [RFC PATCH v3 00/10] paravirt CPUs and push task for less vCPU preemption Shrikanth Hegde
2025-09-10 17:42 ` [RFC PATCH v3 01/10] sched/docs: Document cpu_paravirt_mask and Paravirt CPU concept Shrikanth Hegde
2025-09-10 17:42 ` [RFC PATCH v3 02/10] cpumask: Introduce cpu_paravirt_mask Shrikanth Hegde
2025-09-10 17:42 ` [RFC PATCH v3 03/10] sched: Static key to check paravirt cpu push Shrikanth Hegde
2025-09-11  1:53   ` Yury Norov
2025-09-11 14:37     ` Shrikanth Hegde
2025-09-11 15:29       ` Yury Norov
2025-09-10 17:42 ` [RFC PATCH v3 04/10] sched/core: Dont allow to use CPU marked as paravirt Shrikanth Hegde
2025-09-11  5:16   ` K Prateek Nayak
2025-09-11 14:44     ` Shrikanth Hegde
2025-09-10 17:42 ` [RFC PATCH v3 05/10] sched/fair: Don't consider paravirt CPUs for wakeup and load balance Shrikanth Hegde
2025-09-11  5:23   ` K Prateek Nayak
2025-09-11 15:56     ` Shrikanth Hegde
2025-09-11 16:55       ` K Prateek Nayak
2025-11-08 12:04     ` Shrikanth Hegde
2025-09-10 17:42 ` [RFC PATCH v3 06/10] sched/rt: Don't select paravirt CPU for wakeup and push/pull rt task Shrikanth Hegde
2025-09-10 17:42 ` [RFC PATCH v3 07/10] sched/core: Push current task from paravirt CPU Shrikanth Hegde
2025-09-11  5:40   ` K Prateek Nayak
2025-09-11 16:52     ` Shrikanth Hegde
2025-09-11 17:06       ` K Prateek Nayak
2025-09-12  5:22         ` Shrikanth Hegde
2025-09-12  8:48           ` K Prateek Nayak
2025-09-12 12:49             ` Shrikanth Hegde
2025-11-10  4:54     ` Shrikanth Hegde
2025-09-10 17:42 ` [RFC PATCH v3 08/10] sysfs: Add paravirt CPU file Shrikanth Hegde
2025-09-10 17:42 ` [RFC PATCH v3 09/10] powerpc: Add debug file for set/unset paravirt CPUs Shrikanth Hegde
2025-09-10 17:42 ` [HELPER PATCH] sysfs: Provide write method for paravirt Shrikanth Hegde
2025-10-20 14:32 ` [RFC PATCH v3 00/10] paravirt CPUs and push task for less vCPU preemption Sean Christopherson
2025-10-20 15:05   ` Paolo Bonzini
2025-10-23  4:03     ` Shrikanth Hegde
2025-10-21  6:10   ` Shrikanth Hegde
2025-10-22 18:46     ` Sean Christopherson
2025-10-30 17:43       ` Shrikanth Hegde

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).