Re: [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption

public inbox for linux-s390@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption
       [not found] <20251119124449.1149616-1-sshegde@linux.ibm.com>
@ 2025-12-04 13:28 ` Ilya Leoshkevich
  2025-12-05  5:30   ` Shrikanth Hegde
  0 siblings, 1 reply; 4+ messages in thread
From: Ilya Leoshkevich @ 2025-12-04 13:28 UTC (permalink / raw)
  To: Shrikanth Hegde, linux-kernel, linuxppc-dev
  Cc: mingo, peterz, juri.lelli, vincent.guittot, tglx, yury.norov,
	maddy, srikar, gregkh, pbonzini, seanjc, kprateek.nayak, vschneid,
	huschle, rostedt, dietmar.eggemann, christophe.leroy, linux-s390

On Wed, 2025-11-19 at 18:14 +0530, Shrikanth Hegde wrote:
> Detailed problem statement and some of the implementation choices
> were 
> discussed earlier[1].
> 
> [1]:
> https://lore.kernel.org/all/20250910174210.1969750-1-sshegde@linux.ibm.com/
> 
> This is likely the version which would be used for LPC2025 discussion
> on
> this topic. Feel free to provide your suggestion and hoping for a
> solution
> that works for different architectures and it's use cases.
> 
> All the existing alternatives such as cpu hotplug, creating isolated
> partitions etc break the user affinity. Since number of CPUs to use
> change
> depending on the steal time, it is not driven by User. Hence it would
> be
> wrong to break the affinity. This series allows if the task is pinned
> only paravirt CPUs, it will continue running there.
> 
> Changes compared v3[1]:
> 
> - Introduced computation of steal time in powerpc code.
> - Derive number of CPUs to use and mark the remaining as paravirt
> based
>   on steal values. 
> - Provide debugfs knobs to alter how steal time values being used.
> - Removed static key check for paravirt CPUs (Yury)
> - Removed preempt_disable/enable while calling stopper (Prateek)
> - Made select_idle_sibling and friends aware of paravirt CPUs.
> - Removed 3 unused schedstat fields and introduced 2 related to
> paravirt
>   handling.
> - Handled nohz_full case by enabling tick on it when there is CFS/RT
> on
>   it.
> - Updated helper patch to override arch behaviour for easier
> debugging
>   during development.
> - Kept 
> 
> Changes compared to v4[2]:
> - Last two patches were sent out separate instead of being with
> series.
>   That created confusion. Those two patches are debug patches one can
>   make use to check functionality across acrhitectures. Sorry about
>   that.
> - Use DEVICE_ATTR_RW instead (greg)
> - Made it as PATCH since arch specific handling completes the
>   functionality.
> 
> [2]:
> https://lore.kernel.org/all/20251119062100.1112520-1-sshegde@linux.ibm.com/
> 
> TODO: 
> 
> - Get performance numbers on PowerPC, x86 and S390. Hopefully by next
>   week. Didn't want to hold the series till then.
> 
> - The CPUs to mark as paravirt is very simple and doesn't work when
>   vCPUs aren't spread out uniformly across NUMA nodes. Ideal would be
> splice
>   the numbers based on how many CPUs each NUMA node has. It is quite
>   tricky to do specially since cpumask can be on stack too. Given
>   NR_CPUS can be 8192 and nr_possible_nodes 32. Haven't got my head
> into
>   solving it yet. Maybe there is easier way.
> 
> - DLPAR Add/Remove needs to call init of EC/VP cores (powerpc
> specific)
> 
> - Userspace tools awareness such as irqbalance. 
> 
> - Delve into design of hint from Hyeprvisor(HW Hint). i.e Host
> informs
>   guest which/how many CPUs it has to use at this moment. This
> interface
>   should work across archs with each arch doing its specific
> handling.
> 
> - Determine the default values for steal time related knobs
>   empirically and document them.
> 
> - Need to check safety against CPU hotplug specially in
> process_steal.
> 
> 
> Applies cleanly on tip/master:
> commit c2ef745151b21d4dcc4b29a1eabf1096f5ba544b
> 
> 
> Thanks to srikar for providing the initial code around powerpc steal
> time handling code. Thanks to all who went through and provided
> reviews.
> 
> PS: I haven't found a better name. Please suggest if you have any.
> 
> Shrikanth Hegde (17):
>   sched/docs: Document cpu_paravirt_mask and Paravirt CPU concept
>   cpumask: Introduce cpu_paravirt_mask
>   sched/core: Dont allow to use CPU marked as paravirt
>   sched/debug: Remove unused schedstats
>   sched/fair: Add paravirt movements for proc sched file
>   sched/fair: Pass current cpu in select_idle_sibling
>   sched/fair: Don't consider paravirt CPUs for wakeup and load
> balance
>   sched/rt: Don't select paravirt CPU for wakeup and push/pull rt
> task
>   sched/core: Add support for nohz_full CPUs
>   sched/core: Push current task from paravirt CPU
>   sysfs: Add paravirt CPU file
>   powerpc: method to initialize ec and vp cores
>   powerpc: enable/disable paravirt CPUs based on steal time
>   powerpc: process steal values at fixed intervals
>   powerpc: add debugfs file for controlling handling on steal values
>   sysfs: Provide write method for paravirt
>   sysfs: disable arch handling if paravirt file being written
> 
>  .../ABI/testing/sysfs-devices-system-cpu      |   9 +
>  Documentation/scheduler/sched-arch.rst        |  37 +++
>  arch/powerpc/include/asm/smp.h                |   1 +
>  arch/powerpc/kernel/smp.c                     |   1 +
>  arch/powerpc/platforms/pseries/lpar.c         | 223
> ++++++++++++++++++
>  arch/powerpc/platforms/pseries/pseries.h      |   1 +
>  drivers/base/cpu.c                            |  59 +++++
>  include/linux/cpumask.h                       |  20 ++
>  include/linux/sched.h                         |   9 +-
>  kernel/sched/core.c                           | 106 ++++++++-
>  kernel/sched/debug.c                          |   5 +-
>  kernel/sched/fair.c                           |  42 +++-
>  kernel/sched/rt.c                             |  11 +-
>  kernel/sched/sched.h                          |   9 +
>  14 files changed, 519 insertions(+), 14 deletions(-)

The capability to temporarily exclude CPUs from scheduling might be
beneficial for s390x, where users often run Linux using a proprietary
hypervisor called PR/SM and with high overcommit. In these
circumstances virtual CPUs may not be scheduled by a hypervisor for a
very long time.

Today we have an upstream feature called "Hiperdispatch", which
determines that this is about to happen and uses Capacity Aware
Scheduling to prevent processes from being placed on the affected CPUs.
However, at least when used for this purpose, Capacity Aware Scheduling
is best effort and fails to move tasks away from the affected CPUs
under high load.

Therefore I have decided to smoke test this series.

For the purposes of smoke testing, I set up a number of KVM virtual
machines and start the same benchmark inside each one. Then I collect
and compare the aggregate throughput numbers. I have not done testing
with PR/SM yet, but I plan to do this and report back. I also have not
tested this with VMs that are not 100% utilized yet.

Benchmark parameters:

$ sysbench cpu run --threads=$(nproc) --time=10
$ schbench -r 10 --json --no-locking 
$ hackbench --groups 10 --process --loops 5000
$ pgbench -h $WORKDIR --client=$(nproc) --time=10

Figures:

s390x (16 host CPUs):

Benchmark      #VMs    #CPUs/VM  ΔRPS (%)
-----------  ------  ----------  ----------
hackbench        16           4  60.58%
pgbench          16           4  50.01%
hackbench         8           8  46.18%
hackbench         4           8  43.54%
hackbench         2          16  43.23%
hackbench        12           4  42.92%
hackbench         8           4  35.53%
hackbench         4          16  30.98%
pgbench          12           4  18.41%
hackbench         2          24  7.32%
pgbench           8           4  6.84%
pgbench           2          24  3.38%
pgbench           2          16  3.02%
pgbench           4          16  2.08%
hackbench         2          32  1.46%
pgbench           4           8  1.30%
schbench          2          16  0.72%
schbench          4           8  -0.09%
schbench          4           4  -0.20%
schbench          8           8  -0.41%
sysbench          8           4  -0.46%
sysbench          4           8  -0.53%
schbench          8           4  -0.65%
sysbench          2          16  -0.76%
schbench          2           8  -0.77%
sysbench          8           8  -1.72%
schbench          2          24  -1.98%
schbench         12           4  -2.03%
sysbench         12           4  -2.13%
pgbench           2          32  -3.15%
sysbench         16           4  -3.17%
schbench         16           4  -3.50%
sysbench          2           8  -4.01%
pgbench           8           8  -4.10%
schbench          4          16  -5.93%
sysbench          4           4  -5.94%
pgbench           2           4  -6.40%
hackbench         2           8  -10.04%
hackbench         4           4  -10.91%
pgbench           4           4  -11.05%
sysbench          2          24  -13.07%
sysbench          4          16  -13.59%
hackbench         2           4  -13.96%
pgbench           2           8  -16.16%
schbench          2           4  -24.14%
schbench          2          32  -24.25%
sysbench          2           4  -24.98%
sysbench          2          32  -32.84%

x86_64 (32 host CPUs):

Benchmark      #VMs    #CPUs/VM  ΔRPS (%)
-----------  ------  ----------  ----------
hackbench         4          32  87.02%
hackbench         8          16  48.45%
hackbench         4          24  47.95%
hackbench         2           8  42.74%
hackbench         2          32  34.90%
pgbench          16           8  27.87%
pgbench          12           8  25.17%
hackbench         8           8  24.92%
hackbench        16           8  22.41%
hackbench        16           4  20.83%
pgbench           8          16  20.40%
hackbench        12           8  20.37%
hackbench         4          16  20.36%
pgbench          16           4  16.60%
pgbench           8           8  14.92%
hackbench        12           4  14.49%
pgbench           4          32  9.49%
pgbench           2          32  7.26%
hackbench         2          24  6.54%
pgbench           4           4  4.67%
pgbench           8           4  3.24%
pgbench          12           4  2.66%
hackbench         4           8  2.53%
pgbench           4           8  1.96%
hackbench         2          16  1.93%
schbench          4          32  1.24%
pgbench           2           8  0.82%
schbench          4           4  0.69%
schbench          2          32  0.44%
schbench          2          16  0.25%
schbench         12           8  -0.02%
sysbench          2           4  -0.02%
schbench          4          24  -0.12%
sysbench          2          16  -0.17%
schbench         12           4  -0.18%
schbench          2           4  -0.19%
sysbench          4           8  -0.23%
schbench          8           4  -0.24%
sysbench          2           8  -0.24%
schbench          4           8  -0.28%
sysbench          8           4  -0.30%
schbench          4          16  -0.37%
schbench          2          24  -0.39%
schbench          8          16  -0.49%
schbench          2           8  -0.67%
pgbench           4          16  -0.68%
schbench          8           8  -0.83%
sysbench          4           4  -0.92%
schbench         16           4  -0.94%
sysbench         12           4  -0.98%
sysbench          8          16  -1.52%
sysbench         16           4  -1.57%
pgbench           2           4  -1.62%
sysbench         12           8  -1.69%
schbench         16           8  -1.97%
sysbench          8           8  -2.08%
hackbench         8           4  -2.11%
pgbench           4          24  -3.20%
pgbench           2          24  -3.35%
sysbench          2          24  -3.81%
pgbench           2          16  -4.55%
sysbench          4          16  -5.10%
sysbench         16           8  -6.56%
sysbench          2          32  -8.24%
sysbench          4          32  -13.54%
sysbench          4          24  -13.62%
hackbench         2           4  -15.40%
hackbench         4           4  -17.71%

There are some huge wins, especially for hackbench, which corresponds
to Shrikanth's findings. There are some significant degradations too,
which I plan to debug. This may simply have to do with the simplistic
heuristic I am using for testing [1].

sysbench, for example, is not supposed to benefit from this series,
because it is not affected by overcommit. However, it definitely should
not degrade by 30%. Interestingly enough, this happens only with
certain combinations of VM and CPU counts, and this is reproducible.

Initially I have seen degradations as bad as -80% with schbench. It
turned out this was caused by userspace per-CPU locking it implements;
turning it off caused the degradation to go away. To me this looks like
something synthetic and not something used by real-world application,
but please correct me if I am wrong - then this will have to be
resolved.


One note regarding the PARAVIRT Kconfig gating: s390x does not
select PARAVIRT	today. For example, steal time we determine based on
CPU timers and clocks, and not hypervisor hints. For now I had to add
dummy paravirt headers to test this series. But I would appreciate if
Kconfig gating was removed.

Others have already commented on the naming, and I would agree that
"paravirt" is really misleading. I cannot say that the previous "cpu-
avoid" one was perfect, but it was much better.


[1] https://github.com/iii-i/linux/commits/iii/poc/cpu-avoid/v3/

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption
  2025-12-04 13:28 ` [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption Ilya Leoshkevich
@ 2025-12-05  5:30   ` Shrikanth Hegde
  2025-12-15 17:39     ` Yury Norov
  0 siblings, 1 reply; 4+ messages in thread
From: Shrikanth Hegde @ 2025-12-05  5:30 UTC (permalink / raw)
  To: Ilya Leoshkevich, linux-kernel, linuxppc-dev
  Cc: mingo, peterz, juri.lelli, vincent.guittot, tglx, yury.norov,
	maddy, srikar, gregkh, pbonzini, seanjc, kprateek.nayak, vschneid,
	huschle, rostedt, dietmar.eggemann, christophe.leroy, linux-s390



On 12/4/25 6:58 PM, Ilya Leoshkevich wrote:
> On Wed, 2025-11-19 at 18:14 +0530, Shrikanth Hegde wrote:
>> Detailed problem statement and some of the implementation choices
>> were
>> discussed earlier[1].
>>
>> [1]:
>> https://lore.kernel.org/all/20250910174210.1969750-1-sshegde@linux.ibm.com/
>>
>> This is likely the version which would be used for LPC2025 discussion
>> on
>> this topic. Feel free to provide your suggestion and hoping for a
>> solution
>> that works for different architectures and it's use cases.
>>
>> All the existing alternatives such as cpu hotplug, creating isolated
>> partitions etc break the user affinity. Since number of CPUs to use
>> change
>> depending on the steal time, it is not driven by User. Hence it would
>> be
>> wrong to break the affinity. This series allows if the task is pinned
>> only paravirt CPUs, it will continue running there.
>>
>> Changes compared v3[1]:
>>
>> - Introduced computation of steal time in powerpc code.
>> - Derive number of CPUs to use and mark the remaining as paravirt
>> based
>>    on steal values.
>> - Provide debugfs knobs to alter how steal time values being used.
>> - Removed static key check for paravirt CPUs (Yury)
>> - Removed preempt_disable/enable while calling stopper (Prateek)
>> - Made select_idle_sibling and friends aware of paravirt CPUs.
>> - Removed 3 unused schedstat fields and introduced 2 related to
>> paravirt
>>    handling.
>> - Handled nohz_full case by enabling tick on it when there is CFS/RT
>> on
>>    it.
>> - Updated helper patch to override arch behaviour for easier
>> debugging
>>    during development.
>> - Kept
>>
>> Changes compared to v4[2]:
>> - Last two patches were sent out separate instead of being with
>> series.
>>    That created confusion. Those two patches are debug patches one can
>>    make use to check functionality across acrhitectures. Sorry about
>>    that.
>> - Use DEVICE_ATTR_RW instead (greg)
>> - Made it as PATCH since arch specific handling completes the
>>    functionality.
>>
>> [2]:
>> https://lore.kernel.org/all/20251119062100.1112520-1-sshegde@linux.ibm.com/
>>
>> TODO:
>>
>> - Get performance numbers on PowerPC, x86 and S390. Hopefully by next
>>    week. Didn't want to hold the series till then.
>>
>> - The CPUs to mark as paravirt is very simple and doesn't work when
>>    vCPUs aren't spread out uniformly across NUMA nodes. Ideal would be
>> splice
>>    the numbers based on how many CPUs each NUMA node has. It is quite
>>    tricky to do specially since cpumask can be on stack too. Given
>>    NR_CPUS can be 8192 and nr_possible_nodes 32. Haven't got my head
>> into
>>    solving it yet. Maybe there is easier way.
>>
>> - DLPAR Add/Remove needs to call init of EC/VP cores (powerpc
>> specific)
>>
>> - Userspace tools awareness such as irqbalance.
>>
>> - Delve into design of hint from Hyeprvisor(HW Hint). i.e Host
>> informs
>>    guest which/how many CPUs it has to use at this moment. This
>> interface
>>    should work across archs with each arch doing its specific
>> handling.
>>
>> - Determine the default values for steal time related knobs
>>    empirically and document them.
>>
>> - Need to check safety against CPU hotplug specially in
>> process_steal.
>>
>>
>> Applies cleanly on tip/master:
>> commit c2ef745151b21d4dcc4b29a1eabf1096f5ba544b
>>
>>
>> Thanks to srikar for providing the initial code around powerpc steal
>> time handling code. Thanks to all who went through and provided
>> reviews.
>>
>> PS: I haven't found a better name. Please suggest if you have any.
>>
>> Shrikanth Hegde (17):
>>    sched/docs: Document cpu_paravirt_mask and Paravirt CPU concept
>>    cpumask: Introduce cpu_paravirt_mask
>>    sched/core: Dont allow to use CPU marked as paravirt
>>    sched/debug: Remove unused schedstats
>>    sched/fair: Add paravirt movements for proc sched file
>>    sched/fair: Pass current cpu in select_idle_sibling
>>    sched/fair: Don't consider paravirt CPUs for wakeup and load
>> balance
>>    sched/rt: Don't select paravirt CPU for wakeup and push/pull rt
>> task
>>    sched/core: Add support for nohz_full CPUs
>>    sched/core: Push current task from paravirt CPU
>>    sysfs: Add paravirt CPU file
>>    powerpc: method to initialize ec and vp cores
>>    powerpc: enable/disable paravirt CPUs based on steal time
>>    powerpc: process steal values at fixed intervals
>>    powerpc: add debugfs file for controlling handling on steal values
>>    sysfs: Provide write method for paravirt
>>    sysfs: disable arch handling if paravirt file being written
>>
>>   .../ABI/testing/sysfs-devices-system-cpu      |   9 +
>>   Documentation/scheduler/sched-arch.rst        |  37 +++
>>   arch/powerpc/include/asm/smp.h                |   1 +
>>   arch/powerpc/kernel/smp.c                     |   1 +
>>   arch/powerpc/platforms/pseries/lpar.c         | 223
>> ++++++++++++++++++
>>   arch/powerpc/platforms/pseries/pseries.h      |   1 +
>>   drivers/base/cpu.c                            |  59 +++++
>>   include/linux/cpumask.h                       |  20 ++
>>   include/linux/sched.h                         |   9 +-
>>   kernel/sched/core.c                           | 106 ++++++++-
>>   kernel/sched/debug.c                          |   5 +-
>>   kernel/sched/fair.c                           |  42 +++-
>>   kernel/sched/rt.c                             |  11 +-
>>   kernel/sched/sched.h                          |   9 +
>>   14 files changed, 519 insertions(+), 14 deletions(-)
> 
> The capability to temporarily exclude CPUs from scheduling might be
> beneficial for s390x, where users often run Linux using a proprietary
> hypervisor called PR/SM and with high overcommit. In these
> circumstances virtual CPUs may not be scheduled by a hypervisor for a
> very long time.
> 
> Today we have an upstream feature called "Hiperdispatch", which
> determines that this is about to happen and uses Capacity Aware
> Scheduling to prevent processes from being placed on the affected CPUs.
> However, at least when used for this purpose, Capacity Aware Scheduling
> is best effort and fails to move tasks away from the affected CPUs
> under high load.
> 
> Therefore I have decided to smoke test this series.
> 
> For the purposes of smoke testing, I set up a number of KVM virtual
> machines and start the same benchmark inside each one. Then I collect
> and compare the aggregate throughput numbers. I have not done testing
> with PR/SM yet, but I plan to do this and report back. I also have not
> tested this with VMs that are not 100% utilized yet.
> 

Best results would be when it works as HW hint from hypervisor.

> Benchmark parameters:
> 
> $ sysbench cpu run --threads=$(nproc) --time=10
> $ schbench -r 10 --json --no-locking
> $ hackbench --groups 10 --process --loops 5000
> $ pgbench -h $WORKDIR --client=$(nproc) --time=10
> 
> Figures:
> 
> s390x (16 host CPUs):
> 
> Benchmark      #VMs    #CPUs/VM  ΔRPS (%)
> -----------  ------  ----------  ----------
> hackbench        16           4  60.58%
> pgbench          16           4  50.01%
> hackbench         8           8  46.18%
> hackbench         4           8  43.54%
> hackbench         2          16  43.23%
> hackbench        12           4  42.92%
> hackbench         8           4  35.53%
> hackbench         4          16  30.98%
> pgbench          12           4  18.41%
> hackbench         2          24  7.32%
> pgbench           8           4  6.84%
> pgbench           2          24  3.38%
> pgbench           2          16  3.02%
> pgbench           4          16  2.08%
> hackbench         2          32  1.46%
> pgbench           4           8  1.30%
> schbench          2          16  0.72%
> schbench          4           8  -0.09%
> schbench          4           4  -0.20%
> schbench          8           8  -0.41%
> sysbench          8           4  -0.46%
> sysbench          4           8  -0.53%
> schbench          8           4  -0.65%
> sysbench          2          16  -0.76%
> schbench          2           8  -0.77%
> sysbench          8           8  -1.72%
> schbench          2          24  -1.98%
> schbench         12           4  -2.03%
> sysbench         12           4  -2.13%
> pgbench           2          32  -3.15%
> sysbench         16           4  -3.17%
> schbench         16           4  -3.50%
> sysbench          2           8  -4.01%
> pgbench           8           8  -4.10%
> schbench          4          16  -5.93%
> sysbench          4           4  -5.94%
> pgbench           2           4  -6.40%
> hackbench         2           8  -10.04%
> hackbench         4           4  -10.91%
> pgbench           4           4  -11.05%
> sysbench          2          24  -13.07%
> sysbench          4          16  -13.59%
> hackbench         2           4  -13.96%
> pgbench           2           8  -16.16%
> schbench          2           4  -24.14%
> schbench          2          32  -24.25%
> sysbench          2           4  -24.98%
> sysbench          2          32  -32.84%
> 
> x86_64 (32 host CPUs):
> 
> Benchmark      #VMs    #CPUs/VM  ΔRPS (%)
> -----------  ------  ----------  ----------
> hackbench         4          32  87.02%
> hackbench         8          16  48.45%
> hackbench         4          24  47.95%
> hackbench         2           8  42.74%
> hackbench         2          32  34.90%
> pgbench          16           8  27.87%
> pgbench          12           8  25.17%
> hackbench         8           8  24.92%
> hackbench        16           8  22.41%
> hackbench        16           4  20.83%
> pgbench           8          16  20.40%
> hackbench        12           8  20.37%
> hackbench         4          16  20.36%
> pgbench          16           4  16.60%
> pgbench           8           8  14.92%
> hackbench        12           4  14.49%
> pgbench           4          32  9.49%
> pgbench           2          32  7.26%
> hackbench         2          24  6.54%
> pgbench           4           4  4.67%
> pgbench           8           4  3.24%
> pgbench          12           4  2.66%
> hackbench         4           8  2.53%
> pgbench           4           8  1.96%
> hackbench         2          16  1.93%
> schbench          4          32  1.24%
> pgbench           2           8  0.82%
> schbench          4           4  0.69%
> schbench          2          32  0.44%
> schbench          2          16  0.25%
> schbench         12           8  -0.02%
> sysbench          2           4  -0.02%
> schbench          4          24  -0.12%
> sysbench          2          16  -0.17%
> schbench         12           4  -0.18%
> schbench          2           4  -0.19%
> sysbench          4           8  -0.23%
> schbench          8           4  -0.24%
> sysbench          2           8  -0.24%
> schbench          4           8  -0.28%
> sysbench          8           4  -0.30%
> schbench          4          16  -0.37%
> schbench          2          24  -0.39%
> schbench          8          16  -0.49%
> schbench          2           8  -0.67%
> pgbench           4          16  -0.68%
> schbench          8           8  -0.83%
> sysbench          4           4  -0.92%
> schbench         16           4  -0.94%
> sysbench         12           4  -0.98%
> sysbench          8          16  -1.52%
> sysbench         16           4  -1.57%
> pgbench           2           4  -1.62%
> sysbench         12           8  -1.69%
> schbench         16           8  -1.97%
> sysbench          8           8  -2.08%
> hackbench         8           4  -2.11%
> pgbench           4          24  -3.20%
> pgbench           2          24  -3.35%
> sysbench          2          24  -3.81%
> pgbench           2          16  -4.55%
> sysbench          4          16  -5.10%
> sysbench         16           8  -6.56%
> sysbench          2          32  -8.24%
> sysbench          4          32  -13.54%
> sysbench          4          24  -13.62%
> hackbench         2           4  -15.40%
> hackbench         4           4  -17.71%
> 
> There are some huge wins, especially for hackbench, which corresponds
> to Shrikanth's findings. There are some significant degradations too,
> which I plan to debug. This may simply have to do with the simplistic
> heuristic I am using for testing [1].
> 

Thank you very much!! for running these numbers.

> sysbench, for example, is not supposed to benefit from this series,
> because it is not affected by overcommit. However, it definitely should
> not degrade by 30%. Interestingly enough, this happens only with
> certain combinations of VM and CPU counts, and this is reproducible.
> 

is the host baremetal? is those cases cpufreq governer ramp up or down
might play a role. (speculating)

> Initially I have seen degradations as bad as -80% with schbench. It
> turned out this was caused by userspace per-CPU locking it implements;
> turning it off caused the degradation to go away. To me this looks like
> something synthetic and not something used by real-world application,
> but please correct me if I am wrong - then this will have to be
> resolved.
> 

That's nice to hear. I was concerned with schbench rps. Now i am bit relieved.


Is this with schbench -L option?
I ran with it. and regression i was seeing earlier is gone now.

> 
> One note regarding the PARAVIRT Kconfig gating: s390x does not
> select PARAVIRT	today. For example, steal time we determine based on
> CPU timers and clocks, and not hypervisor hints. For now I had to add
> dummy paravirt headers to test this series. But I would appreciate if
> Kconfig gating was removed.
> 

Keeping PARAVIRT checks on is probably right thing. I will wait to see if
anyone objects.

> Others have already commented on the naming, and I would agree that
> "paravirt" is really misleading. I cannot say that the previous "cpu-
> avoid" one was perfect, but it was much better.
> 
> 
> [1] https://github.com/iii-i/linux/commits/iii/poc/cpu-avoid/v3/

Will look into it. one thing to to be careful are CPU numbers.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption
  2025-12-05  5:30   ` Shrikanth Hegde
@ 2025-12-15 17:39     ` Yury Norov
  2025-12-18  5:22       ` Shrikanth Hegde
  0 siblings, 1 reply; 4+ messages in thread
From: Yury Norov @ 2025-12-15 17:39 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: Ilya Leoshkevich, linux-kernel, linuxppc-dev, mingo, peterz,
	juri.lelli, vincent.guittot, tglx, maddy, srikar, gregkh,
	pbonzini, seanjc, kprateek.nayak, vschneid, huschle, rostedt,
	dietmar.eggemann, christophe.leroy, linux-s390

On Fri, Dec 05, 2025 at 11:00:18AM +0530, Shrikanth Hegde wrote:
> 
> 
> On 12/4/25 6:58 PM, Ilya Leoshkevich wrote:
> > On Wed, 2025-11-19 at 18:14 +0530, Shrikanth Hegde wrote:

...

> > Others have already commented on the naming, and I would agree that
> > "paravirt" is really misleading. I cannot say that the previous "cpu-
> > avoid" one was perfect, but it was much better.

It was my suggestion to switch names. cpu-avoid is definitely a
no-go. Because it doesn't explain anything and only confuses.

I suggested 'paravirt' (notice - only suggested) because the patch
series is mainly discussing paravirtualized VMs. But now I'm not even
sure that the idea of the series is:

1. Applicable only to paravirtualized VMs; and 
2. Preemption and rescheduling throttling requires another in-kernel
   concept other than nohs, isolcpus, cgroups and similar.

Shrikanth, can you please clarify the scope of the new feature? Would
it be useful for non-paravirtualized VMs, for example? Any other
task-cpu bonding problems?

On previous rounds you tried to implement the same with cgroups, as
far as I understood. Can you discuss that? What exactly can't be done
with the existing kernel APIs?

Thanks,
Yury

> > [1] https://github.com/iii-i/linux/commits/iii/poc/cpu-avoid/v3/
> 
> Will look into it. one thing to to be careful are CPU numbers.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption
  2025-12-15 17:39     ` Yury Norov
@ 2025-12-18  5:22       ` Shrikanth Hegde
  0 siblings, 0 replies; 4+ messages in thread
From: Shrikanth Hegde @ 2025-12-18  5:22 UTC (permalink / raw)
  To: Yury Norov, vincent.guittot
  Cc: Ilya Leoshkevich, linux-kernel, linuxppc-dev, mingo, peterz,
	juri.lelli, tglx, maddy, srikar, gregkh, pbonzini, seanjc,
	kprateek.nayak, vschneid, huschle, rostedt, dietmar.eggemann,
	christophe.leroy, linux-s390

Hi, Sorry for delay in response. Just landed yesterday from LPC.

>>> Others have already commented on the naming, and I would agree that
>>> "paravirt" is really misleading. I cannot say that the previous "cpu-
>>> avoid" one was perfect, but it was much better.
>   
> It was my suggestion to switch names. cpu-avoid is definitely a
> no-go. Because it doesn't explain anything and only confuses.
> 
> I suggested 'paravirt' (notice - only suggested) because the patch
> series is mainly discussing paravirtualized VMs. But now I'm not even
> sure that the idea of the series is:
> 
> 1. Applicable only to paravirtualized VMs; and
> 2. Preemption and rescheduling throttling requires another in-kernel
>     concept other than nohs, isolcpus, cgroups and similar.
> 
> Shrikanth, can you please clarify the scope of the new feature? Would
> it be useful for non-paravirtualized VMs, for example? Any other
> task-cpu bonding problems?

Current scope of the feature in virtulaized environment where the idea is
to do co-operative folding in each VM based on hint(either HW hint or steal time).

If you see from macro level, this is framework which allows one to avoid some vCPUs(In
Guest) to achieve better throughput or latency. So one could come up with more usecases
even in non-paravirtualized VMs. For example, one crazy idea such as avoid using SMT siblings
when the system utilization is low to achieve higher ipc(instruction per cycle) value.

> 
> On previous rounds you tried to implement the same with cgroups, as
> far as I understood. Can you discuss that? What exactly can't be done
> with the existing kernel APIs?
> 
> Thanks,
> Yury
> 

We discussed this in Sched-MC this year.
https://youtu.be/zf-MBoUIz1Q?t=8581

Currently explored options.

1. CPU Hotplug - slow. Some efforts underway to speed it up.
2. Creating isolated cpusets - Faster. still involves sched domain rebuilds.

The reason why they both won't work is that they break user affinities in the guest.
i.e guest can do "taskset -c <some_vcpus> <workload>, when the
last vCPU goes offline(guest vCPU hotplug) in that list of vCPUs
the affinity mask is reset and workload can run on online vCPUs and it
doesn't set back to earlier value. That is okay for hotlug or isolated cpusets
since it is driven by user in the guest. So user is aware of it.

Whereas here, the change is driven by the system than user in the guest.
So it cannot break user-space affinities.
So we need a new interface to drive this. I think it is better if it is
non cgroup based framework since cgroup is usually user driven.
(correct me if i am wrong).

PS:
There were some confusion around this affinity breaking. Note it is guest vCPU being marked and
guest vCPU being hotplugged. Task affinied workload was running in guest. Host CPUs(pCPU) are not
hotplugged.

---

I had discussion with vincent in hallway, idea is to use the push framework bits and set the
CPU Capacity=1 (lowest value and consider it as special value) and use a static key check to do
this stuff only when HW says to do so.
Such as (considering name as paravirt):

static inline bool cpu_paravirt(int cpu)
{
	if (static_branch_unlikely(&cpu_paravirt_framework))
		return arch_scale_cpu_capacity(cpu) == 1;

	return false;
}

Rest of the bits remain same. I found an issue with current series where setting affinity
is going wrong after cpu is marked paravirt, i will fix it next version. will do some more
testing and send next version in 2026.

Happy Holidays!

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-12-18  5:23 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20251119124449.1149616-1-sshegde@linux.ibm.com>
2025-12-04 13:28 ` [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption Ilya Leoshkevich
2025-12-05  5:30   ` Shrikanth Hegde
2025-12-15 17:39     ` Yury Norov
2025-12-18  5:22       ` Shrikanth Hegde

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox