public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 00/17] sched/paravirt: Introduce cpu_preferred_mask and steal-driven vCPU backoff
@ 2026-04-07 19:19 Shrikanth Hegde
  2026-04-07 19:19 ` [PATCH v2 01/17] sched/debug: Remove unused schedstats Shrikanth Hegde
                   ` (18 more replies)
  0 siblings, 19 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2026-04-07 19:19 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot, tglx,
	yury.norov, gregkh
  Cc: sshegde, pbonzini, seanjc, kprateek.nayak, vschneid, iii, huschle,
	rostedt, dietmar.eggemann, mgorman, bsegall, maddy, srikar,
	hdanton, chleroy, vineeth, joelagnelf

In the virtualized environment, often there is vCPU overcommit. i.e. sum
of CPUs in all guests(virtual CPU aka vCPU) exceed the underlying physical CPU
(managed by host aka pCPU). 

When many guests ask for CPU at the same time, host/hypervisor would
fail to satisfy that ask and has to preempt one vCPU to run another. If
the guests co-ordinate and ask for less CPU overall, that reduces the
vCPU threads in host, and vCPU preemption goes down.

Steal time is an indication of the underlying contention. Based on that,
if the guests reduce the vCPU request that proportionally, it would achieve
the desired outcome.

The added advantage is, it would reduce the lockholder preemption.
A vCPU maybe holding a spinlock, but still could get preempted. Such cases
will reduce since there is less vCPU preemption and lockholder will run to
completion since it would have disabled preemption in the guest.
Workload could run with time-slice extention to reduce lockholder
preemption for userspace locks, and this could help reduce lockholder
preemption even for kernelspace due to vCPU preemption.

Currently there is no infra in scheduler which moves away the task from
some CPUs without breaking the userspace affinities. CPU hotplug,
isolated CPUset would achieve moving the task off some CPUs at runtime,
But if some task is affined to specific CPUs, taking those CPUs away
results in affinity list being reset. That breaks the user affinities,
Since this is driven by scheduler rather than user doing so, can't do
that. So need a new infra. It would be better if it is lightweight.

Core idea is:
- Maintain set of CPUs which can be used by workload. It is denoted as
  cpu_preferred_mask
- Periodically compute the steal time. If steal time is high/low based
  on the thresholds, either reduce/increase the preferred CPUs.
- If a CPU is marked as non-preferred, push the task running on it if
  possible.
- Use this CPU state in wakeup and load balance to ensure tasks run
  within preferred CPUs.

For the host kernel, there is no steal time, so no changes to its preferred
CPUs. So series would affect only the guest kernels.

Current series implements a simple steal time monitor, which
reduces/increases the number of cores by 1 depending on the steal time.
It also implements a very simple method to avoid oscillations. If there
is need a need for more complex mechanisms for these, then doing them
via a steal time governors maybe an idea. One needs to enable the
feature STEAL_MONITOR to see the steal time values being processed and
preferred CPUs being set correctly. In most of the systems where there
is no steal time, preferred CPUs will be same as online CPUs.

I will attach the irqbalance patch which detects the changes in this
mask and re-adjusts the irq affinities. Series doesn't address when
irqbalance=n. Assuming many distros have irqbalance=y by default.

Discussion at LPC 2025:
https://www.youtube.com/watch?v=sZKpHVUUy1g

*** Please provide your suggestions and comments ***

=====================================================================
Patch Layout:
PATCH    01: Remove stale schedstats. Independent of the series.
PATCH 02-04: Introduce cpu_preferred_mask.
PATCH 05-09: Make scheduler aware of this mask.
PATCH    10: Push the current task in sched_tick if cpu is non-preferred.
PATCH    11: Add a new schedstat.
PATCH    12: Add a new sched feature: STEAL_MONITOR
PATCH 13-17: Periodically calculating steal time and take appropriate
             action.

======================================================================
Performance Numbers:
baseline: tip/master at 8a5f70eb7e4f (Merge branch into tip/master: 'x86/tdx')

on PowerPC: powerVM hypervisor:
+++++++++
Daytrader
+++++++++ 
It is a database workload which simulates stock live trading.
There are two VMs. The same workload is run in both VMs at the same time.
VM1 is bigger than VM2.

Note: VM1 sees 20% steal time, and VM2 sees 10% steal time with
baseline.


(with series: STEAL_MONITOR=y and Default debug steal_mon values)
On VM1:
			baseline		with_series
Throughput		1x			1.3x 
On VM2:
                        baseline                with_series
Throughput              1x                      1.1x


(with series: STEAL_MONITOR=y and Period 100, High 200, Low 100)
On VM1:
                        baseline                with_series
Throughput:             1x                      1.45x
On VM2:
                        baseline                with_series
Throughput:             1x                      1.13x

Verdict: Shows good improvement with default values. Even better when
tuned the debug knobs.

+++++++++
Hackbench 
+++++++++
(with series: STEAL_MONITOR=y and Period 100, High 200, Low 100)
On VM1:
			baseline		with_series
10 groups		10.3			 8.5
30 groups		40.8			25.5
60 groups		77.2			47.8

on VM2:
			baseline		with_series
10 groups		 8.4			 7.5
30 groups		25.3			19.8
60 groups		41.7			36.3

Verdict: With tuned values, shows very good improvement.

==========================================================================
Since v1:
- A new name - Preferred CPUs and cpu_preferred_mask
  I had initially used the name as "Usable CPUs", but this seemed
  better. I thought of pv_preferred too, but left it as it could be too long.

- Arch independent code. Everything happens in scheduler. steal time is
  generic construct and this would help avoid each architecture doing the
  same thing more or less. Dropped powerpc code.

- Removed hacks around wakeups. Made it as part of available_idle_cpu
  which take care of many of the wakeup decisions. same for rt code.

- Implement a work function to calculate the steal times and enforce the
  policy decisions. This ensures sched_tick doesn't suffer any major
  latency.

- Steal time computation is gated with sched feature STEAL_MONITOR to
  avoid any overheads in systems which don't have vCPU overcommit.
  Feature is disabled by default.

- CPU_CAPACITY=1 was not considered since one needs the state of all CPUs
  which have this special value. Computing that in hotpath is not ideal.

- Using cpuset was not considered since it was quite tricky, given there
  is different versions and cgroups is natively user driven.

v1: https://lore.kernel.org/all/20251119124449.1149616-1-sshegde@linux.ibm.com/#t
earlier versions: https://lore.kernel.org/all/236f4925-dd3c-41ef-be04-47708c9ce129@linux.ibm.com/

TODO:
- Splicing of CPUs across NUMA nodes when CPUs aren't split equally.
- irq affinity when irqbalance=n. Not sure if this is worth.
- Avoid running any unbound housekeeping work on non-preferred CPUs 
  such as in find_new_ilb. Tried, but showed a little regression in 
  no noise case. So didn't consider.
- This currently works for kernel built with CONFIG_SCHED_SMT. Didn't
  want to sprinkle too many ifdefs there. Not sure if there is any
  system which needs this feature but !SMT. If so, let me know.
  Seeing those ifdefs makes me wonder, Maybe we could cleanup
  CONFIG_SCHED_SMT with cpumask_of(cpu) in case  of !SMT?
- Performance numbers in KVM with x86, s390. 

Sorry for sending it this late. This series is the one which is meant
for discussion at OSPM 2026.


Shrikanth Hegde (17):
  sched/debug: Remove unused schedstats
  sched/docs: Document cpu_preferred_mask and Preferred CPU concept
  cpumask: Introduce cpu_preferred_mask
  sysfs: Add preferred CPU file
  sched/core: allow only preferred CPUs in is_cpu_allowed
  sched/fair: Select preferred CPU at wakeup when possible
  sched/fair: load balance only among preferred CPUs
  sched/rt: Select a preferred CPU for wakeup and pulling rt task
  sched/core: Keep tick on non-preferred CPUs until tasks are out
  sched/core: Push current task from non preferred CPU
  sched/debug: Add migration stats due to non preferred CPUs
  sched/feature: Add STEAL_MONITOR feature
  sched/core: Introduce a simple steal monitor
  sched/core: Compute steal values at regular intervals
  sched/core: Handle steal values and mark CPUs as preferred
  sched/core: Mark the direction of steal values to avoid oscillations
  sched/debug: Add debug knobs for steal monitor

 .../ABI/testing/sysfs-devices-system-cpu      |  11 +
 Documentation/scheduler/sched-arch.rst        |  48 ++++
 Documentation/scheduler/sched-debug.rst       |  27 +++
 drivers/base/cpu.c                            |  12 +
 include/linux/cpumask.h                       |  22 ++
 include/linux/sched.h                         |   4 +-
 kernel/cpu.c                                  |   6 +
 kernel/sched/core.c                           | 219 +++++++++++++++++-
 kernel/sched/cpupri.c                         |   4 +
 kernel/sched/debug.c                          |  10 +-
 kernel/sched/fair.c                           |   8 +-
 kernel/sched/features.h                       |   3 +
 kernel/sched/rt.c                             |   4 +
 kernel/sched/sched.h                          |  41 ++++
 14 files changed, 409 insertions(+), 10 deletions(-)

-- 
2.47.3


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2026-04-08 13:49 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-07 19:19 [PATCH v2 00/17] sched/paravirt: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 01/17] sched/debug: Remove unused schedstats Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 02/17] sched/docs: Document cpu_preferred_mask and Preferred CPU concept Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 03/17] cpumask: Introduce cpu_preferred_mask Shrikanth Hegde
2026-04-07 20:27   ` Yury Norov
2026-04-08  9:16     ` Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 04/17] sysfs: Add preferred CPU file Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 05/17] sched/core: allow only preferred CPUs in is_cpu_allowed Shrikanth Hegde
2026-04-08  1:05   ` Yury Norov
2026-04-08 12:56     ` Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 06/17] sched/fair: Select preferred CPU at wakeup when possible Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 07/17] sched/fair: load balance only among preferred CPUs Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 08/17] sched/rt: Select a preferred CPU for wakeup and pulling rt task Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 09/17] sched/core: Keep tick on non-preferred CPUs until tasks are out Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 10/17] sched/core: Push current task from non preferred CPU Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 11/17] sched/debug: Add migration stats due to non preferred CPUs Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 12/17] sched/feature: Add STEAL_MONITOR feature Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 13/17] sched/core: Introduce a simple steal monitor Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 14/17] sched/core: Compute steal values at regular intervals Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 15/17] sched/core: Handle steal values and mark CPUs as preferred Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 16/17] sched/core: Mark the direction of steal values to avoid oscillations Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 17/17] sched/debug: Add debug knobs for steal monitor Shrikanth Hegde
2026-04-07 19:50 ` [PATCH v2 00/17] sched/paravirt: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
2026-04-08 10:14 ` Hillf Danton
2026-04-08 13:49   ` Shrikanth Hegde

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox