From: Shrikanth Hegde <sshegde@linux.ibm.com>
To: linux-kernel@vger.kernel.org, mingo@kernel.org,
peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, tglx@linutronix.de,
yury.norov@gmail.com, gregkh@linuxfoundation.org
Cc: sshegde@linux.ibm.com, pbonzini@redhat.com, seanjc@google.com,
kprateek.nayak@amd.com, vschneid@redhat.com, iii@linux.ibm.com,
huschle@linux.ibm.com, rostedt@goodmis.org,
dietmar.eggemann@arm.com, mgorman@suse.de, bsegall@google.com,
maddy@linux.ibm.com, srikar@linux.ibm.com, hdanton@sina.com,
chleroy@kernel.org, vineeth@bitbyteword.org,
joelagnelf@nvidia.com
Subject: [PATCH v2 00/17] sched/paravirt: Introduce cpu_preferred_mask and steal-driven vCPU backoff
Date: Wed, 8 Apr 2026 00:49:33 +0530 [thread overview]
Message-ID: <20260407191950.643549-1-sshegde@linux.ibm.com> (raw)
In the virtualized environment, often there is vCPU overcommit. i.e. sum
of CPUs in all guests(virtual CPU aka vCPU) exceed the underlying physical CPU
(managed by host aka pCPU).
When many guests ask for CPU at the same time, host/hypervisor would
fail to satisfy that ask and has to preempt one vCPU to run another. If
the guests co-ordinate and ask for less CPU overall, that reduces the
vCPU threads in host, and vCPU preemption goes down.
Steal time is an indication of the underlying contention. Based on that,
if the guests reduce the vCPU request that proportionally, it would achieve
the desired outcome.
The added advantage is, it would reduce the lockholder preemption.
A vCPU maybe holding a spinlock, but still could get preempted. Such cases
will reduce since there is less vCPU preemption and lockholder will run to
completion since it would have disabled preemption in the guest.
Workload could run with time-slice extention to reduce lockholder
preemption for userspace locks, and this could help reduce lockholder
preemption even for kernelspace due to vCPU preemption.
Currently there is no infra in scheduler which moves away the task from
some CPUs without breaking the userspace affinities. CPU hotplug,
isolated CPUset would achieve moving the task off some CPUs at runtime,
But if some task is affined to specific CPUs, taking those CPUs away
results in affinity list being reset. That breaks the user affinities,
Since this is driven by scheduler rather than user doing so, can't do
that. So need a new infra. It would be better if it is lightweight.
Core idea is:
- Maintain set of CPUs which can be used by workload. It is denoted as
cpu_preferred_mask
- Periodically compute the steal time. If steal time is high/low based
on the thresholds, either reduce/increase the preferred CPUs.
- If a CPU is marked as non-preferred, push the task running on it if
possible.
- Use this CPU state in wakeup and load balance to ensure tasks run
within preferred CPUs.
For the host kernel, there is no steal time, so no changes to its preferred
CPUs. So series would affect only the guest kernels.
Current series implements a simple steal time monitor, which
reduces/increases the number of cores by 1 depending on the steal time.
It also implements a very simple method to avoid oscillations. If there
is need a need for more complex mechanisms for these, then doing them
via a steal time governors maybe an idea. One needs to enable the
feature STEAL_MONITOR to see the steal time values being processed and
preferred CPUs being set correctly. In most of the systems where there
is no steal time, preferred CPUs will be same as online CPUs.
I will attach the irqbalance patch which detects the changes in this
mask and re-adjusts the irq affinities. Series doesn't address when
irqbalance=n. Assuming many distros have irqbalance=y by default.
Discussion at LPC 2025:
https://www.youtube.com/watch?v=sZKpHVUUy1g
*** Please provide your suggestions and comments ***
=====================================================================
Patch Layout:
PATCH 01: Remove stale schedstats. Independent of the series.
PATCH 02-04: Introduce cpu_preferred_mask.
PATCH 05-09: Make scheduler aware of this mask.
PATCH 10: Push the current task in sched_tick if cpu is non-preferred.
PATCH 11: Add a new schedstat.
PATCH 12: Add a new sched feature: STEAL_MONITOR
PATCH 13-17: Periodically calculating steal time and take appropriate
action.
======================================================================
Performance Numbers:
baseline: tip/master at 8a5f70eb7e4f (Merge branch into tip/master: 'x86/tdx')
on PowerPC: powerVM hypervisor:
+++++++++
Daytrader
+++++++++
It is a database workload which simulates stock live trading.
There are two VMs. The same workload is run in both VMs at the same time.
VM1 is bigger than VM2.
Note: VM1 sees 20% steal time, and VM2 sees 10% steal time with
baseline.
(with series: STEAL_MONITOR=y and Default debug steal_mon values)
On VM1:
baseline with_series
Throughput 1x 1.3x
On VM2:
baseline with_series
Throughput 1x 1.1x
(with series: STEAL_MONITOR=y and Period 100, High 200, Low 100)
On VM1:
baseline with_series
Throughput: 1x 1.45x
On VM2:
baseline with_series
Throughput: 1x 1.13x
Verdict: Shows good improvement with default values. Even better when
tuned the debug knobs.
+++++++++
Hackbench
+++++++++
(with series: STEAL_MONITOR=y and Period 100, High 200, Low 100)
On VM1:
baseline with_series
10 groups 10.3 8.5
30 groups 40.8 25.5
60 groups 77.2 47.8
on VM2:
baseline with_series
10 groups 8.4 7.5
30 groups 25.3 19.8
60 groups 41.7 36.3
Verdict: With tuned values, shows very good improvement.
==========================================================================
Since v1:
- A new name - Preferred CPUs and cpu_preferred_mask
I had initially used the name as "Usable CPUs", but this seemed
better. I thought of pv_preferred too, but left it as it could be too long.
- Arch independent code. Everything happens in scheduler. steal time is
generic construct and this would help avoid each architecture doing the
same thing more or less. Dropped powerpc code.
- Removed hacks around wakeups. Made it as part of available_idle_cpu
which take care of many of the wakeup decisions. same for rt code.
- Implement a work function to calculate the steal times and enforce the
policy decisions. This ensures sched_tick doesn't suffer any major
latency.
- Steal time computation is gated with sched feature STEAL_MONITOR to
avoid any overheads in systems which don't have vCPU overcommit.
Feature is disabled by default.
- CPU_CAPACITY=1 was not considered since one needs the state of all CPUs
which have this special value. Computing that in hotpath is not ideal.
- Using cpuset was not considered since it was quite tricky, given there
is different versions and cgroups is natively user driven.
v1: https://lore.kernel.org/all/20251119124449.1149616-1-sshegde@linux.ibm.com/#t
earlier versions: https://lore.kernel.org/all/236f4925-dd3c-41ef-be04-47708c9ce129@linux.ibm.com/
TODO:
- Splicing of CPUs across NUMA nodes when CPUs aren't split equally.
- irq affinity when irqbalance=n. Not sure if this is worth.
- Avoid running any unbound housekeeping work on non-preferred CPUs
such as in find_new_ilb. Tried, but showed a little regression in
no noise case. So didn't consider.
- This currently works for kernel built with CONFIG_SCHED_SMT. Didn't
want to sprinkle too many ifdefs there. Not sure if there is any
system which needs this feature but !SMT. If so, let me know.
Seeing those ifdefs makes me wonder, Maybe we could cleanup
CONFIG_SCHED_SMT with cpumask_of(cpu) in case of !SMT?
- Performance numbers in KVM with x86, s390.
Sorry for sending it this late. This series is the one which is meant
for discussion at OSPM 2026.
Shrikanth Hegde (17):
sched/debug: Remove unused schedstats
sched/docs: Document cpu_preferred_mask and Preferred CPU concept
cpumask: Introduce cpu_preferred_mask
sysfs: Add preferred CPU file
sched/core: allow only preferred CPUs in is_cpu_allowed
sched/fair: Select preferred CPU at wakeup when possible
sched/fair: load balance only among preferred CPUs
sched/rt: Select a preferred CPU for wakeup and pulling rt task
sched/core: Keep tick on non-preferred CPUs until tasks are out
sched/core: Push current task from non preferred CPU
sched/debug: Add migration stats due to non preferred CPUs
sched/feature: Add STEAL_MONITOR feature
sched/core: Introduce a simple steal monitor
sched/core: Compute steal values at regular intervals
sched/core: Handle steal values and mark CPUs as preferred
sched/core: Mark the direction of steal values to avoid oscillations
sched/debug: Add debug knobs for steal monitor
.../ABI/testing/sysfs-devices-system-cpu | 11 +
Documentation/scheduler/sched-arch.rst | 48 ++++
Documentation/scheduler/sched-debug.rst | 27 +++
drivers/base/cpu.c | 12 +
include/linux/cpumask.h | 22 ++
include/linux/sched.h | 4 +-
kernel/cpu.c | 6 +
kernel/sched/core.c | 219 +++++++++++++++++-
kernel/sched/cpupri.c | 4 +
kernel/sched/debug.c | 10 +-
kernel/sched/fair.c | 8 +-
kernel/sched/features.h | 3 +
kernel/sched/rt.c | 4 +
kernel/sched/sched.h | 41 ++++
14 files changed, 409 insertions(+), 10 deletions(-)
--
2.47.3
next reply other threads:[~2026-04-07 19:21 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-07 19:19 Shrikanth Hegde [this message]
2026-04-07 19:19 ` [PATCH v2 01/17] sched/debug: Remove unused schedstats Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 02/17] sched/docs: Document cpu_preferred_mask and Preferred CPU concept Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 03/17] cpumask: Introduce cpu_preferred_mask Shrikanth Hegde
2026-04-07 20:27 ` Yury Norov
2026-04-08 9:16 ` Shrikanth Hegde
2026-04-08 17:57 ` Yury Norov
2026-04-07 19:19 ` [PATCH v2 04/17] sysfs: Add preferred CPU file Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 05/17] sched/core: allow only preferred CPUs in is_cpu_allowed Shrikanth Hegde
2026-04-08 1:05 ` Yury Norov
2026-04-08 12:56 ` Shrikanth Hegde
2026-04-08 18:09 ` Yury Norov
2026-04-07 19:19 ` [PATCH v2 06/17] sched/fair: Select preferred CPU at wakeup when possible Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 07/17] sched/fair: load balance only among preferred CPUs Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 08/17] sched/rt: Select a preferred CPU for wakeup and pulling rt task Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 09/17] sched/core: Keep tick on non-preferred CPUs until tasks are out Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 10/17] sched/core: Push current task from non preferred CPU Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 11/17] sched/debug: Add migration stats due to non preferred CPUs Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 12/17] sched/feature: Add STEAL_MONITOR feature Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 13/17] sched/core: Introduce a simple steal monitor Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 14/17] sched/core: Compute steal values at regular intervals Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 15/17] sched/core: Handle steal values and mark CPUs as preferred Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 16/17] sched/core: Mark the direction of steal values to avoid oscillations Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 17/17] sched/debug: Add debug knobs for steal monitor Shrikanth Hegde
2026-04-07 19:50 ` [PATCH v2 00/17] sched/paravirt: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
2026-04-08 10:14 ` Hillf Danton
2026-04-08 13:49 ` Shrikanth Hegde
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260407191950.643549-1-sshegde@linux.ibm.com \
--to=sshegde@linux.ibm.com \
--cc=bsegall@google.com \
--cc=chleroy@kernel.org \
--cc=dietmar.eggemann@arm.com \
--cc=gregkh@linuxfoundation.org \
--cc=hdanton@sina.com \
--cc=huschle@linux.ibm.com \
--cc=iii@linux.ibm.com \
--cc=joelagnelf@nvidia.com \
--cc=juri.lelli@redhat.com \
--cc=kprateek.nayak@amd.com \
--cc=linux-kernel@vger.kernel.org \
--cc=maddy@linux.ibm.com \
--cc=mgorman@suse.de \
--cc=mingo@kernel.org \
--cc=pbonzini@redhat.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=seanjc@google.com \
--cc=srikar@linux.ibm.com \
--cc=tglx@linutronix.de \
--cc=vincent.guittot@linaro.org \
--cc=vineeth@bitbyteword.org \
--cc=vschneid@redhat.com \
--cc=yury.norov@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox