From: Shrikanth Hegde <sshegde@linux.ibm.com>
To: linux-kernel@vger.kernel.org, mingo@kernel.org,
peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, yury.norov@gmail.com,
kprateek.nayak@amd.com, iii@linux.ibm.com, corbet@lwn.net
Cc: sshegde@linux.ibm.com, tglx@kernel.org,
gregkh@linuxfoundation.org, pbonzini@redhat.com,
seanjc@google.com, vschneid@redhat.com, huschle@linux.ibm.com,
rostedt@goodmis.org, dietmar.eggemann@arm.com,
maddy@linux.ibm.com, srikar@linux.ibm.com, hdanton@sina.com,
chleroy@kernel.org, vineeth@bitbyteword.org, frederic@kernel.org,
arighi@nvidia.com, pauld@redhat.com, christian.loehle@arm.com,
tj@kernel.org, tommaso.cucinotta@gmail.com, maz@kernel.org,
rafael@kernel.org, rdunlap@infradead.org, kernellwp@gmail.com,
linux-doc@vger.kernel.org
Subject: [PATCH v5 00/24] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff
Date: Thu, 25 Jun 2026 18:16:24 +0530 [thread overview]
Message-ID: <20260625124648.802832-1-sshegde@linux.ibm.com> (raw)
Very briefly,
- Maintain set of CPUs which can be used by workload. It is denoted as
cpu_preferred_mask
- Periodically compute the steal time. If steal time is high/low based
on the thresholds, either reduce/increase the preferred CPUs. This is
handled in a new driver called steal_monitor
- If a CPU is marked as non-preferred, push the task running on it if
possible.
- Use this CPU state in wakeup and load balance to ensure tasks run
within preferred CPUs.
For more details on idea, problem statement and performance numbers,
please refer to cover-letter of v2[2] and OSPM talk[1].
*** Please review and provide your feedback!! ***
[1]:https://youtu.be/adxUKFPlOp0
[2] v2: https://lore.kernel.org/all/20260407191950.643549-1-sshegde@linux.ibm.com/#t
[3] v4: https://lore.kernel.org/all/20260617174139.155540-1-sshegde@linux.ibm.com/#t
Thank you very much for feedback so far. This has helped the code to
evolve towards a clear abstraction layers and get simplified.
(Hopefully). Apologies in advance if I have missed any comment.
base commit:
tip/sched/core at c095741713d1 ("sched/fair: Fix newidle vs core-sched")
v4->v5:
- Move the computation of steal time and decide on preferred CPU state
to a driver. Drop those changes in core scheduler. (Yury Norov, K Prateek Nayak)
- A new driver called steal_monitor is added in drivers/virt/ (K Prateek Nayak)
(Please let me know if there is a better place for it. I can move it
there)
- New driver does periodic computation of steal time and
increments/decrements the preferred CPUs.
- Debug knobs can be changed via module parameters. (Yury Norov)
- Default implementation are weak symbols. Archs may override by
providing strong symbols in new respective arch specific file.
- Everything is centered around CONFIG_PREFERRED_CPU. No new config
for new driver. Driver gets added to kernel, but not loaded by
default.
- Load the driver to enable steal_monitor functionality. Unload to
remove the same.
- Make CONFIG_PREFERRED_CPU depend on PARAVIRT && SMP (Yury Norov)
- move set_cpu_preferred to a macro. (Yury Norov)
on CONFIG_PREFERRED_CPU=n it will just act on active CPUs in that case.
It shouldn't alter any functionality.
- Do a simple encoding for has_preferred_cpu_state, which aims to avoid
repeated cpumask_interest in is_cpu_allowed.
(Please let me know if new variable based approach to is_cpu_allowed
should be done instead).
- Move select_fallback_rq above the rq_lock. (sashiko)
- Few documentation nitpicks (Randy Dunlap, sashiko)
- Avoid any decision for is_cpu_allowed for other classes (sashiko)
- Don't pull the load towards a non-preferred CPUs in idle and new
idle balanced. (Inferred when seeing sashiko comments)
- Fix leaking of task_struct in push_work_done (K Prateek Nayak)
- Module parameters aren't checked for sane values. One should know
what they are writing to it. If one writes 0 for interval_ms,
then it gets set to default value again to avoid workqueue lockup.
- Added a few design construct related checks in the periodic work
to ensure any future arch specific implementations follow it.
1. preferred is subset of active.
2. preferred cannot be empty.
- Added Documentation of steal_monitor in Documentation/driver-api/
(Let me know if there is better place for it)
performance numbers are expected to be same or slightly better than v2.
With driver, one major overhead in sched_tick has been removed. i.e
finding the first housekeeping CPU which was O(N).
Apologies in advance if there is any critical information is missing
regarding new driver such as policy, documentation or missing
implementation. Please let me know, and I can make those changes.
I have ensured checkpatch --strict is happy.
Also, I think there should be a MAINTAINERS file entry for new
driver. I don't see a drivers/virt/* entry.
Either as a new entry for driver or a few file in SCHEDULER entry.
Let me know if/what I should add it. I am bit cautious about such
change. I am willing to maintain this driver, other than that
I don't know what else i going to be necessary for it. I don't have
any maintainer experience either :)
PS: Sorry for the long CC list. Please unicast it to me if you want to
be dropped for the CC list.
Shrikanth Hegde (24):
sched/debug: Remove unused schedstats
sched/docs: Document cpu_preferred_mask and Preferred CPU concept
kconfig: Provide PREFERRED_CPU option
cpumask: Introduce cpu_preferred_mask
sysfs: Add preferred CPU file
sched/core: allow only preferred CPUs in is_cpu_allowed
sched/fair: Select preferred CPU at wakeup when possible
sched/fair: load balance only among preferred CPUs
sched/fair: Pull the load on preferred CPU
sched/core: Keep tick on non-preferred CPUs until tasks are out
sched/core: Push current task from non preferred CPU
sched/debug: Add migration stats due to non preferred CPUs
virt/steal_monitor: Add documentation
virt: Introduce steal monitor driver
virt/steal_monitor: Restore to active on module disable
virt/steal_monitor: Define steal_monitor structure
virt/steal_monitor: Add control knobs for handling steal values
virt/steal_monitor: Compute work at regular intervals
virt/steal_monitor: Provide default method to get systemwide steal
time
virt/steal_monitor: Provide default method to inc/dec preferred CPUs
virt/steal_monitor: Provide default method to get num of CPUs for
steal ratio
virt/steal_monitor: Act on steal values at regular intervals
virt/steal_monitor: Add direction control
virt/steal_monitor: Add design check of preferred subset of active
.../ABI/testing/sysfs-devices-system-cpu | 11 ++
Documentation/driver-api/index.rst | 1 +
Documentation/driver-api/steal-monitor.rst | 93 ++++++++++++
Documentation/scheduler/sched-arch.rst | 50 +++++++
drivers/base/cpu.c | 8 ++
drivers/virt/Makefile | 1 +
drivers/virt/steal_monitor/Makefile | 14 ++
drivers/virt/steal_monitor/defaults.c | 105 ++++++++++++++
drivers/virt/steal_monitor/sm_core.c | 124 ++++++++++++++++
drivers/virt/steal_monitor/sm_core.h | 32 +++++
include/linux/cpumask.h | 21 ++-
include/linux/sched.h | 5 +-
kernel/Kconfig.preempt | 14 ++
kernel/cpu.c | 6 +
kernel/sched/core.c | 133 +++++++++++++++++-
kernel/sched/debug.c | 4 +-
kernel/sched/fair.c | 11 +-
kernel/sched/sched.h | 36 +++++
18 files changed, 659 insertions(+), 10 deletions(-)
create mode 100644 Documentation/driver-api/steal-monitor.rst
create mode 100644 drivers/virt/steal_monitor/Makefile
create mode 100644 drivers/virt/steal_monitor/defaults.c
create mode 100644 drivers/virt/steal_monitor/sm_core.c
create mode 100644 drivers/virt/steal_monitor/sm_core.h
--
2.47.3
next reply other threads:[~2026-06-25 12:47 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-25 12:46 Shrikanth Hegde [this message]
2026-06-25 12:46 ` [PATCH v5 01/24] sched/debug: Remove unused schedstats Shrikanth Hegde
2026-06-25 12:46 ` [PATCH v5 02/24] sched/docs: Document cpu_preferred_mask and Preferred CPU concept Shrikanth Hegde
2026-06-25 12:46 ` [PATCH v5 03/24] kconfig: Provide PREFERRED_CPU option Shrikanth Hegde
2026-06-25 12:46 ` [PATCH v5 04/24] cpumask: Introduce cpu_preferred_mask Shrikanth Hegde
2026-06-25 12:46 ` [PATCH v5 05/24] sysfs: Add preferred CPU file Shrikanth Hegde
2026-06-25 12:46 ` [PATCH v5 06/24] sched/core: allow only preferred CPUs in is_cpu_allowed Shrikanth Hegde
2026-06-25 12:46 ` [PATCH v5 07/24] sched/fair: Select preferred CPU at wakeup when possible Shrikanth Hegde
2026-06-25 12:46 ` [PATCH v5 08/24] sched/fair: load balance only among preferred CPUs Shrikanth Hegde
2026-06-25 12:46 ` [PATCH v5 09/24] sched/fair: Pull the load on preferred CPU Shrikanth Hegde
2026-06-25 12:46 ` [PATCH v5 10/24] sched/core: Keep tick on non-preferred CPUs until tasks are out Shrikanth Hegde
2026-06-25 12:46 ` [PATCH v5 11/24] sched/core: Push current task from non preferred CPU Shrikanth Hegde
2026-06-25 12:46 ` [PATCH v5 12/24] sched/debug: Add migration stats due to non preferred CPUs Shrikanth Hegde
2026-06-25 12:46 ` [PATCH v5 13/24] virt/steal_monitor: Add documentation Shrikanth Hegde
2026-06-25 12:46 ` [PATCH v5 14/24] virt: Introduce steal monitor driver Shrikanth Hegde
2026-06-25 12:46 ` [PATCH v5 15/24] virt/steal_monitor: Restore to active on module disable Shrikanth Hegde
2026-06-25 12:46 ` [PATCH v5 16/24] virt/steal_monitor: Define steal_monitor structure Shrikanth Hegde
2026-06-25 12:46 ` [PATCH v5 17/24] virt/steal_monitor: Add control knobs for handling steal values Shrikanth Hegde
2026-06-25 12:46 ` [PATCH v5 18/24] virt/steal_monitor: Compute work at regular intervals Shrikanth Hegde
2026-06-25 12:46 ` [PATCH v5 19/24] virt/steal_monitor: Provide default method to get systemwide steal time Shrikanth Hegde
2026-06-25 12:46 ` [PATCH v5 20/24] virt/steal_monitor: Provide default method to inc/dec preferred CPUs Shrikanth Hegde
2026-06-25 12:46 ` [PATCH v5 21/24] virt/steal_monitor: Provide default method to get num of CPUs for steal ratio Shrikanth Hegde
2026-06-25 12:46 ` [PATCH v5 22/24] virt/steal_monitor: Act on steal values at regular intervals Shrikanth Hegde
2026-06-25 12:46 ` [PATCH v5 23/24] virt/steal_monitor: Add direction control Shrikanth Hegde
2026-06-25 12:46 ` [PATCH v5 24/24] virt/steal_monitor: Add design check of preferred subset of active Shrikanth Hegde
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260625124648.802832-1-sshegde@linux.ibm.com \
--to=sshegde@linux.ibm.com \
--cc=arighi@nvidia.com \
--cc=chleroy@kernel.org \
--cc=christian.loehle@arm.com \
--cc=corbet@lwn.net \
--cc=dietmar.eggemann@arm.com \
--cc=frederic@kernel.org \
--cc=gregkh@linuxfoundation.org \
--cc=hdanton@sina.com \
--cc=huschle@linux.ibm.com \
--cc=iii@linux.ibm.com \
--cc=juri.lelli@redhat.com \
--cc=kernellwp@gmail.com \
--cc=kprateek.nayak@amd.com \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=maddy@linux.ibm.com \
--cc=maz@kernel.org \
--cc=mingo@kernel.org \
--cc=pauld@redhat.com \
--cc=pbonzini@redhat.com \
--cc=peterz@infradead.org \
--cc=rafael@kernel.org \
--cc=rdunlap@infradead.org \
--cc=rostedt@goodmis.org \
--cc=seanjc@google.com \
--cc=srikar@linux.ibm.com \
--cc=tglx@kernel.org \
--cc=tj@kernel.org \
--cc=tommaso.cucinotta@gmail.com \
--cc=vincent.guittot@linaro.org \
--cc=vineeth@bitbyteword.org \
--cc=vschneid@redhat.com \
--cc=yury.norov@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox