Linux cgroups development
 help / color / mirror / Atom feed
From: Yuri Andriaccio <yurand2000@gmail.com>
To: "Ingo Molnar" <mingo@redhat.com>,
	"Peter Zijlstra" <peterz@infradead.org>,
	"Juri Lelli" <juri.lelli@redhat.com>,
	"Vincent Guittot" <vincent.guittot@linaro.org>,
	"Dietmar Eggemann" <dietmar.eggemann@arm.com>,
	"Steven Rostedt" <rostedt@goodmis.org>,
	"Ben Segall" <bsegall@google.com>, "Mel Gorman" <mgorman@suse.de>,
	"Valentin Schneider" <vschneid@redhat.com>,
	"Tejun Heo" <tj@kernel.org>,
	"Johannes Weiner" <hannes@cmpxchg.org>,
	"Michal Koutný" <mkoutny@suse.com>
Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org,
	Luca Abeni <luca.abeni@santannapisa.it>,
	Yuri Andriaccio <yuri.andriaccio@santannapisa.it>
Subject: [RFC PATCH v6 00/25] Hierarchical Constant Bandwidth Server
Date: Mon,  8 Jun 2026 14:15:19 +0200	[thread overview]
Message-ID: <20260608121546.69910-1-yurand2000@gmail.com> (raw)

Hello,

This is the v6 for Hierarchical Constant Bandwidth Server, aiming at replacing
the current RT_GROUP_SCHED mechanism with something more robust and
theoretically sound. The patchset has been presented at OSPM25 and OSPM26
(https://retis.sssup.it/ospm-summit/), and a summary of its inner workings can
be found at https://lwn.net/Articles/1021332/ . You can find the previous
versions of this patchset at the bottom of the page, in particular version 1
which talks in more detail what this patchset is all about and how it is
implemented.

This v6 version works on the comments by the reviewers and introduces the
following meaningful changes:
- Update to kernel version 7.1.
- Refactorings and general cleanups.
- Removal of substantial duplicated code.
- Express more locking constraints in code.
- New cpu.rt.max interface.
- Refactoring of migration code to reduce code duplication.
  The new migration code now reuses the existing push/pull and similar functions
  and specializes where needed, substantially reducing the footprint of group
  migration code from previous versions.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
New cgroup-v2 interface:
After extensive discussions with the kernel's maintainers, we have built a new
interface to support HCBS scheduling. Since this will be a cgroup-v2 only
feature (the fate of cgroup-v1 old RT_GROUP_SCHED has yet to be decided), it was
possible to drop the original v1 interface entirely and create a completely new
one that is similar to those that are already existing.

Every cgroup has now two new files:
- cpu.rt.max (similar to the cpu.max file)
- cpu.rt.internal (read-only, not available in the root cgroup, it may be
                   removed if deemed unnecessary, see later for details)

In this new interface, HCBS cgroups may either be set to use deadline servers,
and thus reserving a specified amount of bandwidth, very similarly to the
previous system, or can delegate their FIFO/RR tasks' scheduling to the nearest
ancestor that it is configured (default on group creation). If the nearest
configured ancestor is the root cgroup, tasks will be effectively run on the
root runqueue even if their cgroup is not the root task group.

This means that subtrees are allowed to retain the original non-RT_GROUP_SCHED
behaviour, scheduling on root, while the feature is nonetheless active. In the
meantime other subtrees may use HCBS, and the whole hierarchy can coexist
without issues.

This behaviour is specified in the cpu.rt.max file, which accepts the string
"<runtime | 'max'> <period>". A zero runtime disables FIFO/RR scheduling for
tasks in that group, a non-zero runtime creates a reservation and uses HCBS, a
runtime of 'max' instead tells the scheduler to use the nearest configured
ancestor for the FIFO/RR task scheduling.

The admission test now does not only check the immediate children of a cgroup
for schedulability (recall that a group's bandwidth must be always greater than
or equal to its children total bandwidth), but it has to check its whole
subtree: if a child delegates its tasks to its parents (runtime = 'max'), then
this child's own children (the grandchildrens) are effectively viewed as
immediate children that compete for the same bandwidth of their grandparent, and
so on down the hierarchy.

To support both threaded and domain cgroups, the original test that allowed only
to run tasks in leaf cgroups has been removed: this is already enforced for
domain cgroups by existing code, while this must not be the case for threaded
cgroups.

Since groups in the middle of the hierarchy can now also run tasks, their
dl_servers must be configured properly: a parent cgroup dl_servers can only use
their assigned bandwidth minus the total of their children. The cpu.rt.internal
file reads exactly what is this "remainder" bandwidth. Since dl_servers must
have a runtime and period values assigned, the period is taken from the user
configured cpu.rt.max file and the runtime is computed from the remainder bw.
This runtime and the period are the values shown by cpu.rt.internal.

Supporting both threaded and domain cgroups also dropped all the extra code
related to active and 'live' cgroups as mentioned in previous RFCs.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Summary of the patches:
   1-2) Commits already included in sched/tip (not yet in mainline).
   3-8) Preparation patches, so that the RT classes' code can be used both
        for normal and cgroup scheduling.
  9-19) Implementation of HCBS, no migration.
        The old RT_GROUP_SCHED code is removed.
        16) Remove support for cgroup-v1.
        17) Implement cgroup-v2 cpu.rt.max interface.
 20-24) Add support for tasks migration.
    25) Documentation for HCBS.

Updates from v5:
- Rebase to latest master.
- General rebasing/cleanup.
- More locking contraints expressed in code.
- New cpu.rt.max interface.
- Refactoring of migration code.

Updates from v4:
- Rebase to latest tip/master.
- General rebasing/cleanup.
- Update default sysctl_sched_rt_runtime to 1s, same as the period.
- Fix non-deferred deadline server replenishment logic.
- Add missing RCU read sections.
- Account HCBS servers along with their tasks when the servers are active.
- Release bandwidth resources early in unregister_rt_sched_group.
- Drop server_try_pull_task as it is now redundant.
- Remove dl_server_stop call in dequeue_task_rt.
- Update to reuse __checkparam_dl for deadline servers.

Updates from v3:
- Rebase to latest tip/master.
- General rebasing/cleanup.
- Add Documentation.
- Define **live** and **active** groups.
- Introduce server_try_pull_task in place of the removed server_has_task.
- Introduce RELEASE_LOCK helper macro for guard-based locking.
- Update inc/dec_dl_tasks to account for served runqueues regardless of the
  server type.
- Fix computing of new bandwidth values in dl_init_tg.
- Fix check in dl_check_tg to use capacity scaling.
- Fix wakeup_preempt_rt to check if curr is a DEADLINE task.

Updates from v2:
- Rebase to latest tip/master.
- Remove fair-servers' bw reclaiming.
- Fix a check which prevented execution of wakeup_preempt code.
- Fix a priority check in group_pull_rt_task between tasks of different groups.
- Rework allocation/deallocation code for rt-cgroups.
- Update signatures for some group related migration functions.
- Add documentation for wakeup_preempt preemption rules.

Updates from v1:
- Rebase to latest tip/master.
- Add migration code.
- Split big patches for more readability.
- Refactor code to use guarded locks where applicable.
- Remove unnecessary patches from v1 which have been addressed differently by
  mainline updates.
- Remove unnecessary checks and general code cleanup.

Notes:

Patches 1-2 have already been merged in sched/tip, but not yet merged in master.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Testing v6:

The patchset has been tested with a suite of tests tailored to stress all the
implemented functionalities.
The tests are available at https://github.com/Yurand2000/HCBS-Test-Suite .
Refer to the README of the repository for more details.

Follow these steps to test HCBS v6:
- Get the HCBS patch up and running. Any kernel/disto should work effortlessly.
- Get, compile and _install_ the tests.
- Run the `go_rt.sh` script to set the frequency of the CPUs to a fixed value
  and disable hyperthreading and power saving features.
- Run the `run_tests.sh full` script, to run the whole test suite.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Future Work:

We think the current patchset is stable enough. Our current test suite
demonstrates, on our limited hardware, that the kernel does not throw warnings
and that it is actually possible to guarantee time reservations and isolation
among tenants.

Comments on the new cpu.rt.max interface are to be expected, but hopefully with
this new ideas we have solved some of the issues mentioned in the past, such as
not being able to use the cpu controller because standard FIFO/RR tasks had to
be migrated to the root cgroup first. For the future it needs to be investigated
how to integrate this interface with the cpuset controller and with the multiCPU
feature which was presented at OSPM26.

Additional future work:
 - unprivileged FIFO/RR in cgroups.
 - capacity aware bandwidth reservation.
 - hotplug/hotunplug management.

Have a nice day,
Yuri

v1: https://lore.kernel.org/all/20250605071412.139240-1-yurand2000@gmail.com/
v2: https://lore.kernel.org/all/20250731105543.40832-1-yurand2000@gmail.com/
v3: https://lore.kernel.org/all/20250929092221.10947-1-yurand2000@gmail.com/
v4: https://lore.kernel.org/all/20251201124205.11169-1-yurand2000@gmail.com/
v5: https://lore.kernel.org/all/20260430213835.62217-1-yurand2000@gmail.com/

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Yuri Andriaccio (14):
  sched/deadline: Fix replenishment logic for non-deferred servers
  sched/rt: Update default bandwidth for real-time tasks to ONE
  sched/rt: Disable RT_GROUP_SCHED
  sched/rt: Remove unnecessary runqueue pointer in struct rt_rq
  sched/rt: Add {alloc/unregister/free}_rt_sched_group
  sched/rt: Implement dl-server operations for rt-cgroups.
  sched/rt: Update task event callbacks for HCBS scheduling
  sched/rt: Remove support for cgroups-v1
  sched/rt: Update task's RT runqueue when switching scheduling class
  sched/rt: Add HCBS migration code to related functions
  sched/rt: Hook HCBS migration functions
  sched/rt: Try pull task on empty server pick.
  sched/core: Execute enqueued balance callbacks after
    migrate_disable_switch
  Documentation: Update documentation for real-time cgroups

luca abeni (11):
  sched/deadline: Do not access dl_se->rq directly
  sched/deadline: Distinguish between dl_rq and my_q
  sched/rt: Pass an rt_rq instead of an rq where needed
  sched/rt: Move functions from rt.c to sched.h
  sched/rt: Introduce HCBS specific structs in task_group
  sched/core: Initialize HCBS specific structures.
  sched/deadline: Add dl_init_tg
  sched/deadline: Account rt-cgroups bandwidth in deadline tasks
    schedulability tests.
  sched/rt: Update rt-cgroup schedulability checks
  sched/rt: Remove old RT_GROUP_SCHED data structures
  sched/core: Execute enqueued balance callbacks when changing allowed
    CPUs

 Documentation/scheduler/sched-rt-group.rst |  470 ++++-
 include/linux/rcupdate.h                   |    1 +
 include/linux/sched.h                      |   12 +-
 kernel/sched/autogroup.c                   |    4 +-
 kernel/sched/core.c                        |  143 +-
 kernel/sched/deadline.c                    |  221 ++-
 kernel/sched/debug.c                       |    6 -
 kernel/sched/ext.c                         |    4 +-
 kernel/sched/fair.c                        |    4 +-
 kernel/sched/rt.c                          | 2046 +++++++++-----------
 kernel/sched/sched.h                       |  214 +-
 kernel/sched/syscalls.c                    |   11 +-
 12 files changed, 1761 insertions(+), 1375 deletions(-)


base-commit: e43ffb69e0438cddd72aaa30898b4dc446f664f8
--
2.54.0


             reply	other threads:[~2026-06-08 12:15 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-08 12:15 Yuri Andriaccio [this message]
2026-06-08 12:15 ` [RFC PATCH v6 01/25] sched/deadline: Fix replenishment logic for non-deferred servers Yuri Andriaccio
2026-06-08 12:15 ` [RFC PATCH v6 02/25] sched/rt: Update default bandwidth for real-time tasks to ONE Yuri Andriaccio
2026-06-08 12:15 ` [RFC PATCH v6 03/25] sched/deadline: Do not access dl_se->rq directly Yuri Andriaccio
2026-06-08 12:15 ` [RFC PATCH v6 04/25] sched/deadline: Distinguish between dl_rq and my_q Yuri Andriaccio
2026-06-08 12:15 ` [RFC PATCH v6 05/25] sched/rt: Pass an rt_rq instead of an rq where needed Yuri Andriaccio
2026-06-08 12:15 ` [RFC PATCH v6 06/25] sched/rt: Move functions from rt.c to sched.h Yuri Andriaccio
2026-06-08 12:15 ` [RFC PATCH v6 07/25] sched/rt: Disable RT_GROUP_SCHED Yuri Andriaccio
2026-06-08 12:15 ` [RFC PATCH v6 08/25] sched/rt: Remove unnecessary runqueue pointer in struct rt_rq Yuri Andriaccio
2026-06-08 12:15 ` [RFC PATCH v6 09/25] sched/rt: Introduce HCBS specific structs in task_group Yuri Andriaccio
2026-06-08 12:15 ` [RFC PATCH v6 10/25] sched/core: Initialize HCBS specific structures Yuri Andriaccio
2026-06-08 12:15 ` [RFC PATCH v6 11/25] sched/deadline: Add dl_init_tg Yuri Andriaccio
2026-06-08 12:15 ` [RFC PATCH v6 12/25] sched/rt: Add {alloc/unregister/free}_rt_sched_group Yuri Andriaccio
2026-06-11  8:42   ` Juri Lelli
2026-06-08 12:15 ` [RFC PATCH v6 13/25] sched/deadline: Account rt-cgroups bandwidth in deadline tasks schedulability tests Yuri Andriaccio
2026-06-08 12:15 ` [RFC PATCH v6 14/25] sched/rt: Implement dl-server operations for rt-cgroups Yuri Andriaccio
2026-06-08 12:15 ` [RFC PATCH v6 15/25] sched/rt: Update task event callbacks for HCBS scheduling Yuri Andriaccio
2026-06-08 12:15 ` [RFC PATCH v6 16/25] sched/rt: Remove support for cgroups-v1 Yuri Andriaccio
2026-06-08 12:15 ` [RFC PATCH v6 17/25] sched/rt: Update rt-cgroup schedulability checks Yuri Andriaccio
2026-06-08 12:15 ` [RFC PATCH v6 18/25] sched/rt: Update task's RT runqueue when switching scheduling class Yuri Andriaccio
2026-06-08 12:15 ` [RFC PATCH v6 19/25] sched/rt: Remove old RT_GROUP_SCHED data structures Yuri Andriaccio
2026-06-08 12:15 ` [RFC PATCH v6 20/25] sched/rt: Add HCBS migration code to related functions Yuri Andriaccio
2026-06-08 12:15 ` [RFC PATCH v6 21/25] sched/rt: Hook HCBS migration functions Yuri Andriaccio
2026-06-08 12:15 ` [RFC PATCH v6 22/25] sched/core: Execute enqueued balance callbacks when changing allowed CPUs Yuri Andriaccio
2026-06-08 12:15 ` [RFC PATCH v6 23/25] sched/rt: Try pull task on empty server pick Yuri Andriaccio
2026-06-08 12:15 ` [RFC PATCH v6 24/25] sched/core: Execute enqueued balance callbacks after migrate_disable_switch Yuri Andriaccio
2026-06-08 12:15 ` [RFC PATCH v6 25/25] Documentation: Update documentation for real-time cgroups Yuri Andriaccio
2026-06-09 15:46 ` [RFC PATCH v6 00/25] Hierarchical Constant Bandwidth Server Juri Lelli
2026-06-09 16:23   ` Yuri Andriaccio
2026-06-10  9:21     ` Juri Lelli
2026-06-15 20:38 ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260608121546.69910-1-yurand2000@gmail.com \
    --to=yurand2000@gmail.com \
    --cc=bsegall@google.com \
    --cc=cgroups@vger.kernel.org \
    --cc=dietmar.eggemann@arm.com \
    --cc=hannes@cmpxchg.org \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luca.abeni@santannapisa.it \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=mkoutny@suse.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=tj@kernel.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=yuri.andriaccio@santannapisa.it \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox