* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
[not found] ` <20260430213835.62217-21-yurand2000@gmail.com>
@ 2026-05-05 15:15 ` Peter Zijlstra
2026-05-05 19:56 ` Tejun Heo
0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2026-05-05 15:15 UTC (permalink / raw)
To: Yuri Andriaccio
Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
linux-kernel, Luca Abeni, Yuri Andriaccio, tj, hannes, mkoutny,
cgroups
On Thu, Apr 30, 2026 at 11:38:24PM +0200, Yuri Andriaccio wrote:
> From: luca abeni <luca.abeni@santannapisa.it>
>
> Allow for cgroup hierarchies with more than two levels.
>
> Introduce the concept of live and active groups:
> - A group is live if it is a leaf group or if all its children have zero
> runtime.
> - A live group with non-zero runtime can be used to schedule tasks.
> - An active cgroup is a live group with running tasks.
> - A non-live group cannot be used to run tasks, but it is only used for
> bandwidth accounting, i.e. the sum of its children bandwidth must be
> less than or equal to the bandwidth of the parent. This change allows
> to use cgroups for bandwidth management for different users.
> - While the root cgroup specifies the total allocatable bandwidth of rt
> cgroups, a further accounting is performed to keep track of the live
> bandwidth, i.e. the sum of the bandwidth of live groups. The hierarchy
> invariant states that the live bandwidth must always be less than or
> equal to the total allocatable bw.
>
> Add is_live_sched_group() and sched_group_has_live_siblings() in
> deadline.c. These utility functions are used by dl_init_tg to perform
> updates only when necessary:
> - Only live groups may update the active dl bandwidth of dl entities
> (call to dl_rq_change_utilization), while non-live groups must not use
> servers, and thus must not change the active dl bandwidth.
> - The total bandwidth accounting must be changed to follow the
> live/non-live rules:
> - When disabling (runtime zero) the last child of a group, the parent
> becomes a live group, and so the parent's bw must be accounted back.
> - When enabling (runtime non-zero) the first child, the parent becomes a
> non-live group, and so the parent's bandwidth must be removed.
>
> Update tg_set_rt_bandwidth() to change the runtime of a group to a
> non-zero value only if its parent is inactive, thus forcing it to become
> non-live if it was precedently (it would've already been non-live if a
> sibling cgroup was live). An exception is made for groups which have the
> root cgroup as parent.
>
> Update sched_rt_can_attach() to allow attaching only on live groups.
>
> Update dl_init_tg() to take a task_group pointer and a cpu's id rather
> than passing directly the pointer to the cpu's deadline server. The
> task_group pointer is necessary to check and update the live bandwidth
> accounting.
>
> Co-developed-by: Yuri Andriaccio <yurand2000@gmail.com>
> Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
> Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
This probably wants to have the cgroup folks on Cc (added now) to make
sure the semantics are in line with cgroup-v2 expectations.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
2026-05-05 15:15 ` [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups Peter Zijlstra
@ 2026-05-05 19:56 ` Tejun Heo
2026-05-07 10:53 ` Peter Zijlstra
2026-05-07 14:30 ` luca abeni
0 siblings, 2 replies; 19+ messages in thread
From: Tejun Heo @ 2026-05-05 19:56 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Yuri Andriaccio, Ingo Molnar, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, linux-kernel, Luca Abeni, Yuri Andriaccio,
hannes, mkoutny, cgroups
Hello,
Some high level comments:
- Please align it with existing cgroup2 interface files. See cpu.max. This
can be e.g. cpu.rt.max without about the same semantics.
- cgroup2 enforces that internal cgroups w/ controllers enabled cannot have
threads in them. No need to enforce that separately.
- However, the cpu controller is a threaded controller which means that it
can have threaded sub-hierarchy where the no-internal-process rule doesn't
apply. This was created explicitly for cpu controller. The proposed change
blocks it effectively forcing cpu controller into regular domain
controller behavior subject to no-internal-process rule. Note these are
enforced at controller granularity and this means that users who use the
threaded mode will be forced to pick between the two.
- This has the same problem with cgroup1's rt cgroup sched support where
there is no way to have a permissive default configuration, which means
that users who don't really care about distributing rt shares
hierarchically would get blocked from running rt processes by default,
which basically forces distros to disable rt cgroup sched support. This is
not new but it'd be a shame to put in all the work and the end result is
that most people don't even have access to the feature.
Here's my suggestion if there is desire for this to become something most
people have easy access to:
- Don't make it impossible to use in conjunction with other resource control
mechanisms especially not CPU controller itself. Don't force people to
choose between threaded mode and rt control. Allow them to co-exist in a
reasonable manner.
- The same in the wider scope. Don't let it get in the way of people who
don't care about it. Compromising on interface / failure mode is better
than people not being able to use it in most cases.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
2026-05-05 19:56 ` Tejun Heo
@ 2026-05-07 10:53 ` Peter Zijlstra
2026-05-07 15:03 ` Juri Lelli
` (3 more replies)
2026-05-07 14:30 ` luca abeni
1 sibling, 4 replies; 19+ messages in thread
From: Peter Zijlstra @ 2026-05-07 10:53 UTC (permalink / raw)
To: Tejun Heo
Cc: Yuri Andriaccio, Ingo Molnar, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, linux-kernel, Luca Abeni, Yuri Andriaccio,
hannes, mkoutny, cgroups
On Tue, May 05, 2026 at 09:56:58AM -1000, Tejun Heo wrote:
> Hello,
>
> Some high level comments:
>
> - Please align it with existing cgroup2 interface files. See cpu.max. This
> can be e.g. cpu.rt.max without about the same semantics.
>
> - cgroup2 enforces that internal cgroups w/ controllers enabled cannot have
> threads in them. No need to enforce that separately.
Looking at cpu_period_quota_parse() this thing takes two u64 values for:
{runtime, period} but allows runtime to be the string "max".
I think we'd want an optional extension to that and allow 3 values for:
{runtime, period, deadline}, where if the deadline is not given, it will
be the same as period.
In previous versions there was also an option to specify a cpumask,
getting rid of that is one of the reasons I suggested making this thing
a cgroup-v2 thing, then we can use the cpuset controller's effective
mask.
> - However, the cpu controller is a threaded controller which means that it
> can have threaded sub-hierarchy where the no-internal-process rule doesn't
> apply. This was created explicitly for cpu controller. The proposed change
> blocks it effectively forcing cpu controller into regular domain
> controller behavior subject to no-internal-process rule. Note these are
> enforced at controller granularity and this means that users who use the
> threaded mode will be forced to pick between the two.
Right... this then means we need two controls, one to do hierarchical
bandwidth distribution, and one to assign bandwidth to the internal
group -- which is then subject to its own bandwidth distribution
constraint.
This might be a little confusing, but there is no way around that
AFAICT.
> - This has the same problem with cgroup1's rt cgroup sched support where
> there is no way to have a permissive default configuration, which means
> that users who don't really care about distributing rt shares
> hierarchically would get blocked from running rt processes by default,
> which basically forces distros to disable rt cgroup sched support. This is
> not new but it'd be a shame to put in all the work and the end result is
> that most people don't even have access to the feature.
Right, but cgroup-v2 allows enabling/disabling specific controllers for
a (sub)-hierarchy, right? So if the controller is not enabled (by
default), it will fall back to putting the tasks in whatever parent does
have it on, and by default the root group would have and would accept
tasks.
Additionally, I think we want a flag to allow non-priv tasks to use RT
inside the controller -- after all, these tasks would be subject to
strict bandwidth controls and cannot burn the system like unbounded/root
FIFO tasks can.
Does that all sound workable?
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
2026-05-05 19:56 ` Tejun Heo
2026-05-07 10:53 ` Peter Zijlstra
@ 2026-05-07 14:30 ` luca abeni
2026-05-11 18:28 ` Tejun Heo
1 sibling, 1 reply; 19+ messages in thread
From: luca abeni @ 2026-05-07 14:30 UTC (permalink / raw)
To: Tejun Heo
Cc: Peter Zijlstra, Yuri Andriaccio, Ingo Molnar, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, linux-kernel, Yuri Andriaccio,
hannes, mkoutny, cgroups
Hi Tejun,
first of all, thanks for your comments! I think this is the kind of
dicussion that we need to have...
Right now we have something that works "well enough" for real-time, but
we want to make it useful in general, so that distributions will not
disable it by default.
I need to better study your suggestions (I do not know cgroup v2
much...), but I have some questions to better understand possible
solutions:
On Tue, 5 May 2026 09:56:58 -1000
Tejun Heo <tj@kernel.org> wrote:
[...]
> - cgroup2 enforces that internal cgroups w/ controllers enabled
> cannot have threads in them. No need to enforce that separately.
>
> - However, the cpu controller is a threaded controller which means
> that it can have threaded sub-hierarchy where the no-internal-process
> rule doesn't apply. This was created explicitly for cpu controller.
> The proposed change blocks it effectively forcing cpu controller into
> regular domain controller behavior subject to no-internal-process
> rule. Note these are enforced at controller granularity and this
> means that users who use the threaded mode will be forced to pick
> between the two.
Just to better understand: would it make sense to allow non-{FIFO,RT}
tasks to be in non-leaf cgroups (as allowed by the threaded CPU
controller), while enforcing that FIFO/RR tasks can only be in leaf
cgroups? Or would this be a hack that compromises the rt-CPU controller
usefulness?
> - This has the same problem with cgroup1's rt cgroup sched support
> where there is no way to have a permissive default configuration,
> which means that users who don't really care about distributing rt
> shares hierarchically would get blocked from running rt processes by
> default, which basically forces distros to disable rt cgroup sched
> support. This is not new but it'd be a shame to put in all the work
> and the end result is that most people don't even have access to the
> feature.
Yes, we have a bad default here.
Would a default like "allow running FIFO/RR tasks without runtime
enforcement" (this is what happens to FIFO/RR tasks running in the root
control group) be acceptable?
Thanks,
Luca
>
> Here's my suggestion if there is desire for this to become something
> most people have easy access to:
>
> - Don't make it impossible to use in conjunction with other resource
> control mechanisms especially not CPU controller itself. Don't force
> people to choose between threaded mode and rt control. Allow them to
> co-exist in a reasonable manner.
>
> - The same in the wider scope. Don't let it get in the way of people
> who don't care about it. Compromising on interface / failure mode is
> better than people not being able to use it in most cases.
>
> Thanks.
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
2026-05-07 10:53 ` Peter Zijlstra
@ 2026-05-07 15:03 ` Juri Lelli
2026-05-07 15:05 ` Peter Zijlstra
2026-05-07 16:39 ` luca abeni
2026-05-07 16:44 ` luca abeni
` (2 subsequent siblings)
3 siblings, 2 replies; 19+ messages in thread
From: Juri Lelli @ 2026-05-07 15:03 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Tejun Heo, Yuri Andriaccio, Ingo Molnar, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, linux-kernel, Luca Abeni, Yuri Andriaccio,
hannes, mkoutny, cgroups
On 07/05/26 12:53, Peter Zijlstra wrote:
> On Tue, May 05, 2026 at 09:56:58AM -1000, Tejun Heo wrote:
...
> > - However, the cpu controller is a threaded controller which means that it
> > can have threaded sub-hierarchy where the no-internal-process rule doesn't
> > apply. This was created explicitly for cpu controller. The proposed change
> > blocks it effectively forcing cpu controller into regular domain
> > controller behavior subject to no-internal-process rule. Note these are
> > enforced at controller granularity and this means that users who use the
> > threaded mode will be forced to pick between the two.
>
> Right... this then means we need two controls, one to do hierarchical
> bandwidth distribution, and one to assign bandwidth to the internal
> group -- which is then subject to its own bandwidth distribution
> constraint.
>
> This might be a little confusing, but there is no way around that
> AFAICT.
Just to check if I'm following, you are thinking something like below?
groupA/
cpu.rt.max = "50 50 100" <- 0.5 from root
cpu.rt.internal = "20 20 100" <- 0.2 from groupA for threads at this
level
+ threadA <
+ threadB <
+- group1/
cpu.rt.max = "30 30 100" <- 0.3 from groupA
+ threadC
And we still keep it flat, so 2 dl-entities (per CPU), one handles
threads at groupA level and the other threads inside group1?
Thanks,
Juri
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
2026-05-07 15:03 ` Juri Lelli
@ 2026-05-07 15:05 ` Peter Zijlstra
2026-05-07 16:39 ` luca abeni
1 sibling, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2026-05-07 15:05 UTC (permalink / raw)
To: Juri Lelli
Cc: Tejun Heo, Yuri Andriaccio, Ingo Molnar, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, linux-kernel, Luca Abeni, Yuri Andriaccio,
hannes, mkoutny, cgroups
On Thu, May 07, 2026 at 05:03:41PM +0200, Juri Lelli wrote:
> On 07/05/26 12:53, Peter Zijlstra wrote:
> > On Tue, May 05, 2026 at 09:56:58AM -1000, Tejun Heo wrote:
>
> ...
>
> > > - However, the cpu controller is a threaded controller which means that it
> > > can have threaded sub-hierarchy where the no-internal-process rule doesn't
> > > apply. This was created explicitly for cpu controller. The proposed change
> > > blocks it effectively forcing cpu controller into regular domain
> > > controller behavior subject to no-internal-process rule. Note these are
> > > enforced at controller granularity and this means that users who use the
> > > threaded mode will be forced to pick between the two.
> >
> > Right... this then means we need two controls, one to do hierarchical
> > bandwidth distribution, and one to assign bandwidth to the internal
> > group -- which is then subject to its own bandwidth distribution
> > constraint.
> >
> > This might be a little confusing, but there is no way around that
> > AFAICT.
>
> Just to check if I'm following, you are thinking something like below?
>
> groupA/
> cpu.rt.max = "50 50 100" <- 0.5 from root
> cpu.rt.internal = "20 20 100" <- 0.2 from groupA for threads at this
> level
> + threadA <
> + threadB <
> +- group1/
> cpu.rt.max = "30 30 100" <- 0.3 from groupA
> + threadC
>
> And we still keep it flat, so 2 dl-entities (per CPU), one handles
> threads at groupA level and the other threads inside group1?
Exactly!
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
2026-05-07 15:03 ` Juri Lelli
2026-05-07 15:05 ` Peter Zijlstra
@ 2026-05-07 16:39 ` luca abeni
2026-05-11 9:29 ` Juri Lelli
1 sibling, 1 reply; 19+ messages in thread
From: luca abeni @ 2026-05-07 16:39 UTC (permalink / raw)
To: Juri Lelli
Cc: Peter Zijlstra, Tejun Heo, Yuri Andriaccio, Ingo Molnar,
Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, linux-kernel, Yuri Andriaccio,
hannes, mkoutny, cgroups
Hi,
On Thu, 7 May 2026 17:03:41 +0200
Juri Lelli <juri.lelli@redhat.com> wrote:
> On 07/05/26 12:53, Peter Zijlstra wrote:
> > On Tue, May 05, 2026 at 09:56:58AM -1000, Tejun Heo wrote:
>
> ...
>
> > > - However, the cpu controller is a threaded controller which
> > > means that it can have threaded sub-hierarchy where the
> > > no-internal-process rule doesn't apply. This was created
> > > explicitly for cpu controller. The proposed change blocks it
> > > effectively forcing cpu controller into regular domain controller
> > > behavior subject to no-internal-process rule. Note these are
> > > enforced at controller granularity and this means that users who
> > > use the threaded mode will be forced to pick between the two.
> >
> > Right... this then means we need two controls, one to do
> > hierarchical bandwidth distribution, and one to assign bandwidth to
> > the internal group -- which is then subject to its own bandwidth
> > distribution constraint.
> >
> > This might be a little confusing, but there is no way around that
> > AFAICT.
>
> Just to check if I'm following, you are thinking something like below?
>
> groupA/
> cpu.rt.max = "50 50 100" <- 0.5 from root
> cpu.rt.internal = "20 20 100" <- 0.2 from groupA for threads at
> this level
> + threadA <
> + threadB <
> +- group1/
> cpu.rt.max = "30 30 100" <- 0.3 from groupA
> + threadC
>
> And we still keep it flat, so 2 dl-entities (per CPU), one handles
> threads at groupA level and the other threads inside group1?
An alternative idea I was thinking about: we create 2 dl entities (one
for "groupA" and one for "group1"); we set cpu.rt.max for groupA, and
we subtract group1's utilization from it (so, if groupA's cpu.rt.max is
"50 100" and group1's cpu.rt.max is "30 100", groupA is served by a dl
entity (50-30,100)=(20,100) while group1 is served by a dl entity
(30,100)).
Basically, with this idea the "internal" reservation is automatically
computed based on rt.max and on the children cgroups. A possible issue
is that if the children consume all the groupA's utilization the groupA
RT tasks remain with 0 runtime (and never execute).
Luca
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
2026-05-07 10:53 ` Peter Zijlstra
2026-05-07 15:03 ` Juri Lelli
@ 2026-05-07 16:44 ` luca abeni
2026-05-11 9:40 ` luca abeni
2026-05-11 17:37 ` Tejun Heo
3 siblings, 0 replies; 19+ messages in thread
From: luca abeni @ 2026-05-07 16:44 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Tejun Heo, Yuri Andriaccio, Ingo Molnar, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, linux-kernel, Yuri Andriaccio,
hannes, mkoutny, cgroups
Hi,
On Thu, 7 May 2026 12:53:31 +0200
Peter Zijlstra <peterz@infradead.org> wrote:
[...]
> > - This has the same problem with cgroup1's rt cgroup sched support
> > where there is no way to have a permissive default configuration,
> > which means that users who don't really care about distributing rt
> > shares hierarchically would get blocked from running rt processes
> > by default, which basically forces distros to disable rt cgroup
> > sched support. This is not new but it'd be a shame to put in all
> > the work and the end result is that most people don't even have
> > access to the feature.
>
> Right, but cgroup-v2 allows enabling/disabling specific controllers
> for a (sub)-hierarchy, right? So if the controller is not enabled (by
> default), it will fall back to putting the tasks in whatever parent
> does have it on, and by default the root group would have and would
> accept tasks.
If I understand well, this is similar to what I was thinking about:
having a default that allows creating FIFO/RR tasks (and execute them
without runtime control - so, without being served by a dl server)
> Additionally, I think we want a flag to allow non-priv tasks to use RT
> inside the controller -- after all, these tasks would be subject to
> strict bandwidth controls and cannot burn the system like
> unbounded/root FIFO tasks can.
This is something Yuri and I wanted to propose as a follow-up patch,
once there is an agreement on the patchset (should be a pretty simple
change :)
Luca
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
2026-05-07 16:39 ` luca abeni
@ 2026-05-11 9:29 ` Juri Lelli
2026-05-11 17:52 ` Tejun Heo
0 siblings, 1 reply; 19+ messages in thread
From: Juri Lelli @ 2026-05-11 9:29 UTC (permalink / raw)
To: luca abeni
Cc: Peter Zijlstra, Tejun Heo, Yuri Andriaccio, Ingo Molnar,
Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, linux-kernel, Yuri Andriaccio,
hannes, mkoutny, cgroups
On 07/05/26 18:39, luca abeni wrote:
> Hi,
>
> On Thu, 7 May 2026 17:03:41 +0200
> Juri Lelli <juri.lelli@redhat.com> wrote:
>
> > On 07/05/26 12:53, Peter Zijlstra wrote:
> > > On Tue, May 05, 2026 at 09:56:58AM -1000, Tejun Heo wrote:
> >
> > ...
> >
> > > > - However, the cpu controller is a threaded controller which
> > > > means that it can have threaded sub-hierarchy where the
> > > > no-internal-process rule doesn't apply. This was created
> > > > explicitly for cpu controller. The proposed change blocks it
> > > > effectively forcing cpu controller into regular domain controller
> > > > behavior subject to no-internal-process rule. Note these are
> > > > enforced at controller granularity and this means that users who
> > > > use the threaded mode will be forced to pick between the two.
> > >
> > > Right... this then means we need two controls, one to do
> > > hierarchical bandwidth distribution, and one to assign bandwidth to
> > > the internal group -- which is then subject to its own bandwidth
> > > distribution constraint.
> > >
> > > This might be a little confusing, but there is no way around that
> > > AFAICT.
> >
> > Just to check if I'm following, you are thinking something like below?
> >
> > groupA/
> > cpu.rt.max = "50 50 100" <- 0.5 from root
> > cpu.rt.internal = "20 20 100" <- 0.2 from groupA for threads at
> > this level
> > + threadA <
> > + threadB <
> > +- group1/
> > cpu.rt.max = "30 30 100" <- 0.3 from groupA
> > + threadC
> >
> > And we still keep it flat, so 2 dl-entities (per CPU), one handles
> > threads at groupA level and the other threads inside group1?
>
> An alternative idea I was thinking about: we create 2 dl entities (one
> for "groupA" and one for "group1"); we set cpu.rt.max for groupA, and
> we subtract group1's utilization from it (so, if groupA's cpu.rt.max is
> "50 100" and group1's cpu.rt.max is "30 100", groupA is served by a dl
> entity (50-30,100)=(20,100) while group1 is served by a dl entity
> (30,100)).
>
> Basically, with this idea the "internal" reservation is automatically
> computed based on rt.max and on the children cgroups. A possible issue
> is that if the children consume all the groupA's utilization the groupA
> RT tasks remain with 0 runtime (and never execute).
While I like the automatic approach, I also fear that it might be more
difficult to maintain/use from a systemd admin perspective, e.g. I
cannot make a subgroup reservation bigger because there are threads
running in the parent group which consume all the remaining (internal)
bandwidth. If we make it explicit it seems easier to see where bandwidth
is allocated at all levels.
Peter? Tejun? What do we want to do with this interface?
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
2026-05-07 10:53 ` Peter Zijlstra
2026-05-07 15:03 ` Juri Lelli
2026-05-07 16:44 ` luca abeni
@ 2026-05-11 9:40 ` luca abeni
2026-05-11 18:15 ` Tejun Heo
2026-05-11 17:37 ` Tejun Heo
3 siblings, 1 reply; 19+ messages in thread
From: luca abeni @ 2026-05-11 9:40 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Tejun Heo, Yuri Andriaccio, Ingo Molnar, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, linux-kernel, Yuri Andriaccio,
hannes, mkoutny, cgroups
Hi all,
On Thu, 7 May 2026 12:53:31 +0200
Peter Zijlstra <peterz@infradead.org> wrote:
[...]
> > - This has the same problem with cgroup1's rt cgroup sched support
> > where there is no way to have a permissive default configuration,
> > which means that users who don't really care about distributing rt
> > shares hierarchically would get blocked from running rt processes
> > by default, which basically forces distros to disable rt cgroup
> > sched support. This is not new but it'd be a shame to put in all
> > the work and the end result is that most people don't even have
> > access to the feature.
>
> Right, but cgroup-v2 allows enabling/disabling specific controllers
> for a (sub)-hierarchy, right? So if the controller is not enabled (by
> default), it will fall back to putting the tasks in whatever parent
> does have it on, and by default the root group would have and would
> accept tasks.
We are discussing this issue with Yuri, and we have a doubt: if we
disable the RT-CPU controller for a cgroup, would it be possible to
enable it for its children?
(In other words: if we want the RT-CPU controller to be enabled for
some "leaf" cgroups, we need to enable it for their parents, right?)
Thanks,
Luca
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
2026-05-07 10:53 ` Peter Zijlstra
` (2 preceding siblings ...)
2026-05-11 9:40 ` luca abeni
@ 2026-05-11 17:37 ` Tejun Heo
3 siblings, 0 replies; 19+ messages in thread
From: Tejun Heo @ 2026-05-11 17:37 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Yuri Andriaccio, Ingo Molnar, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, linux-kernel, Luca Abeni, Yuri Andriaccio,
hannes, mkoutny, cgroups
Hello, Peter.
On Thu, May 07, 2026 at 12:53:31PM +0200, Peter Zijlstra wrote:
...
> Looking at cpu_period_quota_parse() this thing takes two u64 values for:
> {runtime, period} but allows runtime to be the string "max".
>
> I think we'd want an optional extension to that and allow 3 values for:
> {runtime, period, deadline}, where if the deadline is not given, it will
> be the same as period.
Yeah, I don't know what's needed here but extending the interface as
necessary is completely fine.
> Right... this then means we need two controls, one to do hierarchical
> bandwidth distribution, and one to assign bandwidth to the internal
> group -- which is then subject to its own bandwidth distribution
> constraint.
>
> This might be a little confusing, but there is no way around that
> AFAICT.
Separating out the rt as a separate controller is one way and if the
configuration wants to stick to strict allocation model where nothing is
available by default unless explicitly allocated, this would be the only
way. Interface-wise, I think this is going to be fine but I suspect this
likely would complicated internal implementation quite a bit as now rt can't
piggyback on existing sched core cgroup infra - no task_group or
synchronization built around them - and has to build everything on its own.
It's not the end of the world but not ideal either.
> > - This has the same problem with cgroup1's rt cgroup sched support where
> > there is no way to have a permissive default configuration, which means
> > that users who don't really care about distributing rt shares
> > hierarchically would get blocked from running rt processes by default,
> > which basically forces distros to disable rt cgroup sched support. This is
> > not new but it'd be a shame to put in all the work and the end result is
> > that most people don't even have access to the feature.
>
> Right, but cgroup-v2 allows enabling/disabling specific controllers for
> a (sub)-hierarchy, right? So if the controller is not enabled (by
> default), it will fall back to putting the tasks in whatever parent does
> have it on, and by default the root group would have and would accept
> tasks.
>
> Additionally, I think we want a flag to allow non-priv tasks to use RT
> inside the controller -- after all, these tasks would be subject to
> strict bandwidth controls and cannot burn the system like unbounded/root
> FIFO tasks can.
>
> Does that all sound workable?
Yeah, if rt becomes its own controller, I don't see any fundamental
roadblocks. It'd involve a bunch of churn which may add to maintenance
overhead but it should work. An alternative would be coming up with some way
to express the default no-enforcement state through the config knobs. I'm
sure this would be doable too and if folks can figure out a reasonable
interface, it should be able to obtain basically the same functionality with
a lot less code.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
2026-05-11 9:29 ` Juri Lelli
@ 2026-05-11 17:52 ` Tejun Heo
0 siblings, 0 replies; 19+ messages in thread
From: Tejun Heo @ 2026-05-11 17:52 UTC (permalink / raw)
To: Juri Lelli
Cc: luca abeni, Peter Zijlstra, Yuri Andriaccio, Ingo Molnar,
Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, linux-kernel, Yuri Andriaccio,
hannes, mkoutny, cgroups
Hello,
On Mon, May 11, 2026 at 11:29:47AM +0200, Juri Lelli wrote:
...
> While I like the automatic approach, I also fear that it might be more
> difficult to maintain/use from a systemd admin perspective, e.g. I
> cannot make a subgroup reservation bigger because there are threads
> running in the parent group which consume all the remaining (internal)
> bandwidth. If we make it explicit it seems easier to see where bandwidth
> is allocated at all levels.
>
> Peter? Tejun? What do we want to do with this interface?
blkcg on cgroup1 did soemthing similar for a while. It had a separate subdir
for knobs that apply to "internal threads". Effectivley, this becomes
creating a separate controller group for every cgroup as a sibling to its
children. It does work obviously but it is pretty ugly and unintuitive, both
in interface and implementation, and I'm skeptical this was actually useful
in any meaningful way. Nobody complained when we ripped it out.
If rt were to become its own cgroup controller, maybe one can just side-step
this by not supporting threaded mode at least at the beginning. If people
ask for it, hopefully we'll be able to develop better understanding of their
usecases and drive design that way. In practice, I don't think threaded mode
gets used all that much because usually only application processes
themselves know about their own threads, are not in the business of creating
their own cgroups (delegation to each application isn't common), and have
other ways of controlling their own threads. So, there's some chance that
this may not actually come up.
If rt stays as a part of cpu controller, my preference would be keeping the
config implicit for threaded mode at least at the beginning. ie. Don't get
in the way of people using threaded mode by blocking it but having some
reasonable and clear default (e.g. internal tasks have priority as suggested
or internal tasks get whatever is left over which may make more sense in the
allocation model) may be sufficient. If not, like in the other case, we can
make specific design decisions based on concrete use cases later.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
2026-05-11 9:40 ` luca abeni
@ 2026-05-11 18:15 ` Tejun Heo
0 siblings, 0 replies; 19+ messages in thread
From: Tejun Heo @ 2026-05-11 18:15 UTC (permalink / raw)
To: luca abeni
Cc: Peter Zijlstra, Yuri Andriaccio, Ingo Molnar, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, linux-kernel, Yuri Andriaccio,
hannes, mkoutny, cgroups
On Mon, May 11, 2026 at 11:40:04AM +0200, luca abeni wrote:
> We are discussing this issue with Yuri, and we have a doubt: if we
> disable the RT-CPU controller for a cgroup, would it be possible to
> enable it for its children?
> (In other words: if we want the RT-CPU controller to be enabled for
> some "leaf" cgroups, we need to enable it for their parents, right?)
Yeah, a cgroup has a controller available to it iff its parent enables that
controller, so all ancestors would have to enable it.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
2026-05-07 14:30 ` luca abeni
@ 2026-05-11 18:28 ` Tejun Heo
2026-05-12 17:38 ` Yuri Andriaccio
0 siblings, 1 reply; 19+ messages in thread
From: Tejun Heo @ 2026-05-11 18:28 UTC (permalink / raw)
To: luca abeni
Cc: Peter Zijlstra, Yuri Andriaccio, Ingo Molnar, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, linux-kernel, Yuri Andriaccio,
hannes, mkoutny, cgroups
Hello,
On Thu, May 07, 2026 at 04:30:58PM +0200, luca abeni wrote:
...
> Just to better understand: would it make sense to allow non-{FIFO,RT}
> tasks to be in non-leaf cgroups (as allowed by the threaded CPU
> controller), while enforcing that FIFO/RR tasks can only be in leaf
> cgroups? Or would this be a hack that compromises the rt-CPU controller
> usefulness?
Code-wise, sure, but I don't think an interface like that would be a good
one. From user's pov, this amounts to adding restrictions on both whether a
controller can be enabled and whether tasks can be moved into some cgroups.
UNIX error reporting being what it is, this would come down to getting
-EINVAL or -EBUSY or whatever out of those operations. I don't think it's a
good idea to add subtle failure modes to these already pretty complex (but
currently w/ clearly-defined shared rules) operations. To users, this would
look like random arbitrary failures that are nearly impossible to decode
without tracing code.
If you want to enforce no-internal-threads, separating it out to its own
controller that doesn't support threaded mode would be the right direction.
Note that the only hard requirement here is that you don't want to get in
the way for people who are NOT interested in threaded rt control. If you
block enabling CPU control for e.g. cpu.max or block thread migration into a
cgroup, you'd be in the way; however, if all you say is "I don't support
sub-allocation in threaded mode" and e.g just fail writes to the knobs in
threaded cgroups, that does not get in the way. So, it's not like you *have*
to support full threaded mode. You just need to avoid hindering non-rt
operations.
> Yes, we have a bad default here.
> Would a default like "allow running FIFO/RR tasks without runtime
> enforcement" (this is what happens to FIFO/RR tasks running in the root
> control group) be acceptable?
Yes, if you can express that in a reasonable way in the config knobs, that'd
likely be an easier way. I don't know how to transition from
allowed-by-default to explicitly-allocated in such interface tho. Making
that reasonable and smooth would be the key factor in whether such approach
can be taken.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
2026-05-11 18:28 ` Tejun Heo
@ 2026-05-12 17:38 ` Yuri Andriaccio
2026-05-12 18:19 ` Tejun Heo
0 siblings, 1 reply; 19+ messages in thread
From: Yuri Andriaccio @ 2026-05-12 17:38 UTC (permalink / raw)
To: Tejun Heo
Cc: luca abeni, Peter Zijlstra, Yuri Andriaccio, Ingo Molnar,
Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, linux-kernel, hannes,
mkoutny, cgroups
Hello,
I've been thinking and experimenting with some of the ideas for the rt
controller, and I've come up with the following interface, keeping
everything in the standard cpu controller:
- cpu.rt.max <runtime_us> <period_us>
Sets the bandwidth reserved to the hierarchy that has that specific
cgroup as root, but does
not set any deadline servers.
The default value for this file is '0 0'.
- cpu.rt.min <runtime_us | 'root'> <period_us>
If the runtime part is equal to 'root', the tasks are scheduled on
the root runqueue.
If the runtime is equal to zero, no FIFO/RR tasks can be scheduled.
If the runtime is > zero, FIFO/RR tasks are scheduled under
reservation/HCBS.
This file is not available in the root cgroup, as it does not make
use of dl-servers,
rather only reserves the total bandwidth for the hierarchy.
The default value for this file is 'root 0', meaning that tasks in
this cgroups are
by default scheduled on the root runqueue.
Of course you can imagine that all the admission tests have been updated
accordingly, as an example a cgroups rt.max bw must be >= than the sum
of the rt.max bws of its children + its rt.min bw. I'm also skipping
some details which are only meaningful if we decide to adopt this solution.
What do you think of this interface?
Thanks,
Yuri
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
2026-05-12 17:38 ` Yuri Andriaccio
@ 2026-05-12 18:19 ` Tejun Heo
2026-05-12 18:20 ` Tejun Heo
0 siblings, 1 reply; 19+ messages in thread
From: Tejun Heo @ 2026-05-12 18:19 UTC (permalink / raw)
To: Yuri Andriaccio
Cc: luca abeni, Peter Zijlstra, Yuri Andriaccio, Ingo Molnar,
Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, linux-kernel, hannes,
mkoutny, cgroups
Hello,
How is a delegated subtree prevented from setting cpu.rt.min = 'root' and
escaping its ancestors' cpu.rt.max budget?
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
2026-05-12 18:19 ` Tejun Heo
@ 2026-05-12 18:20 ` Tejun Heo
2026-05-13 12:08 ` Yuri Andriaccio
0 siblings, 1 reply; 19+ messages in thread
From: Tejun Heo @ 2026-05-12 18:20 UTC (permalink / raw)
To: Yuri Andriaccio
Cc: luca abeni, Peter Zijlstra, Yuri Andriaccio, Ingo Molnar,
Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, linux-kernel, hannes,
mkoutny, cgroups
On Tue, May 12, 2026 at 08:19:02AM -1000, Tejun Heo wrote:
> How is a delegated subtree prevented from setting cpu.rt.min = 'root' and
> escaping its ancestors' cpu.rt.max budget?
Hmm.. I guess the same problem exists w/ separate rt controller too. If the
users on the system already started using rt, how do you enable the
controller from the top down with budgets already being used down in the
hierarchy?
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
2026-05-12 18:20 ` Tejun Heo
@ 2026-05-13 12:08 ` Yuri Andriaccio
2026-05-13 19:10 ` Tejun Heo
0 siblings, 1 reply; 19+ messages in thread
From: Yuri Andriaccio @ 2026-05-13 12:08 UTC (permalink / raw)
To: Tejun Heo
Cc: luca abeni, Peter Zijlstra, Yuri Andriaccio, Ingo Molnar,
Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, linux-kernel, hannes,
mkoutny, cgroups
Hello,
> How is a delegated subtree prevented from setting cpu.rt.min = 'root' and
> escaping its ancestors' cpu.rt.max budget?
Is it strictly required that a child cgroup must have 'less runtime'
than its parent? To be more precise I mean scheduling tasks on the root
runqueue instead of using dl-servers. Small note: given that HCBS
cgroups use dl-servers, and thus run at higher priority than FIFO/RR
scheduled on the root runqueue, if a cgroup rt.min is 'root' would yes
escape its ancestor budget but it may also possibly get starved because
of the priority levels.
If we require that child cgroups cannot escape their parent's bandwidth,
even when using 'root', then the cpu.rt.max file must be disallowed in
the root cgroup (removing the possibility to reserve bandwidth for HCBS,
and so doing the admission test similarly to when SCHED_DEADLINE tasks
are executed), and cpu.rt.max would use either 'root' if the whole
subtree must be scheduled onto the root runqueue or a <runtime> <period>
combination to reserve bandwidth for the whole subtree. The cpu.rt.min
would then only be used to reserve internal bandwidth for the cgroup
itself. This also means that a whole subtree either uses HCBS everywhere
or the root runqueue everywhere.
> If the users on the system already started using rt, how do you
enable the
> controller from the top down with budgets already being used down in the
> hierarchy?
In my original idea rt tasks would only interfere with their own cgroup
configuration, but not with the subtree or their parents. When
cpu.rt.min = 'root', you are free to change cpu.rt.max values to
whatever you like in any place of the hierarchy, and tasks inside the
rt.min = 'root' cgroup would not be affected as they are run in the root
runqueue.
If you want to switch a cgroup from/to 'root' and HCBS, you'd have to
either move all the RT tasks out of the cgroup, set rt.min, and then
move them back in, or change temporarily their scheduling policy to
non-rt (SCHED_OTHER, SCHED_DEADLINE, whatever) and then back.
Hopefully I've answered your questions. Which solution do you think
makes the most sense?
Yuri
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
2026-05-13 12:08 ` Yuri Andriaccio
@ 2026-05-13 19:10 ` Tejun Heo
0 siblings, 0 replies; 19+ messages in thread
From: Tejun Heo @ 2026-05-13 19:10 UTC (permalink / raw)
To: Yuri Andriaccio
Cc: luca abeni, Peter Zijlstra, Yuri Andriaccio, Ingo Molnar,
Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, linux-kernel, hannes,
mkoutny, cgroups
Hello,
On Wed, May 13, 2026 at 02:08:52PM +0200, Yuri Andriaccio wrote:
> > How is a delegated subtree prevented from setting cpu.rt.min = 'root' and
> > escaping its ancestors' cpu.rt.max budget?
>
> Is it strictly required that a child cgroup must have 'less runtime' than
> its parent? To be more precise I mean scheduling tasks on the root runqueue
> instead of using dl-servers. Small note: given that HCBS cgroups use
> dl-servers, and thus run at higher priority than FIFO/RR scheduled on the
> root runqueue, if a cgroup rt.min is 'root' would yes escape its ancestor
> budget but it may also possibly get starved because of the priority levels.
The high-level invariant that we must maintain is that any given cgroup has
control over resource usages in its subtree. If that doesn't work, the whole
thing is not very useful.
e.g. There are multiple containers in the system and each wants to manage
its own internal resource distribution, which is a relatively common
scenario in server deployments. This is implemented by putting each
container in a cgroup and deletating the sub-tree to the nested container
manager. At the host level, you don't know or have control over what's going
on in each container but you can control how much each container consumes in
total so that each gets what it's allotted and doesn't get in the way of
others.
While delegation scenario is a clear example, even in regular usages, it
gets really confusing if hierarchical resource distribution isn't actually
hierarchical. If you let a child escape to root at its own discretion, might
as well just not have all the complexities with hierarchical resource
control.
> If we require that child cgroups cannot escape their parent's bandwidth,
> even when using 'root', then the cpu.rt.max file must be disallowed in the
> root cgroup (removing the possibility to reserve bandwidth for HCBS, and so
> doing the admission test similarly to when SCHED_DEADLINE tasks are
> executed), and cpu.rt.max would use either 'root' if the whole subtree must
> be scheduled onto the root runqueue or a <runtime> <period> combination to
> reserve bandwidth for the whole subtree. The cpu.rt.min would then only be
> used to reserve internal bandwidth for the cgroup itself. This also means
> that a whole subtree either uses HCBS everywhere or the root runqueue
> everywhere.
>
> > If the users on the system already started using rt, how do you enable the
> > controller from the top down with budgets already being used down in the
> > hierarchy?
>
> In my original idea rt tasks would only interfere with their own cgroup
> configuration, but not with the subtree or their parents. When cpu.rt.min =
> 'root', you are free to change cpu.rt.max values to whatever you like in any
> place of the hierarchy, and tasks inside the rt.min = 'root' cgroup would
> not be affected as they are run in the root runqueue.
>
> If you want to switch a cgroup from/to 'root' and HCBS, you'd have to either
> move all the RT tasks out of the cgroup, set rt.min, and then move them back
> in, or change temporarily their scheduling policy to non-rt (SCHED_OTHER,
> SCHED_DEADLINE, whatever) and then back.
>
> Hopefully I've answered your questions. Which solution do you think makes
> the most sense?
I'm not sure either makes sense. There's not much point in having
hierarchical controller in the first one (just require direct system-level
distribution) and I don't think the second one is very useable. I mean, try
to imagine being a user. You have to hunt down all rt tasks and twiddle
every one one way or another to change some config and then have to worry
about racing forks and class changes. At that point, you might as well just
control it centrally without the hirarchical stuff. You'd have to be really
dedicated or desparate, which means that not many are going to use it which
then brings up the question why are we doing this at all?
I wonder whether this can just be a regular max interface - ie. limit
maximum reservation in the subtree rather than exact reservation allocation.
Then cgroup can report total reservations in the subtree and admission
control can just reject anyting going over.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2026-05-13 19:10 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20260430213835.62217-1-yurand2000@gmail.com>
[not found] ` <20260430213835.62217-21-yurand2000@gmail.com>
2026-05-05 15:15 ` [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups Peter Zijlstra
2026-05-05 19:56 ` Tejun Heo
2026-05-07 10:53 ` Peter Zijlstra
2026-05-07 15:03 ` Juri Lelli
2026-05-07 15:05 ` Peter Zijlstra
2026-05-07 16:39 ` luca abeni
2026-05-11 9:29 ` Juri Lelli
2026-05-11 17:52 ` Tejun Heo
2026-05-07 16:44 ` luca abeni
2026-05-11 9:40 ` luca abeni
2026-05-11 18:15 ` Tejun Heo
2026-05-11 17:37 ` Tejun Heo
2026-05-07 14:30 ` luca abeni
2026-05-11 18:28 ` Tejun Heo
2026-05-12 17:38 ` Yuri Andriaccio
2026-05-12 18:19 ` Tejun Heo
2026-05-12 18:20 ` Tejun Heo
2026-05-13 12:08 ` Yuri Andriaccio
2026-05-13 19:10 ` Tejun Heo
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox