* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
2026-05-07 10:53 ` Peter Zijlstra
@ 2026-05-07 15:03 ` Juri Lelli
2026-05-07 15:05 ` Peter Zijlstra
2026-05-07 16:39 ` luca abeni
2026-05-07 16:44 ` luca abeni
` (2 subsequent siblings)
3 siblings, 2 replies; 17+ messages in thread
From: Juri Lelli @ 2026-05-07 15:03 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Tejun Heo, Yuri Andriaccio, Ingo Molnar, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, linux-kernel, Luca Abeni, Yuri Andriaccio,
hannes, mkoutny, cgroups
On 07/05/26 12:53, Peter Zijlstra wrote:
> On Tue, May 05, 2026 at 09:56:58AM -1000, Tejun Heo wrote:
...
> > - However, the cpu controller is a threaded controller which means that it
> > can have threaded sub-hierarchy where the no-internal-process rule doesn't
> > apply. This was created explicitly for cpu controller. The proposed change
> > blocks it effectively forcing cpu controller into regular domain
> > controller behavior subject to no-internal-process rule. Note these are
> > enforced at controller granularity and this means that users who use the
> > threaded mode will be forced to pick between the two.
>
> Right... this then means we need two controls, one to do hierarchical
> bandwidth distribution, and one to assign bandwidth to the internal
> group -- which is then subject to its own bandwidth distribution
> constraint.
>
> This might be a little confusing, but there is no way around that
> AFAICT.
Just to check if I'm following, you are thinking something like below?
groupA/
cpu.rt.max = "50 50 100" <- 0.5 from root
cpu.rt.internal = "20 20 100" <- 0.2 from groupA for threads at this
level
+ threadA <
+ threadB <
+- group1/
cpu.rt.max = "30 30 100" <- 0.3 from groupA
+ threadC
And we still keep it flat, so 2 dl-entities (per CPU), one handles
threads at groupA level and the other threads inside group1?
Thanks,
Juri
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
2026-05-07 15:03 ` Juri Lelli
@ 2026-05-07 15:05 ` Peter Zijlstra
2026-05-07 16:39 ` luca abeni
1 sibling, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2026-05-07 15:05 UTC (permalink / raw)
To: Juri Lelli
Cc: Tejun Heo, Yuri Andriaccio, Ingo Molnar, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, linux-kernel, Luca Abeni, Yuri Andriaccio,
hannes, mkoutny, cgroups
On Thu, May 07, 2026 at 05:03:41PM +0200, Juri Lelli wrote:
> On 07/05/26 12:53, Peter Zijlstra wrote:
> > On Tue, May 05, 2026 at 09:56:58AM -1000, Tejun Heo wrote:
>
> ...
>
> > > - However, the cpu controller is a threaded controller which means that it
> > > can have threaded sub-hierarchy where the no-internal-process rule doesn't
> > > apply. This was created explicitly for cpu controller. The proposed change
> > > blocks it effectively forcing cpu controller into regular domain
> > > controller behavior subject to no-internal-process rule. Note these are
> > > enforced at controller granularity and this means that users who use the
> > > threaded mode will be forced to pick between the two.
> >
> > Right... this then means we need two controls, one to do hierarchical
> > bandwidth distribution, and one to assign bandwidth to the internal
> > group -- which is then subject to its own bandwidth distribution
> > constraint.
> >
> > This might be a little confusing, but there is no way around that
> > AFAICT.
>
> Just to check if I'm following, you are thinking something like below?
>
> groupA/
> cpu.rt.max = "50 50 100" <- 0.5 from root
> cpu.rt.internal = "20 20 100" <- 0.2 from groupA for threads at this
> level
> + threadA <
> + threadB <
> +- group1/
> cpu.rt.max = "30 30 100" <- 0.3 from groupA
> + threadC
>
> And we still keep it flat, so 2 dl-entities (per CPU), one handles
> threads at groupA level and the other threads inside group1?
Exactly!
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
2026-05-07 15:03 ` Juri Lelli
2026-05-07 15:05 ` Peter Zijlstra
@ 2026-05-07 16:39 ` luca abeni
2026-05-11 9:29 ` Juri Lelli
1 sibling, 1 reply; 17+ messages in thread
From: luca abeni @ 2026-05-07 16:39 UTC (permalink / raw)
To: Juri Lelli
Cc: Peter Zijlstra, Tejun Heo, Yuri Andriaccio, Ingo Molnar,
Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, linux-kernel, Yuri Andriaccio,
hannes, mkoutny, cgroups
Hi,
On Thu, 7 May 2026 17:03:41 +0200
Juri Lelli <juri.lelli@redhat.com> wrote:
> On 07/05/26 12:53, Peter Zijlstra wrote:
> > On Tue, May 05, 2026 at 09:56:58AM -1000, Tejun Heo wrote:
>
> ...
>
> > > - However, the cpu controller is a threaded controller which
> > > means that it can have threaded sub-hierarchy where the
> > > no-internal-process rule doesn't apply. This was created
> > > explicitly for cpu controller. The proposed change blocks it
> > > effectively forcing cpu controller into regular domain controller
> > > behavior subject to no-internal-process rule. Note these are
> > > enforced at controller granularity and this means that users who
> > > use the threaded mode will be forced to pick between the two.
> >
> > Right... this then means we need two controls, one to do
> > hierarchical bandwidth distribution, and one to assign bandwidth to
> > the internal group -- which is then subject to its own bandwidth
> > distribution constraint.
> >
> > This might be a little confusing, but there is no way around that
> > AFAICT.
>
> Just to check if I'm following, you are thinking something like below?
>
> groupA/
> cpu.rt.max = "50 50 100" <- 0.5 from root
> cpu.rt.internal = "20 20 100" <- 0.2 from groupA for threads at
> this level
> + threadA <
> + threadB <
> +- group1/
> cpu.rt.max = "30 30 100" <- 0.3 from groupA
> + threadC
>
> And we still keep it flat, so 2 dl-entities (per CPU), one handles
> threads at groupA level and the other threads inside group1?
An alternative idea I was thinking about: we create 2 dl entities (one
for "groupA" and one for "group1"); we set cpu.rt.max for groupA, and
we subtract group1's utilization from it (so, if groupA's cpu.rt.max is
"50 100" and group1's cpu.rt.max is "30 100", groupA is served by a dl
entity (50-30,100)=(20,100) while group1 is served by a dl entity
(30,100)).
Basically, with this idea the "internal" reservation is automatically
computed based on rt.max and on the children cgroups. A possible issue
is that if the children consume all the groupA's utilization the groupA
RT tasks remain with 0 runtime (and never execute).
Luca
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
2026-05-07 16:39 ` luca abeni
@ 2026-05-11 9:29 ` Juri Lelli
2026-05-11 17:52 ` Tejun Heo
0 siblings, 1 reply; 17+ messages in thread
From: Juri Lelli @ 2026-05-11 9:29 UTC (permalink / raw)
To: luca abeni
Cc: Peter Zijlstra, Tejun Heo, Yuri Andriaccio, Ingo Molnar,
Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, linux-kernel, Yuri Andriaccio,
hannes, mkoutny, cgroups
On 07/05/26 18:39, luca abeni wrote:
> Hi,
>
> On Thu, 7 May 2026 17:03:41 +0200
> Juri Lelli <juri.lelli@redhat.com> wrote:
>
> > On 07/05/26 12:53, Peter Zijlstra wrote:
> > > On Tue, May 05, 2026 at 09:56:58AM -1000, Tejun Heo wrote:
> >
> > ...
> >
> > > > - However, the cpu controller is a threaded controller which
> > > > means that it can have threaded sub-hierarchy where the
> > > > no-internal-process rule doesn't apply. This was created
> > > > explicitly for cpu controller. The proposed change blocks it
> > > > effectively forcing cpu controller into regular domain controller
> > > > behavior subject to no-internal-process rule. Note these are
> > > > enforced at controller granularity and this means that users who
> > > > use the threaded mode will be forced to pick between the two.
> > >
> > > Right... this then means we need two controls, one to do
> > > hierarchical bandwidth distribution, and one to assign bandwidth to
> > > the internal group -- which is then subject to its own bandwidth
> > > distribution constraint.
> > >
> > > This might be a little confusing, but there is no way around that
> > > AFAICT.
> >
> > Just to check if I'm following, you are thinking something like below?
> >
> > groupA/
> > cpu.rt.max = "50 50 100" <- 0.5 from root
> > cpu.rt.internal = "20 20 100" <- 0.2 from groupA for threads at
> > this level
> > + threadA <
> > + threadB <
> > +- group1/
> > cpu.rt.max = "30 30 100" <- 0.3 from groupA
> > + threadC
> >
> > And we still keep it flat, so 2 dl-entities (per CPU), one handles
> > threads at groupA level and the other threads inside group1?
>
> An alternative idea I was thinking about: we create 2 dl entities (one
> for "groupA" and one for "group1"); we set cpu.rt.max for groupA, and
> we subtract group1's utilization from it (so, if groupA's cpu.rt.max is
> "50 100" and group1's cpu.rt.max is "30 100", groupA is served by a dl
> entity (50-30,100)=(20,100) while group1 is served by a dl entity
> (30,100)).
>
> Basically, with this idea the "internal" reservation is automatically
> computed based on rt.max and on the children cgroups. A possible issue
> is that if the children consume all the groupA's utilization the groupA
> RT tasks remain with 0 runtime (and never execute).
While I like the automatic approach, I also fear that it might be more
difficult to maintain/use from a systemd admin perspective, e.g. I
cannot make a subgroup reservation bigger because there are threads
running in the parent group which consume all the remaining (internal)
bandwidth. If we make it explicit it seems easier to see where bandwidth
is allocated at all levels.
Peter? Tejun? What do we want to do with this interface?
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
2026-05-11 9:29 ` Juri Lelli
@ 2026-05-11 17:52 ` Tejun Heo
0 siblings, 0 replies; 17+ messages in thread
From: Tejun Heo @ 2026-05-11 17:52 UTC (permalink / raw)
To: Juri Lelli
Cc: luca abeni, Peter Zijlstra, Yuri Andriaccio, Ingo Molnar,
Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, linux-kernel, Yuri Andriaccio,
hannes, mkoutny, cgroups
Hello,
On Mon, May 11, 2026 at 11:29:47AM +0200, Juri Lelli wrote:
...
> While I like the automatic approach, I also fear that it might be more
> difficult to maintain/use from a systemd admin perspective, e.g. I
> cannot make a subgroup reservation bigger because there are threads
> running in the parent group which consume all the remaining (internal)
> bandwidth. If we make it explicit it seems easier to see where bandwidth
> is allocated at all levels.
>
> Peter? Tejun? What do we want to do with this interface?
blkcg on cgroup1 did soemthing similar for a while. It had a separate subdir
for knobs that apply to "internal threads". Effectivley, this becomes
creating a separate controller group for every cgroup as a sibling to its
children. It does work obviously but it is pretty ugly and unintuitive, both
in interface and implementation, and I'm skeptical this was actually useful
in any meaningful way. Nobody complained when we ripped it out.
If rt were to become its own cgroup controller, maybe one can just side-step
this by not supporting threaded mode at least at the beginning. If people
ask for it, hopefully we'll be able to develop better understanding of their
usecases and drive design that way. In practice, I don't think threaded mode
gets used all that much because usually only application processes
themselves know about their own threads, are not in the business of creating
their own cgroups (delegation to each application isn't common), and have
other ways of controlling their own threads. So, there's some chance that
this may not actually come up.
If rt stays as a part of cpu controller, my preference would be keeping the
config implicit for threaded mode at least at the beginning. ie. Don't get
in the way of people using threaded mode by blocking it but having some
reasonable and clear default (e.g. internal tasks have priority as suggested
or internal tasks get whatever is left over which may make more sense in the
allocation model) may be sufficient. If not, like in the other case, we can
make specific design decisions based on concrete use cases later.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
2026-05-07 10:53 ` Peter Zijlstra
2026-05-07 15:03 ` Juri Lelli
@ 2026-05-07 16:44 ` luca abeni
2026-05-11 9:40 ` luca abeni
2026-05-11 17:37 ` Tejun Heo
3 siblings, 0 replies; 17+ messages in thread
From: luca abeni @ 2026-05-07 16:44 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Tejun Heo, Yuri Andriaccio, Ingo Molnar, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, linux-kernel, Yuri Andriaccio,
hannes, mkoutny, cgroups
Hi,
On Thu, 7 May 2026 12:53:31 +0200
Peter Zijlstra <peterz@infradead.org> wrote:
[...]
> > - This has the same problem with cgroup1's rt cgroup sched support
> > where there is no way to have a permissive default configuration,
> > which means that users who don't really care about distributing rt
> > shares hierarchically would get blocked from running rt processes
> > by default, which basically forces distros to disable rt cgroup
> > sched support. This is not new but it'd be a shame to put in all
> > the work and the end result is that most people don't even have
> > access to the feature.
>
> Right, but cgroup-v2 allows enabling/disabling specific controllers
> for a (sub)-hierarchy, right? So if the controller is not enabled (by
> default), it will fall back to putting the tasks in whatever parent
> does have it on, and by default the root group would have and would
> accept tasks.
If I understand well, this is similar to what I was thinking about:
having a default that allows creating FIFO/RR tasks (and execute them
without runtime control - so, without being served by a dl server)
> Additionally, I think we want a flag to allow non-priv tasks to use RT
> inside the controller -- after all, these tasks would be subject to
> strict bandwidth controls and cannot burn the system like
> unbounded/root FIFO tasks can.
This is something Yuri and I wanted to propose as a follow-up patch,
once there is an agreement on the patchset (should be a pretty simple
change :)
Luca
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
2026-05-07 10:53 ` Peter Zijlstra
2026-05-07 15:03 ` Juri Lelli
2026-05-07 16:44 ` luca abeni
@ 2026-05-11 9:40 ` luca abeni
2026-05-11 18:15 ` Tejun Heo
2026-05-11 17:37 ` Tejun Heo
3 siblings, 1 reply; 17+ messages in thread
From: luca abeni @ 2026-05-11 9:40 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Tejun Heo, Yuri Andriaccio, Ingo Molnar, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, linux-kernel, Yuri Andriaccio,
hannes, mkoutny, cgroups
Hi all,
On Thu, 7 May 2026 12:53:31 +0200
Peter Zijlstra <peterz@infradead.org> wrote:
[...]
> > - This has the same problem with cgroup1's rt cgroup sched support
> > where there is no way to have a permissive default configuration,
> > which means that users who don't really care about distributing rt
> > shares hierarchically would get blocked from running rt processes
> > by default, which basically forces distros to disable rt cgroup
> > sched support. This is not new but it'd be a shame to put in all
> > the work and the end result is that most people don't even have
> > access to the feature.
>
> Right, but cgroup-v2 allows enabling/disabling specific controllers
> for a (sub)-hierarchy, right? So if the controller is not enabled (by
> default), it will fall back to putting the tasks in whatever parent
> does have it on, and by default the root group would have and would
> accept tasks.
We are discussing this issue with Yuri, and we have a doubt: if we
disable the RT-CPU controller for a cgroup, would it be possible to
enable it for its children?
(In other words: if we want the RT-CPU controller to be enabled for
some "leaf" cgroups, we need to enable it for their parents, right?)
Thanks,
Luca
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
2026-05-11 9:40 ` luca abeni
@ 2026-05-11 18:15 ` Tejun Heo
0 siblings, 0 replies; 17+ messages in thread
From: Tejun Heo @ 2026-05-11 18:15 UTC (permalink / raw)
To: luca abeni
Cc: Peter Zijlstra, Yuri Andriaccio, Ingo Molnar, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, linux-kernel, Yuri Andriaccio,
hannes, mkoutny, cgroups
On Mon, May 11, 2026 at 11:40:04AM +0200, luca abeni wrote:
> We are discussing this issue with Yuri, and we have a doubt: if we
> disable the RT-CPU controller for a cgroup, would it be possible to
> enable it for its children?
> (In other words: if we want the RT-CPU controller to be enabled for
> some "leaf" cgroups, we need to enable it for their parents, right?)
Yeah, a cgroup has a controller available to it iff its parent enables that
controller, so all ancestors would have to enable it.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
2026-05-07 10:53 ` Peter Zijlstra
` (2 preceding siblings ...)
2026-05-11 9:40 ` luca abeni
@ 2026-05-11 17:37 ` Tejun Heo
3 siblings, 0 replies; 17+ messages in thread
From: Tejun Heo @ 2026-05-11 17:37 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Yuri Andriaccio, Ingo Molnar, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, linux-kernel, Luca Abeni, Yuri Andriaccio,
hannes, mkoutny, cgroups
Hello, Peter.
On Thu, May 07, 2026 at 12:53:31PM +0200, Peter Zijlstra wrote:
...
> Looking at cpu_period_quota_parse() this thing takes two u64 values for:
> {runtime, period} but allows runtime to be the string "max".
>
> I think we'd want an optional extension to that and allow 3 values for:
> {runtime, period, deadline}, where if the deadline is not given, it will
> be the same as period.
Yeah, I don't know what's needed here but extending the interface as
necessary is completely fine.
> Right... this then means we need two controls, one to do hierarchical
> bandwidth distribution, and one to assign bandwidth to the internal
> group -- which is then subject to its own bandwidth distribution
> constraint.
>
> This might be a little confusing, but there is no way around that
> AFAICT.
Separating out the rt as a separate controller is one way and if the
configuration wants to stick to strict allocation model where nothing is
available by default unless explicitly allocated, this would be the only
way. Interface-wise, I think this is going to be fine but I suspect this
likely would complicated internal implementation quite a bit as now rt can't
piggyback on existing sched core cgroup infra - no task_group or
synchronization built around them - and has to build everything on its own.
It's not the end of the world but not ideal either.
> > - This has the same problem with cgroup1's rt cgroup sched support where
> > there is no way to have a permissive default configuration, which means
> > that users who don't really care about distributing rt shares
> > hierarchically would get blocked from running rt processes by default,
> > which basically forces distros to disable rt cgroup sched support. This is
> > not new but it'd be a shame to put in all the work and the end result is
> > that most people don't even have access to the feature.
>
> Right, but cgroup-v2 allows enabling/disabling specific controllers for
> a (sub)-hierarchy, right? So if the controller is not enabled (by
> default), it will fall back to putting the tasks in whatever parent does
> have it on, and by default the root group would have and would accept
> tasks.
>
> Additionally, I think we want a flag to allow non-priv tasks to use RT
> inside the controller -- after all, these tasks would be subject to
> strict bandwidth controls and cannot burn the system like unbounded/root
> FIFO tasks can.
>
> Does that all sound workable?
Yeah, if rt becomes its own controller, I don't see any fundamental
roadblocks. It'd involve a bunch of churn which may add to maintenance
overhead but it should work. An alternative would be coming up with some way
to express the default no-enforcement state through the config knobs. I'm
sure this would be doable too and if folks can figure out a reasonable
interface, it should be able to obtain basically the same functionality with
a lot less code.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 17+ messages in thread