Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups

Linux cgroups development
 help / color / mirror / Atom feed

* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
       [not found] ` <20260430213835.62217-21-yurand2000@gmail.com>
@ 2026-05-05 15:15   ` Peter Zijlstra
  2026-05-05 19:56     ` Tejun Heo
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2026-05-05 15:15 UTC (permalink / raw)
  To: Yuri Andriaccio
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	linux-kernel, Luca Abeni, Yuri Andriaccio, tj, hannes, mkoutny,
	cgroups

On Thu, Apr 30, 2026 at 11:38:24PM +0200, Yuri Andriaccio wrote:
> From: luca abeni <luca.abeni@santannapisa.it>
> 
> Allow for cgroup hierarchies with more than two levels.
> 
> Introduce the concept of live and active groups:
> - A group is live if it is a leaf group or if all its children have zero
>   runtime.
> - A live group with non-zero runtime can be used to schedule tasks.
> - An active cgroup is a live group with running tasks.
> - A non-live group cannot be used to run tasks, but it is only used for
>   bandwidth accounting, i.e. the sum of its children bandwidth must be
>   less than or equal to the bandwidth of the parent. This change allows
>   to use cgroups for bandwidth management for different users.
> - While the root cgroup specifies the total allocatable bandwidth of rt
>   cgroups, a further accounting is performed to keep track of the live
>   bandwidth, i.e. the sum of the bandwidth of live groups. The hierarchy
>   invariant states that the live bandwidth must always be less than or
>   equal to the total allocatable bw.
> 
> Add is_live_sched_group() and sched_group_has_live_siblings() in
> deadline.c. These utility functions are used by dl_init_tg to perform
> updates only when necessary:
> - Only live groups may update the active dl bandwidth of dl entities
>   (call to dl_rq_change_utilization), while non-live groups must not use
>   servers, and thus must not change the active dl bandwidth.
> - The total bandwidth accounting must be changed to follow the
>   live/non-live rules:
>   - When disabling (runtime zero) the last child of a group, the parent
>     becomes a live group, and so the parent's bw must be accounted back.
>   - When enabling (runtime non-zero) the first child, the parent becomes a
>     non-live group, and so the parent's bandwidth must be removed.
> 
> Update tg_set_rt_bandwidth() to change the runtime of a group to a
> non-zero value only if its parent is inactive, thus forcing it to become
> non-live if it was precedently (it would've already been non-live if a
> sibling cgroup was live). An exception is made for groups which have the
> root cgroup as parent.
> 
> Update sched_rt_can_attach() to allow attaching only on live groups.
> 
> Update dl_init_tg() to take a task_group pointer and a cpu's id rather
> than passing directly the pointer to the cpu's deadline server. The
> task_group pointer is necessary to check and update the live bandwidth
> accounting.
> 
> Co-developed-by: Yuri Andriaccio <yurand2000@gmail.com>
> Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
> Signed-off-by: luca abeni <luca.abeni@santannapisa.it>

This probably wants to have the cgroup folks on Cc (added now) to make
sure the semantics are in line with cgroup-v2 expectations.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
  2026-05-05 15:15   ` [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups Peter Zijlstra
@ 2026-05-05 19:56     ` Tejun Heo
  2026-05-07 10:53       ` Peter Zijlstra
  2026-05-07 14:30       ` luca abeni
  0 siblings, 2 replies; 17+ messages in thread
From: Tejun Heo @ 2026-05-05 19:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yuri Andriaccio, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, linux-kernel, Luca Abeni, Yuri Andriaccio,
	hannes, mkoutny, cgroups

Hello,

Some high level comments:

- Please align it with existing cgroup2 interface files. See cpu.max. This
  can be e.g. cpu.rt.max without about the same semantics.

- cgroup2 enforces that internal cgroups w/ controllers enabled cannot have
  threads in them. No need to enforce that separately.

- However, the cpu controller is a threaded controller which means that it
  can have threaded sub-hierarchy where the no-internal-process rule doesn't
  apply. This was created explicitly for cpu controller. The proposed change
  blocks it effectively forcing cpu controller into regular domain
  controller behavior subject to no-internal-process rule. Note these are
  enforced at controller granularity and this means that users who use the
  threaded mode will be forced to pick between the two.

- This has the same problem with cgroup1's rt cgroup sched support where
  there is no way to have a permissive default configuration, which means
  that users who don't really care about distributing rt shares
  hierarchically would get blocked from running rt processes by default,
  which basically forces distros to disable rt cgroup sched support. This is
  not new but it'd be a shame to put in all the work and the end result is
  that most people don't even have access to the feature.

Here's my suggestion if there is desire for this to become something most
people have easy access to:

- Don't make it impossible to use in conjunction with other resource control
  mechanisms especially not CPU controller itself. Don't force people to
  choose between threaded mode and rt control. Allow them to co-exist in a
  reasonable manner.

- The same in the wider scope. Don't let it get in the way of people who
  don't care about it. Compromising on interface / failure mode is better
  than people not being able to use it in most cases.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
  2026-05-05 19:56     ` Tejun Heo
@ 2026-05-07 10:53       ` Peter Zijlstra
  2026-05-07 15:03         ` Juri Lelli
                           ` (3 more replies)
  2026-05-07 14:30       ` luca abeni
  1 sibling, 4 replies; 17+ messages in thread
From: Peter Zijlstra @ 2026-05-07 10:53 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Yuri Andriaccio, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, linux-kernel, Luca Abeni, Yuri Andriaccio,
	hannes, mkoutny, cgroups

On Tue, May 05, 2026 at 09:56:58AM -1000, Tejun Heo wrote:
> Hello,
> 
> Some high level comments:
> 
> - Please align it with existing cgroup2 interface files. See cpu.max. This
>   can be e.g. cpu.rt.max without about the same semantics.
> 
> - cgroup2 enforces that internal cgroups w/ controllers enabled cannot have
>   threads in them. No need to enforce that separately.

Looking at cpu_period_quota_parse() this thing takes two u64 values for:
{runtime, period} but allows runtime to be the string "max".

I think we'd want an optional extension to that and allow 3 values for:
{runtime, period, deadline}, where if the deadline is not given, it will
be the same as period.

In previous versions there was also an option to specify a cpumask,
getting rid of that is one of the reasons I suggested making this thing
a cgroup-v2 thing, then we can use the cpuset controller's effective
mask.

> - However, the cpu controller is a threaded controller which means that it
>   can have threaded sub-hierarchy where the no-internal-process rule doesn't
>   apply. This was created explicitly for cpu controller. The proposed change
>   blocks it effectively forcing cpu controller into regular domain
>   controller behavior subject to no-internal-process rule. Note these are
>   enforced at controller granularity and this means that users who use the
>   threaded mode will be forced to pick between the two.

Right... this then means we need two controls, one to do hierarchical
bandwidth distribution, and one to assign bandwidth to the internal
group -- which is then subject to its own bandwidth distribution
constraint.

This might be a little confusing, but there is no way around that
AFAICT.

> - This has the same problem with cgroup1's rt cgroup sched support where
>   there is no way to have a permissive default configuration, which means
>   that users who don't really care about distributing rt shares
>   hierarchically would get blocked from running rt processes by default,
>   which basically forces distros to disable rt cgroup sched support. This is
>   not new but it'd be a shame to put in all the work and the end result is
>   that most people don't even have access to the feature.

Right, but cgroup-v2 allows enabling/disabling specific controllers for
a (sub)-hierarchy, right? So if the controller is not enabled (by
default), it will fall back to putting the tasks in whatever parent does
have it on, and by default the root group would have and would accept
tasks.

Additionally, I think we want a flag to allow non-priv tasks to use RT
inside the controller -- after all, these tasks would be subject to
strict bandwidth controls and cannot burn the system like unbounded/root
FIFO tasks can.

Does that all sound workable?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
  2026-05-07 10:53       ` Peter Zijlstra
@ 2026-05-07 15:03         ` Juri Lelli
  2026-05-07 15:05           ` Peter Zijlstra
  2026-05-07 16:39           ` luca abeni
  2026-05-07 16:44         ` luca abeni
                           ` (2 subsequent siblings)
  3 siblings, 2 replies; 17+ messages in thread
From: Juri Lelli @ 2026-05-07 15:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, Yuri Andriaccio, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, linux-kernel, Luca Abeni, Yuri Andriaccio,
	hannes, mkoutny, cgroups

On 07/05/26 12:53, Peter Zijlstra wrote:
> On Tue, May 05, 2026 at 09:56:58AM -1000, Tejun Heo wrote:

...

> > - However, the cpu controller is a threaded controller which means that it
> >   can have threaded sub-hierarchy where the no-internal-process rule doesn't
> >   apply. This was created explicitly for cpu controller. The proposed change
> >   blocks it effectively forcing cpu controller into regular domain
> >   controller behavior subject to no-internal-process rule. Note these are
> >   enforced at controller granularity and this means that users who use the
> >   threaded mode will be forced to pick between the two.
> 
> Right... this then means we need two controls, one to do hierarchical
> bandwidth distribution, and one to assign bandwidth to the internal
> group -- which is then subject to its own bandwidth distribution
> constraint.
> 
> This might be a little confusing, but there is no way around that
> AFAICT.

Just to check if I'm following, you are thinking something like below?

groupA/
  cpu.rt.max = "50 50 100"       <- 0.5 from root
  cpu.rt.internal = "20 20 100"  <- 0.2 from groupA for threads at this
                                        level
  + threadA                               <
  + threadB                               <
  +- group1/
       cpu.rt.max = "30 30 100"  <- 0.3 from groupA
       + threadC

And we still keep it flat, so 2 dl-entities (per CPU), one handles
threads at groupA level and the other threads inside group1?

Thanks,
Juri


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
  2026-05-07 15:03         ` Juri Lelli
@ 2026-05-07 15:05           ` Peter Zijlstra
  2026-05-07 16:39           ` luca abeni
  1 sibling, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2026-05-07 15:05 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Tejun Heo, Yuri Andriaccio, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, linux-kernel, Luca Abeni, Yuri Andriaccio,
	hannes, mkoutny, cgroups

On Thu, May 07, 2026 at 05:03:41PM +0200, Juri Lelli wrote:
> On 07/05/26 12:53, Peter Zijlstra wrote:
> > On Tue, May 05, 2026 at 09:56:58AM -1000, Tejun Heo wrote:
> 
> ...
> 
> > > - However, the cpu controller is a threaded controller which means that it
> > >   can have threaded sub-hierarchy where the no-internal-process rule doesn't
> > >   apply. This was created explicitly for cpu controller. The proposed change
> > >   blocks it effectively forcing cpu controller into regular domain
> > >   controller behavior subject to no-internal-process rule. Note these are
> > >   enforced at controller granularity and this means that users who use the
> > >   threaded mode will be forced to pick between the two.
> > 
> > Right... this then means we need two controls, one to do hierarchical
> > bandwidth distribution, and one to assign bandwidth to the internal
> > group -- which is then subject to its own bandwidth distribution
> > constraint.
> > 
> > This might be a little confusing, but there is no way around that
> > AFAICT.
> 
> Just to check if I'm following, you are thinking something like below?
> 
> groupA/
>   cpu.rt.max = "50 50 100"       <- 0.5 from root
>   cpu.rt.internal = "20 20 100"  <- 0.2 from groupA for threads at this
>                                         level
>   + threadA                               <
>   + threadB                               <
>   +- group1/
>        cpu.rt.max = "30 30 100"  <- 0.3 from groupA
>        + threadC
> 
> And we still keep it flat, so 2 dl-entities (per CPU), one handles
> threads at groupA level and the other threads inside group1?

Exactly!

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
  2026-05-07 15:03         ` Juri Lelli
  2026-05-07 15:05           ` Peter Zijlstra
@ 2026-05-07 16:39           ` luca abeni
  2026-05-11  9:29             ` Juri Lelli
  1 sibling, 1 reply; 17+ messages in thread
From: luca abeni @ 2026-05-07 16:39 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Peter Zijlstra, Tejun Heo, Yuri Andriaccio, Ingo Molnar,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, linux-kernel, Yuri Andriaccio,
	hannes, mkoutny, cgroups

Hi,

On Thu, 7 May 2026 17:03:41 +0200
Juri Lelli <juri.lelli@redhat.com> wrote:

> On 07/05/26 12:53, Peter Zijlstra wrote:
> > On Tue, May 05, 2026 at 09:56:58AM -1000, Tejun Heo wrote:  
> 
> ...
> 
> > > - However, the cpu controller is a threaded controller which
> > > means that it can have threaded sub-hierarchy where the
> > > no-internal-process rule doesn't apply. This was created
> > > explicitly for cpu controller. The proposed change blocks it
> > > effectively forcing cpu controller into regular domain controller
> > > behavior subject to no-internal-process rule. Note these are
> > > enforced at controller granularity and this means that users who
> > > use the threaded mode will be forced to pick between the two.  
> > 
> > Right... this then means we need two controls, one to do
> > hierarchical bandwidth distribution, and one to assign bandwidth to
> > the internal group -- which is then subject to its own bandwidth
> > distribution constraint.
> > 
> > This might be a little confusing, but there is no way around that
> > AFAICT.  
> 
> Just to check if I'm following, you are thinking something like below?
> 
> groupA/
>   cpu.rt.max = "50 50 100"       <- 0.5 from root
>   cpu.rt.internal = "20 20 100"  <- 0.2 from groupA for threads at
> this level
>   + threadA                               <
>   + threadB                               <
>   +- group1/
>        cpu.rt.max = "30 30 100"  <- 0.3 from groupA
>        + threadC
> 
> And we still keep it flat, so 2 dl-entities (per CPU), one handles
> threads at groupA level and the other threads inside group1?

An alternative idea I was thinking about: we create 2 dl entities (one
for "groupA" and one for "group1"); we set cpu.rt.max for groupA, and
we subtract group1's utilization from it (so, if groupA's cpu.rt.max is
"50 100" and group1's cpu.rt.max is "30 100", groupA is served by a dl
entity (50-30,100)=(20,100) while group1 is served by a dl entity
(30,100)).

Basically, with this idea the "internal" reservation is automatically
computed based on rt.max and on the children cgroups. A possible issue
is that if the children consume all the groupA's utilization the groupA
RT tasks remain with 0 runtime (and never execute).


				Luca

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
  2026-05-07 16:39           ` luca abeni
@ 2026-05-11  9:29             ` Juri Lelli
  2026-05-11 17:52               ` Tejun Heo
  0 siblings, 1 reply; 17+ messages in thread
From: Juri Lelli @ 2026-05-11  9:29 UTC (permalink / raw)
  To: luca abeni
  Cc: Peter Zijlstra, Tejun Heo, Yuri Andriaccio, Ingo Molnar,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, linux-kernel, Yuri Andriaccio,
	hannes, mkoutny, cgroups

On 07/05/26 18:39, luca abeni wrote:
> Hi,
> 
> On Thu, 7 May 2026 17:03:41 +0200
> Juri Lelli <juri.lelli@redhat.com> wrote:
> 
> > On 07/05/26 12:53, Peter Zijlstra wrote:
> > > On Tue, May 05, 2026 at 09:56:58AM -1000, Tejun Heo wrote:  
> > 
> > ...
> > 
> > > > - However, the cpu controller is a threaded controller which
> > > > means that it can have threaded sub-hierarchy where the
> > > > no-internal-process rule doesn't apply. This was created
> > > > explicitly for cpu controller. The proposed change blocks it
> > > > effectively forcing cpu controller into regular domain controller
> > > > behavior subject to no-internal-process rule. Note these are
> > > > enforced at controller granularity and this means that users who
> > > > use the threaded mode will be forced to pick between the two.  
> > > 
> > > Right... this then means we need two controls, one to do
> > > hierarchical bandwidth distribution, and one to assign bandwidth to
> > > the internal group -- which is then subject to its own bandwidth
> > > distribution constraint.
> > > 
> > > This might be a little confusing, but there is no way around that
> > > AFAICT.  
> > 
> > Just to check if I'm following, you are thinking something like below?
> > 
> > groupA/
> >   cpu.rt.max = "50 50 100"       <- 0.5 from root
> >   cpu.rt.internal = "20 20 100"  <- 0.2 from groupA for threads at
> > this level
> >   + threadA                               <
> >   + threadB                               <
> >   +- group1/
> >        cpu.rt.max = "30 30 100"  <- 0.3 from groupA
> >        + threadC
> > 
> > And we still keep it flat, so 2 dl-entities (per CPU), one handles
> > threads at groupA level and the other threads inside group1?
> 
> An alternative idea I was thinking about: we create 2 dl entities (one
> for "groupA" and one for "group1"); we set cpu.rt.max for groupA, and
> we subtract group1's utilization from it (so, if groupA's cpu.rt.max is
> "50 100" and group1's cpu.rt.max is "30 100", groupA is served by a dl
> entity (50-30,100)=(20,100) while group1 is served by a dl entity
> (30,100)).
> 
> Basically, with this idea the "internal" reservation is automatically
> computed based on rt.max and on the children cgroups. A possible issue
> is that if the children consume all the groupA's utilization the groupA
> RT tasks remain with 0 runtime (and never execute).

While I like the automatic approach, I also fear that it might be more
difficult to maintain/use from a systemd admin perspective, e.g. I
cannot make a subgroup reservation bigger because there are threads
running in the parent group which consume all the remaining (internal)
bandwidth. If we make it explicit it seems easier to see where bandwidth
is allocated at all levels.

Peter? Tejun? What do we want to do with this interface?


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
  2026-05-11  9:29             ` Juri Lelli
@ 2026-05-11 17:52               ` Tejun Heo
  0 siblings, 0 replies; 17+ messages in thread
From: Tejun Heo @ 2026-05-11 17:52 UTC (permalink / raw)
  To: Juri Lelli
  Cc: luca abeni, Peter Zijlstra, Yuri Andriaccio, Ingo Molnar,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, linux-kernel, Yuri Andriaccio,
	hannes, mkoutny, cgroups

Hello,

On Mon, May 11, 2026 at 11:29:47AM +0200, Juri Lelli wrote:
...
> While I like the automatic approach, I also fear that it might be more
> difficult to maintain/use from a systemd admin perspective, e.g. I
> cannot make a subgroup reservation bigger because there are threads
> running in the parent group which consume all the remaining (internal)
> bandwidth. If we make it explicit it seems easier to see where bandwidth
> is allocated at all levels.
> 
> Peter? Tejun? What do we want to do with this interface?

blkcg on cgroup1 did soemthing similar for a while. It had a separate subdir
for knobs that apply to "internal threads". Effectivley, this becomes
creating a separate controller group for every cgroup as a sibling to its
children. It does work obviously but it is pretty ugly and unintuitive, both
in interface and implementation, and I'm skeptical this was actually useful
in any meaningful way. Nobody complained when we ripped it out.

If rt were to become its own cgroup controller, maybe one can just side-step
this by not supporting threaded mode at least at the beginning. If people
ask for it, hopefully we'll be able to develop better understanding of their
usecases and drive design that way. In practice, I don't think threaded mode
gets used all that much because usually only application processes
themselves know about their own threads, are not in the business of creating
their own cgroups (delegation to each application isn't common), and have
other ways of controlling their own threads. So, there's some chance that
this may not actually come up.

If rt stays as a part of cpu controller, my preference would be keeping the
config implicit for threaded mode at least at the beginning. ie. Don't get
in the way of people using threaded mode by blocking it but having some
reasonable and clear default (e.g. internal tasks have priority as suggested
or internal tasks get whatever is left over which may make more sense in the
allocation model) may be sufficient. If not, like in the other case, we can
make specific design decisions based on concrete use cases later.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
  2026-05-07 10:53       ` Peter Zijlstra
  2026-05-07 15:03         ` Juri Lelli
@ 2026-05-07 16:44         ` luca abeni
  2026-05-11  9:40         ` luca abeni
  2026-05-11 17:37         ` Tejun Heo
  3 siblings, 0 replies; 17+ messages in thread
From: luca abeni @ 2026-05-07 16:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, Yuri Andriaccio, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, linux-kernel, Yuri Andriaccio,
	hannes, mkoutny, cgroups

Hi,

On Thu, 7 May 2026 12:53:31 +0200
Peter Zijlstra <peterz@infradead.org> wrote:
[...]
> > - This has the same problem with cgroup1's rt cgroup sched support
> > where there is no way to have a permissive default configuration,
> > which means that users who don't really care about distributing rt
> > shares hierarchically would get blocked from running rt processes
> > by default, which basically forces distros to disable rt cgroup
> > sched support. This is not new but it'd be a shame to put in all
> > the work and the end result is that most people don't even have
> > access to the feature.  
> 
> Right, but cgroup-v2 allows enabling/disabling specific controllers
> for a (sub)-hierarchy, right? So if the controller is not enabled (by
> default), it will fall back to putting the tasks in whatever parent
> does have it on, and by default the root group would have and would
> accept tasks.

If I understand well, this is similar to what I was thinking about:
having a default that allows creating FIFO/RR tasks (and execute them
without runtime control - so, without being served by a dl server)


> Additionally, I think we want a flag to allow non-priv tasks to use RT
> inside the controller -- after all, these tasks would be subject to
> strict bandwidth controls and cannot burn the system like
> unbounded/root FIFO tasks can.

This is something Yuri and I wanted to propose as a follow-up patch,
once there is an agreement on the patchset (should be a pretty simple
change :)



				Luca

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
  2026-05-07 10:53       ` Peter Zijlstra
  2026-05-07 15:03         ` Juri Lelli
  2026-05-07 16:44         ` luca abeni
@ 2026-05-11  9:40         ` luca abeni
  2026-05-11 18:15           ` Tejun Heo
  2026-05-11 17:37         ` Tejun Heo
  3 siblings, 1 reply; 17+ messages in thread
From: luca abeni @ 2026-05-11  9:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, Yuri Andriaccio, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, linux-kernel, Yuri Andriaccio,
	hannes, mkoutny, cgroups

Hi all,

On Thu, 7 May 2026 12:53:31 +0200
Peter Zijlstra <peterz@infradead.org> wrote:
[...]
> > - This has the same problem with cgroup1's rt cgroup sched support
> > where there is no way to have a permissive default configuration,
> > which means that users who don't really care about distributing rt
> > shares hierarchically would get blocked from running rt processes
> > by default, which basically forces distros to disable rt cgroup
> > sched support. This is not new but it'd be a shame to put in all
> > the work and the end result is that most people don't even have
> > access to the feature.  
> 
> Right, but cgroup-v2 allows enabling/disabling specific controllers
> for a (sub)-hierarchy, right? So if the controller is not enabled (by
> default), it will fall back to putting the tasks in whatever parent
> does have it on, and by default the root group would have and would
> accept tasks.

We are discussing this issue with Yuri, and we have a doubt: if we
disable the RT-CPU controller for a cgroup, would it be possible to
enable it for its children?
(In other words: if we want the RT-CPU controller to be enabled for
some "leaf" cgroups, we need to enable it for their parents, right?)



			Thanks,
				Luca

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
  2026-05-11  9:40         ` luca abeni
@ 2026-05-11 18:15           ` Tejun Heo
  0 siblings, 0 replies; 17+ messages in thread
From: Tejun Heo @ 2026-05-11 18:15 UTC (permalink / raw)
  To: luca abeni
  Cc: Peter Zijlstra, Yuri Andriaccio, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, linux-kernel, Yuri Andriaccio,
	hannes, mkoutny, cgroups

On Mon, May 11, 2026 at 11:40:04AM +0200, luca abeni wrote:
> We are discussing this issue with Yuri, and we have a doubt: if we
> disable the RT-CPU controller for a cgroup, would it be possible to
> enable it for its children?
> (In other words: if we want the RT-CPU controller to be enabled for
> some "leaf" cgroups, we need to enable it for their parents, right?)

Yeah, a cgroup has a controller available to it iff its parent enables that
controller, so all ancestors would have to enable it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
  2026-05-07 10:53       ` Peter Zijlstra
                           ` (2 preceding siblings ...)
  2026-05-11  9:40         ` luca abeni
@ 2026-05-11 17:37         ` Tejun Heo
  3 siblings, 0 replies; 17+ messages in thread
From: Tejun Heo @ 2026-05-11 17:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yuri Andriaccio, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, linux-kernel, Luca Abeni, Yuri Andriaccio,
	hannes, mkoutny, cgroups

Hello, Peter.

On Thu, May 07, 2026 at 12:53:31PM +0200, Peter Zijlstra wrote:
...
> Looking at cpu_period_quota_parse() this thing takes two u64 values for:
> {runtime, period} but allows runtime to be the string "max".
> 
> I think we'd want an optional extension to that and allow 3 values for:
> {runtime, period, deadline}, where if the deadline is not given, it will
> be the same as period.

Yeah, I don't know what's needed here but extending the interface as
necessary is completely fine.

> Right... this then means we need two controls, one to do hierarchical
> bandwidth distribution, and one to assign bandwidth to the internal
> group -- which is then subject to its own bandwidth distribution
> constraint.
> 
> This might be a little confusing, but there is no way around that
> AFAICT.

Separating out the rt as a separate controller is one way and if the
configuration wants to stick to strict allocation model where nothing is
available by default unless explicitly allocated, this would be the only
way. Interface-wise, I think this is going to be fine but I suspect this
likely would complicated internal implementation quite a bit as now rt can't
piggyback on existing sched core cgroup infra - no task_group or
synchronization built around them - and has to build everything on its own.
It's not the end of the world but not ideal either.

> > - This has the same problem with cgroup1's rt cgroup sched support where
> >   there is no way to have a permissive default configuration, which means
> >   that users who don't really care about distributing rt shares
> >   hierarchically would get blocked from running rt processes by default,
> >   which basically forces distros to disable rt cgroup sched support. This is
> >   not new but it'd be a shame to put in all the work and the end result is
> >   that most people don't even have access to the feature.
> 
> Right, but cgroup-v2 allows enabling/disabling specific controllers for
> a (sub)-hierarchy, right? So if the controller is not enabled (by
> default), it will fall back to putting the tasks in whatever parent does
> have it on, and by default the root group would have and would accept
> tasks.
> 
> Additionally, I think we want a flag to allow non-priv tasks to use RT
> inside the controller -- after all, these tasks would be subject to
> strict bandwidth controls and cannot burn the system like unbounded/root
> FIFO tasks can.
> 
> Does that all sound workable?

Yeah, if rt becomes its own controller, I don't see any fundamental
roadblocks. It'd involve a bunch of churn which may add to maintenance
overhead but it should work. An alternative would be coming up with some way
to express the default no-enforcement state through the config knobs. I'm
sure this would be doable too and if folks can figure out a reasonable
interface, it should be able to obtain basically the same functionality with
a lot less code.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
  2026-05-05 19:56     ` Tejun Heo
  2026-05-07 10:53       ` Peter Zijlstra
@ 2026-05-07 14:30       ` luca abeni
  2026-05-11 18:28         ` Tejun Heo
  1 sibling, 1 reply; 17+ messages in thread
From: luca abeni @ 2026-05-07 14:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Peter Zijlstra, Yuri Andriaccio, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, linux-kernel, Yuri Andriaccio,
	hannes, mkoutny, cgroups

Hi Tejun,

first of all, thanks for your comments! I think this is the kind of
dicussion that we need to have...
Right now we have something that works "well enough" for real-time, but
we want to make it useful in general, so that distributions will not
disable it by default.

I need to better study your suggestions (I do not know cgroup v2
much...), but I have some questions to better understand possible
solutions:

On Tue, 5 May 2026 09:56:58 -1000
Tejun Heo <tj@kernel.org> wrote:
[...]
> - cgroup2 enforces that internal cgroups w/ controllers enabled
> cannot have threads in them. No need to enforce that separately.
> 
> - However, the cpu controller is a threaded controller which means
> that it can have threaded sub-hierarchy where the no-internal-process
> rule doesn't apply. This was created explicitly for cpu controller.
> The proposed change blocks it effectively forcing cpu controller into
> regular domain controller behavior subject to no-internal-process
> rule. Note these are enforced at controller granularity and this
> means that users who use the threaded mode will be forced to pick
> between the two.

Just to better understand: would it make sense to allow non-{FIFO,RT}
tasks to be in non-leaf cgroups (as allowed by the threaded CPU
controller), while enforcing that FIFO/RR tasks can only be in leaf
cgroups? Or would this be a hack that compromises the rt-CPU controller
usefulness?


> - This has the same problem with cgroup1's rt cgroup sched support
> where there is no way to have a permissive default configuration,
> which means that users who don't really care about distributing rt
> shares hierarchically would get blocked from running rt processes by
> default, which basically forces distros to disable rt cgroup sched
> support. This is not new but it'd be a shame to put in all the work
> and the end result is that most people don't even have access to the
> feature.

Yes, we have a bad default here.
Would a default like "allow running FIFO/RR tasks without runtime
enforcement" (this is what happens to FIFO/RR tasks running in the root
control group) be acceptable?


			Thanks,
				Luca

> 
> Here's my suggestion if there is desire for this to become something
> most people have easy access to:
> 
> - Don't make it impossible to use in conjunction with other resource
> control mechanisms especially not CPU controller itself. Don't force
> people to choose between threaded mode and rt control. Allow them to
> co-exist in a reasonable manner.
> 
> - The same in the wider scope. Don't let it get in the way of people
> who don't care about it. Compromising on interface / failure mode is
> better than people not being able to use it in most cases.
> 
> Thanks.
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
  2026-05-07 14:30       ` luca abeni
@ 2026-05-11 18:28         ` Tejun Heo
  2026-05-12 17:38           ` Yuri Andriaccio
  0 siblings, 1 reply; 17+ messages in thread
From: Tejun Heo @ 2026-05-11 18:28 UTC (permalink / raw)
  To: luca abeni
  Cc: Peter Zijlstra, Yuri Andriaccio, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, linux-kernel, Yuri Andriaccio,
	hannes, mkoutny, cgroups

Hello,

On Thu, May 07, 2026 at 04:30:58PM +0200, luca abeni wrote:
...
> Just to better understand: would it make sense to allow non-{FIFO,RT}
> tasks to be in non-leaf cgroups (as allowed by the threaded CPU
> controller), while enforcing that FIFO/RR tasks can only be in leaf
> cgroups? Or would this be a hack that compromises the rt-CPU controller
> usefulness?

Code-wise, sure, but I don't think an interface like that would be a good
one. From user's pov, this amounts to adding restrictions on both whether a
controller can be enabled and whether tasks can be moved into some cgroups.
UNIX error reporting being what it is, this would come down to getting
-EINVAL or -EBUSY or whatever out of those operations. I don't think it's a
good idea to add subtle failure modes to these already pretty complex (but
currently w/ clearly-defined shared rules) operations. To users, this would
look like random arbitrary failures that are nearly impossible to decode
without tracing code.

If you want to enforce no-internal-threads, separating it out to its own
controller that doesn't support threaded mode would be the right direction.

Note that the only hard requirement here is that you don't want to get in
the way for people who are NOT interested in threaded rt control. If you
block enabling CPU control for e.g. cpu.max or block thread migration into a
cgroup, you'd be in the way; however, if all you say is "I don't support
sub-allocation in threaded mode" and e.g just fail writes to the knobs in
threaded cgroups, that does not get in the way. So, it's not like you *have*
to support full threaded mode. You just need to avoid hindering non-rt
operations.

> Yes, we have a bad default here.
> Would a default like "allow running FIFO/RR tasks without runtime
> enforcement" (this is what happens to FIFO/RR tasks running in the root
> control group) be acceptable?

Yes, if you can express that in a reasonable way in the config knobs, that'd
likely be an easier way. I don't know how to transition from
allowed-by-default to explicitly-allocated in such interface tho. Making
that reasonable and smooth would be the key factor in whether such approach
can be taken.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
  2026-05-11 18:28         ` Tejun Heo
@ 2026-05-12 17:38           ` Yuri Andriaccio
  2026-05-12 18:19             ` Tejun Heo
  0 siblings, 1 reply; 17+ messages in thread
From: Yuri Andriaccio @ 2026-05-12 17:38 UTC (permalink / raw)
  To: Tejun Heo
  Cc: luca abeni, Peter Zijlstra, Yuri Andriaccio, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, linux-kernel, hannes,
	mkoutny, cgroups

Hello,

I've been thinking and experimenting with some of the ideas for the rt 
controller, and I've come up with the following interface, keeping 
everything in the standard cpu controller:

- cpu.rt.max <runtime_us> <period_us>
   Sets the bandwidth reserved to the hierarchy that has that specific 
cgroup as root, but does
   not set any deadline servers.
   The default value for this file is '0 0'.
- cpu.rt.min <runtime_us | 'root'> <period_us>
   If the runtime part is equal to 'root', the tasks are scheduled on 
the root runqueue.
   If the runtime is equal to zero, no FIFO/RR tasks can be scheduled.
   If the runtime is > zero, FIFO/RR tasks are scheduled under 
reservation/HCBS.
   This file is not available in the root cgroup, as it does not make 
use of dl-servers,
   rather only reserves the total bandwidth for the hierarchy.
   The default value for this file is 'root 0', meaning that tasks in 
this cgroups are
   by default scheduled on the root runqueue.

Of course you can imagine that all the admission tests have been updated 
accordingly, as an example a cgroups rt.max bw must be >= than the sum 
of the rt.max bws of its children + its rt.min bw. I'm also skipping 
some details which are only meaningful if we decide to adopt this solution.

What do you think of this interface?

Thanks,
Yuri

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
  2026-05-12 17:38           ` Yuri Andriaccio
@ 2026-05-12 18:19             ` Tejun Heo
  2026-05-12 18:20               ` Tejun Heo
  0 siblings, 1 reply; 17+ messages in thread
From: Tejun Heo @ 2026-05-12 18:19 UTC (permalink / raw)
  To: Yuri Andriaccio
  Cc: luca abeni, Peter Zijlstra, Yuri Andriaccio, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, linux-kernel, hannes,
	mkoutny, cgroups

Hello,

How is a delegated subtree prevented from setting cpu.rt.min = 'root' and
escaping its ancestors' cpu.rt.max budget?

Thanks.

--
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
  2026-05-12 18:19             ` Tejun Heo
@ 2026-05-12 18:20               ` Tejun Heo
  0 siblings, 0 replies; 17+ messages in thread
From: Tejun Heo @ 2026-05-12 18:20 UTC (permalink / raw)
  To: Yuri Andriaccio
  Cc: luca abeni, Peter Zijlstra, Yuri Andriaccio, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, linux-kernel, hannes,
	mkoutny, cgroups

On Tue, May 12, 2026 at 08:19:02AM -1000, Tejun Heo wrote:
> How is a delegated subtree prevented from setting cpu.rt.min = 'root' and
> escaping its ancestors' cpu.rt.max budget?

Hmm.. I guess the same problem exists w/ separate rt controller too. If the
users on the system already started using rt, how do you enable the
controller from the top down with budgets already being used down in the
hierarchy?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2026-05-12 18:20 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20260430213835.62217-1-yurand2000@gmail.com>
     [not found] ` <20260430213835.62217-21-yurand2000@gmail.com>
2026-05-05 15:15   ` [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups Peter Zijlstra
2026-05-05 19:56     ` Tejun Heo
2026-05-07 10:53       ` Peter Zijlstra
2026-05-07 15:03         ` Juri Lelli
2026-05-07 15:05           ` Peter Zijlstra
2026-05-07 16:39           ` luca abeni
2026-05-11  9:29             ` Juri Lelli
2026-05-11 17:52               ` Tejun Heo
2026-05-07 16:44         ` luca abeni
2026-05-11  9:40         ` luca abeni
2026-05-11 18:15           ` Tejun Heo
2026-05-11 17:37         ` Tejun Heo
2026-05-07 14:30       ` luca abeni
2026-05-11 18:28         ` Tejun Heo
2026-05-12 17:38           ` Yuri Andriaccio
2026-05-12 18:19             ` Tejun Heo
2026-05-12 18:20               ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox