From: Tejun Heo <tj@kernel.org>
To: Yuri Andriaccio <yuri.andriaccio@santannapisa.it>
Cc: luca abeni <luca.abeni@santannapisa.it>,
Peter Zijlstra <peterz@infradead.org>,
Yuri Andriaccio <yurand2000@gmail.com>,
Ingo Molnar <mingo@redhat.com>,
Juri Lelli <juri.lelli@redhat.com>,
Vincent Guittot <vincent.guittot@linaro.org>,
Dietmar Eggemann <dietmar.eggemann@arm.com>,
Steven Rostedt <rostedt@goodmis.org>,
Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
Valentin Schneider <vschneid@redhat.com>,
linux-kernel@vger.kernel.org, hannes@cmpxchg.org,
mkoutny@suse.com, cgroups@vger.kernel.org
Subject: Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
Date: Wed, 13 May 2026 09:10:39 -1000 [thread overview]
Message-ID: <agTMrz_nMV880pe0@slm.duckdns.org> (raw)
In-Reply-To: <0d3336a7-ae42-4359-bfe7-48a7d6796d06@santannapisa.it>
Hello,
On Wed, May 13, 2026 at 02:08:52PM +0200, Yuri Andriaccio wrote:
> > How is a delegated subtree prevented from setting cpu.rt.min = 'root' and
> > escaping its ancestors' cpu.rt.max budget?
>
> Is it strictly required that a child cgroup must have 'less runtime' than
> its parent? To be more precise I mean scheduling tasks on the root runqueue
> instead of using dl-servers. Small note: given that HCBS cgroups use
> dl-servers, and thus run at higher priority than FIFO/RR scheduled on the
> root runqueue, if a cgroup rt.min is 'root' would yes escape its ancestor
> budget but it may also possibly get starved because of the priority levels.
The high-level invariant that we must maintain is that any given cgroup has
control over resource usages in its subtree. If that doesn't work, the whole
thing is not very useful.
e.g. There are multiple containers in the system and each wants to manage
its own internal resource distribution, which is a relatively common
scenario in server deployments. This is implemented by putting each
container in a cgroup and deletating the sub-tree to the nested container
manager. At the host level, you don't know or have control over what's going
on in each container but you can control how much each container consumes in
total so that each gets what it's allotted and doesn't get in the way of
others.
While delegation scenario is a clear example, even in regular usages, it
gets really confusing if hierarchical resource distribution isn't actually
hierarchical. If you let a child escape to root at its own discretion, might
as well just not have all the complexities with hierarchical resource
control.
> If we require that child cgroups cannot escape their parent's bandwidth,
> even when using 'root', then the cpu.rt.max file must be disallowed in the
> root cgroup (removing the possibility to reserve bandwidth for HCBS, and so
> doing the admission test similarly to when SCHED_DEADLINE tasks are
> executed), and cpu.rt.max would use either 'root' if the whole subtree must
> be scheduled onto the root runqueue or a <runtime> <period> combination to
> reserve bandwidth for the whole subtree. The cpu.rt.min would then only be
> used to reserve internal bandwidth for the cgroup itself. This also means
> that a whole subtree either uses HCBS everywhere or the root runqueue
> everywhere.
>
> > If the users on the system already started using rt, how do you enable the
> > controller from the top down with budgets already being used down in the
> > hierarchy?
>
> In my original idea rt tasks would only interfere with their own cgroup
> configuration, but not with the subtree or their parents. When cpu.rt.min =
> 'root', you are free to change cpu.rt.max values to whatever you like in any
> place of the hierarchy, and tasks inside the rt.min = 'root' cgroup would
> not be affected as they are run in the root runqueue.
>
> If you want to switch a cgroup from/to 'root' and HCBS, you'd have to either
> move all the RT tasks out of the cgroup, set rt.min, and then move them back
> in, or change temporarily their scheduling policy to non-rt (SCHED_OTHER,
> SCHED_DEADLINE, whatever) and then back.
>
> Hopefully I've answered your questions. Which solution do you think makes
> the most sense?
I'm not sure either makes sense. There's not much point in having
hierarchical controller in the first one (just require direct system-level
distribution) and I don't think the second one is very useable. I mean, try
to imagine being a user. You have to hunt down all rt tasks and twiddle
every one one way or another to change some config and then have to worry
about racing forks and class changes. At that point, you might as well just
control it centrally without the hirarchical stuff. You'd have to be really
dedicated or desparate, which means that not many are going to use it which
then brings up the question why are we doing this at all?
I wonder whether this can just be a regular max interface - ie. limit
maximum reservation in the subtree rather than exact reservation allocation.
Then cgroup can report total reservations in the subtree and admission
control can just reject anyting going over.
Thanks.
--
tejun
prev parent reply other threads:[~2026-05-13 19:10 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20260430213835.62217-1-yurand2000@gmail.com>
[not found] ` <20260430213835.62217-21-yurand2000@gmail.com>
2026-05-05 15:15 ` [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups Peter Zijlstra
2026-05-05 19:56 ` Tejun Heo
2026-05-07 10:53 ` Peter Zijlstra
2026-05-07 15:03 ` Juri Lelli
2026-05-07 15:05 ` Peter Zijlstra
2026-05-07 16:39 ` luca abeni
2026-05-11 9:29 ` Juri Lelli
2026-05-11 17:52 ` Tejun Heo
2026-05-07 16:44 ` luca abeni
2026-05-11 9:40 ` luca abeni
2026-05-11 18:15 ` Tejun Heo
2026-05-11 17:37 ` Tejun Heo
2026-05-07 14:30 ` luca abeni
2026-05-11 18:28 ` Tejun Heo
2026-05-12 17:38 ` Yuri Andriaccio
2026-05-12 18:19 ` Tejun Heo
2026-05-12 18:20 ` Tejun Heo
2026-05-13 12:08 ` Yuri Andriaccio
2026-05-13 19:10 ` Tejun Heo [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=agTMrz_nMV880pe0@slm.duckdns.org \
--to=tj@kernel.org \
--cc=bsegall@google.com \
--cc=cgroups@vger.kernel.org \
--cc=dietmar.eggemann@arm.com \
--cc=hannes@cmpxchg.org \
--cc=juri.lelli@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=luca.abeni@santannapisa.it \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=mkoutny@suse.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
--cc=yurand2000@gmail.com \
--cc=yuri.andriaccio@santannapisa.it \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox