From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 87F2D3B5837; Wed, 13 May 2026 19:10:42 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778699442; cv=none; b=h80O4Efbhvhl7Ji1S++RITuHC7L/FL3q3RHC4uS1qb7I3tmhZUczSv31uYBW8Wrz+rFrjuDhxFRujthzGfK0/JsNJbPRmKQ+fsdd3tMd9Tui1zonSORp+5SFBm0ndNgahzrWcifzq92FoiFRBeXO6VNW6zVUjv07THfoVrkLBZA= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778699442; c=relaxed/simple; bh=Ja4tdX2yfM58Cfuh2bBgf7da3miE+R/5XOrkFDGo/fc=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=k9AkC/72X8lFGW25ieZYElC3dyM/Lew8wD09xJ7YqQLEjA7qzh7z51FzyxGd4/cwjNc45sAV5K1p669zoKN4RWDxCvoBbZq3bMOKBf6h+MY4gn99fymNancoi/J0aLCIMKR5NCYotjRHudMcogIpKZ6+RhnkWtfUe/hOrxLZRz0= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=Xt01GXlv; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="Xt01GXlv" Received: by smtp.kernel.org (Postfix) with ESMTPSA id DC7F8C19425; Wed, 13 May 2026 19:10:40 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1778699441; bh=Ja4tdX2yfM58Cfuh2bBgf7da3miE+R/5XOrkFDGo/fc=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=Xt01GXlv++jY82DjI0QBKYzeNlh5mgpt/0PBGbf80iE+N9opezaE6/2QpOcDM4Sh9 LvcpKGhVWMmnA+WPBLzQXrLXyIVLvdfE0AUUX1Z1ZSTm8yNSUc0sXxTT1poQpYUQfx 8wUk2XqdDAqMdvcbkPdxAuDuXhbT1FB9ALjdS6Em79NuSu3jbnbATUFfhOW2ymPSFT FnCT+c6NpMCW6SgNjvTmXSy/ucna69tn2I37as9wVgWzf1I9ZlnuG2jfvHRcxFIfPw yhLK42QAVbtzmQa+JbIpGCXmgMmLavNyot5UNaHt8Ux0LR3a4XHicGLcV+K1gByg4v hcxGQhcs5+Prg== Date: Wed, 13 May 2026 09:10:39 -1000 From: Tejun Heo To: Yuri Andriaccio Cc: luca abeni , Peter Zijlstra , Yuri Andriaccio , Ingo Molnar , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , linux-kernel@vger.kernel.org, hannes@cmpxchg.org, mkoutny@suse.com, cgroups@vger.kernel.org Subject: Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups Message-ID: References: <20260430213835.62217-1-yurand2000@gmail.com> <20260430213835.62217-21-yurand2000@gmail.com> <20260505151523.GF3102624@noisy.programming.kicks-ass.net> <20260507163058.2c435922@nowhere> <0d3336a7-ae42-4359-bfe7-48a7d6796d06@santannapisa.it> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <0d3336a7-ae42-4359-bfe7-48a7d6796d06@santannapisa.it> Hello, On Wed, May 13, 2026 at 02:08:52PM +0200, Yuri Andriaccio wrote: > > How is a delegated subtree prevented from setting cpu.rt.min = 'root' and > > escaping its ancestors' cpu.rt.max budget? > > Is it strictly required that a child cgroup must have 'less runtime' than > its parent? To be more precise I mean scheduling tasks on the root runqueue > instead of using dl-servers. Small note: given that HCBS cgroups use > dl-servers, and thus run at higher priority than FIFO/RR scheduled on the > root runqueue, if a cgroup rt.min is 'root' would yes escape its ancestor > budget but it may also possibly get starved because of the priority levels. The high-level invariant that we must maintain is that any given cgroup has control over resource usages in its subtree. If that doesn't work, the whole thing is not very useful. e.g. There are multiple containers in the system and each wants to manage its own internal resource distribution, which is a relatively common scenario in server deployments. This is implemented by putting each container in a cgroup and deletating the sub-tree to the nested container manager. At the host level, you don't know or have control over what's going on in each container but you can control how much each container consumes in total so that each gets what it's allotted and doesn't get in the way of others. While delegation scenario is a clear example, even in regular usages, it gets really confusing if hierarchical resource distribution isn't actually hierarchical. If you let a child escape to root at its own discretion, might as well just not have all the complexities with hierarchical resource control. > If we require that child cgroups cannot escape their parent's bandwidth, > even when using 'root', then the cpu.rt.max file must be disallowed in the > root cgroup (removing the possibility to reserve bandwidth for HCBS, and so > doing the admission test similarly to when SCHED_DEADLINE tasks are > executed), and cpu.rt.max would use either 'root' if the whole subtree must > be scheduled onto the root runqueue or a combination to > reserve bandwidth for the whole subtree. The cpu.rt.min would then only be > used to reserve internal bandwidth for the cgroup itself. This also means > that a whole subtree either uses HCBS everywhere or the root runqueue > everywhere. > > > If the users on the system already started using rt, how do you enable the > > controller from the top down with budgets already being used down in the > > hierarchy? > > In my original idea rt tasks would only interfere with their own cgroup > configuration, but not with the subtree or their parents. When cpu.rt.min = > 'root', you are free to change cpu.rt.max values to whatever you like in any > place of the hierarchy, and tasks inside the rt.min = 'root' cgroup would > not be affected as they are run in the root runqueue. > > If you want to switch a cgroup from/to 'root' and HCBS, you'd have to either > move all the RT tasks out of the cgroup, set rt.min, and then move them back > in, or change temporarily their scheduling policy to non-rt (SCHED_OTHER, > SCHED_DEADLINE, whatever) and then back. > > Hopefully I've answered your questions. Which solution do you think makes > the most sense? I'm not sure either makes sense. There's not much point in having hierarchical controller in the first one (just require direct system-level distribution) and I don't think the second one is very useable. I mean, try to imagine being a user. You have to hunt down all rt tasks and twiddle every one one way or another to change some config and then have to worry about racing forks and class changes. At that point, you might as well just control it centrally without the hirarchical stuff. You'd have to be really dedicated or desparate, which means that not many are going to use it which then brings up the question why are we doing this at all? I wonder whether this can just be a regular max interface - ie. limit maximum reservation in the subtree rather than exact reservation allocation. Then cgroup can report total reservations in the subtree and admission control can just reject anyting going over. Thanks. -- tejun