From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-wr1-f44.google.com (mail-wr1-f44.google.com [209.85.221.44]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 91BB73D3D13 for ; Thu, 30 Apr 2026 21:39:14 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.221.44 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777585156; cv=none; b=Ge4zqU2Xdr/HLSfXqg1U8TTvjMrgbdGWZBb9M/w7R4R+oK2t/XxOCxockV7m9zwFpucvqmUcbUvOUpOlD5lrgibdntFzsTjb9cTrbjE4sY+bGtnonRvFcqj+qTrCYPjIyUFq1petnBnG/bu/K7VSgQGIwlIaQCuhTcP3/wAf0pw= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777585156; c=relaxed/simple; bh=wM0TKq7Rg5y5xDtXxIITO+iqo0RmM3qXEg68VZyNcHo=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=eHQw11AHHA61e3YNOafyhnAef19Vjg9Uf/5k6GRe1Cn5iQQrt3sf422CSYw7AGcHl7vpT/Ikm+VvOQ8Pdf1QTnYAbwQbABGYPAALhqwyCCb3RTVqkriBpCb5gy56Dt75vGlmlk3g9AweazBAvl16vBkrLXcRP/rGEcWaM1qMNS4= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=Ujrc5KEc; arc=none smtp.client-ip=209.85.221.44 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="Ujrc5KEc" Received: by mail-wr1-f44.google.com with SMTP id ffacd0b85a97d-44a74032ff8so111405f8f.1 for ; Thu, 30 Apr 2026 14:39:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1777585153; x=1778189953; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=eamkCv1N/oJLdxwGUSNg+AYQDDcGgmr4OHVsQdeF9c4=; b=Ujrc5KEclHI93rddY9x9Pbf2DAs2O0ljCHK79Jz1i2simI7pBvUPfxVvTQoojF9GOb Z0O3f/bdSKDLECnlxbRCoY0VjoqhftoVQmD6t7e9uiRVkW+UGxrgYtJlcpjKtVxnQsKr m1xjVwN6br350LMw6TZkxy8bsnhl/bjWVIGNSX8YrboCUR0xV1lv23Nd81G531/xvUrS k+N2bkY6OxNDkJhtmBitAfEUt4lseTU+LP4OdIh0EgBSXMytli+r2UkIQ05uYL9DMd4A H/23qAaEyihQ1i04m9w6A31TGTAGSsTtjQGhTBvtxhRO1wmr2X33H6iIjJ8KojNicGT5 UrSw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1777585153; x=1778189953; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=eamkCv1N/oJLdxwGUSNg+AYQDDcGgmr4OHVsQdeF9c4=; b=ZDyyt8FXkb4vVYnmU8gh6o5ukeF4nZ2z4lfQWrIbdxlEjWAjZNg04SKFddWwZF5x7a juUnPjZOxes1f9N9gZZ7lN+OykhdOPWoQRCQ4O1Ie2TEXqoKKqqzeNzDv8Cwe8Ds+sRE eM38hMnFbToBOl+kaU4qsXJloVZzJHL+CFroV1+W0bERyjIFraF+62eF5UJYTbf1Sxkf DaxJCOhbXA8v3pn+Y2SWkywi9wujUntxtLhpixH00o1nleZ6fGd5G/fH7eohK6nU7gOw P2jEarZr9kFf/M33X73C+jvCGSseYxdYHXL9zzrkMgfozLnHED0e9faE64iQUV8OGWDE an7Q== X-Gm-Message-State: AOJu0YxZhtSKJlaa7KXPU/PqMlSv0cNAKl0vQmID/z9R1XA1NyvstZSN FHqywzmpfDZ0iL/MPboh6moxHGzfeRDOWZ5rsufuQSC8XGlrmFs1Qi9/byrYUA== X-Gm-Gg: AeBDieuShKUftyKQ0NxeLcklv9ibwqkZ+aORm1Kfu+E8R9rXDzBkdNrD2KgKvj6l5tt RSeQNBXYMUEtnVNBglill1HMNK1MThUsgk8Zq1IW/KUjEG8IZIMPOi9sT827sbHHjaPbC5s1lVH x2BKpagV9NM8lsnFCTLvowUzhrZwAwqhnu6CAz6LQrYUQuVc62JRjoqVcCRAM1aDzLm1P8t8th0 +OkHhf07o81PTIkd9wVzP6l4ItbcHseUxDpG46VolTNEkq65EOtdf26YfAdMihpFVO120BsTmk5 Hmrquzf1/+wnBaO4Pwwg3nfzbWg3eNyynSuJIktKPu299PLKzw5TLVFFb5VSS23w/336nUHwxSm mrKV/VCU9LgPKQ80C+/TQo4OHmUlAz2rBL1vLBOywIxH9JFgBn8z0plIcTwta2OXXTfR6noRzdv e23UeFc2IdTqd4UNiKrjSGQ6Rf3JSJOapjfsADJVeA X-Received: by 2002:a05:6000:2381:b0:43d:1dfe:350a with SMTP id ffacd0b85a97d-4493e982942mr8036984f8f.22.1777585152835; Thu, 30 Apr 2026 14:39:12 -0700 (PDT) Received: from yuri-framework13 ([78.211.51.156]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-44a9879ef89sm418510f8f.30.2026.04.30.14.39.11 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 30 Apr 2026 14:39:12 -0700 (PDT) From: Yuri Andriaccio To: Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider Cc: linux-kernel@vger.kernel.org, Luca Abeni , Yuri Andriaccio Subject: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups Date: Thu, 30 Apr 2026 23:38:24 +0200 Message-ID: <20260430213835.62217-21-yurand2000@gmail.com> X-Mailer: git-send-email 2.53.0 In-Reply-To: <20260430213835.62217-1-yurand2000@gmail.com> References: <20260430213835.62217-1-yurand2000@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit From: luca abeni Allow for cgroup hierarchies with more than two levels. Introduce the concept of live and active groups: - A group is live if it is a leaf group or if all its children have zero runtime. - A live group with non-zero runtime can be used to schedule tasks. - An active cgroup is a live group with running tasks. - A non-live group cannot be used to run tasks, but it is only used for bandwidth accounting, i.e. the sum of its children bandwidth must be less than or equal to the bandwidth of the parent. This change allows to use cgroups for bandwidth management for different users. - While the root cgroup specifies the total allocatable bandwidth of rt cgroups, a further accounting is performed to keep track of the live bandwidth, i.e. the sum of the bandwidth of live groups. The hierarchy invariant states that the live bandwidth must always be less than or equal to the total allocatable bw. Add is_live_sched_group() and sched_group_has_live_siblings() in deadline.c. These utility functions are used by dl_init_tg to perform updates only when necessary: - Only live groups may update the active dl bandwidth of dl entities (call to dl_rq_change_utilization), while non-live groups must not use servers, and thus must not change the active dl bandwidth. - The total bandwidth accounting must be changed to follow the live/non-live rules: - When disabling (runtime zero) the last child of a group, the parent becomes a live group, and so the parent's bw must be accounted back. - When enabling (runtime non-zero) the first child, the parent becomes a non-live group, and so the parent's bandwidth must be removed. Update tg_set_rt_bandwidth() to change the runtime of a group to a non-zero value only if its parent is inactive, thus forcing it to become non-live if it was precedently (it would've already been non-live if a sibling cgroup was live). An exception is made for groups which have the root cgroup as parent. Update sched_rt_can_attach() to allow attaching only on live groups. Update dl_init_tg() to take a task_group pointer and a cpu's id rather than passing directly the pointer to the cpu's deadline server. The task_group pointer is necessary to check and update the live bandwidth accounting. Co-developed-by: Yuri Andriaccio Signed-off-by: Yuri Andriaccio Signed-off-by: luca abeni --- kernel/sched/core.c | 6 ---- kernel/sched/deadline.c | 63 ++++++++++++++++++++++++++++++++++++++--- kernel/sched/rt.c | 17 ++++++++--- kernel/sched/sched.h | 3 +- 4 files changed, 74 insertions(+), 15 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 41758824b460..fd532bb46995 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -9205,12 +9205,6 @@ cpu_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) return &root_task_group.css; } - /* Do not allow cpu_cgroup hierachies with depth greater than 2. */ -#ifdef CONFIG_RT_GROUP_SCHED - if (parent != &root_task_group) - return ERR_PTR(-EINVAL); -#endif - tg = sched_create_group(parent); if (IS_ERR(tg)) return ERR_PTR(-ENOMEM); diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 74bff7fb7b92..5967b5350166 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -376,11 +376,46 @@ int dl_check_tg(unsigned long total) return 1; } -void dl_init_tg(struct sched_dl_entity *dl_se, u64 rt_runtime, u64 rt_period) +/* + * A cgroup is deemed live if: + * - It is a leaf cgroup. + * - All it's children have zero runtime. + */ +bool is_live_sched_group(struct task_group *tg) +{ + struct task_group *child; + bool is_active = 1; + + /* if there are no children, this is a leaf group, thus it is live */ + guard(rcu)(); + list_for_each_entry_rcu(child, &tg->children, siblings) { + if (child->dl_bandwidth.dl_runtime > 0) + is_active = 0; + } + return is_active; +} + +static inline bool sched_group_has_live_siblings(struct task_group *tg) +{ + struct task_group *child; + bool has_active_siblings = 0; + + guard(rcu)(); + list_for_each_entry_rcu(child, &tg->parent->children, siblings) { + if (child != tg && child->dl_bandwidth.dl_runtime > 0) + has_active_siblings = 1; + } + return has_active_siblings; +} + +void dl_init_tg(struct task_group *tg, int cpu, u64 rt_runtime, u64 rt_period) { + struct sched_dl_entity *dl_se = tg->dl_se[cpu]; struct rq *rq = container_of_const(dl_se->dl_rq, struct rq, dl); - int is_active; - u64 new_bw; + int is_active, is_live_group; + u64 old_runtime, new_bw; + + is_live_group = (int)is_live_sched_group(tg); guard(raw_spin_rq_lock_irq)(rq); is_active = dl_se->my_q->rt.rt_nr_running > 0; @@ -388,8 +423,10 @@ void dl_init_tg(struct sched_dl_entity *dl_se, u64 rt_runtime, u64 rt_period) update_rq_clock(rq); dl_server_stop(dl_se); + old_runtime = dl_se->dl_runtime; new_bw = to_ratio(rt_period, rt_runtime); - dl_rq_change_utilization(rq, dl_se, new_bw); + if (is_live_group) + dl_rq_change_utilization(rq, dl_se, new_bw); dl_se->dl_runtime = rt_runtime; dl_se->dl_deadline = rt_period; @@ -401,6 +438,24 @@ void dl_init_tg(struct sched_dl_entity *dl_se, u64 rt_runtime, u64 rt_period) dl_se->dl_bw = new_bw; dl_se->dl_density = new_bw; + /* + * Handle parent bandwidth accounting when child runtime changes: + * - When disabling the last child, the parent becomes a leaf group, + * and so the parent's bandwidth must be accounted back. + * - When enabling the first child, the parent becomes a non-leaf group, + * and so the parent's bandwidth must be removed. + * Only leaf groups (those without active children) have non-zero bandwidth. + */ + if (tg->parent && tg->parent != &root_task_group) { + if (rt_runtime == 0 && old_runtime != 0 && + !sched_group_has_live_siblings(tg)) { + __add_rq_bw(tg->parent->dl_se[cpu]->dl_bw, dl_se->dl_rq); + } else if (rt_runtime != 0 && old_runtime == 0 && + !sched_group_has_live_siblings(tg)) { + __sub_rq_bw(tg->parent->dl_se[cpu]->dl_bw, dl_se->dl_rq); + } + } + if (is_active) dl_server_start(dl_se); } diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 5caddc5c2876..2be22024e66d 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -101,7 +101,7 @@ void unregister_rt_sched_group(struct task_group *tg) continue; if (tg->dl_se[i]->dl_runtime) - dl_init_tg(tg->dl_se[i], 0, tg->dl_se[i]->dl_period); + dl_init_tg(tg, i, 0, tg->dl_se[i]->dl_period); } } @@ -129,7 +129,7 @@ void free_rt_sched_group(struct task_group *tg) * to 0 immediately before freeing it. */ if (tg->dl_se[i]->dl_runtime) - dl_init_tg(tg->dl_se[i], 0, tg->dl_se[i]->dl_period); + dl_init_tg(tg, i, 0, tg->dl_se[i]->dl_period); raw_spin_rq_lock_irqsave(cpu_rq(i), flags); hrtimer_cancel(&tg->dl_se[i]->dl_timer); @@ -2154,6 +2154,14 @@ static int tg_set_rt_bandwidth(struct task_group *tg, static DEFINE_MUTEX(rt_constraints_mutex); int i, err = 0; + /* + * Do not allow to set a RT runtime > 0 if the parent has RT tasks + * (and is not the root group) + */ + if (rt_runtime && tg != &root_task_group && + tg->parent != &root_task_group && tg_has_rt_tasks(tg->parent)) + return -EINVAL; + /* * Bound quota to defend quota against overflow during bandwidth shift. */ @@ -2173,7 +2181,7 @@ static int tg_set_rt_bandwidth(struct task_group *tg, return 0; for_each_possible_cpu(i) { - dl_init_tg(tg->dl_se[i], rt_runtime, rt_period); + dl_init_tg(tg, i, rt_runtime, rt_period); } return 0; @@ -2244,7 +2252,8 @@ int sched_rt_can_attach(struct task_group *tg) if (rt_group_sched_enabled() && tg->dl_bandwidth.dl_runtime == 0) return 0; - return 1; + /* tasks can be attached only if the taskgroup has no live children. */ + return (int)is_live_sched_group(tg); } #else /* !CONFIG_RT_GROUP_SCHED */ diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index a4435f107cfe..9814be8348cd 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -409,7 +409,8 @@ extern void dl_server_init(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq, dl_server_pick_f pick_task); extern void sched_init_dl_servers(void); extern int dl_check_tg(unsigned long total); -extern void dl_init_tg(struct sched_dl_entity *dl_se, u64 rt_runtime, u64 rt_period); +extern void dl_init_tg(struct task_group *tg, int cpu, u64 rt_runtime, u64 rt_period); +extern bool is_live_sched_group(struct task_group *tg); extern void fair_server_init(struct rq *rq); extern void ext_server_init(struct rq *rq); -- 2.53.0