public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Yuri Andriaccio <yurand2000@gmail.com>
To: Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>
Cc: linux-kernel@vger.kernel.org,
	Luca Abeni <luca.abeni@santannapisa.it>,
	Yuri Andriaccio <yuri.andriaccio@santannapisa.it>
Subject: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
Date: Thu, 30 Apr 2026 23:38:24 +0200	[thread overview]
Message-ID: <20260430213835.62217-21-yurand2000@gmail.com> (raw)
In-Reply-To: <20260430213835.62217-1-yurand2000@gmail.com>

From: luca abeni <luca.abeni@santannapisa.it>

Allow for cgroup hierarchies with more than two levels.

Introduce the concept of live and active groups:
- A group is live if it is a leaf group or if all its children have zero
  runtime.
- A live group with non-zero runtime can be used to schedule tasks.
- An active cgroup is a live group with running tasks.
- A non-live group cannot be used to run tasks, but it is only used for
  bandwidth accounting, i.e. the sum of its children bandwidth must be
  less than or equal to the bandwidth of the parent. This change allows
  to use cgroups for bandwidth management for different users.
- While the root cgroup specifies the total allocatable bandwidth of rt
  cgroups, a further accounting is performed to keep track of the live
  bandwidth, i.e. the sum of the bandwidth of live groups. The hierarchy
  invariant states that the live bandwidth must always be less than or
  equal to the total allocatable bw.

Add is_live_sched_group() and sched_group_has_live_siblings() in
deadline.c. These utility functions are used by dl_init_tg to perform
updates only when necessary:
- Only live groups may update the active dl bandwidth of dl entities
  (call to dl_rq_change_utilization), while non-live groups must not use
  servers, and thus must not change the active dl bandwidth.
- The total bandwidth accounting must be changed to follow the
  live/non-live rules:
  - When disabling (runtime zero) the last child of a group, the parent
    becomes a live group, and so the parent's bw must be accounted back.
  - When enabling (runtime non-zero) the first child, the parent becomes a
    non-live group, and so the parent's bandwidth must be removed.

Update tg_set_rt_bandwidth() to change the runtime of a group to a
non-zero value only if its parent is inactive, thus forcing it to become
non-live if it was precedently (it would've already been non-live if a
sibling cgroup was live). An exception is made for groups which have the
root cgroup as parent.

Update sched_rt_can_attach() to allow attaching only on live groups.

Update dl_init_tg() to take a task_group pointer and a cpu's id rather
than passing directly the pointer to the cpu's deadline server. The
task_group pointer is necessary to check and update the live bandwidth
accounting.

Co-developed-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
---
 kernel/sched/core.c     |  6 ----
 kernel/sched/deadline.c | 63 ++++++++++++++++++++++++++++++++++++++---
 kernel/sched/rt.c       | 17 ++++++++---
 kernel/sched/sched.h    |  3 +-
 4 files changed, 74 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 41758824b460..fd532bb46995 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9205,12 +9205,6 @@ cpu_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 		return &root_task_group.css;
 	}

-	/* Do not allow cpu_cgroup hierachies with depth greater than 2. */
-#ifdef CONFIG_RT_GROUP_SCHED
-	if (parent != &root_task_group)
-		return ERR_PTR(-EINVAL);
-#endif
-
 	tg = sched_create_group(parent);
 	if (IS_ERR(tg))
 		return ERR_PTR(-ENOMEM);
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 74bff7fb7b92..5967b5350166 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -376,11 +376,46 @@ int dl_check_tg(unsigned long total)
 	return 1;
 }

-void dl_init_tg(struct sched_dl_entity *dl_se, u64 rt_runtime, u64 rt_period)
+/*
+ * A cgroup is deemed live if:
+ * - It is a leaf cgroup.
+ * - All it's children have zero runtime.
+ */
+bool is_live_sched_group(struct task_group *tg)
+{
+	struct task_group *child;
+	bool is_active = 1;
+
+	/* if there are no children, this is a leaf group, thus it is live */
+	guard(rcu)();
+	list_for_each_entry_rcu(child, &tg->children, siblings) {
+		if (child->dl_bandwidth.dl_runtime > 0)
+			is_active = 0;
+	}
+	return is_active;
+}
+
+static inline bool sched_group_has_live_siblings(struct task_group *tg)
+{
+	struct task_group *child;
+	bool has_active_siblings = 0;
+
+	guard(rcu)();
+	list_for_each_entry_rcu(child, &tg->parent->children, siblings) {
+		if (child != tg && child->dl_bandwidth.dl_runtime > 0)
+			has_active_siblings = 1;
+	}
+	return has_active_siblings;
+}
+
+void dl_init_tg(struct task_group *tg, int cpu, u64 rt_runtime, u64 rt_period)
 {
+	struct sched_dl_entity *dl_se = tg->dl_se[cpu];
 	struct rq *rq = container_of_const(dl_se->dl_rq, struct rq, dl);
-	int is_active;
-	u64 new_bw;
+	int is_active, is_live_group;
+	u64 old_runtime, new_bw;
+
+	is_live_group = (int)is_live_sched_group(tg);

 	guard(raw_spin_rq_lock_irq)(rq);
 	is_active = dl_se->my_q->rt.rt_nr_running > 0;
@@ -388,8 +423,10 @@ void dl_init_tg(struct sched_dl_entity *dl_se, u64 rt_runtime, u64 rt_period)
 	update_rq_clock(rq);
 	dl_server_stop(dl_se);

+	old_runtime = dl_se->dl_runtime;
 	new_bw = to_ratio(rt_period, rt_runtime);
-	dl_rq_change_utilization(rq, dl_se, new_bw);
+	if (is_live_group)
+		dl_rq_change_utilization(rq, dl_se, new_bw);

 	dl_se->dl_runtime  = rt_runtime;
 	dl_se->dl_deadline = rt_period;
@@ -401,6 +438,24 @@ void dl_init_tg(struct sched_dl_entity *dl_se, u64 rt_runtime, u64 rt_period)
 	dl_se->dl_bw = new_bw;
 	dl_se->dl_density = new_bw;

+	/*
+	 * Handle parent bandwidth accounting when child runtime changes:
+	 * - When disabling the last child, the parent becomes a leaf group,
+	 *   and so the parent's bandwidth must be accounted back.
+	 * - When enabling the first child, the parent becomes a non-leaf group,
+	 *   and so the parent's bandwidth must be removed.
+	 * Only leaf groups (those without active children) have non-zero bandwidth.
+	 */
+	if (tg->parent && tg->parent != &root_task_group) {
+		if (rt_runtime == 0 && old_runtime != 0 &&
+		    !sched_group_has_live_siblings(tg)) {
+			__add_rq_bw(tg->parent->dl_se[cpu]->dl_bw, dl_se->dl_rq);
+		} else if (rt_runtime != 0 && old_runtime == 0 &&
+			   !sched_group_has_live_siblings(tg)) {
+			__sub_rq_bw(tg->parent->dl_se[cpu]->dl_bw, dl_se->dl_rq);
+		}
+	}
+
 	if (is_active)
 		dl_server_start(dl_se);
 }
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 5caddc5c2876..2be22024e66d 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -101,7 +101,7 @@ void unregister_rt_sched_group(struct task_group *tg)
 			continue;

 		if (tg->dl_se[i]->dl_runtime)
-			dl_init_tg(tg->dl_se[i], 0, tg->dl_se[i]->dl_period);
+			dl_init_tg(tg, i, 0, tg->dl_se[i]->dl_period);
 	}
 }

@@ -129,7 +129,7 @@ void free_rt_sched_group(struct task_group *tg)
 		 * to 0 immediately before freeing it.
 		 */
 		if (tg->dl_se[i]->dl_runtime)
-			dl_init_tg(tg->dl_se[i], 0, tg->dl_se[i]->dl_period);
+			dl_init_tg(tg, i, 0, tg->dl_se[i]->dl_period);

 		raw_spin_rq_lock_irqsave(cpu_rq(i), flags);
 		hrtimer_cancel(&tg->dl_se[i]->dl_timer);
@@ -2154,6 +2154,14 @@ static int tg_set_rt_bandwidth(struct task_group *tg,
 	static DEFINE_MUTEX(rt_constraints_mutex);
 	int i, err = 0;

+	/*
+	 * Do not allow to set a RT runtime > 0 if the parent has RT tasks
+	 * (and is not the root group)
+	 */
+	if (rt_runtime && tg != &root_task_group &&
+		tg->parent != &root_task_group && tg_has_rt_tasks(tg->parent))
+		return -EINVAL;
+
 	/*
 	 * Bound quota to defend quota against overflow during bandwidth shift.
 	 */
@@ -2173,7 +2181,7 @@ static int tg_set_rt_bandwidth(struct task_group *tg,
 		return 0;

 	for_each_possible_cpu(i) {
-		dl_init_tg(tg->dl_se[i], rt_runtime, rt_period);
+		dl_init_tg(tg, i, rt_runtime, rt_period);
 	}

 	return 0;
@@ -2244,7 +2252,8 @@ int sched_rt_can_attach(struct task_group *tg)
 	if (rt_group_sched_enabled() && tg->dl_bandwidth.dl_runtime == 0)
 		return 0;

-	return 1;
+	/* tasks can be attached only if the taskgroup has no live children. */
+	return (int)is_live_sched_group(tg);
 }

 #else /* !CONFIG_RT_GROUP_SCHED */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a4435f107cfe..9814be8348cd 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -409,7 +409,8 @@ extern void dl_server_init(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq,
 		    dl_server_pick_f pick_task);
 extern void sched_init_dl_servers(void);
 extern int dl_check_tg(unsigned long total);
-extern void dl_init_tg(struct sched_dl_entity *dl_se, u64 rt_runtime, u64 rt_period);
+extern void dl_init_tg(struct task_group *tg, int cpu, u64 rt_runtime, u64 rt_period);
+extern bool is_live_sched_group(struct task_group *tg);

 extern void fair_server_init(struct rq *rq);
 extern void ext_server_init(struct rq *rq);
--
2.53.0


  parent reply	other threads:[~2026-04-30 21:39 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-30 21:38 [RFC PATCH v5 00/29] Hierarchical Constant Bandwidth Server Yuri Andriaccio
2026-04-30 21:38 ` [RFC PATCH v5 01/29] sched/deadline: Fix replenishment logic for non-deferred servers Yuri Andriaccio
2026-04-30 21:38 ` [RFC PATCH v5 02/29] sched/deadline: Do not access dl_se->rq directly Yuri Andriaccio
2026-04-30 21:38 ` [RFC PATCH v5 03/29] sched/deadline: Distinguish between dl_rq and my_q Yuri Andriaccio
2026-04-30 21:38 ` [RFC PATCH v5 04/29] sched/rt: Pass an rt_rq instead of an rq where needed Yuri Andriaccio
2026-04-30 21:38 ` [RFC PATCH v5 05/29] sched/rt: Move functions from rt.c to sched.h Yuri Andriaccio
2026-04-30 21:38 ` [RFC PATCH v5 06/29] sched/rt: Disable RT_GROUP_SCHED Yuri Andriaccio
2026-04-30 21:38 ` [RFC PATCH v5 07/29] sched/rt: Remove unnecessary runqueue pointer in struct rt_rq Yuri Andriaccio
2026-04-30 21:38 ` [RFC PATCH v5 08/29] sched/rt: Introduce HCBS specific structs in task_group Yuri Andriaccio
2026-04-30 21:38 ` [RFC PATCH v5 09/29] sched/core: Initialize HCBS specific structures Yuri Andriaccio
2026-04-30 21:38 ` [RFC PATCH v5 10/29] sched/deadline: Add dl_init_tg Yuri Andriaccio
2026-04-30 21:38 ` [RFC PATCH v5 11/29] sched/rt: Add {alloc/unregister/free}_rt_sched_group Yuri Andriaccio
2026-04-30 21:38 ` [RFC PATCH v5 12/29] sched/deadline: Account rt-cgroups bandwidth in deadline tasks schedulability tests Yuri Andriaccio
2026-04-30 21:38 ` [RFC PATCH v5 13/29] sched/rt: Implement dl-server operations for rt-cgroups Yuri Andriaccio
2026-05-05 13:04   ` Peter Zijlstra
2026-04-30 21:38 ` [RFC PATCH v5 14/29] sched/rt: Update task event callbacks for HCBS scheduling Yuri Andriaccio
2026-05-05 13:16   ` Peter Zijlstra
2026-04-30 21:38 ` [RFC PATCH v5 15/29] sched/rt: Update rt-cgroup schedulability checks Yuri Andriaccio
2026-05-05 14:36   ` Peter Zijlstra
2026-04-30 21:38 ` [RFC PATCH v5 16/29] sched/rt: Allow zeroing the runtime of the root control group Yuri Andriaccio
2026-04-30 21:38 ` [RFC PATCH v5 17/29] sched/rt: Remove old RT_GROUP_SCHED data structures Yuri Andriaccio
2026-04-30 21:38 ` [RFC PATCH v5 18/29] sched/core: Cgroup v2 support Yuri Andriaccio
2026-05-05 14:59   ` Peter Zijlstra
2026-04-30 21:38 ` [RFC PATCH v5 19/29] sched/rt: Remove support for cgroups-v1 Yuri Andriaccio
2026-05-05 15:01   ` Peter Zijlstra
2026-04-30 21:38 ` Yuri Andriaccio [this message]
2026-05-05 15:15   ` [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups Peter Zijlstra
2026-05-05 19:56     ` Tejun Heo
2026-04-30 21:38 ` [RFC PATCH v5 21/29] sched/rt: Update default bandwidth for real-time tasks to ONE Yuri Andriaccio
2026-04-30 21:38 ` [RFC PATCH v5 22/29] sched/rt: Add rt-cgroup migration functions Yuri Andriaccio
2026-05-05 15:20   ` Peter Zijlstra
2026-05-05 15:24   ` Peter Zijlstra
2026-04-30 21:38 ` [RFC PATCH v5 23/29] sched/rt: Hook HCBS " Yuri Andriaccio
2026-04-30 21:38 ` [RFC PATCH v5 24/29] sched/core: Execute enqueued balance callbacks when changing allowed CPUs Yuri Andriaccio
2026-04-30 21:38 ` [RFC PATCH v5 25/29] sched/rt: Try pull task on empty server pick Yuri Andriaccio
2026-04-30 21:38 ` [RFC PATCH v5 26/29] sched/core: Execute enqueued balance callbacks after migrate_disable_switch Yuri Andriaccio
2026-04-30 21:38 ` [RFC PATCH v5 27/29] Documentation: Update documentation for real-time cgroups Yuri Andriaccio
2026-04-30 21:38 ` [RFC PATCH v5 28/29] sched/rt: Add debug BUG_ONs for pre-migration code Yuri Andriaccio
2026-04-30 21:38 ` [RFC PATCH v5 29/29] sched/rt: Add debug BUG_ONs in migration code Yuri Andriaccio

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260430213835.62217-21-yurand2000@gmail.com \
    --to=yurand2000@gmail.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luca.abeni@santannapisa.it \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=yuri.andriaccio@santannapisa.it \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox