[RFC PATCH v5 15/29] sched/rt: Update rt-cgroup schedulability checks

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Yuri Andriaccio <yurand2000@gmail.com>
To: Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>
Cc: linux-kernel@vger.kernel.org,
	Luca Abeni <luca.abeni@santannapisa.it>,
	Yuri Andriaccio <yuri.andriaccio@santannapisa.it>
Subject: [RFC PATCH v5 15/29] sched/rt: Update rt-cgroup schedulability checks
Date: Thu, 30 Apr 2026 23:38:19 +0200	[thread overview]
Message-ID: <20260430213835.62217-16-yurand2000@gmail.com> (raw)
In-Reply-To: <20260430213835.62217-1-yurand2000@gmail.com>

From: luca abeni <luca.abeni@santannapisa.it>

Update sched_group_rt_runtime/period and sched_group_set_rt_runtime/period
to use the newly defined data structures and perform necessary checks to
update both the runtime and period of a given group.

The 'set' functions call tg_set_rt_bandwidth() which is also updated:
- Use the newly added HCBS dl_bandwidth structure instead of rt_bandwidth.
- Update __rt_schedulable() to check for numerical issues:
  - Reuse __checkparam_dl.
  - Add allow_zero_runtime param to __checkparam_dl as cgroups may zero
    their runtime, while it is not allowed for DEADLINE tasks to do so.
  - Use RCU lock guard instead of rcu_read_lock/unlock.
- Update tg_rt_schedulable(), used when walking the cgroup tree to check
  if all invariants are met:
  - Update most of the instructions to obtain data from the newly added
    data structures (dl_bandwidth).
  - If the task group is the root group, run a total bandwidth check with
    the newly added dl_check_tg() function.
- After all checks are successful, if the changed group is not the root
  cgroup, update the assigned runtime and period to all the local
  deadline servers.
- Additionally use mutex guards instead of manually locking/unlocking.

Add dl_check_tg(), which performs an admission control test similar to
__dl_overflow, but this time we are updating the cgroup's total bandwidth
rather than scheduling a new DEADLINE task or updating a non-cgroup
deadline server.

Add rcu_sched lock guard for rcu_read_lock/unlock_sched.

Finally, prevent creation of a cgroup hierarchy with depth greater than
two, as this will be addressed in a future patch. A depth two hierarchy
is sufficient for now for testing the patchset.

Co-developed-by: Alessio Balsini <a.balsini@sssup.it>
Signed-off-by: Alessio Balsini <a.balsini@sssup.it>
Co-developed-by: Andrea Parri <parri.andrea@gmail.com>
Signed-off-by: Andrea Parri <parri.andrea@gmail.com>
Co-developed-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
---
 include/linux/rcupdate.h |  1 +
 kernel/sched/core.c      |  6 ++++
 kernel/sched/deadline.c  | 43 +++++++++++++++++-----
 kernel/sched/rt.c        | 77 +++++++++++++++++++---------------------
 kernel/sched/sched.h     |  3 +-
 kernel/sched/syscalls.c  |  2 +-
 6 files changed, 82 insertions(+), 50 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 04f3f86a4145..032cfa763047 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -1191,6 +1191,7 @@ extern int rcu_expedited;
 extern int rcu_normal;
 
 DEFINE_LOCK_GUARD_0(rcu, rcu_read_lock(), rcu_read_unlock())
+DEFINE_LOCK_GUARD_0(rcu_sched, rcu_read_lock_sched(), rcu_read_unlock_sched())
 DECLARE_LOCK_GUARD_0_ATTRS(rcu, __acquires_shared(RCU), __releases_shared(RCU))
 
 #endif /* __LINUX_RCUPDATE_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 98a53b60e21f..0c7032d254ba 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9205,6 +9205,12 @@ cpu_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 		return &root_task_group.css;
 	}
 
+	/* Do not allow cpu_cgroup hierachies with depth greater than 2. */
+#ifdef CONFIG_RT_GROUP_SCHED
+	if (parent != &root_task_group)
+		return ERR_PTR(-EINVAL);
+#endif
+
 	tg = sched_create_group(parent);
 	if (IS_ERR(tg))
 		return ERR_PTR(-ENOMEM);
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index c82810732106..74bff7fb7b92 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -343,7 +343,39 @@ void cancel_inactive_timer(struct sched_dl_entity *dl_se)
 	cancel_dl_timer(dl_se, &dl_se->inactive_timer);
 }
 
+/*
+ * Used for dl_bw check and update, used under sched_rt_handler()::mutex and
+ * sched_domains_mutex.
+ */
+u64 dl_cookie;
+
 #ifdef CONFIG_RT_GROUP_SCHED
+int dl_check_tg(unsigned long total)
+{
+	int which_cpu;
+	int cap;
+	struct dl_bw *dl_b;
+	u64 gen = ++dl_cookie;
+
+	for_each_possible_cpu(which_cpu) {
+		guard(rcu_sched)();
+
+		if (!dl_bw_visited(which_cpu, gen)) {
+			cap = dl_bw_capacity(which_cpu);
+			dl_b = dl_bw_of(which_cpu);
+
+			guard(raw_spinlock_irqsave)(&dl_b->lock);
+
+			if (dl_b->bw != -1 &&
+			    cap_scale(dl_b->bw, cap) < dl_b->total_bw + cap_scale(total, cap))
+				return 0;
+		}
+
+	}
+
+	return 1;
+}
+
 void dl_init_tg(struct sched_dl_entity *dl_se, u64 rt_runtime, u64 rt_period)
 {
 	struct rq *rq = container_of_const(dl_se->dl_rq, struct rq, dl);
@@ -3469,12 +3501,6 @@ DEFINE_SCHED_CLASS(dl) = {
 #endif
 };
 
-/*
- * Used for dl_bw check and update, used under sched_rt_handler()::mutex and
- * sched_domains_mutex.
- */
-u64 dl_cookie;
-
 int sched_dl_global_validate(void)
 {
 	u64 runtime = global_rt_runtime();
@@ -3670,7 +3696,7 @@ void __getparam_dl(struct task_struct *p, struct sched_attr *attr)
  * below 2^63 ns (we have to check both sched_deadline and
  * sched_period, as the latter can be zero).
  */
-bool __checkparam_dl(const struct sched_attr *attr)
+bool __checkparam_dl(const struct sched_attr *attr, bool allow_zero_runtime)
 {
 	u64 period, max, min;
 
@@ -3686,7 +3712,8 @@ bool __checkparam_dl(const struct sched_attr *attr)
 	 * Since we truncate DL_SCALE bits, make sure we're at least
 	 * that big.
 	 */
-	if (attr->sched_runtime < (1ULL << DL_SCALE))
+	if ((!allow_zero_runtime || attr->sched_runtime != 0) &&
+	    attr->sched_runtime < (1ULL << DL_SCALE))
 		return false;
 
 	/*
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 67fbf4bbe461..c994447f5b1c 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2035,11 +2035,6 @@ DEFINE_SCHED_CLASS(rt) = {
 };
 
 #ifdef CONFIG_RT_GROUP_SCHED
-/*
- * Ensure that the real time constraints are schedulable.
- */
-static DEFINE_MUTEX(rt_constraints_mutex);
-
 static inline int tg_has_rt_tasks(struct task_group *tg)
 {
 	struct task_struct *task;
@@ -2073,8 +2068,8 @@ static int tg_rt_schedulable(struct task_group *tg, void *data)
 	unsigned long total, sum = 0;
 	u64 period, runtime;
 
-	period = ktime_to_ns(tg->rt_bandwidth.rt_period);
-	runtime = tg->rt_bandwidth.rt_runtime;
+	period = tg->dl_bandwidth.dl_period;
+	runtime = tg->dl_bandwidth.dl_runtime;
 
 	if (tg == d->tg) {
 		period = d->rt_period;
@@ -2090,8 +2085,7 @@ static int tg_rt_schedulable(struct task_group *tg, void *data)
 	/*
 	 * Ensure we don't starve existing RT tasks if runtime turns zero.
 	 */
-	if (rt_bandwidth_enabled() && !runtime &&
-	    tg->rt_bandwidth.rt_runtime && tg_has_rt_tasks(tg))
+	if (dl_bandwidth_enabled() && !runtime && tg_has_rt_tasks(tg))
 		return -EBUSY;
 
 	if (WARN_ON(!rt_group_sched_enabled() && tg != &root_task_group))
@@ -2105,12 +2099,17 @@ static int tg_rt_schedulable(struct task_group *tg, void *data)
 	if (total > to_ratio(global_rt_period(), global_rt_runtime()))
 		return -EINVAL;
 
+	if (tg == &root_task_group) {
+		if (!dl_check_tg(total))
+			return -EBUSY;
+	}
+
 	/*
 	 * The sum of our children's runtime should not exceed our own.
 	 */
 	list_for_each_entry_rcu(child, &tg->children, siblings) {
-		period = ktime_to_ns(child->rt_bandwidth.rt_period);
-		runtime = child->rt_bandwidth.rt_runtime;
+		period = child->dl_bandwidth.dl_period;
+		runtime = child->dl_bandwidth.dl_runtime;
 
 		if (child == d->tg) {
 			period = d->rt_period;
@@ -2128,24 +2127,30 @@ static int tg_rt_schedulable(struct task_group *tg, void *data)
 
 static int __rt_schedulable(struct task_group *tg, u64 period, u64 runtime)
 {
-	int ret;
-
 	struct rt_schedulable_data data = {
 		.tg = tg,
 		.rt_period = period,
 		.rt_runtime = runtime,
 	};
 
-	rcu_read_lock();
-	ret = walk_tg_tree(tg_rt_schedulable, tg_nop, &data);
-	rcu_read_unlock();
+	struct sched_attr attr = {
+		.sched_flags = 0,
+		.sched_runtime = runtime,
+		.sched_deadline = period,
+		.sched_period = period,
+	};
 
-	return ret;
+	if (!__checkparam_dl(&attr, true))
+		return -EINVAL;
+
+	guard(rcu)();
+	return walk_tg_tree(tg_rt_schedulable, tg_nop, &data);
 }
 
 static int tg_set_rt_bandwidth(struct task_group *tg,
 		u64 rt_period, u64 rt_runtime)
 {
+	static DEFINE_MUTEX(rt_constraints_mutex);
 	int i, err = 0;
 
 	/*
@@ -2155,44 +2160,36 @@ static int tg_set_rt_bandwidth(struct task_group *tg,
 	if (tg == &root_task_group && rt_runtime == 0)
 		return -EINVAL;
 
-	/* No period doesn't make any sense. */
-	if (rt_period == 0)
-		return -EINVAL;
-
 	/*
 	 * Bound quota to defend quota against overflow during bandwidth shift.
 	 */
 	if (rt_runtime != RUNTIME_INF && rt_runtime > max_rt_runtime)
 		return -EINVAL;
 
-	mutex_lock(&rt_constraints_mutex);
+	guard(mutex)(&rt_constraints_mutex);
 	err = __rt_schedulable(tg, rt_period, rt_runtime);
 	if (err)
-		goto unlock;
+		return err;
 
-	raw_spin_lock_irq(&tg->rt_bandwidth.rt_runtime_lock);
-	tg->rt_bandwidth.rt_period = ns_to_ktime(rt_period);
-	tg->rt_bandwidth.rt_runtime = rt_runtime;
+	guard(raw_spinlock_irq)(&tg->dl_bandwidth.dl_runtime_lock);
+	tg->dl_bandwidth.dl_period  = rt_period;
+	tg->dl_bandwidth.dl_runtime = rt_runtime;
 
-	for_each_possible_cpu(i) {
-		struct rt_rq *rt_rq = tg->rt_rq[i];
+	if (tg == &root_task_group)
+		return 0;
 
-		raw_spin_lock(&rt_rq->rt_runtime_lock);
-		rt_rq->rt_runtime = rt_runtime;
-		raw_spin_unlock(&rt_rq->rt_runtime_lock);
+	for_each_possible_cpu(i) {
+		dl_init_tg(tg->dl_se[i], rt_runtime, rt_period);
 	}
-	raw_spin_unlock_irq(&tg->rt_bandwidth.rt_runtime_lock);
-unlock:
-	mutex_unlock(&rt_constraints_mutex);
 
-	return err;
+	return 0;
 }
 
 int sched_group_set_rt_runtime(struct task_group *tg, long rt_runtime_us)
 {
 	u64 rt_runtime, rt_period;
 
-	rt_period = ktime_to_ns(tg->rt_bandwidth.rt_period);
+	rt_period  = tg->dl_bandwidth.dl_period;
 	rt_runtime = (u64)rt_runtime_us * NSEC_PER_USEC;
 	if (rt_runtime_us < 0)
 		rt_runtime = RUNTIME_INF;
@@ -2206,10 +2203,10 @@ long sched_group_rt_runtime(struct task_group *tg)
 {
 	u64 rt_runtime_us;
 
-	if (tg->rt_bandwidth.rt_runtime == RUNTIME_INF)
+	if (tg->dl_bandwidth.dl_runtime == RUNTIME_INF)
 		return -1;
 
-	rt_runtime_us = tg->rt_bandwidth.rt_runtime;
+	rt_runtime_us = tg->dl_bandwidth.dl_runtime;
 	do_div(rt_runtime_us, NSEC_PER_USEC);
 	return rt_runtime_us;
 }
@@ -2222,7 +2219,7 @@ int sched_group_set_rt_period(struct task_group *tg, u64 rt_period_us)
 		return -EINVAL;
 
 	rt_period = rt_period_us * NSEC_PER_USEC;
-	rt_runtime = tg->rt_bandwidth.rt_runtime;
+	rt_runtime = tg->dl_bandwidth.dl_runtime;
 
 	return tg_set_rt_bandwidth(tg, rt_period, rt_runtime);
 }
@@ -2231,7 +2228,7 @@ long sched_group_rt_period(struct task_group *tg)
 {
 	u64 rt_period_us;
 
-	rt_period_us = ktime_to_ns(tg->rt_bandwidth.rt_period);
+	rt_period_us = tg->dl_bandwidth.dl_period;
 	do_div(rt_period_us, NSEC_PER_USEC);
 	return rt_period_us;
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index fceb02a04858..78f080275bf0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -364,7 +364,7 @@ extern void sched_dl_do_global(void);
 extern int  sched_dl_overflow(struct task_struct *p, int policy, const struct sched_attr *attr);
 extern void __setparam_dl(struct task_struct *p, const struct sched_attr *attr);
 extern void __getparam_dl(struct task_struct *p, struct sched_attr *attr);
-extern bool __checkparam_dl(const struct sched_attr *attr);
+extern bool __checkparam_dl(const struct sched_attr *attr, bool allow_zero_runtime);
 extern bool dl_param_changed(struct task_struct *p, const struct sched_attr *attr);
 extern int  dl_cpuset_cpumask_can_shrink(const struct cpumask *cur, const struct cpumask *trial);
 extern int  dl_bw_deactivate(int cpu);
@@ -423,6 +423,7 @@ extern void dl_server_init(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq,
 		    struct rq *served_rq,
 		    dl_server_pick_f pick_task);
 extern void sched_init_dl_servers(void);
+extern int dl_check_tg(unsigned long total);
 extern void dl_init_tg(struct sched_dl_entity *dl_se, u64 rt_runtime, u64 rt_period);
 
 extern void fair_server_init(struct rq *rq);
diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
index 15653840c812..d30aee2e90c4 100644
--- a/kernel/sched/syscalls.c
+++ b/kernel/sched/syscalls.c
@@ -528,7 +528,7 @@ int __sched_setscheduler(struct task_struct *p,
 	 */
 	if (attr->sched_priority > MAX_RT_PRIO-1)
 		return -EINVAL;
-	if ((dl_policy(policy) && !__checkparam_dl(attr)) ||
+	if ((dl_policy(policy) && !__checkparam_dl(attr, false)) ||
 	    (rt_policy(policy) != (attr->sched_priority != 0)))
 		return -EINVAL;
 
-- 
2.53.0

next prev parent reply	other threads:[~2026-04-30 21:39 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-30 21:38 [RFC PATCH v5 00/29] Hierarchical Constant Bandwidth Server Yuri Andriaccio
2026-04-30 21:38 ` [RFC PATCH v5 01/29] sched/deadline: Fix replenishment logic for non-deferred servers Yuri Andriaccio
2026-04-30 21:38 ` [RFC PATCH v5 02/29] sched/deadline: Do not access dl_se->rq directly Yuri Andriaccio
2026-04-30 21:38 ` [RFC PATCH v5 03/29] sched/deadline: Distinguish between dl_rq and my_q Yuri Andriaccio
2026-04-30 21:38 ` [RFC PATCH v5 04/29] sched/rt: Pass an rt_rq instead of an rq where needed Yuri Andriaccio
2026-04-30 21:38 ` [RFC PATCH v5 05/29] sched/rt: Move functions from rt.c to sched.h Yuri Andriaccio
2026-04-30 21:38 ` [RFC PATCH v5 06/29] sched/rt: Disable RT_GROUP_SCHED Yuri Andriaccio
2026-04-30 21:38 ` [RFC PATCH v5 07/29] sched/rt: Remove unnecessary runqueue pointer in struct rt_rq Yuri Andriaccio
2026-04-30 21:38 ` [RFC PATCH v5 08/29] sched/rt: Introduce HCBS specific structs in task_group Yuri Andriaccio
2026-04-30 21:38 ` [RFC PATCH v5 09/29] sched/core: Initialize HCBS specific structures Yuri Andriaccio
2026-04-30 21:38 ` [RFC PATCH v5 10/29] sched/deadline: Add dl_init_tg Yuri Andriaccio
2026-04-30 21:38 ` [RFC PATCH v5 11/29] sched/rt: Add {alloc/unregister/free}_rt_sched_group Yuri Andriaccio
2026-04-30 21:38 ` [RFC PATCH v5 12/29] sched/deadline: Account rt-cgroups bandwidth in deadline tasks schedulability tests Yuri Andriaccio
2026-04-30 21:38 ` [RFC PATCH v5 13/29] sched/rt: Implement dl-server operations for rt-cgroups Yuri Andriaccio
2026-05-05 13:04   ` Peter Zijlstra
2026-04-30 21:38 ` [RFC PATCH v5 14/29] sched/rt: Update task event callbacks for HCBS scheduling Yuri Andriaccio
2026-05-05 13:16   ` Peter Zijlstra
2026-04-30 21:38 ` Yuri Andriaccio [this message]
2026-05-05 14:36   ` [RFC PATCH v5 15/29] sched/rt: Update rt-cgroup schedulability checks Peter Zijlstra
2026-04-30 21:38 ` [RFC PATCH v5 16/29] sched/rt: Allow zeroing the runtime of the root control group Yuri Andriaccio
2026-04-30 21:38 ` [RFC PATCH v5 17/29] sched/rt: Remove old RT_GROUP_SCHED data structures Yuri Andriaccio
2026-04-30 21:38 ` [RFC PATCH v5 18/29] sched/core: Cgroup v2 support Yuri Andriaccio
2026-05-05 14:59   ` Peter Zijlstra
2026-04-30 21:38 ` [RFC PATCH v5 19/29] sched/rt: Remove support for cgroups-v1 Yuri Andriaccio
2026-05-05 15:01   ` Peter Zijlstra
2026-04-30 21:38 ` [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups Yuri Andriaccio
2026-05-05 15:15   ` Peter Zijlstra
2026-05-05 19:56     ` Tejun Heo
2026-04-30 21:38 ` [RFC PATCH v5 21/29] sched/rt: Update default bandwidth for real-time tasks to ONE Yuri Andriaccio
2026-04-30 21:38 ` [RFC PATCH v5 22/29] sched/rt: Add rt-cgroup migration functions Yuri Andriaccio
2026-05-05 15:20   ` Peter Zijlstra
2026-05-05 15:24   ` Peter Zijlstra
2026-04-30 21:38 ` [RFC PATCH v5 23/29] sched/rt: Hook HCBS " Yuri Andriaccio
2026-04-30 21:38 ` [RFC PATCH v5 24/29] sched/core: Execute enqueued balance callbacks when changing allowed CPUs Yuri Andriaccio
2026-04-30 21:38 ` [RFC PATCH v5 25/29] sched/rt: Try pull task on empty server pick Yuri Andriaccio
2026-04-30 21:38 ` [RFC PATCH v5 26/29] sched/core: Execute enqueued balance callbacks after migrate_disable_switch Yuri Andriaccio
2026-04-30 21:38 ` [RFC PATCH v5 27/29] Documentation: Update documentation for real-time cgroups Yuri Andriaccio
2026-04-30 21:38 ` [RFC PATCH v5 28/29] sched/rt: Add debug BUG_ONs for pre-migration code Yuri Andriaccio
2026-04-30 21:38 ` [RFC PATCH v5 29/29] sched/rt: Add debug BUG_ONs in migration code Yuri Andriaccio

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:04f3f86a414 dfblob:032cfa76304 dfblob:98a53b60e21
dfblob:0c7032d254b dfblob:c8281073210 dfblob:74bff7fb7b9
dfblob:67fbf4bbe46 dfblob:c994447f5b1 dfblob:fceb02a0485
dfblob:78f080275bf dfblob:15653840c81 dfblob:d30aee2e90c )
 OR (
bs:"[RFC PATCH v5 15/29] sched/rt: Update rt-cgroup schedulability checks" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260430213835.62217-16-yurand2000@gmail.com \
    --to=yurand2000@gmail.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luca.abeni@santannapisa.it \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=yuri.andriaccio@santannapisa.it \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox