From: Yuri Andriaccio <yurand2000@gmail.com>
To: "Ingo Molnar" <mingo@redhat.com>,
"Peter Zijlstra" <peterz@infradead.org>,
"Juri Lelli" <juri.lelli@redhat.com>,
"Vincent Guittot" <vincent.guittot@linaro.org>,
"Dietmar Eggemann" <dietmar.eggemann@arm.com>,
"Steven Rostedt" <rostedt@goodmis.org>,
"Ben Segall" <bsegall@google.com>, "Mel Gorman" <mgorman@suse.de>,
"Valentin Schneider" <vschneid@redhat.com>,
"Tejun Heo" <tj@kernel.org>,
"Johannes Weiner" <hannes@cmpxchg.org>,
"Michal Koutný" <mkoutny@suse.com>
Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org,
Luca Abeni <luca.abeni@santannapisa.it>,
Yuri Andriaccio <yuri.andriaccio@santannapisa.it>
Subject: [RFC PATCH v6 17/25] sched/rt: Update rt-cgroup schedulability checks
Date: Mon, 8 Jun 2026 14:15:36 +0200 [thread overview]
Message-ID: <20260608121546.69910-18-yurand2000@gmail.com> (raw)
In-Reply-To: <20260608121546.69910-1-yurand2000@gmail.com>
From: luca abeni <luca.abeni@santannapisa.it>
Introduce cgroup-v2 control files:
- cpu.rt.max:
Get/set the bandwidth of the given cgroup, or inherith from parent.
- cpu.rt.internal:
Get the actual remaning bandwidth for the group, removing the bw of the
group's children.
Introduce a number of functions to update the cgroup settings across the
whole hierarchy:
- tg_subtree_has_rt_tasks()
Checks if the active context rooted at tg is running rt workload.
Child groups which do not share the same active context are ignored.
- tg_compute_children_bw()
Computes the total bandwidth of the active context rooted at tg minux
the root of the context itself.
- tg_rt_schedulable()
Runs admission tests for the current cgroup tree and the given
bandwidth update.
- tg_update_active_context()
Updates the active context of a given subtree with a new one.
- tg_rt_bandwidth() / tg_rt_internal_bandwidth()
Read the max (internal) bandwidth set to the cgroup.
- tg_set_rt_bandwidth()
Set the bandwidth of the group.
Update sched_rt_can_attach to run only tasks in the root cgroup or HCBS
cgroups which have non-zero runtime.
Update and reuse __checkparam_dl to check for numerical issues regarding
the dl_server's parameters.
Add from_ratio function to convert from period and bw to runtime, inverse
of the to_ratio function.
Add dl_check_tg(), which performs an admission control test similar to
__dl_overflow, but this time we are updating the cgroup's total bandwidth
rather than scheduling a new DEADLINE task or updating a non-cgroup
deadline server.
Add rcu_sched lock guard for rcu_read_{lock/unlock}_sched.
Add sched_domains lock guard for sched_domains_mutex_{lock/unlock}.
Add lock/unlock methods for sched_rt_handler_mutex and its lock guard.
Add asserts for held sched_domains_mutex and sched_rt_handler_mutex.
Co-developed-by: Alessio Balsini <a.balsini@sssup.it>
Signed-off-by: Alessio Balsini <a.balsini@sssup.it>
Co-developed-by: Andrea Parri <parri.andrea@gmail.com>
Signed-off-by: Andrea Parri <parri.andrea@gmail.com>
Co-developed-by: luca abeni <luca.abeni@santannapisa.it>
Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
---
include/linux/rcupdate.h | 1 +
include/linux/sched.h | 2 +
kernel/sched/core.c | 55 ++++++
kernel/sched/deadline.c | 60 ++++--
kernel/sched/rt.c | 393 +++++++++++++++++++++++++++++++--------
kernel/sched/sched.h | 18 +-
kernel/sched/syscalls.c | 2 +-
7 files changed, 445 insertions(+), 86 deletions(-)
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index bfa765132de8..70432ca3dbb9 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -1179,6 +1179,7 @@ extern int rcu_expedited;
extern int rcu_normal;
DEFINE_LOCK_GUARD_0(rcu, rcu_read_lock(), rcu_read_unlock())
+DEFINE_LOCK_GUARD_0(rcu_sched, rcu_read_lock_sched(), rcu_read_unlock_sched())
DECLARE_LOCK_GUARD_0_ATTRS(rcu, __acquires_shared(RCU), __releases_shared(RCU))
#endif /* __LINUX_RCUPDATE_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b20451fcda55..0021069581c2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2522,4 +2522,6 @@ extern void migrate_enable(void);
DEFINE_LOCK_GUARD_0(migrate, migrate_disable(), migrate_enable())
+DEFINE_LOCK_GUARD_0(sched_domains, sched_domains_mutex_lock(), sched_domains_mutex_unlock())
+
#endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a8a81c69b3d3..1ad1efe1dca7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4815,6 +4815,14 @@ u64 to_ratio(u64 period, u64 runtime)
return div64_u64(runtime << BW_SHIFT, period);
}
+u64 from_ratio(u64 period, u64 bw)
+{
+ if (bw == BW_UNIT)
+ return RUNTIME_INF;
+
+ return (bw * period) >> BW_SHIFT;
+}
+
/*
* wake_up_new_task - wake up a newly created task for the first time.
*
@@ -10415,6 +10423,41 @@ static ssize_t cpu_max_write(struct kernfs_open_file *of,
}
#endif /* CONFIG_CFS_BANDWIDTH */
+#ifdef CONFIG_RT_GROUP_SCHED
+static int cpu_rt_max_show(struct seq_file *sf, void *v)
+{
+ struct task_group *tg = css_tg(seq_css(sf));
+ long period_us, runtime_us;
+
+ tg_rt_bandwidth(tg, &period_us, &runtime_us);
+ cpu_period_quota_print(sf, period_us, runtime_us);
+ return 0;
+}
+
+static int cpu_rt_internal_show(struct seq_file *sf, void *v)
+{
+ struct task_group *tg = css_tg(seq_css(sf));
+ long period_us, runtime_us;
+
+ tg_rt_internal_bandwidth(tg, &period_us, &runtime_us);
+ cpu_period_quota_print(sf, period_us, runtime_us);
+ return 0;
+}
+
+static ssize_t cpu_rt_max_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+ struct task_group *tg = css_tg(of_css(of));
+ u64 period_us, runtime_us;
+ int ret;
+
+ ret = cpu_period_quota_parse(buf, &period_us, &runtime_us);
+ if (!ret)
+ ret = tg_set_rt_bandwidth(tg, period_us, runtime_us);
+ return ret ?: nbytes;
+}
+#endif /* CONFIG_RT_GROUP_SCHED */
+
static struct cftype cpu_files[] = {
#ifdef CONFIG_GROUP_SCHED_WEIGHT
{
@@ -10450,6 +10493,18 @@ static struct cftype cpu_files[] = {
.write_u64 = cpu_burst_write_u64,
},
#endif /* CONFIG_CFS_BANDWIDTH */
+#ifdef CONFIG_RT_GROUP_SCHED
+ {
+ .name = "rt.max",
+ .seq_show = cpu_rt_max_show,
+ .write = cpu_rt_max_write,
+ },
+ {
+ .name = "rt.internal",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = cpu_rt_internal_show,
+ },
+#endif /* CONFIG_RT_GROUP_SCHED */
#ifdef CONFIG_UCLAMP_TASK_GROUP
{
.name = "uclamp.min",
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index a63253ec6441..b7102f643171 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -346,10 +346,45 @@ void cancel_inactive_timer(struct sched_dl_entity *dl_se)
cancel_dl_timer(dl_se, &dl_se->inactive_timer);
}
+/*
+ * Used for dl_bw check and update, used under sched_rt_handler()::mutex and
+ * sched_domains_mutex.
+ */
+u64 dl_cookie;
+
#ifdef CONFIG_RT_GROUP_SCHED
+int dl_check_tg(unsigned long total)
+{
+ int which_cpu;
+ int cap;
+ struct dl_bw *dl_b;
+ u64 gen = ++dl_cookie;
+
+ lockdep_assert_held(&sched_domains_mutex);
+ lockdep_assert_held(&sched_rt_handler_mutex);
+
+ for_each_possible_cpu(which_cpu) {
+ guard(rcu_sched)();
+
+ if (!dl_bw_visited(which_cpu, gen)) {
+ cap = dl_bw_capacity(which_cpu);
+ dl_b = dl_bw_of(which_cpu);
+
+ guard(raw_spinlock_irqsave)(&dl_b->lock);
+
+ if (dl_b->bw != -1 &&
+ cap_scale(dl_b->bw, cap) < dl_b->total_bw + cap_scale(total, cap))
+ return 0;
+ }
+
+ }
+
+ return 1;
+}
+
void dl_init_tg(struct sched_dl_entity *dl_se, u64 rt_runtime, u64 rt_period)
{
- struct rq *rq = container_of_const(dl_se->dl_rq, struct rq, dl);
+ struct rq *rq = rq_of_dl_se(dl_se);
int is_active;
u64 new_bw;
@@ -3497,12 +3532,6 @@ DEFINE_SCHED_CLASS(dl) = {
#endif
};
-/*
- * Used for dl_bw check and update, used under sched_rt_handler()::mutex and
- * sched_domains_mutex.
- */
-u64 dl_cookie;
-
int sched_dl_global_validate(void)
{
u64 runtime = global_rt_runtime();
@@ -3514,6 +3543,9 @@ int sched_dl_global_validate(void)
int cpu, cap, cpus, ret = 0;
unsigned long flags;
+ lockdep_assert_held(&sched_domains_mutex);
+ lockdep_assert_held(&sched_rt_handler_mutex);
+
/*
* Here we want to check the bandwidth not being set to some
* value smaller than the currently allocated bandwidth in
@@ -3566,6 +3598,9 @@ void sched_dl_do_global(void)
int cpu;
unsigned long flags;
+ lockdep_assert_held(&sched_domains_mutex);
+ lockdep_assert_held(&sched_rt_handler_mutex);
+
if (global_rt_runtime() != RUNTIME_INF)
new_bw = to_ratio(global_rt_period(), global_rt_runtime());
@@ -3711,7 +3746,7 @@ void __getparam_dl(struct task_struct *p, struct sched_attr *attr, unsigned int
* below 2^63 ns (we have to check both sched_deadline and
* sched_period, as the latter can be zero).
*/
-bool __checkparam_dl(const struct sched_attr *attr)
+bool __checkparam_dl(const struct sched_attr *attr, bool allow_zero_runtime)
{
u64 period, max, min;
@@ -3720,14 +3755,16 @@ bool __checkparam_dl(const struct sched_attr *attr)
return true;
/* deadline != 0 */
- if (attr->sched_deadline == 0)
+ if ((!allow_zero_runtime || attr->sched_runtime != 0) &&
+ attr->sched_deadline == 0)
return false;
/*
* Since we truncate DL_SCALE bits, make sure we're at least
* that big.
*/
- if (attr->sched_runtime < (1ULL << DL_SCALE))
+ if ((!allow_zero_runtime || attr->sched_runtime != 0) &&
+ attr->sched_runtime < (1ULL << DL_SCALE))
return false;
/*
@@ -3750,7 +3787,8 @@ bool __checkparam_dl(const struct sched_attr *attr)
max = (u64)READ_ONCE(sysctl_sched_dl_period_max) * NSEC_PER_USEC;
min = (u64)READ_ONCE(sysctl_sched_dl_period_min) * NSEC_PER_USEC;
- if (period < min || period > max)
+ if ((!allow_zero_runtime || period != 0) &&
+ (period < min || period > max))
return false;
return true;
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 4f1e7af2e88d..a32b1f68e645 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1,4 +1,3 @@
-#pragma GCC diagnostic ignored "-Wunused-function"
// SPDX-License-Identifier: GPL-2.0
/*
* Real-Time Scheduling Class (mapped to the SCHED_FIFO and SCHED_RR
@@ -2111,9 +2110,6 @@ DEFINE_SCHED_CLASS(rt) = {
};
#ifdef CONFIG_RT_GROUP_SCHED
-/*
- * Ensure that the real time constraints are schedulable.
- */
static inline int tg_has_rt_tasks(struct task_group *tg)
{
struct task_struct *task;
@@ -2134,38 +2130,114 @@ static inline int tg_has_rt_tasks(struct task_group *tg)
return ret;
}
-struct rt_schedulable_data {
+static int __tg_subtree_has_rt_tasks(struct task_group *tg, void *data) {
+ struct task_group *ctx = data;
+
+ if (dl_bandwidth_read(tg)->active_context == ctx && tg_has_rt_tasks(tg))
+ return 1;
+ else
+ return 0;
+}
+
+static int tg_subtree_has_rt_tasks(struct task_group *tg) {
+ lockdep_assert(rcu_read_lock_held());
+ return walk_tg_tree_from(tg, __tg_subtree_has_rt_tasks, tg_nop,
+ dl_bandwidth_read(tg)->active_context);
+}
+
+struct tg_update_data {
struct task_group *tg;
u64 rt_period;
u64 rt_runtime;
};
-static int tg_rt_schedulable(struct task_group *tg, void *data)
+struct tg_compute_children_bw_data {
+ struct tg_update_data update;
+ struct task_group *active_context;
+ u64 bw_sum;
+};
+
+static int __tg_compute_children_bw(struct task_group *tg, void *data) {
+ struct tg_compute_children_bw_data *d = data;
+ const struct dl_bandwidth *dl_b = dl_bandwidth_read(tg);
+ u64 period, runtime;
+
+ /* Skip the current task group from the sum. */
+ if (tg == d->active_context)
+ return 0;
+
+ period = dl_b->dl_period;
+ runtime = dl_b->dl_runtime;
+ if (tg == d->update.tg) {
+ period = d->update.rt_period;
+ runtime = d->update.rt_runtime;
+ }
+
+ if (runtime == RUNTIME_INF ||
+ dl_bandwidth_read(tg->parent)->active_context != d->active_context)
+ return 0;
+
+ d->bw_sum += to_ratio(period, runtime);
+ return 0;
+}
+
+static unsigned long tg_compute_children_bw(struct task_group *tg,
+ struct tg_update_data *data)
+{
+ struct tg_compute_children_bw_data sum_data = {
+ .active_context = tg,
+ .bw_sum = 0,
+ .update = (struct tg_update_data) {
+ .tg = data->tg,
+ .rt_period = data->rt_period,
+ .rt_runtime = data->rt_runtime,
+ }
+ };
+
+ lockdep_assert(rcu_read_lock_held());
+ walk_tg_tree_from(tg, __tg_compute_children_bw, tg_nop, &sum_data);
+ return sum_data.bw_sum;
+}
+
+struct rt_schedulable_data {
+ struct tg_update_data update;
+ u64 rt_runtime_remainder;
+};
+
+static int __tg_rt_schedulable(struct task_group *tg, void *data)
{
struct rt_schedulable_data *d = data;
- struct task_group *child;
+ const struct dl_bandwidth *dl_b;
u64 total, sum = 0;
u64 period, runtime;
- period = ktime_to_ns(tg->rt_bandwidth.rt_period);
- runtime = tg->rt_bandwidth.rt_runtime;
+ dl_b = dl_bandwidth_read(tg);
+ period = dl_b->dl_period;
+ runtime = dl_b->dl_runtime;
- if (tg == d->tg) {
- period = d->rt_period;
- runtime = d->rt_runtime;
+ if (tg == d->update.tg) {
+ period = d->update.rt_period;
+ runtime = d->update.rt_runtime;
}
+ /*
+ * "max" groups are always schedulable, as they defer their access
+ * control to their first non-max parent.
+ */
+ if (runtime == RUNTIME_INF)
+ return 0;
+
/*
* Cannot have more runtime than the period.
*/
- if (runtime > period && runtime != RUNTIME_INF)
+ if (runtime > period)
return -EINVAL;
/*
* Ensure we don't starve existing RT tasks if runtime turns zero.
*/
- if (rt_bandwidth_enabled() && !runtime &&
- tg->rt_bandwidth.rt_runtime && tg_has_rt_tasks(tg))
+ if (dl_bandwidth_enabled() && !runtime && tg != &root_task_group &&
+ tg_subtree_has_rt_tasks(tg))
return -EBUSY;
total = to_ratio(period, runtime);
@@ -2176,58 +2248,146 @@ static int tg_rt_schedulable(struct task_group *tg, void *data)
if (total > to_ratio(global_rt_period(), global_rt_runtime()))
return -EINVAL;
+ if (tg == &root_task_group) {
+ if (!dl_check_tg(total))
+ return -EBUSY;
+ }
+
/*
- * The sum of our children's runtime should not exceed our own.
+ * The sum of our children's runtime, plus our own bw, should not
+ * exceed our own max.
*/
- list_for_each_entry_rcu(child, &tg->children, siblings) {
- period = ktime_to_ns(child->rt_bandwidth.rt_period);
- runtime = child->rt_bandwidth.rt_runtime;
+ sum = tg_compute_children_bw(tg, &d->update);
+ if (sum > total)
+ return -EINVAL;
- if (child == d->tg) {
- period = d->rt_period;
- runtime = d->rt_runtime;
- }
+ /*
+ * Compute remaining runtime
+ */
+ if (tg == d->update.tg)
+ d->rt_runtime_remainder = from_ratio(period, total - sum);
+
+ return 0;
+}
- sum += to_ratio(period, runtime);
+static int tg_rt_schedulable(struct tg_update_data *data, u64 *remainder_runtime)
+{
+ int err;
+ struct rt_schedulable_data d = {
+ .update = (struct tg_update_data) {
+ .tg = data->tg,
+ .rt_period = data->rt_period,
+ .rt_runtime = data->rt_runtime,
+ },
+ .rt_runtime_remainder = 0,
+ };
+
+ /*
+ * Walk the cgroup tree and check schedulability constraints.
+ */
+ lockdep_assert(rcu_read_lock_held());
+ err = walk_tg_tree(__tg_rt_schedulable, tg_nop, &d);
+ if (err)
+ return err;
+
+ *remainder_runtime = d.rt_runtime_remainder;
+ return 0;
+}
+
+struct tg_update_active_context_data {
+ struct task_group *new_active_context;
+ struct task_group *old_active_context;
+};
+
+static int __tg_update_active_context(struct task_group *tg, void *data) {
+ struct tg_update_active_context_data *d = data;
+
+ if (dl_bandwidth_read(tg)->active_context == d->old_active_context) {
+ guard(raw_spinlock_irq)(dl_bw_lock_of_tg(tg));
+ dl_bandwidth_write(tg)->active_context = d->new_active_context;
}
- if (sum > total)
- return -EINVAL;
+ return 0;
+}
+
+static void tg_update_active_context(struct task_group *tg,
+ struct task_group *old_context,
+ struct task_group *new_context)
+{
+ struct tg_update_active_context_data data = {
+ .new_active_context = new_context,
+ .old_active_context = old_context,
+ };
+ lockdep_assert(rcu_read_lock_held());
+ walk_tg_tree_from(tg, __tg_update_active_context, tg_nop, &data);
+}
+
+int tg_rt_bandwidth(struct task_group *tg,
+ long *rt_period_us, long *rt_runtime_us)
+{
+ const struct dl_bandwidth *dl_b;
+
+ guard(raw_spinlock_irq)(dl_bw_lock_of_tg(tg));
+ dl_b = dl_bandwidth_read(tg);
+
+ *rt_runtime_us = -1;
+ if (dl_b->dl_runtime != RUNTIME_INF) {
+ *rt_runtime_us = dl_b->dl_runtime;
+ do_div(*rt_runtime_us, NSEC_PER_USEC);
+ }
+
+ *rt_period_us = dl_b->dl_period;
+ do_div(*rt_period_us, NSEC_PER_USEC);
return 0;
}
-static int __rt_schedulable(struct task_group *tg, u64 period, u64 runtime)
+int tg_rt_internal_bandwidth(struct task_group *tg,
+ long *rt_period_us, long *rt_runtime_us)
{
- int ret;
+ const struct dl_bandwidth *dl_b;
- struct rt_schedulable_data data = {
- .tg = tg,
- .rt_period = period,
- .rt_runtime = runtime,
- };
+ guard(raw_spinlock_irq)(dl_bw_lock_of_tg(tg));
+ dl_b = dl_bandwidth_read(tg);
- rcu_read_lock();
- ret = walk_tg_tree(tg_rt_schedulable, tg_nop, &data);
- rcu_read_unlock();
+ *rt_runtime_us = dl_b->dl_internal_runtime;
+ do_div(*rt_runtime_us, NSEC_PER_USEC);
- return ret;
+ *rt_period_us = dl_b->dl_period;
+ do_div(*rt_period_us, NSEC_PER_USEC);
+
+ return 0;
}
-static int tg_set_rt_bandwidth(struct task_group *tg,
- u64 rt_period, u64 rt_runtime)
+int tg_set_rt_bandwidth(struct task_group *tg,
+ u64 rt_period_us, u64 rt_runtime_us)
{
- int i, err = 0;
+ struct tg_update_data update;
+ struct task_group *parent_ctx;
+ struct dl_bandwidth *dl_b;
+ u64 rt_period, rt_runtime, old_rt_runtime;
+ u64 rt_actual_runtime = 0;
+ u64 bw, children_bw;
+ struct sched_attr attr;
+ int err, i;
- /*
- * Disallowing the root group RT runtime is BAD, it would disallow the
- * kernel creating (and or operating) RT threads.
- */
- if (tg == &root_task_group && rt_runtime == 0)
+ if (rt_runtime_us == RUNTIME_INF)
+ rt_runtime = RUNTIME_INF;
+ else if ((u64)rt_runtime_us > U64_MAX / NSEC_PER_USEC)
return -EINVAL;
+ else
+ rt_runtime = (u64)rt_runtime_us * NSEC_PER_USEC;
- /* No period doesn't make any sense. */
- if (rt_period == 0)
+ if ((u64)rt_period_us > U64_MAX / NSEC_PER_USEC)
+ return -EINVAL;
+ else
+ rt_period = (u64)rt_period_us * NSEC_PER_USEC;
+
+ /*
+ * The root_task_group bandwidth settings are only used to reserve bw
+ * for HCBS cgroups; runtime == "max" has no meaning there.
+ */
+ if (rt_runtime == RUNTIME_INF && tg == &root_task_group)
return -EINVAL;
/*
@@ -2236,34 +2396,119 @@ static int tg_set_rt_bandwidth(struct task_group *tg,
if (rt_runtime != RUNTIME_INF && rt_runtime > max_rt_runtime)
return -EINVAL;
- mutex_lock(&rt_constraints_mutex);
- err = __rt_schedulable(tg, rt_period, rt_runtime);
+ /*
+ * Check if the runtime and period min and max values are admissible.
+ */
+ attr = (struct sched_attr){
+ .sched_flags = 0,
+ .sched_runtime = rt_runtime,
+ .sched_deadline = rt_period,
+ .sched_period = rt_period,
+ };
+
+ if (rt_runtime != RUNTIME_INF && !__checkparam_dl(&attr, true))
+ return -EINVAL;
+
+ update = (struct tg_update_data) {
+ .tg = tg,
+ .rt_period = rt_period,
+ .rt_runtime = rt_runtime,
+ };
+
+ guard(mutex)(&rt_constraints_mutex);
+ old_rt_runtime = dl_bandwidth_read(tg)->dl_runtime;
+
+ /*
+ * Disallow changing from/to "max" and a HCBS reservation if the group
+ * and all of its "max" children have active tasks.
+ */
+ guard(sched_rt_handler)();
+ guard(sched_domains)();
+ guard(rcu)();
+ if (((rt_runtime == RUNTIME_INF && old_rt_runtime != RUNTIME_INF) ||
+ (rt_runtime != RUNTIME_INF && old_rt_runtime == RUNTIME_INF)) &&
+ tg_subtree_has_rt_tasks(tg))
+ return -EINVAL;
+
+ err = tg_rt_schedulable(&update, &rt_actual_runtime);
if (err)
- goto unlock;
+ return err;
- raw_spin_lock_irq(&tg->rt_bandwidth.rt_runtime_lock);
- tg->rt_bandwidth.rt_period = ns_to_ktime(rt_period);
- tg->rt_bandwidth.rt_runtime = rt_runtime;
+ scoped_guard(raw_spinlock_irq, dl_bw_lock_of_tg(tg)) {
+ dl_b = dl_bandwidth_write(tg);
+ dl_b->dl_period = rt_period;
+ dl_b->dl_runtime = rt_runtime;
+ dl_b->dl_internal_runtime = rt_actual_runtime;
+ }
+
+ if (tg == &root_task_group)
+ return 0;
+ parent_ctx = dl_bandwidth_read(tg->parent)->active_context;
+
+ /*
+ * If changing from/to "max" and a HCBS reservation, must update the
+ * active_context of self and all of its subtree.
+ */
+ if ((rt_runtime == RUNTIME_INF && old_rt_runtime != RUNTIME_INF) ||
+ (rt_runtime != RUNTIME_INF && old_rt_runtime == RUNTIME_INF))
+ {
+ if (rt_runtime == RUNTIME_INF)
+ tg_update_active_context(tg, dl_b->active_context, parent_ctx);
+ else
+ tg_update_active_context(tg, dl_b->active_context, tg);
+
+ }
+
+ WARN_ON(rt_runtime == RUNTIME_INF && rt_actual_runtime != 0);
for_each_possible_cpu(i) {
- struct rt_rq *rt_rq = tg->rt_rq[i];
+ dl_init_tg(tg->dl_se[i], rt_actual_runtime, rt_period);
+ }
+
+ /*
+ * Update the dl_servers of the parent's active context
+ */
+ if (parent_ctx == &root_task_group)
+ return 0;
+
+ scoped_guard(raw_spinlock_irq, dl_bw_lock_of_tg(parent_ctx)) {
+ dl_b = dl_bandwidth_write(parent_ctx);
- raw_spin_lock(&rt_rq->rt_runtime_lock);
- rt_rq->rt_runtime = rt_runtime;
- raw_spin_unlock(&rt_rq->rt_runtime_lock);
+ bw = to_ratio(dl_b->dl_period, dl_b->dl_runtime);
+ children_bw = tg_compute_children_bw(parent_ctx, &update);
+
+ rt_period = dl_b->dl_period;
+ rt_actual_runtime = from_ratio(rt_period, bw - children_bw);
+ dl_b->dl_internal_runtime = rt_actual_runtime;
}
- raw_spin_unlock_irq(&tg->rt_bandwidth.rt_runtime_lock);
-unlock:
- mutex_unlock(&rt_constraints_mutex);
- return err;
+ for_each_possible_cpu(i) {
+ dl_init_tg(parent_ctx->dl_se[i], rt_actual_runtime, rt_period);
+ }
+
+ return 0;
}
int sched_rt_can_attach(struct task_group *tg)
{
+ struct task_group *ctx;
+
+ /* If rt group sched is disabled, tasks are always run in the root rq */
+ if (!rt_group_sched_enabled())
+ return 1;
+
+ /* Can always run on the root task group */
+ scoped_guard(raw_spinlock_irqsave, dl_bw_lock_of_tg(tg)) {
+ ctx = dl_bandwidth_read(tg)->active_context;
+ if (ctx == &root_task_group)
+ return 1;
+ }
+
/* Don't accept real-time tasks when there is no way for them to run */
- if (rt_group_sched_enabled() && tg->dl_bandwidth.dl_runtime == 0)
- return 0;
+ scoped_guard(raw_spinlock_irqsave, dl_bw_lock_of_tg(ctx)) {
+ if (dl_bandwidth_read(ctx)->dl_runtime == 0)
+ return 0;
+ }
return 1;
}
@@ -2279,24 +2524,26 @@ static int sched_rt_global_validate(void)
NSEC_PER_USEC > max_rt_runtime)))
return -EINVAL;
-#ifdef CONFIG_RT_GROUP_SCHED
- if (!rt_group_sched_enabled())
- return 0;
-
- scoped_guard(mutex, &rt_constraints_mutex)
- return __rt_schedulable(NULL, 0, 0);
-#endif
return 0;
}
+DEFINE_MUTEX(sched_rt_handler_mutex);
+
+void sched_rt_handler_mutex_lock() {
+ mutex_lock(&sched_rt_handler_mutex);
+}
+
+void sched_rt_handler_mutex_unlock() {
+ mutex_unlock(&sched_rt_handler_mutex);
+}
+
static int sched_rt_handler(const struct ctl_table *table, int write, void *buffer,
size_t *lenp, loff_t *ppos)
{
int old_period, old_runtime;
- static DEFINE_MUTEX(mutex);
int ret;
- mutex_lock(&mutex);
+ sched_rt_handler_mutex_lock();
sched_domains_mutex_lock();
old_period = sysctl_sched_rt_period;
old_runtime = sysctl_sched_rt_runtime;
@@ -2320,7 +2567,7 @@ static int sched_rt_handler(const struct ctl_table *table, int write, void *buff
sysctl_sched_rt_runtime = old_runtime;
}
sched_domains_mutex_unlock();
- mutex_unlock(&mutex);
+ sched_rt_handler_mutex_unlock();
/*
* After changing maximum available bandwidth for DEADLINE, we need to
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index efe52e162ba5..394f40dc26db 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -366,7 +366,7 @@ extern void sched_dl_do_global(void);
extern int sched_dl_overflow(struct task_struct *p, int policy, const struct sched_attr *attr);
extern void __setparam_dl(struct task_struct *p, const struct sched_attr *attr);
extern void __getparam_dl(struct task_struct *p, struct sched_attr *attr, unsigned int flags);
-extern bool __checkparam_dl(const struct sched_attr *attr);
+extern bool __checkparam_dl(const struct sched_attr *attr, bool allow_zero_runtime);
extern bool dl_param_changed(struct task_struct *p, const struct sched_attr *attr);
extern int dl_cpuset_cpumask_can_shrink(const struct cpumask *cur, const struct cpumask *trial);
extern int dl_bw_deactivate(int cpu);
@@ -425,6 +425,7 @@ extern void dl_server_init(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq,
struct rq *served_rq,
dl_server_pick_f pick_task);
extern void sched_init_dl_servers(void);
+extern int dl_check_tg(unsigned long total);
extern void dl_init_tg(struct sched_dl_entity *dl_se, u64 rt_runtime, u64 rt_period);
extern void fair_server_init(struct rq *rq);
@@ -607,6 +608,12 @@ extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
extern void unthrottle_cfs_rq(struct cfs_rq *cfs_rq);
extern bool cfs_task_bw_constrained(struct task_struct *p);
+extern int tg_rt_bandwidth(struct task_group *tg,
+ long *rt_period_us, long *rt_runtime_us);
+extern int tg_rt_internal_bandwidth(struct task_group *tg,
+ long *rt_period_us, long *rt_runtime_us);
+extern int tg_set_rt_bandwidth(struct task_group *tg,
+ u64 rt_period_us, u64 rt_runtime_us);
extern int sched_rt_can_attach(struct task_group *tg);
extern struct task_group *sched_create_group(struct task_group *parent);
@@ -2045,6 +2052,14 @@ DEFINE_LOCK_GUARD_1(raw_spin_rq_lock_irq, struct rq,
raw_spin_rq_lock_irq(_T->lock),
raw_spin_rq_unlock_irq(_T->lock))
+extern struct mutex sched_rt_handler_mutex;
+extern void sched_rt_handler_mutex_lock(void);
+extern void sched_rt_handler_mutex_unlock(void);
+
+DEFINE_LOCK_GUARD_0(sched_rt_handler,
+ sched_rt_handler_mutex_lock(),
+ sched_rt_handler_mutex_unlock())
+
#ifdef CONFIG_NUMA
enum numa_topology_type {
@@ -2938,6 +2953,7 @@ extern void init_cfs_throttle_work(struct task_struct *p);
#define MAX_BW ((1ULL << MAX_BW_BITS) - 1)
extern u64 to_ratio(u64 period, u64 runtime);
+extern u64 from_ratio(u64 period, u64 bw);
extern void init_entity_runnable_average(struct sched_entity *se);
extern void post_init_entity_util_avg(struct task_struct *p);
diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
index 773f744c0460..e5b8d2f42ea8 100644
--- a/kernel/sched/syscalls.c
+++ b/kernel/sched/syscalls.c
@@ -528,7 +528,7 @@ int __sched_setscheduler(struct task_struct *p,
*/
if (attr->sched_priority > MAX_RT_PRIO-1)
return -EINVAL;
- if ((dl_policy(policy) && !__checkparam_dl(attr)) ||
+ if ((dl_policy(policy) && !__checkparam_dl(attr, false)) ||
(rt_policy(policy) != (attr->sched_priority != 0)))
return -EINVAL;
--
2.54.0
next prev parent reply other threads:[~2026-06-08 12:16 UTC|newest]
Thread overview: 31+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-08 12:15 [RFC PATCH v6 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
2026-06-08 12:15 ` [RFC PATCH v6 01/25] sched/deadline: Fix replenishment logic for non-deferred servers Yuri Andriaccio
2026-06-08 12:15 ` [RFC PATCH v6 02/25] sched/rt: Update default bandwidth for real-time tasks to ONE Yuri Andriaccio
2026-06-08 12:15 ` [RFC PATCH v6 03/25] sched/deadline: Do not access dl_se->rq directly Yuri Andriaccio
2026-06-08 12:15 ` [RFC PATCH v6 04/25] sched/deadline: Distinguish between dl_rq and my_q Yuri Andriaccio
2026-06-08 12:15 ` [RFC PATCH v6 05/25] sched/rt: Pass an rt_rq instead of an rq where needed Yuri Andriaccio
2026-06-08 12:15 ` [RFC PATCH v6 06/25] sched/rt: Move functions from rt.c to sched.h Yuri Andriaccio
2026-06-08 12:15 ` [RFC PATCH v6 07/25] sched/rt: Disable RT_GROUP_SCHED Yuri Andriaccio
2026-06-08 12:15 ` [RFC PATCH v6 08/25] sched/rt: Remove unnecessary runqueue pointer in struct rt_rq Yuri Andriaccio
2026-06-08 12:15 ` [RFC PATCH v6 09/25] sched/rt: Introduce HCBS specific structs in task_group Yuri Andriaccio
2026-06-08 12:15 ` [RFC PATCH v6 10/25] sched/core: Initialize HCBS specific structures Yuri Andriaccio
2026-06-08 12:15 ` [RFC PATCH v6 11/25] sched/deadline: Add dl_init_tg Yuri Andriaccio
2026-06-08 12:15 ` [RFC PATCH v6 12/25] sched/rt: Add {alloc/unregister/free}_rt_sched_group Yuri Andriaccio
2026-06-11 8:42 ` Juri Lelli
2026-06-08 12:15 ` [RFC PATCH v6 13/25] sched/deadline: Account rt-cgroups bandwidth in deadline tasks schedulability tests Yuri Andriaccio
2026-06-08 12:15 ` [RFC PATCH v6 14/25] sched/rt: Implement dl-server operations for rt-cgroups Yuri Andriaccio
2026-06-08 12:15 ` [RFC PATCH v6 15/25] sched/rt: Update task event callbacks for HCBS scheduling Yuri Andriaccio
2026-06-08 12:15 ` [RFC PATCH v6 16/25] sched/rt: Remove support for cgroups-v1 Yuri Andriaccio
2026-06-08 12:15 ` Yuri Andriaccio [this message]
2026-06-08 12:15 ` [RFC PATCH v6 18/25] sched/rt: Update task's RT runqueue when switching scheduling class Yuri Andriaccio
2026-06-08 12:15 ` [RFC PATCH v6 19/25] sched/rt: Remove old RT_GROUP_SCHED data structures Yuri Andriaccio
2026-06-08 12:15 ` [RFC PATCH v6 20/25] sched/rt: Add HCBS migration code to related functions Yuri Andriaccio
2026-06-08 12:15 ` [RFC PATCH v6 21/25] sched/rt: Hook HCBS migration functions Yuri Andriaccio
2026-06-08 12:15 ` [RFC PATCH v6 22/25] sched/core: Execute enqueued balance callbacks when changing allowed CPUs Yuri Andriaccio
2026-06-08 12:15 ` [RFC PATCH v6 23/25] sched/rt: Try pull task on empty server pick Yuri Andriaccio
2026-06-08 12:15 ` [RFC PATCH v6 24/25] sched/core: Execute enqueued balance callbacks after migrate_disable_switch Yuri Andriaccio
2026-06-08 12:15 ` [RFC PATCH v6 25/25] Documentation: Update documentation for real-time cgroups Yuri Andriaccio
2026-06-09 15:46 ` [RFC PATCH v6 00/25] Hierarchical Constant Bandwidth Server Juri Lelli
2026-06-09 16:23 ` Yuri Andriaccio
2026-06-10 9:21 ` Juri Lelli
2026-06-15 20:38 ` Tejun Heo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260608121546.69910-18-yurand2000@gmail.com \
--to=yurand2000@gmail.com \
--cc=bsegall@google.com \
--cc=cgroups@vger.kernel.org \
--cc=dietmar.eggemann@arm.com \
--cc=hannes@cmpxchg.org \
--cc=juri.lelli@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=luca.abeni@santannapisa.it \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=mkoutny@suse.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=tj@kernel.org \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
--cc=yuri.andriaccio@santannapisa.it \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox