* [RFC PATCH v2 01/25] sched/deadline: Remove fair-servers from real-time task's bandwidth accounting
2025-07-31 10:55 [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
@ 2025-07-31 10:55 ` Yuri Andriaccio
2025-07-31 10:55 ` [RFC PATCH v2 02/25] sched/deadline: Do not access dl_se->rq directly Yuri Andriaccio
` (24 subsequent siblings)
25 siblings, 0 replies; 38+ messages in thread
From: Yuri Andriaccio @ 2025-07-31 10:55 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider
Cc: linux-kernel, Luca Abeni, Yuri Andriaccio
Fair-servers are currently used in place of the old RT_THROTTLING mechanism to
prevent the starvation of SCHED_OTHER (and other lower priority) tasks when
real-time FIFO/RR processes are trying to fully utilize the CPU. To allow the
RT_THROTTLING mechanism, the maximum allocatable bandwidth for real-time tasks
has been limited to 95% of the CPU-time.
The RT_THROTTLING mechanism is now removed in favor of fair-servers, which are
currently set to use, as expected, 5% of the CPU-time. Still, they share the
same bandwidth that allows to run real-time tasks, and which is still set to 95%
of the total CPU-time. This means that by removing the RT_THROTTLING mechanism,
the bandwidth remaning for real-time SCHED_DEADLINE tasks and other dl-servers
(FIFO/RR are not affected) is only 90%.
This patch reclaims the 5% lost CPU-time, which is definitely reserved for
SCHED_OTHER tasks, but should not be accounted togheter with the other real-time
tasks. More generally, the fair-servers' bandwidth must not be accounted with
other real-time tasks.
Updates:
- Make the fair-servers' bandwidth not be accounted into the total allocated
bandwidth for real-time tasks.
- Remove the admission control test when allocating a fair-server.
- Do not account for fair-servers in the GRUB's bandwidth reclaiming mechanism.
- Limit the max bandwidth to (BW_UNIT - max_rt_bw) when changing the parameters
of a fair-server, preventing overcommitment.
- Add dl_bw_fair, which computes the total allocated bandwidth of the
fair-servers in the given root-domain.
- Update admission tests (in sched_dl_global_validate) when changing the
maximum allocatable bandwidth for real-time tasks, preventing overcommitment.
Since the fair-server's bandwidth can be changed through debugfs, it has not
been enforced that a fair-server's bw must be always equal to (BW_UNIT -
max_rt_bw), rather it must be less or equal to this value. This allows retaining
the fair-servers' settings changed through the debugfs when chaning the
max_rt_bw.
This also means that in order to increase the maximum bandwidth for real-time
tasks, the bw of fair-servers must be first decreased through debugfs otherwise
admission tests will fail, and viceversa, to increase the bw of fair-servers,
the bw of real-time tasks must be reduced beforehand.
This v2 version addresses the compilation error on i386 reported at:
https://lore.kernel.org/oe-kbuild-all/202507220727.BmA1Osdg-lkp@intel.com/
v1: https://lore.kernel.org/all/20250721111131.309388-1-yurand2000@gmail.com/
Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
---
kernel/sched/deadline.c | 66 ++++++++++++++++++-----------------------
kernel/sched/sched.h | 1 -
kernel/sched/topology.c | 8 -----
3 files changed, 29 insertions(+), 46 deletions(-)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index e2d51f4306b..8ba6bf3ef68 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -141,6 +141,24 @@ static inline int dl_bw_cpus(int i)
return cpus;
}
+static inline u64 dl_bw_fair(int i)
+{
+ struct root_domain *rd = cpu_rq(i)->rd;
+ u64 fair_server_bw = 0;
+
+ RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
+ "sched RCU must be held");
+
+ if (cpumask_subset(rd->span, cpu_active_mask))
+ i = cpumask_first(rd->span);
+
+ for_each_cpu_and(i, rd->span, cpu_active_mask) {
+ fair_server_bw += cpu_rq(i)->fair_server.dl_bw;
+ }
+
+ return fair_server_bw;
+}
+
static inline unsigned long __dl_bw_capacity(const struct cpumask *mask)
{
unsigned long cap = 0;
@@ -1657,25 +1675,9 @@ void sched_init_dl_servers(void)
}
}
-void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq)
-{
- u64 new_bw = dl_se->dl_bw;
- int cpu = cpu_of(rq);
- struct dl_bw *dl_b;
-
- dl_b = dl_bw_of(cpu_of(rq));
- guard(raw_spinlock)(&dl_b->lock);
-
- if (!dl_bw_cpus(cpu))
- return;
-
- __dl_add(dl_b, new_bw, dl_bw_cpus(cpu));
-}
-
int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 period, bool init)
{
- u64 old_bw = init ? 0 : to_ratio(dl_se->dl_period, dl_se->dl_runtime);
- u64 new_bw = to_ratio(period, runtime);
+ u64 max_bw, new_bw = to_ratio(period, runtime);
struct rq *rq = dl_se->rq;
int cpu = cpu_of(rq);
struct dl_bw *dl_b;
@@ -1688,17 +1690,14 @@ int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 perio
cpus = dl_bw_cpus(cpu);
cap = dl_bw_capacity(cpu);
+ max_bw = div64_ul(cap_scale(BW_UNIT - dl_b->bw, cap), (unsigned long)cpus);
- if (__dl_overflow(dl_b, cap, old_bw, new_bw))
+ if (new_bw > max_bw)
return -EBUSY;
if (init) {
__add_rq_bw(new_bw, &rq->dl);
- __dl_add(dl_b, new_bw, cpus);
} else {
- __dl_sub(dl_b, dl_se->dl_bw, cpus);
- __dl_add(dl_b, new_bw, cpus);
-
dl_rq_change_utilization(rq, dl_se, new_bw);
}
@@ -2939,17 +2938,6 @@ void dl_clear_root_domain(struct root_domain *rd)
rd->dl_bw.total_bw = 0;
for_each_cpu(i, rd->span)
cpu_rq(i)->dl.extra_bw = cpu_rq(i)->dl.max_bw;
-
- /*
- * dl_servers are not tasks. Since dl_add_task_root_domain ignores
- * them, we need to account for them here explicitly.
- */
- for_each_cpu(i, rd->span) {
- struct sched_dl_entity *dl_se = &cpu_rq(i)->fair_server;
-
- if (dl_server(dl_se) && cpu_active(i))
- __dl_add(&rd->dl_bw, dl_se->dl_bw, dl_bw_cpus(i));
- }
}
void dl_clear_root_domain_cpu(int cpu)
@@ -3133,9 +3121,10 @@ int sched_dl_global_validate(void)
u64 period = global_rt_period();
u64 new_bw = to_ratio(period, runtime);
u64 cookie = ++dl_cookie;
+ u64 fair_bw;
struct dl_bw *dl_b;
- int cpu, cpus, ret = 0;
- unsigned long flags;
+ int cpu, ret = 0;
+ unsigned long cap, flags;
/*
* Here we want to check the bandwidth not being set to some
@@ -3149,10 +3138,13 @@ int sched_dl_global_validate(void)
goto next;
dl_b = dl_bw_of(cpu);
- cpus = dl_bw_cpus(cpu);
+ cap = dl_bw_capacity(cpu);
+ fair_bw = dl_bw_fair(cpu);
raw_spin_lock_irqsave(&dl_b->lock, flags);
- if (new_bw * cpus < dl_b->total_bw)
+ if (cap_scale(new_bw, cap) < dl_b->total_bw)
+ ret = -EBUSY;
+ if (cap_scale(new_bw, cap) + fair_bw > cap_scale(BW_UNIT, cap))
ret = -EBUSY;
raw_spin_unlock_irqrestore(&dl_b->lock, flags);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d3f33d10c58..8719ab8a817 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -390,7 +390,6 @@ extern void sched_init_dl_servers(void);
extern void dl_server_update_idle_time(struct rq *rq,
struct task_struct *p);
extern void fair_server_init(struct rq *rq);
-extern void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq);
extern int dl_server_apply_params(struct sched_dl_entity *dl_se,
u64 runtime, u64 period, bool init);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 977e133bb8a..4ea3365984a 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -500,14 +500,6 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd)
if (cpumask_test_cpu(rq->cpu, cpu_active_mask))
set_rq_online(rq);
- /*
- * Because the rq is not a task, dl_add_task_root_domain() did not
- * move the fair server bw to the rd if it already started.
- * Add it now.
- */
- if (rq->fair_server.dl_server)
- __dl_server_attach_root(&rq->fair_server, rq);
-
rq_unlock_irqrestore(rq, &rf);
if (old_rd)
--
2.50.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [RFC PATCH v2 02/25] sched/deadline: Do not access dl_se->rq directly
2025-07-31 10:55 [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
2025-07-31 10:55 ` [RFC PATCH v2 01/25] sched/deadline: Remove fair-servers from real-time task's bandwidth accounting Yuri Andriaccio
@ 2025-07-31 10:55 ` Yuri Andriaccio
2025-08-14 8:30 ` Juri Lelli
2025-07-31 10:55 ` [RFC PATCH v2 03/25] sched/deadline: Distinct between dl_rq and my_q Yuri Andriaccio
` (23 subsequent siblings)
25 siblings, 1 reply; 38+ messages in thread
From: Yuri Andriaccio @ 2025-07-31 10:55 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider
Cc: linux-kernel, Luca Abeni, Yuri Andriaccio
From: luca abeni <luca.abeni@santannapisa.it>
Make deadline.c code access the runqueue of a scheduling entity saved in the
sched_dl_entity data structure. This allows future patches to save different
runqueues in sched_dl_entity other than the global runqueues.
Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
---
kernel/sched/deadline.c | 27 ++++++++++++++-------------
1 file changed, 14 insertions(+), 13 deletions(-)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 8ba6bf3ef68..46b9b78cca2 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -892,7 +892,7 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
* and arm the defer timer.
*/
if (dl_se->dl_defer && !dl_se->dl_defer_running &&
- dl_time_before(rq_clock(dl_se->rq), dl_se->deadline - dl_se->runtime)) {
+ dl_time_before(rq_clock(rq), dl_se->deadline - dl_se->runtime)) {
if (!is_dl_boosted(dl_se) && dl_se->server_has_tasks(dl_se)) {
/*
@@ -1202,11 +1202,11 @@ static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_
* of time. The dl_server_min_res serves as a limit to avoid
* forwarding the timer for a too small amount of time.
*/
- if (dl_time_before(rq_clock(dl_se->rq),
+ if (dl_time_before(rq_clock(rq),
(dl_se->deadline - dl_se->runtime - dl_server_min_res))) {
/* reset the defer timer */
- fw = dl_se->deadline - rq_clock(dl_se->rq) - dl_se->runtime;
+ fw = dl_se->deadline - rq_clock(rq) - dl_se->runtime;
hrtimer_forward_now(timer, ns_to_ktime(fw));
return HRTIMER_RESTART;
@@ -1217,7 +1217,7 @@ static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_
enqueue_dl_entity(dl_se, ENQUEUE_REPLENISH);
- if (!dl_task(dl_se->rq->curr) || dl_entity_preempt(dl_se, &dl_se->rq->curr->dl))
+ if (!dl_task(rq->curr) || dl_entity_preempt(dl_se, &rq->curr->dl))
resched_curr(rq);
__push_dl_task(rq, rf);
@@ -1485,7 +1485,7 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64
hrtimer_try_to_cancel(&dl_se->dl_timer);
- replenish_dl_new_period(dl_se, dl_se->rq);
+ replenish_dl_new_period(dl_se, rq);
/*
* Not being able to start the timer seems problematic. If it could not
@@ -1597,21 +1597,22 @@ void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec)
/* 0 runtime = fair server disabled */
if (dl_se->dl_runtime) {
dl_se->dl_server_idle = 0;
- update_curr_dl_se(dl_se->rq, dl_se, delta_exec);
+ update_curr_dl_se(rq_of_dl_se(dl_se), dl_se, delta_exec);
}
}
void dl_server_start(struct sched_dl_entity *dl_se)
{
- struct rq *rq = dl_se->rq;
+ struct rq *rq;
if (!dl_server(dl_se) || dl_se->dl_server_active)
return;
dl_se->dl_server_active = 1;
enqueue_dl_entity(dl_se, ENQUEUE_WAKEUP);
- if (!dl_task(dl_se->rq->curr) || dl_entity_preempt(dl_se, &rq->curr->dl))
- resched_curr(dl_se->rq);
+ rq = rq_of_dl_se(dl_se);
+ if (!dl_task(rq->curr) || dl_entity_preempt(dl_se, &rq->curr->dl))
+ resched_curr(rq);
}
void dl_server_stop(struct sched_dl_entity *dl_se)
@@ -1667,9 +1668,9 @@ void sched_init_dl_servers(void)
WARN_ON(dl_server(dl_se));
- dl_server_apply_params(dl_se, runtime, period, 1);
-
dl_se->dl_server = 1;
+ BUG_ON(dl_server_apply_params(dl_se, runtime, period, 1));
+
dl_se->dl_defer = 1;
setup_new_dl_entity(dl_se);
}
@@ -1678,7 +1679,7 @@ void sched_init_dl_servers(void)
int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 period, bool init)
{
u64 max_bw, new_bw = to_ratio(period, runtime);
- struct rq *rq = dl_se->rq;
+ struct rq *rq = rq_of_dl_se(dl_se);
int cpu = cpu_of(rq);
struct dl_bw *dl_b;
unsigned long cap;
@@ -1752,7 +1753,7 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
p = dl_task_of(dl_se);
rq = task_rq_lock(p, &rf);
} else {
- rq = dl_se->rq;
+ rq = rq_of_dl_se(dl_se);
rq_lock(rq, &rf);
}
--
2.50.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [RFC PATCH v2 02/25] sched/deadline: Do not access dl_se->rq directly
2025-07-31 10:55 ` [RFC PATCH v2 02/25] sched/deadline: Do not access dl_se->rq directly Yuri Andriaccio
@ 2025-08-14 8:30 ` Juri Lelli
0 siblings, 0 replies; 38+ messages in thread
From: Juri Lelli @ 2025-08-14 8:30 UTC (permalink / raw)
To: Yuri Andriaccio
Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
linux-kernel, Luca Abeni, Yuri Andriaccio
Hi,
On 31/07/25 12:55, Yuri Andriaccio wrote:
...
> @@ -1667,9 +1668,9 @@ void sched_init_dl_servers(void)
>
> WARN_ON(dl_server(dl_se));
>
> - dl_server_apply_params(dl_se, runtime, period, 1);
> -
> dl_se->dl_server = 1;
> + BUG_ON(dl_server_apply_params(dl_se, runtime, period, 1));
A WARN_ON(), with possibly a recover strategy, is usually to prefer.
Thanks,
Juri
^ permalink raw reply [flat|nested] 38+ messages in thread
* [RFC PATCH v2 03/25] sched/deadline: Distinct between dl_rq and my_q
2025-07-31 10:55 [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
2025-07-31 10:55 ` [RFC PATCH v2 01/25] sched/deadline: Remove fair-servers from real-time task's bandwidth accounting Yuri Andriaccio
2025-07-31 10:55 ` [RFC PATCH v2 02/25] sched/deadline: Do not access dl_se->rq directly Yuri Andriaccio
@ 2025-07-31 10:55 ` Yuri Andriaccio
2025-07-31 10:55 ` [RFC PATCH v2 04/25] sched/rt: Pass an rt_rq instead of an rq where needed Yuri Andriaccio
` (22 subsequent siblings)
25 siblings, 0 replies; 38+ messages in thread
From: Yuri Andriaccio @ 2025-07-31 10:55 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider
Cc: linux-kernel, Luca Abeni, Yuri Andriaccio
From: luca abeni <luca.abeni@santannapisa.it>
Create two fields for runqueues in sched_dl_entity to make a distinction between
the global runqueue and the runqueue which the dl_server serves.
Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
---
include/linux/sched.h | 6 ++++--
kernel/sched/deadline.c | 11 +++++++----
kernel/sched/fair.c | 6 +++---
kernel/sched/sched.h | 3 ++-
4 files changed, 16 insertions(+), 10 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 40d2fa90df4..f0c8229afd1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -724,12 +724,14 @@ struct sched_dl_entity {
* Bits for DL-server functionality. Also see the comment near
* dl_server_update().
*
- * @rq the runqueue this server is for
+ * @dl_rq the runqueue on which this entity is (to be) queued
+ * @my_q the runqueue "owned" by this entity
*
* @server_has_tasks() returns true if @server_pick return a
* runnable task.
*/
- struct rq *rq;
+ struct dl_rq *dl_rq;
+ struct rq *my_q;
dl_server_has_tasks_f server_has_tasks;
dl_server_pick_f server_pick_task;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 46b9b78cca2..73ca5c0a086 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -75,11 +75,12 @@ static inline struct rq *rq_of_dl_rq(struct dl_rq *dl_rq)
static inline struct rq *rq_of_dl_se(struct sched_dl_entity *dl_se)
{
- struct rq *rq = dl_se->rq;
+ struct rq *rq;
if (!dl_server(dl_se))
rq = task_rq(dl_task_of(dl_se));
-
+ else
+ rq = container_of(dl_se->dl_rq, struct rq, dl);
return rq;
}
@@ -1641,11 +1642,13 @@ static bool dl_server_stopped(struct sched_dl_entity *dl_se)
return false;
}
-void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
+void dl_server_init(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq,
+ struct rq *served_rq,
dl_server_has_tasks_f has_tasks,
dl_server_pick_f pick_task)
{
- dl_se->rq = rq;
+ dl_se->dl_rq = dl_rq;
+ dl_se->my_q = served_rq;
dl_se->server_has_tasks = has_tasks;
dl_se->server_pick_task = pick_task;
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b173a059315..2723086538b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8861,12 +8861,12 @@ static struct task_struct *__pick_next_task_fair(struct rq *rq, struct task_stru
static bool fair_server_has_tasks(struct sched_dl_entity *dl_se)
{
- return !!dl_se->rq->cfs.nr_queued;
+ return !!dl_se->my_q->cfs.nr_queued;
}
static struct task_struct *fair_server_pick_task(struct sched_dl_entity *dl_se)
{
- return pick_task_fair(dl_se->rq);
+ return pick_task_fair(dl_se->my_q);
}
void fair_server_init(struct rq *rq)
@@ -8875,7 +8875,7 @@ void fair_server_init(struct rq *rq)
init_dl_entity(dl_se);
- dl_server_init(dl_se, rq, fair_server_has_tasks, fair_server_pick_task);
+ dl_server_init(dl_se, &rq->dl, rq, fair_server_has_tasks, fair_server_pick_task);
}
/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8719ab8a817..a8073d0824d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -382,7 +382,8 @@ extern s64 dl_scaled_delta_exec(struct rq *rq, struct sched_dl_entity *dl_se, s6
extern void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec);
extern void dl_server_start(struct sched_dl_entity *dl_se);
extern void dl_server_stop(struct sched_dl_entity *dl_se);
-extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
+extern void dl_server_init(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq,
+ struct rq *served_rq,
dl_server_has_tasks_f has_tasks,
dl_server_pick_f pick_task);
extern void sched_init_dl_servers(void);
--
2.50.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [RFC PATCH v2 04/25] sched/rt: Pass an rt_rq instead of an rq where needed
2025-07-31 10:55 [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
` (2 preceding siblings ...)
2025-07-31 10:55 ` [RFC PATCH v2 03/25] sched/deadline: Distinct between dl_rq and my_q Yuri Andriaccio
@ 2025-07-31 10:55 ` Yuri Andriaccio
2025-08-14 8:46 ` Juri Lelli
2025-07-31 10:55 ` [RFC PATCH v2 05/25] sched/rt: Move some functions from rt.c to sched.h Yuri Andriaccio
` (21 subsequent siblings)
25 siblings, 1 reply; 38+ messages in thread
From: Yuri Andriaccio @ 2025-07-31 10:55 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider
Cc: linux-kernel, Luca Abeni, Yuri Andriaccio
From: luca abeni <luca.abeni@santannapisa.it>
Make rt.c code access the runqueue through the rt_rq data structure rather than
passing an rq pointer directly. This allows future patches to define rt_rq data
structures which do not refer only to the global runqueue, but also to local
cgroup runqueues (rt_rq is not always equal to &rq->rt).
Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
---
kernel/sched/rt.c | 83 +++++++++++++++++++++++++----------------------
1 file changed, 44 insertions(+), 39 deletions(-)
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 7936d433373..945e3d705cc 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -370,9 +370,9 @@ static inline void rt_clear_overload(struct rq *rq)
cpumask_clear_cpu(rq->cpu, rq->rd->rto_mask);
}
-static inline int has_pushable_tasks(struct rq *rq)
+static inline int has_pushable_tasks(struct rt_rq *rt_rq)
{
- return !plist_head_empty(&rq->rt.pushable_tasks);
+ return !plist_head_empty(&rt_rq->pushable_tasks);
}
static DEFINE_PER_CPU(struct balance_callback, rt_push_head);
@@ -383,7 +383,7 @@ static void pull_rt_task(struct rq *);
static inline void rt_queue_push_tasks(struct rq *rq)
{
- if (!has_pushable_tasks(rq))
+ if (!has_pushable_tasks(&rq->rt))
return;
queue_balance_callback(rq, &per_cpu(rt_push_head, rq->cpu), push_rt_tasks);
@@ -394,37 +394,37 @@ static inline void rt_queue_pull_task(struct rq *rq)
queue_balance_callback(rq, &per_cpu(rt_pull_head, rq->cpu), pull_rt_task);
}
-static void enqueue_pushable_task(struct rq *rq, struct task_struct *p)
+static void enqueue_pushable_task(struct rt_rq *rt_rq, struct task_struct *p)
{
- plist_del(&p->pushable_tasks, &rq->rt.pushable_tasks);
+ plist_del(&p->pushable_tasks, &rt_rq->pushable_tasks);
plist_node_init(&p->pushable_tasks, p->prio);
- plist_add(&p->pushable_tasks, &rq->rt.pushable_tasks);
+ plist_add(&p->pushable_tasks, &rt_rq->pushable_tasks);
/* Update the highest prio pushable task */
- if (p->prio < rq->rt.highest_prio.next)
- rq->rt.highest_prio.next = p->prio;
+ if (p->prio < rt_rq->highest_prio.next)
+ rt_rq->highest_prio.next = p->prio;
- if (!rq->rt.overloaded) {
- rt_set_overload(rq);
- rq->rt.overloaded = 1;
+ if (!rt_rq->overloaded) {
+ rt_set_overload(rq_of_rt_rq(rt_rq));
+ rt_rq->overloaded = 1;
}
}
-static void dequeue_pushable_task(struct rq *rq, struct task_struct *p)
+static void dequeue_pushable_task(struct rt_rq *rt_rq, struct task_struct *p)
{
- plist_del(&p->pushable_tasks, &rq->rt.pushable_tasks);
+ plist_del(&p->pushable_tasks, &rt_rq->pushable_tasks);
/* Update the new highest prio pushable task */
- if (has_pushable_tasks(rq)) {
- p = plist_first_entry(&rq->rt.pushable_tasks,
+ if (has_pushable_tasks(rt_rq)) {
+ p = plist_first_entry(&rt_rq->pushable_tasks,
struct task_struct, pushable_tasks);
- rq->rt.highest_prio.next = p->prio;
+ rt_rq->highest_prio.next = p->prio;
} else {
- rq->rt.highest_prio.next = MAX_RT_PRIO-1;
+ rt_rq->highest_prio.next = MAX_RT_PRIO-1;
- if (rq->rt.overloaded) {
- rt_clear_overload(rq);
- rq->rt.overloaded = 0;
+ if (rt_rq->overloaded) {
+ rt_clear_overload(rq_of_rt_rq(rt_rq));
+ rt_rq->overloaded = 0;
}
}
}
@@ -1431,6 +1431,7 @@ static void
enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)
{
struct sched_rt_entity *rt_se = &p->rt;
+ struct rt_rq *rt_rq = rt_rq_of_se(rt_se);
if (flags & ENQUEUE_WAKEUP)
rt_se->timeout = 0;
@@ -1444,17 +1445,18 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)
return;
if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
- enqueue_pushable_task(rq, p);
+ enqueue_pushable_task(rt_rq, p);
}
static bool dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
{
struct sched_rt_entity *rt_se = &p->rt;
+ struct rt_rq *rt_rq = rt_rq_of_se(rt_se);
update_curr_rt(rq);
dequeue_rt_entity(rt_se, flags);
- dequeue_pushable_task(rq, p);
+ dequeue_pushable_task(rt_rq, p);
return true;
}
@@ -1639,14 +1641,14 @@ static void wakeup_preempt_rt(struct rq *rq, struct task_struct *p, int flags)
static inline void set_next_task_rt(struct rq *rq, struct task_struct *p, bool first)
{
struct sched_rt_entity *rt_se = &p->rt;
- struct rt_rq *rt_rq = &rq->rt;
+ struct rt_rq *rt_rq = rt_rq_of_se(&p->rt);
p->se.exec_start = rq_clock_task(rq);
if (on_rt_rq(&p->rt))
update_stats_wait_end_rt(rt_rq, rt_se);
/* The running task is never eligible for pushing */
- dequeue_pushable_task(rq, p);
+ dequeue_pushable_task(rt_rq, p);
if (!first)
return;
@@ -1710,7 +1712,7 @@ static struct task_struct *pick_task_rt(struct rq *rq)
static void put_prev_task_rt(struct rq *rq, struct task_struct *p, struct task_struct *next)
{
struct sched_rt_entity *rt_se = &p->rt;
- struct rt_rq *rt_rq = &rq->rt;
+ struct rt_rq *rt_rq = rt_rq_of_se(&p->rt);
if (on_rt_rq(&p->rt))
update_stats_wait_start_rt(rt_rq, rt_se);
@@ -1726,7 +1728,7 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p, struct task_s
* if it is still active
*/
if (on_rt_rq(&p->rt) && p->nr_cpus_allowed > 1)
- enqueue_pushable_task(rq, p);
+ enqueue_pushable_task(rt_rq, p);
}
/* Only try algorithms three times */
@@ -1736,16 +1738,16 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p, struct task_s
* Return the highest pushable rq's task, which is suitable to be executed
* on the CPU, NULL otherwise
*/
-static struct task_struct *pick_highest_pushable_task(struct rq *rq, int cpu)
+static struct task_struct *pick_highest_pushable_task(struct rt_rq *rt_rq, int cpu)
{
- struct plist_head *head = &rq->rt.pushable_tasks;
+ struct plist_head *head = &rt_rq->pushable_tasks;
struct task_struct *p;
- if (!has_pushable_tasks(rq))
+ if (!has_pushable_tasks(rt_rq))
return NULL;
plist_for_each_entry(p, head, pushable_tasks) {
- if (task_is_pushable(rq, p, cpu))
+ if (task_is_pushable(rq_of_rt_rq(rt_rq), p, cpu))
return p;
}
@@ -1845,14 +1847,15 @@ static int find_lowest_rq(struct task_struct *task)
return -1;
}
-static struct task_struct *pick_next_pushable_task(struct rq *rq)
+static struct task_struct *pick_next_pushable_task(struct rt_rq *rt_rq)
{
+ struct rq *rq = rq_of_rt_rq(rt_rq);
struct task_struct *p;
- if (!has_pushable_tasks(rq))
+ if (!has_pushable_tasks(rt_rq))
return NULL;
- p = plist_first_entry(&rq->rt.pushable_tasks,
+ p = plist_first_entry(&rt_rq->pushable_tasks,
struct task_struct, pushable_tasks);
BUG_ON(rq->cpu != task_cpu(p));
@@ -1905,7 +1908,7 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
*/
if (unlikely(is_migration_disabled(task) ||
!cpumask_test_cpu(lowest_rq->cpu, &task->cpus_mask) ||
- task != pick_next_pushable_task(rq))) {
+ task != pick_next_pushable_task(&rq->rt))) {
double_unlock_balance(rq, lowest_rq);
lowest_rq = NULL;
@@ -1939,7 +1942,7 @@ static int push_rt_task(struct rq *rq, bool pull)
if (!rq->rt.overloaded)
return 0;
- next_task = pick_next_pushable_task(rq);
+ next_task = pick_next_pushable_task(&rq->rt);
if (!next_task)
return 0;
@@ -2014,7 +2017,7 @@ static int push_rt_task(struct rq *rq, bool pull)
* run-queue and is also still the next task eligible for
* pushing.
*/
- task = pick_next_pushable_task(rq);
+ task = pick_next_pushable_task(&rq->rt);
if (task == next_task) {
/*
* The task hasn't migrated, and is still the next
@@ -2202,7 +2205,7 @@ void rto_push_irq_work_func(struct irq_work *work)
* We do not need to grab the lock to check for has_pushable_tasks.
* When it gets updated, a check is made if a push is possible.
*/
- if (has_pushable_tasks(rq)) {
+ if (has_pushable_tasks(&rq->rt)) {
raw_spin_rq_lock(rq);
while (push_rt_task(rq, true))
;
@@ -2231,6 +2234,7 @@ static void pull_rt_task(struct rq *this_rq)
int this_cpu = this_rq->cpu, cpu;
bool resched = false;
struct task_struct *p, *push_task;
+ struct rt_rq *src_rt_rq;
struct rq *src_rq;
int rt_overload_count = rt_overloaded(this_rq);
@@ -2260,6 +2264,7 @@ static void pull_rt_task(struct rq *this_rq)
continue;
src_rq = cpu_rq(cpu);
+ src_rt_rq = &src_rq->rt;
/*
* Don't bother taking the src_rq->lock if the next highest
@@ -2268,7 +2273,7 @@ static void pull_rt_task(struct rq *this_rq)
* logically higher, the src_rq will push this task away.
* And if its going logically lower, we do not care
*/
- if (src_rq->rt.highest_prio.next >=
+ if (src_rt_rq->highest_prio.next >=
this_rq->rt.highest_prio.curr)
continue;
@@ -2284,7 +2289,7 @@ static void pull_rt_task(struct rq *this_rq)
* We can pull only a task, which is pushable
* on its rq, and no others.
*/
- p = pick_highest_pushable_task(src_rq, this_cpu);
+ p = pick_highest_pushable_task(src_rt_rq, this_cpu);
/*
* Do we have an RT task that preempts
--
2.50.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [RFC PATCH v2 04/25] sched/rt: Pass an rt_rq instead of an rq where needed
2025-07-31 10:55 ` [RFC PATCH v2 04/25] sched/rt: Pass an rt_rq instead of an rq where needed Yuri Andriaccio
@ 2025-08-14 8:46 ` Juri Lelli
0 siblings, 0 replies; 38+ messages in thread
From: Juri Lelli @ 2025-08-14 8:46 UTC (permalink / raw)
To: Yuri Andriaccio
Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
linux-kernel, Luca Abeni, Yuri Andriaccio
Hi!
On 31/07/25 12:55, Yuri Andriaccio wrote:
...
> @@ -383,7 +383,7 @@ static void pull_rt_task(struct rq *);
>
> static inline void rt_queue_push_tasks(struct rq *rq)
Can't the above also take an rt_rq?
> {
> - if (!has_pushable_tasks(rq))
> + if (!has_pushable_tasks(&rq->rt))
> return;
>
> queue_balance_callback(rq, &per_cpu(rt_push_head, rq->cpu), push_rt_tasks);
Thanks,
Juri
^ permalink raw reply [flat|nested] 38+ messages in thread
* [RFC PATCH v2 05/25] sched/rt: Move some functions from rt.c to sched.h
2025-07-31 10:55 [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
` (3 preceding siblings ...)
2025-07-31 10:55 ` [RFC PATCH v2 04/25] sched/rt: Pass an rt_rq instead of an rq where needed Yuri Andriaccio
@ 2025-07-31 10:55 ` Yuri Andriaccio
2025-08-14 8:50 ` Juri Lelli
2025-07-31 10:55 ` [RFC PATCH v2 06/25] sched/rt: Disable RT_GROUP_SCHED Yuri Andriaccio
` (20 subsequent siblings)
25 siblings, 1 reply; 38+ messages in thread
From: Yuri Andriaccio @ 2025-07-31 10:55 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider
Cc: linux-kernel, Luca Abeni, Yuri Andriaccio
From: luca abeni <luca.abeni@santannapisa.it>
Make the following functions be non-static and move them in sched.h, so that
they can be used also in other source files:
- rt_task_of()
- rq_of_rt_rq()
- rt_rq_of_se()
- rq_of_rt_se()
There are no functional changes. This is needed by future patches.
Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
---
kernel/sched/rt.c | 52 --------------------------------------------
kernel/sched/sched.h | 51 +++++++++++++++++++++++++++++++++++++++++++
2 files changed, 51 insertions(+), 52 deletions(-)
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 945e3d705cc..3ea92b08a0e 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -168,34 +168,6 @@ static void destroy_rt_bandwidth(struct rt_bandwidth *rt_b)
#define rt_entity_is_task(rt_se) (!(rt_se)->my_q)
-static inline struct task_struct *rt_task_of(struct sched_rt_entity *rt_se)
-{
- WARN_ON_ONCE(!rt_entity_is_task(rt_se));
-
- return container_of(rt_se, struct task_struct, rt);
-}
-
-static inline struct rq *rq_of_rt_rq(struct rt_rq *rt_rq)
-{
- /* Cannot fold with non-CONFIG_RT_GROUP_SCHED version, layout */
- WARN_ON(!rt_group_sched_enabled() && rt_rq->tg != &root_task_group);
- return rt_rq->rq;
-}
-
-static inline struct rt_rq *rt_rq_of_se(struct sched_rt_entity *rt_se)
-{
- WARN_ON(!rt_group_sched_enabled() && rt_se->rt_rq->tg != &root_task_group);
- return rt_se->rt_rq;
-}
-
-static inline struct rq *rq_of_rt_se(struct sched_rt_entity *rt_se)
-{
- struct rt_rq *rt_rq = rt_se->rt_rq;
-
- WARN_ON(!rt_group_sched_enabled() && rt_rq->tg != &root_task_group);
- return rt_rq->rq;
-}
-
void unregister_rt_sched_group(struct task_group *tg)
{
if (!rt_group_sched_enabled())
@@ -296,30 +268,6 @@ int alloc_rt_sched_group(struct task_group *tg, struct task_group *parent)
#define rt_entity_is_task(rt_se) (1)
-static inline struct task_struct *rt_task_of(struct sched_rt_entity *rt_se)
-{
- return container_of(rt_se, struct task_struct, rt);
-}
-
-static inline struct rq *rq_of_rt_rq(struct rt_rq *rt_rq)
-{
- return container_of(rt_rq, struct rq, rt);
-}
-
-static inline struct rq *rq_of_rt_se(struct sched_rt_entity *rt_se)
-{
- struct task_struct *p = rt_task_of(rt_se);
-
- return task_rq(p);
-}
-
-static inline struct rt_rq *rt_rq_of_se(struct sched_rt_entity *rt_se)
-{
- struct rq *rq = rq_of_rt_se(rt_se);
-
- return &rq->rt;
-}
-
void unregister_rt_sched_group(struct task_group *tg) { }
void free_rt_sched_group(struct task_group *tg) { }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a8073d0824d..a1e6d2852ca 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3042,6 +3042,57 @@ extern void set_rq_offline(struct rq *rq);
extern bool sched_smp_initialized;
+#ifdef CONFIG_RT_GROUP_SCHED
+static inline struct task_struct *rt_task_of(struct sched_rt_entity *rt_se)
+{
+#ifdef CONFIG_SCHED_DEBUG
+ WARN_ON_ONCE(rt_se->my_q);
+#endif
+ return container_of(rt_se, struct task_struct, rt);
+}
+
+static inline struct rq *rq_of_rt_rq(struct rt_rq *rt_rq)
+{
+ return rt_rq->rq;
+}
+
+static inline struct rt_rq *rt_rq_of_se(struct sched_rt_entity *rt_se)
+{
+ return rt_se->rt_rq;
+}
+
+static inline struct rq *rq_of_rt_se(struct sched_rt_entity *rt_se)
+{
+ struct rt_rq *rt_rq = rt_se->rt_rq;
+
+ return rt_rq->rq;
+}
+#else
+static inline struct task_struct *rt_task_of(struct sched_rt_entity *rt_se)
+{
+ return container_of(rt_se, struct task_struct, rt);
+}
+
+static inline struct rq *rq_of_rt_rq(struct rt_rq *rt_rq)
+{
+ return container_of(rt_rq, struct rq, rt);
+}
+
+static inline struct rq *rq_of_rt_se(struct sched_rt_entity *rt_se)
+{
+ struct task_struct *p = rt_task_of(rt_se);
+
+ return task_rq(p);
+}
+
+static inline struct rt_rq *rt_rq_of_se(struct sched_rt_entity *rt_se)
+{
+ struct rq *rq = rq_of_rt_se(rt_se);
+
+ return &rq->rt;
+}
+#endif
+
DEFINE_LOCK_GUARD_2(double_rq_lock, struct rq,
double_rq_lock(_T->lock, _T->lock2),
double_rq_unlock(_T->lock, _T->lock2))
--
2.50.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [RFC PATCH v2 05/25] sched/rt: Move some functions from rt.c to sched.h
2025-07-31 10:55 ` [RFC PATCH v2 05/25] sched/rt: Move some functions from rt.c to sched.h Yuri Andriaccio
@ 2025-08-14 8:50 ` Juri Lelli
0 siblings, 0 replies; 38+ messages in thread
From: Juri Lelli @ 2025-08-14 8:50 UTC (permalink / raw)
To: Yuri Andriaccio
Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
linux-kernel, Luca Abeni, Yuri Andriaccio
Hi!
On 31/07/25 12:55, Yuri Andriaccio wrote:
> From: luca abeni <luca.abeni@santannapisa.it>
>
> Make the following functions be non-static and move them in sched.h, so that
> they can be used also in other source files:
> - rt_task_of()
> - rq_of_rt_rq()
> - rt_rq_of_se()
> - rq_of_rt_se()
>
> There are no functional changes. This is needed by future patches.
>
> Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
> Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
> ---
> kernel/sched/rt.c | 52 --------------------------------------------
> kernel/sched/sched.h | 51 +++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 51 insertions(+), 52 deletions(-)
>
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 945e3d705cc..3ea92b08a0e 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -168,34 +168,6 @@ static void destroy_rt_bandwidth(struct rt_bandwidth *rt_b)
>
> #define rt_entity_is_task(rt_se) (!(rt_se)->my_q)
>
> -static inline struct task_struct *rt_task_of(struct sched_rt_entity *rt_se)
> -{
> - WARN_ON_ONCE(!rt_entity_is_task(rt_se));
> -
> - return container_of(rt_se, struct task_struct, rt);
> -}
> -
> -static inline struct rq *rq_of_rt_rq(struct rt_rq *rt_rq)
> -{
> - /* Cannot fold with non-CONFIG_RT_GROUP_SCHED version, layout */
> - WARN_ON(!rt_group_sched_enabled() && rt_rq->tg != &root_task_group);
> - return rt_rq->rq;
> -}
> -
> -static inline struct rt_rq *rt_rq_of_se(struct sched_rt_entity *rt_se)
> -{
> - WARN_ON(!rt_group_sched_enabled() && rt_se->rt_rq->tg != &root_task_group);
> - return rt_se->rt_rq;
> -}
> -
> -static inline struct rq *rq_of_rt_se(struct sched_rt_entity *rt_se)
> -{
> - struct rt_rq *rt_rq = rt_se->rt_rq;
> -
> - WARN_ON(!rt_group_sched_enabled() && rt_rq->tg != &root_task_group);
> - return rt_rq->rq;
> -}
Looks like we are losing WARN_ON checks with the change? Why is that?
...
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index a8073d0824d..a1e6d2852ca 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -3042,6 +3042,57 @@ extern void set_rq_offline(struct rq *rq);
>
> extern bool sched_smp_initialized;
>
> +#ifdef CONFIG_RT_GROUP_SCHED
> +static inline struct task_struct *rt_task_of(struct sched_rt_entity *rt_se)
> +{
> +#ifdef CONFIG_SCHED_DEBUG
> + WARN_ON_ONCE(rt_se->my_q);
> +#endif
And gaining this one above. :)
Thanks,
Juri
^ permalink raw reply [flat|nested] 38+ messages in thread
* [RFC PATCH v2 06/25] sched/rt: Disable RT_GROUP_SCHED
2025-07-31 10:55 [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
` (4 preceding siblings ...)
2025-07-31 10:55 ` [RFC PATCH v2 05/25] sched/rt: Move some functions from rt.c to sched.h Yuri Andriaccio
@ 2025-07-31 10:55 ` Yuri Andriaccio
2025-07-31 10:55 ` [RFC PATCH v2 07/25] sched/rt: Introduce HCBS specific structs in task_group Yuri Andriaccio
` (19 subsequent siblings)
25 siblings, 0 replies; 38+ messages in thread
From: Yuri Andriaccio @ 2025-07-31 10:55 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider
Cc: linux-kernel, Luca Abeni, Yuri Andriaccio
Disable the old RT_GROUP_SCHED scheduler. Note that this does not completely
remove all the RT_GROUP_SCHED functionality, just unhooks it and removes most of
the relevant functions. Some of the RT_GROUP_SCHED functions are kept because
they will be adapted for the HCBS scheduling.
Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
---
kernel/sched/core.c | 6 -
kernel/sched/deadline.c | 34 --
kernel/sched/debug.c | 6 -
kernel/sched/rt.c | 848 ++--------------------------------------
kernel/sched/sched.h | 9 +-
kernel/sched/syscalls.c | 13 -
6 files changed, 26 insertions(+), 890 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3ec00d08d46..42587a3c71f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8714,11 +8714,6 @@ void __init sched_init(void)
init_defrootdomain();
-#ifdef CONFIG_RT_GROUP_SCHED
- init_rt_bandwidth(&root_task_group.rt_bandwidth,
- global_rt_period(), global_rt_runtime());
-#endif /* CONFIG_RT_GROUP_SCHED */
-
#ifdef CONFIG_CGROUP_SCHED
task_group_cache = KMEM_CACHE(task_group, 0);
@@ -8770,7 +8765,6 @@ void __init sched_init(void)
* starts working after scheduler_running, which is not the case
* yet.
*/
- rq->rt.rt_runtime = global_rt_runtime();
init_tg_rt_entry(&root_task_group, &rq->rt, NULL, i, NULL);
#endif
rq->sd = NULL;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 73ca5c0a086..0640d0ca45b 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1524,40 +1524,6 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64
if (!is_leftmost(dl_se, &rq->dl))
resched_curr(rq);
}
-
- /*
- * The fair server (sole dl_server) does not account for real-time
- * workload because it is running fair work.
- */
- if (dl_se == &rq->fair_server)
- return;
-
-#ifdef CONFIG_RT_GROUP_SCHED
- /*
- * Because -- for now -- we share the rt bandwidth, we need to
- * account our runtime there too, otherwise actual rt tasks
- * would be able to exceed the shared quota.
- *
- * Account to the root rt group for now.
- *
- * The solution we're working towards is having the RT groups scheduled
- * using deadline servers -- however there's a few nasties to figure
- * out before that can happen.
- */
- if (rt_bandwidth_enabled()) {
- struct rt_rq *rt_rq = &rq->rt;
-
- raw_spin_lock(&rt_rq->rt_runtime_lock);
- /*
- * We'll let actual RT tasks worry about the overflow here, we
- * have our own CBS to keep us inline; only account when RT
- * bandwidth is relevant.
- */
- if (sched_rt_bandwidth_account(rt_rq))
- rt_rq->rt_time += delta_exec;
- raw_spin_unlock(&rt_rq->rt_runtime_lock);
- }
-#endif /* CONFIG_RT_GROUP_SCHED */
}
/*
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 3f06ab84d53..f05decde708 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -892,12 +892,6 @@ void print_rt_rq(struct seq_file *m, int cpu, struct rt_rq *rt_rq)
PU(rt_nr_running);
-#ifdef CONFIG_RT_GROUP_SCHED
- P(rt_throttled);
- PN(rt_time);
- PN(rt_runtime);
-#endif
-
#undef PN
#undef PU
#undef P
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 3ea92b08a0e..a6282784978 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1,3 +1,4 @@
+#pragma GCC diagnostic ignored "-Wunused-function"
// SPDX-License-Identifier: GPL-2.0
/*
* Real-Time Scheduling Class (mapped to the SCHED_FIFO and SCHED_RR
@@ -82,117 +83,18 @@ void init_rt_rq(struct rt_rq *rt_rq)
rt_rq->highest_prio.next = MAX_RT_PRIO-1;
rt_rq->overloaded = 0;
plist_head_init(&rt_rq->pushable_tasks);
- /* We start is dequeued state, because no RT tasks are queued */
- rt_rq->rt_queued = 0;
-
-#ifdef CONFIG_RT_GROUP_SCHED
- rt_rq->rt_time = 0;
- rt_rq->rt_throttled = 0;
- rt_rq->rt_runtime = 0;
- raw_spin_lock_init(&rt_rq->rt_runtime_lock);
- rt_rq->tg = &root_task_group;
-#endif
}
#ifdef CONFIG_RT_GROUP_SCHED
-static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun);
-
-static enum hrtimer_restart sched_rt_period_timer(struct hrtimer *timer)
-{
- struct rt_bandwidth *rt_b =
- container_of(timer, struct rt_bandwidth, rt_period_timer);
- int idle = 0;
- int overrun;
-
- raw_spin_lock(&rt_b->rt_runtime_lock);
- for (;;) {
- overrun = hrtimer_forward_now(timer, rt_b->rt_period);
- if (!overrun)
- break;
-
- raw_spin_unlock(&rt_b->rt_runtime_lock);
- idle = do_sched_rt_period_timer(rt_b, overrun);
- raw_spin_lock(&rt_b->rt_runtime_lock);
- }
- if (idle)
- rt_b->rt_period_active = 0;
- raw_spin_unlock(&rt_b->rt_runtime_lock);
-
- return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
-}
-
-void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime)
-{
- rt_b->rt_period = ns_to_ktime(period);
- rt_b->rt_runtime = runtime;
-
- raw_spin_lock_init(&rt_b->rt_runtime_lock);
-
- hrtimer_setup(&rt_b->rt_period_timer, sched_rt_period_timer, CLOCK_MONOTONIC,
- HRTIMER_MODE_REL_HARD);
-}
-
-static inline void do_start_rt_bandwidth(struct rt_bandwidth *rt_b)
-{
- raw_spin_lock(&rt_b->rt_runtime_lock);
- if (!rt_b->rt_period_active) {
- rt_b->rt_period_active = 1;
- /*
- * SCHED_DEADLINE updates the bandwidth, as a run away
- * RT task with a DL task could hog a CPU. But DL does
- * not reset the period. If a deadline task was running
- * without an RT task running, it can cause RT tasks to
- * throttle when they start up. Kick the timer right away
- * to update the period.
- */
- hrtimer_forward_now(&rt_b->rt_period_timer, ns_to_ktime(0));
- hrtimer_start_expires(&rt_b->rt_period_timer,
- HRTIMER_MODE_ABS_PINNED_HARD);
- }
- raw_spin_unlock(&rt_b->rt_runtime_lock);
-}
-
-static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
-{
- if (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF)
- return;
-
- do_start_rt_bandwidth(rt_b);
-}
-
-static void destroy_rt_bandwidth(struct rt_bandwidth *rt_b)
-{
- hrtimer_cancel(&rt_b->rt_period_timer);
-}
-
-#define rt_entity_is_task(rt_se) (!(rt_se)->my_q)
-
void unregister_rt_sched_group(struct task_group *tg)
{
- if (!rt_group_sched_enabled())
- return;
-
- if (tg->rt_se)
- destroy_rt_bandwidth(&tg->rt_bandwidth);
}
void free_rt_sched_group(struct task_group *tg)
{
- int i;
-
if (!rt_group_sched_enabled())
return;
-
- for_each_possible_cpu(i) {
- if (tg->rt_rq)
- kfree(tg->rt_rq[i]);
- if (tg->rt_se)
- kfree(tg->rt_se[i]);
- }
-
- kfree(tg->rt_rq);
- kfree(tg->rt_se);
}
void init_tg_rt_entry(struct task_group *tg, struct rt_rq *rt_rq,
@@ -202,72 +104,23 @@ void init_tg_rt_entry(struct task_group *tg, struct rt_rq *rt_rq,
struct rq *rq = cpu_rq(cpu);
rt_rq->highest_prio.curr = MAX_RT_PRIO-1;
- rt_rq->rt_nr_boosted = 0;
rt_rq->rq = rq;
rt_rq->tg = tg;
tg->rt_rq[cpu] = rt_rq;
tg->rt_se[cpu] = rt_se;
-
- if (!rt_se)
- return;
-
- if (!parent)
- rt_se->rt_rq = &rq->rt;
- else
- rt_se->rt_rq = parent->my_q;
-
- rt_se->my_q = rt_rq;
- rt_se->parent = parent;
- INIT_LIST_HEAD(&rt_se->run_list);
}
int alloc_rt_sched_group(struct task_group *tg, struct task_group *parent)
{
- struct rt_rq *rt_rq;
- struct sched_rt_entity *rt_se;
- int i;
-
if (!rt_group_sched_enabled())
return 1;
- tg->rt_rq = kcalloc(nr_cpu_ids, sizeof(rt_rq), GFP_KERNEL);
- if (!tg->rt_rq)
- goto err;
- tg->rt_se = kcalloc(nr_cpu_ids, sizeof(rt_se), GFP_KERNEL);
- if (!tg->rt_se)
- goto err;
-
- init_rt_bandwidth(&tg->rt_bandwidth, ktime_to_ns(global_rt_period()), 0);
-
- for_each_possible_cpu(i) {
- rt_rq = kzalloc_node(sizeof(struct rt_rq),
- GFP_KERNEL, cpu_to_node(i));
- if (!rt_rq)
- goto err;
-
- rt_se = kzalloc_node(sizeof(struct sched_rt_entity),
- GFP_KERNEL, cpu_to_node(i));
- if (!rt_se)
- goto err_free_rq;
-
- init_rt_rq(rt_rq);
- rt_rq->rt_runtime = tg->rt_bandwidth.rt_runtime;
- init_tg_rt_entry(tg, rt_rq, rt_se, i, parent->rt_se[i]);
- }
-
return 1;
-
-err_free_rq:
- kfree(rt_rq);
-err:
- return 0;
}
#else /* !CONFIG_RT_GROUP_SCHED: */
-#define rt_entity_is_task(rt_se) (1)
-
void unregister_rt_sched_group(struct task_group *tg) { }
void free_rt_sched_group(struct task_group *tg) { }
@@ -377,9 +230,6 @@ static void dequeue_pushable_task(struct rt_rq *rt_rq, struct task_struct *p)
}
}
-static void enqueue_top_rt_rq(struct rt_rq *rt_rq);
-static void dequeue_top_rt_rq(struct rt_rq *rt_rq, unsigned int count);
-
static inline int on_rt_rq(struct sched_rt_entity *rt_se)
{
return rt_se->on_rq;
@@ -426,16 +276,6 @@ static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
#ifdef CONFIG_RT_GROUP_SCHED
-static inline u64 sched_rt_runtime(struct rt_rq *rt_rq)
-{
- return rt_rq->rt_runtime;
-}
-
-static inline u64 sched_rt_period(struct rt_rq *rt_rq)
-{
- return ktime_to_ns(rt_rq->tg->rt_bandwidth.rt_period);
-}
-
typedef struct task_group *rt_rq_iter_t;
static inline struct task_group *next_task_group(struct task_group *tg)
@@ -461,457 +301,20 @@ static inline struct task_group *next_task_group(struct task_group *tg)
iter && (rt_rq = iter->rt_rq[cpu_of(rq)]); \
iter = next_task_group(iter))
-#define for_each_sched_rt_entity(rt_se) \
- for (; rt_se; rt_se = rt_se->parent)
-
-static inline struct rt_rq *group_rt_rq(struct sched_rt_entity *rt_se)
-{
- return rt_se->my_q;
-}
-
static void enqueue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags);
static void dequeue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags);
-static void sched_rt_rq_enqueue(struct rt_rq *rt_rq)
-{
- struct task_struct *donor = rq_of_rt_rq(rt_rq)->donor;
- struct rq *rq = rq_of_rt_rq(rt_rq);
- struct sched_rt_entity *rt_se;
-
- int cpu = cpu_of(rq);
-
- rt_se = rt_rq->tg->rt_se[cpu];
-
- if (rt_rq->rt_nr_running) {
- if (!rt_se)
- enqueue_top_rt_rq(rt_rq);
- else if (!on_rt_rq(rt_se))
- enqueue_rt_entity(rt_se, 0);
-
- if (rt_rq->highest_prio.curr < donor->prio)
- resched_curr(rq);
- }
-}
-
-static void sched_rt_rq_dequeue(struct rt_rq *rt_rq)
-{
- struct sched_rt_entity *rt_se;
- int cpu = cpu_of(rq_of_rt_rq(rt_rq));
-
- rt_se = rt_rq->tg->rt_se[cpu];
-
- if (!rt_se) {
- dequeue_top_rt_rq(rt_rq, rt_rq->rt_nr_running);
- /* Kick cpufreq (see the comment in kernel/sched/sched.h). */
- cpufreq_update_util(rq_of_rt_rq(rt_rq), 0);
- }
- else if (on_rt_rq(rt_se))
- dequeue_rt_entity(rt_se, 0);
-}
-
-static inline int rt_rq_throttled(struct rt_rq *rt_rq)
-{
- return rt_rq->rt_throttled && !rt_rq->rt_nr_boosted;
-}
-
-static int rt_se_boosted(struct sched_rt_entity *rt_se)
-{
- struct rt_rq *rt_rq = group_rt_rq(rt_se);
- struct task_struct *p;
-
- if (rt_rq)
- return !!rt_rq->rt_nr_boosted;
-
- p = rt_task_of(rt_se);
- return p->prio != p->normal_prio;
-}
-
-static inline const struct cpumask *sched_rt_period_mask(void)
-{
- return this_rq()->rd->span;
-}
-
-static inline
-struct rt_rq *sched_rt_period_rt_rq(struct rt_bandwidth *rt_b, int cpu)
-{
- return container_of(rt_b, struct task_group, rt_bandwidth)->rt_rq[cpu];
-}
-
-static inline struct rt_bandwidth *sched_rt_bandwidth(struct rt_rq *rt_rq)
-{
- return &rt_rq->tg->rt_bandwidth;
-}
-
-bool sched_rt_bandwidth_account(struct rt_rq *rt_rq)
-{
- struct rt_bandwidth *rt_b = sched_rt_bandwidth(rt_rq);
-
- return (hrtimer_active(&rt_b->rt_period_timer) ||
- rt_rq->rt_time < rt_b->rt_runtime);
-}
-
-/*
- * We ran out of runtime, see if we can borrow some from our neighbours.
- */
-static void do_balance_runtime(struct rt_rq *rt_rq)
-{
- struct rt_bandwidth *rt_b = sched_rt_bandwidth(rt_rq);
- struct root_domain *rd = rq_of_rt_rq(rt_rq)->rd;
- int i, weight;
- u64 rt_period;
-
- weight = cpumask_weight(rd->span);
-
- raw_spin_lock(&rt_b->rt_runtime_lock);
- rt_period = ktime_to_ns(rt_b->rt_period);
- for_each_cpu(i, rd->span) {
- struct rt_rq *iter = sched_rt_period_rt_rq(rt_b, i);
- s64 diff;
-
- if (iter == rt_rq)
- continue;
-
- raw_spin_lock(&iter->rt_runtime_lock);
- /*
- * Either all rqs have inf runtime and there's nothing to steal
- * or __disable_runtime() below sets a specific rq to inf to
- * indicate its been disabled and disallow stealing.
- */
- if (iter->rt_runtime == RUNTIME_INF)
- goto next;
-
- /*
- * From runqueues with spare time, take 1/n part of their
- * spare time, but no more than our period.
- */
- diff = iter->rt_runtime - iter->rt_time;
- if (diff > 0) {
- diff = div_u64((u64)diff, weight);
- if (rt_rq->rt_runtime + diff > rt_period)
- diff = rt_period - rt_rq->rt_runtime;
- iter->rt_runtime -= diff;
- rt_rq->rt_runtime += diff;
- if (rt_rq->rt_runtime == rt_period) {
- raw_spin_unlock(&iter->rt_runtime_lock);
- break;
- }
- }
-next:
- raw_spin_unlock(&iter->rt_runtime_lock);
- }
- raw_spin_unlock(&rt_b->rt_runtime_lock);
-}
-
-/*
- * Ensure this RQ takes back all the runtime it lend to its neighbours.
- */
-static void __disable_runtime(struct rq *rq)
-{
- struct root_domain *rd = rq->rd;
- rt_rq_iter_t iter;
- struct rt_rq *rt_rq;
-
- if (unlikely(!scheduler_running))
- return;
-
- for_each_rt_rq(rt_rq, iter, rq) {
- struct rt_bandwidth *rt_b = sched_rt_bandwidth(rt_rq);
- s64 want;
- int i;
-
- raw_spin_lock(&rt_b->rt_runtime_lock);
- raw_spin_lock(&rt_rq->rt_runtime_lock);
- /*
- * Either we're all inf and nobody needs to borrow, or we're
- * already disabled and thus have nothing to do, or we have
- * exactly the right amount of runtime to take out.
- */
- if (rt_rq->rt_runtime == RUNTIME_INF ||
- rt_rq->rt_runtime == rt_b->rt_runtime)
- goto balanced;
- raw_spin_unlock(&rt_rq->rt_runtime_lock);
-
- /*
- * Calculate the difference between what we started out with
- * and what we current have, that's the amount of runtime
- * we lend and now have to reclaim.
- */
- want = rt_b->rt_runtime - rt_rq->rt_runtime;
-
- /*
- * Greedy reclaim, take back as much as we can.
- */
- for_each_cpu(i, rd->span) {
- struct rt_rq *iter = sched_rt_period_rt_rq(rt_b, i);
- s64 diff;
-
- /*
- * Can't reclaim from ourselves or disabled runqueues.
- */
- if (iter == rt_rq || iter->rt_runtime == RUNTIME_INF)
- continue;
-
- raw_spin_lock(&iter->rt_runtime_lock);
- if (want > 0) {
- diff = min_t(s64, iter->rt_runtime, want);
- iter->rt_runtime -= diff;
- want -= diff;
- } else {
- iter->rt_runtime -= want;
- want -= want;
- }
- raw_spin_unlock(&iter->rt_runtime_lock);
-
- if (!want)
- break;
- }
-
- raw_spin_lock(&rt_rq->rt_runtime_lock);
- /*
- * We cannot be left wanting - that would mean some runtime
- * leaked out of the system.
- */
- WARN_ON_ONCE(want);
-balanced:
- /*
- * Disable all the borrow logic by pretending we have inf
- * runtime - in which case borrowing doesn't make sense.
- */
- rt_rq->rt_runtime = RUNTIME_INF;
- rt_rq->rt_throttled = 0;
- raw_spin_unlock(&rt_rq->rt_runtime_lock);
- raw_spin_unlock(&rt_b->rt_runtime_lock);
-
- /* Make rt_rq available for pick_next_task() */
- sched_rt_rq_enqueue(rt_rq);
- }
-}
-
-static void __enable_runtime(struct rq *rq)
-{
- rt_rq_iter_t iter;
- struct rt_rq *rt_rq;
-
- if (unlikely(!scheduler_running))
- return;
-
- /*
- * Reset each runqueue's bandwidth settings
- */
- for_each_rt_rq(rt_rq, iter, rq) {
- struct rt_bandwidth *rt_b = sched_rt_bandwidth(rt_rq);
-
- raw_spin_lock(&rt_b->rt_runtime_lock);
- raw_spin_lock(&rt_rq->rt_runtime_lock);
- rt_rq->rt_runtime = rt_b->rt_runtime;
- rt_rq->rt_time = 0;
- rt_rq->rt_throttled = 0;
- raw_spin_unlock(&rt_rq->rt_runtime_lock);
- raw_spin_unlock(&rt_b->rt_runtime_lock);
- }
-}
-
-static void balance_runtime(struct rt_rq *rt_rq)
-{
- if (!sched_feat(RT_RUNTIME_SHARE))
- return;
-
- if (rt_rq->rt_time > rt_rq->rt_runtime) {
- raw_spin_unlock(&rt_rq->rt_runtime_lock);
- do_balance_runtime(rt_rq);
- raw_spin_lock(&rt_rq->rt_runtime_lock);
- }
-}
-
-static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun)
-{
- int i, idle = 1, throttled = 0;
- const struct cpumask *span;
-
- span = sched_rt_period_mask();
-
- /*
- * FIXME: isolated CPUs should really leave the root task group,
- * whether they are isolcpus or were isolated via cpusets, lest
- * the timer run on a CPU which does not service all runqueues,
- * potentially leaving other CPUs indefinitely throttled. If
- * isolation is really required, the user will turn the throttle
- * off to kill the perturbations it causes anyway. Meanwhile,
- * this maintains functionality for boot and/or troubleshooting.
- */
- if (rt_b == &root_task_group.rt_bandwidth)
- span = cpu_online_mask;
-
- for_each_cpu(i, span) {
- int enqueue = 0;
- struct rt_rq *rt_rq = sched_rt_period_rt_rq(rt_b, i);
- struct rq *rq = rq_of_rt_rq(rt_rq);
- struct rq_flags rf;
- int skip;
-
- /*
- * When span == cpu_online_mask, taking each rq->lock
- * can be time-consuming. Try to avoid it when possible.
- */
- raw_spin_lock(&rt_rq->rt_runtime_lock);
- if (!sched_feat(RT_RUNTIME_SHARE) && rt_rq->rt_runtime != RUNTIME_INF)
- rt_rq->rt_runtime = rt_b->rt_runtime;
- skip = !rt_rq->rt_time && !rt_rq->rt_nr_running;
- raw_spin_unlock(&rt_rq->rt_runtime_lock);
- if (skip)
- continue;
-
- rq_lock(rq, &rf);
- update_rq_clock(rq);
-
- if (rt_rq->rt_time) {
- u64 runtime;
-
- raw_spin_lock(&rt_rq->rt_runtime_lock);
- if (rt_rq->rt_throttled)
- balance_runtime(rt_rq);
- runtime = rt_rq->rt_runtime;
- rt_rq->rt_time -= min(rt_rq->rt_time, overrun*runtime);
- if (rt_rq->rt_throttled && rt_rq->rt_time < runtime) {
- rt_rq->rt_throttled = 0;
- enqueue = 1;
-
- /*
- * When we're idle and a woken (rt) task is
- * throttled wakeup_preempt() will set
- * skip_update and the time between the wakeup
- * and this unthrottle will get accounted as
- * 'runtime'.
- */
- if (rt_rq->rt_nr_running && rq->curr == rq->idle)
- rq_clock_cancel_skipupdate(rq);
- }
- if (rt_rq->rt_time || rt_rq->rt_nr_running)
- idle = 0;
- raw_spin_unlock(&rt_rq->rt_runtime_lock);
- } else if (rt_rq->rt_nr_running) {
- idle = 0;
- if (!rt_rq_throttled(rt_rq))
- enqueue = 1;
- }
- if (rt_rq->rt_throttled)
- throttled = 1;
-
- if (enqueue)
- sched_rt_rq_enqueue(rt_rq);
- rq_unlock(rq, &rf);
- }
-
- if (!throttled && (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF))
- return 1;
-
- return idle;
-}
-
-static int sched_rt_runtime_exceeded(struct rt_rq *rt_rq)
-{
- u64 runtime = sched_rt_runtime(rt_rq);
-
- if (rt_rq->rt_throttled)
- return rt_rq_throttled(rt_rq);
-
- if (runtime >= sched_rt_period(rt_rq))
- return 0;
-
- balance_runtime(rt_rq);
- runtime = sched_rt_runtime(rt_rq);
- if (runtime == RUNTIME_INF)
- return 0;
-
- if (rt_rq->rt_time > runtime) {
- struct rt_bandwidth *rt_b = sched_rt_bandwidth(rt_rq);
-
- /*
- * Don't actually throttle groups that have no runtime assigned
- * but accrue some time due to boosting.
- */
- if (likely(rt_b->rt_runtime)) {
- rt_rq->rt_throttled = 1;
- printk_deferred_once("sched: RT throttling activated\n");
- } else {
- /*
- * In case we did anyway, make it go away,
- * replenishment is a joke, since it will replenish us
- * with exactly 0 ns.
- */
- rt_rq->rt_time = 0;
- }
-
- if (rt_rq_throttled(rt_rq)) {
- sched_rt_rq_dequeue(rt_rq);
- return 1;
- }
- }
-
- return 0;
-}
-
-#else /* !CONFIG_RT_GROUP_SCHED: */
+#else /* !CONFIG_RT_GROUP_SCHED */
typedef struct rt_rq *rt_rq_iter_t;
#define for_each_rt_rq(rt_rq, iter, rq) \
for ((void) iter, rt_rq = &rq->rt; rt_rq; rt_rq = NULL)
-#define for_each_sched_rt_entity(rt_se) \
- for (; rt_se; rt_se = NULL)
-
-static inline struct rt_rq *group_rt_rq(struct sched_rt_entity *rt_se)
-{
- return NULL;
-}
-
-static inline void sched_rt_rq_enqueue(struct rt_rq *rt_rq)
-{
- struct rq *rq = rq_of_rt_rq(rt_rq);
-
- if (!rt_rq->rt_nr_running)
- return;
-
- enqueue_top_rt_rq(rt_rq);
- resched_curr(rq);
-}
-
-static inline void sched_rt_rq_dequeue(struct rt_rq *rt_rq)
-{
- dequeue_top_rt_rq(rt_rq, rt_rq->rt_nr_running);
-}
-
-static inline int rt_rq_throttled(struct rt_rq *rt_rq)
-{
- return false;
-}
-
-static inline const struct cpumask *sched_rt_period_mask(void)
-{
- return cpu_online_mask;
-}
-
-static inline
-struct rt_rq *sched_rt_period_rt_rq(struct rt_bandwidth *rt_b, int cpu)
-{
- return &cpu_rq(cpu)->rt;
-}
-
-static void __enable_runtime(struct rq *rq) { }
-static void __disable_runtime(struct rq *rq) { }
-
-#endif /* !CONFIG_RT_GROUP_SCHED */
+#endif /* CONFIG_RT_GROUP_SCHED */
static inline int rt_se_prio(struct sched_rt_entity *rt_se)
{
-#ifdef CONFIG_RT_GROUP_SCHED
- struct rt_rq *rt_rq = group_rt_rq(rt_se);
-
- if (rt_rq)
- return rt_rq->highest_prio.curr;
-#endif
-
return rt_task_of(rt_se)->prio;
}
@@ -931,67 +334,8 @@ static void update_curr_rt(struct rq *rq)
if (unlikely(delta_exec <= 0))
return;
-#ifdef CONFIG_RT_GROUP_SCHED
- struct sched_rt_entity *rt_se = &donor->rt;
-
if (!rt_bandwidth_enabled())
return;
-
- for_each_sched_rt_entity(rt_se) {
- struct rt_rq *rt_rq = rt_rq_of_se(rt_se);
- int exceeded;
-
- if (sched_rt_runtime(rt_rq) != RUNTIME_INF) {
- raw_spin_lock(&rt_rq->rt_runtime_lock);
- rt_rq->rt_time += delta_exec;
- exceeded = sched_rt_runtime_exceeded(rt_rq);
- if (exceeded)
- resched_curr(rq);
- raw_spin_unlock(&rt_rq->rt_runtime_lock);
- if (exceeded)
- do_start_rt_bandwidth(sched_rt_bandwidth(rt_rq));
- }
- }
-#endif /* CONFIG_RT_GROUP_SCHED */
-}
-
-static void
-dequeue_top_rt_rq(struct rt_rq *rt_rq, unsigned int count)
-{
- struct rq *rq = rq_of_rt_rq(rt_rq);
-
- BUG_ON(&rq->rt != rt_rq);
-
- if (!rt_rq->rt_queued)
- return;
-
- BUG_ON(!rq->nr_running);
-
- sub_nr_running(rq, count);
- rt_rq->rt_queued = 0;
-
-}
-
-static void
-enqueue_top_rt_rq(struct rt_rq *rt_rq)
-{
- struct rq *rq = rq_of_rt_rq(rt_rq);
-
- BUG_ON(&rq->rt != rt_rq);
-
- if (rt_rq->rt_queued)
- return;
-
- if (rt_rq_throttled(rt_rq))
- return;
-
- if (rt_rq->rt_nr_running) {
- add_nr_running(rq, rt_rq->rt_nr_running);
- rt_rq->rt_queued = 1;
- }
-
- /* Kick cpufreq (see the comment in kernel/sched/sched.h). */
- cpufreq_update_util(rq, 0);
}
static void
@@ -1062,58 +406,17 @@ dec_rt_prio(struct rt_rq *rt_rq, int prio)
dec_rt_prio_smp(rt_rq, prio, prev_prio);
}
-#ifdef CONFIG_RT_GROUP_SCHED
-
-static void
-inc_rt_group(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
-{
- if (rt_se_boosted(rt_se))
- rt_rq->rt_nr_boosted++;
-
- start_rt_bandwidth(&rt_rq->tg->rt_bandwidth);
-}
-
-static void
-dec_rt_group(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
-{
- if (rt_se_boosted(rt_se))
- rt_rq->rt_nr_boosted--;
-
- WARN_ON(!rt_rq->rt_nr_running && rt_rq->rt_nr_boosted);
-}
-
-#else /* !CONFIG_RT_GROUP_SCHED: */
-
-static void
-inc_rt_group(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
-{
-}
-
-static inline
-void dec_rt_group(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq) {}
-
-#endif /* !CONFIG_RT_GROUP_SCHED */
-
static inline
unsigned int rt_se_nr_running(struct sched_rt_entity *rt_se)
{
- struct rt_rq *group_rq = group_rt_rq(rt_se);
-
- if (group_rq)
- return group_rq->rt_nr_running;
- else
- return 1;
+ return 1;
}
static inline
unsigned int rt_se_rr_nr_running(struct sched_rt_entity *rt_se)
{
- struct rt_rq *group_rq = group_rt_rq(rt_se);
struct task_struct *tsk;
- if (group_rq)
- return group_rq->rr_nr_running;
-
tsk = rt_task_of(rt_se);
return (tsk->policy == SCHED_RR) ? 1 : 0;
@@ -1122,26 +425,21 @@ unsigned int rt_se_rr_nr_running(struct sched_rt_entity *rt_se)
static inline
void inc_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
{
- int prio = rt_se_prio(rt_se);
-
- WARN_ON(!rt_prio(prio));
+ WARN_ON(!rt_prio(rt_se_prio(rt_se)));
rt_rq->rt_nr_running += rt_se_nr_running(rt_se);
rt_rq->rr_nr_running += rt_se_rr_nr_running(rt_se);
- inc_rt_prio(rt_rq, prio);
- inc_rt_group(rt_se, rt_rq);
+ inc_rt_prio(rt_rq, rt_se_prio(rt_se));
}
static inline
void dec_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
{
WARN_ON(!rt_prio(rt_se_prio(rt_se)));
- WARN_ON(!rt_rq->rt_nr_running);
rt_rq->rt_nr_running -= rt_se_nr_running(rt_se);
rt_rq->rr_nr_running -= rt_se_rr_nr_running(rt_se);
dec_rt_prio(rt_rq, rt_se_prio(rt_se));
- dec_rt_group(rt_se, rt_rq);
}
/*
@@ -1170,10 +468,6 @@ static void __delist_rt_entity(struct sched_rt_entity *rt_se, struct rt_prio_arr
static inline struct sched_statistics *
__schedstats_from_rt_se(struct sched_rt_entity *rt_se)
{
- /* schedstats is not supported for rt group. */
- if (!rt_entity_is_task(rt_se))
- return NULL;
-
return &rt_task_of(rt_se)->stats;
}
@@ -1186,9 +480,7 @@ update_stats_wait_start_rt(struct rt_rq *rt_rq, struct sched_rt_entity *rt_se)
if (!schedstat_enabled())
return;
- if (rt_entity_is_task(rt_se))
- p = rt_task_of(rt_se);
-
+ p = rt_task_of(rt_se);
stats = __schedstats_from_rt_se(rt_se);
if (!stats)
return;
@@ -1205,9 +497,7 @@ update_stats_enqueue_sleeper_rt(struct rt_rq *rt_rq, struct sched_rt_entity *rt_
if (!schedstat_enabled())
return;
- if (rt_entity_is_task(rt_se))
- p = rt_task_of(rt_se);
-
+ p = rt_task_of(rt_se);
stats = __schedstats_from_rt_se(rt_se);
if (!stats)
return;
@@ -1235,9 +525,7 @@ update_stats_wait_end_rt(struct rt_rq *rt_rq, struct sched_rt_entity *rt_se)
if (!schedstat_enabled())
return;
- if (rt_entity_is_task(rt_se))
- p = rt_task_of(rt_se);
-
+ p = rt_task_of(rt_se);
stats = __schedstats_from_rt_se(rt_se);
if (!stats)
return;
@@ -1254,9 +542,7 @@ update_stats_dequeue_rt(struct rt_rq *rt_rq, struct sched_rt_entity *rt_se,
if (!schedstat_enabled())
return;
- if (rt_entity_is_task(rt_se))
- p = rt_task_of(rt_se);
-
+ p = rt_task_of(rt_se);
if ((flags & DEQUEUE_SLEEP) && p) {
unsigned int state;
@@ -1275,21 +561,8 @@ static void __enqueue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flag
{
struct rt_rq *rt_rq = rt_rq_of_se(rt_se);
struct rt_prio_array *array = &rt_rq->active;
- struct rt_rq *group_rq = group_rt_rq(rt_se);
struct list_head *queue = array->queue + rt_se_prio(rt_se);
- /*
- * Don't enqueue the group if its throttled, or when empty.
- * The latter is a consequence of the former when a child group
- * get throttled and the current group doesn't have any other
- * active members.
- */
- if (group_rq && (rt_rq_throttled(group_rq) || !group_rq->rt_nr_running)) {
- if (rt_se->on_list)
- __delist_rt_entity(rt_se, array);
- return;
- }
-
if (move_entity(flags)) {
WARN_ON_ONCE(rt_se->on_list);
if (flags & ENQUEUE_HEAD)
@@ -1319,57 +592,18 @@ static void __dequeue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flag
dec_rt_tasks(rt_se, rt_rq);
}
-/*
- * Because the prio of an upper entry depends on the lower
- * entries, we must remove entries top - down.
- */
-static void dequeue_rt_stack(struct sched_rt_entity *rt_se, unsigned int flags)
-{
- struct sched_rt_entity *back = NULL;
- unsigned int rt_nr_running;
-
- for_each_sched_rt_entity(rt_se) {
- rt_se->back = back;
- back = rt_se;
- }
-
- rt_nr_running = rt_rq_of_se(back)->rt_nr_running;
-
- for (rt_se = back; rt_se; rt_se = rt_se->back) {
- if (on_rt_rq(rt_se))
- __dequeue_rt_entity(rt_se, flags);
- }
-
- dequeue_top_rt_rq(rt_rq_of_se(back), rt_nr_running);
-}
-
static void enqueue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)
{
- struct rq *rq = rq_of_rt_se(rt_se);
-
update_stats_enqueue_rt(rt_rq_of_se(rt_se), rt_se, flags);
- dequeue_rt_stack(rt_se, flags);
- for_each_sched_rt_entity(rt_se)
- __enqueue_rt_entity(rt_se, flags);
- enqueue_top_rt_rq(&rq->rt);
+ __enqueue_rt_entity(rt_se, flags);
}
static void dequeue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)
{
- struct rq *rq = rq_of_rt_se(rt_se);
-
update_stats_dequeue_rt(rt_rq_of_se(rt_se), rt_se, flags);
- dequeue_rt_stack(rt_se, flags);
-
- for_each_sched_rt_entity(rt_se) {
- struct rt_rq *rt_rq = group_rt_rq(rt_se);
-
- if (rt_rq && rt_rq->rt_nr_running)
- __enqueue_rt_entity(rt_se, flags);
- }
- enqueue_top_rt_rq(&rq->rt);
+ __dequeue_rt_entity(rt_se, flags);
}
/*
@@ -1432,10 +666,8 @@ static void requeue_task_rt(struct rq *rq, struct task_struct *p, int head)
struct sched_rt_entity *rt_se = &p->rt;
struct rt_rq *rt_rq;
- for_each_sched_rt_entity(rt_se) {
- rt_rq = rt_rq_of_se(rt_se);
- requeue_rt_entity(rt_rq, rt_se, head);
- }
+ rt_rq = rt_rq_of_se(rt_se);
+ requeue_rt_entity(rt_rq, rt_se, head);
}
static void yield_task_rt(struct rq *rq)
@@ -1632,17 +864,7 @@ static struct sched_rt_entity *pick_next_rt_entity(struct rt_rq *rt_rq)
static struct task_struct *_pick_next_task_rt(struct rq *rq)
{
- struct sched_rt_entity *rt_se;
- struct rt_rq *rt_rq = &rq->rt;
-
- do {
- rt_se = pick_next_rt_entity(rt_rq);
- if (unlikely(!rt_se))
- return NULL;
- rt_rq = group_rt_rq(rt_se);
- } while (rt_rq);
-
- return rt_task_of(rt_se);
+ return NULL;
}
static struct task_struct *pick_task_rt(struct rq *rq)
@@ -2311,8 +1533,6 @@ static void rq_online_rt(struct rq *rq)
if (rq->rt.overloaded)
rt_set_overload(rq);
- __enable_runtime(rq);
-
cpupri_set(&rq->rd->cpupri, rq->cpu, rq->rt.highest_prio.curr);
}
@@ -2322,8 +1542,6 @@ static void rq_offline_rt(struct rq *rq)
if (rq->rt.overloaded)
rt_clear_overload(rq);
- __disable_runtime(rq);
-
cpupri_set(&rq->rd->cpupri, rq->cpu, CPUPRI_INVALID);
}
@@ -2481,12 +1699,12 @@ static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued)
* Requeue to the end of queue if we (and all of our ancestors) are not
* the only element on the queue
*/
- for_each_sched_rt_entity(rt_se) {
- if (rt_se->run_list.prev != rt_se->run_list.next) {
- requeue_task_rt(rq, p, 0);
- resched_curr(rq);
- return;
- }
+ if (rt_se->run_list.prev != rt_se->run_list.next) {
+ requeue_task_rt(rq, p, 0);
+ resched_curr(rq);
+ // set_tsk_need_resched(p);
+
+ return;
}
}
@@ -2504,16 +1722,7 @@ static unsigned int get_rr_interval_rt(struct rq *rq, struct task_struct *task)
#ifdef CONFIG_SCHED_CORE
static int task_is_throttled_rt(struct task_struct *p, int cpu)
{
- struct rt_rq *rt_rq;
-
-#ifdef CONFIG_RT_GROUP_SCHED // XXX maybe add task_rt_rq(), see also sched_rt_period_rt_rq
- rt_rq = task_group(p)->rt_rq[cpu];
- WARN_ON(!rt_group_sched_enabled() && rt_rq->tg != &root_task_group);
-#else
- rt_rq = &cpu_rq(cpu)->rt;
-#endif
-
- return rt_rq_throttled(rt_rq);
+ return 0;
}
#endif /* CONFIG_SCHED_CORE */
@@ -2761,13 +1970,7 @@ long sched_group_rt_period(struct task_group *tg)
#ifdef CONFIG_SYSCTL
static int sched_rt_global_constraints(void)
{
- int ret = 0;
-
- mutex_lock(&rt_constraints_mutex);
- ret = __rt_schedulable(NULL, 0, 0);
- mutex_unlock(&rt_constraints_mutex);
-
- return ret;
+ return 0;
}
#endif /* CONFIG_SYSCTL */
@@ -2802,10 +2005,6 @@ static int sched_rt_global_validate(void)
return 0;
}
-static void sched_rt_do_global(void)
-{
-}
-
static int sched_rt_handler(const struct ctl_table *table, int write, void *buffer,
size_t *lenp, loff_t *ppos)
{
@@ -2833,7 +2032,6 @@ static int sched_rt_handler(const struct ctl_table *table, int write, void *buff
if (ret)
goto undo;
- sched_rt_do_global();
sched_dl_do_global();
}
if (0) {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a1e6d2852ca..2f9035cb9e5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -789,7 +789,7 @@ struct scx_rq {
static inline int rt_bandwidth_enabled(void)
{
- return sysctl_sched_rt_runtime >= 0;
+ return 0;
}
/* RT IPI pull logic requires IRQ_WORK */
@@ -829,7 +829,7 @@ struct rt_rq {
static inline bool rt_rq_is_runnable(struct rt_rq *rt_rq)
{
- return rt_rq->rt_queued && rt_rq->rt_nr_running;
+ return rt_rq->rt_nr_running;
}
/* Deadline class' related fields in a runqueue */
@@ -2545,7 +2545,7 @@ static inline bool sched_dl_runnable(struct rq *rq)
static inline bool sched_rt_runnable(struct rq *rq)
{
- return rq->rt.rt_queued > 0;
+ return rq->rt.rt_nr_running > 0;
}
static inline bool sched_fair_runnable(struct rq *rq)
@@ -2656,9 +2656,6 @@ extern void resched_curr(struct rq *rq);
extern void resched_curr_lazy(struct rq *rq);
extern void resched_cpu(int cpu);
-extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime);
-extern bool sched_rt_bandwidth_account(struct rt_rq *rt_rq);
-
extern void init_dl_entity(struct sched_dl_entity *dl_se);
#define BW_SHIFT 20
diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
index 77ae87f36e8..93a9c03b28e 100644
--- a/kernel/sched/syscalls.c
+++ b/kernel/sched/syscalls.c
@@ -626,19 +626,6 @@ int __sched_setscheduler(struct task_struct *p,
change:
if (user) {
-#ifdef CONFIG_RT_GROUP_SCHED
- /*
- * Do not allow real-time tasks into groups that have no runtime
- * assigned.
- */
- if (rt_group_sched_enabled() &&
- rt_bandwidth_enabled() && rt_policy(policy) &&
- task_group(p)->rt_bandwidth.rt_runtime == 0 &&
- !task_group_is_autogroup(task_group(p))) {
- retval = -EPERM;
- goto unlock;
- }
-#endif /* CONFIG_RT_GROUP_SCHED */
if (dl_bandwidth_enabled() && dl_policy(policy) &&
!(attr->sched_flags & SCHED_FLAG_SUGOV)) {
cpumask_t *span = rq->rd->span;
--
2.50.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [RFC PATCH v2 07/25] sched/rt: Introduce HCBS specific structs in task_group
2025-07-31 10:55 [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
` (5 preceding siblings ...)
2025-07-31 10:55 ` [RFC PATCH v2 06/25] sched/rt: Disable RT_GROUP_SCHED Yuri Andriaccio
@ 2025-07-31 10:55 ` Yuri Andriaccio
2025-08-14 12:51 ` Juri Lelli
2025-07-31 10:55 ` [RFC PATCH v2 08/25] sched/deadline: Account rt-cgroups bandwidth in deadline tasks schedulability tests Yuri Andriaccio
` (18 subsequent siblings)
25 siblings, 1 reply; 38+ messages in thread
From: Yuri Andriaccio @ 2025-07-31 10:55 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider
Cc: linux-kernel, Luca Abeni, Yuri Andriaccio
From: luca abeni <luca.abeni@santannapisa.it>
Each task_group manages a number of new objects:
- a sched_dl_entity/dl_server for each CPU
- a dl_bandwidth object to keep track of its allocated dl_bandwidth
Co-developed-by: Alessio Balsini <a.balsini@sssup.it>
Signed-off-by: Alessio Balsini <a.balsini@sssup.it>
Co-developed-by: Andrea Parri <parri.andrea@gmail.com>
Signed-off-by: Andrea Parri <parri.andrea@gmail.com>
Co-developed-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
---
kernel/sched/sched.h | 17 +++++++++++++++--
1 file changed, 15 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2f9035cb9e5..2a7601d400c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -319,6 +319,13 @@ struct rt_bandwidth {
unsigned int rt_period_active;
};
+struct dl_bandwidth {
+ raw_spinlock_t dl_runtime_lock;
+ u64 dl_runtime;
+ u64 dl_period;
+};
+
+
static inline int dl_bandwidth_enabled(void)
{
return sysctl_sched_rt_runtime >= 0;
@@ -467,9 +474,15 @@ struct task_group {
#ifdef CONFIG_RT_GROUP_SCHED
struct sched_rt_entity **rt_se;
+ /*
+ * The scheduling entities for the task group are managed as a single
+ * sched_dl_entity, each of them sharing the same dl_bandwidth.
+ */
+ struct sched_dl_entity **dl_se;
struct rt_rq **rt_rq;
struct rt_bandwidth rt_bandwidth;
+ struct dl_bandwidth dl_bandwidth;
#endif
#ifdef CONFIG_EXT_GROUP_SCHED
@@ -819,12 +832,12 @@ struct rt_rq {
raw_spinlock_t rt_runtime_lock;
unsigned int rt_nr_boosted;
-
- struct rq *rq; /* this is always top-level rq, cache? */
#endif
#ifdef CONFIG_CGROUP_SCHED
struct task_group *tg; /* this tg has "this" rt_rq on given CPU for runnable entities */
#endif
+
+ struct rq *rq; /* cgroup's runqueue if the rt_rq entity belongs to a cgroup, otherwise top-level rq */
};
static inline bool rt_rq_is_runnable(struct rt_rq *rt_rq)
--
2.50.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [RFC PATCH v2 07/25] sched/rt: Introduce HCBS specific structs in task_group
2025-07-31 10:55 ` [RFC PATCH v2 07/25] sched/rt: Introduce HCBS specific structs in task_group Yuri Andriaccio
@ 2025-08-14 12:51 ` Juri Lelli
0 siblings, 0 replies; 38+ messages in thread
From: Juri Lelli @ 2025-08-14 12:51 UTC (permalink / raw)
To: Yuri Andriaccio
Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
linux-kernel, Luca Abeni, Yuri Andriaccio
Hi!
On 31/07/25 12:55, Yuri Andriaccio wrote:
...
> @@ -467,9 +474,15 @@ struct task_group {
>
> #ifdef CONFIG_RT_GROUP_SCHED
> struct sched_rt_entity **rt_se;
> + /*
> + * The scheduling entities for the task group are managed as a single
> + * sched_dl_entity, each of them sharing the same dl_bandwidth.
> + */
> + struct sched_dl_entity **dl_se;
This is actually one sched_dl_entity per CPU, right? If that is the case
the comment is a little misleading I am afraid (mentioning single
entitiy).
Thanks,
Juri
^ permalink raw reply [flat|nested] 38+ messages in thread
* [RFC PATCH v2 08/25] sched/deadline: Account rt-cgroups bandwidth in deadline tasks schedulability tests.
2025-07-31 10:55 [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
` (6 preceding siblings ...)
2025-07-31 10:55 ` [RFC PATCH v2 07/25] sched/rt: Introduce HCBS specific structs in task_group Yuri Andriaccio
@ 2025-07-31 10:55 ` Yuri Andriaccio
2025-08-14 13:03 ` Juri Lelli
2025-07-31 10:55 ` [RFC PATCH v2 09/25] sched/deadline: Account rt-cgroups bandwidth in sched_dl_global_validate Yuri Andriaccio
` (17 subsequent siblings)
25 siblings, 1 reply; 38+ messages in thread
From: Yuri Andriaccio @ 2025-07-31 10:55 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider
Cc: linux-kernel, Luca Abeni, Yuri Andriaccio
From: luca abeni <luca.abeni@santannapisa.it>
Account the rt-cgroups hierarchy's reserved bandwidth in the schedulability
test of deadline entities. This mechanism allows to completely reserve portion
of the rt-bandwidth to rt-cgroups even if they do not use all of it.
Co-developed-by: Alessio Balsini <a.balsini@sssup.it>
Signed-off-by: Alessio Balsini <a.balsini@sssup.it>
Co-developed-by: Andrea Parri <parri.andrea@gmail.com>
Signed-off-by: Andrea Parri <parri.andrea@gmail.com>
Co-developed-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
---
kernel/sched/deadline.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 0640d0ca45b..43af48038b9 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -231,8 +231,15 @@ void __dl_add(struct dl_bw *dl_b, u64 tsk_bw, int cpus)
static inline bool
__dl_overflow(struct dl_bw *dl_b, unsigned long cap, u64 old_bw, u64 new_bw)
{
+ u64 dl_groups_root = 0;
+
+#ifdef CONFIG_RT_GROUP_SCHED
+ dl_groups_root = to_ratio(root_task_group.dl_bandwidth.dl_period,
+ root_task_group.dl_bandwidth.dl_runtime);
+#endif
return dl_b->bw != -1 &&
- cap_scale(dl_b->bw, cap) < dl_b->total_bw - old_bw + new_bw;
+ cap_scale(dl_b->bw, cap) < dl_b->total_bw - old_bw + new_bw
+ + cap_scale(dl_groups_root, cap);
}
static inline
--
2.50.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [RFC PATCH v2 08/25] sched/deadline: Account rt-cgroups bandwidth in deadline tasks schedulability tests.
2025-07-31 10:55 ` [RFC PATCH v2 08/25] sched/deadline: Account rt-cgroups bandwidth in deadline tasks schedulability tests Yuri Andriaccio
@ 2025-08-14 13:03 ` Juri Lelli
0 siblings, 0 replies; 38+ messages in thread
From: Juri Lelli @ 2025-08-14 13:03 UTC (permalink / raw)
To: Yuri Andriaccio
Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
linux-kernel, Luca Abeni, Yuri Andriaccio
Hi!
On 31/07/25 12:55, Yuri Andriaccio wrote:
> From: luca abeni <luca.abeni@santannapisa.it>
>
> Account the rt-cgroups hierarchy's reserved bandwidth in the schedulability
> test of deadline entities. This mechanism allows to completely reserve portion
> of the rt-bandwidth to rt-cgroups even if they do not use all of it.
>
> Co-developed-by: Alessio Balsini <a.balsini@sssup.it>
> Signed-off-by: Alessio Balsini <a.balsini@sssup.it>
> Co-developed-by: Andrea Parri <parri.andrea@gmail.com>
> Signed-off-by: Andrea Parri <parri.andrea@gmail.com>
> Co-developed-by: Yuri Andriaccio <yurand2000@gmail.com>
> Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
> Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
> ---
I wonder if, from a series flow perspective, would make sense to move
this one and the following after the changes that deal with groups
initialization. Also, please consider a little bit of squashing.
Thanks,
Juri
^ permalink raw reply [flat|nested] 38+ messages in thread
* [RFC PATCH v2 09/25] sched/deadline: Account rt-cgroups bandwidth in sched_dl_global_validate
2025-07-31 10:55 [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
` (7 preceding siblings ...)
2025-07-31 10:55 ` [RFC PATCH v2 08/25] sched/deadline: Account rt-cgroups bandwidth in deadline tasks schedulability tests Yuri Andriaccio
@ 2025-07-31 10:55 ` Yuri Andriaccio
2025-07-31 10:55 ` [RFC PATCH v2 10/25] sched/core: Initialize root_task_group Yuri Andriaccio
` (16 subsequent siblings)
25 siblings, 0 replies; 38+ messages in thread
From: Yuri Andriaccio @ 2025-07-31 10:55 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider
Cc: linux-kernel, Luca Abeni, Yuri Andriaccio
sched_dl_global_validate (similarly to the previous patch) must take into
account the rt-cgroups' reserved bandwidth.
Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
---
kernel/sched/deadline.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 43af48038b9..55b7f883815 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -3098,11 +3098,17 @@ int sched_dl_global_validate(void)
u64 period = global_rt_period();
u64 new_bw = to_ratio(period, runtime);
u64 cookie = ++dl_cookie;
+ u64 dl_groups_root = 0;
u64 fair_bw;
struct dl_bw *dl_b;
int cpu, ret = 0;
unsigned long cap, flags;
+#ifdef CONFIG_RT_GROUP_SCHED
+ dl_groups_root = to_ratio(root_task_group.dl_bandwidth.dl_period,
+ root_task_group.dl_bandwidth.dl_runtime);
+#endif
+
/*
* Here we want to check the bandwidth not being set to some
* value smaller than the currently allocated bandwidth in
@@ -3119,7 +3125,7 @@ int sched_dl_global_validate(void)
fair_bw = dl_bw_fair(cpu);
raw_spin_lock_irqsave(&dl_b->lock, flags);
- if (cap_scale(new_bw, cap) < dl_b->total_bw)
+ if (cap_scale(new_bw, cap) < dl_b->total_bw + cap_scale(dl_groups_root, cap))
ret = -EBUSY;
if (cap_scale(new_bw, cap) + fair_bw > cap_scale(BW_UNIT, cap))
ret = -EBUSY;
--
2.50.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [RFC PATCH v2 10/25] sched/core: Initialize root_task_group
2025-07-31 10:55 [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
` (8 preceding siblings ...)
2025-07-31 10:55 ` [RFC PATCH v2 09/25] sched/deadline: Account rt-cgroups bandwidth in sched_dl_global_validate Yuri Andriaccio
@ 2025-07-31 10:55 ` Yuri Andriaccio
2025-07-31 10:55 ` [RFC PATCH v2 11/25] sched/deadline: Add dl_init_tg Yuri Andriaccio
` (15 subsequent siblings)
25 siblings, 0 replies; 38+ messages in thread
From: Yuri Andriaccio @ 2025-07-31 10:55 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider
Cc: linux-kernel, Luca Abeni, Yuri Andriaccio
From: luca abeni <luca.abeni@santannapisa.it>
Add the initialization function for task_group's dl_servers.
Initialize the default bandwidth for rt-cgroups and the root control group.
Add utility functions to check (and get) if a rt_rq entity is connected to a
real-time cgroup.
Co-developed-by: Alessio Balsini <a.balsini@sssup.it>
Signed-off-by: Alessio Balsini <a.balsini@sssup.it>
Co-developed-by: Andrea Parri <parri.andrea@gmail.com>
Signed-off-by: Andrea Parri <parri.andrea@gmail.com>
Co-developed-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
---
kernel/sched/autogroup.c | 4 ++--
kernel/sched/core.c | 9 +++++++--
kernel/sched/deadline.c | 8 ++++++++
kernel/sched/rt.c | 18 ++++++++----------
kernel/sched/sched.h | 32 +++++++++++++++++++++++++++++---
5 files changed, 54 insertions(+), 17 deletions(-)
diff --git a/kernel/sched/autogroup.c b/kernel/sched/autogroup.c
index cdea931aae3..017eadc0a0a 100644
--- a/kernel/sched/autogroup.c
+++ b/kernel/sched/autogroup.c
@@ -52,7 +52,7 @@ static inline void autogroup_destroy(struct kref *kref)
#ifdef CONFIG_RT_GROUP_SCHED
/* We've redirected RT tasks to the root task group... */
- ag->tg->rt_se = NULL;
+ ag->tg->dl_se = NULL;
ag->tg->rt_rq = NULL;
#endif
sched_release_group(ag->tg);
@@ -109,7 +109,7 @@ static inline struct autogroup *autogroup_create(void)
* the policy change to proceed.
*/
free_rt_sched_group(tg);
- tg->rt_se = root_task_group.rt_se;
+ tg->dl_se = root_task_group.dl_se;
tg->rt_rq = root_task_group.rt_rq;
#endif /* CONFIG_RT_GROUP_SCHED */
tg->autogroup = ag;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 42587a3c71f..3a69cb906c3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8703,7 +8703,7 @@ void __init sched_init(void)
scx_tg_init(&root_task_group);
#endif /* CONFIG_EXT_GROUP_SCHED */
#ifdef CONFIG_RT_GROUP_SCHED
- root_task_group.rt_se = (struct sched_rt_entity **)ptr;
+ root_task_group.dl_se = (struct sched_dl_entity **)ptr;
ptr += nr_cpu_ids * sizeof(void **);
root_task_group.rt_rq = (struct rt_rq **)ptr;
@@ -8714,6 +8714,11 @@ void __init sched_init(void)
init_defrootdomain();
+#ifdef CONFIG_RT_GROUP_SCHED
+ init_dl_bandwidth(&root_task_group.dl_bandwidth,
+ global_rt_period(), global_rt_runtime());
+#endif /* CONFIG_RT_GROUP_SCHED */
+
#ifdef CONFIG_CGROUP_SCHED
task_group_cache = KMEM_CACHE(task_group, 0);
@@ -8765,7 +8770,7 @@ void __init sched_init(void)
* starts working after scheduler_running, which is not the case
* yet.
*/
- init_tg_rt_entry(&root_task_group, &rq->rt, NULL, i, NULL);
+ init_tg_rt_entry(&root_task_group, rq, NULL, i, NULL);
#endif
rq->sd = NULL;
rq->rd = NULL;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 55b7f883815..b8228f553fe 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -538,6 +538,14 @@ static inline int is_leftmost(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq
static void init_dl_rq_bw_ratio(struct dl_rq *dl_rq);
+void init_dl_bandwidth(struct dl_bandwidth *dl_b, u64 period, u64 runtime)
+{
+ raw_spin_lock_init(&dl_b->dl_runtime_lock);
+ dl_b->dl_period = period;
+ dl_b->dl_runtime = runtime;
+}
+
+
void init_dl_bw(struct dl_bw *dl_b)
{
raw_spin_lock_init(&dl_b->lock);
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index a6282784978..38178003184 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -97,18 +97,16 @@ void free_rt_sched_group(struct task_group *tg)
return;
}
-void init_tg_rt_entry(struct task_group *tg, struct rt_rq *rt_rq,
- struct sched_rt_entity *rt_se, int cpu,
- struct sched_rt_entity *parent)
+void init_tg_rt_entry(struct task_group *tg, struct rq *served_rq,
+ struct sched_dl_entity *dl_se, int cpu,
+ struct sched_dl_entity *parent)
{
- struct rq *rq = cpu_rq(cpu);
+ served_rq->rt.highest_prio.curr = MAX_RT_PRIO-1;
+ served_rq->rt.rq = cpu_rq(cpu);
+ served_rq->rt.tg = tg;
- rt_rq->highest_prio.curr = MAX_RT_PRIO-1;
- rt_rq->rq = rq;
- rt_rq->tg = tg;
-
- tg->rt_rq[cpu] = rt_rq;
- tg->rt_se[cpu] = rt_se;
+ tg->rt_rq[cpu] = &served_rq->rt;
+ tg->dl_se[cpu] = dl_se;
}
int alloc_rt_sched_group(struct task_group *tg, struct task_group *parent)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2a7601d400c..3283d824859 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -577,9 +577,9 @@ extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
extern void unthrottle_cfs_rq(struct cfs_rq *cfs_rq);
extern bool cfs_task_bw_constrained(struct task_struct *p);
-extern void init_tg_rt_entry(struct task_group *tg, struct rt_rq *rt_rq,
- struct sched_rt_entity *rt_se, int cpu,
- struct sched_rt_entity *parent);
+extern void init_tg_rt_entry(struct task_group *tg, struct rq *s_rq,
+ struct sched_dl_entity *rt_se, int cpu,
+ struct sched_dl_entity *parent);
extern int sched_group_set_rt_runtime(struct task_group *tg, long rt_runtime_us);
extern int sched_group_set_rt_period(struct task_group *tg, u64 rt_period_us);
extern long sched_group_rt_runtime(struct task_group *tg);
@@ -2669,6 +2669,7 @@ extern void resched_curr(struct rq *rq);
extern void resched_curr_lazy(struct rq *rq);
extern void resched_cpu(int cpu);
+void init_dl_bandwidth(struct dl_bandwidth *dl_b, u64 period, u64 runtime);
extern void init_dl_entity(struct sched_dl_entity *dl_se);
#define BW_SHIFT 20
@@ -3077,6 +3078,21 @@ static inline struct rq *rq_of_rt_se(struct sched_rt_entity *rt_se)
return rt_rq->rq;
}
+
+static inline int is_dl_group(struct rt_rq *rt_rq)
+{
+ return rt_rq->tg != &root_task_group;
+}
+
+/*
+ * Return the scheduling entity of this group of tasks.
+ */
+static inline struct sched_dl_entity *dl_group_of(struct rt_rq *rt_rq)
+{
+ BUG_ON(!is_dl_group(rt_rq));
+
+ return rt_rq->tg->dl_se[cpu_of(rt_rq->rq)];
+}
#else
static inline struct task_struct *rt_task_of(struct sched_rt_entity *rt_se)
{
@@ -3101,6 +3117,16 @@ static inline struct rt_rq *rt_rq_of_se(struct sched_rt_entity *rt_se)
return &rq->rt;
}
+
+static inline int is_dl_group(struct rt_rq *rt_rq)
+{
+ return 0;
+}
+
+static inline struct sched_dl_entity *dl_group_of(struct rt_rq *rt_rq)
+{
+ return NULL;
+}
#endif
DEFINE_LOCK_GUARD_2(double_rq_lock, struct rq,
--
2.50.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [RFC PATCH v2 11/25] sched/deadline: Add dl_init_tg
2025-07-31 10:55 [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
` (9 preceding siblings ...)
2025-07-31 10:55 ` [RFC PATCH v2 10/25] sched/core: Initialize root_task_group Yuri Andriaccio
@ 2025-07-31 10:55 ` Yuri Andriaccio
2025-08-14 13:10 ` Juri Lelli
2025-07-31 10:55 ` [RFC PATCH v2 12/25] sched/rt: Add {alloc/free}_rt_sched_group and dl_server specific functions Yuri Andriaccio
` (14 subsequent siblings)
25 siblings, 1 reply; 38+ messages in thread
From: Yuri Andriaccio @ 2025-07-31 10:55 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider
Cc: linux-kernel, Luca Abeni, Yuri Andriaccio
From: luca abeni <luca.abeni@santannapisa.it>
This function is used to initialize and/or update a rt-cgroup dl_server, also
accounting for the allocated bandwidth.
Co-developed-by: Alessio Balsini <a.balsini@sssup.it>
Signed-off-by: Alessio Balsini <a.balsini@sssup.it>
Co-developed-by: Andrea Parri <parri.andrea@gmail.com>
Signed-off-by: Andrea Parri <parri.andrea@gmail.com>
Co-developed-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
---
kernel/sched/deadline.c | 33 +++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 1 +
2 files changed, 34 insertions(+)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index b8228f553fe..264838c4a85 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -365,6 +365,39 @@ void cancel_inactive_timer(struct sched_dl_entity *dl_se)
cancel_dl_timer(dl_se, &dl_se->inactive_timer);
}
+#ifdef CONFIG_RT_GROUP_SCHED
+void dl_init_tg(struct sched_dl_entity *dl_se, u64 rt_runtime, u64 rt_period)
+{
+ struct rq *rq = container_of(dl_se->dl_rq, struct rq, dl);
+ int is_active;
+ u64 new_bw;
+
+ raw_spin_rq_lock_irq(rq);
+ is_active = dl_se->my_q->rt.rt_nr_running > 0;
+
+ update_rq_clock(rq);
+ dl_server_stop(dl_se);
+
+ new_bw = to_ratio(dl_se->dl_period, dl_se->dl_runtime);
+ dl_rq_change_utilization(rq, dl_se, new_bw);
+
+ dl_se->dl_runtime = rt_runtime;
+ dl_se->dl_deadline = rt_period;
+ dl_se->dl_period = rt_period;
+
+ dl_se->runtime = 0;
+ dl_se->deadline = 0;
+
+ dl_se->dl_bw = new_bw;
+ dl_se->dl_density = new_bw;
+
+ if (is_active)
+ dl_server_start(dl_se);
+
+ raw_spin_rq_unlock_irq(rq);
+}
+#endif
+
static void dl_change_utilization(struct task_struct *p, u64 new_bw)
{
WARN_ON_ONCE(p->dl.flags & SCHED_FLAG_SUGOV);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3283d824859..611e3757fea 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -394,6 +394,7 @@ extern void dl_server_init(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq,
dl_server_has_tasks_f has_tasks,
dl_server_pick_f pick_task);
extern void sched_init_dl_servers(void);
+extern void dl_init_tg(struct sched_dl_entity *dl_se, u64 rt_runtime, u64 rt_period);
extern void dl_server_update_idle_time(struct rq *rq,
struct task_struct *p);
--
2.50.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [RFC PATCH v2 11/25] sched/deadline: Add dl_init_tg
2025-07-31 10:55 ` [RFC PATCH v2 11/25] sched/deadline: Add dl_init_tg Yuri Andriaccio
@ 2025-08-14 13:10 ` Juri Lelli
0 siblings, 0 replies; 38+ messages in thread
From: Juri Lelli @ 2025-08-14 13:10 UTC (permalink / raw)
To: Yuri Andriaccio
Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
linux-kernel, Luca Abeni, Yuri Andriaccio
Hi!
On 31/07/25 12:55, Yuri Andriaccio wrote:
> From: luca abeni <luca.abeni@santannapisa.it>
>
> This function is used to initialize and/or update a rt-cgroup dl_server, also
> accounting for the allocated bandwidth.
This function/this patch are usually frowned up [1].
Thanks,
Juri
1 - https://elixir.bootlin.com/linux/v6.16/source/Documentation/process/submitting-patches.rst#L94
^ permalink raw reply [flat|nested] 38+ messages in thread
* [RFC PATCH v2 12/25] sched/rt: Add {alloc/free}_rt_sched_group and dl_server specific functions
2025-07-31 10:55 [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
` (10 preceding siblings ...)
2025-07-31 10:55 ` [RFC PATCH v2 11/25] sched/deadline: Add dl_init_tg Yuri Andriaccio
@ 2025-07-31 10:55 ` Yuri Andriaccio
2025-08-14 13:42 ` Juri Lelli
2025-07-31 10:55 ` [RFC PATCH v2 13/25] sched/rt: Add HCBS related checks and operations for rt tasks Yuri Andriaccio
` (13 subsequent siblings)
25 siblings, 1 reply; 38+ messages in thread
From: Yuri Andriaccio @ 2025-07-31 10:55 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider
Cc: linux-kernel, Luca Abeni, Yuri Andriaccio
From: luca abeni <luca.abeni@santannapisa.it>
Add allocation and deallocation code for rt-cgroups. Add rt dl_server's specific
functions that pick the next eligible task to run.
Co-developed-by: Alessio Balsini <a.balsini@sssup.it>
Signed-off-by: Alessio Balsini <a.balsini@sssup.it>
Co-developed-by: Andrea Parri <parri.andrea@gmail.com>
Signed-off-by: Andrea Parri <parri.andrea@gmail.com>
Co-developed-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
---
kernel/sched/rt.c | 107 ++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 104 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 38178003184..9c4ac6875a2 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -93,8 +93,39 @@ void unregister_rt_sched_group(struct task_group *tg)
void free_rt_sched_group(struct task_group *tg)
{
+ int i;
+
if (!rt_group_sched_enabled())
return;
+
+ for_each_possible_cpu(i) {
+ if (tg->dl_se) {
+ unsigned long flags;
+
+ /*
+ * Since the dl timer is going to be cancelled,
+ * we risk to never decrease the running bw...
+ * Fix this issue by changing the group runtime
+ * to 0 immediately before freeing it.
+ */
+ dl_init_tg(tg->dl_se[i], 0, tg->dl_se[i]->dl_period);
+ raw_spin_rq_lock_irqsave(cpu_rq(i), flags);
+ BUG_ON(tg->rt_rq[i]->rt_nr_running);
+ raw_spin_rq_unlock_irqrestore(cpu_rq(i), flags);
+
+ hrtimer_cancel(&tg->dl_se[i]->dl_timer);
+ kfree(tg->dl_se[i]);
+ }
+ if (tg->rt_rq) {
+ struct rq *served_rq;
+
+ served_rq = container_of(tg->rt_rq[i], struct rq, rt);
+ kfree(served_rq);
+ }
+ }
+
+ kfree(tg->rt_rq);
+ kfree(tg->dl_se);
}
void init_tg_rt_entry(struct task_group *tg, struct rq *served_rq,
@@ -109,12 +140,77 @@ void init_tg_rt_entry(struct task_group *tg, struct rq *served_rq,
tg->dl_se[cpu] = dl_se;
}
+static bool rt_server_has_tasks(struct sched_dl_entity *dl_se)
+{
+ return !!dl_se->my_q->rt.rt_nr_running;
+}
+
+static struct task_struct *_pick_next_task_rt(struct rt_rq *rt_rq);
+static inline void set_next_task_rt(struct rq *rq, struct task_struct *p, bool first);
+static struct task_struct *rt_server_pick(struct sched_dl_entity *dl_se)
+{
+ struct rt_rq *rt_rq = &dl_se->my_q->rt;
+ struct rq *rq = rq_of_rt_rq(rt_rq);
+ struct task_struct *p;
+
+ if (dl_se->my_q->rt.rt_nr_running == 0)
+ return NULL;
+
+ p = _pick_next_task_rt(rt_rq);
+ set_next_task_rt(rq, p, true);
+
+ return p;
+}
+
int alloc_rt_sched_group(struct task_group *tg, struct task_group *parent)
{
+ struct rq *s_rq;
+ struct sched_dl_entity *dl_se;
+ int i;
+
if (!rt_group_sched_enabled())
return 1;
+ tg->rt_rq = kcalloc(nr_cpu_ids, sizeof(struct rt_rq *), GFP_KERNEL);
+ if (!tg->rt_rq)
+ goto err;
+ tg->dl_se = kcalloc(nr_cpu_ids, sizeof(dl_se), GFP_KERNEL);
+ if (!tg->dl_se)
+ goto err;
+
+ init_dl_bandwidth(&tg->dl_bandwidth, 0, 0);
+
+ for_each_possible_cpu(i) {
+ s_rq = kzalloc_node(sizeof(struct rq),
+ GFP_KERNEL, cpu_to_node(i));
+ if (!s_rq)
+ goto err;
+
+ dl_se = kzalloc_node(sizeof(struct sched_dl_entity),
+ GFP_KERNEL, cpu_to_node(i));
+ if (!dl_se)
+ goto err_free_rq;
+
+ init_rt_rq(&s_rq->rt);
+ init_dl_entity(dl_se);
+ dl_se->dl_runtime = tg->dl_bandwidth.dl_runtime;
+ dl_se->dl_period = tg->dl_bandwidth.dl_period;
+ dl_se->dl_deadline = dl_se->dl_period;
+ dl_se->dl_bw = to_ratio(dl_se->dl_period, dl_se->dl_runtime);
+ dl_se->dl_density = to_ratio(dl_se->dl_period, dl_se->dl_runtime);
+ dl_se->dl_server = 1;
+
+ dl_server_init(dl_se, &cpu_rq(i)->dl, s_rq, rt_server_has_tasks, rt_server_pick);
+
+ init_tg_rt_entry(tg, s_rq, dl_se, i, parent->dl_se[i]);
+ }
+
return 1;
+
+err_free_rq:
+ kfree(s_rq);
+err:
+ return 0;
}
#else /* !CONFIG_RT_GROUP_SCHED: */
@@ -860,9 +956,14 @@ static struct sched_rt_entity *pick_next_rt_entity(struct rt_rq *rt_rq)
return next;
}
-static struct task_struct *_pick_next_task_rt(struct rq *rq)
+static struct task_struct *_pick_next_task_rt(struct rt_rq *rt_rq)
{
- return NULL;
+ struct sched_rt_entity *rt_se;
+
+ rt_se = pick_next_rt_entity(rt_rq);
+ BUG_ON(!rt_se);
+
+ return rt_task_of(rt_se);
}
static struct task_struct *pick_task_rt(struct rq *rq)
@@ -872,7 +973,7 @@ static struct task_struct *pick_task_rt(struct rq *rq)
if (!sched_rt_runnable(rq))
return NULL;
- p = _pick_next_task_rt(rq);
+ p = _pick_next_task_rt(&rq->rt);
return p;
}
--
2.50.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [RFC PATCH v2 12/25] sched/rt: Add {alloc/free}_rt_sched_group and dl_server specific functions
2025-07-31 10:55 ` [RFC PATCH v2 12/25] sched/rt: Add {alloc/free}_rt_sched_group and dl_server specific functions Yuri Andriaccio
@ 2025-08-14 13:42 ` Juri Lelli
0 siblings, 0 replies; 38+ messages in thread
From: Juri Lelli @ 2025-08-14 13:42 UTC (permalink / raw)
To: Yuri Andriaccio
Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
linux-kernel, Luca Abeni, Yuri Andriaccio
Hi!
On 31/07/25 12:55, Yuri Andriaccio wrote:
> From: luca abeni <luca.abeni@santannapisa.it>
>
> Add allocation and deallocation code for rt-cgroups. Add rt dl_server's specific
> functions that pick the next eligible task to run.
>
> Co-developed-by: Alessio Balsini <a.balsini@sssup.it>
> Signed-off-by: Alessio Balsini <a.balsini@sssup.it>
> Co-developed-by: Andrea Parri <parri.andrea@gmail.com>
> Signed-off-by: Andrea Parri <parri.andrea@gmail.com>
> Co-developed-by: Yuri Andriaccio <yurand2000@gmail.com>
> Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
> Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
> ---
> kernel/sched/rt.c | 107 ++++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 104 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 38178003184..9c4ac6875a2 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -93,8 +93,39 @@ void unregister_rt_sched_group(struct task_group *tg)
>
> void free_rt_sched_group(struct task_group *tg)
> {
> + int i;
> +
> if (!rt_group_sched_enabled())
> return;
> +
> + for_each_possible_cpu(i) {
> + if (tg->dl_se) {
> + unsigned long flags;
> +
> + /*
> + * Since the dl timer is going to be cancelled,
> + * we risk to never decrease the running bw...
> + * Fix this issue by changing the group runtime
> + * to 0 immediately before freeing it.
> + */
> + dl_init_tg(tg->dl_se[i], 0, tg->dl_se[i]->dl_period);
> + raw_spin_rq_lock_irqsave(cpu_rq(i), flags);
> + BUG_ON(tg->rt_rq[i]->rt_nr_running);
> + raw_spin_rq_unlock_irqrestore(cpu_rq(i), flags);
So here we always call dl_init_tg for cpu 0, is it correct?
Also I wonder if the lock shouldn't cover both init and subsequent
check.
> +
> + hrtimer_cancel(&tg->dl_se[i]->dl_timer);
> + kfree(tg->dl_se[i]);
> + }
> + if (tg->rt_rq) {
> + struct rq *served_rq;
> +
> + served_rq = container_of(tg->rt_rq[i], struct rq, rt);
> + kfree(served_rq);
> + }
> + }
> +
> + kfree(tg->rt_rq);
> + kfree(tg->dl_se);
> }
>
> void init_tg_rt_entry(struct task_group *tg, struct rq *served_rq,
> @@ -109,12 +140,77 @@ void init_tg_rt_entry(struct task_group *tg, struct rq *served_rq,
> tg->dl_se[cpu] = dl_se;
> }
>
> +static bool rt_server_has_tasks(struct sched_dl_entity *dl_se)
> +{
> + return !!dl_se->my_q->rt.rt_nr_running;
> +}
> +
> +static struct task_struct *_pick_next_task_rt(struct rt_rq *rt_rq);
> +static inline void set_next_task_rt(struct rq *rq, struct task_struct *p, bool first);
> +static struct task_struct *rt_server_pick(struct sched_dl_entity *dl_se)
> +{
> + struct rt_rq *rt_rq = &dl_se->my_q->rt;
> + struct rq *rq = rq_of_rt_rq(rt_rq);
> + struct task_struct *p;
> +
> + if (dl_se->my_q->rt.rt_nr_running == 0)
> + return NULL;
> +
> + p = _pick_next_task_rt(rt_rq);
> + set_next_task_rt(rq, p, true);
> +
> + return p;
> +}
> +
> int alloc_rt_sched_group(struct task_group *tg, struct task_group *parent)
> {
> + struct rq *s_rq;
> + struct sched_dl_entity *dl_se;
> + int i;
> +
> if (!rt_group_sched_enabled())
> return 1;
>
> + tg->rt_rq = kcalloc(nr_cpu_ids, sizeof(struct rt_rq *), GFP_KERNEL);
> + if (!tg->rt_rq)
> + goto err;
> + tg->dl_se = kcalloc(nr_cpu_ids, sizeof(dl_se), GFP_KERNEL);
> + if (!tg->dl_se)
> + goto err;
Don't we need to free the array allocated above if we fail here?
Thanks,
Juri
^ permalink raw reply [flat|nested] 38+ messages in thread
* [RFC PATCH v2 13/25] sched/rt: Add HCBS related checks and operations for rt tasks
2025-07-31 10:55 [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
` (11 preceding siblings ...)
2025-07-31 10:55 ` [RFC PATCH v2 12/25] sched/rt: Add {alloc/free}_rt_sched_group and dl_server specific functions Yuri Andriaccio
@ 2025-07-31 10:55 ` Yuri Andriaccio
2025-08-14 14:01 ` Juri Lelli
2025-07-31 10:55 ` [RFC PATCH v2 14/25] sched/rt: Update rt-cgroup schedulability checks Yuri Andriaccio
` (12 subsequent siblings)
25 siblings, 1 reply; 38+ messages in thread
From: Yuri Andriaccio @ 2025-07-31 10:55 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider
Cc: linux-kernel, Luca Abeni, Yuri Andriaccio
From: luca abeni <luca.abeni@santannapisa.it>
Add checks wheter a task belongs to the root cgroup or a rt-cgroup, since HCBS
reuses the rt classes' scheduler, and operate accordingly where needed.
Co-developed-by: Alessio Balsini <a.balsini@sssup.it>
Signed-off-by: Alessio Balsini <a.balsini@sssup.it>
Co-developed-by: Andrea Parri <parri.andrea@gmail.com>
Signed-off-by: Andrea Parri <parri.andrea@gmail.com>
Co-developed-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
---
kernel/sched/core.c | 3 +
kernel/sched/deadline.c | 16 ++++-
kernel/sched/rt.c | 147 +++++++++++++++++++++++++++++++++++++---
kernel/sched/sched.h | 6 +-
kernel/sched/syscalls.c | 13 ++++
5 files changed, 171 insertions(+), 14 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3a69cb906c3..6173684a02b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2196,6 +2196,9 @@ void wakeup_preempt(struct rq *rq, struct task_struct *p, int flags)
{
struct task_struct *donor = rq->donor;
+ if (is_dl_group(rt_rq_of_se(&p->rt)) && task_has_rt_policy(p))
+ resched_curr(rq);
+
if (p->sched_class == donor->sched_class)
donor->sched_class->wakeup_preempt(rq, p, flags);
else if (sched_class_above(p->sched_class, donor->sched_class))
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 264838c4a85..b948000f29f 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1866,7 +1866,13 @@ void inc_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
u64 deadline = dl_se->deadline;
dl_rq->dl_nr_running++;
- add_nr_running(rq_of_dl_rq(dl_rq), 1);
+ if (!dl_server(dl_se) || dl_se == &rq_of_dl_rq(dl_rq)->fair_server) {
+ add_nr_running(rq_of_dl_rq(dl_rq), 1);
+ } else {
+ struct rt_rq *rt_rq = &dl_se->my_q->rt;
+
+ add_nr_running(rq_of_dl_rq(dl_rq), rt_rq->rt_nr_running);
+ }
inc_dl_deadline(dl_rq, deadline);
}
@@ -1876,7 +1882,13 @@ void dec_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
{
WARN_ON(!dl_rq->dl_nr_running);
dl_rq->dl_nr_running--;
- sub_nr_running(rq_of_dl_rq(dl_rq), 1);
+ if (!dl_server(dl_se) || dl_se == &rq_of_dl_rq(dl_rq)->fair_server) {
+ sub_nr_running(rq_of_dl_rq(dl_rq), 1);
+ } else {
+ struct rt_rq *rt_rq = &dl_se->my_q->rt;
+
+ sub_nr_running(rq_of_dl_rq(dl_rq), rt_rq->rt_nr_running);
+ }
dec_dl_deadline(dl_rq, dl_se->deadline);
}
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 9c4ac6875a2..83695e11db4 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -419,6 +419,7 @@ static inline int rt_se_prio(struct sched_rt_entity *rt_se)
static void update_curr_rt(struct rq *rq)
{
struct task_struct *donor = rq->donor;
+ struct rt_rq *rt_rq;
s64 delta_exec;
if (donor->sched_class != &rt_sched_class)
@@ -428,8 +429,18 @@ static void update_curr_rt(struct rq *rq)
if (unlikely(delta_exec <= 0))
return;
- if (!rt_bandwidth_enabled())
+ if (!rt_group_sched_enabled())
return;
+
+ if (!dl_bandwidth_enabled())
+ return;
+
+ rt_rq = rt_rq_of_se(&donor->rt);
+ if (is_dl_group(rt_rq)) {
+ struct sched_dl_entity *dl_se = dl_group_of(rt_rq);
+
+ dl_server_update(dl_se, delta_exec);
+ }
}
static void
@@ -440,7 +451,7 @@ inc_rt_prio_smp(struct rt_rq *rt_rq, int prio, int prev_prio)
/*
* Change rq's cpupri only if rt_rq is the top queue.
*/
- if (IS_ENABLED(CONFIG_RT_GROUP_SCHED) && &rq->rt != rt_rq)
+ if (IS_ENABLED(CONFIG_RT_GROUP_SCHED) && is_dl_group(rt_rq))
return;
if (rq->online && prio < prev_prio)
@@ -455,7 +466,7 @@ dec_rt_prio_smp(struct rt_rq *rt_rq, int prio, int prev_prio)
/*
* Change rq's cpupri only if rt_rq is the top queue.
*/
- if (IS_ENABLED(CONFIG_RT_GROUP_SCHED) && &rq->rt != rt_rq)
+ if (IS_ENABLED(CONFIG_RT_GROUP_SCHED) && is_dl_group(rt_rq))
return;
if (rq->online && rt_rq->highest_prio.curr != prev_prio)
@@ -524,6 +535,15 @@ void inc_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
rt_rq->rr_nr_running += rt_se_rr_nr_running(rt_se);
inc_rt_prio(rt_rq, rt_se_prio(rt_se));
+
+ if (IS_ENABLED(CONFIG_RT_GROUP_SCHED) && is_dl_group(rt_rq)) {
+ struct sched_dl_entity *dl_se = dl_group_of(rt_rq);
+
+ if (!dl_se->dl_throttled)
+ add_nr_running(rq_of_rt_rq(rt_rq), 1);
+ } else {
+ add_nr_running(rq_of_rt_rq(rt_rq), 1);
+ }
}
static inline
@@ -534,6 +554,15 @@ void dec_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
rt_rq->rr_nr_running -= rt_se_rr_nr_running(rt_se);
dec_rt_prio(rt_rq, rt_se_prio(rt_se));
+
+ if (IS_ENABLED(CONFIG_RT_GROUP_SCHED) && is_dl_group(rt_rq)) {
+ struct sched_dl_entity *dl_se = dl_group_of(rt_rq);
+
+ if (!dl_se->dl_throttled)
+ sub_nr_running(rq_of_rt_rq(rt_rq), 1);
+ } else {
+ sub_nr_running(rq_of_rt_rq(rt_rq), 1);
+ }
}
/*
@@ -715,6 +744,14 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)
check_schedstat_required();
update_stats_wait_start_rt(rt_rq_of_se(rt_se), rt_se);
+ /* Task arriving in an idle group of tasks. */
+ if (IS_ENABLED(CONFIG_RT_GROUP_SCHED) &&
+ is_dl_group(rt_rq) && rt_rq->rt_nr_running == 0) {
+ struct sched_dl_entity *dl_se = dl_group_of(rt_rq);
+
+ dl_server_start(dl_se);
+ }
+
enqueue_rt_entity(rt_se, flags);
if (task_is_blocked(p))
@@ -734,6 +771,14 @@ static bool dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
dequeue_pushable_task(rt_rq, p);
+ /* Last task of the task group. */
+ if (IS_ENABLED(CONFIG_RT_GROUP_SCHED) &&
+ is_dl_group(rt_rq) && rt_rq->rt_nr_running == 0) {
+ struct sched_dl_entity *dl_se = dl_group_of(rt_rq);
+
+ dl_server_stop(dl_se);
+ }
+
return true;
}
@@ -891,6 +936,34 @@ static void wakeup_preempt_rt(struct rq *rq, struct task_struct *p, int flags)
{
struct task_struct *donor = rq->donor;
+ if (!rt_group_sched_enabled())
+ goto no_group_sched;
+
+ if (is_dl_group(rt_rq_of_se(&p->rt)) &&
+ is_dl_group(rt_rq_of_se(&rq->curr->rt))) {
+ struct sched_dl_entity *dl_se, *curr_dl_se;
+
+ dl_se = dl_group_of(rt_rq_of_se(&p->rt));
+ curr_dl_se = dl_group_of(rt_rq_of_se(&rq->curr->rt));
+
+ if (dl_entity_preempt(dl_se, curr_dl_se)) {
+ resched_curr(rq);
+ return;
+ } else if (!dl_entity_preempt(curr_dl_se, dl_se)) {
+ if (p->prio < rq->curr->prio) {
+ resched_curr(rq);
+ return;
+ }
+ }
+ return;
+ } else if (is_dl_group(rt_rq_of_se(&p->rt))) {
+ resched_curr(rq);
+ return;
+ } else if (is_dl_group(rt_rq_of_se(&rq->curr->rt))) {
+ return;
+ }
+
+no_group_sched:
if (p->prio < donor->prio) {
resched_curr(rq);
return;
@@ -1609,12 +1682,36 @@ static void pull_rt_task(struct rq *this_rq)
resched_curr(this_rq);
}
+#ifdef CONFIG_RT_GROUP_SCHED
+static int group_push_rt_task(struct rt_rq *rt_rq)
+{
+ struct rq *rq = rq_of_rt_rq(rt_rq);
+
+ if (is_dl_group(rt_rq))
+ return 0;
+
+ return push_rt_task(rq, false);
+}
+
+static void group_push_rt_tasks(struct rt_rq *rt_rq)
+{
+ while (group_push_rt_task(rt_rq))
+ ;
+}
+#else
+static void group_push_rt_tasks(struct rt_rq *rt_rq)
+{
+ push_rt_tasks(rq_of_rt_rq(rt_rq));
+}
+#endif
+
/*
* If we are not running and we are not going to reschedule soon, we should
* try to push tasks away now
*/
static void task_woken_rt(struct rq *rq, struct task_struct *p)
{
+ struct rt_rq *rt_rq = rt_rq_of_se(&p->rt);
bool need_to_push = !task_on_cpu(rq, p) &&
!test_tsk_need_resched(rq->curr) &&
p->nr_cpus_allowed > 1 &&
@@ -1623,7 +1720,7 @@ static void task_woken_rt(struct rq *rq, struct task_struct *p)
rq->donor->prio <= p->prio);
if (need_to_push)
- push_rt_tasks(rq);
+ group_push_rt_tasks(rt_rq);
}
/* Assumes rq->lock is held */
@@ -1632,6 +1729,7 @@ static void rq_online_rt(struct rq *rq)
if (rq->rt.overloaded)
rt_set_overload(rq);
+ /*FIXME: Enable the dl server! */
cpupri_set(&rq->rd->cpupri, rq->cpu, rq->rt.highest_prio.curr);
}
@@ -1641,6 +1739,7 @@ static void rq_offline_rt(struct rq *rq)
if (rq->rt.overloaded)
rt_clear_overload(rq);
+ /* FIXME: Disable the dl server! */
cpupri_set(&rq->rd->cpupri, rq->cpu, CPUPRI_INVALID);
}
@@ -1650,6 +1749,8 @@ static void rq_offline_rt(struct rq *rq)
*/
static void switched_from_rt(struct rq *rq, struct task_struct *p)
{
+ struct rt_rq *rt_rq = rt_rq_of_se(&p->rt);
+
/*
* If there are other RT tasks then we will reschedule
* and the scheduling of the other RT tasks will handle
@@ -1657,10 +1758,11 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p)
* we may need to handle the pulling of RT tasks
* now.
*/
- if (!task_on_rq_queued(p) || rq->rt.rt_nr_running)
+ if (!task_on_rq_queued(p) || rt_rq->rt_nr_running)
return;
- rt_queue_pull_task(rq);
+ if (!IS_ENABLED(CONFIG_RT_GROUP_SCHED))
+ rt_queue_pull_task(rq);
}
void __init init_sched_rt_class(void)
@@ -1695,8 +1797,17 @@ static void switched_to_rt(struct rq *rq, struct task_struct *p)
* then see if we can move to another run queue.
*/
if (task_on_rq_queued(p)) {
+
+#ifndef CONFIG_RT_GROUP_SCHED
if (p->nr_cpus_allowed > 1 && rq->rt.overloaded)
rt_queue_push_tasks(rq);
+#else
+ if (rt_rq_of_se(&p->rt)->overloaded) {
+ } else {
+ if (p->prio < rq->curr->prio)
+ resched_curr(rq);
+ }
+#endif
if (p->prio < rq->donor->prio && cpu_online(cpu_of(rq)))
resched_curr(rq);
}
@@ -1709,6 +1820,8 @@ static void switched_to_rt(struct rq *rq, struct task_struct *p)
static void
prio_changed_rt(struct rq *rq, struct task_struct *p, int oldprio)
{
+ struct rt_rq *rt_rq = rt_rq_of_se(&p->rt);
+
if (!task_on_rq_queued(p))
return;
@@ -1717,16 +1830,25 @@ prio_changed_rt(struct rq *rq, struct task_struct *p, int oldprio)
* If our priority decreases while running, we
* may need to pull tasks to this runqueue.
*/
- if (oldprio < p->prio)
+ if (!IS_ENABLED(CONFIG_RT_GROUP_SCHED) && oldprio < p->prio)
rt_queue_pull_task(rq);
/*
* If there's a higher priority task waiting to run
* then reschedule.
*/
- if (p->prio > rq->rt.highest_prio.curr)
+ if (p->prio > rt_rq->highest_prio.curr)
resched_curr(rq);
} else {
+ /*
+ * This task is not running, thus we check against the currently
+ * running task for preemption. We can preempt only if both tasks are
+ * in the same cgroup or on the global runqueue.
+ */
+ if (IS_ENABLED(CONFIG_RT_GROUP_SCHED) &&
+ rt_rq_of_se(&p->rt)->tg != rt_rq_of_se(&rq->curr->rt)->tg)
+ return;
+
/*
* This task is not running, but if it is
* greater than the current running task
@@ -1821,7 +1943,16 @@ static unsigned int get_rr_interval_rt(struct rq *rq, struct task_struct *task)
#ifdef CONFIG_SCHED_CORE
static int task_is_throttled_rt(struct task_struct *p, int cpu)
{
+#ifdef CONFIG_RT_GROUP_SCHED
+ struct rt_rq *rt_rq;
+
+ rt_rq = task_group(p)->rt_rq[cpu];
+ WARN_ON(!rt_group_sched_enabled() && rt_rq->tg != &root_task_group);
+
+ return dl_group_of(rt_rq)->dl_throttled;
+#else
return 0;
+#endif
}
#endif /* CONFIG_SCHED_CORE */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 611e3757fea..8bf8af7064f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2171,7 +2171,7 @@ static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
if (!rt_group_sched_enabled())
tg = &root_task_group;
p->rt.rt_rq = tg->rt_rq[cpu];
- p->rt.parent = tg->rt_se[cpu];
+ p->dl.dl_rq = &cpu_rq(cpu)->dl;
#endif /* CONFIG_RT_GROUP_SCHED */
}
@@ -2727,6 +2727,7 @@ static inline void add_nr_running(struct rq *rq, unsigned count)
static inline void sub_nr_running(struct rq *rq, unsigned count)
{
+ BUG_ON(rq->nr_running < count);
rq->nr_running -= count;
if (trace_sched_update_nr_running_tp_enabled()) {
call_trace_sched_update_nr_running(rq, -count);
@@ -3057,9 +3058,6 @@ extern bool sched_smp_initialized;
#ifdef CONFIG_RT_GROUP_SCHED
static inline struct task_struct *rt_task_of(struct sched_rt_entity *rt_se)
{
-#ifdef CONFIG_SCHED_DEBUG
- WARN_ON_ONCE(rt_se->my_q);
-#endif
return container_of(rt_se, struct task_struct, rt);
}
diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
index 93a9c03b28e..7c1f7649477 100644
--- a/kernel/sched/syscalls.c
+++ b/kernel/sched/syscalls.c
@@ -626,6 +626,19 @@ int __sched_setscheduler(struct task_struct *p,
change:
if (user) {
+#ifdef CONFIG_RT_GROUP_SCHED
+ /*
+ * Do not allow real-time tasks into groups that have no runtime
+ * assigned.
+ */
+ if (rt_group_sched_enabled() &&
+ dl_bandwidth_enabled() && rt_policy(policy) &&
+ task_group(p)->dl_bandwidth.dl_runtime == 0 &&
+ !task_group_is_autogroup(task_group(p))) {
+ retval = -EPERM;
+ goto unlock;
+ }
+#endif
if (dl_bandwidth_enabled() && dl_policy(policy) &&
!(attr->sched_flags & SCHED_FLAG_SUGOV)) {
cpumask_t *span = rq->rd->span;
--
2.50.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [RFC PATCH v2 13/25] sched/rt: Add HCBS related checks and operations for rt tasks
2025-07-31 10:55 ` [RFC PATCH v2 13/25] sched/rt: Add HCBS related checks and operations for rt tasks Yuri Andriaccio
@ 2025-08-14 14:01 ` Juri Lelli
0 siblings, 0 replies; 38+ messages in thread
From: Juri Lelli @ 2025-08-14 14:01 UTC (permalink / raw)
To: Yuri Andriaccio
Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
linux-kernel, Luca Abeni, Yuri Andriaccio
Hi!
On 31/07/25 12:55, Yuri Andriaccio wrote:
> From: luca abeni <luca.abeni@santannapisa.it>
>
> Add checks wheter a task belongs to the root cgroup or a rt-cgroup, since HCBS
> reuses the rt classes' scheduler, and operate accordingly where needed.
>
> Co-developed-by: Alessio Balsini <a.balsini@sssup.it>
> Signed-off-by: Alessio Balsini <a.balsini@sssup.it>
> Co-developed-by: Andrea Parri <parri.andrea@gmail.com>
> Signed-off-by: Andrea Parri <parri.andrea@gmail.com>
> Co-developed-by: Yuri Andriaccio <yurand2000@gmail.com>
> Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
> Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
> ---
> kernel/sched/core.c | 3 +
> kernel/sched/deadline.c | 16 ++++-
> kernel/sched/rt.c | 147 +++++++++++++++++++++++++++++++++++++---
> kernel/sched/sched.h | 6 +-
> kernel/sched/syscalls.c | 13 ++++
> 5 files changed, 171 insertions(+), 14 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 3a69cb906c3..6173684a02b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2196,6 +2196,9 @@ void wakeup_preempt(struct rq *rq, struct task_struct *p, int flags)
> {
> struct task_struct *donor = rq->donor;
>
> + if (is_dl_group(rt_rq_of_se(&p->rt)) && task_has_rt_policy(p))
> + resched_curr(rq);
Why this unconditional resched for tasks in groups?
...
> @@ -715,6 +744,14 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)
> check_schedstat_required();
> update_stats_wait_start_rt(rt_rq_of_se(rt_se), rt_se);
>
> + /* Task arriving in an idle group of tasks. */
> + if (IS_ENABLED(CONFIG_RT_GROUP_SCHED) &&
> + is_dl_group(rt_rq) && rt_rq->rt_nr_running == 0) {
> + struct sched_dl_entity *dl_se = dl_group_of(rt_rq);
> +
> + dl_server_start(dl_se);
> + }
> +
> enqueue_rt_entity(rt_se, flags);
Is it OK to start the server before the first task is enqueued in an
idle group?
...
> @@ -891,6 +936,34 @@ static void wakeup_preempt_rt(struct rq *rq, struct task_struct *p, int flags)
> {
> struct task_struct *donor = rq->donor;
>
> + if (!rt_group_sched_enabled())
> + goto no_group_sched;
> +
I think the below deserves a comment detailing the rules of preemption
(inside/outside groups, etc.).
> + if (is_dl_group(rt_rq_of_se(&p->rt)) &&
> + is_dl_group(rt_rq_of_se(&rq->curr->rt))) {
> + struct sched_dl_entity *dl_se, *curr_dl_se;
> +
> + dl_se = dl_group_of(rt_rq_of_se(&p->rt));
> + curr_dl_se = dl_group_of(rt_rq_of_se(&rq->curr->rt));
> +
> + if (dl_entity_preempt(dl_se, curr_dl_se)) {
> + resched_curr(rq);
> + return;
> + } else if (!dl_entity_preempt(curr_dl_se, dl_se)) {
Isn't this condition implied by the above?
...
> +#ifdef CONFIG_RT_GROUP_SCHED
> +static int group_push_rt_task(struct rt_rq *rt_rq)
> +{
> + struct rq *rq = rq_of_rt_rq(rt_rq);
> +
> + if (is_dl_group(rt_rq))
> + return 0;
> +
> + return push_rt_task(rq, false);
> +}
> +
> +static void group_push_rt_tasks(struct rt_rq *rt_rq)
> +{
> + while (group_push_rt_task(rt_rq))
> + ;
> +}
> +#else
> +static void group_push_rt_tasks(struct rt_rq *rt_rq)
> +{
> + push_rt_tasks(rq_of_rt_rq(rt_rq));
> +}
> +#endif
> +
Hummm, maybe too much for a single patch? I am a little lost at this
point. :))
Thanks,
Juri
^ permalink raw reply [flat|nested] 38+ messages in thread
* [RFC PATCH v2 14/25] sched/rt: Update rt-cgroup schedulability checks
2025-07-31 10:55 [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
` (12 preceding siblings ...)
2025-07-31 10:55 ` [RFC PATCH v2 13/25] sched/rt: Add HCBS related checks and operations for rt tasks Yuri Andriaccio
@ 2025-07-31 10:55 ` Yuri Andriaccio
2025-07-31 10:55 ` [RFC PATCH v2 15/25] sched/rt: Remove old RT_GROUP_SCHED data structures Yuri Andriaccio
` (11 subsequent siblings)
25 siblings, 0 replies; 38+ messages in thread
From: Yuri Andriaccio @ 2025-07-31 10:55 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider
Cc: linux-kernel, Luca Abeni, Yuri Andriaccio
From: luca abeni <luca.abeni@santannapisa.it>
Update schedulability checks and setup of runtime/period for rt-cgroups.
Co-developed-by: Alessio Balsini <a.balsini@sssup.it>
Signed-off-by: Alessio Balsini <a.balsini@sssup.it>
Co-developed-by: Andrea Parri <parri.andrea@gmail.com>
Signed-off-by: Andrea Parri <parri.andrea@gmail.com>
Co-developed-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
---
kernel/sched/core.c | 6 ++++
kernel/sched/deadline.c | 46 ++++++++++++++++++++++----
kernel/sched/rt.c | 72 +++++++++++++++++++++++------------------
kernel/sched/sched.h | 1 +
4 files changed, 88 insertions(+), 37 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6173684a02b..63cb9271052 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9277,6 +9277,12 @@ cpu_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
return &root_task_group.css;
}
+ /* Do not allow cpu_cgroup hierachies with depth greater than 2. */
+#ifdef CONFIG_RT_GROUP_SCHED
+ if (parent != &root_task_group)
+ return ERR_PTR(-EINVAL);
+#endif
+
tg = sched_create_group(parent);
if (IS_ERR(tg))
return ERR_PTR(-ENOMEM);
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index b948000f29f..7b131630743 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -365,7 +365,47 @@ void cancel_inactive_timer(struct sched_dl_entity *dl_se)
cancel_dl_timer(dl_se, &dl_se->inactive_timer);
}
+/*
+ * Used for dl_bw check and update, used under sched_rt_handler()::mutex and
+ * sched_domains_mutex.
+ */
+u64 dl_cookie;
+
#ifdef CONFIG_RT_GROUP_SCHED
+int dl_check_tg(unsigned long total)
+{
+ unsigned long flags;
+ int which_cpu;
+ int cpus;
+ struct dl_bw *dl_b;
+ u64 gen = ++dl_cookie;
+
+ for_each_possible_cpu(which_cpu) {
+ rcu_read_lock_sched();
+
+ if (!dl_bw_visited(which_cpu, gen)) {
+ cpus = dl_bw_cpus(which_cpu);
+ dl_b = dl_bw_of(which_cpu);
+
+ raw_spin_lock_irqsave(&dl_b->lock, flags);
+
+ if (dl_b->bw != -1 &&
+ dl_b->bw * cpus < dl_b->total_bw + total * cpus) {
+ raw_spin_unlock_irqrestore(&dl_b->lock, flags);
+ rcu_read_unlock_sched();
+
+ return 0;
+ }
+
+ raw_spin_unlock_irqrestore(&dl_b->lock, flags);
+ }
+
+ rcu_read_unlock_sched();
+ }
+
+ return 1;
+}
+
void dl_init_tg(struct sched_dl_entity *dl_se, u64 rt_runtime, u64 rt_period)
{
struct rq *rq = container_of(dl_se->dl_rq, struct rq, dl);
@@ -3139,12 +3179,6 @@ DEFINE_SCHED_CLASS(dl) = {
#endif
};
-/*
- * Used for dl_bw check and update, used under sched_rt_handler()::mutex and
- * sched_domains_mutex.
- */
-u64 dl_cookie;
-
int sched_dl_global_validate(void)
{
u64 runtime = global_rt_runtime();
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 83695e11db4..bd11f4a03f7 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1996,11 +1996,6 @@ DEFINE_SCHED_CLASS(rt) = {
};
#ifdef CONFIG_RT_GROUP_SCHED
-/*
- * Ensure that the real time constraints are schedulable.
- */
-static DEFINE_MUTEX(rt_constraints_mutex);
-
static inline int tg_has_rt_tasks(struct task_group *tg)
{
struct task_struct *task;
@@ -2034,8 +2029,8 @@ static int tg_rt_schedulable(struct task_group *tg, void *data)
unsigned long total, sum = 0;
u64 period, runtime;
- period = ktime_to_ns(tg->rt_bandwidth.rt_period);
- runtime = tg->rt_bandwidth.rt_runtime;
+ period = tg->dl_bandwidth.dl_period;
+ runtime = tg->dl_bandwidth.dl_runtime;
if (tg == d->tg) {
period = d->rt_period;
@@ -2051,8 +2046,7 @@ static int tg_rt_schedulable(struct task_group *tg, void *data)
/*
* Ensure we don't starve existing RT tasks if runtime turns zero.
*/
- if (rt_bandwidth_enabled() && !runtime &&
- tg->rt_bandwidth.rt_runtime && tg_has_rt_tasks(tg))
+ if (dl_bandwidth_enabled() && !runtime && tg_has_rt_tasks(tg))
return -EBUSY;
if (WARN_ON(!rt_group_sched_enabled() && tg != &root_task_group))
@@ -2066,12 +2060,17 @@ static int tg_rt_schedulable(struct task_group *tg, void *data)
if (total > to_ratio(global_rt_period(), global_rt_runtime()))
return -EINVAL;
+ if (tg == &root_task_group) {
+ if (!dl_check_tg(total))
+ return -EBUSY;
+ }
+
/*
* The sum of our children's runtime should not exceed our own.
*/
list_for_each_entry_rcu(child, &tg->children, siblings) {
- period = ktime_to_ns(child->rt_bandwidth.rt_period);
- runtime = child->rt_bandwidth.rt_runtime;
+ period = child->dl_bandwidth.dl_period;
+ runtime = child->dl_bandwidth.dl_runtime;
if (child == d->tg) {
period = d->rt_period;
@@ -2097,6 +2096,20 @@ static int __rt_schedulable(struct task_group *tg, u64 period, u64 runtime)
.rt_runtime = runtime,
};
+ /*
+ * Since we truncate DL_SCALE bits, make sure we're at least
+ * that big.
+ */
+ if (runtime != 0 && runtime < (1ULL << DL_SCALE))
+ return -EINVAL;
+
+ /*
+ * Since we use the MSB for wrap-around and sign issues, make
+ * sure it's not set (mind that period can be equal to zero).
+ */
+ if (period & (1ULL << 63))
+ return -EINVAL;
+
rcu_read_lock();
ret = walk_tg_tree(tg_rt_schedulable, tg_nop, &data);
rcu_read_unlock();
@@ -2107,6 +2120,7 @@ static int __rt_schedulable(struct task_group *tg, u64 period, u64 runtime)
static int tg_set_rt_bandwidth(struct task_group *tg,
u64 rt_period, u64 rt_runtime)
{
+ static DEFINE_MUTEX(rt_constraints_mutex);
int i, err = 0;
/*
@@ -2126,34 +2140,30 @@ static int tg_set_rt_bandwidth(struct task_group *tg,
if (rt_runtime != RUNTIME_INF && rt_runtime > max_rt_runtime)
return -EINVAL;
- mutex_lock(&rt_constraints_mutex);
+ guard(mutex)(&rt_constraints_mutex);
err = __rt_schedulable(tg, rt_period, rt_runtime);
if (err)
- goto unlock;
+ return err;
- raw_spin_lock_irq(&tg->rt_bandwidth.rt_runtime_lock);
- tg->rt_bandwidth.rt_period = ns_to_ktime(rt_period);
- tg->rt_bandwidth.rt_runtime = rt_runtime;
+ guard(raw_spinlock_irq)(&tg->dl_bandwidth.dl_runtime_lock);
+ tg->dl_bandwidth.dl_period = rt_period;
+ tg->dl_bandwidth.dl_runtime = rt_runtime;
- for_each_possible_cpu(i) {
- struct rt_rq *rt_rq = tg->rt_rq[i];
+ if (tg == &root_task_group)
+ return 0;
- raw_spin_lock(&rt_rq->rt_runtime_lock);
- rt_rq->rt_runtime = rt_runtime;
- raw_spin_unlock(&rt_rq->rt_runtime_lock);
+ for_each_possible_cpu(i) {
+ dl_init_tg(tg->dl_se[i], rt_runtime, rt_period);
}
- raw_spin_unlock_irq(&tg->rt_bandwidth.rt_runtime_lock);
-unlock:
- mutex_unlock(&rt_constraints_mutex);
- return err;
+ return 0;
}
int sched_group_set_rt_runtime(struct task_group *tg, long rt_runtime_us)
{
u64 rt_runtime, rt_period;
- rt_period = ktime_to_ns(tg->rt_bandwidth.rt_period);
+ rt_period = tg->dl_bandwidth.dl_period;
rt_runtime = (u64)rt_runtime_us * NSEC_PER_USEC;
if (rt_runtime_us < 0)
rt_runtime = RUNTIME_INF;
@@ -2167,10 +2177,10 @@ long sched_group_rt_runtime(struct task_group *tg)
{
u64 rt_runtime_us;
- if (tg->rt_bandwidth.rt_runtime == RUNTIME_INF)
+ if (tg->dl_bandwidth.dl_runtime == RUNTIME_INF)
return -1;
- rt_runtime_us = tg->rt_bandwidth.rt_runtime;
+ rt_runtime_us = tg->dl_bandwidth.dl_runtime;
do_div(rt_runtime_us, NSEC_PER_USEC);
return rt_runtime_us;
}
@@ -2183,7 +2193,7 @@ int sched_group_set_rt_period(struct task_group *tg, u64 rt_period_us)
return -EINVAL;
rt_period = rt_period_us * NSEC_PER_USEC;
- rt_runtime = tg->rt_bandwidth.rt_runtime;
+ rt_runtime = tg->dl_bandwidth.dl_runtime;
return tg_set_rt_bandwidth(tg, rt_period, rt_runtime);
}
@@ -2192,7 +2202,7 @@ long sched_group_rt_period(struct task_group *tg)
{
u64 rt_period_us;
- rt_period_us = ktime_to_ns(tg->rt_bandwidth.rt_period);
+ rt_period_us = tg->dl_bandwidth.dl_period;
do_div(rt_period_us, NSEC_PER_USEC);
return rt_period_us;
}
@@ -2207,7 +2217,7 @@ static int sched_rt_global_constraints(void)
int sched_rt_can_attach(struct task_group *tg, struct task_struct *tsk)
{
/* Don't accept real-time tasks when there is no way for them to run */
- if (rt_group_sched_enabled() && rt_task(tsk) && tg->rt_bandwidth.rt_runtime == 0)
+ if (rt_group_sched_enabled() && rt_task(tsk) && tg->dl_bandwidth.dl_runtime == 0)
return 0;
return 1;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8bf8af7064f..9f235df4bf1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -394,6 +394,7 @@ extern void dl_server_init(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq,
dl_server_has_tasks_f has_tasks,
dl_server_pick_f pick_task);
extern void sched_init_dl_servers(void);
+extern int dl_check_tg(unsigned long total);
extern void dl_init_tg(struct sched_dl_entity *dl_se, u64 rt_runtime, u64 rt_period);
extern void dl_server_update_idle_time(struct rq *rq,
--
2.50.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [RFC PATCH v2 15/25] sched/rt: Remove old RT_GROUP_SCHED data structures
2025-07-31 10:55 [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
` (13 preceding siblings ...)
2025-07-31 10:55 ` [RFC PATCH v2 14/25] sched/rt: Update rt-cgroup schedulability checks Yuri Andriaccio
@ 2025-07-31 10:55 ` Yuri Andriaccio
2025-07-31 10:55 ` [RFC PATCH v2 16/25] sched/core: Cgroup v2 support Yuri Andriaccio
` (10 subsequent siblings)
25 siblings, 0 replies; 38+ messages in thread
From: Yuri Andriaccio @ 2025-07-31 10:55 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider
Cc: linux-kernel, Luca Abeni, Yuri Andriaccio
From: luca abeni <luca.abeni@santannapisa.it>
Completely remove the old RT_GROUP_SCHED's functions and data structures.
Co-developed-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
---
include/linux/sched.h | 4 ----
kernel/sched/rt.c | 1 -
kernel/sched/sched.h | 26 --------------------------
3 files changed, 31 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index f0c8229afd1..343e8ef5ba1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -621,13 +621,9 @@ struct sched_rt_entity {
unsigned short on_rq;
unsigned short on_list;
- struct sched_rt_entity *back;
#ifdef CONFIG_RT_GROUP_SCHED
- struct sched_rt_entity *parent;
/* rq on which this entity is (to be) queued: */
struct rt_rq *rt_rq;
- /* rq "owned" by this entity/group: */
- struct rt_rq *my_q;
#endif
} __randomize_layout;
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index bd11f4a03f7..f37ac9100d1 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1,4 +1,3 @@
-#pragma GCC diagnostic ignored "-Wunused-function"
// SPDX-License-Identifier: GPL-2.0
/*
* Real-Time Scheduling Class (mapped to the SCHED_FIFO and SCHED_RR
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9f235df4bf1..4a1bbda3720 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -310,15 +310,6 @@ struct rt_prio_array {
struct list_head queue[MAX_RT_PRIO];
};
-struct rt_bandwidth {
- /* nests inside the rq lock: */
- raw_spinlock_t rt_runtime_lock;
- ktime_t rt_period;
- u64 rt_runtime;
- struct hrtimer rt_period_timer;
- unsigned int rt_period_active;
-};
-
struct dl_bandwidth {
raw_spinlock_t dl_runtime_lock;
u64 dl_runtime;
@@ -483,7 +474,6 @@ struct task_group {
struct sched_dl_entity **dl_se;
struct rt_rq **rt_rq;
- struct rt_bandwidth rt_bandwidth;
struct dl_bandwidth dl_bandwidth;
#endif
@@ -802,11 +792,6 @@ struct scx_rq {
};
#endif /* CONFIG_SCHED_CLASS_EXT */
-static inline int rt_bandwidth_enabled(void)
-{
- return 0;
-}
-
/* RT IPI pull logic requires IRQ_WORK */
#if defined(CONFIG_IRQ_WORK) && defined(CONFIG_SMP)
# define HAVE_RT_PUSH_IPI
@@ -824,17 +809,6 @@ struct rt_rq {
bool overloaded;
struct plist_head pushable_tasks;
- int rt_queued;
-
-#ifdef CONFIG_RT_GROUP_SCHED
- int rt_throttled;
- u64 rt_time; /* consumed RT time, goes up in update_curr_rt */
- u64 rt_runtime; /* allotted RT time, "slice" from rt_bandwidth, RT sharing/balancing */
- /* Nests inside the rq lock: */
- raw_spinlock_t rt_runtime_lock;
-
- unsigned int rt_nr_boosted;
-#endif
#ifdef CONFIG_CGROUP_SCHED
struct task_group *tg; /* this tg has "this" rt_rq on given CPU for runnable entities */
#endif
--
2.50.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [RFC PATCH v2 16/25] sched/core: Cgroup v2 support
2025-07-31 10:55 [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
` (14 preceding siblings ...)
2025-07-31 10:55 ` [RFC PATCH v2 15/25] sched/rt: Remove old RT_GROUP_SCHED data structures Yuri Andriaccio
@ 2025-07-31 10:55 ` Yuri Andriaccio
2025-07-31 10:55 ` [RFC PATCH v2 17/25] sched/rt: Remove support for cgroups-v1 Yuri Andriaccio
` (9 subsequent siblings)
25 siblings, 0 replies; 38+ messages in thread
From: Yuri Andriaccio @ 2025-07-31 10:55 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider
Cc: linux-kernel, Luca Abeni, Yuri Andriaccio
From: luca abeni <luca.abeni@santannapisa.it>
Make rt_runtime_us and rt_period_us virtual files accessible also to the cgroup
v2 controller, effectively enabling the RT_GROUP_SCHED mechanism to cgroups v2.
Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
---
kernel/sched/core.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 63cb9271052..465f44d7235 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10300,6 +10300,18 @@ static struct cftype cpu_files[] = {
.write = cpu_uclamp_max_write,
},
#endif /* CONFIG_UCLAMP_TASK_GROUP */
+#ifdef CONFIG_RT_GROUP_SCHED
+ {
+ .name = "rt_runtime_us",
+ .read_s64 = cpu_rt_runtime_read,
+ .write_s64 = cpu_rt_runtime_write,
+ },
+ {
+ .name = "rt_period_us",
+ .read_u64 = cpu_rt_period_read_uint,
+ .write_u64 = cpu_rt_period_write_uint,
+ },
+#endif /* CONFIG_RT_GROUP_SCHED */
{ } /* terminate */
};
--
2.50.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [RFC PATCH v2 17/25] sched/rt: Remove support for cgroups-v1
2025-07-31 10:55 [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
` (15 preceding siblings ...)
2025-07-31 10:55 ` [RFC PATCH v2 16/25] sched/core: Cgroup v2 support Yuri Andriaccio
@ 2025-07-31 10:55 ` Yuri Andriaccio
2025-07-31 10:55 ` [RFC PATCH v2 18/25] sched/rt: Zero rt-cgroups default bandwidth Yuri Andriaccio
` (8 subsequent siblings)
25 siblings, 0 replies; 38+ messages in thread
From: Yuri Andriaccio @ 2025-07-31 10:55 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider
Cc: linux-kernel, Luca Abeni, Yuri Andriaccio
Disable control files for cgroups-v1, and allow only cgroups-v2. This should
simplify maintaining the code, also because cgroups-v1 are deprecated.
Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
---
kernel/sched/core.c | 18 ------------------
1 file changed, 18 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 465f44d7235..85950e10bb1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10039,20 +10039,6 @@ static struct cftype cpu_legacy_files[] = {
};
#ifdef CONFIG_RT_GROUP_SCHED
-static struct cftype rt_group_files[] = {
- {
- .name = "rt_runtime_us",
- .read_s64 = cpu_rt_runtime_read,
- .write_s64 = cpu_rt_runtime_write,
- },
- {
- .name = "rt_period_us",
- .read_u64 = cpu_rt_period_read_uint,
- .write_u64 = cpu_rt_period_write_uint,
- },
- { } /* Terminate */
-};
-
# ifdef CONFIG_RT_GROUP_SCHED_DEFAULT_DISABLED
DEFINE_STATIC_KEY_FALSE(rt_group_sched);
# else
@@ -10078,10 +10064,6 @@ __setup("rt_group_sched=", setup_rt_group_sched);
static int __init cpu_rt_group_init(void)
{
- if (!rt_group_sched_enabled())
- return 0;
-
- WARN_ON(cgroup_add_legacy_cftypes(&cpu_cgrp_subsys, rt_group_files));
return 0;
}
subsys_initcall(cpu_rt_group_init);
--
2.50.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [RFC PATCH v2 18/25] sched/rt: Zero rt-cgroups default bandwidth
2025-07-31 10:55 [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
` (16 preceding siblings ...)
2025-07-31 10:55 ` [RFC PATCH v2 17/25] sched/rt: Remove support for cgroups-v1 Yuri Andriaccio
@ 2025-07-31 10:55 ` Yuri Andriaccio
2025-08-14 14:13 ` Juri Lelli
2025-07-31 10:55 ` [RFC PATCH v2 19/25] sched/deadline: Allow deeper hierarchies of RT cgroups Yuri Andriaccio
` (7 subsequent siblings)
25 siblings, 1 reply; 38+ messages in thread
From: Yuri Andriaccio @ 2025-07-31 10:55 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider
Cc: linux-kernel, Luca Abeni, Yuri Andriaccio
Set the default rt-cgroups runtime to zero, otherwise a cgroup-v1 kernel will
not be able to start SCHED_DEADLINE tasks. The bandwidth for rt-cgroups must
then be manually assigned after the kernel boots.
Allow zeroing the runtime of the root control group. This runtime only affects
the available bandwidth of the rt-cgroup hierarchy but not the SCHED_FIFO /
SCHED_RR tasks on the global runqueue.
Notes:
Disabling the root control group bandwidth should not cause any side effect, as
SCHED_FIFO / SCHED_RR tasks do not depend on it since the introduction of
fair_servers.
Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
---
kernel/sched/core.c | 4 ++--
kernel/sched/rt.c | 13 +++++--------
kernel/sched/syscalls.c | 2 +-
3 files changed, 8 insertions(+), 11 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 85950e10bb1..3ac65c6af70 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8719,7 +8719,7 @@ void __init sched_init(void)
#ifdef CONFIG_RT_GROUP_SCHED
init_dl_bandwidth(&root_task_group.dl_bandwidth,
- global_rt_period(), global_rt_runtime());
+ global_rt_period(), 0);
#endif /* CONFIG_RT_GROUP_SCHED */
#ifdef CONFIG_CGROUP_SCHED
@@ -9348,7 +9348,7 @@ static int cpu_cgroup_can_attach(struct cgroup_taskset *tset)
goto scx_check;
cgroup_taskset_for_each(task, css, tset) {
- if (!sched_rt_can_attach(css_tg(css), task))
+ if (rt_task(task) && !sched_rt_can_attach(css_tg(css), task))
return -EINVAL;
}
scx_check:
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index f37ac9100d1..75a6860c2e2 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2122,13 +2122,6 @@ static int tg_set_rt_bandwidth(struct task_group *tg,
static DEFINE_MUTEX(rt_constraints_mutex);
int i, err = 0;
- /*
- * Disallowing the root group RT runtime is BAD, it would disallow the
- * kernel creating (and or operating) RT threads.
- */
- if (tg == &root_task_group && rt_runtime == 0)
- return -EINVAL;
-
/* No period doesn't make any sense. */
if (rt_period == 0)
return -EINVAL;
@@ -2215,8 +2208,12 @@ static int sched_rt_global_constraints(void)
int sched_rt_can_attach(struct task_group *tg, struct task_struct *tsk)
{
+ /* Allow executing in the root cgroup regardless of allowed bandwidth */
+ if (tg == &root_task_group)
+ return 1;
+
/* Don't accept real-time tasks when there is no way for them to run */
- if (rt_group_sched_enabled() && rt_task(tsk) && tg->dl_bandwidth.dl_runtime == 0)
+ if (rt_group_sched_enabled() && tg->dl_bandwidth.dl_runtime == 0)
return 0;
return 1;
diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
index 7c1f7649477..71f20be6f29 100644
--- a/kernel/sched/syscalls.c
+++ b/kernel/sched/syscalls.c
@@ -633,7 +633,7 @@ int __sched_setscheduler(struct task_struct *p,
*/
if (rt_group_sched_enabled() &&
dl_bandwidth_enabled() && rt_policy(policy) &&
- task_group(p)->dl_bandwidth.dl_runtime == 0 &&
+ !sched_rt_can_attach(task_group(p), p) &&
!task_group_is_autogroup(task_group(p))) {
retval = -EPERM;
goto unlock;
--
2.50.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [RFC PATCH v2 18/25] sched/rt: Zero rt-cgroups default bandwidth
2025-07-31 10:55 ` [RFC PATCH v2 18/25] sched/rt: Zero rt-cgroups default bandwidth Yuri Andriaccio
@ 2025-08-14 14:13 ` Juri Lelli
0 siblings, 0 replies; 38+ messages in thread
From: Juri Lelli @ 2025-08-14 14:13 UTC (permalink / raw)
To: Yuri Andriaccio
Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
linux-kernel, Luca Abeni, Yuri Andriaccio
Hi!
On 31/07/25 12:55, Yuri Andriaccio wrote:
> Set the default rt-cgroups runtime to zero, otherwise a cgroup-v1 kernel will
> not be able to start SCHED_DEADLINE tasks. The bandwidth for rt-cgroups must
Well, we disabled v1 support at this point with the previous patch,
didn't we? :)
> then be manually assigned after the kernel boots.
>
> Allow zeroing the runtime of the root control group. This runtime only affects
> the available bandwidth of the rt-cgroup hierarchy but not the SCHED_FIFO /
> SCHED_RR tasks on the global runqueue.
>
> Notes:
> Disabling the root control group bandwidth should not cause any side effect, as
> SCHED_FIFO / SCHED_RR tasks do not depend on it since the introduction of
> fair_servers.
I believe this deserves proper documentation.
Thanks,
Juri
^ permalink raw reply [flat|nested] 38+ messages in thread
* [RFC PATCH v2 19/25] sched/deadline: Allow deeper hierarchies of RT cgroups
2025-07-31 10:55 [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
` (17 preceding siblings ...)
2025-07-31 10:55 ` [RFC PATCH v2 18/25] sched/rt: Zero rt-cgroups default bandwidth Yuri Andriaccio
@ 2025-07-31 10:55 ` Yuri Andriaccio
2025-08-14 14:29 ` Juri Lelli
2025-07-31 10:55 ` [RFC PATCH v2 20/25] sched/rt: Add rt-cgroup migration Yuri Andriaccio
` (6 subsequent siblings)
25 siblings, 1 reply; 38+ messages in thread
From: Yuri Andriaccio @ 2025-07-31 10:55 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider
Cc: linux-kernel, Luca Abeni, Yuri Andriaccio
From: luca abeni <luca.abeni@santannapisa.it>
Allow creation of cgroup hierachies with depth greater than two.
Add check to prevent attaching tasks to a child cgroup of an active cgroup (i.e.
with a running FIFO/RR task).
Add check to prevent attaching tasks to cgroups which have children with
non-zero runtime.
Update rt-cgroups allocated bandwidth accounting for nested cgroup hierachies.
Co-developed-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
---
kernel/sched/core.c | 6 -----
kernel/sched/deadline.c | 51 +++++++++++++++++++++++++++++++++++++----
kernel/sched/rt.c | 25 +++++++++++++++++---
kernel/sched/sched.h | 2 +-
4 files changed, 70 insertions(+), 14 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3ac65c6af70..eb9de8c7b1f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9277,12 +9277,6 @@ cpu_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
return &root_task_group.css;
}
- /* Do not allow cpu_cgroup hierachies with depth greater than 2. */
-#ifdef CONFIG_RT_GROUP_SCHED
- if (parent != &root_task_group)
- return ERR_PTR(-EINVAL);
-#endif
-
tg = sched_create_group(parent);
if (IS_ERR(tg))
return ERR_PTR(-ENOMEM);
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 7b131630743..e263abcdc04 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -406,11 +406,42 @@ int dl_check_tg(unsigned long total)
return 1;
}
-void dl_init_tg(struct sched_dl_entity *dl_se, u64 rt_runtime, u64 rt_period)
+static inline bool is_active_sched_group(struct task_group *tg)
{
+ struct task_group *child;
+ bool is_active = 1;
+
+ // if there are no children, this is a leaf group, thus it is active
+ list_for_each_entry_rcu(child, &tg->children, siblings) {
+ if (child->dl_bandwidth.dl_runtime > 0) {
+ is_active = 0;
+ }
+ }
+ return is_active;
+}
+
+static inline bool sched_group_has_active_siblings(struct task_group *tg)
+{
+ struct task_group *child;
+ bool has_active_siblings = 0;
+
+ // if there are no children, this is a leaf group, thus it is active
+ list_for_each_entry_rcu(child, &tg->parent->children, siblings) {
+ if (child != tg && child->dl_bandwidth.dl_runtime > 0) {
+ has_active_siblings = 1;
+ }
+ }
+ return has_active_siblings;
+}
+
+void dl_init_tg(struct task_group *tg, int cpu, u64 rt_runtime, u64 rt_period)
+{
+ struct sched_dl_entity *dl_se = tg->dl_se[cpu];
struct rq *rq = container_of(dl_se->dl_rq, struct rq, dl);
- int is_active;
- u64 new_bw;
+ int is_active, is_active_group;
+ u64 old_runtime, new_bw;
+
+ is_active_group = is_active_sched_group(tg);
raw_spin_rq_lock_irq(rq);
is_active = dl_se->my_q->rt.rt_nr_running > 0;
@@ -418,8 +449,10 @@ void dl_init_tg(struct sched_dl_entity *dl_se, u64 rt_runtime, u64 rt_period)
update_rq_clock(rq);
dl_server_stop(dl_se);
+ old_runtime = dl_se->dl_runtime;
new_bw = to_ratio(dl_se->dl_period, dl_se->dl_runtime);
- dl_rq_change_utilization(rq, dl_se, new_bw);
+ if (is_active_group)
+ dl_rq_change_utilization(rq, dl_se, new_bw);
dl_se->dl_runtime = rt_runtime;
dl_se->dl_deadline = rt_period;
@@ -431,6 +464,16 @@ void dl_init_tg(struct sched_dl_entity *dl_se, u64 rt_runtime, u64 rt_period)
dl_se->dl_bw = new_bw;
dl_se->dl_density = new_bw;
+ // add/remove the parent's bw
+ if (tg->parent && tg->parent != &root_task_group)
+ {
+ if (rt_runtime == 0 && old_runtime != 0 && !sched_group_has_active_siblings(tg)) {
+ __add_rq_bw(tg->parent->dl_se[cpu]->dl_bw, dl_se->dl_rq);
+ } else if (rt_runtime != 0 && old_runtime == 0 && !sched_group_has_active_siblings(tg)) {
+ __sub_rq_bw(tg->parent->dl_se[cpu]->dl_bw, dl_se->dl_rq);
+ }
+ }
+
if (is_active)
dl_server_start(dl_se);
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 75a6860c2e2..29b51251fdc 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -107,7 +107,8 @@ void free_rt_sched_group(struct task_group *tg)
* Fix this issue by changing the group runtime
* to 0 immediately before freeing it.
*/
- dl_init_tg(tg->dl_se[i], 0, tg->dl_se[i]->dl_period);
+ if (tg->dl_se[i]->dl_runtime)
+ dl_init_tg(tg, i, 0, tg->dl_se[i]->dl_period);
raw_spin_rq_lock_irqsave(cpu_rq(i), flags);
BUG_ON(tg->rt_rq[i]->rt_nr_running);
raw_spin_rq_unlock_irqrestore(cpu_rq(i), flags);
@@ -2122,6 +2123,14 @@ static int tg_set_rt_bandwidth(struct task_group *tg,
static DEFINE_MUTEX(rt_constraints_mutex);
int i, err = 0;
+ /*
+ * Do not allow to set a RT runtime > 0 if the parent has RT tasks
+ * (and is not the root group)
+ */
+ if (rt_runtime && (tg != &root_task_group) && (tg->parent != &root_task_group) && tg_has_rt_tasks(tg->parent)) {
+ return -EINVAL;
+ }
+
/* No period doesn't make any sense. */
if (rt_period == 0)
return -EINVAL;
@@ -2145,7 +2154,7 @@ static int tg_set_rt_bandwidth(struct task_group *tg,
return 0;
for_each_possible_cpu(i) {
- dl_init_tg(tg->dl_se[i], rt_runtime, rt_period);
+ dl_init_tg(tg, i, rt_runtime, rt_period);
}
return 0;
@@ -2208,6 +2217,9 @@ static int sched_rt_global_constraints(void)
int sched_rt_can_attach(struct task_group *tg, struct task_struct *tsk)
{
+ struct task_group *child;
+ int can_attach = 1;
+
/* Allow executing in the root cgroup regardless of allowed bandwidth */
if (tg == &root_task_group)
return 1;
@@ -2216,7 +2228,14 @@ int sched_rt_can_attach(struct task_group *tg, struct task_struct *tsk)
if (rt_group_sched_enabled() && tg->dl_bandwidth.dl_runtime == 0)
return 0;
- return 1;
+ /* If one of the children has runtime > 0, cannot attach RT tasks! */
+ list_for_each_entry_rcu(child, &tg->children, siblings) {
+ if (child->dl_bandwidth.dl_runtime) {
+ can_attach = 0;
+ }
+ }
+
+ return can_attach;
}
#else /* !CONFIG_RT_GROUP_SCHED: */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4a1bbda3720..3dd2ede6d35 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -386,7 +386,7 @@ extern void dl_server_init(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq,
dl_server_pick_f pick_task);
extern void sched_init_dl_servers(void);
extern int dl_check_tg(unsigned long total);
-extern void dl_init_tg(struct sched_dl_entity *dl_se, u64 rt_runtime, u64 rt_period);
+extern void dl_init_tg(struct task_group *tg, int cpu, u64 rt_runtime, u64 rt_period);
extern void dl_server_update_idle_time(struct rq *rq,
struct task_struct *p);
--
2.50.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [RFC PATCH v2 19/25] sched/deadline: Allow deeper hierarchies of RT cgroups
2025-07-31 10:55 ` [RFC PATCH v2 19/25] sched/deadline: Allow deeper hierarchies of RT cgroups Yuri Andriaccio
@ 2025-08-14 14:29 ` Juri Lelli
0 siblings, 0 replies; 38+ messages in thread
From: Juri Lelli @ 2025-08-14 14:29 UTC (permalink / raw)
To: Yuri Andriaccio
Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
linux-kernel, Luca Abeni, Yuri Andriaccio
Hi!
On 31/07/25 12:55, Yuri Andriaccio wrote:
...
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 7b131630743..e263abcdc04 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -406,11 +406,42 @@ int dl_check_tg(unsigned long total)
> return 1;
> }
>
> -void dl_init_tg(struct sched_dl_entity *dl_se, u64 rt_runtime, u64 rt_period)
> +static inline bool is_active_sched_group(struct task_group *tg)
> {
> + struct task_group *child;
> + bool is_active = 1;
> +
> + // if there are no children, this is a leaf group, thus it is active
> + list_for_each_entry_rcu(child, &tg->children, siblings) {
> + if (child->dl_bandwidth.dl_runtime > 0) {
> + is_active = 0;
> + }
> + }
> + return is_active;
> +}
> +
> +static inline bool sched_group_has_active_siblings(struct task_group *tg)
> +{
> + struct task_group *child;
> + bool has_active_siblings = 0;
> +
> + // if there are no children, this is a leaf group, thus it is active
> + list_for_each_entry_rcu(child, &tg->parent->children, siblings) {
> + if (child != tg && child->dl_bandwidth.dl_runtime > 0) {
> + has_active_siblings = 1;
> + }
> + }
> + return has_active_siblings;
> +}
...
> @@ -2216,7 +2228,14 @@ int sched_rt_can_attach(struct task_group *tg, struct task_struct *tsk)
> if (rt_group_sched_enabled() && tg->dl_bandwidth.dl_runtime == 0)
> return 0;
>
> - return 1;
> + /* If one of the children has runtime > 0, cannot attach RT tasks! */
> + list_for_each_entry_rcu(child, &tg->children, siblings) {
> + if (child->dl_bandwidth.dl_runtime) {
> + can_attach = 0;
> + }
> + }
Can we maybe reuse some of the methods above?
Thanks,
Juri
^ permalink raw reply [flat|nested] 38+ messages in thread
* [RFC PATCH v2 20/25] sched/rt: Add rt-cgroup migration
2025-07-31 10:55 [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
` (18 preceding siblings ...)
2025-07-31 10:55 ` [RFC PATCH v2 19/25] sched/deadline: Allow deeper hierarchies of RT cgroups Yuri Andriaccio
@ 2025-07-31 10:55 ` Yuri Andriaccio
2025-07-31 10:55 ` [RFC PATCH v2 21/25] sched/rt: add HCBS migration related checks and function calls Yuri Andriaccio
` (5 subsequent siblings)
25 siblings, 0 replies; 38+ messages in thread
From: Yuri Andriaccio @ 2025-07-31 10:55 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider
Cc: linux-kernel, Luca Abeni, Yuri Andriaccio
From: luca abeni <luca.abeni@santannapisa.it>
When the runtime is exhausted in a RT CGroup, the scheduler checks for
another non-throttled runqueue and, if available, migrates the tasks.
The bandwidth (runtime/period) chosen for a certain cgroup is replicated on
every core of the system, therefore, in an SMP system with M cores, the
total available bandwidth is the given runtime/period multiplied by M.
Co-developed-by: Alessio Balsini <a.balsini@sssup.it>
Signed-off-by: Alessio Balsini <a.balsini@sssup.it>
Co-developed-by: Andrea Parri <parri.andrea@gmail.com>
Signed-off-by: Andrea Parri <parri.andrea@gmail.com>
Co-developed-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
---
kernel/sched/rt.c | 471 ++++++++++++++++++++++++++++++++++++++++++-
kernel/sched/sched.h | 5 +
2 files changed, 468 insertions(+), 8 deletions(-)
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 29b51251fdc..2fdb2657554 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1,3 +1,4 @@
+#pragma GCC diagnostic ignored "-Wunused-function"
// SPDX-License-Identifier: GPL-2.0
/*
* Real-Time Scheduling Class (mapped to the SCHED_FIFO and SCHED_RR
@@ -84,6 +85,8 @@ void init_rt_rq(struct rt_rq *rt_rq)
plist_head_init(&rt_rq->pushable_tasks);
}
+static void group_pull_rt_task(struct rt_rq *this_rt_rq);
+
#ifdef CONFIG_RT_GROUP_SCHED
void unregister_rt_sched_group(struct task_group *tg)
@@ -289,6 +292,45 @@ static inline void rt_queue_pull_task(struct rq *rq)
queue_balance_callback(rq, &per_cpu(rt_pull_head, rq->cpu), pull_rt_task);
}
+#ifdef CONFIG_RT_GROUP_SCHED
+static DEFINE_PER_CPU(struct balance_callback, rt_group_push_head);
+static DEFINE_PER_CPU(struct balance_callback, rt_group_pull_head);
+static void push_group_rt_tasks(struct rq *);
+static void pull_group_rt_tasks(struct rq *);
+
+static void rt_queue_push_from_group(struct rq *rq, struct rt_rq *rt_rq)
+{
+ BUG_ON(rt_rq == NULL);
+ BUG_ON(rt_rq->rq != rq);
+
+ if (rq->rq_to_push_from)
+ return;
+
+ rq->rq_to_push_from = container_of(rt_rq, struct rq, rt);
+ queue_balance_callback(rq, &per_cpu(rt_group_push_head, rq->cpu),
+ push_group_rt_tasks);
+}
+
+static void rt_queue_pull_to_group(struct rq *rq, struct rt_rq *rt_rq)
+{
+ struct sched_dl_entity *dl_se = dl_group_of(rt_rq);
+
+ BUG_ON(rt_rq == NULL);
+ BUG_ON(!is_dl_group(rt_rq));
+ BUG_ON(rt_rq->rq != rq);
+
+ if (dl_se->dl_throttled || rq->rq_to_pull_to)
+ return;
+
+ rq->rq_to_pull_to = container_of(rt_rq, struct rq, rt);
+ queue_balance_callback(rq, &per_cpu(rt_group_pull_head, rq->cpu),
+ pull_group_rt_tasks);
+}
+#else
+static inline void rt_queue_push_from_group(struct rq *rq, struct rt_rq *rt_rq) {};
+static inline void rt_queue_pull_to_group(struct rq *rq, struct rt_rq *rt_rq) {};
+#endif
+
static void enqueue_pushable_task(struct rt_rq *rt_rq, struct task_struct *p)
{
plist_del(&p->pushable_tasks, &rt_rq->pushable_tasks);
@@ -1277,6 +1319,8 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
*/
static int push_rt_task(struct rq *rq, bool pull)
{
+ BUG_ON(is_dl_group(&rq->rt));
+
struct task_struct *next_task;
struct rq *lowest_rq;
int ret = 0;
@@ -1573,6 +1617,8 @@ void rto_push_irq_work_func(struct irq_work *work)
static void pull_rt_task(struct rq *this_rq)
{
+ BUG_ON(is_dl_group(&this_rq->rt));
+
int this_cpu = this_rq->cpu, cpu;
bool resched = false;
struct task_struct *p, *push_task;
@@ -1683,27 +1729,436 @@ static void pull_rt_task(struct rq *this_rq)
}
#ifdef CONFIG_RT_GROUP_SCHED
-static int group_push_rt_task(struct rt_rq *rt_rq)
+/*
+ * Find the lowest priority runqueue among the runqueues of the same
+ * task group. Unlike find_lowest_rt(), this does not mean that the
+ * lowest priority cpu is running tasks from this runqueue.
+ */
+static int group_find_lowest_rt_rq(struct task_struct *task, struct rt_rq* task_rt_rq)
+{
+ struct sched_domain *sd;
+ struct cpumask mask, *lowest_mask = &mask;
+ struct sched_dl_entity *dl_se;
+ struct rt_rq *rt_rq;
+ int prio, lowest_prio;
+ int cpu, this_cpu = smp_processor_id();
+
+ BUG_ON(task->sched_task_group != task_rt_rq->tg);
+
+ if (task->nr_cpus_allowed == 1)
+ return -1; /* No other targets possible */
+
+ lowest_prio = task->prio - 1;
+ cpumask_clear(lowest_mask);
+ for_each_cpu_and(cpu, cpu_online_mask, task->cpus_ptr) {
+ dl_se = task_rt_rq->tg->dl_se[cpu];
+ rt_rq = &dl_se->my_q->rt;
+ prio = rt_rq->highest_prio.curr;
+
+ /*
+ * If we're on asym system ensure we consider the different capacities
+ * of the CPUs when searching for the lowest_mask.
+ */
+ if (dl_se->dl_throttled || !rt_task_fits_capacity(task, cpu))
+ continue;
+
+ if (prio >= lowest_prio) {
+ if (prio > lowest_prio) {
+ cpumask_clear(lowest_mask);
+ lowest_prio = prio;
+ }
+
+ cpumask_set_cpu(cpu, lowest_mask);
+ }
+ }
+
+ if (cpumask_empty(lowest_mask))
+ return -1;
+
+ /*
+ * At this point we have built a mask of CPUs representing the
+ * lowest priority tasks in the system. Now we want to elect
+ * the best one based on our affinity and topology.
+ *
+ * We prioritize the last CPU that the task executed on since
+ * it is most likely cache-hot in that location.
+ */
+ cpu = task_cpu(task);
+ if (cpumask_test_cpu(cpu, lowest_mask))
+ return cpu;
+
+ /*
+ * Otherwise, we consult the sched_domains span maps to figure
+ * out which CPU is logically closest to our hot cache data.
+ */
+ if (!cpumask_test_cpu(this_cpu, lowest_mask))
+ this_cpu = -1; /* Skip this_cpu opt if not among lowest */
+
+ rcu_read_lock();
+ for_each_domain(cpu, sd) {
+ if (sd->flags & SD_WAKE_AFFINE) {
+ int best_cpu;
+
+ /*
+ * "this_cpu" is cheaper to preempt than a
+ * remote processor.
+ */
+ if (this_cpu != -1 &&
+ cpumask_test_cpu(this_cpu, sched_domain_span(sd))) {
+ rcu_read_unlock();
+ return this_cpu;
+ }
+
+ best_cpu = cpumask_any_and_distribute(lowest_mask,
+ sched_domain_span(sd));
+ if (best_cpu < nr_cpu_ids) {
+ rcu_read_unlock();
+ return best_cpu;
+ }
+ }
+ }
+ rcu_read_unlock();
+
+ /*
+ * And finally, if there were no matches within the domains
+ * just give the caller *something* to work with from the compatible
+ * locations.
+ */
+ if (this_cpu != -1)
+ return this_cpu;
+
+ cpu = cpumask_any_distribute(lowest_mask);
+ if (cpu < nr_cpu_ids)
+ return cpu;
+
+ return -1;
+}
+
+/*
+ * Find and lock the lowest priority runqueue among the runqueues
+ * of the same task group. Unlike find_lock_lowest_rt(), this does not
+ * mean that the lowest priority cpu is running tasks from this runqueue.
+ */
+static struct rt_rq* group_find_lock_lowest_rt_rq(struct task_struct *task, struct rt_rq *rt_rq)
+{
+ struct rq *rq = rq_of_rt_rq(rt_rq);
+ struct rq *lowest_rq;
+ struct rt_rq *lowest_rt_rq = NULL;
+ struct sched_dl_entity *lowest_dl_se;
+ int tries, cpu;
+
+ BUG_ON(task->sched_task_group != rt_rq->tg);
+
+ for (tries = 0; tries < RT_MAX_TRIES; tries++) {
+ cpu = group_find_lowest_rt_rq(task, rt_rq);
+
+ if ((cpu == -1) || (cpu == rq->cpu))
+ break;
+
+ lowest_dl_se = rt_rq->tg->dl_se[cpu];
+ lowest_rt_rq = &lowest_dl_se->my_q->rt;
+ lowest_rq = cpu_rq(cpu);
+
+ if (lowest_rt_rq->highest_prio.curr <= task->prio) {
+ /*
+ * Target rq has tasks of equal or higher priority,
+ * retrying does not release any lock and is unlikely
+ * to yield a different result.
+ */
+ lowest_rt_rq = NULL;
+ break;
+ }
+
+ /* if the prio of this runqueue changed, try again */
+ if (double_lock_balance(rq, lowest_rq)) {
+ /*
+ * We had to unlock the run queue. In
+ * the mean time, task could have
+ * migrated already or had its affinity changed.
+ * Also make sure that it wasn't scheduled on its rq.
+ * It is possible the task was scheduled, set
+ * "migrate_disabled" and then got preempted, so we must
+ * check the task migration disable flag here too.
+ */
+ if (unlikely(is_migration_disabled(task) ||
+ lowest_dl_se->dl_throttled ||
+ !cpumask_test_cpu(lowest_rq->cpu, &task->cpus_mask) ||
+ task != pick_next_pushable_task(rt_rq))) {
+
+ double_unlock_balance(rq, lowest_rq);
+ lowest_rt_rq = NULL;
+ break;
+ }
+ }
+
+ /* If this rq is still suitable use it. */
+ if (lowest_rt_rq->highest_prio.curr > task->prio)
+ break;
+
+ /* try again */
+ double_unlock_balance(rq, lowest_rq);
+ lowest_rt_rq = NULL;
+ }
+
+ return lowest_rt_rq;
+}
+
+static int group_push_rt_task(struct rt_rq *rt_rq, bool pull)
{
+ BUG_ON(!is_dl_group(rt_rq));
+
struct rq *rq = rq_of_rt_rq(rt_rq);
+ struct task_struct *next_task;
+ struct rq *lowest_rq;
+ struct rt_rq *lowest_rt_rq;
+ int ret = 0;
+
+ if (!rt_rq->overloaded)
+ return 0;
+
+ next_task = pick_next_pushable_task(rt_rq);
+ if (!next_task)
+ return 0;
+
+retry:
+ if (is_migration_disabled(next_task)) {
+ struct task_struct *push_task = NULL;
+ int cpu;
+
+ if (!pull || rq->push_busy)
+ return 0;
+
+ /*
+ * If the current task does not belong to the same task group
+ * we cannot push it away.
+ */
+ if (rq->curr->sched_task_group != rt_rq->tg)
+ return 0;
+
+ /*
+ * Invoking group_find_lowest_rt_rq() on anything but an RT task doesn't
+ * make sense. Per the above priority check, curr has to
+ * be of higher priority than next_task, so no need to
+ * reschedule when bailing out.
+ *
+ * Note that the stoppers are masqueraded as SCHED_FIFO
+ * (cf. sched_set_stop_task()), so we can't rely on rt_task().
+ */
+ if (rq->curr->sched_class != &rt_sched_class)
+ return 0;
+
+ cpu = group_find_lowest_rt_rq(rq->curr, rt_rq);
+ if (cpu == -1 || cpu == rq->cpu)
+ return 0;
+
+ /*
+ * Given we found a CPU with lower priority than @next_task,
+ * therefore it should be running. However we cannot migrate it
+ * to this other CPU, instead attempt to push the current
+ * running task on this CPU away.
+ */
+ push_task = get_push_task(rq);
+ if (push_task) {
+ preempt_disable();
+ raw_spin_rq_unlock(rq);
+ stop_one_cpu_nowait(rq->cpu, push_cpu_stop,
+ push_task, &rq->push_work);
+ preempt_enable();
+ raw_spin_rq_lock(rq);
+ }
- if (is_dl_group(rt_rq))
return 0;
+ }
+
+ if (WARN_ON(next_task == rq->curr))
+ return 0;
+
+ /* We might release rq lock */
+ get_task_struct(next_task);
+
+ /* group_find_lock_lowest_rq locks the rq if found */
+ lowest_rt_rq = group_find_lock_lowest_rt_rq(next_task, rt_rq);
+ if (!lowest_rt_rq) {
+ struct task_struct *task;
+ /*
+ * group_find_lock_lowest_rt_rq releases rq->lock
+ * so it is possible that next_task has migrated.
+ *
+ * We need to make sure that the task is still on the same
+ * run-queue and is also still the next task eligible for
+ * pushing.
+ */
+ task = pick_next_pushable_task(rt_rq);
+ if (task == next_task) {
+ /*
+ * The task hasn't migrated, and is still the next
+ * eligible task, but we failed to find a run-queue
+ * to push it to. Do not retry in this case, since
+ * other CPUs will pull from us when ready.
+ */
+ goto out;
+ }
+
+ if (!task)
+ /* No more tasks, just exit */
+ goto out;
+
+ /*
+ * Something has shifted, try again.
+ */
+ put_task_struct(next_task);
+ next_task = task;
+ goto retry;
+ }
+
+ lowest_rq = rq_of_rt_rq(lowest_rt_rq);
+
+ move_queued_task_locked(rq, lowest_rq, next_task);
+ resched_curr(lowest_rq);
+ ret = 1;
+
+ double_unlock_balance(rq, lowest_rq);
+out:
+ put_task_struct(next_task);
+
+ return ret;
+}
+
+static void group_pull_rt_task(struct rt_rq *this_rt_rq)
+{
+ BUG_ON(!is_dl_group(this_rt_rq));
+
+ struct rq *this_rq = rq_of_rt_rq(this_rt_rq);
+ int this_cpu = this_rq->cpu, cpu;
+ bool resched = false;
+ struct task_struct *p, *push_task = NULL;
+ struct rt_rq *src_rt_rq;
+ struct rq *src_rq;
+ struct sched_dl_entity *src_dl_se;
+
+ for_each_online_cpu(cpu) {
+ if (this_cpu == cpu)
+ continue;
- return push_rt_task(rq, false);
+ src_dl_se = this_rt_rq->tg->dl_se[cpu];
+ src_rt_rq = &src_dl_se->my_q->rt;
+
+ if (src_rt_rq->rt_nr_running <= 1 && !src_dl_se->dl_throttled)
+ continue;
+
+ src_rq = rq_of_rt_rq(src_rt_rq);
+
+ /*
+ * Don't bother taking the src_rq->lock if the next highest
+ * task is known to be lower-priority than our current task.
+ * This may look racy, but if this value is about to go
+ * logically higher, the src_rq will push this task away.
+ * And if its going logically lower, we do not care
+ */
+ if (src_rt_rq->highest_prio.next >=
+ this_rt_rq->highest_prio.curr)
+ continue;
+
+ /*
+ * We can potentially drop this_rq's lock in
+ * double_lock_balance, and another CPU could
+ * alter this_rq
+ */
+ push_task = NULL;
+ double_lock_balance(this_rq, src_rq);
+
+ /*
+ * We can pull only a task, which is pushable
+ * on its rq, and no others.
+ */
+ p = pick_highest_pushable_task(src_rt_rq, this_cpu);
+
+ /*
+ * Do we have an RT task that preempts
+ * the to-be-scheduled task?
+ */
+ if (p && (p->prio < this_rt_rq->highest_prio.curr)) {
+ WARN_ON(p == src_rq->curr);
+ WARN_ON(!task_on_rq_queued(p));
+
+ /*
+ * There's a chance that p is higher in priority
+ * than what's currently running on its CPU.
+ * This is just that p is waking up and hasn't
+ * had a chance to schedule. We only pull
+ * p if it is lower in priority than the
+ * current task on the run queue
+ */
+ if (p->prio < src_rq->curr->prio)
+ goto skip;
+
+ if (is_migration_disabled(p)) {
+ /*
+ * If the current task does not belong to the same task group
+ * we cannot push it away.
+ */
+ if (src_rq->curr->sched_task_group != this_rt_rq->tg)
+ goto skip;
+
+ push_task = get_push_task(src_rq);
+ } else {
+ move_queued_task_locked(src_rq, this_rq, p);
+ resched = true;
+ }
+ /*
+ * We continue with the search, just in
+ * case there's an even higher prio task
+ * in another runqueue. (low likelihood
+ * but possible)
+ */
+ }
+skip:
+ double_unlock_balance(this_rq, src_rq);
+
+ if (push_task) {
+ preempt_disable();
+ raw_spin_rq_unlock(this_rq);
+ stop_one_cpu_nowait(src_rq->cpu, push_cpu_stop,
+ push_task, &src_rq->push_work);
+ preempt_enable();
+ raw_spin_rq_lock(this_rq);
+ }
+ }
+
+ if (resched)
+ resched_curr(this_rq);
}
static void group_push_rt_tasks(struct rt_rq *rt_rq)
{
- while (group_push_rt_task(rt_rq))
+ while (group_push_rt_task(rt_rq, false))
;
}
-#else
-static void group_push_rt_tasks(struct rt_rq *rt_rq)
+
+static void push_group_rt_tasks(struct rq *rq)
{
- push_rt_tasks(rq_of_rt_rq(rt_rq));
+ BUG_ON(rq->rq_to_push_from == NULL);
+
+ if ((rq->rq_to_push_from->rt.rt_nr_running > 1) ||
+ (dl_group_of(&rq->rq_to_push_from->rt)->dl_throttled == 1)) {
+ group_push_rt_task(&rq->rq_to_push_from->rt, false);
+ }
+
+ rq->rq_to_push_from = NULL;
}
-#endif
+
+static void pull_group_rt_tasks(struct rq *rq)
+{
+ BUG_ON(rq->rq_to_pull_to == NULL);
+ BUG_ON(rq->rq_to_pull_to->rt.rq != rq);
+
+ group_pull_rt_task(&rq->rq_to_pull_to->rt);
+ rq->rq_to_pull_to = NULL;
+}
+#else /* CONFIG_RT_GROUP_SCHED */
+static void group_pull_rt_task(struct rt_rq *this_rt_rq) { }
+static void group_push_rt_tasks(struct rt_rq *rt_rq) { }
+#endif /* CONFIG_RT_GROUP_SCHED */
/*
* If we are not running and we are not going to reschedule soon, we should
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3dd2ede6d35..10e29f37f9b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1282,6 +1282,11 @@ struct rq {
call_single_data_t cfsb_csd;
struct list_head cfsb_csd_list;
#endif
+
+#ifdef CONFIG_RT_GROUP_SCHED
+ struct rq *rq_to_push_from;
+ struct rq *rq_to_pull_to;
+#endif
};
#ifdef CONFIG_FAIR_GROUP_SCHED
--
2.50.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [RFC PATCH v2 21/25] sched/rt: add HCBS migration related checks and function calls
2025-07-31 10:55 [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
` (19 preceding siblings ...)
2025-07-31 10:55 ` [RFC PATCH v2 20/25] sched/rt: Add rt-cgroup migration Yuri Andriaccio
@ 2025-07-31 10:55 ` Yuri Andriaccio
2025-07-31 10:55 ` [RFC PATCH v2 22/25] sched/deadline: Make rt-cgroup's servers pull tasks on timer replenishment Yuri Andriaccio
` (4 subsequent siblings)
25 siblings, 0 replies; 38+ messages in thread
From: Yuri Andriaccio @ 2025-07-31 10:55 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider
Cc: linux-kernel, Luca Abeni, Yuri Andriaccio
From: luca abeni <luca.abeni@santannapisa.it>
Add HCBS related checks and operations to allow rt-task migration,
differentiating between cgroup's tasks or tasks that run on the global runqueue.
Co-developed-by: Alessio Balsini <a.balsini@sssup.it>
Signed-off-by: Alessio Balsini <a.balsini@sssup.it>
Co-developed-by: Andrea Parri <parri.andrea@gmail.com>
Signed-off-by: Andrea Parri <parri.andrea@gmail.com>
Co-developed-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
---
kernel/sched/rt.c | 63 ++++++++++++++++++++++++++++++++++++-----------
1 file changed, 48 insertions(+), 15 deletions(-)
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 2fdb2657554..677ab9e8aa4 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -865,6 +865,11 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
struct rq *rq;
bool test;
+ /* Just return the task_cpu for processes inside task groups */
+ if (IS_ENABLED(CONFIG_RT_GROUP_SCHED) &&
+ is_dl_group(rt_rq_of_se(&p->rt)))
+ goto out;
+
/* For anything but wake ups, just return the task_cpu */
if (!(flags & (WF_TTWU | WF_FORK)))
goto out;
@@ -964,7 +969,10 @@ static int balance_rt(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
* not yet started the picking loop.
*/
rq_unpin_lock(rq, rf);
- pull_rt_task(rq);
+ if (IS_ENABLED(CONFIG_RT_GROUP_SCHED) && is_dl_group(rt_rq_of_se(&p->rt)))
+ group_pull_rt_task(rt_rq_of_se(&p->rt));
+ else
+ pull_rt_task(rq);
rq_repin_lock(rq, rf);
}
@@ -1050,7 +1058,10 @@ static inline void set_next_task_rt(struct rq *rq, struct task_struct *p, bool f
if (rq->donor->sched_class != &rt_sched_class)
update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 0);
- rt_queue_push_tasks(rq);
+ if (IS_ENABLED(CONFIG_RT_GROUP_SCHED) && is_dl_group(rt_rq))
+ rt_queue_push_from_group(rq, rt_rq);
+ else
+ rt_queue_push_tasks(rq);
}
static struct sched_rt_entity *pick_next_rt_entity(struct rt_rq *rt_rq)
@@ -1113,6 +1124,13 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p, struct task_s
*/
if (on_rt_rq(&p->rt) && p->nr_cpus_allowed > 1)
enqueue_pushable_task(rt_rq, p);
+
+ if (IS_ENABLED(CONFIG_RT_GROUP_SCHED) && is_dl_group(rt_rq)) {
+ struct sched_dl_entity *dl_se = dl_group_of(rt_rq);
+
+ if (dl_se->dl_throttled)
+ rt_queue_push_from_group(rq, rt_rq);
+ }
}
/* Only try algorithms three times */
@@ -2174,8 +2192,13 @@ static void task_woken_rt(struct rq *rq, struct task_struct *p)
(rq->curr->nr_cpus_allowed < 2 ||
rq->donor->prio <= p->prio);
- if (need_to_push)
+ if (!need_to_push)
+ return;
+
+ if (IS_ENABLED(CONFIG_RT_GROUP_SCHED) && is_dl_group(rt_rq))
group_push_rt_tasks(rt_rq);
+ else
+ push_rt_tasks(rq);
}
/* Assumes rq->lock is held */
@@ -2216,7 +2239,9 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p)
if (!task_on_rq_queued(p) || rt_rq->rt_nr_running)
return;
- if (!IS_ENABLED(CONFIG_RT_GROUP_SCHED))
+ if (IS_ENABLED(CONFIG_RT_GROUP_SCHED) && is_dl_group(rt_rq))
+ rt_queue_pull_to_group(rq, rt_rq);
+ else
rt_queue_pull_task(rq);
}
@@ -2243,6 +2268,13 @@ static void switched_to_rt(struct rq *rq, struct task_struct *p)
*/
if (task_current(rq, p)) {
update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 0);
+
+ if (IS_ENABLED(CONFIG_RT_GROUP_SCHED) && is_dl_group(rt_rq_of_se(&p->rt))) {
+ struct sched_dl_entity *dl_se = dl_group_of(rt_rq_of_se(&p->rt));
+
+ p->dl_server = dl_se;
+ }
+
return;
}
@@ -2252,17 +2284,14 @@ static void switched_to_rt(struct rq *rq, struct task_struct *p)
* then see if we can move to another run queue.
*/
if (task_on_rq_queued(p)) {
-
-#ifndef CONFIG_RT_GROUP_SCHED
- if (p->nr_cpus_allowed > 1 && rq->rt.overloaded)
+ if (!is_dl_group(rt_rq_of_se(&p->rt)) && p->nr_cpus_allowed > 1 && rq->rt.overloaded) {
rt_queue_push_tasks(rq);
-#else
- if (rt_rq_of_se(&p->rt)->overloaded) {
- } else {
- if (p->prio < rq->curr->prio)
- resched_curr(rq);
+ return;
+ } else if (is_dl_group(rt_rq_of_se(&p->rt)) && rt_rq_of_se(&p->rt)->overloaded) {
+ rt_queue_push_from_group(rq, rt_rq_of_se(&p->rt));
+ return;
}
-#endif
+
if (p->prio < rq->donor->prio && cpu_online(cpu_of(rq)))
resched_curr(rq);
}
@@ -2285,8 +2314,12 @@ prio_changed_rt(struct rq *rq, struct task_struct *p, int oldprio)
* If our priority decreases while running, we
* may need to pull tasks to this runqueue.
*/
- if (!IS_ENABLED(CONFIG_RT_GROUP_SCHED) && oldprio < p->prio)
- rt_queue_pull_task(rq);
+ if (oldprio < p->prio) {
+ if (IS_ENABLED(CONFIG_RT_GROUP_SCHED) && is_dl_group(rt_rq))
+ rt_queue_pull_to_group(rq, rt_rq);
+ else
+ rt_queue_pull_task(rq);
+ }
/*
* If there's a higher priority task waiting to run
--
2.50.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [RFC PATCH v2 22/25] sched/deadline: Make rt-cgroup's servers pull tasks on timer replenishment
2025-07-31 10:55 [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
` (20 preceding siblings ...)
2025-07-31 10:55 ` [RFC PATCH v2 21/25] sched/rt: add HCBS migration related checks and function calls Yuri Andriaccio
@ 2025-07-31 10:55 ` Yuri Andriaccio
2025-07-31 10:55 ` [RFC PATCH v2 23/25] sched/deadline: Fix HCBS migrations on server stop Yuri Andriaccio
` (3 subsequent siblings)
25 siblings, 0 replies; 38+ messages in thread
From: Yuri Andriaccio @ 2025-07-31 10:55 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider
Cc: linux-kernel, Luca Abeni, Yuri Andriaccio
From: luca abeni <luca.abeni@santannapisa.it>
Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
---
kernel/sched/deadline.c | 6 +++++-
kernel/sched/rt.c | 6 +++++-
2 files changed, 10 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index e263abcdc04..021d7349897 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1308,6 +1308,7 @@ static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_
{
struct rq *rq = rq_of_dl_se(dl_se);
u64 fw;
+ bool has_tasks;
scoped_guard (rq_lock, rq) {
struct rq_flags *rf = &scope.rf;
@@ -1321,7 +1322,10 @@ static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_
if (!dl_se->dl_runtime)
return HRTIMER_NORESTART;
- if (!dl_se->server_has_tasks(dl_se)) {
+ rq_unpin_lock(rq, rf);
+ has_tasks = dl_se->server_has_tasks(dl_se);
+ rq_repin_lock(rq, rf);
+ if (!has_tasks) {
replenish_dl_entity(dl_se);
dl_server_stopped(dl_se);
return HRTIMER_NORESTART;
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 677ab9e8aa4..116fa0422b9 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1,4 +1,3 @@
-#pragma GCC diagnostic ignored "-Wunused-function"
// SPDX-License-Identifier: GPL-2.0
/*
* Real-Time Scheduling Class (mapped to the SCHED_FIFO and SCHED_RR
@@ -145,6 +144,11 @@ void init_tg_rt_entry(struct task_group *tg, struct rq *served_rq,
static bool rt_server_has_tasks(struct sched_dl_entity *dl_se)
{
+#ifdef CONFIG_SMP
+ struct rt_rq *rt_rq = &dl_se->my_q->rt;
+
+ group_pull_rt_task(rt_rq);
+#endif
return !!dl_se->my_q->rt.rt_nr_running;
}
--
2.50.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [RFC PATCH v2 23/25] sched/deadline: Fix HCBS migrations on server stop
2025-07-31 10:55 [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
` (21 preceding siblings ...)
2025-07-31 10:55 ` [RFC PATCH v2 22/25] sched/deadline: Make rt-cgroup's servers pull tasks on timer replenishment Yuri Andriaccio
@ 2025-07-31 10:55 ` Yuri Andriaccio
2025-07-31 10:55 ` [RFC PATCH v2 24/25] sched/core: Execute enqueued balance callbacks when changing allowed CPUs Yuri Andriaccio
` (2 subsequent siblings)
25 siblings, 0 replies; 38+ messages in thread
From: Yuri Andriaccio @ 2025-07-31 10:55 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider
Cc: linux-kernel, Luca Abeni, Yuri Andriaccio
From: luca abeni <luca.abeni@santannapisa.it>
Do not unthrottle a non-fair-server dl_server_stop(), since it ends up
being stopped when throttled (we try to migrate all the RT tasks away
from it).
Notes:
This is a temporary workaround, but it will be hopefully removed in
favor of less invasive code.
Co-developed-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
---
kernel/sched/deadline.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 021d7349897..d9ab209a492 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1723,9 +1723,11 @@ void dl_server_stop(struct sched_dl_entity *dl_se)
return;
dequeue_dl_entity(dl_se, DEQUEUE_SLEEP);
- hrtimer_try_to_cancel(&dl_se->dl_timer);
+ if (dl_se == &rq_of_dl_se(dl_se)->fair_server) {
+ hrtimer_try_to_cancel(&dl_se->dl_timer);
+ dl_se->dl_throttled = 0;
+ }
dl_se->dl_defer_armed = 0;
- dl_se->dl_throttled = 0;
dl_se->dl_server_active = 0;
}
--
2.50.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [RFC PATCH v2 24/25] sched/core: Execute enqueued balance callbacks when changing allowed CPUs
2025-07-31 10:55 [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
` (22 preceding siblings ...)
2025-07-31 10:55 ` [RFC PATCH v2 23/25] sched/deadline: Fix HCBS migrations on server stop Yuri Andriaccio
@ 2025-07-31 10:55 ` Yuri Andriaccio
2025-07-31 10:55 ` [RFC PATCH v2 25/25] sched/core: Execute enqueued balance callbacks when migrating task betweeen cgroups Yuri Andriaccio
2025-08-13 14:06 ` [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Juri Lelli
25 siblings, 0 replies; 38+ messages in thread
From: Yuri Andriaccio @ 2025-07-31 10:55 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider
Cc: linux-kernel, Luca Abeni, Yuri Andriaccio
From: luca abeni <luca.abeni@santannapisa.it>
Execute balancing callbacks when setting the affinity of a task, since the HCBS
scheduler may request balancing of throttled dl_servers to fully utilize the
server's bandwidth.
Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
---
kernel/sched/core.c | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index eb9de8c7b1f..c8763c46030 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2947,6 +2947,7 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
if (cpumask_test_cpu(task_cpu(p), &p->cpus_mask) ||
(task_current_donor(rq, p) && !task_current(rq, p))) {
struct task_struct *push_task = NULL;
+ struct balance_callback *head;
if ((flags & SCA_MIGRATE_ENABLE) &&
(p->migration_flags & MDF_PUSH) && !rq->push_busy) {
@@ -2965,11 +2966,13 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
}
preempt_disable();
+ head = splice_balance_callbacks(rq);
task_rq_unlock(rq, p, rf);
if (push_task) {
stop_one_cpu_nowait(rq->cpu, push_cpu_stop,
p, &rq->push_work);
}
+ balance_callbacks(rq, head);
preempt_enable();
if (complete)
@@ -3024,6 +3027,8 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
}
if (task_on_cpu(rq, p) || READ_ONCE(p->__state) == TASK_WAKING) {
+ struct balance_callback *head;
+
/*
* MIGRATE_ENABLE gets here because 'p == current', but for
* anything else we cannot do is_migration_disabled(), punt
@@ -3037,16 +3042,19 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
p->migration_flags &= ~MDF_PUSH;
preempt_disable();
+ head = splice_balance_callbacks(rq);
task_rq_unlock(rq, p, rf);
if (!stop_pending) {
stop_one_cpu_nowait(cpu_of(rq), migration_cpu_stop,
&pending->arg, &pending->stop_work);
}
+ balance_callbacks(rq, head);
preempt_enable();
if (flags & SCA_MIGRATE_ENABLE)
return 0;
} else {
+ struct balance_callback *head;
if (!is_migration_disabled(p)) {
if (task_on_rq_queued(p))
@@ -3057,7 +3065,12 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
complete = true;
}
}
+
+ preempt_disable();
+ head = splice_balance_callbacks(rq);
task_rq_unlock(rq, p, rf);
+ balance_callbacks(rq, head);
+ preempt_enable();
if (complete)
complete_all(&pending->done);
--
2.50.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [RFC PATCH v2 25/25] sched/core: Execute enqueued balance callbacks when migrating task betweeen cgroups
2025-07-31 10:55 [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
` (23 preceding siblings ...)
2025-07-31 10:55 ` [RFC PATCH v2 24/25] sched/core: Execute enqueued balance callbacks when changing allowed CPUs Yuri Andriaccio
@ 2025-07-31 10:55 ` Yuri Andriaccio
2025-08-13 14:06 ` [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Juri Lelli
25 siblings, 0 replies; 38+ messages in thread
From: Yuri Andriaccio @ 2025-07-31 10:55 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider
Cc: linux-kernel, Luca Abeni, Yuri Andriaccio
Execute balancing callbacks when migrating task between cgroups, since the HCBS
scheduler, similarly to the previous patch, may request balancing of throttled
dl_servers to fully utilize the server's bandwidth.
Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
---
kernel/sched/core.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c8763c46030..65896f46e50 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9247,10 +9247,11 @@ void sched_move_task(struct task_struct *tsk, bool for_autogroup)
{
int queued, running, queue_flags =
DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK;
+ struct balance_callback *head;
struct rq *rq;
+ struct rq_flags rf;
- CLASS(task_rq_lock, rq_guard)(tsk);
- rq = rq_guard.rq;
+ rq = task_rq_lock(tsk, &rf);
update_rq_clock(rq);
@@ -9277,6 +9278,12 @@ void sched_move_task(struct task_struct *tsk, bool for_autogroup)
*/
resched_curr(rq);
}
+
+ preempt_disable();
+ head = splice_balance_callbacks(rq);
+ task_rq_unlock(rq, tsk, &rf);
+ balance_callbacks(rq, head);
+ preempt_enable();
}
static struct cgroup_subsys_state *
--
2.50.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server
2025-07-31 10:55 [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
` (24 preceding siblings ...)
2025-07-31 10:55 ` [RFC PATCH v2 25/25] sched/core: Execute enqueued balance callbacks when migrating task betweeen cgroups Yuri Andriaccio
@ 2025-08-13 14:06 ` Juri Lelli
2025-08-13 14:22 ` Yuri Andriaccio
25 siblings, 1 reply; 38+ messages in thread
From: Juri Lelli @ 2025-08-13 14:06 UTC (permalink / raw)
To: Yuri Andriaccio
Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
linux-kernel, Luca Abeni, Yuri Andriaccio
Hi,
On 31/07/25 12:55, Yuri Andriaccio wrote:
> Hello,
>
> This is the v2 for Hierarchical Constant Bandwidth Server, aiming at replacing
> the current RT_GROUP_SCHED mechanism with something more robust and
> theoretically sound. The patchset has been presented at OSPM25
> (https://retis.sssup.it/ospm-summit/), and a summary of its inner workings can
> be found at https://lwn.net/Articles/1021332/ . You can find the v1 of this
> patchset at the bottom of the page, which talks in more detail what this
> patchset is all about and how it is implemented.
>
> The big update for this v2 version is the addition of migration code, which
> allows to migrate tasks between different CPUs (following of course affinity
> settings).
>
> As requested, we've split the big patches in smaller chunks in order to improve
> in readability. Additionally, it has been rebased on the latest tip/master to
> keep up with the latest scheduler updates and new features of dl_servers.
>
> Last but not least, the first patch, which has been presented separately at
> https://lore.kernel.org/all/20250725164412.35912-1-yurand2000@gmail.com/ , is
> necessary to fully utilize the deadline bandwidth while keeping the fair-servers
> active. You can refer to the aforementioned link for details. The issue
> presented in this patch also reflects in HCBS: in the current version of the
> kernel, by default, 5% of the realtime bandwidth is reserved for fair-servers,
> 5% is not usable, and only the remaining 90% could be used by deadline tasks, or
> in our case, by HCBS dl_servers. The first patch addresses this issue and allows
> to fully utilize the default 95% of bandwidth for rt-tasks/servers.
>
> Summary of the patches:
> 1) Account fair-servers bw separately from other dl tasks and servers bw.
> 2-5) Preparation patches, so that the RT classes' code can be used both
> for normal and cgroup scheduling.
> 6-15) Implementation of HCBS, no migration and only one level hierarchy.
> The old RT_GROUP_SCHED code is removed.
> 16-18) Remove cgroups v1 in favour of v2.
> 19) Add support for deeper hierarchies.
> 20-25) Add support for tasks migration.
>
> Updates from v1:
> - Rebase to tip/master.
Would you mind sharing the baseline sha this set applies to? It looks
like it doesn't apply cleanly anymore.
Thanks!
Juri
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server
2025-08-13 14:06 ` [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Juri Lelli
@ 2025-08-13 14:22 ` Yuri Andriaccio
0 siblings, 0 replies; 38+ messages in thread
From: Yuri Andriaccio @ 2025-08-13 14:22 UTC (permalink / raw)
To: juri.lelli
Cc: bsegall, dietmar.eggemann, linux-kernel, luca.abeni, mgorman,
mingo, peterz, rostedt, vincent.guittot, vschneid,
yuri.andriaccio
Hi,
> Would you mind sharing the baseline sha this set applies to? It looks
> like it doesn't apply cleanly anymore.
The patchset should apply cleanly on top of commit "Merge tag 'sysctl-6.17-rc1'
of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl", with sha
4b290aae788e06561754b28c6842e4080957d3f7 .
Have a nice day,
Yuri
^ permalink raw reply [flat|nested] 38+ messages in thread