* [PATCH v4] sched/deadline: Remove fair-servers from real-time task's bandwidth accounting
@ 2025-09-03 11:44 Yuri Andriaccio
2025-09-03 14:12 ` Juri Lelli
0 siblings, 1 reply; 2+ messages in thread
From: Yuri Andriaccio @ 2025-09-03 11:44 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider
Cc: linux-kernel, Luca Abeni, Yuri Andriaccio
Fair-servers are currently used in place of the old RT_THROTTLING mechanism to
prevent the starvation of SCHED_OTHER (and other lower priority) tasks when
real-time FIFO/RR processes are trying to fully utilize the CPU. To allow the
RT_THROTTLING mechanism, the maximum allocatable bandwidth for real-time tasks
has been limited to 95% of the CPU-time.
The RT_THROTTLING mechanism is now removed in favor of fair-servers, which are
currently set to use, as expected, 5% of the CPU-time. Still, they share the
same bandwidth that allows running real-time tasks, and which is still set to
95% of the total CPU-time. This means that by removing the RT_THROTTLING
mechanism, the remaining bandwidth for real-time SCHED_DEADLINE tasks and other
dl-servers (FIFO/RR are not affected) is only 90%.
This patch reclaims the 5% lost CPU-time, which is definitely reserved for
SCHED_OTHER tasks, but should not be accounted together with the other real-time
tasks. More generally, the fair-servers' bandwidth must not be accounted with
other real-time tasks.
Updates:
- Make the fair-servers' bandwidth not be accounted into the total allocated
bandwidth for real-time tasks.
- Remove the admission control test when allocating a fair-server.
- Do not account for fair-servers in the GRUB's bandwidth reclaiming mechanism.
- Limit the max bandwidth to (BW_UNIT - max_rt_bw) when changing the parameters
of a fair-server, preventing overcommitment.
- Update admission tests (in sched_dl_global_validate) when changing the
maximum allocatable bandwidth for real-time tasks, preventing overcommitment.
- Update admission tests (in dl_bw_manage) when offlining a CPU.
Since the fair-server's bandwidth can be changed through debugfs, it has not
been enforced that a fair-server's bandwidth must be always equal to (BW_UNIT -
max_rt_bw), rather it must be less or equal to this value. This allows retaining
the fair-servers' settings changed through the debugfs when changing the
max_rt_bw.
This also means that in order to increase the maximum bandwidth for real-time
tasks, the bw of fair-servers must be first decreased through debugfs otherwise
admission tests will fail, and vice versa, to increase the bw of fair-servers,
the bw of real-time tasks must be reduced beforehand.
This v4 version removes dl_bw_fair, as it is not needed anymore since each fair
server's bw is now checked individually rather than globally. This is necessary
because different fair-servers can have different runtimes. The bandwidth
assignment is sound only if each CPU's rt-bw + fair-server-bw is less tahn or
equal to 1, rather than computing the total and checking if it is less than or
equal to the number of CPUs. The check on deadline tasks can be instead be done
globally (on a root-domain basis) as dl tasks are allowed to migrate between
cores. This new version fixes the error reported here:
https://lore.kernel.org/all/aLa3zdmyKuRMy3bm@jlelli-thinkpadt14gen4.remote.csb/
v1: https://lore.kernel.org/all/20250721111131.309388-1-yurand2000@gmail.com/
v2: https://lore.kernel.org/all/20250725164412.35912-1-yurand2000@gmail.com/
v3: https://lore.kernel.org/all/20250901113103.601085-1-yurand2000@gmail.com/
Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
---
kernel/sched/deadline.c | 81 +++++++++++++----------------------------
kernel/sched/sched.h | 1 -
kernel/sched/topology.c | 8 ----
3 files changed, 26 insertions(+), 64 deletions(-)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index f25301267e4..35bcd360329 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1659,48 +1659,22 @@ void sched_init_dl_servers(void)
}
}
-void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq)
-{
- u64 new_bw = dl_se->dl_bw;
- int cpu = cpu_of(rq);
- struct dl_bw *dl_b;
-
- dl_b = dl_bw_of(cpu_of(rq));
- guard(raw_spinlock)(&dl_b->lock);
-
- if (!dl_bw_cpus(cpu))
- return;
-
- __dl_add(dl_b, new_bw, dl_bw_cpus(cpu));
-}
-
int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 period, bool init)
{
- u64 old_bw = init ? 0 : to_ratio(dl_se->dl_period, dl_se->dl_runtime);
u64 new_bw = to_ratio(period, runtime);
struct rq *rq = dl_se->rq;
int cpu = cpu_of(rq);
struct dl_bw *dl_b;
- unsigned long cap;
- int retval = 0;
- int cpus;
dl_b = dl_bw_of(cpu);
guard(raw_spinlock)(&dl_b->lock);
- cpus = dl_bw_cpus(cpu);
- cap = dl_bw_capacity(cpu);
-
- if (__dl_overflow(dl_b, cap, old_bw, new_bw))
+ if (new_bw > BW_UNIT - dl_b->bw)
return -EBUSY;
if (init) {
__add_rq_bw(new_bw, &rq->dl);
- __dl_add(dl_b, new_bw, cpus);
} else {
- __dl_sub(dl_b, dl_se->dl_bw, cpus);
- __dl_add(dl_b, new_bw, cpus);
-
dl_rq_change_utilization(rq, dl_se, new_bw);
}
@@ -1714,7 +1688,7 @@ int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 perio
dl_se->dl_bw = to_ratio(dl_se->dl_period, dl_se->dl_runtime);
dl_se->dl_density = to_ratio(dl_se->dl_deadline, dl_se->dl_runtime);
- return retval;
+ return 0;
}
/*
@@ -2945,17 +2919,6 @@ void dl_clear_root_domain(struct root_domain *rd)
rd->dl_bw.total_bw = 0;
for_each_cpu(i, rd->span)
cpu_rq(i)->dl.extra_bw = cpu_rq(i)->dl.max_bw;
-
- /*
- * dl_servers are not tasks. Since dl_add_task_root_domain ignores
- * them, we need to account for them here explicitly.
- */
- for_each_cpu(i, rd->span) {
- struct sched_dl_entity *dl_se = &cpu_rq(i)->fair_server;
-
- if (dl_server(dl_se) && cpu_active(i))
- __dl_add(&rd->dl_bw, dl_se->dl_bw, dl_bw_cpus(i));
- }
}
void dl_clear_root_domain_cpu(int cpu)
@@ -3139,9 +3102,10 @@ int sched_dl_global_validate(void)
u64 period = global_rt_period();
u64 new_bw = to_ratio(period, runtime);
u64 cookie = ++dl_cookie;
+ u64 fair_bw;
struct dl_bw *dl_b;
- int cpu, cpus, ret = 0;
- unsigned long flags;
+ int i, cpu, ret = 0;
+ unsigned long cap, flags;
/*
* Here we want to check the bandwidth not being set to some
@@ -3155,11 +3119,28 @@ int sched_dl_global_validate(void)
goto next;
dl_b = dl_bw_of(cpu);
- cpus = dl_bw_cpus(cpu);
+ cap = dl_bw_capacity(cpu);
raw_spin_lock_irqsave(&dl_b->lock, flags);
- if (new_bw * cpus < dl_b->total_bw)
+ /* Check if the whole root domain can support the active dl tasks */
+ if (cap_scale(new_bw, cap) < dl_b->total_bw) {
ret = -EBUSY;
+ goto unlock;
+ }
+
+ /*
+ * For each cpu in the root domain, check if there is enough
+ * bandwidth for the fair server.
+ */
+ for_each_cpu_and(i, cpu_rq(cpu)->rd->span, cpu_active_mask) {
+ fair_bw = cpu_rq(i)->fair_server.dl_bw;
+
+ if (new_bw + fair_bw > BW_UNIT) {
+ ret = -EBUSY;
+ goto unlock;
+ }
+ }
+unlock:
raw_spin_unlock_irqrestore(&dl_b->lock, flags);
next:
@@ -3444,7 +3425,6 @@ static int dl_bw_manage(enum dl_bw_request req, int cpu, u64 dl_bw)
unsigned long flags, cap;
struct dl_bw *dl_b;
bool overflow = 0;
- u64 fair_server_bw = 0;
rcu_read_lock_sched();
dl_b = dl_bw_of(cpu);
@@ -3476,28 +3456,19 @@ static int dl_bw_manage(enum dl_bw_request req, int cpu, u64 dl_bw)
*/
cap -= arch_scale_cpu_capacity(cpu);
- /*
- * cpu is going offline and NORMAL tasks will be moved away
- * from it. We can thus discount dl_server bandwidth
- * contribution as it won't need to be servicing tasks after
- * the cpu is off.
- */
- if (cpu_rq(cpu)->fair_server.dl_server)
- fair_server_bw = cpu_rq(cpu)->fair_server.dl_bw;
-
/*
* Not much to check if no DEADLINE bandwidth is present.
* dl_servers we can discount, as tasks will be moved out the
* offlined CPUs anyway.
*/
- if (dl_b->total_bw - fair_server_bw > 0) {
+ if (dl_b->total_bw > 0) {
/*
* Leaving at least one CPU for DEADLINE tasks seems a
* wise thing to do. As said above, cpu is not offline
* yet, so account for that.
*/
if (dl_bw_cpus(cpu) - 1)
- overflow = __dl_overflow(dl_b, cap, fair_server_bw, 0);
+ overflow = __dl_overflow(dl_b, cap, 0, 0);
else
overflow = 1;
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index be9745d104f..01afa7424f8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -390,7 +390,6 @@ extern void sched_init_dl_servers(void);
extern void dl_server_update_idle_time(struct rq *rq,
struct task_struct *p);
extern void fair_server_init(struct rq *rq);
-extern void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq);
extern int dl_server_apply_params(struct sched_dl_entity *dl_se,
u64 runtime, u64 period, bool init);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 977e133bb8a..4ea3365984a 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -500,14 +500,6 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd)
if (cpumask_test_cpu(rq->cpu, cpu_active_mask))
set_rq_online(rq);
- /*
- * Because the rq is not a task, dl_add_task_root_domain() did not
- * move the fair server bw to the rd if it already started.
- * Add it now.
- */
- if (rq->fair_server.dl_server)
- __dl_server_attach_root(&rq->fair_server, rq);
-
rq_unlock_irqrestore(rq, &rf);
if (old_rd)
base-commit: 5c3b3264e5858813632031ba58bcd6e1eeb3b214
--
2.51.0
^ permalink raw reply related [flat|nested] 2+ messages in thread
* Re: [PATCH v4] sched/deadline: Remove fair-servers from real-time task's bandwidth accounting
2025-09-03 11:44 [PATCH v4] sched/deadline: Remove fair-servers from real-time task's bandwidth accounting Yuri Andriaccio
@ 2025-09-03 14:12 ` Juri Lelli
0 siblings, 0 replies; 2+ messages in thread
From: Juri Lelli @ 2025-09-03 14:12 UTC (permalink / raw)
To: Yuri Andriaccio
Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
linux-kernel, Luca Abeni, Yuri Andriaccio
Hi!
On 03/09/25 13:44, Yuri Andriaccio wrote:
> Fair-servers are currently used in place of the old RT_THROTTLING mechanism to
> prevent the starvation of SCHED_OTHER (and other lower priority) tasks when
> real-time FIFO/RR processes are trying to fully utilize the CPU. To allow the
> RT_THROTTLING mechanism, the maximum allocatable bandwidth for real-time tasks
> has been limited to 95% of the CPU-time.
>
> The RT_THROTTLING mechanism is now removed in favor of fair-servers, which are
> currently set to use, as expected, 5% of the CPU-time. Still, they share the
> same bandwidth that allows running real-time tasks, and which is still set to
> 95% of the total CPU-time. This means that by removing the RT_THROTTLING
> mechanism, the remaining bandwidth for real-time SCHED_DEADLINE tasks and other
> dl-servers (FIFO/RR are not affected) is only 90%.
>
> This patch reclaims the 5% lost CPU-time, which is definitely reserved for
> SCHED_OTHER tasks, but should not be accounted together with the other real-time
> tasks. More generally, the fair-servers' bandwidth must not be accounted with
> other real-time tasks.
>
> Updates:
> - Make the fair-servers' bandwidth not be accounted into the total allocated
> bandwidth for real-time tasks.
> - Remove the admission control test when allocating a fair-server.
> - Do not account for fair-servers in the GRUB's bandwidth reclaiming mechanism.
However, it looks like running_bw and this_bw still account for
fair-servers? I just checked with tools/sched/dl_bw_dump.py and can see
their contribution showing up.
running_bw, although, also influences schedutil decisions, which might
be something that is required, as maybe tasks can still be starved if
the cpu is running too slow? Not sure about this last point.
> - Limit the max bandwidth to (BW_UNIT - max_rt_bw) when changing the parameters
> of a fair-server, preventing overcommitment.
> - Update admission tests (in sched_dl_global_validate) when changing the
> maximum allocatable bandwidth for real-time tasks, preventing overcommitment.
> - Update admission tests (in dl_bw_manage) when offlining a CPU.
>
> Since the fair-server's bandwidth can be changed through debugfs, it has not
> been enforced that a fair-server's bandwidth must be always equal to (BW_UNIT -
> max_rt_bw), rather it must be less or equal to this value. This allows retaining
> the fair-servers' settings changed through the debugfs when changing the
> max_rt_bw.
>
> This also means that in order to increase the maximum bandwidth for real-time
> tasks, the bw of fair-servers must be first decreased through debugfs otherwise
> admission tests will fail, and vice versa, to increase the bw of fair-servers,
> the bw of real-time tasks must be reduced beforehand.
>
> This v4 version removes dl_bw_fair, as it is not needed anymore since each fair
> server's bw is now checked individually rather than globally. This is necessary
> because different fair-servers can have different runtimes. The bandwidth
> assignment is sound only if each CPU's rt-bw + fair-server-bw is less tahn or
> equal to 1, rather than computing the total and checking if it is less than or
> equal to the number of CPUs. The check on deadline tasks can be instead be done
> globally (on a root-domain basis) as dl tasks are allowed to migrate between
> cores. This new version fixes the error reported here:
> https://lore.kernel.org/all/aLa3zdmyKuRMy3bm@jlelli-thinkpadt14gen4.remote.csb/
Thanks for looking into it. It seems to be working correctly now.
Best,
Juri
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2025-09-03 14:12 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-03 11:44 [PATCH v4] sched/deadline: Remove fair-servers from real-time task's bandwidth accounting Yuri Andriaccio
2025-09-03 14:12 ` Juri Lelli
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).