From: Yuri Andriaccio <yurand2000@gmail.com>
To: Ingo Molnar <mingo@redhat.com>,
Peter Zijlstra <peterz@infradead.org>,
Juri Lelli <juri.lelli@redhat.com>,
Vincent Guittot <vincent.guittot@linaro.org>,
Dietmar Eggemann <dietmar.eggemann@arm.com>,
Steven Rostedt <rostedt@goodmis.org>,
Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
Valentin Schneider <vschneid@redhat.com>
Cc: linux-kernel@vger.kernel.org,
Luca Abeni <luca.abeni@santannapisa.it>,
Yuri Andriaccio <yuri.andriaccio@santannapisa.it>
Subject: [PATCH] sched/deadline: Remove fair-servers from real-time task's bandwidth accounting
Date: Mon, 21 Jul 2025 13:11:31 +0200 [thread overview]
Message-ID: <20250721111131.309388-1-yurand2000@gmail.com> (raw)
Fair-servers are currently used in place of the old RT_THROTTLING mechanism to
prevent the starvation of SCHED_OTHER (and other lower priority) tasks when
real-time FIFO/RR processes are trying to fully utilize the CPU. To allow the
RT_THROTTLING mechanism, the maximum allocatable bandwidth for real-time tasks
has been limited to 95% of the CPU-time.
The RT_THROTTLING mechanism is now removed in favor of fair-servers, which are
currently set to use, as expected, 5% of the CPU-time. Still, they share the
same bandwidth that allows to run real-time tasks, and which is still set to 95%
of the total CPU-time. This means that by removing the RT_THROTTLING mechanism,
the bandwidth remaning for real-time SCHED_DEADLINE tasks and other dl-servers
(FIFO/RR are not affected) is only 90%.
To demonstrate this, I've run the following script on the latest tip/master:
----------
PROCS=$(nproc)
echo "Allocate 95% bw per task/cpu"
for ((i = 0; i < $PROCS; i++))
do
sleep infinity &
chrt -d -T 95000000 -P 100000000 -D 100000000 -p 0 $!
done
killall sleep
echo "Allocate 90% bw per task/cpu"
for ((i = 0; i < $PROCS; i++))
do
sleep infinity &
chrt -d -T 90000000 -P 100000000 -D 100000000 -p 0 $!
done
killall sleep
----------
First-off we try to fully utilize the 95% rt-bandwidth by allocating #CPU
SCHED_DEADLINE tasks, requesting 95/100ms each. This will fail, because, as
mentioned, fair-servers are also accounted in the 95% realtime bw. With the
second allocation, it is possible to show that 90% bandwidth is instead allowed
by the scheduler. By playing with the numbers and chrt(s), it is possible to see
that the allocatable bandwidth for SCHED_DEADLINE tasks is exactly 90%, while it
is possible to see through stress-tests that on CPU-hog caused by FIFO/RR tasks
(of course SCHED_DEADLINE tasks are limited by the admission test, which as
mentioned fails at 90% total bw utilization), the fair-servers only allocate at
most 5% of the CPU-time to SCHED_OTHER tasks. There is clearly a 5% of CPU-time
lost somewhere.
This patch reclaims the 5% lost SCHED_DEADLINE CPU-time (FIFO/RR are not
affected, there is no admission test there to perform), by accounting the
fair-server's bandwidth separately. After this patch, the above script runs
successfully also when allocating 95% bw per task/cpu.
Changes:
- Make the fair-servers' bandwidth not be accounted into the total allocated
bandwidth for real-time tasks.
- Do not account for fair-servers in the GRUB's bandwidth reclaiming mechanism.
- Remove the admission control test when allocating a fair-server, as its
bandwidth is accounted differently.
- Limit the max bandwidth to (BW_UNIT - max_rt_bw) when changing the parameters
of a fair-server, preventing overcommitment.
- Add dl_bw_fair, which computes the total allocated bandwidth of the
fair-servers in the given root-domain.
- Update admission tests (in sched_dl_global_validate) when changing the
maximum allocatable bandwidth for real-time tasks, preventing overcommitment.
Notes:
Since the fair-server's bandwidth can be changed through debugfs, it has not
been enforced that a fair-server's bw must be always equal to (BW_UNIT -
max_rt_bw), rather it must be less or equal to this value. This allows retaining
the fair-servers' settings changed through the debugfs when changing the
maximum realtime bandwidth.
This also means that in order to increase the maximum bandwidth for real-time
tasks, the bw of fair-servers must be first decreased through debugfs otherwise
admission tests will fail, and viceversa, to increase the bw of fair-servers,
the bw of real-time tasks must be reduced beforehand.
Testing:
This patch has been tested with basic regression tests, by checking that it is
not possible to overcommit the bandwidth of fair-servers and that SCHED_OTHER
tasks do use at least the specified amount of bw (also varying the ratio of
rt/non-rt bandwidth).
Additionally it has also been tested on top of this fix, ensuring that the
warning mentioned in the bug report is not re-triggered:
https://lore.kernel.org/all/aHpf4LfMtB2V9uNb@jlelli-thinkpadt14gen4.remote.csb/
Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
---
kernel/sched/deadline.c | 66 ++++++++++++++++++-----------------------
kernel/sched/sched.h | 1 -
kernel/sched/topology.c | 8 -----
3 files changed, 29 insertions(+), 46 deletions(-)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 1af06e48227..e97a7feb59d 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -141,6 +141,24 @@ static inline int dl_bw_cpus(int i)
return cpus;
}
+static inline u64 dl_bw_fair(int i)
+{
+ struct root_domain *rd = cpu_rq(i)->rd;
+ u64 fair_server_bw = 0;
+
+ RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
+ "sched RCU must be held");
+
+ if (cpumask_subset(rd->span, cpu_active_mask))
+ i = cpumask_first(rd->span);
+
+ for_each_cpu_and(i, rd->span, cpu_active_mask) {
+ fair_server_bw += cpu_rq(i)->fair_server.dl_bw;
+ }
+
+ return fair_server_bw;
+}
+
static inline unsigned long __dl_bw_capacity(const struct cpumask *mask)
{
unsigned long cap = 0;
@@ -1657,25 +1675,9 @@ void sched_init_dl_servers(void)
}
}
-void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq)
-{
- u64 new_bw = dl_se->dl_bw;
- int cpu = cpu_of(rq);
- struct dl_bw *dl_b;
-
- dl_b = dl_bw_of(cpu_of(rq));
- guard(raw_spinlock)(&dl_b->lock);
-
- if (!dl_bw_cpus(cpu))
- return;
-
- __dl_add(dl_b, new_bw, dl_bw_cpus(cpu));
-}
-
int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 period, bool init)
{
- u64 old_bw = init ? 0 : to_ratio(dl_se->dl_period, dl_se->dl_runtime);
- u64 new_bw = to_ratio(period, runtime);
+ u64 max_bw, new_bw = to_ratio(period, runtime);
struct rq *rq = dl_se->rq;
int cpu = cpu_of(rq);
struct dl_bw *dl_b;
@@ -1688,17 +1690,14 @@ int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 perio
cpus = dl_bw_cpus(cpu);
cap = dl_bw_capacity(cpu);
+ max_bw = cap_scale(BW_UNIT - dl_b->bw, cap) / cpus;
- if (__dl_overflow(dl_b, cap, old_bw, new_bw))
+ if (new_bw > max_bw)
return -EBUSY;
if (init) {
__add_rq_bw(new_bw, &rq->dl);
- __dl_add(dl_b, new_bw, cpus);
} else {
- __dl_sub(dl_b, dl_se->dl_bw, cpus);
- __dl_add(dl_b, new_bw, cpus);
-
dl_rq_change_utilization(rq, dl_se, new_bw);
}
@@ -2932,17 +2931,6 @@ void dl_clear_root_domain(struct root_domain *rd)
rd->dl_bw.total_bw = 0;
for_each_cpu(i, rd->span)
cpu_rq(i)->dl.extra_bw = cpu_rq(i)->dl.max_bw;
-
- /*
- * dl_servers are not tasks. Since dl_add_task_root_domain ignores
- * them, we need to account for them here explicitly.
- */
- for_each_cpu(i, rd->span) {
- struct sched_dl_entity *dl_se = &cpu_rq(i)->fair_server;
-
- if (dl_server(dl_se) && cpu_active(i))
- __dl_add(&rd->dl_bw, dl_se->dl_bw, dl_bw_cpus(i));
- }
}
void dl_clear_root_domain_cpu(int cpu)
@@ -3126,9 +3114,10 @@ int sched_dl_global_validate(void)
u64 period = global_rt_period();
u64 new_bw = to_ratio(period, runtime);
u64 cookie = ++dl_cookie;
+ u64 fair_bw;
struct dl_bw *dl_b;
- int cpu, cpus, ret = 0;
- unsigned long flags;
+ int cpu, ret = 0;
+ unsigned long cap, flags;
/*
* Here we want to check the bandwidth not being set to some
@@ -3142,10 +3131,13 @@ int sched_dl_global_validate(void)
goto next;
dl_b = dl_bw_of(cpu);
- cpus = dl_bw_cpus(cpu);
+ cap = dl_bw_capacity(cpu);
+ fair_bw = dl_bw_fair(cpu);
raw_spin_lock_irqsave(&dl_b->lock, flags);
- if (new_bw * cpus < dl_b->total_bw)
+ if (cap_scale(new_bw, cap) < dl_b->total_bw)
+ ret = -EBUSY;
+ if (cap_scale(new_bw, cap) + fair_bw > cap_scale(BW_UNIT, cap))
ret = -EBUSY;
raw_spin_unlock_irqrestore(&dl_b->lock, flags);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ac953fad8c2..42b5d024dce 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -390,7 +390,6 @@ extern void sched_init_dl_servers(void);
extern void dl_server_update_idle_time(struct rq *rq,
struct task_struct *p);
extern void fair_server_init(struct rq *rq);
-extern void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq);
extern int dl_server_apply_params(struct sched_dl_entity *dl_se,
u64 runtime, u64 period, bool init);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 977e133bb8a..4ea3365984a 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -500,14 +500,6 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd)
if (cpumask_test_cpu(rq->cpu, cpu_active_mask))
set_rq_online(rq);
- /*
- * Because the rq is not a task, dl_add_task_root_domain() did not
- * move the fair server bw to the rd if it already started.
- * Add it now.
- */
- if (rq->fair_server.dl_server)
- __dl_server_attach_root(&rq->fair_server, rq);
-
rq_unlock_irqrestore(rq, &rf);
if (old_rd)
--
2.50.1
next reply other threads:[~2025-07-21 11:11 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-07-21 11:11 Yuri Andriaccio [this message]
2025-07-22 0:07 ` [PATCH] sched/deadline: Remove fair-servers from real-time task's bandwidth accounting kernel test robot
2025-07-25 11:44 ` Matteo Martelli
2025-07-25 15:28 ` Yuri Andriaccio
2025-07-25 17:22 ` Matteo Martelli
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250721111131.309388-1-yurand2000@gmail.com \
--to=yurand2000@gmail.com \
--cc=bsegall@google.com \
--cc=dietmar.eggemann@arm.com \
--cc=juri.lelli@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=luca.abeni@santannapisa.it \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
--cc=yuri.andriaccio@santannapisa.it \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.