* [PATCH 0/5] sched/deadline: Fix GRUB accounting
@ 2025-06-27 11:51 Juri Lelli
2025-06-27 11:51 ` [PATCH 1/5] sched/deadline: Initialize dl_servers after SMP Juri Lelli
` (6 more replies)
0 siblings, 7 replies; 18+ messages in thread
From: Juri Lelli @ 2025-06-27 11:51 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Waiman Long
Cc: linux-kernel, Marcel Ziswiler, Luca Abeni, Juri Lelli
Hi All,
This patch series addresses a significant regression observed in
`SCHED_DEADLINE` performance, specifically when `SCHED_FLAG_RECLAIM`
(Greedy Reclamation of Unused Bandwidth - GRUB) is enabled alongside
overrunning jobs. This issue was reported by Marcel [1].
Marcel's team extensive real-time scheduler (`SCHED_DEADLINE`) tests on
mainline Linux kernels (amd64-based Intel NUCs and aarch64-based RADXA
ROCK5Bs) typically show zero deadline misses for 5ms granularity tasks.
However, with reclaim mode enabled and the same two overrunning jobs in
the mix, they observed a dramatic increase in deadline misses: 43
million on NUC and 600 thousand on ROCK55B. This highlights a critical
accounting issue within `SCHED_DEADLINE` when reclaim is active.
This series fixes the issue by doing the following.
- 1/5: sched/deadline: Initialize dl_servers after SMP
Currently, `dl-servers` are initialized too early during boot, before
all CPUs are online. This results in an incorrect calculation of
per-runqueue `DEADLINE` variables, such as `extra_bw`, which rely on a
stable CPU count. This patch moves the `dl-server` initialization to a
later stage, after SMP initialization, ensuring all CPUs are online and
correct `extra_bw` values can be computed from the start.
- 2/5: sched/deadline: Reset extra_bw to max_bw when clearing root domains
The `dl_clear_root_domain()` function was found to not properly account
for the fact that per-runqueue `extra_bw` variables retained stale
values computed before root domain changes. This led to broken
accounting. This patch fixes the issue by resetting `extra_bw` to
`max_bw` before restoring `dl-server` contributions, ensuring a clean
state.
- 3/5: sched/deadline: Fix accounting after global limits change
Changes to global `SCHED_DEADLINE` limits (handled by
`sched_rt_handler()` logic) were found to leave stale or incorrect
values in various accounting-related variables, including `extra_bw`.
This patch properly cleans up per-runqueue variables before implementing
the global limit change and then rebuilds the scheduling domains. This
ensures that the accounting is correctly restored and maintained after
such global limit adjustments.
- 4/5 and 5/5 are simple drgn scripts I put together to help debugging
this issue. I have the impression that they might be useful to have
around for the future.
Please review and test.
The set is also availabe at
git@github.com:jlelli/linux.git upstream/fix-grub-tip
1 - https://lore.kernel.org/lkml/ce8469c4fb2f3e2ada74add22cce4bfe61fd5bab.camel@codethink.co.uk/
Thanks,
Juri
Juri Lelli (5):
sched/deadline: Initialize dl_servers after SMP
sched/deadline: Reset extra_bw to max_bw when clearing root domains
sched/deadline: Fix accounting after global limits change
tools/sched: Add root_domains_dump.py which dumps root domains info
tools/sched: Add dl_bw_dump.py for printing bandwidth accounting info
MAINTAINERS | 1 +
kernel/sched/core.c | 2 +
kernel/sched/deadline.c | 61 +++++++++++++++++++---------
kernel/sched/rt.c | 6 +++
kernel/sched/sched.h | 1 +
tools/sched/dl_bw_dump.py | 57 ++++++++++++++++++++++++++
tools/sched/root_domains_dump.py | 68 ++++++++++++++++++++++++++++++++
7 files changed, 177 insertions(+), 19 deletions(-)
create mode 100755 tools/sched/dl_bw_dump.py
create mode 100755 tools/sched/root_domains_dump.py
--
2.49.0
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH 1/5] sched/deadline: Initialize dl_servers after SMP
2025-06-27 11:51 [PATCH 0/5] sched/deadline: Fix GRUB accounting Juri Lelli
@ 2025-06-27 11:51 ` Juri Lelli
[not found] ` <1e39c473-d161-4ad0-bfdc-8a306f57135f@redhat.com>
2025-07-14 9:10 ` [tip: sched/core] " tip-bot2 for Juri Lelli
2025-06-27 11:51 ` [PATCH 2/5] sched/deadline: Reset extra_bw to max_bw when clearing root domains Juri Lelli
` (5 subsequent siblings)
6 siblings, 2 replies; 18+ messages in thread
From: Juri Lelli @ 2025-06-27 11:51 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Waiman Long
Cc: linux-kernel, Marcel Ziswiler, Luca Abeni, Juri Lelli
dl-servers are currently initialized too early at boot when CPUs are not
fully up (only boot CPU is). This results in miscalculation of per
runqueue DEADLINE variables like extra_bw (which needs a stable CPU
count).
Move initialization of dl-servers later on after SMP has been
initialized and CPUs are all online, so that CPU count is stable and
DEADLINE variables can be computed correctly.
Fixes: d741f297bceaf ("sched/fair: Fair server interface")
Reported-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
---
kernel/sched/core.c | 2 ++
kernel/sched/deadline.c | 50 ++++++++++++++++++++++++++---------------
kernel/sched/sched.h | 1 +
3 files changed, 35 insertions(+), 18 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2f8caa9db78d5..89b3ed637465b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8371,6 +8371,8 @@ void __init sched_init_smp(void)
init_sched_rt_class();
init_sched_dl_class();
+ sched_init_dl_servers();
+
sched_smp_initialized = true;
}
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 0f30697ad7956..c1f223f372968 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -761,6 +761,8 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
struct rq *rq = rq_of_dl_rq(dl_rq);
+ update_rq_clock(rq);
+
WARN_ON(is_dl_boosted(dl_se));
WARN_ON(dl_time_before(rq_clock(rq), dl_se->deadline));
@@ -1580,23 +1582,7 @@ void dl_server_start(struct sched_dl_entity *dl_se)
{
struct rq *rq = dl_se->rq;
- /*
- * XXX: the apply do not work fine at the init phase for the
- * fair server because things are not yet set. We need to improve
- * this before getting generic.
- */
- if (!dl_server(dl_se)) {
- u64 runtime = 50 * NSEC_PER_MSEC;
- u64 period = 1000 * NSEC_PER_MSEC;
-
- dl_server_apply_params(dl_se, runtime, period, 1);
-
- dl_se->dl_server = 1;
- dl_se->dl_defer = 1;
- setup_new_dl_entity(dl_se);
- }
-
- if (!dl_se->dl_runtime)
+ if (!dl_server(dl_se))
return;
dl_se->dl_server_active = 1;
@@ -1607,7 +1593,7 @@ void dl_server_start(struct sched_dl_entity *dl_se)
void dl_server_stop(struct sched_dl_entity *dl_se)
{
- if (!dl_se->dl_runtime)
+ if (!dl_server(dl_se) || !dl_server_active(dl_se))
return;
dequeue_dl_entity(dl_se, DEQUEUE_SLEEP);
@@ -1626,6 +1612,32 @@ void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
dl_se->server_pick_task = pick_task;
}
+void sched_init_dl_servers(void)
+{
+ int cpu;
+ struct rq *rq;
+ struct sched_dl_entity *dl_se;
+
+ for_each_online_cpu(cpu) {
+ u64 runtime = 50 * NSEC_PER_MSEC;
+ u64 period = 1000 * NSEC_PER_MSEC;
+
+ rq = cpu_rq(cpu);
+
+ guard(rq_lock_irq)(rq);
+
+ dl_se = &rq->fair_server;
+
+ WARN_ON(dl_server(dl_se));
+
+ dl_server_apply_params(dl_se, runtime, period, 1);
+
+ dl_se->dl_server = 1;
+ dl_se->dl_defer = 1;
+ setup_new_dl_entity(dl_se);
+ }
+}
+
void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq)
{
u64 new_bw = dl_se->dl_bw;
@@ -1652,6 +1664,8 @@ int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 perio
int retval = 0;
int cpus;
+ guard(rcu)();
+
dl_b = dl_bw_of(cpu);
guard(raw_spinlock)(&dl_b->lock);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 105190b180203..3058fb6246dab 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -385,6 +385,7 @@ extern void dl_server_stop(struct sched_dl_entity *dl_se);
extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
dl_server_has_tasks_f has_tasks,
dl_server_pick_f pick_task);
+extern void sched_init_dl_servers(void);
extern void dl_server_update_idle_time(struct rq *rq,
struct task_struct *p);
--
2.49.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH 2/5] sched/deadline: Reset extra_bw to max_bw when clearing root domains
2025-06-27 11:51 [PATCH 0/5] sched/deadline: Fix GRUB accounting Juri Lelli
2025-06-27 11:51 ` [PATCH 1/5] sched/deadline: Initialize dl_servers after SMP Juri Lelli
@ 2025-06-27 11:51 ` Juri Lelli
2025-07-14 9:10 ` [tip: sched/core] " tip-bot2 for Juri Lelli
2025-06-27 11:51 ` [PATCH 3/5] sched/deadline: Fix accounting after global limits change Juri Lelli
` (4 subsequent siblings)
6 siblings, 1 reply; 18+ messages in thread
From: Juri Lelli @ 2025-06-27 11:51 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Waiman Long
Cc: linux-kernel, Marcel Ziswiler, Luca Abeni, Juri Lelli
dl_clear_root_domain() doesn't take into account the fact that per-rq
extra_bw variables retain values computed before root domain changes,
resulting in broken accounting.
Fix it by resetting extra_bw to max_bw before restoring back dl-servers
contributions.
Fixes: 2ff899e351643 ("sched/deadline: Rebuild root domain accounting after every update")
Reported-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
---
kernel/sched/deadline.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index c1f223f372968..7a3b556d45a99 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2907,7 +2907,14 @@ void dl_clear_root_domain(struct root_domain *rd)
int i;
guard(raw_spinlock_irqsave)(&rd->dl_bw.lock);
+
+ /*
+ * Reset total_bw to zero and extra_bw to max_bw so that next
+ * loop will add dl-servers contributions back properly,
+ */
rd->dl_bw.total_bw = 0;
+ for_each_cpu(i, rd->span)
+ cpu_rq(i)->dl.extra_bw = cpu_rq(i)->dl.max_bw;
/*
* dl_servers are not tasks. Since dl_add_task_root_domain ignores
--
2.49.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH 3/5] sched/deadline: Fix accounting after global limits change
2025-06-27 11:51 [PATCH 0/5] sched/deadline: Fix GRUB accounting Juri Lelli
2025-06-27 11:51 ` [PATCH 1/5] sched/deadline: Initialize dl_servers after SMP Juri Lelli
2025-06-27 11:51 ` [PATCH 2/5] sched/deadline: Reset extra_bw to max_bw when clearing root domains Juri Lelli
@ 2025-06-27 11:51 ` Juri Lelli
2025-07-14 8:59 ` Peter Zijlstra
2025-07-14 9:10 ` [tip: sched/core] " tip-bot2 for Juri Lelli
2025-06-27 11:51 ` [PATCH 4/5] tools/sched: Add root_domains_dump.py which dumps root domains info Juri Lelli
` (3 subsequent siblings)
6 siblings, 2 replies; 18+ messages in thread
From: Juri Lelli @ 2025-06-27 11:51 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Waiman Long
Cc: linux-kernel, Marcel Ziswiler, Luca Abeni, Juri Lelli
A global limits change (sched_rt_handler() logic) currently leaves stale
and/or incorrect values in variables related to accounting (e.g.
extra_bw).
Properly clean up per runqueue variables before implementing the change
and rebuild scheduling domains (so that accounting is also properly
restored) after such a change is complete.
Reported-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
---
kernel/sched/deadline.c | 4 +++-
kernel/sched/rt.c | 6 ++++++
2 files changed, 9 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 7a3b556d45a99..187f324565f92 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -3166,6 +3166,9 @@ void sched_dl_do_global(void)
if (global_rt_runtime() != RUNTIME_INF)
new_bw = to_ratio(global_rt_period(), global_rt_runtime());
+ for_each_possible_cpu(cpu)
+ init_dl_rq_bw_ratio(&cpu_rq(cpu)->dl);
+
for_each_possible_cpu(cpu) {
rcu_read_lock_sched();
@@ -3181,7 +3184,6 @@ void sched_dl_do_global(void)
raw_spin_unlock_irqrestore(&dl_b->lock, flags);
rcu_read_unlock_sched();
- init_dl_rq_bw_ratio(&cpu_rq(cpu)->dl);
}
}
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 15d5855c542cb..be6e9bcbe82b6 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2886,6 +2886,12 @@ static int sched_rt_handler(const struct ctl_table *table, int write, void *buff
sched_domains_mutex_unlock();
mutex_unlock(&mutex);
+ /*
+ * After changing maximum available bandwidth for DEADLINE, we need to
+ * recompute per root domain and per cpus variables accordingly.
+ */
+ rebuild_sched_domains();
+
return ret;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH 4/5] tools/sched: Add root_domains_dump.py which dumps root domains info
2025-06-27 11:51 [PATCH 0/5] sched/deadline: Fix GRUB accounting Juri Lelli
` (2 preceding siblings ...)
2025-06-27 11:51 ` [PATCH 3/5] sched/deadline: Fix accounting after global limits change Juri Lelli
@ 2025-06-27 11:51 ` Juri Lelli
2025-07-14 9:10 ` [tip: sched/core] " tip-bot2 for Juri Lelli
2025-06-27 11:51 ` [PATCH 5/5] tools/sched: Add dl_bw_dump.py for printing bandwidth accounting info Juri Lelli
` (2 subsequent siblings)
6 siblings, 1 reply; 18+ messages in thread
From: Juri Lelli @ 2025-06-27 11:51 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Waiman Long
Cc: linux-kernel, Marcel Ziswiler, Luca Abeni, Juri Lelli
Root domains information is somewhat hard to access at runtime. Even
with sched_debug and sched_verbose, such information is only printed
on kernel console when domains are modified.
Add a simple drgn script to more easily retrieve root domains
information at runtime.
Since tools/sched is a new directory, add it to MAINTAINERS as well.
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
---
MAINTAINERS | 1 +
tools/sched/root_domains_dump.py | 68 ++++++++++++++++++++++++++++++++
2 files changed, 69 insertions(+)
create mode 100755 tools/sched/root_domains_dump.py
diff --git a/MAINTAINERS b/MAINTAINERS
index a92290fffa163..b986a49383c9c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -22165,6 +22165,7 @@ F: include/linux/wait.h
F: include/uapi/linux/sched.h
F: kernel/fork.c
F: kernel/sched/
+F: tools/sched/
SCHEDULER - SCHED_EXT
R: Tejun Heo <tj@kernel.org>
diff --git a/tools/sched/root_domains_dump.py b/tools/sched/root_domains_dump.py
new file mode 100755
index 0000000000000..56dc91f017b20
--- /dev/null
+++ b/tools/sched/root_domains_dump.py
@@ -0,0 +1,68 @@
+#!/usr/bin/env drgn
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (C) 2025 Juri Lelli <juri.lelli@redhat.com>
+# Copyright (C) 2025 Red Hat, Inc.
+
+desc = """
+This is a drgn script to show the current root domains configuration. For more
+info on drgn, visit https://github.com/osandov/drgn.
+
+Root domains are only printed once, as multiple CPUs might be attached to the
+same root domain.
+"""
+
+import os
+import argparse
+
+import drgn
+from drgn import FaultError
+from drgn.helpers.common import *
+from drgn.helpers.linux import *
+
+def print_root_domains_info():
+
+ # To store unique root domains found
+ seen_root_domains = set()
+
+ print("Retrieving (unique) Root Domain Information:")
+
+ runqueues = prog['runqueues']
+ def_root_domain = prog['def_root_domain']
+
+ for cpu_id in for_each_possible_cpu(prog):
+ try:
+ rq = per_cpu(runqueues, cpu_id)
+
+ root_domain = rq.rd
+
+ # Check if we've already processed this root domain to avoid duplicates
+ # Use the memory address of the root_domain as a unique identifier
+ root_domain_cast = int(root_domain)
+ if root_domain_cast in seen_root_domains:
+ continue
+ seen_root_domains.add(root_domain_cast)
+
+ if root_domain_cast == int(def_root_domain.address_):
+ print(f"\n--- Root Domain @ def_root_domain ---")
+ else:
+ print(f"\n--- Root Domain @ 0x{root_domain_cast:x} ---")
+
+ print(f" From CPU: {cpu_id}") # This CPU belongs to this root domain
+
+ # Access and print relevant fields from struct root_domain
+ print(f" Span : {cpumask_to_cpulist(root_domain.span[0])}")
+ print(f" Online : {cpumask_to_cpulist(root_domain.span[0])}")
+
+ except drgn.FaultError as fe:
+ print(f" (CPU {cpu_id}: Fault accessing kernel memory: {fe})")
+ except AttributeError as ae:
+ print(f" (CPU {cpu_id}: Missing attribute for root_domain (kernel struct change?): {ae})")
+ except Exception as e:
+ print(f" (CPU {cpu_id}: An unexpected error occurred: {e})")
+
+if __name__ == "__main__":
+ parser = argparse.ArgumentParser(description=desc,
+ formatter_class=argparse.RawTextHelpFormatter)
+ args = parser.parse_args()
+
+ print_root_domains_info()
--
2.49.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH 5/5] tools/sched: Add dl_bw_dump.py for printing bandwidth accounting info
2025-06-27 11:51 [PATCH 0/5] sched/deadline: Fix GRUB accounting Juri Lelli
` (3 preceding siblings ...)
2025-06-27 11:51 ` [PATCH 4/5] tools/sched: Add root_domains_dump.py which dumps root domains info Juri Lelli
@ 2025-06-27 11:51 ` Juri Lelli
2025-07-14 9:10 ` [tip: sched/core] " tip-bot2 for Juri Lelli
2025-06-30 6:01 ` [PATCH 0/5] sched/deadline: Fix GRUB accounting Marcel Ziswiler
2025-07-11 10:05 ` Marcel Ziswiler
6 siblings, 1 reply; 18+ messages in thread
From: Juri Lelli @ 2025-06-27 11:51 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Waiman Long
Cc: linux-kernel, Marcel Ziswiler, Luca Abeni, Juri Lelli
dl_rq bandwidth accounting information is crucial for the correct
functioning of SCHED_DEADLINE.
Add a drgn script for accessing that information at runtime, so that
it's easier to check and debug issues related to it.
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
---
tools/sched/dl_bw_dump.py | 57 +++++++++++++++++++++++++++++++++++++++
1 file changed, 57 insertions(+)
create mode 100755 tools/sched/dl_bw_dump.py
diff --git a/tools/sched/dl_bw_dump.py b/tools/sched/dl_bw_dump.py
new file mode 100755
index 0000000000000..aae4e42b17690
--- /dev/null
+++ b/tools/sched/dl_bw_dump.py
@@ -0,0 +1,57 @@
+#!/usr/bin/env drgn
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (C) 2025 Juri Lelli <juri.lelli@redhat.com>
+# Copyright (C) 2025 Red Hat, Inc.
+
+desc = """
+This is a drgn script to show dl_rq bandwidth accounting information. For more
+info on drgn, visit https://github.com/osandov/drgn.
+
+Only online CPUs are reported.
+"""
+
+import os
+import argparse
+
+import drgn
+from drgn import FaultError
+from drgn.helpers.common import *
+from drgn.helpers.linux import *
+
+def print_dl_bws_info():
+
+ print("Retrieving dl_rq bandwidth accounting information:")
+
+ runqueues = prog['runqueues']
+
+ for cpu_id in for_each_possible_cpu(prog):
+ try:
+ rq = per_cpu(runqueues, cpu_id)
+
+ if rq.online == 0:
+ continue
+
+ dl_rq = rq.dl
+
+ print(f" From CPU: {cpu_id}")
+
+ # Access and print relevant fields from struct dl_rq
+ print(f" running_bw : {dl_rq.running_bw}")
+ print(f" this_bw : {dl_rq.this_bw}")
+ print(f" extra_bw : {dl_rq.extra_bw}")
+ print(f" max_bw : {dl_rq.max_bw}")
+ print(f" bw_ratio : {dl_rq.bw_ratio}")
+
+ except drgn.FaultError as fe:
+ print(f" (CPU {cpu_id}: Fault accessing kernel memory: {fe})")
+ except AttributeError as ae:
+ print(f" (CPU {cpu_id}: Missing attribute for root_domain (kernel struct change?): {ae})")
+ except Exception as e:
+ print(f" (CPU {cpu_id}: An unexpected error occurred: {e})")
+
+if __name__ == "__main__":
+ parser = argparse.ArgumentParser(description=desc,
+ formatter_class=argparse.RawTextHelpFormatter)
+ args = parser.parse_args()
+
+ print_dl_bws_info()
--
2.49.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH 1/5] sched/deadline: Initialize dl_servers after SMP
[not found] ` <1e39c473-d161-4ad0-bfdc-8a306f57135f@redhat.com>
@ 2025-06-29 23:08 ` Waiman Long
2025-06-30 10:21 ` Juri Lelli
1 sibling, 0 replies; 18+ messages in thread
From: Waiman Long @ 2025-06-29 23:08 UTC (permalink / raw)
To: Waiman Long, Juri Lelli, Ingo Molnar, Peter Zijlstra,
Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider
Cc: linux-kernel, Marcel Ziswiler, Luca Abeni
Resend again.
On 6/29/25 6:48 PM, Waiman Long wrote:
> On 6/27/25 7:51 AM, Juri Lelli wrote:
>> dl-servers are currently initialized too early at boot when CPUs are not
>> fully up (only boot CPU is). This results in miscalculation of per
>> runqueue DEADLINE variables like extra_bw (which needs a stable CPU
>> count).
>>
>> Move initialization of dl-servers later on after SMP has been
>> initialized and CPUs are all online, so that CPU count is stable and
>> DEADLINE variables can be computed correctly.
>>
>> Fixes: d741f297bceaf ("sched/fair: Fair server interface")
>> Reported-by: Marcel Ziswiler<marcel.ziswiler@codethink.co.uk>
>> Signed-off-by: Juri Lelli<juri.lelli@redhat.com>
>> ---
>> kernel/sched/core.c | 2 ++
>> kernel/sched/deadline.c | 50 ++++++++++++++++++++++++++---------------
>> kernel/sched/sched.h | 1 +
>> 3 files changed, 35 insertions(+), 18 deletions(-)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 2f8caa9db78d5..89b3ed637465b 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -8371,6 +8371,8 @@ void __init sched_init_smp(void)
>> init_sched_rt_class();
>> init_sched_dl_class();
>>
>> + sched_init_dl_servers();
>> +
>> sched_smp_initialized = true;
>> }
>>
>> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
>> index 0f30697ad7956..c1f223f372968 100644
>> --- a/kernel/sched/deadline.c
>> +++ b/kernel/sched/deadline.c
>> @@ -761,6 +761,8 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
>> struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
>> struct rq *rq = rq_of_dl_rq(dl_rq);
>>
>> + update_rq_clock(rq);
>> +
>> WARN_ON(is_dl_boosted(dl_se));
>> WARN_ON(dl_time_before(rq_clock(rq), dl_se->deadline));
>>
>> @@ -1580,23 +1582,7 @@ void dl_server_start(struct sched_dl_entity *dl_se)
>> {
>> struct rq *rq = dl_se->rq;
>>
>> - /*
>> - * XXX: the apply do not work fine at the init phase for the
>> - * fair server because things are not yet set. We need to improve
>> - * this before getting generic.
>> - */
>> - if (!dl_server(dl_se)) {
>> - u64 runtime = 50 * NSEC_PER_MSEC;
>> - u64 period = 1000 * NSEC_PER_MSEC;
>> -
>> - dl_server_apply_params(dl_se, runtime, period, 1);
>> -
>> - dl_se->dl_server = 1;
>> - dl_se->dl_defer = 1;
>> - setup_new_dl_entity(dl_se);
>> - }
>> -
>> - if (!dl_se->dl_runtime)
>> + if (!dl_server(dl_se))
>> return;
>>
>> dl_se->dl_server_active = 1;
>> @@ -1607,7 +1593,7 @@ void dl_server_start(struct sched_dl_entity *dl_se)
>>
>> void dl_server_stop(struct sched_dl_entity *dl_se)
>> {
>> - if (!dl_se->dl_runtime)
>> + if (!dl_server(dl_se) || !dl_server_active(dl_se))
>> return;
>>
>> dequeue_dl_entity(dl_se, DEQUEUE_SLEEP);
>> @@ -1626,6 +1612,32 @@ void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
>> dl_se->server_pick_task = pick_task;
>> }
>>
>> +void sched_init_dl_servers(void)
>> +{
>> + int cpu;
>> + struct rq *rq;
>> + struct sched_dl_entity *dl_se;
>> +
>> + for_each_online_cpu(cpu) {
>> + u64 runtime = 50 * NSEC_PER_MSEC;
>> + u64 period = 1000 * NSEC_PER_MSEC;
>> +
>> + rq = cpu_rq(cpu);
>> +
>> + guard(rq_lock_irq)(rq);
>> +
>> + dl_se = &rq->fair_server;
>> +
>> + WARN_ON(dl_server(dl_se));
>> +
>> + dl_server_apply_params(dl_se, runtime, period, 1);
>> +
>> + dl_se->dl_server = 1;
>> + dl_se->dl_defer = 1;
>> + setup_new_dl_entity(dl_se);
>> + }
>> +}
>> +
>> void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq)
>> {
>> u64 new_bw = dl_se->dl_bw;
>> @@ -1652,6 +1664,8 @@ int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 perio
>> int retval = 0;
>> int cpus;
>>
>> + guard(rcu)();
>> +
>
> Your patch doesn't explain why a RCU guard is needed here?
> sched_init_dl_servers() is the changed caller, but it is called with
> rq_lock_irq held which should implies a RCU read critical section as
> IRQ is disabled.
>
> Cheers, Longman
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 0/5] sched/deadline: Fix GRUB accounting
2025-06-27 11:51 [PATCH 0/5] sched/deadline: Fix GRUB accounting Juri Lelli
` (4 preceding siblings ...)
2025-06-27 11:51 ` [PATCH 5/5] tools/sched: Add dl_bw_dump.py for printing bandwidth accounting info Juri Lelli
@ 2025-06-30 6:01 ` Marcel Ziswiler
2025-07-11 10:05 ` Marcel Ziswiler
6 siblings, 0 replies; 18+ messages in thread
From: Marcel Ziswiler @ 2025-06-30 6:01 UTC (permalink / raw)
To: Juri Lelli, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Waiman Long
Cc: linux-kernel, Luca Abeni
Hi Juri
On Fri, 2025-06-27 at 13:51 +0200, Juri Lelli wrote:
> Hi All,
>
> This patch series addresses a significant regression observed in
> `SCHED_DEADLINE` performance, specifically when `SCHED_FLAG_RECLAIM`
> (Greedy Reclamation of Unused Bandwidth - GRUB) is enabled alongside
> overrunning jobs. This issue was reported by Marcel [1].
>
> Marcel's team extensive real-time scheduler (`SCHED_DEADLINE`) tests on
> mainline Linux kernels (amd64-based Intel NUCs and aarch64-based RADXA
> ROCK5Bs) typically show zero deadline misses for 5ms granularity tasks.
> However, with reclaim mode enabled and the same two overrunning jobs in
> the mix, they observed a dramatic increase in deadline misses: 43
> million on NUC and 600 thousand on ROCK55B. This highlights a critical
> accounting issue within `SCHED_DEADLINE` when reclaim is active.
>
> This series fixes the issue by doing the following.
>
> - 1/5: sched/deadline: Initialize dl_servers after SMP
> Currently, `dl-servers` are initialized too early during boot, before
> all CPUs are online. This results in an incorrect calculation of
> per-runqueue `DEADLINE` variables, such as `extra_bw`, which rely on a
> stable CPU count. This patch moves the `dl-server` initialization to a
> later stage, after SMP initialization, ensuring all CPUs are online and
> correct `extra_bw` values can be computed from the start.
>
> - 2/5: sched/deadline: Reset extra_bw to max_bw when clearing root domains
> The `dl_clear_root_domain()` function was found to not properly account
> for the fact that per-runqueue `extra_bw` variables retained stale
> values computed before root domain changes. This led to broken
> accounting. This patch fixes the issue by resetting `extra_bw` to
> `max_bw` before restoring `dl-server` contributions, ensuring a clean
> state.
>
> - 3/5: sched/deadline: Fix accounting after global limits change
> Changes to global `SCHED_DEADLINE` limits (handled by
> `sched_rt_handler()` logic) were found to leave stale or incorrect
> values in various accounting-related variables, including `extra_bw`.
> This patch properly cleans up per-runqueue variables before implementing
> the global limit change and then rebuilds the scheduling domains. This
> ensures that the accounting is correctly restored and maintained after
> such global limit adjustments.
>
> - 4/5 and 5/5 are simple drgn scripts I put together to help debugging
> this issue. I have the impression that they might be useful to have
> around for the future.
>
> Please review and test.
Over the weekend I run 312 mio. test runs on NUC and 231 mio. on ROCK55B without any single deadline misses.
Therefore,
for the whole series:
Tested-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> # nuc & rock5b
Thanks!
> The set is also availabe at
>
> git@github.com:jlelli/linux.git upstream/fix-grub-tip
>
> 1 - https://lore.kernel.org/lkml/ce8469c4fb2f3e2ada74add22cce4bfe61fd5bab.camel@codethink.co.uk/
>
> Thanks,
> Juri
>
> Juri Lelli (5):
> sched/deadline: Initialize dl_servers after SMP
> sched/deadline: Reset extra_bw to max_bw when clearing root domains
> sched/deadline: Fix accounting after global limits change
> tools/sched: Add root_domains_dump.py which dumps root domains info
> tools/sched: Add dl_bw_dump.py for printing bandwidth accounting info
>
> MAINTAINERS | 1 +
> kernel/sched/core.c | 2 +
> kernel/sched/deadline.c | 61 +++++++++++++++++++---------
> kernel/sched/rt.c | 6 +++
> kernel/sched/sched.h | 1 +
> tools/sched/dl_bw_dump.py | 57 ++++++++++++++++++++++++++
> tools/sched/root_domains_dump.py | 68 ++++++++++++++++++++++++++++++++
> 7 files changed, 177 insertions(+), 19 deletions(-)
> create mode 100755 tools/sched/dl_bw_dump.py
> create mode 100755 tools/sched/root_domains_dump.py
Cheers
Marcel
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 1/5] sched/deadline: Initialize dl_servers after SMP
[not found] ` <1e39c473-d161-4ad0-bfdc-8a306f57135f@redhat.com>
2025-06-29 23:08 ` Waiman Long
@ 2025-06-30 10:21 ` Juri Lelli
2025-06-30 17:04 ` Waiman Long
1 sibling, 1 reply; 18+ messages in thread
From: Juri Lelli @ 2025-06-30 10:21 UTC (permalink / raw)
To: Waiman Long
Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
linux-kernel, Marcel Ziswiler, Luca Abeni
On 29/06/25 18:48, Waiman Long wrote:
> On 6/27/25 7:51 AM, Juri Lelli wrote:
...
> > @@ -1652,6 +1664,8 @@ int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 perio
> > int retval = 0;
> > int cpus;
> > + guard(rcu)();
> > +
>
> Your patch doesn't explain why a RCU guard is needed here?
> sched_init_dl_servers() is the changed caller, but it is called with
> rq_lock_irq held which should implies a RCU read critical section as IRQ is
> disabled.
Yeah, looks like it's not required. Will remove. Thanks for spotting it!
Best,
Juri
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 1/5] sched/deadline: Initialize dl_servers after SMP
2025-06-30 10:21 ` Juri Lelli
@ 2025-06-30 17:04 ` Waiman Long
0 siblings, 0 replies; 18+ messages in thread
From: Waiman Long @ 2025-06-30 17:04 UTC (permalink / raw)
To: Juri Lelli, Waiman Long
Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
linux-kernel, Marcel Ziswiler, Luca Abeni
On 6/30/25 6:21 AM, Juri Lelli wrote:
> On 29/06/25 18:48, Waiman Long wrote:
>> On 6/27/25 7:51 AM, Juri Lelli wrote:
> ...
>
>>> @@ -1652,6 +1664,8 @@ int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 perio
>>> int retval = 0;
>>> int cpus;
>>> + guard(rcu)();
>>> +
>> Your patch doesn't explain why a RCU guard is needed here?
>> sched_init_dl_servers() is the changed caller, but it is called with
>> rq_lock_irq held which should implies a RCU read critical section as IRQ is
>> disabled.
> Yeah, looks like it's not required. Will remove. Thanks for spotting it!
Other than this minor nit, the patch series look good to me with my
limited understanding about the DL scheduler.
Acked-by: Waiman Long <longman@redhat.com>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 0/5] sched/deadline: Fix GRUB accounting
2025-06-27 11:51 [PATCH 0/5] sched/deadline: Fix GRUB accounting Juri Lelli
` (5 preceding siblings ...)
2025-06-30 6:01 ` [PATCH 0/5] sched/deadline: Fix GRUB accounting Marcel Ziswiler
@ 2025-07-11 10:05 ` Marcel Ziswiler
6 siblings, 0 replies; 18+ messages in thread
From: Marcel Ziswiler @ 2025-07-11 10:05 UTC (permalink / raw)
To: Juri Lelli, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Waiman Long
Cc: linux-kernel, Luca Abeni
Hi everybody
On Fri, 2025-06-27 at 13:51 +0200, Juri Lelli wrote:
> Hi All,
Any more progress on this?
As this is a bug it would be really nice to land a fix sooner than later : )
Thanks!
> This patch series addresses a significant regression observed in
> `SCHED_DEADLINE` performance, specifically when `SCHED_FLAG_RECLAIM`
> (Greedy Reclamation of Unused Bandwidth - GRUB) is enabled alongside
> overrunning jobs. This issue was reported by Marcel [1].
>
> Marcel's team extensive real-time scheduler (`SCHED_DEADLINE`) tests on
> mainline Linux kernels (amd64-based Intel NUCs and aarch64-based RADXA
> ROCK5Bs) typically show zero deadline misses for 5ms granularity tasks.
> However, with reclaim mode enabled and the same two overrunning jobs in
> the mix, they observed a dramatic increase in deadline misses: 43
> million on NUC and 600 thousand on ROCK55B. This highlights a critical
> accounting issue within `SCHED_DEADLINE` when reclaim is active.
>
> This series fixes the issue by doing the following.
>
> - 1/5: sched/deadline: Initialize dl_servers after SMP
> Currently, `dl-servers` are initialized too early during boot, before
> all CPUs are online. This results in an incorrect calculation of
> per-runqueue `DEADLINE` variables, such as `extra_bw`, which rely on a
> stable CPU count. This patch moves the `dl-server` initialization to a
> later stage, after SMP initialization, ensuring all CPUs are online and
> correct `extra_bw` values can be computed from the start.
>
> - 2/5: sched/deadline: Reset extra_bw to max_bw when clearing root domains
> The `dl_clear_root_domain()` function was found to not properly account
> for the fact that per-runqueue `extra_bw` variables retained stale
> values computed before root domain changes. This led to broken
> accounting. This patch fixes the issue by resetting `extra_bw` to
> `max_bw` before restoring `dl-server` contributions, ensuring a clean
> state.
>
> - 3/5: sched/deadline: Fix accounting after global limits change
> Changes to global `SCHED_DEADLINE` limits (handled by
> `sched_rt_handler()` logic) were found to leave stale or incorrect
> values in various accounting-related variables, including `extra_bw`.
> This patch properly cleans up per-runqueue variables before implementing
> the global limit change and then rebuilds the scheduling domains. This
> ensures that the accounting is correctly restored and maintained after
> such global limit adjustments.
>
> - 4/5 and 5/5 are simple drgn scripts I put together to help debugging
> this issue. I have the impression that they might be useful to have
> around for the future.
>
> Please review and test.
>
> The set is also availabe at
>
> git@github.com:jlelli/linux.git upstream/fix-grub-tip
>
> 1 - https://lore.kernel.org/lkml/ce8469c4fb2f3e2ada74add22cce4bfe61fd5bab.camel@codethink.co.uk/
>
> Thanks,
> Juri
>
> Juri Lelli (5):
> sched/deadline: Initialize dl_servers after SMP
> sched/deadline: Reset extra_bw to max_bw when clearing root domains
> sched/deadline: Fix accounting after global limits change
> tools/sched: Add root_domains_dump.py which dumps root domains info
> tools/sched: Add dl_bw_dump.py for printing bandwidth accounting info
>
> MAINTAINERS | 1 +
> kernel/sched/core.c | 2 +
> kernel/sched/deadline.c | 61 +++++++++++++++++++---------
> kernel/sched/rt.c | 6 +++
> kernel/sched/sched.h | 1 +
> tools/sched/dl_bw_dump.py | 57 ++++++++++++++++++++++++++
> tools/sched/root_domains_dump.py | 68 ++++++++++++++++++++++++++++++++
> 7 files changed, 177 insertions(+), 19 deletions(-)
> create mode 100755 tools/sched/dl_bw_dump.py
> create mode 100755 tools/sched/root_domains_dump.py
Cheers
Marcel
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 3/5] sched/deadline: Fix accounting after global limits change
2025-06-27 11:51 ` [PATCH 3/5] sched/deadline: Fix accounting after global limits change Juri Lelli
@ 2025-07-14 8:59 ` Peter Zijlstra
2025-07-15 10:07 ` Juri Lelli
2025-07-14 9:10 ` [tip: sched/core] " tip-bot2 for Juri Lelli
1 sibling, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2025-07-14 8:59 UTC (permalink / raw)
To: Juri Lelli
Cc: Ingo Molnar, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, Waiman Long,
linux-kernel, Marcel Ziswiler, Luca Abeni
On Fri, Jun 27, 2025 at 01:51:16PM +0200, Juri Lelli wrote:
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 15d5855c542cb..be6e9bcbe82b6 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -2886,6 +2886,12 @@ static int sched_rt_handler(const struct ctl_table *table, int write, void *buff
> sched_domains_mutex_unlock();
> mutex_unlock(&mutex);
>
> + /*
> + * After changing maximum available bandwidth for DEADLINE, we need to
> + * recompute per root domain and per cpus variables accordingly.
> + */
> + rebuild_sched_domains();
> +
> return ret;
> }
So I'll merge these patches since correctness first etc. But the above
is quite terrible. It would be really good not to have to rebuild the
sched domains for every rt change. Surely we can iterate the existing
domains and update stuff?
^ permalink raw reply [flat|nested] 18+ messages in thread
* [tip: sched/core] tools/sched: Add dl_bw_dump.py for printing bandwidth accounting info
2025-06-27 11:51 ` [PATCH 5/5] tools/sched: Add dl_bw_dump.py for printing bandwidth accounting info Juri Lelli
@ 2025-07-14 9:10 ` tip-bot2 for Juri Lelli
0 siblings, 0 replies; 18+ messages in thread
From: tip-bot2 for Juri Lelli @ 2025-07-14 9:10 UTC (permalink / raw)
To: linux-tip-commits
Cc: Juri Lelli, Peter Zijlstra (Intel), Marcel Ziswiler, x86,
linux-kernel
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 634c24068abf8f325e520e663250e4a32a95ea0e
Gitweb: https://git.kernel.org/tip/634c24068abf8f325e520e663250e4a32a95ea0e
Author: Juri Lelli <juri.lelli@redhat.com>
AuthorDate: Fri, 27 Jun 2025 13:51:18 +02:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 14 Jul 2025 10:59:33 +02:00
tools/sched: Add dl_bw_dump.py for printing bandwidth accounting info
dl_rq bandwidth accounting information is crucial for the correct
functioning of SCHED_DEADLINE.
Add a drgn script for accessing that information at runtime, so that
it's easier to check and debug issues related to it.
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> # nuc & rock5b
Link: https://lore.kernel.org/r/20250627115118.438797-6-juri.lelli@redhat.com
---
tools/sched/dl_bw_dump.py | 57 ++++++++++++++++++++++++++++++++++++++-
1 file changed, 57 insertions(+)
create mode 100644 tools/sched/dl_bw_dump.py
diff --git a/tools/sched/dl_bw_dump.py b/tools/sched/dl_bw_dump.py
new file mode 100644
index 0000000..aae4e42
--- /dev/null
+++ b/tools/sched/dl_bw_dump.py
@@ -0,0 +1,57 @@
+#!/usr/bin/env drgn
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (C) 2025 Juri Lelli <juri.lelli@redhat.com>
+# Copyright (C) 2025 Red Hat, Inc.
+
+desc = """
+This is a drgn script to show dl_rq bandwidth accounting information. For more
+info on drgn, visit https://github.com/osandov/drgn.
+
+Only online CPUs are reported.
+"""
+
+import os
+import argparse
+
+import drgn
+from drgn import FaultError
+from drgn.helpers.common import *
+from drgn.helpers.linux import *
+
+def print_dl_bws_info():
+
+ print("Retrieving dl_rq bandwidth accounting information:")
+
+ runqueues = prog['runqueues']
+
+ for cpu_id in for_each_possible_cpu(prog):
+ try:
+ rq = per_cpu(runqueues, cpu_id)
+
+ if rq.online == 0:
+ continue
+
+ dl_rq = rq.dl
+
+ print(f" From CPU: {cpu_id}")
+
+ # Access and print relevant fields from struct dl_rq
+ print(f" running_bw : {dl_rq.running_bw}")
+ print(f" this_bw : {dl_rq.this_bw}")
+ print(f" extra_bw : {dl_rq.extra_bw}")
+ print(f" max_bw : {dl_rq.max_bw}")
+ print(f" bw_ratio : {dl_rq.bw_ratio}")
+
+ except drgn.FaultError as fe:
+ print(f" (CPU {cpu_id}: Fault accessing kernel memory: {fe})")
+ except AttributeError as ae:
+ print(f" (CPU {cpu_id}: Missing attribute for root_domain (kernel struct change?): {ae})")
+ except Exception as e:
+ print(f" (CPU {cpu_id}: An unexpected error occurred: {e})")
+
+if __name__ == "__main__":
+ parser = argparse.ArgumentParser(description=desc,
+ formatter_class=argparse.RawTextHelpFormatter)
+ args = parser.parse_args()
+
+ print_dl_bws_info()
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [tip: sched/core] tools/sched: Add root_domains_dump.py which dumps root domains info
2025-06-27 11:51 ` [PATCH 4/5] tools/sched: Add root_domains_dump.py which dumps root domains info Juri Lelli
@ 2025-07-14 9:10 ` tip-bot2 for Juri Lelli
0 siblings, 0 replies; 18+ messages in thread
From: tip-bot2 for Juri Lelli @ 2025-07-14 9:10 UTC (permalink / raw)
To: linux-tip-commits
Cc: Juri Lelli, Peter Zijlstra (Intel), Marcel Ziswiler, x86,
linux-kernel
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 9fdb12c88e9ba75e2d831fb397dd27f03a534968
Gitweb: https://git.kernel.org/tip/9fdb12c88e9ba75e2d831fb397dd27f03a534968
Author: Juri Lelli <juri.lelli@redhat.com>
AuthorDate: Fri, 27 Jun 2025 13:51:17 +02:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 14 Jul 2025 10:59:33 +02:00
tools/sched: Add root_domains_dump.py which dumps root domains info
Root domains information is somewhat hard to access at runtime. Even
with sched_debug and sched_verbose, such information is only printed
on kernel console when domains are modified.
Add a simple drgn script to more easily retrieve root domains
information at runtime.
Since tools/sched is a new directory, add it to MAINTAINERS as well.
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> # nuc & rock5b
Link: https://lore.kernel.org/r/20250627115118.438797-5-juri.lelli@redhat.com
---
MAINTAINERS | 1 +-
tools/sched/root_domains_dump.py | 68 +++++++++++++++++++++++++++++++-
2 files changed, 69 insertions(+)
create mode 100644 tools/sched/root_domains_dump.py
diff --git a/MAINTAINERS b/MAINTAINERS
index a92290f..b986a49 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -22165,6 +22165,7 @@ F: include/linux/wait.h
F: include/uapi/linux/sched.h
F: kernel/fork.c
F: kernel/sched/
+F: tools/sched/
SCHEDULER - SCHED_EXT
R: Tejun Heo <tj@kernel.org>
diff --git a/tools/sched/root_domains_dump.py b/tools/sched/root_domains_dump.py
new file mode 100644
index 0000000..56dc91f
--- /dev/null
+++ b/tools/sched/root_domains_dump.py
@@ -0,0 +1,68 @@
+#!/usr/bin/env drgn
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (C) 2025 Juri Lelli <juri.lelli@redhat.com>
+# Copyright (C) 2025 Red Hat, Inc.
+
+desc = """
+This is a drgn script to show the current root domains configuration. For more
+info on drgn, visit https://github.com/osandov/drgn.
+
+Root domains are only printed once, as multiple CPUs might be attached to the
+same root domain.
+"""
+
+import os
+import argparse
+
+import drgn
+from drgn import FaultError
+from drgn.helpers.common import *
+from drgn.helpers.linux import *
+
+def print_root_domains_info():
+
+ # To store unique root domains found
+ seen_root_domains = set()
+
+ print("Retrieving (unique) Root Domain Information:")
+
+ runqueues = prog['runqueues']
+ def_root_domain = prog['def_root_domain']
+
+ for cpu_id in for_each_possible_cpu(prog):
+ try:
+ rq = per_cpu(runqueues, cpu_id)
+
+ root_domain = rq.rd
+
+ # Check if we've already processed this root domain to avoid duplicates
+ # Use the memory address of the root_domain as a unique identifier
+ root_domain_cast = int(root_domain)
+ if root_domain_cast in seen_root_domains:
+ continue
+ seen_root_domains.add(root_domain_cast)
+
+ if root_domain_cast == int(def_root_domain.address_):
+ print(f"\n--- Root Domain @ def_root_domain ---")
+ else:
+ print(f"\n--- Root Domain @ 0x{root_domain_cast:x} ---")
+
+ print(f" From CPU: {cpu_id}") # This CPU belongs to this root domain
+
+ # Access and print relevant fields from struct root_domain
+ print(f" Span : {cpumask_to_cpulist(root_domain.span[0])}")
+ print(f" Online : {cpumask_to_cpulist(root_domain.span[0])}")
+
+ except drgn.FaultError as fe:
+ print(f" (CPU {cpu_id}: Fault accessing kernel memory: {fe})")
+ except AttributeError as ae:
+ print(f" (CPU {cpu_id}: Missing attribute for root_domain (kernel struct change?): {ae})")
+ except Exception as e:
+ print(f" (CPU {cpu_id}: An unexpected error occurred: {e})")
+
+if __name__ == "__main__":
+ parser = argparse.ArgumentParser(description=desc,
+ formatter_class=argparse.RawTextHelpFormatter)
+ args = parser.parse_args()
+
+ print_root_domains_info()
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [tip: sched/core] sched/deadline: Fix accounting after global limits change
2025-06-27 11:51 ` [PATCH 3/5] sched/deadline: Fix accounting after global limits change Juri Lelli
2025-07-14 8:59 ` Peter Zijlstra
@ 2025-07-14 9:10 ` tip-bot2 for Juri Lelli
1 sibling, 0 replies; 18+ messages in thread
From: tip-bot2 for Juri Lelli @ 2025-07-14 9:10 UTC (permalink / raw)
To: linux-tip-commits
Cc: Marcel Ziswiler, Juri Lelli, Peter Zijlstra (Intel), x86,
linux-kernel
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 440989c10f4e32620e9e2717ca52c3ed7ae11048
Gitweb: https://git.kernel.org/tip/440989c10f4e32620e9e2717ca52c3ed7ae11048
Author: Juri Lelli <juri.lelli@redhat.com>
AuthorDate: Fri, 27 Jun 2025 13:51:16 +02:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 14 Jul 2025 10:59:33 +02:00
sched/deadline: Fix accounting after global limits change
A global limits change (sched_rt_handler() logic) currently leaves stale
and/or incorrect values in variables related to accounting (e.g.
extra_bw).
Properly clean up per runqueue variables before implementing the change
and rebuild scheduling domains (so that accounting is also properly
restored) after such a change is complete.
Reported-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> # nuc & rock5b
Link: https://lore.kernel.org/r/20250627115118.438797-4-juri.lelli@redhat.com
---
kernel/sched/deadline.c | 4 +++-
kernel/sched/rt.c | 6 ++++++
2 files changed, 9 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 0abffe3..9c7d952 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -3183,6 +3183,9 @@ void sched_dl_do_global(void)
if (global_rt_runtime() != RUNTIME_INF)
new_bw = to_ratio(global_rt_period(), global_rt_runtime());
+ for_each_possible_cpu(cpu)
+ init_dl_rq_bw_ratio(&cpu_rq(cpu)->dl);
+
for_each_possible_cpu(cpu) {
rcu_read_lock_sched();
@@ -3198,7 +3201,6 @@ void sched_dl_do_global(void)
raw_spin_unlock_irqrestore(&dl_b->lock, flags);
rcu_read_unlock_sched();
- init_dl_rq_bw_ratio(&cpu_rq(cpu)->dl);
}
}
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 15d5855..be6e9bc 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2886,6 +2886,12 @@ undo:
sched_domains_mutex_unlock();
mutex_unlock(&mutex);
+ /*
+ * After changing maximum available bandwidth for DEADLINE, we need to
+ * recompute per root domain and per cpus variables accordingly.
+ */
+ rebuild_sched_domains();
+
return ret;
}
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [tip: sched/core] sched/deadline: Reset extra_bw to max_bw when clearing root domains
2025-06-27 11:51 ` [PATCH 2/5] sched/deadline: Reset extra_bw to max_bw when clearing root domains Juri Lelli
@ 2025-07-14 9:10 ` tip-bot2 for Juri Lelli
0 siblings, 0 replies; 18+ messages in thread
From: tip-bot2 for Juri Lelli @ 2025-07-14 9:10 UTC (permalink / raw)
To: linux-tip-commits
Cc: Marcel Ziswiler, Juri Lelli, Peter Zijlstra (Intel), x86,
linux-kernel
The following commit has been merged into the sched/core branch of tip:
Commit-ID: fcc9276c4d331cd1fe9319d793e80b02e09727f5
Gitweb: https://git.kernel.org/tip/fcc9276c4d331cd1fe9319d793e80b02e09727f5
Author: Juri Lelli <juri.lelli@redhat.com>
AuthorDate: Fri, 27 Jun 2025 13:51:15 +02:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 14 Jul 2025 10:59:32 +02:00
sched/deadline: Reset extra_bw to max_bw when clearing root domains
dl_clear_root_domain() doesn't take into account the fact that per-rq
extra_bw variables retain values computed before root domain changes,
resulting in broken accounting.
Fix it by resetting extra_bw to max_bw before restoring back dl-servers
contributions.
Fixes: 2ff899e351643 ("sched/deadline: Rebuild root domain accounting after every update")
Reported-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> # nuc & rock5b
Link: https://lore.kernel.org/r/20250627115118.438797-3-juri.lelli@redhat.com
---
kernel/sched/deadline.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 0d25553..0abffe3 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2924,7 +2924,14 @@ void dl_clear_root_domain(struct root_domain *rd)
int i;
guard(raw_spinlock_irqsave)(&rd->dl_bw.lock);
+
+ /*
+ * Reset total_bw to zero and extra_bw to max_bw so that next
+ * loop will add dl-servers contributions back properly,
+ */
rd->dl_bw.total_bw = 0;
+ for_each_cpu(i, rd->span)
+ cpu_rq(i)->dl.extra_bw = cpu_rq(i)->dl.max_bw;
/*
* dl_servers are not tasks. Since dl_add_task_root_domain ignores
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [tip: sched/core] sched/deadline: Initialize dl_servers after SMP
2025-06-27 11:51 ` [PATCH 1/5] sched/deadline: Initialize dl_servers after SMP Juri Lelli
[not found] ` <1e39c473-d161-4ad0-bfdc-8a306f57135f@redhat.com>
@ 2025-07-14 9:10 ` tip-bot2 for Juri Lelli
1 sibling, 0 replies; 18+ messages in thread
From: tip-bot2 for Juri Lelli @ 2025-07-14 9:10 UTC (permalink / raw)
To: linux-tip-commits
Cc: Marcel Ziswiler, Juri Lelli, Peter Zijlstra (Intel), Waiman Long,
x86, linux-kernel
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 9f239df55546ee1d28f0976130136ffd1cad0fd7
Gitweb: https://git.kernel.org/tip/9f239df55546ee1d28f0976130136ffd1cad0fd7
Author: Juri Lelli <juri.lelli@redhat.com>
AuthorDate: Fri, 27 Jun 2025 13:51:14 +02:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 14 Jul 2025 10:59:32 +02:00
sched/deadline: Initialize dl_servers after SMP
dl-servers are currently initialized too early at boot when CPUs are not
fully up (only boot CPU is). This results in miscalculation of per
runqueue DEADLINE variables like extra_bw (which needs a stable CPU
count).
Move initialization of dl-servers later on after SMP has been
initialized and CPUs are all online, so that CPU count is stable and
DEADLINE variables can be computed correctly.
Fixes: d741f297bceaf ("sched/fair: Fair server interface")
Reported-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Tested-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> # nuc & rock5b
Link: https://lore.kernel.org/r/20250627115118.438797-2-juri.lelli@redhat.com
---
kernel/sched/core.c | 2 ++-
kernel/sched/deadline.c | 48 +++++++++++++++++++++++++---------------
kernel/sched/sched.h | 1 +-
3 files changed, 33 insertions(+), 18 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2f8caa9..89b3ed6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8371,6 +8371,8 @@ void __init sched_init_smp(void)
init_sched_rt_class();
init_sched_dl_class();
+ sched_init_dl_servers();
+
sched_smp_initialized = true;
}
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 23668fc..0d25553 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -761,6 +761,8 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
struct rq *rq = rq_of_dl_rq(dl_rq);
+ update_rq_clock(rq);
+
WARN_ON(is_dl_boosted(dl_se));
WARN_ON(dl_time_before(rq_clock(rq), dl_se->deadline));
@@ -1585,23 +1587,7 @@ void dl_server_start(struct sched_dl_entity *dl_se)
{
struct rq *rq = dl_se->rq;
- /*
- * XXX: the apply do not work fine at the init phase for the
- * fair server because things are not yet set. We need to improve
- * this before getting generic.
- */
- if (!dl_server(dl_se)) {
- u64 runtime = 50 * NSEC_PER_MSEC;
- u64 period = 1000 * NSEC_PER_MSEC;
-
- dl_server_apply_params(dl_se, runtime, period, 1);
-
- dl_se->dl_server = 1;
- dl_se->dl_defer = 1;
- setup_new_dl_entity(dl_se);
- }
-
- if (!dl_se->dl_runtime || dl_se->dl_server_active)
+ if (!dl_server(dl_se) || dl_se->dl_server_active)
return;
dl_se->dl_server_active = 1;
@@ -1612,7 +1598,7 @@ void dl_server_start(struct sched_dl_entity *dl_se)
void dl_server_stop(struct sched_dl_entity *dl_se)
{
- if (!dl_se->dl_runtime)
+ if (!dl_server(dl_se) || !dl_server_active(dl_se))
return;
dequeue_dl_entity(dl_se, DEQUEUE_SLEEP);
@@ -1645,6 +1631,32 @@ void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
dl_se->server_pick_task = pick_task;
}
+void sched_init_dl_servers(void)
+{
+ int cpu;
+ struct rq *rq;
+ struct sched_dl_entity *dl_se;
+
+ for_each_online_cpu(cpu) {
+ u64 runtime = 50 * NSEC_PER_MSEC;
+ u64 period = 1000 * NSEC_PER_MSEC;
+
+ rq = cpu_rq(cpu);
+
+ guard(rq_lock_irq)(rq);
+
+ dl_se = &rq->fair_server;
+
+ WARN_ON(dl_server(dl_se));
+
+ dl_server_apply_params(dl_se, runtime, period, 1);
+
+ dl_se->dl_server = 1;
+ dl_se->dl_defer = 1;
+ setup_new_dl_entity(dl_se);
+ }
+}
+
void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq)
{
u64 new_bw = dl_se->dl_bw;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 105190b..3058fb6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -385,6 +385,7 @@ extern void dl_server_stop(struct sched_dl_entity *dl_se);
extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
dl_server_has_tasks_f has_tasks,
dl_server_pick_f pick_task);
+extern void sched_init_dl_servers(void);
extern void dl_server_update_idle_time(struct rq *rq,
struct task_struct *p);
^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH 3/5] sched/deadline: Fix accounting after global limits change
2025-07-14 8:59 ` Peter Zijlstra
@ 2025-07-15 10:07 ` Juri Lelli
0 siblings, 0 replies; 18+ messages in thread
From: Juri Lelli @ 2025-07-15 10:07 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, Waiman Long,
linux-kernel, Marcel Ziswiler, Luca Abeni
On 14/07/25 10:59, Peter Zijlstra wrote:
> On Fri, Jun 27, 2025 at 01:51:16PM +0200, Juri Lelli wrote:
>
> > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> > index 15d5855c542cb..be6e9bcbe82b6 100644
> > --- a/kernel/sched/rt.c
> > +++ b/kernel/sched/rt.c
> > @@ -2886,6 +2886,12 @@ static int sched_rt_handler(const struct ctl_table *table, int write, void *buff
> > sched_domains_mutex_unlock();
> > mutex_unlock(&mutex);
> >
> > + /*
> > + * After changing maximum available bandwidth for DEADLINE, we need to
> > + * recompute per root domain and per cpus variables accordingly.
> > + */
> > + rebuild_sched_domains();
> > +
> > return ret;
> > }
>
> So I'll merge these patches since correctness first etc. But the above
Thanks!
> is quite terrible. It would be really good not to have to rebuild the
> sched domains for every rt change. Surely we can iterate the existing
> domains and update stuff?
Yeah, I agree. Tried doing an update at first, but then the involved
locking and the not so pleasant thing I could come up with made me
decide for the big hammer. Also because it should be a very infrequent
operation anyway.
But, I will try again somewhat soon.
Thanks,
Juri
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2025-07-15 10:07 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-27 11:51 [PATCH 0/5] sched/deadline: Fix GRUB accounting Juri Lelli
2025-06-27 11:51 ` [PATCH 1/5] sched/deadline: Initialize dl_servers after SMP Juri Lelli
[not found] ` <1e39c473-d161-4ad0-bfdc-8a306f57135f@redhat.com>
2025-06-29 23:08 ` Waiman Long
2025-06-30 10:21 ` Juri Lelli
2025-06-30 17:04 ` Waiman Long
2025-07-14 9:10 ` [tip: sched/core] " tip-bot2 for Juri Lelli
2025-06-27 11:51 ` [PATCH 2/5] sched/deadline: Reset extra_bw to max_bw when clearing root domains Juri Lelli
2025-07-14 9:10 ` [tip: sched/core] " tip-bot2 for Juri Lelli
2025-06-27 11:51 ` [PATCH 3/5] sched/deadline: Fix accounting after global limits change Juri Lelli
2025-07-14 8:59 ` Peter Zijlstra
2025-07-15 10:07 ` Juri Lelli
2025-07-14 9:10 ` [tip: sched/core] " tip-bot2 for Juri Lelli
2025-06-27 11:51 ` [PATCH 4/5] tools/sched: Add root_domains_dump.py which dumps root domains info Juri Lelli
2025-07-14 9:10 ` [tip: sched/core] " tip-bot2 for Juri Lelli
2025-06-27 11:51 ` [PATCH 5/5] tools/sched: Add dl_bw_dump.py for printing bandwidth accounting info Juri Lelli
2025-07-14 9:10 ` [tip: sched/core] " tip-bot2 for Juri Lelli
2025-06-30 6:01 ` [PATCH 0/5] sched/deadline: Fix GRUB accounting Marcel Ziswiler
2025-07-11 10:05 ` Marcel Ziswiler
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).