* [PATCH 1/2] sched_ext: Auto-register/unregister dl_server reservations
2026-05-26 16:42 [PATCHSET v3 sched_ext/for-7.2] sched_ext: Auto-manage ext/fair dl_server bandwidth Andrea Righi
@ 2026-05-26 16:42 ` Andrea Righi
2026-05-26 17:14 ` sashiko-bot
2026-05-28 11:36 ` Peter Zijlstra
2026-05-26 16:42 ` [PATCH 2/2] selftests/sched_ext: Validate dl_server attach/detach in total_bw test Andrea Righi
` (2 subsequent siblings)
3 siblings, 2 replies; 12+ messages in thread
From: Andrea Righi @ 2026-05-26 16:42 UTC (permalink / raw)
To: Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, K Prateek Nayak, Christian Loehle, Phil Auld,
Koba Ko, Joel Fernandes, Richard Cheng, Cheng-Yang Chou,
sched-ext, linux-kernel
Commit cd959a3562050d ("sched_ext: Add a DL server for sched_ext tasks")
introduced an ext_server deadline server to protect sched_ext tasks from
fair/RT starvation, mirroring the existing fair_server.
Currently, both servers reserve their 50ms/1000ms bandwidth at boot,
regardless of whether a BPF scheduler is loaded. Unused bandwidth is
still reclaimed at runtime by other classes, but the static reservation
prevents the RT class from implicitly using that headroom when one of
the two classes is guaranteed to be empty.
A sysadmin can work around this by writing
/sys/kernel/debug/sched/{fair,ext}_server/cpu*/runtime, but that
requires manual action and not all systems expose debugfs.
A better approach is to make server bandwidth reservations dynamic: only
the scheduling policy that is currently active should register its
reservation, while the inactive one should not artificially hold
capacity (keeping both reservations only when the BPF scheduler is
running in partial mode):
+---------------------------------------------+-------------+------------+
| BPF scheduler state | fair server | ext server |
+---------------------------------------------+-------------+------------+
| not loaded (default boot) | reserved | none |
| loaded full mode (!SCX_OPS_SWITCH_PARTIAL) | none | reserved |
| loaded partial mode (SCX_OPS_SWITCH_PARTIAL)| reserved | reserved |
+---------------------------------------------+-------------+------------+
To achieve this, introduce an "attached/detached" state for each
deadline server, so the kernel can decide whether a server's bandwidth
should be accounted in global bandwidth tracking.
At boot, the system starts with only the fair server contributing to
bandwidth accounting. When a BPF scheduler is enabled, the ext server is
attached and may replace or complement the fair server depending on
whether full or partial mode is used. When sched_ext is disabled, the
system restores the previous deadline bandwidth values and behavior.
The transition logic ensures that switching between scheduling modes is
consistent and reversible, without losing runtime configuration or
requiring manual intervention.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
include/linux/sched.h | 6 ++
kernel/sched/deadline.c | 204 ++++++++++++++++++++++++++++++++++++++--
kernel/sched/ext.c | 71 ++++++++++++++
kernel/sched/sched.h | 4 +
4 files changed, 278 insertions(+), 7 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ee06cba5c6f53..7acceb80628b0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -702,6 +702,11 @@ struct sched_dl_entity {
* running, skipping the defer phase.
*
* @dl_defer_idle tracks idle state
+ *
+ * @dl_bw_attached tells if this server's bandwidth currently
+ * contributes to the root domain's total_bw. Only meaningful for server
+ * entities (@dl_server == 1). Allows toggling the reservation on/off
+ * without losing the configured @dl_runtime/@dl_period.
*/
unsigned int dl_throttled : 1;
unsigned int dl_yielded : 1;
@@ -713,6 +718,7 @@ struct sched_dl_entity {
unsigned int dl_defer_armed : 1;
unsigned int dl_defer_running : 1;
unsigned int dl_defer_idle : 1;
+ unsigned int dl_bw_attached : 1;
/*
* Bandwidth enforcement timer. Each -deadline task has its
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 7db4c87df83b0..5672f9c583982 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1793,7 +1793,8 @@ void dl_server_start(struct sched_dl_entity *dl_se)
struct rq *rq = dl_se->rq;
dl_se->dl_defer_idle = 0;
- if (!dl_server(dl_se) || dl_se->dl_server_active || !dl_se->dl_runtime)
+ if (!dl_server(dl_se) || dl_se->dl_server_active || !dl_se->dl_runtime ||
+ !dl_se->dl_bw_attached)
return;
/*
@@ -1868,6 +1869,13 @@ void sched_init_dl_servers(void)
dl_se->dl_server = 1;
dl_se->dl_defer = 1;
setup_new_dl_entity(dl_se);
+
+ /*
+ * No BPF scheduler is loaded at boot, so the ext_server has no
+ * tasks to protect. Detach its bandwidth reservation, it will
+ * be attached when a BPF scheduler is loaded.
+ */
+ dl_server_detach_bw(dl_se);
#endif
}
}
@@ -1878,6 +1886,9 @@ void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq)
int cpu = cpu_of(rq);
struct dl_bw *dl_b;
+ if (!dl_se->dl_bw_attached)
+ return;
+
dl_b = dl_bw_of(cpu_of(rq));
guard(raw_spinlock)(&dl_b->lock);
@@ -1889,7 +1900,8 @@ void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq)
int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 period, bool init)
{
- u64 old_bw = init ? 0 : to_ratio(dl_se->dl_period, dl_se->dl_runtime);
+ u64 old_bw = (init || !dl_se->dl_bw_attached) ? 0 :
+ to_ratio(dl_se->dl_period, dl_se->dl_runtime);
u64 new_bw = to_ratio(period, runtime);
struct rq *rq = dl_se->rq;
int cpu = cpu_of(rq);
@@ -1909,7 +1921,8 @@ int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 perio
if (init) {
__add_rq_bw(new_bw, &rq->dl);
__dl_add(dl_b, new_bw, cpus);
- } else {
+ dl_se->dl_bw_attached = 1;
+ } else if (dl_se->dl_bw_attached) {
__dl_sub(dl_b, dl_se->dl_bw, cpus);
__dl_add(dl_b, new_bw, cpus);
@@ -1929,6 +1942,181 @@ int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 perio
return 0;
}
+/*
+ * Add @dl_se's bw to the root-domain accounting.
+ *
+ * Return -EBUSY if attaching would overflow root domain capacity.
+ */
+static int __dl_server_attach_bw_locked(struct sched_dl_entity *dl_se,
+ struct dl_bw *dl_b, int cpus)
+{
+ struct rq *rq = dl_se->rq;
+ unsigned long cap;
+
+ /*
+ * Always update @rq->dl.this_bw, but only update @dl_b->total_bw
+ * (and run the overflow check it gates) while this CPU is active.
+ *
+ * This mirrors dl_server_add_bw() during root-domain rebuilds, which
+ * only publishes bandwidth from active CPUs into @dl_b.
+ */
+ if (cpu_active(cpu_of(rq))) {
+ cap = dl_bw_capacity(cpu_of(rq));
+ if (__dl_overflow(dl_b, cap, 0, dl_se->dl_bw))
+ return -EBUSY;
+ __dl_add(dl_b, dl_se->dl_bw, cpus);
+ }
+ __add_rq_bw(dl_se->dl_bw, &rq->dl);
+ dl_se->dl_bw_attached = 1;
+
+ return 0;
+}
+
+/*
+ * Drain @dl_se and remove its bw from the root-domain accounting.
+ */
+static void __dl_server_detach_bw_locked(struct sched_dl_entity *dl_se,
+ struct dl_bw *dl_b, int cpus)
+{
+ struct rq *rq = dl_se->rq;
+
+ /*
+ * If the server is still active (on_rq), dequeue it via
+ * dl_server_stop(); task_non_contending() will either subtract
+ * @dl_bw from running_bw immediately (0-lag passed) or set
+ * dl_non_contending and arm the inactive_timer.
+ */
+ if (dl_se->dl_server_active)
+ dl_server_stop(dl_se);
+
+ /*
+ * Drop @dl_se's contribution from this rq's bandwidth accounting,
+ * mirroring the __add_rq_bw() done at attach time.
+ */
+ dl_rq_change_utilization(rq, dl_se, 0);
+
+ /*
+ * Update @dl_b only while this CPU is active, matching
+ * dl_server_add_bw() during root-domain rebuilds.
+ *
+ * If this CPU is inactive, its bandwidth is not currently accounted in
+ * @dl_b->total_bw: either attach skipped adding it, or a rebuild
+ * already dropped it while re-publishing active CPUs only.
+ *
+ * In that case there is nothing to subtract from @dl_b. Just clear
+ * @dl_se->dl_bw_attached; if the CPU becomes active again, the next
+ * rebuild will re-publish its bandwidth.
+ */
+ if (cpu_active(cpu_of(rq)))
+ __dl_sub(dl_b, dl_se->dl_bw, cpus);
+ dl_se->dl_bw_attached = 0;
+}
+
+/*
+ * Attach @dl_se's bandwidth to the root domain's total_bw accounting.
+ *
+ * Use to dynamically register a dl_server's bandwidth reservation while
+ * preserving its configured @dl_runtime / @dl_period. No-op if @dl_se is
+ * already attached.
+ *
+ * Returns -EBUSY if attaching would overflow the root domain capacity.
+ */
+int dl_server_attach_bw(struct sched_dl_entity *dl_se)
+{
+ struct rq *rq = dl_se->rq;
+ int cpu = cpu_of(rq);
+ struct dl_bw *dl_b;
+ int cpus, ret;
+
+ if (dl_se->dl_bw_attached)
+ return 0;
+
+ scoped_guard (raw_spinlock, &dl_bw_of(cpu)->lock) {
+ dl_b = dl_bw_of(cpu);
+ cpus = dl_bw_cpus(cpu);
+ ret = __dl_server_attach_bw_locked(dl_se, dl_b, cpus);
+ }
+ if (ret)
+ return ret;
+
+ /*
+ * The natural 0->nr_running transition that triggers dl_server_start()
+ * may have happened while @dl_se was still detached (e.g., between
+ * scx_bypass(false) and the scx_enable() re-balance loop), so kick a
+ * start here.
+ *
+ * dl_server_start() bails out cleanly if there's nothing to schedule or
+ * it's already active. Skip if @cpu is offline; the server will be
+ * started naturally on the first enqueue once @cpu comes back.
+ */
+ if (cpu_online(cpu))
+ dl_server_start(dl_se);
+
+ return 0;
+}
+
+/*
+ * Detach @dl_se's bandwidth from the root domain's total_bw accounting.
+ *
+ * Use to dynamically unregister a dl_server's bandwidth reservation while
+ * preserving its configured @dl_runtime / @dl_period. No-op if @dl_se is
+ * not currently attached.
+ */
+void dl_server_detach_bw(struct sched_dl_entity *dl_se)
+{
+ int cpu = cpu_of(dl_se->rq);
+ struct dl_bw *dl_b;
+ int cpus;
+
+ if (!dl_se->dl_bw_attached)
+ return;
+
+ dl_b = dl_bw_of(cpu);
+ guard(raw_spinlock)(&dl_b->lock);
+ cpus = dl_bw_cpus(cpu);
+ __dl_server_detach_bw_locked(dl_se, dl_b, cpus);
+}
+
+/*
+ * Atomically detach @detach_se and attach @attach_se on the same rq, holding
+ * @dl_b->lock across both operations so a concurrent sched_setattr() cannot
+ * steal the bandwidth freed by the detach before the attach can claim it.
+ *
+ * Both entities must live on the same rq (same root domain). Returns the
+ * result of the attach: -EBUSY if attaching @attach_se would overflow root
+ * domain capacity (in which case both servers end up detached).
+ */
+int dl_server_swap_bw(struct sched_dl_entity *detach_se,
+ struct sched_dl_entity *attach_se)
+{
+ struct rq *rq = detach_se->rq;
+ int cpu = cpu_of(rq);
+ struct dl_bw *dl_b;
+ int cpus, ret;
+
+ WARN_ON_ONCE(attach_se->rq != rq);
+
+ scoped_guard (raw_spinlock, &dl_bw_of(cpu)->lock) {
+ dl_b = dl_bw_of(cpu);
+ cpus = dl_bw_cpus(cpu);
+
+ if (detach_se->dl_bw_attached)
+ __dl_server_detach_bw_locked(detach_se, dl_b, cpus);
+
+ if (attach_se->dl_bw_attached)
+ ret = 0;
+ else
+ ret = __dl_server_attach_bw_locked(attach_se, dl_b, cpus);
+ }
+ if (ret)
+ return ret;
+
+ if (cpu_online(cpu))
+ dl_server_start(attach_se);
+
+ return 0;
+}
+
/*
* Update the current task's runtime statistics (provided it is still
* a -deadline task and has not been removed from the dl_rq).
@@ -3236,12 +3424,12 @@ static void dl_server_add_bw(struct root_domain *rd, int cpu)
struct sched_dl_entity *dl_se;
dl_se = &cpu_rq(cpu)->fair_server;
- if (dl_server(dl_se) && cpu_active(cpu))
+ if (dl_server(dl_se) && dl_se->dl_bw_attached && cpu_active(cpu))
__dl_add(&rd->dl_bw, dl_se->dl_bw, dl_bw_cpus(cpu));
#ifdef CONFIG_SCHED_CLASS_EXT
dl_se = &cpu_rq(cpu)->ext_server;
- if (dl_server(dl_se) && cpu_active(cpu))
+ if (dl_server(dl_se) && dl_se->dl_bw_attached && cpu_active(cpu))
__dl_add(&rd->dl_bw, dl_se->dl_bw, dl_bw_cpus(cpu));
#endif
}
@@ -3250,11 +3438,13 @@ static u64 dl_server_read_bw(int cpu)
{
u64 dl_bw = 0;
- if (cpu_rq(cpu)->fair_server.dl_server)
+ if (cpu_rq(cpu)->fair_server.dl_server &&
+ cpu_rq(cpu)->fair_server.dl_bw_attached)
dl_bw += cpu_rq(cpu)->fair_server.dl_bw;
#ifdef CONFIG_SCHED_CLASS_EXT
- if (cpu_rq(cpu)->ext_server.dl_server)
+ if (cpu_rq(cpu)->ext_server.dl_server &&
+ cpu_rq(cpu)->ext_server.dl_bw_attached)
dl_bw += cpu_rq(cpu)->ext_server.dl_bw;
#endif
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 83272acf17637..2330657bd66f3 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -6112,6 +6112,7 @@ static void scx_root_disable(struct scx_sched *sch)
{
struct scx_task_iter sti;
struct task_struct *p;
+ bool was_switched_all;
int cpu;
/* guarantee forward progress and wait for descendants to be disabled */
@@ -6138,6 +6139,8 @@ static void scx_root_disable(struct scx_sched *sch)
*/
mutex_lock(&scx_enable_mutex);
+ was_switched_all = scx_switched_all();
+
static_branch_disable(&__scx_switched_all);
WRITE_ONCE(scx_switching_all, false);
@@ -6187,10 +6190,34 @@ static void scx_root_disable(struct scx_sched *sch)
/*
* Invalidate all the rq clocks to prevent getting outdated
* rq clocks from a previous scx scheduler.
+ *
+ * Also re-balance the dl_server bandwidth reservations: detach
+ * ext_server (no more sched_ext tasks) and reinstate fair_server if it
+ * was previously detached because we were running in full mode.
+ *
+ * Unlike the enable path, this runs on a recovery path that cannot
+ * fail, so we use dl_server_swap_bw() to atomically free ext_server's
+ * bandwidth and reclaim it for fair_server under the same dl_b lock.
+ *
+ * The swap can still fail with -EBUSY if someone bumped ext_server's
+ * runtime via debugfs between enable and disable; in that narrow case
+ * both servers end up detached and we just WARN.
*/
for_each_possible_cpu(cpu) {
struct rq *rq = cpu_rq(cpu);
+
scx_rq_clock_invalidate(rq);
+
+ scoped_guard(rq_lock_irqsave, rq) {
+ update_rq_clock(rq);
+ if (was_switched_all) {
+ if (WARN_ON_ONCE(dl_server_swap_bw(&rq->ext_server,
+ &rq->fair_server)))
+ pr_warn("failed to re-attach fair_server on CPU %d\n", cpu);
+ } else {
+ dl_server_detach_bw(&rq->ext_server);
+ }
+ }
}
/* no task is on scx, turn off all the switches and flush in-progress calls */
@@ -7233,6 +7260,31 @@ static void scx_root_enable_workfn(struct kthread_work *work)
if (ret)
goto err_disable;
+ /*
+ * Attach the ext_server bandwidth reservation before anything is
+ * committed so that we can fail the enable if the root domain cannot
+ * accommodate it. The matching fair_server detach is deferred to the
+ * tail of this function, after the switch is fully committed and can no
+ * longer fail.
+ *
+ * On failure, err_disable funnels into scx_root_disable() which
+ * detaches ext_server, so partially-attached state is cleaned up
+ * automatically.
+ */
+ for_each_possible_cpu(cpu) {
+ struct rq *rq = cpu_rq(cpu);
+
+ scoped_guard(rq_lock_irqsave, rq) {
+ update_rq_clock(rq);
+ ret = dl_server_attach_bw(&rq->ext_server);
+ }
+ if (ret) {
+ pr_warn("sched_ext: failed to attach ext_server on CPU %d (%d)\n",
+ cpu, ret);
+ goto err_disable;
+ }
+ }
+
/*
* Once __scx_enabled is set, %current can be switched to SCX anytime.
* This can lead to stalls as some BPF schedulers (e.g. userspace
@@ -7387,6 +7439,25 @@ static void scx_root_enable_workfn(struct kthread_work *work)
if (!(ops->flags & SCX_OPS_SWITCH_PARTIAL))
static_branch_enable(&__scx_switched_all);
+ /*
+ * Detach the fair_server bandwidth reservation now that the switch
+ * is fully committed. In full mode (!SCX_OPS_SWITCH_PARTIAL) no
+ * task will ever run in the fair class, so give that bandwidth
+ * back to the RT class. The matching ext_server attach already
+ * happened earlier; this only releases bandwidth and cannot fail.
+ *
+ * In partial mode keep fair_server attached.
+ */
+ if (scx_switched_all()) {
+ for_each_possible_cpu(cpu) {
+ struct rq *rq = cpu_rq(cpu);
+
+ guard(rq_lock_irqsave)(rq);
+ update_rq_clock(rq);
+ dl_server_detach_bw(&rq->fair_server);
+ }
+ }
+
pr_info("sched_ext: BPF scheduler \"%s\" enabled%s\n",
sch->ops.name, scx_switched_all() ? "" : " (partial)");
kobject_uevent(&sch->kobj, KOBJ_ADD);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9f63b15d309d1..6d7eae73a6c00 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -421,6 +421,10 @@ extern void ext_server_init(struct rq *rq);
extern void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq);
extern int dl_server_apply_params(struct sched_dl_entity *dl_se,
u64 runtime, u64 period, bool init);
+extern int dl_server_attach_bw(struct sched_dl_entity *dl_se);
+extern void dl_server_detach_bw(struct sched_dl_entity *dl_se);
+extern int dl_server_swap_bw(struct sched_dl_entity *detach_se,
+ struct sched_dl_entity *attach_se);
static inline bool dl_server_active(struct sched_dl_entity *dl_se)
{
--
2.54.0
^ permalink raw reply related [flat|nested] 12+ messages in thread* Re: [PATCH 1/2] sched_ext: Auto-register/unregister dl_server reservations
2026-05-26 16:42 ` [PATCH 1/2] sched_ext: Auto-register/unregister dl_server reservations Andrea Righi
@ 2026-05-26 17:14 ` sashiko-bot
2026-05-28 11:36 ` Peter Zijlstra
1 sibling, 0 replies; 12+ messages in thread
From: sashiko-bot @ 2026-05-26 17:14 UTC (permalink / raw)
To: Andrea Righi; +Cc: sched-ext
Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
Pre-existing issues:
- [High] Modifying a deadline server's parameters via debugfs while its CPU is inactive (offline) corrupts the root domain's bandwidth accounting and can trigger a divide-by-zero kernel panic.
--
commit 81a64c3f8b34ea0965999ebcdc61714354b79af3
Author: Andrea Righi <arighi@nvidia.com>
sched_ext: Auto-register/unregister dl_server reservations
This commit makes deadline server bandwidth reservations dynamic by attaching
or detaching them based on whether a BPF scheduler is loaded. This prevents
inactive scheduling classes from artificially holding capacity.
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index ddfd6bc63ab1d..90f6589297ac5 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
[ ... ]
> @@ -1893,7 +1904,8 @@ void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq)
>
> int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 period, bool init)
> {
> - u64 old_bw = init ? 0 : to_ratio(dl_se->dl_period, dl_se->dl_runtime);
> + u64 old_bw = (init || !dl_se->dl_bw_attached) ? 0 :
> + to_ratio(dl_se->dl_period, dl_se->dl_runtime);
> u64 new_bw = to_ratio(period, runtime);
> struct rq *rq = dl_se->rq;
> int cpu = cpu_of(rq);
> @@ -1913,7 +1925,8 @@ int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 perio
> if (init) {
> __add_rq_bw(new_bw, &rq->dl);
> __dl_add(dl_b, new_bw, cpus);
> - } else {
> + } else if (dl_se->dl_bw_attached) {
> __dl_sub(dl_b, dl_se->dl_bw, cpus);
[Severity: High]
This is a pre-existing issue, but can modifying a deadline server's
parameters via debugfs while its CPU is offline trigger a divide-by-zero
panic here?
Looking at dl_server_apply_params(), it retrieves the number of active CPUs
via dl_bw_cpus(cpu) and then passes that to __dl_sub() and __dl_add() when
the server is attached:
dl_server_apply_params() {
...
cpus = dl_bw_cpus(cpu);
...
if (init) {
...
} else if (dl_se->dl_bw_attached) {
__dl_sub(dl_b, dl_se->dl_bw, cpus);
__dl_add(dl_b, new_bw, cpus);
...
}
If the CPU is offline, dl_bw_cpus() can return 0. The helpers __dl_sub()
and __dl_add() perform a division by cpus:
__dl_sub() {
...
__dl_update(dl_b, (s32)tsk_bw / cpus);
}
Wouldn't this result in a direct divide-by-zero if the debugfs file for an
inactive CPU is written to?
Additionally, even if cpus > 0, since inactive CPUs are skipped during
root domain bandwidth rebuilds, does subtracting dl_se->dl_bw from
dl_b->total_bw without a cpu_active() check subtract a value that was
never added, resulting in accounting corruption?
> __dl_add(dl_b, new_bw, cpus);
>
> dl_rq_change_utilization(rq, dl_se, new_bw);
> }
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260526164420.638711-1-arighi@nvidia.com?part=1
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCH 1/2] sched_ext: Auto-register/unregister dl_server reservations
2026-05-26 16:42 ` [PATCH 1/2] sched_ext: Auto-register/unregister dl_server reservations Andrea Righi
2026-05-26 17:14 ` sashiko-bot
@ 2026-05-28 11:36 ` Peter Zijlstra
2026-05-28 16:13 ` Andrea Righi
1 sibling, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2026-05-28 11:36 UTC (permalink / raw)
To: Andrea Righi
Cc: Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, K Prateek Nayak, Christian Loehle,
Phil Auld, Koba Ko, Joel Fernandes, Richard Cheng,
Cheng-Yang Chou, sched-ext, linux-kernel
On Tue, May 26, 2026 at 06:42:48PM +0200, Andrea Righi wrote:
> @@ -6187,10 +6190,34 @@ static void scx_root_disable(struct scx_sched *sch)
> /*
> * Invalidate all the rq clocks to prevent getting outdated
> * rq clocks from a previous scx scheduler.
> + *
> + * Also re-balance the dl_server bandwidth reservations: detach
> + * ext_server (no more sched_ext tasks) and reinstate fair_server if it
> + * was previously detached because we were running in full mode.
> + *
> + * Unlike the enable path, this runs on a recovery path that cannot
> + * fail, so we use dl_server_swap_bw() to atomically free ext_server's
> + * bandwidth and reclaim it for fair_server under the same dl_b lock.
> + *
> + * The swap can still fail with -EBUSY if someone bumped ext_server's
> + * runtime via debugfs between enable and disable; in that narrow case
> + * both servers end up detached and we just WARN.
> */
> for_each_possible_cpu(cpu) {
> struct rq *rq = cpu_rq(cpu);
> +
> scx_rq_clock_invalidate(rq);
> +
> + scoped_guard(rq_lock_irqsave, rq) {
> + update_rq_clock(rq);
> + if (was_switched_all) {
> + if (WARN_ON_ONCE(dl_server_swap_bw(&rq->ext_server,
> + &rq->fair_server)))
> + pr_warn("failed to re-attach fair_server on CPU %d\n", cpu);
One option here, with the swap, is to reduce the fair servers bandwidth
to match the outgoing ext server. Then at least you end up with the fair
server running, rather than having it completely stopped.
But this is going to be a rather rare occurrence, and people will have
to go poke at the debugfs controls anyway if this happens, so maybe
that's just not worth the effort.
But I wanted to mention it...
> + } else {
> + dl_server_detach_bw(&rq->ext_server);
> + }
> + }
> }
>
> /* no task is on scx, turn off all the switches and flush in-progress calls */
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCH 1/2] sched_ext: Auto-register/unregister dl_server reservations
2026-05-28 11:36 ` Peter Zijlstra
@ 2026-05-28 16:13 ` Andrea Righi
0 siblings, 0 replies; 12+ messages in thread
From: Andrea Righi @ 2026-05-28 16:13 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, K Prateek Nayak, Christian Loehle,
Phil Auld, Koba Ko, Joel Fernandes, Richard Cheng,
Cheng-Yang Chou, sched-ext, linux-kernel
Hi Peter,
On Thu, May 28, 2026 at 01:36:21PM +0200, Peter Zijlstra wrote:
> On Tue, May 26, 2026 at 06:42:48PM +0200, Andrea Righi wrote:
> > @@ -6187,10 +6190,34 @@ static void scx_root_disable(struct scx_sched *sch)
> > /*
> > * Invalidate all the rq clocks to prevent getting outdated
> > * rq clocks from a previous scx scheduler.
> > + *
> > + * Also re-balance the dl_server bandwidth reservations: detach
> > + * ext_server (no more sched_ext tasks) and reinstate fair_server if it
> > + * was previously detached because we were running in full mode.
> > + *
> > + * Unlike the enable path, this runs on a recovery path that cannot
> > + * fail, so we use dl_server_swap_bw() to atomically free ext_server's
> > + * bandwidth and reclaim it for fair_server under the same dl_b lock.
> > + *
> > + * The swap can still fail with -EBUSY if someone bumped ext_server's
> > + * runtime via debugfs between enable and disable; in that narrow case
> > + * both servers end up detached and we just WARN.
> > */
> > for_each_possible_cpu(cpu) {
> > struct rq *rq = cpu_rq(cpu);
> > +
> > scx_rq_clock_invalidate(rq);
> > +
> > + scoped_guard(rq_lock_irqsave, rq) {
> > + update_rq_clock(rq);
> > + if (was_switched_all) {
> > + if (WARN_ON_ONCE(dl_server_swap_bw(&rq->ext_server,
> > + &rq->fair_server)))
> > + pr_warn("failed to re-attach fair_server on CPU %d\n", cpu);
>
> One option here, with the swap, is to reduce the fair servers bandwidth
> to match the outgoing ext server. Then at least you end up with the fair
> server running, rather than having it completely stopped.
>
> But this is going to be a rather rare occurrence, and people will have
> to go poke at the debugfs controls anyway if this happens, so maybe
> that's just not worth the effort.
>
> But I wanted to mention it...
Yeah, it'd be safer to at least have "some" bandwidth attached if
dl_server_swap_bw() fails, so that fair isn't left completely unprotected.
On top of that we could even try to opportunistically restore the original
bandwidth whenever DL bw is released, but as you say, this is probably a rare
scenario, maybe it could be a later follow-up improvement?
>
> > + } else {
> > + dl_server_detach_bw(&rq->ext_server);
> > + }
> > + }
> > }
> >
> > /* no task is on scx, turn off all the switches and flush in-progress calls */
Thanks,
-Andrea
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH 2/2] selftests/sched_ext: Validate dl_server attach/detach in total_bw test
2026-05-26 16:42 [PATCHSET v3 sched_ext/for-7.2] sched_ext: Auto-manage ext/fair dl_server bandwidth Andrea Righi
2026-05-26 16:42 ` [PATCH 1/2] sched_ext: Auto-register/unregister dl_server reservations Andrea Righi
@ 2026-05-26 16:42 ` Andrea Righi
2026-05-26 17:33 ` sashiko-bot
2026-05-27 12:36 ` [PATCHSET v3 sched_ext/for-7.2] sched_ext: Auto-manage ext/fair dl_server bandwidth Juri Lelli
2026-05-28 15:53 ` Tejun Heo
3 siblings, 1 reply; 12+ messages in thread
From: Andrea Righi @ 2026-05-26 16:42 UTC (permalink / raw)
To: Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, K Prateek Nayak, Christian Loehle, Phil Auld,
Koba Ko, Joel Fernandes, Richard Cheng, Cheng-Yang Chou,
sched-ext, linux-kernel
Extend the total_bw selftest to validate the fair/ext dl_server
auto-attach/detach operations.
After the existing consistency checks, the test now doubles the
fair_server's runtime on every CPU via debugfs and verifies that:
1. total_bw grew after the customization (proves fair_server was
attached and apply_params() honored the dl_bw_attached flag),
2. with the minimal BPF scheduler loaded, total_bw drops back to the
baseline value (proves fair_server was detached and ext_server was
attached at its own default runtime),
3. after unload total_bw matches the doubled value from step 1 (proves
fair_server was re-attached with the runtime customization preserved
across the load/unload cycle).
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
tools/testing/selftests/sched_ext/total_bw.c | 201 ++++++++++++++++++-
1 file changed, 200 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/sched_ext/total_bw.c b/tools/testing/selftests/sched_ext/total_bw.c
index 5b0a619bab86e..2af01cee90cc0 100644
--- a/tools/testing/selftests/sched_ext/total_bw.c
+++ b/tools/testing/selftests/sched_ext/total_bw.c
@@ -100,6 +100,98 @@ static int read_total_bw_values(long *bw_values, int max_cpus)
return cpu_count;
}
+/*
+ * Read a per-CPU dl_server param (runtime or period) from debugfs.
+ * Returns the value in nanoseconds, or -1 on failure.
+ */
+static long read_server_param(const char *server, const char *param, int cpu)
+{
+ char path[128];
+ long value = -1;
+ FILE *fp;
+
+ snprintf(path, sizeof(path),
+ "/sys/kernel/debug/sched/%s_server/cpu%d/%s",
+ server, cpu, param);
+ fp = fopen(path, "r");
+ if (!fp)
+ return -1;
+ if (fscanf(fp, "%ld", &value) != 1)
+ value = -1;
+ fclose(fp);
+
+ return value;
+}
+
+/*
+ * Write a per-CPU dl_server param to debugfs. Returns 0 on success.
+ */
+static int write_server_param(const char *server, const char *param,
+ int cpu, long value)
+{
+ char path[128];
+ FILE *fp;
+ int ret = 0;
+
+ snprintf(path, sizeof(path),
+ "/sys/kernel/debug/sched/%s_server/cpu%d/%s",
+ server, cpu, param);
+ fp = fopen(path, "w");
+ if (!fp)
+ return -1;
+ if (fprintf(fp, "%ld", value) < 0)
+ ret = -1;
+ if (fclose(fp) != 0)
+ ret = -1;
+
+ return ret;
+}
+
+static int read_fair_runtime_all(int nr_cpus, long *runtimes)
+{
+ int i;
+
+ for (i = 0; i < nr_cpus; i++) {
+ runtimes[i] = read_server_param("fair", "runtime", i);
+ if (runtimes[i] <= 0)
+ return -1;
+ }
+
+ return 0;
+}
+
+static int write_fair_runtime_all(int nr_cpus, long value)
+{
+ int i;
+
+ for (i = 0; i < nr_cpus; i++) {
+ if (write_server_param("fair", "runtime", i, value) < 0) {
+ SCX_ERR("Failed to write fair_server runtime on CPU %d", i);
+ return -1;
+ }
+ }
+
+ return 0;
+}
+
+/*
+ * Restore per-CPU fair_server runtimes.
+ */
+static int restore_fair_runtime_all(int nr_cpus, const long *runtimes)
+{
+ int ret = 0;
+ int i;
+
+ for (i = 0; i < nr_cpus; i++) {
+ if (write_server_param("fair", "runtime", i, runtimes[i]) < 0) {
+ SCX_ERR("Failed to restore fair_server runtime on CPU %d", i);
+ ret = -1;
+ }
+ }
+
+ return ret;
+}
+
static bool verify_total_bw_consistency(long *bw_values, int count)
{
int i;
@@ -217,6 +309,9 @@ static enum scx_test_status run(void *ctx)
struct bpf_link *link;
long loaded_bw[MAX_CPUS];
long unloaded_bw[MAX_CPUS];
+ long doubled_bw[MAX_CPUS];
+ long original_runtime[MAX_CPUS], doubled_runtime;
+ enum scx_test_status ret;
int i;
/* Test scenario 2: BPF program loaded */
@@ -257,7 +352,111 @@ static enum scx_test_status run(void *ctx)
}
fprintf(stderr, "All total_bw values are consistent across all scenarios\n");
- return SCX_TEST_PASS;
+
+ /*
+ * Validate auto-register/unregister of dl_server bandwidth reservations.
+ *
+ * Doubling fair_server's runtime doubles its bw contribution. With a
+ * full-mode BPF scheduler (minimal_ops), the kernel should detach
+ * fair_server and attach ext_server, dropping total_bw back to its
+ * pre-customization (default ext_server-only) value. On unload, the
+ * fair_server reservation should come back with its customized runtime
+ * preserved, so total_bw doubles again.
+ */
+ if (read_fair_runtime_all(test_ctx->nr_cpus, original_runtime) < 0) {
+ fprintf(stderr, "Skipping attach/detach validation: debugfs not accessible\n");
+ return SCX_TEST_PASS;
+ }
+ doubled_runtime = original_runtime[0] * 2;
+
+ fprintf(stderr,
+ "Setting fair_server runtime to %ld ns on all CPUs (orig %ld)\n",
+ doubled_runtime, original_runtime[0]);
+
+ if (write_fair_runtime_all(test_ctx->nr_cpus, doubled_runtime) < 0) {
+ ret = SCX_TEST_FAIL;
+ goto restore;
+ }
+
+ if (fetch_verify_total_bw(doubled_bw, test_ctx->nr_cpus) < 0) {
+ SCX_ERR("Failed to get stable values after doubling fair runtime");
+ ret = SCX_TEST_FAIL;
+ goto restore;
+ }
+
+ /*
+ * After doubling the runtime, fair_server's bw contribution must grow.
+ * We don't assert exactly 2x, because the kernel's to_ratio() truncates
+ * the value, so 2 * to_ratio(period, runtime) and
+ * to_ratio(period, 2 * runtime) can differ.
+ */
+ for (i = 0; i < test_ctx->nr_cpus; i++) {
+ if (doubled_bw[i] <= test_ctx->baseline_bw[i]) {
+ SCX_ERR("CPU%d: fair did not increase total_bw (baseline=%ld, doubled=%ld)",
+ i, test_ctx->baseline_bw[i], doubled_bw[i]);
+ ret = SCX_TEST_FAIL;
+ goto restore;
+ }
+ }
+
+ link = bpf_map__attach_struct_ops(test_ctx->skel->maps.minimal_ops);
+ if (!link) {
+ SCX_ERR("Failed to attach scheduler for detach test");
+ ret = SCX_TEST_FAIL;
+ goto restore;
+ }
+
+ if (fetch_verify_total_bw(loaded_bw, test_ctx->nr_cpus) < 0) {
+ SCX_ERR("Failed to get stable values with BPF loaded (detach test)");
+ bpf_link__destroy(link);
+ ret = SCX_TEST_FAIL;
+ goto restore;
+ }
+
+ /*
+ * In full mode the customized fair_server is detached and ext_server is
+ * attached at its default runtime, total_bw must match baseline.
+ */
+ for (i = 0; i < test_ctx->nr_cpus; i++) {
+ if (loaded_bw[i] != test_ctx->baseline_bw[i]) {
+ SCX_ERR("CPU%d: expected bw %ld (fair detached, ext default), got %ld",
+ i, test_ctx->baseline_bw[i], loaded_bw[i]);
+ bpf_link__destroy(link);
+ ret = SCX_TEST_FAIL;
+ goto restore;
+ }
+ }
+
+ bpf_link__destroy(link);
+
+ if (fetch_verify_total_bw(unloaded_bw, test_ctx->nr_cpus) < 0) {
+ SCX_ERR("Failed to get stable values after BPF unload (detach test)");
+ ret = SCX_TEST_FAIL;
+ goto restore;
+ }
+
+ /*
+ * After unload, fair_server is re-attached with its preserved 2x
+ * runtime, so total_bw should return to the doubled value.
+ */
+ for (i = 0; i < test_ctx->nr_cpus; i++) {
+ if (unloaded_bw[i] != doubled_bw[i]) {
+ SCX_ERR("CPU%d: BPF unloaded: expected %ld (fair restored at 2x), got %ld",
+ i, doubled_bw[i], unloaded_bw[i]);
+ ret = SCX_TEST_FAIL;
+ goto restore;
+ }
+ }
+
+ fprintf(stderr,
+ "dl_server attach/detach with customized fair runtime verified\n");
+ ret = SCX_TEST_PASS;
+
+restore:
+ if (restore_fair_runtime_all(test_ctx->nr_cpus, original_runtime) < 0)
+ SCX_ERR("Failed to fully restore per-CPU fair_server runtimes");
+
+ return ret;
}
static void cleanup(void *ctx)
--
2.54.0
^ permalink raw reply related [flat|nested] 12+ messages in thread* Re: [PATCH 2/2] selftests/sched_ext: Validate dl_server attach/detach in total_bw test
2026-05-26 16:42 ` [PATCH 2/2] selftests/sched_ext: Validate dl_server attach/detach in total_bw test Andrea Righi
@ 2026-05-26 17:33 ` sashiko-bot
0 siblings, 0 replies; 12+ messages in thread
From: sashiko-bot @ 2026-05-26 17:33 UTC (permalink / raw)
To: Andrea Righi; +Cc: sched-ext
Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Medium] The dl_server runtime and period parameters are parsed and written as signed `long` instead of `u64` or `unsigned long long`, leading to potential integer overflow on 32-bit platforms.
--
commit 6475744500400f1c70cab62bc90b17ee9b9cbbea
Author: Andrea Righi <arighi@nvidia.com>
selftests/sched_ext: Validate dl_server attach/detach in total_bw test
Extend the total_bw selftest to validate the fair/ext dl_server
auto-attach/detach operations by doubling the fair_server's runtime
via debugfs and verifying total_bw changes.
> diff --git a/tools/testing/selftests/sched_ext/total_bw.c b/tools/testing/selftests/sched_ext/total_bw.c
> index 5b0a619bab86e..2af01cee90cc0 100644
> --- a/tools/testing/selftests/sched_ext/total_bw.c
> +++ b/tools/testing/selftests/sched_ext/total_bw.c
[ ... ]
> @@ -100,6 +100,98 @@ static int read_total_bw_values(long *bw_values, int max_cpus)
> return cpu_count;
> }
>
> +/*
> + * Read a per-CPU dl_server param (runtime or period) from debugfs.
> + * Returns the value in nanoseconds, or -1 on failure.
> + */
> +static long read_server_param(const char *server, const char *param, int cpu)
> +{
> + char path[128];
> + long value = -1;
[Severity: Medium]
Will parsing the dl_server runtime and period parameters as signed long
lead to integer overflows on 32-bit platforms?
In the kernel, dl_server parameters like runtime and period are represented
as u64 values (in nanoseconds). On 32-bit platforms, a signed long is 32
bits, which can only represent up to ~2.14 seconds in nanoseconds.
If a test system configures a fair_server runtime larger than ~1.07 seconds,
doubling it could cause a signed integer overflow. Would it be better to use
unsigned long long or u64 with %llu to avoid this?
> + FILE *fp;
> +
> + snprintf(path, sizeof(path),
> + "/sys/kernel/debug/sched/%s_server/cpu%d/%s",
> + server, cpu, param);
> + fp = fopen(path, "r");
> + if (!fp)
> + return -1;
> + if (fscanf(fp, "%ld", &value) != 1)
> + value = -1;
> + fclose(fp);
> +
> + return value;
> +}
> +
> +/*
> + * Write a per-CPU dl_server param to debugfs. Returns 0 on success.
> + */
> +static int write_server_param(const char *server, const char *param,
> + int cpu, long value)
[Severity: Medium]
Similarly here, does taking long value as a parameter restrict the maximum
runtime that can be written on 32-bit systems?
> +{
> + char path[128];
> + FILE *fp;
> + int ret = 0;
> +
> + snprintf(path, sizeof(path),
> + "/sys/kernel/debug/sched/%s_server/cpu%d/%s",
> + server, cpu, param);
> + fp = fopen(path, "w");
> + if (!fp)
> + return -1;
> + if (fprintf(fp, "%ld", value) < 0)
> + ret = -1;
> + if (fclose(fp) != 0)
> + ret = -1;
> +
> + return ret;
> +}
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260526164420.638711-1-arighi@nvidia.com?part=2
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCHSET v3 sched_ext/for-7.2] sched_ext: Auto-manage ext/fair dl_server bandwidth
2026-05-26 16:42 [PATCHSET v3 sched_ext/for-7.2] sched_ext: Auto-manage ext/fair dl_server bandwidth Andrea Righi
2026-05-26 16:42 ` [PATCH 1/2] sched_ext: Auto-register/unregister dl_server reservations Andrea Righi
2026-05-26 16:42 ` [PATCH 2/2] selftests/sched_ext: Validate dl_server attach/detach in total_bw test Andrea Righi
@ 2026-05-27 12:36 ` Juri Lelli
2026-05-28 11:33 ` Peter Zijlstra
2026-05-28 15:53 ` Tejun Heo
3 siblings, 1 reply; 12+ messages in thread
From: Juri Lelli @ 2026-05-27 12:36 UTC (permalink / raw)
To: Andrea Righi
Cc: Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar,
Peter Zijlstra, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
Christian Loehle, Phil Auld, Koba Ko, Joel Fernandes,
Richard Cheng, Cheng-Yang Chou, sched-ext, linux-kernel
Hi Andrea,
On 26/05/26 18:42, Andrea Righi wrote:
> Currently, a fixed bandwidth is reserved at boot for both the fair and ext
> deadline servers, and this reservation remains unchanged unless explicitly
> modified via debugfs. As a result, both servers permanently contribute to global
> bandwidth accounting, regardless of whether a BPF scheduler is active.
>
> While unused bandwidth can still be reclaimed at runtime by other classes, this
> static reservation prevents RT from fully utilizing available headroom in
> situations where one of the sched_ext or fair class is guaranteed to be inactive
> (for example, when no BPF scheduler is loaded, or when sched_ext runs in full
> mode and replaces fair).
>
> As discussed at the VIII OSPM summit in Cambridge [1], a better solution would
> be to dynamically register and unregister deadline server bandwidth based on the
> active sched_ext state. This allows the kernel to automatically enable bandwidth
> accounting only for the scheduling class that is currently active, while
> disabling it for inactive ones.
>
> This patch series implements this automatic register/unregister logic. Moreover,
> the sched_ext total_bw kselftest is also modified to validate the correct
> behavior across the different scheduling configurations and ensure that
> bandwidth accounting follows the expected state transitions.
>
> [1] https://retis.santannapisa.it/ospm-summit/
>
> Git tree: git://git.kernel.org/pub/scm/linux/kernel/git/arighi/linux.git dl-server-bw-v3
>
> Changes in v3:
> - Don't bypass __dl_overflow() for detached servers in dl_server_apply_params()
> to reject oversized configs up front (reported by Sashiko)
> - A potential divide-by-zero in dl_server_apply_params() reported by Sashiko
> has been fixed in a separate patch (not introduced by this patch set):
> https://lore.kernel.org/all/20260526100502.575774-1-arighi@nvidia.com/
> - Link to v2: https://lore.kernel.org/all/20260526082954.550958-1-arighi@nvidia.com/
This looks now good to me.
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Thanks!
Juri
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCHSET v3 sched_ext/for-7.2] sched_ext: Auto-manage ext/fair dl_server bandwidth
2026-05-27 12:36 ` [PATCHSET v3 sched_ext/for-7.2] sched_ext: Auto-manage ext/fair dl_server bandwidth Juri Lelli
@ 2026-05-28 11:33 ` Peter Zijlstra
2026-05-28 16:13 ` Andrea Righi
0 siblings, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2026-05-28 11:33 UTC (permalink / raw)
To: Juri Lelli
Cc: Andrea Righi, Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar,
Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, K Prateek Nayak, Christian Loehle,
Phil Auld, Koba Ko, Joel Fernandes, Richard Cheng,
Cheng-Yang Chou, sched-ext, linux-kernel
On Wed, May 27, 2026 at 02:36:18PM +0200, Juri Lelli wrote:
> Hi Andrea,
>
> On 26/05/26 18:42, Andrea Righi wrote:
> > Currently, a fixed bandwidth is reserved at boot for both the fair and ext
> > deadline servers, and this reservation remains unchanged unless explicitly
> > modified via debugfs. As a result, both servers permanently contribute to global
> > bandwidth accounting, regardless of whether a BPF scheduler is active.
> >
> > While unused bandwidth can still be reclaimed at runtime by other classes, this
> > static reservation prevents RT from fully utilizing available headroom in
> > situations where one of the sched_ext or fair class is guaranteed to be inactive
> > (for example, when no BPF scheduler is loaded, or when sched_ext runs in full
> > mode and replaces fair).
> >
> > As discussed at the VIII OSPM summit in Cambridge [1], a better solution would
> > be to dynamically register and unregister deadline server bandwidth based on the
> > active sched_ext state. This allows the kernel to automatically enable bandwidth
> > accounting only for the scheduling class that is currently active, while
> > disabling it for inactive ones.
> >
> > This patch series implements this automatic register/unregister logic. Moreover,
> > the sched_ext total_bw kselftest is also modified to validate the correct
> > behavior across the different scheduling configurations and ensure that
> > bandwidth accounting follows the expected state transitions.
> >
> > [1] https://retis.santannapisa.it/ospm-summit/
> >
> > Git tree: git://git.kernel.org/pub/scm/linux/kernel/git/arighi/linux.git dl-server-bw-v3
> >
> > Changes in v3:
> > - Don't bypass __dl_overflow() for detached servers in dl_server_apply_params()
> > to reject oversized configs up front (reported by Sashiko)
> > - A potential divide-by-zero in dl_server_apply_params() reported by Sashiko
> > has been fixed in a separate patch (not introduced by this patch set):
> > https://lore.kernel.org/all/20260526100502.575774-1-arighi@nvidia.com/
> > - Link to v2: https://lore.kernel.org/all/20260526082954.550958-1-arighi@nvidia.com/
>
> This looks now good to me.
>
> Acked-by: Juri Lelli <juri.lelli@redhat.com>
Thanks!, I've stuck them in queue:sched/core for the robots to chew on.
There was an absolutely trivial reject in ext.c that I fixed up, so
something moved around there.
There is one little nit, but I'll reply there and that can easily be
done on top if we decide its worth it.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCHSET v3 sched_ext/for-7.2] sched_ext: Auto-manage ext/fair dl_server bandwidth
2026-05-28 11:33 ` Peter Zijlstra
@ 2026-05-28 16:13 ` Andrea Righi
0 siblings, 0 replies; 12+ messages in thread
From: Andrea Righi @ 2026-05-28 16:13 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Juri Lelli, Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar,
Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, K Prateek Nayak, Christian Loehle,
Phil Auld, Koba Ko, Joel Fernandes, Richard Cheng,
Cheng-Yang Chou, sched-ext, linux-kernel
Hi Peter,
On Thu, May 28, 2026 at 01:33:17PM +0200, Peter Zijlstra wrote:
> On Wed, May 27, 2026 at 02:36:18PM +0200, Juri Lelli wrote:
> > Hi Andrea,
> >
> > On 26/05/26 18:42, Andrea Righi wrote:
> > > Currently, a fixed bandwidth is reserved at boot for both the fair and ext
> > > deadline servers, and this reservation remains unchanged unless explicitly
> > > modified via debugfs. As a result, both servers permanently contribute to global
> > > bandwidth accounting, regardless of whether a BPF scheduler is active.
> > >
> > > While unused bandwidth can still be reclaimed at runtime by other classes, this
> > > static reservation prevents RT from fully utilizing available headroom in
> > > situations where one of the sched_ext or fair class is guaranteed to be inactive
> > > (for example, when no BPF scheduler is loaded, or when sched_ext runs in full
> > > mode and replaces fair).
> > >
> > > As discussed at the VIII OSPM summit in Cambridge [1], a better solution would
> > > be to dynamically register and unregister deadline server bandwidth based on the
> > > active sched_ext state. This allows the kernel to automatically enable bandwidth
> > > accounting only for the scheduling class that is currently active, while
> > > disabling it for inactive ones.
> > >
> > > This patch series implements this automatic register/unregister logic. Moreover,
> > > the sched_ext total_bw kselftest is also modified to validate the correct
> > > behavior across the different scheduling configurations and ensure that
> > > bandwidth accounting follows the expected state transitions.
> > >
> > > [1] https://retis.santannapisa.it/ospm-summit/
> > >
> > > Git tree: git://git.kernel.org/pub/scm/linux/kernel/git/arighi/linux.git dl-server-bw-v3
> > >
> > > Changes in v3:
> > > - Don't bypass __dl_overflow() for detached servers in dl_server_apply_params()
> > > to reject oversized configs up front (reported by Sashiko)
> > > - A potential divide-by-zero in dl_server_apply_params() reported by Sashiko
> > > has been fixed in a separate patch (not introduced by this patch set):
> > > https://lore.kernel.org/all/20260526100502.575774-1-arighi@nvidia.com/
> > > - Link to v2: https://lore.kernel.org/all/20260526082954.550958-1-arighi@nvidia.com/
> >
> > This looks now good to me.
> >
> > Acked-by: Juri Lelli <juri.lelli@redhat.com>
>
> Thanks!, I've stuck them in queue:sched/core for the robots to chew on.
> There was an absolutely trivial reject in ext.c that I fixed up, so
> something moved around there.
FYI, I re-ran all my tests with queue:sched/core, everything looks good on my
side.
Thanks,
-Andrea
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCHSET v3 sched_ext/for-7.2] sched_ext: Auto-manage ext/fair dl_server bandwidth
2026-05-26 16:42 [PATCHSET v3 sched_ext/for-7.2] sched_ext: Auto-manage ext/fair dl_server bandwidth Andrea Righi
` (2 preceding siblings ...)
2026-05-27 12:36 ` [PATCHSET v3 sched_ext/for-7.2] sched_ext: Auto-manage ext/fair dl_server bandwidth Juri Lelli
@ 2026-05-28 15:53 ` Tejun Heo
2026-05-29 9:08 ` Peter Zijlstra
3 siblings, 1 reply; 12+ messages in thread
From: Tejun Heo @ 2026-05-28 15:53 UTC (permalink / raw)
To: Andrea Righi
Cc: David Vernet, Changwoo Min, Ingo Molnar, Peter Zijlstra,
Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
Christian Loehle, Phil Auld, Koba Ko, Joel Fernandes,
Richard Cheng, Cheng-Yang Chou, sched-ext, linux-kernel
Hello,
Peter, how do you want to route the patches? I'd be happy to take them
through sched_ext/for-7.2.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCHSET v3 sched_ext/for-7.2] sched_ext: Auto-manage ext/fair dl_server bandwidth
2026-05-28 15:53 ` Tejun Heo
@ 2026-05-29 9:08 ` Peter Zijlstra
0 siblings, 0 replies; 12+ messages in thread
From: Peter Zijlstra @ 2026-05-29 9:08 UTC (permalink / raw)
To: Tejun Heo
Cc: Andrea Righi, David Vernet, Changwoo Min, Ingo Molnar, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, K Prateek Nayak, Christian Loehle,
Phil Auld, Koba Ko, Joel Fernandes, Richard Cheng,
Cheng-Yang Chou, sched-ext, linux-kernel
On Thu, May 28, 2026 at 05:53:04AM -1000, Tejun Heo wrote:
> Hello,
>
> Peter, how do you want to route the patches? I'd be happy to take them
> through sched_ext/for-7.2.
I have them en-route to tip:sched/core.
^ permalink raw reply [flat|nested] 12+ messages in thread