Sched_ext development
 help / color / mirror / Atom feed
* [PATCHSET v3 sched_ext/for-7.2] sched_ext: Auto-manage ext/fair dl_server bandwidth
@ 2026-05-26 16:42 Andrea Righi
  2026-05-26 16:42 ` [PATCH 1/2] sched_ext: Auto-register/unregister dl_server reservations Andrea Righi
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Andrea Righi @ 2026-05-26 16:42 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Christian Loehle, Phil Auld,
	Koba Ko, Joel Fernandes, Richard Cheng, Cheng-Yang Chou,
	sched-ext, linux-kernel

Currently, a fixed bandwidth is reserved at boot for both the fair and ext
deadline servers, and this reservation remains unchanged unless explicitly
modified via debugfs. As a result, both servers permanently contribute to global
bandwidth accounting, regardless of whether a BPF scheduler is active.

While unused bandwidth can still be reclaimed at runtime by other classes, this
static reservation prevents RT from fully utilizing available headroom in
situations where one of the sched_ext or fair class is guaranteed to be inactive
(for example, when no BPF scheduler is loaded, or when sched_ext runs in full
mode and replaces fair).

As discussed at the VIII OSPM summit in Cambridge [1], a better solution would
be to dynamically register and unregister deadline server bandwidth based on the
active sched_ext state. This allows the kernel to automatically enable bandwidth
accounting only for the scheduling class that is currently active, while
disabling it for inactive ones.

This patch series implements this automatic register/unregister logic. Moreover,
the sched_ext total_bw kselftest is also modified to validate the correct
behavior across the different scheduling configurations and ensure that
bandwidth accounting follows the expected state transitions.

[1] https://retis.santannapisa.it/ospm-summit/

Git tree: git://git.kernel.org/pub/scm/linux/kernel/git/arighi/linux.git dl-server-bw-v3

Changes in v3:
 - Don't bypass __dl_overflow() for detached servers in dl_server_apply_params()
   to reject oversized configs up front (reported by Sashiko)
 - A potential divide-by-zero in dl_server_apply_params() reported by Sashiko
   has been fixed in a separate patch (not introduced by this patch set):
   https://lore.kernel.org/all/20260526100502.575774-1-arighi@nvidia.com/
 - Link to v2: https://lore.kernel.org/all/20260526082954.550958-1-arighi@nvidia.com/

Changes in v2:
 - Rework the sched_ext enable path as suggested by Peter: attach ext_server
   before committing the scheduler switch and fail the enable if admission
   control rejects the reservation; detach fair_server only after a successful
   full-mode switch.
 - Added dl_server_swap_bw() for the disable/recovery path so ext_server detach
   and fair_server reattach happen under the same dl_b->lock, closing the
   window where concurrent SCHED_DEADLINE admission could steal the freed
   bandwidth (reported by Sashiko).
 - Fixed the attach/detach accounting issue reported by Sashiko by updating
   rq->dl.this_bw together with root-domain total_bw, draining active or
   non-contending servers before detach and preventing detached servers from
   starting.
 - Reuse dl_rq_change_utilization() to drain the server, so the detach path goes
   through the same machinery as dl_server_apply_params()
 - Made root-domain accounting honor the same cpu_active() conditions used by
   root-domain rebuilds, while preserving runtime/period updates made while a
   server is detached.
 - Fixed the total_bw selftest issues reported by Sashiko: check fclose()
   errors for debugfs writes, preserve per-CPU fair_server runtime values, and
   restore all CPUs on cleanup even if one write fails.
 - Link to v1: https://lore.kernel.org/all/20260521174509.1534623-1-arighi@nvidia.com/

Andrea Righi (2):
      sched_ext: Auto-register/unregister dl_server reservations
      selftests/sched_ext: Validate dl_server attach/detach in total_bw test

 include/linux/sched.h                        |   6 +
 kernel/sched/deadline.c                      | 204 ++++++++++++++++++++++++++-
 kernel/sched/ext.c                           |  71 ++++++++++
 kernel/sched/sched.h                         |   4 +
 tools/testing/selftests/sched_ext/total_bw.c | 201 +++++++++++++++++++++++++-
 5 files changed, 478 insertions(+), 8 deletions(-)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/2] sched_ext: Auto-register/unregister dl_server reservations
  2026-05-26 16:42 [PATCHSET v3 sched_ext/for-7.2] sched_ext: Auto-manage ext/fair dl_server bandwidth Andrea Righi
@ 2026-05-26 16:42 ` Andrea Righi
  2026-05-26 17:14   ` sashiko-bot
  2026-05-28 11:36   ` Peter Zijlstra
  2026-05-26 16:42 ` [PATCH 2/2] selftests/sched_ext: Validate dl_server attach/detach in total_bw test Andrea Righi
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 12+ messages in thread
From: Andrea Righi @ 2026-05-26 16:42 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Christian Loehle, Phil Auld,
	Koba Ko, Joel Fernandes, Richard Cheng, Cheng-Yang Chou,
	sched-ext, linux-kernel

Commit cd959a3562050d ("sched_ext: Add a DL server for sched_ext tasks")
introduced an ext_server deadline server to protect sched_ext tasks from
fair/RT starvation, mirroring the existing fair_server.

Currently, both servers reserve their 50ms/1000ms bandwidth at boot,
regardless of whether a BPF scheduler is loaded. Unused bandwidth is
still reclaimed at runtime by other classes, but the static reservation
prevents the RT class from implicitly using that headroom when one of
the two classes is guaranteed to be empty.

A sysadmin can work around this by writing
/sys/kernel/debug/sched/{fair,ext}_server/cpu*/runtime, but that
requires manual action and not all systems expose debugfs.

A better approach is to make server bandwidth reservations dynamic: only
the scheduling policy that is currently active should register its
reservation, while the inactive one should not artificially hold
capacity (keeping both reservations only when the BPF scheduler is
running in partial mode):

 +---------------------------------------------+-------------+------------+
 | BPF scheduler state                         | fair server | ext server |
 +---------------------------------------------+-------------+------------+
 | not loaded (default boot)                   | reserved    | none       |
 | loaded full mode (!SCX_OPS_SWITCH_PARTIAL)  | none        | reserved   |
 | loaded partial mode (SCX_OPS_SWITCH_PARTIAL)| reserved    | reserved   |
 +---------------------------------------------+-------------+------------+

To achieve this, introduce an "attached/detached" state for each
deadline server, so the kernel can decide whether a server's bandwidth
should be accounted in global bandwidth tracking.

At boot, the system starts with only the fair server contributing to
bandwidth accounting. When a BPF scheduler is enabled, the ext server is
attached and may replace or complement the fair server depending on
whether full or partial mode is used. When sched_ext is disabled, the
system restores the previous deadline bandwidth values and behavior.

The transition logic ensures that switching between scheduling modes is
consistent and reversible, without losing runtime configuration or
requiring manual intervention.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 include/linux/sched.h   |   6 ++
 kernel/sched/deadline.c | 204 ++++++++++++++++++++++++++++++++++++++--
 kernel/sched/ext.c      |  71 ++++++++++++++
 kernel/sched/sched.h    |   4 +
 4 files changed, 278 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ee06cba5c6f53..7acceb80628b0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -702,6 +702,11 @@ struct sched_dl_entity {
 	 * running, skipping the defer phase.
 	 *
 	 * @dl_defer_idle tracks idle state
+	 *
+	 * @dl_bw_attached tells if this server's bandwidth currently
+	 * contributes to the root domain's total_bw. Only meaningful for server
+	 * entities (@dl_server == 1). Allows toggling the reservation on/off
+	 * without losing the configured @dl_runtime/@dl_period.
 	 */
 	unsigned int			dl_throttled      : 1;
 	unsigned int			dl_yielded        : 1;
@@ -713,6 +718,7 @@ struct sched_dl_entity {
 	unsigned int			dl_defer_armed	  : 1;
 	unsigned int			dl_defer_running  : 1;
 	unsigned int			dl_defer_idle     : 1;
+	unsigned int			dl_bw_attached    : 1;
 
 	/*
 	 * Bandwidth enforcement timer. Each -deadline task has its
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 7db4c87df83b0..5672f9c583982 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1793,7 +1793,8 @@ void dl_server_start(struct sched_dl_entity *dl_se)
 	struct rq *rq = dl_se->rq;
 
 	dl_se->dl_defer_idle = 0;
-	if (!dl_server(dl_se) || dl_se->dl_server_active || !dl_se->dl_runtime)
+	if (!dl_server(dl_se) || dl_se->dl_server_active || !dl_se->dl_runtime ||
+	    !dl_se->dl_bw_attached)
 		return;
 
 	/*
@@ -1868,6 +1869,13 @@ void sched_init_dl_servers(void)
 		dl_se->dl_server = 1;
 		dl_se->dl_defer = 1;
 		setup_new_dl_entity(dl_se);
+
+		/*
+		 * No BPF scheduler is loaded at boot, so the ext_server has no
+		 * tasks to protect. Detach its bandwidth reservation, it will
+		 * be attached when a BPF scheduler is loaded.
+		 */
+		dl_server_detach_bw(dl_se);
 #endif
 	}
 }
@@ -1878,6 +1886,9 @@ void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq)
 	int cpu = cpu_of(rq);
 	struct dl_bw *dl_b;
 
+	if (!dl_se->dl_bw_attached)
+		return;
+
 	dl_b = dl_bw_of(cpu_of(rq));
 	guard(raw_spinlock)(&dl_b->lock);
 
@@ -1889,7 +1900,8 @@ void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq)
 
 int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 period, bool init)
 {
-	u64 old_bw = init ? 0 : to_ratio(dl_se->dl_period, dl_se->dl_runtime);
+	u64 old_bw = (init || !dl_se->dl_bw_attached) ? 0 :
+		     to_ratio(dl_se->dl_period, dl_se->dl_runtime);
 	u64 new_bw = to_ratio(period, runtime);
 	struct rq *rq = dl_se->rq;
 	int cpu = cpu_of(rq);
@@ -1909,7 +1921,8 @@ int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 perio
 	if (init) {
 		__add_rq_bw(new_bw, &rq->dl);
 		__dl_add(dl_b, new_bw, cpus);
-	} else {
+		dl_se->dl_bw_attached = 1;
+	} else if (dl_se->dl_bw_attached) {
 		__dl_sub(dl_b, dl_se->dl_bw, cpus);
 		__dl_add(dl_b, new_bw, cpus);
 
@@ -1929,6 +1942,181 @@ int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 perio
 	return 0;
 }
 
+/*
+ * Add @dl_se's bw to the root-domain accounting.
+ *
+ * Return -EBUSY if attaching would overflow root domain capacity.
+ */
+static int __dl_server_attach_bw_locked(struct sched_dl_entity *dl_se,
+					struct dl_bw *dl_b, int cpus)
+{
+	struct rq *rq = dl_se->rq;
+	unsigned long cap;
+
+	/*
+	 * Always update @rq->dl.this_bw, but only update @dl_b->total_bw
+	 * (and run the overflow check it gates) while this CPU is active.
+	 *
+	 * This mirrors dl_server_add_bw() during root-domain rebuilds, which
+	 * only publishes bandwidth from active CPUs into @dl_b.
+	 */
+	if (cpu_active(cpu_of(rq))) {
+		cap = dl_bw_capacity(cpu_of(rq));
+		if (__dl_overflow(dl_b, cap, 0, dl_se->dl_bw))
+			return -EBUSY;
+		__dl_add(dl_b, dl_se->dl_bw, cpus);
+	}
+	__add_rq_bw(dl_se->dl_bw, &rq->dl);
+	dl_se->dl_bw_attached = 1;
+
+	return 0;
+}
+
+/*
+ * Drain @dl_se and remove its bw from the root-domain accounting.
+ */
+static void __dl_server_detach_bw_locked(struct sched_dl_entity *dl_se,
+					 struct dl_bw *dl_b, int cpus)
+{
+	struct rq *rq = dl_se->rq;
+
+	/*
+	 * If the server is still active (on_rq), dequeue it via
+	 * dl_server_stop(); task_non_contending() will either subtract
+	 * @dl_bw from running_bw immediately (0-lag passed) or set
+	 * dl_non_contending and arm the inactive_timer.
+	 */
+	if (dl_se->dl_server_active)
+		dl_server_stop(dl_se);
+
+	/*
+	 * Drop @dl_se's contribution from this rq's bandwidth accounting,
+	 * mirroring the __add_rq_bw() done at attach time.
+	 */
+	dl_rq_change_utilization(rq, dl_se, 0);
+
+	/*
+	 * Update @dl_b only while this CPU is active, matching
+	 * dl_server_add_bw() during root-domain rebuilds.
+	 *
+	 * If this CPU is inactive, its bandwidth is not currently accounted in
+	 * @dl_b->total_bw: either attach skipped adding it, or a rebuild
+	 * already dropped it while re-publishing active CPUs only.
+	 *
+	 * In that case there is nothing to subtract from @dl_b. Just clear
+	 * @dl_se->dl_bw_attached; if the CPU becomes active again, the next
+	 * rebuild will re-publish its bandwidth.
+	 */
+	if (cpu_active(cpu_of(rq)))
+		__dl_sub(dl_b, dl_se->dl_bw, cpus);
+	dl_se->dl_bw_attached = 0;
+}
+
+/*
+ * Attach @dl_se's bandwidth to the root domain's total_bw accounting.
+ *
+ * Use to dynamically register a dl_server's bandwidth reservation while
+ * preserving its configured @dl_runtime / @dl_period. No-op if @dl_se is
+ * already attached.
+ *
+ * Returns -EBUSY if attaching would overflow the root domain capacity.
+ */
+int dl_server_attach_bw(struct sched_dl_entity *dl_se)
+{
+	struct rq *rq = dl_se->rq;
+	int cpu = cpu_of(rq);
+	struct dl_bw *dl_b;
+	int cpus, ret;
+
+	if (dl_se->dl_bw_attached)
+		return 0;
+
+	scoped_guard (raw_spinlock, &dl_bw_of(cpu)->lock) {
+		dl_b = dl_bw_of(cpu);
+		cpus = dl_bw_cpus(cpu);
+		ret = __dl_server_attach_bw_locked(dl_se, dl_b, cpus);
+	}
+	if (ret)
+		return ret;
+
+	/*
+	 * The natural 0->nr_running transition that triggers dl_server_start()
+	 * may have happened while @dl_se was still detached (e.g., between
+	 * scx_bypass(false) and the scx_enable() re-balance loop), so kick a
+	 * start here.
+	 *
+	 * dl_server_start() bails out cleanly if there's nothing to schedule or
+	 * it's already active. Skip if @cpu is offline; the server will be
+	 * started naturally on the first enqueue once @cpu comes back.
+	 */
+	if (cpu_online(cpu))
+		dl_server_start(dl_se);
+
+	return 0;
+}
+
+/*
+ * Detach @dl_se's bandwidth from the root domain's total_bw accounting.
+ *
+ * Use to dynamically unregister a dl_server's bandwidth reservation while
+ * preserving its configured @dl_runtime / @dl_period. No-op if @dl_se is
+ * not currently attached.
+ */
+void dl_server_detach_bw(struct sched_dl_entity *dl_se)
+{
+	int cpu = cpu_of(dl_se->rq);
+	struct dl_bw *dl_b;
+	int cpus;
+
+	if (!dl_se->dl_bw_attached)
+		return;
+
+	dl_b = dl_bw_of(cpu);
+	guard(raw_spinlock)(&dl_b->lock);
+	cpus = dl_bw_cpus(cpu);
+	__dl_server_detach_bw_locked(dl_se, dl_b, cpus);
+}
+
+/*
+ * Atomically detach @detach_se and attach @attach_se on the same rq, holding
+ * @dl_b->lock across both operations so a concurrent sched_setattr() cannot
+ * steal the bandwidth freed by the detach before the attach can claim it.
+ *
+ * Both entities must live on the same rq (same root domain). Returns the
+ * result of the attach: -EBUSY if attaching @attach_se would overflow root
+ * domain capacity (in which case both servers end up detached).
+ */
+int dl_server_swap_bw(struct sched_dl_entity *detach_se,
+		      struct sched_dl_entity *attach_se)
+{
+	struct rq *rq = detach_se->rq;
+	int cpu = cpu_of(rq);
+	struct dl_bw *dl_b;
+	int cpus, ret;
+
+	WARN_ON_ONCE(attach_se->rq != rq);
+
+	scoped_guard (raw_spinlock, &dl_bw_of(cpu)->lock) {
+		dl_b = dl_bw_of(cpu);
+		cpus = dl_bw_cpus(cpu);
+
+		if (detach_se->dl_bw_attached)
+			__dl_server_detach_bw_locked(detach_se, dl_b, cpus);
+
+		if (attach_se->dl_bw_attached)
+			ret = 0;
+		else
+			ret = __dl_server_attach_bw_locked(attach_se, dl_b, cpus);
+	}
+	if (ret)
+		return ret;
+
+	if (cpu_online(cpu))
+		dl_server_start(attach_se);
+
+	return 0;
+}
+
 /*
  * Update the current task's runtime statistics (provided it is still
  * a -deadline task and has not been removed from the dl_rq).
@@ -3236,12 +3424,12 @@ static void dl_server_add_bw(struct root_domain *rd, int cpu)
 	struct sched_dl_entity *dl_se;
 
 	dl_se = &cpu_rq(cpu)->fair_server;
-	if (dl_server(dl_se) && cpu_active(cpu))
+	if (dl_server(dl_se) && dl_se->dl_bw_attached && cpu_active(cpu))
 		__dl_add(&rd->dl_bw, dl_se->dl_bw, dl_bw_cpus(cpu));
 
 #ifdef CONFIG_SCHED_CLASS_EXT
 	dl_se = &cpu_rq(cpu)->ext_server;
-	if (dl_server(dl_se) && cpu_active(cpu))
+	if (dl_server(dl_se) && dl_se->dl_bw_attached && cpu_active(cpu))
 		__dl_add(&rd->dl_bw, dl_se->dl_bw, dl_bw_cpus(cpu));
 #endif
 }
@@ -3250,11 +3438,13 @@ static u64 dl_server_read_bw(int cpu)
 {
 	u64 dl_bw = 0;
 
-	if (cpu_rq(cpu)->fair_server.dl_server)
+	if (cpu_rq(cpu)->fair_server.dl_server &&
+	    cpu_rq(cpu)->fair_server.dl_bw_attached)
 		dl_bw += cpu_rq(cpu)->fair_server.dl_bw;
 
 #ifdef CONFIG_SCHED_CLASS_EXT
-	if (cpu_rq(cpu)->ext_server.dl_server)
+	if (cpu_rq(cpu)->ext_server.dl_server &&
+	    cpu_rq(cpu)->ext_server.dl_bw_attached)
 		dl_bw += cpu_rq(cpu)->ext_server.dl_bw;
 #endif
 
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 83272acf17637..2330657bd66f3 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -6112,6 +6112,7 @@ static void scx_root_disable(struct scx_sched *sch)
 {
 	struct scx_task_iter sti;
 	struct task_struct *p;
+	bool was_switched_all;
 	int cpu;
 
 	/* guarantee forward progress and wait for descendants to be disabled */
@@ -6138,6 +6139,8 @@ static void scx_root_disable(struct scx_sched *sch)
 	 */
 	mutex_lock(&scx_enable_mutex);
 
+	was_switched_all = scx_switched_all();
+
 	static_branch_disable(&__scx_switched_all);
 	WRITE_ONCE(scx_switching_all, false);
 
@@ -6187,10 +6190,34 @@ static void scx_root_disable(struct scx_sched *sch)
 	/*
 	 * Invalidate all the rq clocks to prevent getting outdated
 	 * rq clocks from a previous scx scheduler.
+	 *
+	 * Also re-balance the dl_server bandwidth reservations: detach
+	 * ext_server (no more sched_ext tasks) and reinstate fair_server if it
+	 * was previously detached because we were running in full mode.
+	 *
+	 * Unlike the enable path, this runs on a recovery path that cannot
+	 * fail, so we use dl_server_swap_bw() to atomically free ext_server's
+	 * bandwidth and reclaim it for fair_server under the same dl_b lock.
+	 *
+	 * The swap can still fail with -EBUSY if someone bumped ext_server's
+	 * runtime via debugfs between enable and disable; in that narrow case
+	 * both servers end up detached and we just WARN.
 	 */
 	for_each_possible_cpu(cpu) {
 		struct rq *rq = cpu_rq(cpu);
+
 		scx_rq_clock_invalidate(rq);
+
+		scoped_guard(rq_lock_irqsave, rq) {
+			update_rq_clock(rq);
+			if (was_switched_all) {
+				if (WARN_ON_ONCE(dl_server_swap_bw(&rq->ext_server,
+								   &rq->fair_server)))
+					pr_warn("failed to re-attach fair_server on CPU %d\n", cpu);
+			} else {
+				dl_server_detach_bw(&rq->ext_server);
+			}
+		}
 	}
 
 	/* no task is on scx, turn off all the switches and flush in-progress calls */
@@ -7233,6 +7260,31 @@ static void scx_root_enable_workfn(struct kthread_work *work)
 	if (ret)
 		goto err_disable;
 
+	/*
+	 * Attach the ext_server bandwidth reservation before anything is
+	 * committed so that we can fail the enable if the root domain cannot
+	 * accommodate it. The matching fair_server detach is deferred to the
+	 * tail of this function, after the switch is fully committed and can no
+	 * longer fail.
+	 *
+	 * On failure, err_disable funnels into scx_root_disable() which
+	 * detaches ext_server, so partially-attached state is cleaned up
+	 * automatically.
+	 */
+	for_each_possible_cpu(cpu) {
+		struct rq *rq = cpu_rq(cpu);
+
+		scoped_guard(rq_lock_irqsave, rq) {
+			update_rq_clock(rq);
+			ret = dl_server_attach_bw(&rq->ext_server);
+		}
+		if (ret) {
+			pr_warn("sched_ext: failed to attach ext_server on CPU %d (%d)\n",
+				cpu, ret);
+			goto err_disable;
+		}
+	}
+
 	/*
 	 * Once __scx_enabled is set, %current can be switched to SCX anytime.
 	 * This can lead to stalls as some BPF schedulers (e.g. userspace
@@ -7387,6 +7439,25 @@ static void scx_root_enable_workfn(struct kthread_work *work)
 	if (!(ops->flags & SCX_OPS_SWITCH_PARTIAL))
 		static_branch_enable(&__scx_switched_all);
 
+	/*
+	 * Detach the fair_server bandwidth reservation now that the switch
+	 * is fully committed. In full mode (!SCX_OPS_SWITCH_PARTIAL) no
+	 * task will ever run in the fair class, so give that bandwidth
+	 * back to the RT class. The matching ext_server attach already
+	 * happened earlier; this only releases bandwidth and cannot fail.
+	 *
+	 * In partial mode keep fair_server attached.
+	 */
+	if (scx_switched_all()) {
+		for_each_possible_cpu(cpu) {
+			struct rq *rq = cpu_rq(cpu);
+
+			guard(rq_lock_irqsave)(rq);
+			update_rq_clock(rq);
+			dl_server_detach_bw(&rq->fair_server);
+		}
+	}
+
 	pr_info("sched_ext: BPF scheduler \"%s\" enabled%s\n",
 		sch->ops.name, scx_switched_all() ? "" : " (partial)");
 	kobject_uevent(&sch->kobj, KOBJ_ADD);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9f63b15d309d1..6d7eae73a6c00 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -421,6 +421,10 @@ extern void ext_server_init(struct rq *rq);
 extern void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq);
 extern int dl_server_apply_params(struct sched_dl_entity *dl_se,
 		    u64 runtime, u64 period, bool init);
+extern int dl_server_attach_bw(struct sched_dl_entity *dl_se);
+extern void dl_server_detach_bw(struct sched_dl_entity *dl_se);
+extern int dl_server_swap_bw(struct sched_dl_entity *detach_se,
+			     struct sched_dl_entity *attach_se);
 
 static inline bool dl_server_active(struct sched_dl_entity *dl_se)
 {
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/2] selftests/sched_ext: Validate dl_server attach/detach in total_bw test
  2026-05-26 16:42 [PATCHSET v3 sched_ext/for-7.2] sched_ext: Auto-manage ext/fair dl_server bandwidth Andrea Righi
  2026-05-26 16:42 ` [PATCH 1/2] sched_ext: Auto-register/unregister dl_server reservations Andrea Righi
@ 2026-05-26 16:42 ` Andrea Righi
  2026-05-26 17:33   ` sashiko-bot
  2026-05-27 12:36 ` [PATCHSET v3 sched_ext/for-7.2] sched_ext: Auto-manage ext/fair dl_server bandwidth Juri Lelli
  2026-05-28 15:53 ` Tejun Heo
  3 siblings, 1 reply; 12+ messages in thread
From: Andrea Righi @ 2026-05-26 16:42 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Christian Loehle, Phil Auld,
	Koba Ko, Joel Fernandes, Richard Cheng, Cheng-Yang Chou,
	sched-ext, linux-kernel

Extend the total_bw selftest to validate the fair/ext dl_server
auto-attach/detach operations.

After the existing consistency checks, the test now doubles the
fair_server's runtime on every CPU via debugfs and verifies that:
 1. total_bw grew after the customization (proves fair_server was
    attached and apply_params() honored the dl_bw_attached flag),
 2. with the minimal BPF scheduler loaded, total_bw drops back to the
    baseline value (proves fair_server was detached and ext_server was
    attached at its own default runtime),
 3. after unload total_bw matches the doubled value from step 1 (proves
    fair_server was re-attached with the runtime customization preserved
    across the load/unload cycle).

Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 tools/testing/selftests/sched_ext/total_bw.c | 201 ++++++++++++++++++-
 1 file changed, 200 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/sched_ext/total_bw.c b/tools/testing/selftests/sched_ext/total_bw.c
index 5b0a619bab86e..2af01cee90cc0 100644
--- a/tools/testing/selftests/sched_ext/total_bw.c
+++ b/tools/testing/selftests/sched_ext/total_bw.c
@@ -100,6 +100,98 @@ static int read_total_bw_values(long *bw_values, int max_cpus)
 	return cpu_count;
 }
 
+/*
+ * Read a per-CPU dl_server param (runtime or period) from debugfs.
+ * Returns the value in nanoseconds, or -1 on failure.
+ */
+static long read_server_param(const char *server, const char *param, int cpu)
+{
+	char path[128];
+	long value = -1;
+	FILE *fp;
+
+	snprintf(path, sizeof(path),
+		 "/sys/kernel/debug/sched/%s_server/cpu%d/%s",
+		 server, cpu, param);
+	fp = fopen(path, "r");
+	if (!fp)
+		return -1;
+	if (fscanf(fp, "%ld", &value) != 1)
+		value = -1;
+	fclose(fp);
+
+	return value;
+}
+
+/*
+ * Write a per-CPU dl_server param to debugfs. Returns 0 on success.
+ */
+static int write_server_param(const char *server, const char *param,
+			      int cpu, long value)
+{
+	char path[128];
+	FILE *fp;
+	int ret = 0;
+
+	snprintf(path, sizeof(path),
+		 "/sys/kernel/debug/sched/%s_server/cpu%d/%s",
+		 server, cpu, param);
+	fp = fopen(path, "w");
+	if (!fp)
+		return -1;
+	if (fprintf(fp, "%ld", value) < 0)
+		ret = -1;
+	if (fclose(fp) != 0)
+		ret = -1;
+
+	return ret;
+}
+
+static int read_fair_runtime_all(int nr_cpus, long *runtimes)
+{
+	int i;
+
+	for (i = 0; i < nr_cpus; i++) {
+		runtimes[i] = read_server_param("fair", "runtime", i);
+		if (runtimes[i] <= 0)
+			return -1;
+	}
+
+	return 0;
+}
+
+static int write_fair_runtime_all(int nr_cpus, long value)
+{
+	int i;
+
+	for (i = 0; i < nr_cpus; i++) {
+		if (write_server_param("fair", "runtime", i, value) < 0) {
+			SCX_ERR("Failed to write fair_server runtime on CPU %d", i);
+			return -1;
+		}
+	}
+
+	return 0;
+}
+
+/*
+ * Restore per-CPU fair_server runtimes.
+ */
+static int restore_fair_runtime_all(int nr_cpus, const long *runtimes)
+{
+	int ret = 0;
+	int i;
+
+	for (i = 0; i < nr_cpus; i++) {
+		if (write_server_param("fair", "runtime", i, runtimes[i]) < 0) {
+			SCX_ERR("Failed to restore fair_server runtime on CPU %d", i);
+			ret = -1;
+		}
+	}
+
+	return ret;
+}
+
 static bool verify_total_bw_consistency(long *bw_values, int count)
 {
 	int i;
@@ -217,6 +309,9 @@ static enum scx_test_status run(void *ctx)
 	struct bpf_link *link;
 	long loaded_bw[MAX_CPUS];
 	long unloaded_bw[MAX_CPUS];
+	long doubled_bw[MAX_CPUS];
+	long original_runtime[MAX_CPUS], doubled_runtime;
+	enum scx_test_status ret;
 	int i;
 
 	/* Test scenario 2: BPF program loaded */
@@ -257,7 +352,111 @@ static enum scx_test_status run(void *ctx)
 	}
 
 	fprintf(stderr, "All total_bw values are consistent across all scenarios\n");
-	return SCX_TEST_PASS;
+
+	/*
+	 * Validate auto-register/unregister of dl_server bandwidth reservations.
+	 *
+	 * Doubling fair_server's runtime doubles its bw contribution. With a
+	 * full-mode BPF scheduler (minimal_ops), the kernel should detach
+	 * fair_server and attach ext_server, dropping total_bw back to its
+	 * pre-customization (default ext_server-only) value. On unload, the
+	 * fair_server reservation should come back with its customized runtime
+	 * preserved, so total_bw doubles again.
+	 */
+	if (read_fair_runtime_all(test_ctx->nr_cpus, original_runtime) < 0) {
+		fprintf(stderr, "Skipping attach/detach validation: debugfs not accessible\n");
+		return SCX_TEST_PASS;
+	}
+	doubled_runtime = original_runtime[0] * 2;
+
+	fprintf(stderr,
+		"Setting fair_server runtime to %ld ns on all CPUs (orig %ld)\n",
+		doubled_runtime, original_runtime[0]);
+
+	if (write_fair_runtime_all(test_ctx->nr_cpus, doubled_runtime) < 0) {
+		ret = SCX_TEST_FAIL;
+		goto restore;
+	}
+
+	if (fetch_verify_total_bw(doubled_bw, test_ctx->nr_cpus) < 0) {
+		SCX_ERR("Failed to get stable values after doubling fair runtime");
+		ret = SCX_TEST_FAIL;
+		goto restore;
+	}
+
+	/*
+	 * After doubling the runtime, fair_server's bw contribution must grow.
+	 * We don't assert exactly 2x, because the kernel's to_ratio() truncates
+	 * the value, so 2 * to_ratio(period, runtime) and
+	 * to_ratio(period, 2 * runtime) can differ.
+	 */
+	for (i = 0; i < test_ctx->nr_cpus; i++) {
+		if (doubled_bw[i] <= test_ctx->baseline_bw[i]) {
+			SCX_ERR("CPU%d: fair did not increase total_bw (baseline=%ld, doubled=%ld)",
+				i, test_ctx->baseline_bw[i], doubled_bw[i]);
+			ret = SCX_TEST_FAIL;
+			goto restore;
+		}
+	}
+
+	link = bpf_map__attach_struct_ops(test_ctx->skel->maps.minimal_ops);
+	if (!link) {
+		SCX_ERR("Failed to attach scheduler for detach test");
+		ret = SCX_TEST_FAIL;
+		goto restore;
+	}
+
+	if (fetch_verify_total_bw(loaded_bw, test_ctx->nr_cpus) < 0) {
+		SCX_ERR("Failed to get stable values with BPF loaded (detach test)");
+		bpf_link__destroy(link);
+		ret = SCX_TEST_FAIL;
+		goto restore;
+	}
+
+	/*
+	 * In full mode the customized fair_server is detached and ext_server is
+	 * attached at its default runtime, total_bw must match baseline.
+	 */
+	for (i = 0; i < test_ctx->nr_cpus; i++) {
+		if (loaded_bw[i] != test_ctx->baseline_bw[i]) {
+			SCX_ERR("CPU%d: expected bw %ld (fair detached, ext default), got %ld",
+				i, test_ctx->baseline_bw[i], loaded_bw[i]);
+			bpf_link__destroy(link);
+			ret = SCX_TEST_FAIL;
+			goto restore;
+		}
+	}
+
+	bpf_link__destroy(link);
+
+	if (fetch_verify_total_bw(unloaded_bw, test_ctx->nr_cpus) < 0) {
+		SCX_ERR("Failed to get stable values after BPF unload (detach test)");
+		ret = SCX_TEST_FAIL;
+		goto restore;
+	}
+
+	/*
+	 * After unload, fair_server is re-attached with its preserved 2x
+	 * runtime, so total_bw should return to the doubled value.
+	 */
+	for (i = 0; i < test_ctx->nr_cpus; i++) {
+		if (unloaded_bw[i] != doubled_bw[i]) {
+			SCX_ERR("CPU%d: BPF unloaded: expected %ld (fair restored at 2x), got %ld",
+				i, doubled_bw[i], unloaded_bw[i]);
+			ret = SCX_TEST_FAIL;
+			goto restore;
+		}
+	}
+
+	fprintf(stderr,
+		"dl_server attach/detach with customized fair runtime verified\n");
+	ret = SCX_TEST_PASS;
+
+restore:
+	if (restore_fair_runtime_all(test_ctx->nr_cpus, original_runtime) < 0)
+		SCX_ERR("Failed to fully restore per-CPU fair_server runtimes");
+
+	return ret;
 }
 
 static void cleanup(void *ctx)
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] sched_ext: Auto-register/unregister dl_server reservations
  2026-05-26 16:42 ` [PATCH 1/2] sched_ext: Auto-register/unregister dl_server reservations Andrea Righi
@ 2026-05-26 17:14   ` sashiko-bot
  2026-05-28 11:36   ` Peter Zijlstra
  1 sibling, 0 replies; 12+ messages in thread
From: sashiko-bot @ 2026-05-26 17:14 UTC (permalink / raw)
  To: Andrea Righi; +Cc: sched-ext

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:

Pre-existing issues:
- [High] Modifying a deadline server's parameters via debugfs while its CPU is inactive (offline) corrupts the root domain's bandwidth accounting and can trigger a divide-by-zero kernel panic.
--

commit 81a64c3f8b34ea0965999ebcdc61714354b79af3
Author: Andrea Righi <arighi@nvidia.com>

sched_ext: Auto-register/unregister dl_server reservations

This commit makes deadline server bandwidth reservations dynamic by attaching
or detaching them based on whether a BPF scheduler is loaded. This prevents
inactive scheduling classes from artificially holding capacity.

> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index ddfd6bc63ab1d..90f6589297ac5 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
[ ... ]
> @@ -1893,7 +1904,8 @@ void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq)
>  
>  int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 period, bool init)
>  {
> -	u64 old_bw = init ? 0 : to_ratio(dl_se->dl_period, dl_se->dl_runtime);
> +	u64 old_bw = (init || !dl_se->dl_bw_attached) ? 0 :
> +		     to_ratio(dl_se->dl_period, dl_se->dl_runtime);
>  	u64 new_bw = to_ratio(period, runtime);
>  	struct rq *rq = dl_se->rq;
>  	int cpu = cpu_of(rq);
> @@ -1913,7 +1925,8 @@ int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 perio
>  	if (init) {
>  		__add_rq_bw(new_bw, &rq->dl);
>  		__dl_add(dl_b, new_bw, cpus);
> -	} else {
> +	} else if (dl_se->dl_bw_attached) {
>  		__dl_sub(dl_b, dl_se->dl_bw, cpus);

[Severity: High]
This is a pre-existing issue, but can modifying a deadline server's
parameters via debugfs while its CPU is offline trigger a divide-by-zero
panic here?

Looking at dl_server_apply_params(), it retrieves the number of active CPUs
via dl_bw_cpus(cpu) and then passes that to __dl_sub() and __dl_add() when
the server is attached:

dl_server_apply_params() {
    ...
    cpus = dl_bw_cpus(cpu);
    ...
    if (init) {
        ...
    } else if (dl_se->dl_bw_attached) {
        __dl_sub(dl_b, dl_se->dl_bw, cpus);
        __dl_add(dl_b, new_bw, cpus);
        ...
}

If the CPU is offline, dl_bw_cpus() can return 0. The helpers __dl_sub()
and __dl_add() perform a division by cpus:

__dl_sub() {
    ...
    __dl_update(dl_b, (s32)tsk_bw / cpus);
}

Wouldn't this result in a direct divide-by-zero if the debugfs file for an
inactive CPU is written to?

Additionally, even if cpus > 0, since inactive CPUs are skipped during
root domain bandwidth rebuilds, does subtracting dl_se->dl_bw from
dl_b->total_bw without a cpu_active() check subtract a value that was
never added, resulting in accounting corruption?

>  		__dl_add(dl_b, new_bw, cpus);
>  
>  		dl_rq_change_utilization(rq, dl_se, new_bw);
>  	}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260526164420.638711-1-arighi@nvidia.com?part=1

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] selftests/sched_ext: Validate dl_server attach/detach in total_bw test
  2026-05-26 16:42 ` [PATCH 2/2] selftests/sched_ext: Validate dl_server attach/detach in total_bw test Andrea Righi
@ 2026-05-26 17:33   ` sashiko-bot
  0 siblings, 0 replies; 12+ messages in thread
From: sashiko-bot @ 2026-05-26 17:33 UTC (permalink / raw)
  To: Andrea Righi; +Cc: sched-ext

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Medium] The dl_server runtime and period parameters are parsed and written as signed `long` instead of `u64` or `unsigned long long`, leading to potential integer overflow on 32-bit platforms.
--

commit 6475744500400f1c70cab62bc90b17ee9b9cbbea
Author: Andrea Righi <arighi@nvidia.com>

selftests/sched_ext: Validate dl_server attach/detach in total_bw test

Extend the total_bw selftest to validate the fair/ext dl_server
auto-attach/detach operations by doubling the fair_server's runtime
via debugfs and verifying total_bw changes.

> diff --git a/tools/testing/selftests/sched_ext/total_bw.c b/tools/testing/selftests/sched_ext/total_bw.c
> index 5b0a619bab86e..2af01cee90cc0 100644
> --- a/tools/testing/selftests/sched_ext/total_bw.c
> +++ b/tools/testing/selftests/sched_ext/total_bw.c
[ ... ]
> @@ -100,6 +100,98 @@ static int read_total_bw_values(long *bw_values, int max_cpus)
>  	return cpu_count;
>  }
>  
> +/*
> + * Read a per-CPU dl_server param (runtime or period) from debugfs.
> + * Returns the value in nanoseconds, or -1 on failure.
> + */
> +static long read_server_param(const char *server, const char *param, int cpu)
> +{
> +	char path[128];
> +	long value = -1;

[Severity: Medium]
Will parsing the dl_server runtime and period parameters as signed long
lead to integer overflows on 32-bit platforms?

In the kernel, dl_server parameters like runtime and period are represented
as u64 values (in nanoseconds). On 32-bit platforms, a signed long is 32
bits, which can only represent up to ~2.14 seconds in nanoseconds.

If a test system configures a fair_server runtime larger than ~1.07 seconds,
doubling it could cause a signed integer overflow. Would it be better to use
unsigned long long or u64 with %llu to avoid this?

> +	FILE *fp;
> +
> +	snprintf(path, sizeof(path),
> +		 "/sys/kernel/debug/sched/%s_server/cpu%d/%s",
> +		 server, cpu, param);
> +	fp = fopen(path, "r");
> +	if (!fp)
> +		return -1;
> +	if (fscanf(fp, "%ld", &value) != 1)
> +		value = -1;
> +	fclose(fp);
> +
> +	return value;
> +}
> +
> +/*
> + * Write a per-CPU dl_server param to debugfs. Returns 0 on success.
> + */
> +static int write_server_param(const char *server, const char *param,
> +			      int cpu, long value)

[Severity: Medium]
Similarly here, does taking long value as a parameter restrict the maximum
runtime that can be written on 32-bit systems?

> +{
> +	char path[128];
> +	FILE *fp;
> +	int ret = 0;
> +
> +	snprintf(path, sizeof(path),
> +		 "/sys/kernel/debug/sched/%s_server/cpu%d/%s",
> +		 server, cpu, param);
> +	fp = fopen(path, "w");
> +	if (!fp)
> +		return -1;
> +	if (fprintf(fp, "%ld", value) < 0)
> +		ret = -1;
> +	if (fclose(fp) != 0)
> +		ret = -1;
> +
> +	return ret;
> +}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260526164420.638711-1-arighi@nvidia.com?part=2

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCHSET v3 sched_ext/for-7.2] sched_ext: Auto-manage ext/fair dl_server bandwidth
  2026-05-26 16:42 [PATCHSET v3 sched_ext/for-7.2] sched_ext: Auto-manage ext/fair dl_server bandwidth Andrea Righi
  2026-05-26 16:42 ` [PATCH 1/2] sched_ext: Auto-register/unregister dl_server reservations Andrea Righi
  2026-05-26 16:42 ` [PATCH 2/2] selftests/sched_ext: Validate dl_server attach/detach in total_bw test Andrea Righi
@ 2026-05-27 12:36 ` Juri Lelli
  2026-05-28 11:33   ` Peter Zijlstra
  2026-05-28 15:53 ` Tejun Heo
  3 siblings, 1 reply; 12+ messages in thread
From: Juri Lelli @ 2026-05-27 12:36 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar,
	Peter Zijlstra, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
	Christian Loehle, Phil Auld, Koba Ko, Joel Fernandes,
	Richard Cheng, Cheng-Yang Chou, sched-ext, linux-kernel

Hi Andrea,

On 26/05/26 18:42, Andrea Righi wrote:
> Currently, a fixed bandwidth is reserved at boot for both the fair and ext
> deadline servers, and this reservation remains unchanged unless explicitly
> modified via debugfs. As a result, both servers permanently contribute to global
> bandwidth accounting, regardless of whether a BPF scheduler is active.
> 
> While unused bandwidth can still be reclaimed at runtime by other classes, this
> static reservation prevents RT from fully utilizing available headroom in
> situations where one of the sched_ext or fair class is guaranteed to be inactive
> (for example, when no BPF scheduler is loaded, or when sched_ext runs in full
> mode and replaces fair).
> 
> As discussed at the VIII OSPM summit in Cambridge [1], a better solution would
> be to dynamically register and unregister deadline server bandwidth based on the
> active sched_ext state. This allows the kernel to automatically enable bandwidth
> accounting only for the scheduling class that is currently active, while
> disabling it for inactive ones.
> 
> This patch series implements this automatic register/unregister logic. Moreover,
> the sched_ext total_bw kselftest is also modified to validate the correct
> behavior across the different scheduling configurations and ensure that
> bandwidth accounting follows the expected state transitions.
> 
> [1] https://retis.santannapisa.it/ospm-summit/
> 
> Git tree: git://git.kernel.org/pub/scm/linux/kernel/git/arighi/linux.git dl-server-bw-v3
> 
> Changes in v3:
>  - Don't bypass __dl_overflow() for detached servers in dl_server_apply_params()
>    to reject oversized configs up front (reported by Sashiko)
>  - A potential divide-by-zero in dl_server_apply_params() reported by Sashiko
>    has been fixed in a separate patch (not introduced by this patch set):
>    https://lore.kernel.org/all/20260526100502.575774-1-arighi@nvidia.com/
>  - Link to v2: https://lore.kernel.org/all/20260526082954.550958-1-arighi@nvidia.com/

This looks now good to me.

Acked-by: Juri Lelli <juri.lelli@redhat.com>

Thanks!
Juri


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCHSET v3 sched_ext/for-7.2] sched_ext: Auto-manage ext/fair dl_server bandwidth
  2026-05-27 12:36 ` [PATCHSET v3 sched_ext/for-7.2] sched_ext: Auto-manage ext/fair dl_server bandwidth Juri Lelli
@ 2026-05-28 11:33   ` Peter Zijlstra
  2026-05-28 16:13     ` Andrea Righi
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2026-05-28 11:33 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Andrea Righi, Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, K Prateek Nayak, Christian Loehle,
	Phil Auld, Koba Ko, Joel Fernandes, Richard Cheng,
	Cheng-Yang Chou, sched-ext, linux-kernel

On Wed, May 27, 2026 at 02:36:18PM +0200, Juri Lelli wrote:
> Hi Andrea,
> 
> On 26/05/26 18:42, Andrea Righi wrote:
> > Currently, a fixed bandwidth is reserved at boot for both the fair and ext
> > deadline servers, and this reservation remains unchanged unless explicitly
> > modified via debugfs. As a result, both servers permanently contribute to global
> > bandwidth accounting, regardless of whether a BPF scheduler is active.
> > 
> > While unused bandwidth can still be reclaimed at runtime by other classes, this
> > static reservation prevents RT from fully utilizing available headroom in
> > situations where one of the sched_ext or fair class is guaranteed to be inactive
> > (for example, when no BPF scheduler is loaded, or when sched_ext runs in full
> > mode and replaces fair).
> > 
> > As discussed at the VIII OSPM summit in Cambridge [1], a better solution would
> > be to dynamically register and unregister deadline server bandwidth based on the
> > active sched_ext state. This allows the kernel to automatically enable bandwidth
> > accounting only for the scheduling class that is currently active, while
> > disabling it for inactive ones.
> > 
> > This patch series implements this automatic register/unregister logic. Moreover,
> > the sched_ext total_bw kselftest is also modified to validate the correct
> > behavior across the different scheduling configurations and ensure that
> > bandwidth accounting follows the expected state transitions.
> > 
> > [1] https://retis.santannapisa.it/ospm-summit/
> > 
> > Git tree: git://git.kernel.org/pub/scm/linux/kernel/git/arighi/linux.git dl-server-bw-v3
> > 
> > Changes in v3:
> >  - Don't bypass __dl_overflow() for detached servers in dl_server_apply_params()
> >    to reject oversized configs up front (reported by Sashiko)
> >  - A potential divide-by-zero in dl_server_apply_params() reported by Sashiko
> >    has been fixed in a separate patch (not introduced by this patch set):
> >    https://lore.kernel.org/all/20260526100502.575774-1-arighi@nvidia.com/
> >  - Link to v2: https://lore.kernel.org/all/20260526082954.550958-1-arighi@nvidia.com/
> 
> This looks now good to me.
> 
> Acked-by: Juri Lelli <juri.lelli@redhat.com>

Thanks!, I've stuck them in queue:sched/core for the robots to chew on.
There was an absolutely trivial reject in ext.c that I fixed up, so
something moved around there.

There is one little nit, but I'll reply there and that can easily be
done on top if we decide its worth it.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] sched_ext: Auto-register/unregister dl_server reservations
  2026-05-26 16:42 ` [PATCH 1/2] sched_ext: Auto-register/unregister dl_server reservations Andrea Righi
  2026-05-26 17:14   ` sashiko-bot
@ 2026-05-28 11:36   ` Peter Zijlstra
  2026-05-28 16:13     ` Andrea Righi
  1 sibling, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2026-05-28 11:36 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, K Prateek Nayak, Christian Loehle,
	Phil Auld, Koba Ko, Joel Fernandes, Richard Cheng,
	Cheng-Yang Chou, sched-ext, linux-kernel

On Tue, May 26, 2026 at 06:42:48PM +0200, Andrea Righi wrote:
> @@ -6187,10 +6190,34 @@ static void scx_root_disable(struct scx_sched *sch)
>  	/*
>  	 * Invalidate all the rq clocks to prevent getting outdated
>  	 * rq clocks from a previous scx scheduler.
> +	 *
> +	 * Also re-balance the dl_server bandwidth reservations: detach
> +	 * ext_server (no more sched_ext tasks) and reinstate fair_server if it
> +	 * was previously detached because we were running in full mode.
> +	 *
> +	 * Unlike the enable path, this runs on a recovery path that cannot
> +	 * fail, so we use dl_server_swap_bw() to atomically free ext_server's
> +	 * bandwidth and reclaim it for fair_server under the same dl_b lock.
> +	 *
> +	 * The swap can still fail with -EBUSY if someone bumped ext_server's
> +	 * runtime via debugfs between enable and disable; in that narrow case
> +	 * both servers end up detached and we just WARN.
>  	 */
>  	for_each_possible_cpu(cpu) {
>  		struct rq *rq = cpu_rq(cpu);
> +
>  		scx_rq_clock_invalidate(rq);
> +
> +		scoped_guard(rq_lock_irqsave, rq) {
> +			update_rq_clock(rq);
> +			if (was_switched_all) {
> +				if (WARN_ON_ONCE(dl_server_swap_bw(&rq->ext_server,
> +								   &rq->fair_server)))
> +					pr_warn("failed to re-attach fair_server on CPU %d\n", cpu);

One option here, with the swap, is to reduce the fair servers bandwidth
to match the outgoing ext server. Then at least you end up with the fair
server running, rather than having it completely stopped.

But this is going to be a rather rare occurrence, and people will have
to go poke at the debugfs controls anyway if this happens, so maybe
that's just not worth the effort.

But I wanted to mention it...

> +			} else {
> +				dl_server_detach_bw(&rq->ext_server);
> +			}
> +		}
>  	}
>  
>  	/* no task is on scx, turn off all the switches and flush in-progress calls */

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCHSET v3 sched_ext/for-7.2] sched_ext: Auto-manage ext/fair dl_server bandwidth
  2026-05-26 16:42 [PATCHSET v3 sched_ext/for-7.2] sched_ext: Auto-manage ext/fair dl_server bandwidth Andrea Righi
                   ` (2 preceding siblings ...)
  2026-05-27 12:36 ` [PATCHSET v3 sched_ext/for-7.2] sched_ext: Auto-manage ext/fair dl_server bandwidth Juri Lelli
@ 2026-05-28 15:53 ` Tejun Heo
  2026-05-29  9:08   ` Peter Zijlstra
  3 siblings, 1 reply; 12+ messages in thread
From: Tejun Heo @ 2026-05-28 15:53 UTC (permalink / raw)
  To: Andrea Righi
  Cc: David Vernet, Changwoo Min, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
	Christian Loehle, Phil Auld, Koba Ko, Joel Fernandes,
	Richard Cheng, Cheng-Yang Chou, sched-ext, linux-kernel

Hello,

Peter, how do you want to route the patches? I'd be happy to take them
through sched_ext/for-7.2.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCHSET v3 sched_ext/for-7.2] sched_ext: Auto-manage ext/fair dl_server bandwidth
  2026-05-28 11:33   ` Peter Zijlstra
@ 2026-05-28 16:13     ` Andrea Righi
  0 siblings, 0 replies; 12+ messages in thread
From: Andrea Righi @ 2026-05-28 16:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Juri Lelli, Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, K Prateek Nayak, Christian Loehle,
	Phil Auld, Koba Ko, Joel Fernandes, Richard Cheng,
	Cheng-Yang Chou, sched-ext, linux-kernel

Hi Peter,

On Thu, May 28, 2026 at 01:33:17PM +0200, Peter Zijlstra wrote:
> On Wed, May 27, 2026 at 02:36:18PM +0200, Juri Lelli wrote:
> > Hi Andrea,
> > 
> > On 26/05/26 18:42, Andrea Righi wrote:
> > > Currently, a fixed bandwidth is reserved at boot for both the fair and ext
> > > deadline servers, and this reservation remains unchanged unless explicitly
> > > modified via debugfs. As a result, both servers permanently contribute to global
> > > bandwidth accounting, regardless of whether a BPF scheduler is active.
> > > 
> > > While unused bandwidth can still be reclaimed at runtime by other classes, this
> > > static reservation prevents RT from fully utilizing available headroom in
> > > situations where one of the sched_ext or fair class is guaranteed to be inactive
> > > (for example, when no BPF scheduler is loaded, or when sched_ext runs in full
> > > mode and replaces fair).
> > > 
> > > As discussed at the VIII OSPM summit in Cambridge [1], a better solution would
> > > be to dynamically register and unregister deadline server bandwidth based on the
> > > active sched_ext state. This allows the kernel to automatically enable bandwidth
> > > accounting only for the scheduling class that is currently active, while
> > > disabling it for inactive ones.
> > > 
> > > This patch series implements this automatic register/unregister logic. Moreover,
> > > the sched_ext total_bw kselftest is also modified to validate the correct
> > > behavior across the different scheduling configurations and ensure that
> > > bandwidth accounting follows the expected state transitions.
> > > 
> > > [1] https://retis.santannapisa.it/ospm-summit/
> > > 
> > > Git tree: git://git.kernel.org/pub/scm/linux/kernel/git/arighi/linux.git dl-server-bw-v3
> > > 
> > > Changes in v3:
> > >  - Don't bypass __dl_overflow() for detached servers in dl_server_apply_params()
> > >    to reject oversized configs up front (reported by Sashiko)
> > >  - A potential divide-by-zero in dl_server_apply_params() reported by Sashiko
> > >    has been fixed in a separate patch (not introduced by this patch set):
> > >    https://lore.kernel.org/all/20260526100502.575774-1-arighi@nvidia.com/
> > >  - Link to v2: https://lore.kernel.org/all/20260526082954.550958-1-arighi@nvidia.com/
> > 
> > This looks now good to me.
> > 
> > Acked-by: Juri Lelli <juri.lelli@redhat.com>
> 
> Thanks!, I've stuck them in queue:sched/core for the robots to chew on.
> There was an absolutely trivial reject in ext.c that I fixed up, so
> something moved around there.

FYI, I re-ran all my tests with queue:sched/core, everything looks good on my
side.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] sched_ext: Auto-register/unregister dl_server reservations
  2026-05-28 11:36   ` Peter Zijlstra
@ 2026-05-28 16:13     ` Andrea Righi
  0 siblings, 0 replies; 12+ messages in thread
From: Andrea Righi @ 2026-05-28 16:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, K Prateek Nayak, Christian Loehle,
	Phil Auld, Koba Ko, Joel Fernandes, Richard Cheng,
	Cheng-Yang Chou, sched-ext, linux-kernel

Hi Peter,

On Thu, May 28, 2026 at 01:36:21PM +0200, Peter Zijlstra wrote:
> On Tue, May 26, 2026 at 06:42:48PM +0200, Andrea Righi wrote:
> > @@ -6187,10 +6190,34 @@ static void scx_root_disable(struct scx_sched *sch)
> >  	/*
> >  	 * Invalidate all the rq clocks to prevent getting outdated
> >  	 * rq clocks from a previous scx scheduler.
> > +	 *
> > +	 * Also re-balance the dl_server bandwidth reservations: detach
> > +	 * ext_server (no more sched_ext tasks) and reinstate fair_server if it
> > +	 * was previously detached because we were running in full mode.
> > +	 *
> > +	 * Unlike the enable path, this runs on a recovery path that cannot
> > +	 * fail, so we use dl_server_swap_bw() to atomically free ext_server's
> > +	 * bandwidth and reclaim it for fair_server under the same dl_b lock.
> > +	 *
> > +	 * The swap can still fail with -EBUSY if someone bumped ext_server's
> > +	 * runtime via debugfs between enable and disable; in that narrow case
> > +	 * both servers end up detached and we just WARN.
> >  	 */
> >  	for_each_possible_cpu(cpu) {
> >  		struct rq *rq = cpu_rq(cpu);
> > +
> >  		scx_rq_clock_invalidate(rq);
> > +
> > +		scoped_guard(rq_lock_irqsave, rq) {
> > +			update_rq_clock(rq);
> > +			if (was_switched_all) {
> > +				if (WARN_ON_ONCE(dl_server_swap_bw(&rq->ext_server,
> > +								   &rq->fair_server)))
> > +					pr_warn("failed to re-attach fair_server on CPU %d\n", cpu);
> 
> One option here, with the swap, is to reduce the fair servers bandwidth
> to match the outgoing ext server. Then at least you end up with the fair
> server running, rather than having it completely stopped.
> 
> But this is going to be a rather rare occurrence, and people will have
> to go poke at the debugfs controls anyway if this happens, so maybe
> that's just not worth the effort.
> 
> But I wanted to mention it...

Yeah, it'd be safer to at least have "some" bandwidth attached if
dl_server_swap_bw() fails, so that fair isn't left completely unprotected.

On top of that we could even try to opportunistically restore the original
bandwidth whenever DL bw is released, but as you say, this is probably a rare
scenario, maybe it could be a later follow-up improvement?

> 
> > +			} else {
> > +				dl_server_detach_bw(&rq->ext_server);
> > +			}
> > +		}
> >  	}
> >  
> >  	/* no task is on scx, turn off all the switches and flush in-progress calls */

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCHSET v3 sched_ext/for-7.2] sched_ext: Auto-manage ext/fair dl_server bandwidth
  2026-05-28 15:53 ` Tejun Heo
@ 2026-05-29  9:08   ` Peter Zijlstra
  0 siblings, 0 replies; 12+ messages in thread
From: Peter Zijlstra @ 2026-05-29  9:08 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrea Righi, David Vernet, Changwoo Min, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, K Prateek Nayak, Christian Loehle,
	Phil Auld, Koba Ko, Joel Fernandes, Richard Cheng,
	Cheng-Yang Chou, sched-ext, linux-kernel

On Thu, May 28, 2026 at 05:53:04AM -1000, Tejun Heo wrote:
> Hello,
> 
> Peter, how do you want to route the patches? I'd be happy to take them
> through sched_ext/for-7.2.

I have them en-route to tip:sched/core.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2026-05-29  9:08 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-26 16:42 [PATCHSET v3 sched_ext/for-7.2] sched_ext: Auto-manage ext/fair dl_server bandwidth Andrea Righi
2026-05-26 16:42 ` [PATCH 1/2] sched_ext: Auto-register/unregister dl_server reservations Andrea Righi
2026-05-26 17:14   ` sashiko-bot
2026-05-28 11:36   ` Peter Zijlstra
2026-05-28 16:13     ` Andrea Righi
2026-05-26 16:42 ` [PATCH 2/2] selftests/sched_ext: Validate dl_server attach/detach in total_bw test Andrea Righi
2026-05-26 17:33   ` sashiko-bot
2026-05-27 12:36 ` [PATCHSET v3 sched_ext/for-7.2] sched_ext: Auto-manage ext/fair dl_server bandwidth Juri Lelli
2026-05-28 11:33   ` Peter Zijlstra
2026-05-28 16:13     ` Andrea Righi
2026-05-28 15:53 ` Tejun Heo
2026-05-29  9:08   ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox