[PATCHSET sched_ext/for-7.2] sched_ext: Auto-manage ext/fair dl

Sched_ext development
 help / color / mirror / Atom feed

* [PATCHSET sched_ext/for-7.2] sched_ext: Auto-manage ext/fair dl_server bandwidth
@ 2026-05-21 17:33 Andrea Righi
  2026-05-21 17:33 ` [PATCH 1/2] sched_ext: Auto-register/unregister dl_server reservations Andrea Righi
  2026-05-21 17:33 ` [PATCH 2/2] selftests/sched_ext: Validate dl_server attach/detach in total_bw test Andrea Righi
  0 siblings, 2 replies; 10+ messages in thread
From: Andrea Righi @ 2026-05-21 17:33 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Christian Loehle, Phil Auld,
	Koba Ko, Joel Fernandes, Richard Cheng, Cheng-Yang Chou,
	sched-ext, linux-kernel

Currently, a fixed bandwidth is reserved at boot for both the fair and ext
deadline servers, and this reservation remains unchanged unless explicitly
modified via debugfs. As a result, both reservations stay accounted in the root
domain's deadline bandwidth pool, regardless of whether a BPF scheduler is
active.

While unused bandwidth can still be reclaimed at runtime by other classes, this
static reservation prevents RT from fully utilizing available headroom in
situations where either the sched_ext or fair class is guaranteed to be inactive
(for example, when no BPF scheduler is loaded, or when sched_ext runs in full
mode and replaces fair).

As discussed at the VIII OSPM summit in Cambridge [1], a better solution would
be to dynamically register and unregister deadline server bandwidth based on the
active sched_ext state. This allows the kernel to automatically enable bandwidth
accounting only for the scheduling class that is currently active, while
disabling it for inactive ones.

This patch series implements this automatic register/unregister logic. Moreover,
the sched_ext total_bw kselftest is also modified to validate the correct
behavior across the different scheduling configurations and ensure that
bandwidth accounting follows the expected state transitions.

[1] https://retis.santannapisa.it/ospm-summit/

Git tree: git://git.kernel.org/pub/scm/linux/kernel/git/arighi/linux.git dl-server-bw

Andrea Righi (2):
      sched_ext: Auto-register/unregister dl_server reservations
      selftests/sched_ext: Validate dl_server attach/detach in total_bw test

 include/linux/sched.h                        |   6 +
 kernel/sched/deadline.c                      | 109 +++++++++++++++--
 kernel/sched/ext.c                           |  43 +++++++
 kernel/sched/sched.h                         |   2 +
 tools/testing/selftests/sched_ext/total_bw.c | 171 ++++++++++++++++++++++++++-
 5 files changed, 323 insertions(+), 8 deletions(-)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/2] sched_ext: Auto-register/unregister dl_server reservations
  2026-05-21 17:33 [PATCHSET sched_ext/for-7.2] sched_ext: Auto-manage ext/fair dl_server bandwidth Andrea Righi
@ 2026-05-21 17:33 ` Andrea Righi
  2026-05-21 18:23   ` sashiko-bot
  2026-05-22  8:36   ` Peter Zijlstra
  2026-05-21 17:33 ` [PATCH 2/2] selftests/sched_ext: Validate dl_server attach/detach in total_bw test Andrea Righi
  1 sibling, 2 replies; 10+ messages in thread
From: Andrea Righi @ 2026-05-21 17:33 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Christian Loehle, Phil Auld,
	Koba Ko, Joel Fernandes, Richard Cheng, Cheng-Yang Chou,
	sched-ext, linux-kernel

Commit cd959a3562050d ("sched_ext: Add a DL server for sched_ext tasks")
introduced an ext_server deadline server to protect sched_ext tasks from
fair/RT starvation, mirroring the existing fair_server.

Currently, both servers reserve their 50ms/1000ms bandwidth at boot,
regardless of whether a BPF scheduler is loaded. Unused bandwidth is
still reclaimed at runtime by other classes, but the static reservation
prevents the RT class from implicitly using that headroom when one of
the two classes is guaranteed to be empty.

A sysadmin can work around this by writing
/sys/kernel/debug/sched/{fair,ext}_server/cpu*/runtime, but that
requires manual action and not all systems expose debugfs.

A better approach is to make server bandwidth reservations dynamic: only
the scheduling policy that is currently active should register its
reservation, while the inactive one should not artificially hold
capacity (keeping both reservations only when the BPF scheduler is
running in partial mode):

 +---------------------------------------------+-------------+------------+
 | BPF scheduler state                         | fair server | ext server |
 +---------------------------------------------+-------------+------------+
 | not loaded (default boot)                   | reserved    | none       |
 | loaded full mode (!SCX_OPS_SWITCH_PARTIAL)  | none        | reserved   |
 | loaded partial mode (SCX_OPS_SWITCH_PARTIAL)| reserved    | reserved   |
 +---------------------------------------------+-------------+------------+

To achieve this, introduce an "attached/detached" state for each
deadline server, so the kernel can decide whether a server's bandwidth
should be accounted in global bandwidth tracking.

At boot, the system starts with only the fair server contributing to
bandwidth accounting. When a BPF scheduler is enabled, the ext server is
attached and may replace or complement the fair server depending on
whether full or partial mode is used. When sched_ext is disabled, the
system restores the previous deadline bandwidth values and behavior.

The transition logic ensures that switching between scheduling modes is
consistent and reversible, without losing runtime configuration or
requiring manual intervention.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 include/linux/sched.h   |   6 +++
 kernel/sched/deadline.c | 109 +++++++++++++++++++++++++++++++++++++---
 kernel/sched/ext.c      |  43 ++++++++++++++++
 kernel/sched/sched.h    |   2 +
 4 files changed, 153 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 004e6d56a499a..6d55b2ed9651a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -702,6 +702,11 @@ struct sched_dl_entity {
 	 * running, skipping the defer phase.
 	 *
 	 * @dl_defer_idle tracks idle state
+	 *
+	 * @dl_bw_attached tells if this server's bandwidth currently
+	 * contributes to the root domain's total_bw. Only meaningful for server
+	 * entities (@dl_server == 1). Allows toggling the reservation on/off
+	 * without losing the configured @dl_runtime/@dl_period.
 	 */
 	unsigned int			dl_throttled      : 1;
 	unsigned int			dl_yielded        : 1;
@@ -713,6 +718,7 @@ struct sched_dl_entity {
 	unsigned int			dl_defer_armed	  : 1;
 	unsigned int			dl_defer_running  : 1;
 	unsigned int			dl_defer_idle     : 1;
+	unsigned int			dl_bw_attached    : 1;
 
 	/*
 	 * Bandwidth enforcement timer. Each -deadline task has its
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index edca7849b165d..08b8a8c23e80b 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1868,6 +1868,13 @@ void sched_init_dl_servers(void)
 		dl_se->dl_server = 1;
 		dl_se->dl_defer = 1;
 		setup_new_dl_entity(dl_se);
+
+		/*
+		 * No BPF scheduler is loaded at boot, so the ext_server has no
+		 * tasks to protect. Detach its bandwidth reservation, it will
+		 * be re-attached when a BPF scheduler is loaded.
+		 */
+		dl_server_detach_bw(dl_se);
 #endif
 	}
 }
@@ -1878,6 +1885,9 @@ void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq)
 	int cpu = cpu_of(rq);
 	struct dl_bw *dl_b;
 
+	if (!dl_se->dl_bw_attached)
+		return;
+
 	dl_b = dl_bw_of(cpu_of(rq));
 	guard(raw_spinlock)(&dl_b->lock);
 
@@ -1903,16 +1913,21 @@ int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 perio
 	cpus = dl_bw_cpus(cpu);
 	cap = dl_bw_capacity(cpu);
 
-	if (__dl_overflow(dl_b, cap, old_bw, new_bw))
-		return -EBUSY;
+	if (init || dl_se->dl_bw_attached) {
+		if (__dl_overflow(dl_b, cap, old_bw, new_bw))
+			return -EBUSY;
+	}
 
 	if (init) {
 		__add_rq_bw(new_bw, &rq->dl);
 		__dl_add(dl_b, new_bw, cpus);
-	} else {
+		dl_se->dl_bw_attached = 1;
+	} else if (dl_se->dl_bw_attached) {
 		__dl_sub(dl_b, dl_se->dl_bw, cpus);
 		__dl_add(dl_b, new_bw, cpus);
 
+		dl_rq_change_utilization(rq, dl_se, new_bw);
+	} else {
 		dl_rq_change_utilization(rq, dl_se, new_bw);
 	}
 
@@ -1929,6 +1944,84 @@ int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 perio
 	return 0;
 }
 
+/*
+ * Attach @dl_se's bandwidth to the root domain's total_bw accounting.
+ *
+ * Use to dynamically register a dl_server's bandwidth reservation while
+ * preserving its configured @dl_runtime / @dl_period. No-op if @dl_se is
+ * already attached.
+ *
+ * Returns -EBUSY if attaching would overflow the root domain capacity.
+ */
+int dl_server_attach_bw(struct sched_dl_entity *dl_se)
+{
+	struct rq *rq = dl_se->rq;
+	int cpu = cpu_of(rq);
+	struct dl_bw *dl_b;
+	unsigned long cap;
+	int cpus;
+
+	if (dl_se->dl_bw_attached)
+		return 0;
+
+	dl_b = dl_bw_of(cpu);
+	guard(raw_spinlock)(&dl_b->lock);
+
+	cpus = dl_bw_cpus(cpu);
+
+	/*
+	 * If there's no active CPU in this root domain (e.g., @cpu is offline),
+	 * just record the intent, so dl_server_add_bw() honors it when CPUs
+	 * come back.
+	 */
+	if (!cpus) {
+		dl_se->dl_bw_attached = 1;
+		return 0;
+	}
+
+	cap = dl_bw_capacity(cpu);
+
+	if (__dl_overflow(dl_b, cap, 0, dl_se->dl_bw))
+		return -EBUSY;
+
+	__dl_add(dl_b, dl_se->dl_bw, cpus);
+	dl_se->dl_bw_attached = 1;
+
+	return 0;
+}
+
+/*
+ * Detach @dl_se's bandwidth from the root domain's total_bw accounting.
+ *
+ * Use to dynamically unregister a dl_server's bandwidth reservation while
+ * preserving its configured @dl_runtime / @dl_period. No-op if @dl_se is
+ * not currently attached.
+ */
+void dl_server_detach_bw(struct sched_dl_entity *dl_se)
+{
+	struct rq *rq = dl_se->rq;
+	int cpu = cpu_of(rq);
+	struct dl_bw *dl_b;
+	int cpus;
+
+	if (!dl_se->dl_bw_attached)
+		return;
+
+	dl_b = dl_bw_of(cpu);
+	guard(raw_spinlock)(&dl_b->lock);
+
+	cpus = dl_bw_cpus(cpu);
+
+	/*
+	 * If no active CPUs in this root domain, the bandwidth isn't in
+	 * @dl_b right now; only clear the flag so dl_server_add_bw() skips
+	 * the server on the next root-domain rebuild.
+	 */
+	if (cpus)
+		__dl_sub(dl_b, dl_se->dl_bw, cpus);
+	dl_se->dl_bw_attached = 0;
+}
+
 /*
  * Update the current task's runtime statistics (provided it is still
  * a -deadline task and has not been removed from the dl_rq).
@@ -3229,12 +3322,12 @@ static void dl_server_add_bw(struct root_domain *rd, int cpu)
 	struct sched_dl_entity *dl_se;
 
 	dl_se = &cpu_rq(cpu)->fair_server;
-	if (dl_server(dl_se) && cpu_active(cpu))
+	if (dl_server(dl_se) && dl_se->dl_bw_attached && cpu_active(cpu))
 		__dl_add(&rd->dl_bw, dl_se->dl_bw, dl_bw_cpus(cpu));
 
 #ifdef CONFIG_SCHED_CLASS_EXT
 	dl_se = &cpu_rq(cpu)->ext_server;
-	if (dl_server(dl_se) && cpu_active(cpu))
+	if (dl_server(dl_se) && dl_se->dl_bw_attached && cpu_active(cpu))
 		__dl_add(&rd->dl_bw, dl_se->dl_bw, dl_bw_cpus(cpu));
 #endif
 }
@@ -3243,11 +3336,13 @@ static u64 dl_server_read_bw(int cpu)
 {
 	u64 dl_bw = 0;
 
-	if (cpu_rq(cpu)->fair_server.dl_server)
+	if (cpu_rq(cpu)->fair_server.dl_server &&
+	    cpu_rq(cpu)->fair_server.dl_bw_attached)
 		dl_bw += cpu_rq(cpu)->fair_server.dl_bw;
 
 #ifdef CONFIG_SCHED_CLASS_EXT
-	if (cpu_rq(cpu)->ext_server.dl_server)
+	if (cpu_rq(cpu)->ext_server.dl_server &&
+	    cpu_rq(cpu)->ext_server.dl_bw_attached)
 		dl_bw += cpu_rq(cpu)->ext_server.dl_bw;
 #endif
 
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 9c458552d14ff..15ba49fcba9af 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -6061,6 +6061,7 @@ static void scx_root_disable(struct scx_sched *sch)
 {
 	struct scx_task_iter sti;
 	struct task_struct *p;
+	bool was_switched_all;
 	int cpu;
 
 	/* guarantee forward progress and wait for descendants to be disabled */
@@ -6087,6 +6088,13 @@ static void scx_root_disable(struct scx_sched *sch)
 	 */
 	mutex_lock(&scx_enable_mutex);
 
+	/*
+	 * Snapshot the full vs partial mode before clearing the static
+	 * branch, so the dl_server re-balance below knows whether the
+	 * fair_server reservation needs to be reinstated.
+	 */
+	was_switched_all = scx_switched_all();
+
 	static_branch_disable(&__scx_switched_all);
 	WRITE_ONCE(scx_switching_all, false);
 
@@ -6136,10 +6144,24 @@ static void scx_root_disable(struct scx_sched *sch)
 	/*
 	 * Invalidate all the rq clocks to prevent getting outdated
 	 * rq clocks from a previous scx scheduler.
+	 *
+	 * Also re-balance the dl_server bandwidth reservations: detach
+	 * ext_server (no more sched_ext tasks) and reinstate fair_server
+	 * if it was previously detached because we were running in full
+	 * mode. Detach before attach to avoid a transient overflow of the
+	 * root domain's bandwidth capacity.
 	 */
 	for_each_possible_cpu(cpu) {
 		struct rq *rq = cpu_rq(cpu);
+
 		scx_rq_clock_invalidate(rq);
+
+		scoped_guard(rq_lock_irqsave, rq) {
+			dl_server_detach_bw(&rq->ext_server);
+			if (was_switched_all &&
+			    WARN_ON_ONCE(dl_server_attach_bw(&rq->fair_server)))
+				pr_warn("failed to re-attach fair_server on CPU %d\n", cpu);
+		}
 	}
 
 	/* no task is on scx, turn off all the switches and flush in-progress calls */
@@ -7314,6 +7336,27 @@ static void scx_root_enable_workfn(struct kthread_work *work)
 	if (!(ops->flags & SCX_OPS_SWITCH_PARTIAL))
 		static_branch_enable(&__scx_switched_all);
 
+	/*
+	 * Re-balance the dl_server bandwidth reservations.
+	 *
+	 * In full mode (!SCX_OPS_SWITCH_PARTIAL) no task will ever run in
+	 * the fair class, so detach the fair_server reservation and give
+	 * that bandwidth back to the RT class. Always attach the
+	 * ext_server reservation since sched_ext tasks are now possible.
+	 *
+	 * Detach before attach to avoid a transient overflow of the root
+	 * domain's bandwidth capacity.
+	 */
+	for_each_possible_cpu(cpu) {
+		struct rq *rq = cpu_rq(cpu);
+
+		guard(rq_lock_irqsave)(rq);
+		if (scx_switched_all())
+			dl_server_detach_bw(&rq->fair_server);
+		if (WARN_ON_ONCE(dl_server_attach_bw(&rq->ext_server)))
+			pr_warn("failed to attach ext_server on CPU %d\n", cpu);
+	}
+
 	pr_info("sched_ext: BPF scheduler \"%s\" enabled%s\n",
 		sch->ops.name, scx_switched_all() ? "" : " (partial)");
 	kobject_uevent(&sch->kobj, KOBJ_ADD);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9f63b15d309d1..1e233618257ac 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -421,6 +421,8 @@ extern void ext_server_init(struct rq *rq);
 extern void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq);
 extern int dl_server_apply_params(struct sched_dl_entity *dl_se,
 		    u64 runtime, u64 period, bool init);
+extern int dl_server_attach_bw(struct sched_dl_entity *dl_se);
+extern void dl_server_detach_bw(struct sched_dl_entity *dl_se);
 
 static inline bool dl_server_active(struct sched_dl_entity *dl_se)
 {
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] sched_ext: Auto-register/unregister dl_server reservations
  2026-05-21 17:33 ` [PATCH 1/2] sched_ext: Auto-register/unregister dl_server reservations Andrea Righi
@ 2026-05-21 18:23   ` sashiko-bot
  2026-05-22  8:36   ` Peter Zijlstra
  1 sibling, 0 replies; 10+ messages in thread
From: sashiko-bot @ 2026-05-21 18:23 UTC (permalink / raw)
  To: Andrea Righi; +Cc: sched-ext

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
- [High] Deferring the ext_server bandwidth reservation to sched_ext load time allows deadline bandwidth overcommit, breaking real-time guarantees.
- [High] Dropping `dl_b->lock` between detaching and attaching deadline servers allows concurrent tasks to steal bandwidth, permanently breaking starvation protection.
- [High] `dl_server_detach_bw()` and `dl_server_attach_bw()` fail to update local runqueue bandwidth (`rq->dl.this_bw`), breaking GRUB reclamation and causing CPU overcommit.
--

commit c14e78e1decaa133b4b018a638661ae8c0968dc2
Author: Andrea Righi <arighi@nvidia.com>

sched_ext: Auto-register/unregister dl_server reservations

This commit changes the deadline server bandwidth reservations for sched_ext
and fair servers to be dynamic. Rather than statically reserving bandwidth
for both at boot, it attaches and detaches the bandwidth reservations
based on whether the BPF scheduler is loaded and operating in full or
partial mode.

> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c

[ ... ]

> +int dl_server_attach_bw(struct sched_dl_entity *dl_se)
> +{
> +	struct rq *rq = dl_se->rq;
> +	int cpu = cpu_of(rq);
> +	struct dl_bw *dl_b;
> +	unsigned long cap;
> +	int cpus;
> +
> +	if (dl_se->dl_bw_attached)
> +		return 0;
> +
> +	dl_b = dl_bw_of(cpu);
> +	guard(raw_spinlock)(&dl_b->lock);
> +
> +	cpus = dl_bw_cpus(cpu);
> +
> +	/*
> +	 * If there's no active CPU in this root domain (e.g., @cpu is offline),
> +	 * just record the intent, so dl_server_add_bw() honors it when CPUs
> +	 * come back.
> +	 */
> +	if (!cpus) {
> +		dl_se->dl_bw_attached = 1;
> +		return 0;
> +	}
> +
> +	cap = dl_bw_capacity(cpu);
> +
> +	if (__dl_overflow(dl_b, cap, 0, dl_se->dl_bw))
> +		return -EBUSY;
> +
> +	__dl_add(dl_b, dl_se->dl_bw, cpus);
> +	dl_se->dl_bw_attached = 1;
> +
> +	return 0;
> +}
> +
> +/*
> + * Detach @dl_se's bandwidth from the root domain's total_bw accounting.
> + *
> + * Use to dynamically unregister a dl_server's bandwidth reservation while
> + * preserving its configured @dl_runtime / @dl_period. No-op if @dl_se is
> + * not currently attached.
> + */
> +void dl_server_detach_bw(struct sched_dl_entity *dl_se)
> +{
> +	struct rq *rq = dl_se->rq;
> +	int cpu = cpu_of(rq);
> +	struct dl_bw *dl_b;
> +	int cpus;
> +
> +	if (!dl_se->dl_bw_attached)
> +		return;
> +
> +	dl_b = dl_bw_of(cpu);
> +	guard(raw_spinlock)(&dl_b->lock);
> +
> +	cpus = dl_bw_cpus(cpu);
> +
> +	/*
> +	 * If no active CPUs in this root domain, the bandwidth isn't in
> +	 * @dl_b right now; only clear the flag so dl_server_add_bw() skips
> +	 * the server on the next root-domain rebuild.
> +	 */
> +	if (cpus)
> +		__dl_sub(dl_b, dl_se->dl_bw, cpus);
> +	dl_se->dl_bw_attached = 0;
> +}

[Severity: High]
Does this miss updating the local runqueue bandwidth?

While the global root domain bandwidth (dl_b->total_bw) is updated here
and in dl_server_attach_bw(), rq->dl.this_bw doesn't appear to be updated
via __sub_rq_bw() and __add_rq_bw(). 

During boot, ext_server is initialized and its bandwidth is added to
this_bw, but the subsequent dl_server_detach_bw() does not subtract it.

Could this permanently inflate this_bw, which then causes the GRUB
reclamation algorithm to incorrectly calculate inactive utilization and
allow active DL tasks to consume more CPU time than admitted?

> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c

[ ... ]

> @@ -6136,10 +6144,24 @@ static void scx_root_disable(struct scx_sched *sch)
>  	/*
>  	 * Invalidate all the rq clocks to prevent getting outdated
>  	 * rq clocks from a previous scx scheduler.
> +	 *
> +	 * Also re-balance the dl_server bandwidth reservations: detach
> +	 * ext_server (no more sched_ext tasks) and reinstate fair_server
> +	 * if it was previously detached because we were running in full
> +	 * mode. Detach before attach to avoid a transient overflow of the
> +	 * root domain's bandwidth capacity.
>  	 */
>  	for_each_possible_cpu(cpu) {
>  		struct rq *rq = cpu_rq(cpu);
> +
>  		scx_rq_clock_invalidate(rq);
> +
> +		scoped_guard(rq_lock_irqsave, rq) {
> +			dl_server_detach_bw(&rq->ext_server);
> +			if (was_switched_all &&
> +			    WARN_ON_ONCE(dl_server_attach_bw(&rq->fair_server)))

[Severity: High]
Can a concurrent task steal the bandwidth during this transition?

Because dl_server_detach_bw() and dl_server_attach_bw() each acquire and
release the root domain's dl_b->lock independently, the lock is dropped
between the two calls.

If another CPU executes sched_setattr(SCHED_DEADLINE) during this window,
can it consume the newly freed bandwidth, causing the attach operation to
fail and permanently leaving the CPU without its starvation protection
server?

> +				pr_warn("failed to re-attach fair_server on CPU %d\n", cpu);
> +		}
>  	}

[ ... ]

> @@ -7314,6 +7336,27 @@ static void scx_root_enable_workfn(struct kthread_work *work)
>  	if (!(ops->flags & SCX_OPS_SWITCH_PARTIAL))
>  		static_branch_enable(&__scx_switched_all);
>  
> +	/*
> +	 * Re-balance the dl_server bandwidth reservations.
> +	 *
> +	 * In full mode (!SCX_OPS_SWITCH_PARTIAL) no task will ever run in
> +	 * the fair class, so detach the fair_server reservation and give
> +	 * that bandwidth back to the RT class. Always attach the
> +	 * ext_server reservation since sched_ext tasks are now possible.
> +	 *
> +	 * Detach before attach to avoid a transient overflow of the root
> +	 * domain's bandwidth capacity.
> +	 */
> +	for_each_possible_cpu(cpu) {
> +		struct rq *rq = cpu_rq(cpu);
> +
> +		guard(rq_lock_irqsave)(rq);
> +		if (scx_switched_all())
> +			dl_server_detach_bw(&rq->fair_server);
> +		if (WARN_ON_ONCE(dl_server_attach_bw(&rq->ext_server)))

[Severity: High]
Does this allow deadline bandwidth overcommit?

In partial mode (!scx_switched_all()), the system attempts to attach
ext_server while fair_server is still attached. If user deadline tasks
have already consumed the remaining root domain bandwidth prior to loading
the BPF scheduler, dl_server_attach_bw() will fail with -EBUSY.

Since this error only triggers a warning and proceeds to load the scheduler,
ext_server remains active and will throttle other DL tasks. Does this
silently bypass DL admission control and break real-time guarantees?

> +			pr_warn("failed to attach ext_server on CPU %d\n", cpu);
> +	}
> +
>  	pr_info("sched_ext: BPF scheduler \"%s\" enabled%s\n",
>  		sch->ops.name, scx_switched_all() ? "" : " (partial)");
>  	kobject_uevent(&sch->kobj, KOBJ_ADD);

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260521174509.1534623-1-arighi@nvidia.com?part=1

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] sched_ext: Auto-register/unregister dl_server reservations
  2026-05-21 17:33 ` [PATCH 1/2] sched_ext: Auto-register/unregister dl_server reservations Andrea Righi
  2026-05-21 18:23   ` sashiko-bot
@ 2026-05-22  8:36   ` Peter Zijlstra
  2026-05-22 10:02     ` Andrea Righi
  1 sibling, 1 reply; 10+ messages in thread
From: Peter Zijlstra @ 2026-05-22  8:36 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, K Prateek Nayak, Christian Loehle,
	Phil Auld, Koba Ko, Joel Fernandes, Richard Cheng,
	Cheng-Yang Chou, sched-ext, linux-kernel

On Thu, May 21, 2026 at 07:33:56PM +0200, Andrea Righi wrote:

> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 9c458552d14ff..15ba49fcba9af 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -6061,6 +6061,7 @@ static void scx_root_disable(struct scx_sched *sch)
>  {
>  	struct scx_task_iter sti;
>  	struct task_struct *p;
> +	bool was_switched_all;
>  	int cpu;
>  
>  	/* guarantee forward progress and wait for descendants to be disabled */
> @@ -6087,6 +6088,13 @@ static void scx_root_disable(struct scx_sched *sch)
>  	 */
>  	mutex_lock(&scx_enable_mutex);
>  
> +	/*
> +	 * Snapshot the full vs partial mode before clearing the static
> +	 * branch, so the dl_server re-balance below knows whether the
> +	 * fair_server reservation needs to be reinstated.
> +	 */
> +	was_switched_all = scx_switched_all();
> +
>  	static_branch_disable(&__scx_switched_all);
>  	WRITE_ONCE(scx_switching_all, false);
>  
> @@ -6136,10 +6144,24 @@ static void scx_root_disable(struct scx_sched *sch)
>  	/*
>  	 * Invalidate all the rq clocks to prevent getting outdated
>  	 * rq clocks from a previous scx scheduler.
> +	 *
> +	 * Also re-balance the dl_server bandwidth reservations: detach
> +	 * ext_server (no more sched_ext tasks) and reinstate fair_server
> +	 * if it was previously detached because we were running in full
> +	 * mode. Detach before attach to avoid a transient overflow of the
> +	 * root domain's bandwidth capacity.
>  	 */
>  	for_each_possible_cpu(cpu) {
>  		struct rq *rq = cpu_rq(cpu);
> +
>  		scx_rq_clock_invalidate(rq);
> +
> +		scoped_guard(rq_lock_irqsave, rq) {
> +			dl_server_detach_bw(&rq->ext_server);
> +			if (was_switched_all &&
> +			    WARN_ON_ONCE(dl_server_attach_bw(&rq->fair_server)))
> +				pr_warn("failed to re-attach fair_server on CPU %d\n", cpu);
> +		}
>  	}
>  
>  	/* no task is on scx, turn off all the switches and flush in-progress calls */
> @@ -7314,6 +7336,27 @@ static void scx_root_enable_workfn(struct kthread_work *work)
>  	if (!(ops->flags & SCX_OPS_SWITCH_PARTIAL))
>  		static_branch_enable(&__scx_switched_all);
>  
> +	/*
> +	 * Re-balance the dl_server bandwidth reservations.
> +	 *
> +	 * In full mode (!SCX_OPS_SWITCH_PARTIAL) no task will ever run in
> +	 * the fair class, so detach the fair_server reservation and give
> +	 * that bandwidth back to the RT class. Always attach the
> +	 * ext_server reservation since sched_ext tasks are now possible.
> +	 *
> +	 * Detach before attach to avoid a transient overflow of the root
> +	 * domain's bandwidth capacity.
> +	 */
> +	for_each_possible_cpu(cpu) {
> +		struct rq *rq = cpu_rq(cpu);
> +
> +		guard(rq_lock_irqsave)(rq);
> +		if (scx_switched_all())
> +			dl_server_detach_bw(&rq->fair_server);
> +		if (WARN_ON_ONCE(dl_server_attach_bw(&rq->ext_server)))
> +			pr_warn("failed to attach ext_server on CPU %d\n", cpu);
> +	}
> +
>  	pr_info("sched_ext: BPF scheduler \"%s\" enabled%s\n",
>  		sch->ops.name, scx_switched_all() ? "" : " (partial)");
>  	kobject_uevent(&sch->kobj, KOBJ_ADD);

For switching *to* scx, I think it makes sense to attach ext_server
early and fail the switch if the attach fails. And only after the
switch, conditionally detach fair_server.

Since switching back to fair is a recovery path, this isn't really an
option -- the only actual option is keeping the fair_server reservation,
but that isn't ideal either.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] sched_ext: Auto-register/unregister dl_server reservations
  2026-05-22  8:36   ` Peter Zijlstra
@ 2026-05-22 10:02     ` Andrea Righi
  0 siblings, 0 replies; 10+ messages in thread
From: Andrea Righi @ 2026-05-22 10:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, K Prateek Nayak, Christian Loehle,
	Phil Auld, Koba Ko, Joel Fernandes, Richard Cheng,
	Cheng-Yang Chou, sched-ext, linux-kernel

Hi Peter,

On Fri, May 22, 2026 at 10:36:55AM +0200, Peter Zijlstra wrote:
> On Thu, May 21, 2026 at 07:33:56PM +0200, Andrea Righi wrote:
> 
> > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> > index 9c458552d14ff..15ba49fcba9af 100644
> > --- a/kernel/sched/ext.c
> > +++ b/kernel/sched/ext.c
> > @@ -6061,6 +6061,7 @@ static void scx_root_disable(struct scx_sched *sch)
> >  {
> >  	struct scx_task_iter sti;
> >  	struct task_struct *p;
> > +	bool was_switched_all;
> >  	int cpu;
> >  
> >  	/* guarantee forward progress and wait for descendants to be disabled */
> > @@ -6087,6 +6088,13 @@ static void scx_root_disable(struct scx_sched *sch)
> >  	 */
> >  	mutex_lock(&scx_enable_mutex);
> >  
> > +	/*
> > +	 * Snapshot the full vs partial mode before clearing the static
> > +	 * branch, so the dl_server re-balance below knows whether the
> > +	 * fair_server reservation needs to be reinstated.
> > +	 */
> > +	was_switched_all = scx_switched_all();
> > +
> >  	static_branch_disable(&__scx_switched_all);
> >  	WRITE_ONCE(scx_switching_all, false);
> >  
> > @@ -6136,10 +6144,24 @@ static void scx_root_disable(struct scx_sched *sch)
> >  	/*
> >  	 * Invalidate all the rq clocks to prevent getting outdated
> >  	 * rq clocks from a previous scx scheduler.
> > +	 *
> > +	 * Also re-balance the dl_server bandwidth reservations: detach
> > +	 * ext_server (no more sched_ext tasks) and reinstate fair_server
> > +	 * if it was previously detached because we were running in full
> > +	 * mode. Detach before attach to avoid a transient overflow of the
> > +	 * root domain's bandwidth capacity.
> >  	 */
> >  	for_each_possible_cpu(cpu) {
> >  		struct rq *rq = cpu_rq(cpu);
> > +
> >  		scx_rq_clock_invalidate(rq);
> > +
> > +		scoped_guard(rq_lock_irqsave, rq) {
> > +			dl_server_detach_bw(&rq->ext_server);
> > +			if (was_switched_all &&
> > +			    WARN_ON_ONCE(dl_server_attach_bw(&rq->fair_server)))
> > +				pr_warn("failed to re-attach fair_server on CPU %d\n", cpu);
> > +		}
> >  	}
> >  
> >  	/* no task is on scx, turn off all the switches and flush in-progress calls */
> > @@ -7314,6 +7336,27 @@ static void scx_root_enable_workfn(struct kthread_work *work)
> >  	if (!(ops->flags & SCX_OPS_SWITCH_PARTIAL))
> >  		static_branch_enable(&__scx_switched_all);
> >  
> > +	/*
> > +	 * Re-balance the dl_server bandwidth reservations.
> > +	 *
> > +	 * In full mode (!SCX_OPS_SWITCH_PARTIAL) no task will ever run in
> > +	 * the fair class, so detach the fair_server reservation and give
> > +	 * that bandwidth back to the RT class. Always attach the
> > +	 * ext_server reservation since sched_ext tasks are now possible.
> > +	 *
> > +	 * Detach before attach to avoid a transient overflow of the root
> > +	 * domain's bandwidth capacity.
> > +	 */
> > +	for_each_possible_cpu(cpu) {
> > +		struct rq *rq = cpu_rq(cpu);
> > +
> > +		guard(rq_lock_irqsave)(rq);
> > +		if (scx_switched_all())
> > +			dl_server_detach_bw(&rq->fair_server);
> > +		if (WARN_ON_ONCE(dl_server_attach_bw(&rq->ext_server)))
> > +			pr_warn("failed to attach ext_server on CPU %d\n", cpu);
> > +	}
> > +
> >  	pr_info("sched_ext: BPF scheduler \"%s\" enabled%s\n",
> >  		sch->ops.name, scx_switched_all() ? "" : " (partial)");
> >  	kobject_uevent(&sch->kobj, KOBJ_ADD);
> 
> For switching *to* scx, I think it makes sense to attach ext_server
> early and fail the switch if the attach fails. And only after the
> switch, conditionally detach fair_server.
> 
> Since switching back to fair is a recovery path, this isn't really an
> option -- the only actual option is keeping the fair_server reservation,
> but that isn't ideal either.

Makes sense, I'll restructure the enable path to attach ext_server early (before
any commit, failing with -EBUSY if needed) and defer the fair_server detach
until after the switch is fully committed.

I'll send a new version with this change, along with the fixes to the other
issues reported by Sashiko.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 2/2] selftests/sched_ext: Validate dl_server attach/detach in total_bw test
  2026-05-21 17:33 [PATCHSET sched_ext/for-7.2] sched_ext: Auto-manage ext/fair dl_server bandwidth Andrea Righi
  2026-05-21 17:33 ` [PATCH 1/2] sched_ext: Auto-register/unregister dl_server reservations Andrea Righi
@ 2026-05-21 17:33 ` Andrea Righi
  2026-05-21 18:31   ` sashiko-bot
  1 sibling, 1 reply; 10+ messages in thread
From: Andrea Righi @ 2026-05-21 17:33 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Christian Loehle, Phil Auld,
	Koba Ko, Joel Fernandes, Richard Cheng, Cheng-Yang Chou,
	sched-ext, linux-kernel

Extend the total_bw selftest to validate the fair/ext dl_server
auto-attach/detach operations.

After the existing consistency checks, the test now doubles the
fair_server's runtime on every CPU via debugfs and verifies that:
 1. total_bw grew after the customization (proves fair_server was
    attached and apply_params() honored the dl_bw_attached flag),
 2. with the minimal BPF scheduler loaded, total_bw drops back to the
    baseline value (proves fair_server was detached and ext_server was
    attached at its own default runtime),
 3. after unload total_bw matches the doubled value from step 1 (proves
    fair_server was re-attached with the runtime customization preserved
    across the load/unload cycle).

Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 tools/testing/selftests/sched_ext/total_bw.c | 171 ++++++++++++++++++-
 1 file changed, 170 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/sched_ext/total_bw.c b/tools/testing/selftests/sched_ext/total_bw.c
index 5b0a619bab86e..8eea12e8868ac 100644
--- a/tools/testing/selftests/sched_ext/total_bw.c
+++ b/tools/testing/selftests/sched_ext/total_bw.c
@@ -100,6 +100,66 @@ static int read_total_bw_values(long *bw_values, int max_cpus)
 	return cpu_count;
 }
 
+/*
+ * Read a per-CPU dl_server param (runtime or period) from debugfs.
+ * Returns the value in nanoseconds, or -1 on failure.
+ */
+static long read_server_param(const char *server, const char *param, int cpu)
+{
+	char path[128];
+	long value = -1;
+	FILE *fp;
+
+	snprintf(path, sizeof(path),
+		 "/sys/kernel/debug/sched/%s_server/cpu%d/%s",
+		 server, cpu, param);
+	fp = fopen(path, "r");
+	if (!fp)
+		return -1;
+	if (fscanf(fp, "%ld", &value) != 1)
+		value = -1;
+	fclose(fp);
+
+	return value;
+}
+
+/*
+ * Write a per-CPU dl_server param to debugfs. Returns 0 on success.
+ */
+static int write_server_param(const char *server, const char *param,
+			      int cpu, long value)
+{
+	char path[128];
+	FILE *fp;
+	int ret = 0;
+
+	snprintf(path, sizeof(path),
+		 "/sys/kernel/debug/sched/%s_server/cpu%d/%s",
+		 server, cpu, param);
+	fp = fopen(path, "w");
+	if (!fp)
+		return -1;
+	if (fprintf(fp, "%ld", value) < 0)
+		ret = -1;
+	fclose(fp);
+
+	return ret;
+}
+
+static int write_fair_runtime_all(int nr_cpus, long value)
+{
+	int i;
+
+	for (i = 0; i < nr_cpus; i++) {
+		if (write_server_param("fair", "runtime", i, value) < 0) {
+			SCX_ERR("Failed to write fair_server runtime on CPU %d", i);
+			return -1;
+		}
+	}
+
+	return 0;
+}
+
 static bool verify_total_bw_consistency(long *bw_values, int count)
 {
 	int i;
@@ -217,6 +277,9 @@ static enum scx_test_status run(void *ctx)
 	struct bpf_link *link;
 	long loaded_bw[MAX_CPUS];
 	long unloaded_bw[MAX_CPUS];
+	long doubled_bw[MAX_CPUS];
+	long original_runtime, doubled_runtime;
+	enum scx_test_status ret;
 	int i;
 
 	/* Test scenario 2: BPF program loaded */
@@ -257,7 +320,113 @@ static enum scx_test_status run(void *ctx)
 	}
 
 	fprintf(stderr, "All total_bw values are consistent across all scenarios\n");
-	return SCX_TEST_PASS;
+
+	/*
+	 * Validate auto-register/unregister of dl_server bandwidth reservations.
+	 *
+	 * Doubling fair_server's runtime doubles its bw contribution. With a
+	 * full-mode BPF scheduler (minimal_ops), the kernel should detach
+	 * fair_server and attach ext_server, dropping total_bw back to its
+	 * pre-customization (default ext_server-only) value. On unload, the
+	 * fair_server reservation should come back with its customized runtime
+	 * preserved, so total_bw doubles again.
+	 */
+	original_runtime = read_server_param("fair", "runtime", 0);
+	if (original_runtime <= 0) {
+		fprintf(stderr, "Skipping attach/detach validation: debugfs not accessible\n");
+		return SCX_TEST_PASS;
+	}
+	doubled_runtime = original_runtime * 2;
+
+	fprintf(stderr,
+		"Setting fair_server runtime to %ld ns on all CPUs (orig %ld)\n",
+		doubled_runtime, original_runtime);
+
+	if (write_fair_runtime_all(test_ctx->nr_cpus, doubled_runtime) < 0) {
+		ret = SCX_TEST_FAIL;
+		goto restore;
+	}
+
+	if (fetch_verify_total_bw(doubled_bw, test_ctx->nr_cpus) < 0) {
+		SCX_ERR("Failed to get stable values after doubling fair runtime");
+		ret = SCX_TEST_FAIL;
+		goto restore;
+	}
+
+	/*
+	 * After doubling the runtime, fair_server's bw contribution must grow.
+	 * We don't assert exactly 2x, because the kernel's to_ratio() truncates
+	 * the value, so 2 * to_ratio(period, runtime) and
+	 * to_ratio(period, 2 * runtime) can differ.
+	 */
+	for (i = 0; i < test_ctx->nr_cpus; i++) {
+		if (doubled_bw[i] <= test_ctx->baseline_bw[i]) {
+			SCX_ERR("CPU%d: doubling fair runtime did not increase total_bw (baseline=%ld, doubled=%ld)",
+				i, test_ctx->baseline_bw[i], doubled_bw[i]);
+			ret = SCX_TEST_FAIL;
+			goto restore;
+		}
+	}
+
+	link = bpf_map__attach_struct_ops(test_ctx->skel->maps.minimal_ops);
+	if (!link) {
+		SCX_ERR("Failed to attach scheduler for detach test");
+		ret = SCX_TEST_FAIL;
+		goto restore;
+	}
+
+	if (fetch_verify_total_bw(loaded_bw, test_ctx->nr_cpus) < 0) {
+		SCX_ERR("Failed to get stable values with BPF loaded (detach test)");
+		bpf_link__destroy(link);
+		ret = SCX_TEST_FAIL;
+		goto restore;
+	}
+
+	/*
+	 * In full mode the customized fair_server is detached and ext_server
+	 * is attached at its default runtime - total_bw must match baseline.
+	 */
+	for (i = 0; i < test_ctx->nr_cpus; i++) {
+		if (loaded_bw[i] != test_ctx->baseline_bw[i]) {
+			SCX_ERR("CPU%d: BPF loaded with 2x fair: expected %ld (fair detached, ext default), got %ld",
+				i, test_ctx->baseline_bw[i], loaded_bw[i]);
+			bpf_link__destroy(link);
+			ret = SCX_TEST_FAIL;
+			goto restore;
+		}
+	}
+
+	bpf_link__destroy(link);
+
+	if (fetch_verify_total_bw(unloaded_bw, test_ctx->nr_cpus) < 0) {
+		SCX_ERR("Failed to get stable values after BPF unload (detach test)");
+		ret = SCX_TEST_FAIL;
+		goto restore;
+	}
+
+	/*
+	 * After unload, fair_server is re-attached with its preserved 2x
+	 * runtime, so total_bw should return to the doubled value.
+	 */
+	for (i = 0; i < test_ctx->nr_cpus; i++) {
+		if (unloaded_bw[i] != doubled_bw[i]) {
+			SCX_ERR("CPU%d: BPF unloaded: expected %ld (fair restored at 2x), got %ld",
+				i, doubled_bw[i], unloaded_bw[i]);
+			ret = SCX_TEST_FAIL;
+			goto restore;
+		}
+	}
+
+	fprintf(stderr,
+		"dl_server attach/detach with customized fair runtime verified\n");
+	ret = SCX_TEST_PASS;
+
+restore:
+	if (write_fair_runtime_all(test_ctx->nr_cpus, original_runtime) < 0)
+		SCX_ERR("Failed to restore fair_server runtime to %ld",
+			original_runtime);
+
+	return ret;
 }
 
 static void cleanup(void *ctx)
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] selftests/sched_ext: Validate dl_server attach/detach in total_bw test
  2026-05-21 17:33 ` [PATCH 2/2] selftests/sched_ext: Validate dl_server attach/detach in total_bw test Andrea Righi
@ 2026-05-21 18:31   ` sashiko-bot
  0 siblings, 0 replies; 10+ messages in thread
From: sashiko-bot @ 2026-05-21 18:31 UTC (permalink / raw)
  To: Andrea Righi; +Cc: sched-ext

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
- [Medium] Unchecked return value of fclose() in write_server_param() masks kernel-level debugfs write failures.
- [Medium] The test incorrectly assumes all CPUs share the same fair_server runtime and globally restores CPU 0's value to all CPUs.
- [Medium] Early exit in write_fair_runtime_all() prevents full system state restoration if a single CPU write fails during cleanup.
--

commit 694643dec460b3df43c360074223ec82f5be4cc8
Author: Andrea Righi <arighi@nvidia.com>

selftests/sched_ext: Validate dl_server attach/detach in total_bw test

Extend the total_bw selftest to validate the fair/ext dl_server
auto-attach/detach operations.

> diff --git a/tools/testing/selftests/sched_ext/total_bw.c b/tools/testing/selftests/sched_ext/total_bw.c
> index 5b0a619bab86e..8eea12e8868ac 100644
> --- a/tools/testing/selftests/sched_ext/total_bw.c
> +++ b/tools/testing/selftests/sched_ext/total_bw.c
[ ... ]
> @@ -100,6 +100,66 @@ static int read_total_bw_values(long *bw_values, int max_cpus)
[ ... ]
> +static int write_server_param(const char *server, const char *param,
> +			      int cpu, long value)
> +{
> +	char path[128];
> +	FILE *fp;
> +	int ret = 0;
> +
> +	snprintf(path, sizeof(path),
> +		 "/sys/kernel/debug/sched/%s_server/cpu%d/%s",
> +		 server, cpu, param);
> +	fp = fopen(path, "w");
> +	if (!fp)
> +		return -1;
> +	if (fprintf(fp, "%ld", value) < 0)
> +		ret = -1;
> +	fclose(fp);

[Severity: Medium]
Since FILE streams are fully buffered or line-buffered by default and
fprintf() writes a string without a newline, couldn't the actual write()
syscall to debugfs be triggered during fclose()?

If the kernel rejects the write (e.g., due to admission control), fclose()
would return EOF but fprintf() would have succeeded. By ignoring the fclose()
return value, could this incorrectly return 0 (success) even when the
parameter was not updated?

> +
> +	return ret;
> +}
> +
> +static int write_fair_runtime_all(int nr_cpus, long value)
> +{
> +	int i;
> +
> +	for (i = 0; i < nr_cpus; i++) {
> +		if (write_server_param("fair", "runtime", i, value) < 0) {
> +			SCX_ERR("Failed to write fair_server runtime on CPU %d", i);
> +			return -1;

[Severity: Medium]
If a single CPU write fails when this function is called from the cleanup
path, this early return will abort the loop. Could this prevent the remaining
CPUs from being restored to their original fair_server runtimes, leaving the
system in a permanently customized state?

> +		}
> +	}
> +
> +	return 0;
> +}
[ ... ]
> @@ -217,6 +277,9 @@ static enum scx_test_status run(void *ctx)
[ ... ]
> +	original_runtime = read_server_param("fair", "runtime", 0);
> +	if (original_runtime <= 0) {
> +		fprintf(stderr, "Skipping attach/detach validation: debugfs not accessible\n");
> +		return SCX_TEST_PASS;
> +	}
> +	doubled_runtime = original_runtime * 2;
> +
> +	fprintf(stderr,
> +		"Setting fair_server runtime to %ld ns on all CPUs (orig %ld)\n",
> +		doubled_runtime, original_runtime);
> +
> +	if (write_fair_runtime_all(test_ctx->nr_cpus, doubled_runtime) < 0) {
> +		ret = SCX_TEST_FAIL;
> +		goto restore;
> +	}

[Severity: Medium]
This reads the original fair_server runtime from CPU 0 only, and then
eventually restores this same value to all CPUs during cleanup.

On heterogeneous systems (like big.LITTLE) or manually tuned systems, CPUs may
have different initial runtimes. Does this destructively overwrite all unique
per-CPU runtimes with CPU 0's value after the test completes?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260521174509.1534623-1-arighi@nvidia.com?part=2

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCHSET v2 sched_ext/for-7.2] sched_ext: Auto-manage ext/fair dl_server bandwidth
@ 2026-05-26  8:27 Andrea Righi
  2026-05-26  8:27 ` [PATCH 2/2] selftests/sched_ext: Validate dl_server attach/detach in total_bw test Andrea Righi
  0 siblings, 1 reply; 10+ messages in thread
From: Andrea Righi @ 2026-05-26  8:27 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Christian Loehle, Phil Auld,
	Koba Ko, Joel Fernandes, Richard Cheng, Cheng-Yang Chou,
	sched-ext, linux-kernel

Currently, a fixed bandwidth is reserved at boot for both the fair and ext
deadline servers, and this reservation remains unchanged unless explicitly
modified via debugfs. As a result, both servers permanently contribute to global
bandwidth accounting, regardless of whether a BPF scheduler is active.

While unused bandwidth can still be reclaimed at runtime by other classes, this
static reservation prevents RT from fully utilizing available headroom in
situations where one of the sched_ext or fair class is guaranteed to be inactive
(for example, when no BPF scheduler is loaded, or when sched_ext runs in full
mode and replaces fair).

As discussed at the VIII OSPM summit in Cambridge [1], a better solution would
be to dynamically register and unregister deadline server bandwidth based on the
active sched_ext state. This allows the kernel to automatically enable bandwidth
accounting only for the scheduling class that is currently active, while
disabling it for inactive ones.

This patch series implements this automatic register/unregister logic. The
sched_ext total_bw kselftest is also modified to validate the correct behavior
across the different scheduling configurations and ensure that bandwidth
accounting follows the expected state transitions.

[1] https://retis.santannapisa.it/ospm-summit/

Git tree: git://git.kernel.org/pub/scm/linux/kernel/git/arighi/linux.git dl-server-bw-v2

Changes in v2:
 - Rework the sched_ext enable path as suggested by Peter: attach ext_server
   before committing the scheduler switch and fail the enable if admission
   control rejects the reservation; detach fair_server only after a successful
   full-mode switch.
 - Added dl_server_swap_bw() for the disable/recovery path so ext_server detach
   and fair_server reattach happen under the same dl_b->lock, closing the
   window where concurrent SCHED_DEADLINE admission could steal the freed
   bandwidth (reported by Sashiko).
 - Fixed the attach/detach accounting issue reported by Sashiko by updating
   rq->dl.this_bw together with root-domain total_bw, draining active or
   non-contending servers before detach and preventing detached servers from
   starting.
 - Reuse dl_rq_change_utilization() to drain the server, so the detach path goes
   through the same machinery as dl_server_apply_params()
 - Made root-domain accounting honor the same cpu_active() conditions used by
   root-domain rebuilds, while preserving runtime/period updates made while a
   server is detached.
 - Fixed the total_bw selftest issues reported by Sashiko: check fclose()
   errors for debugfs writes, preserve per-CPU fair_server runtime values, and
   restore all CPUs on cleanup even if one write fails.
 - Link to v1: https://lore.kernel.org/all/20260521174509.1534623-1-arighi@nvidia.com/

Andrea Righi (2):
      sched_ext: Auto-register/unregister dl_server reservations
      selftests/sched_ext: Validate dl_server attach/detach in total_bw test

 include/linux/sched.h                        |   6 +
 kernel/sched/deadline.c                      | 207 +++++++++++++++++++++++++--
 kernel/sched/ext.c                           |  71 +++++++++
 kernel/sched/sched.h                         |   4 +
 tools/testing/selftests/sched_ext/total_bw.c | 201 +++++++++++++++++++++++++-
 5 files changed, 480 insertions(+), 9 deletions(-)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 2/2] selftests/sched_ext: Validate dl_server attach/detach in total_bw test
  2026-05-26  8:27 [PATCHSET v2 sched_ext/for-7.2] sched_ext: Auto-manage ext/fair dl_server bandwidth Andrea Righi
@ 2026-05-26  8:27 ` Andrea Righi
  0 siblings, 0 replies; 10+ messages in thread
From: Andrea Righi @ 2026-05-26  8:27 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Christian Loehle, Phil Auld,
	Koba Ko, Joel Fernandes, Richard Cheng, Cheng-Yang Chou,
	sched-ext, linux-kernel

Extend the total_bw selftest to validate the fair/ext dl_server
auto-attach/detach operations.

After the existing consistency checks, the test now doubles the
fair_server's runtime on every CPU via debugfs and verifies that:
 1. total_bw grew after the customization (proves fair_server was
    attached and apply_params() honored the dl_bw_attached flag),
 2. with the minimal BPF scheduler loaded, total_bw drops back to the
    baseline value (proves fair_server was detached and ext_server was
    attached at its own default runtime),
 3. after unload total_bw matches the doubled value from step 1 (proves
    fair_server was re-attached with the runtime customization preserved
    across the load/unload cycle).

Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 tools/testing/selftests/sched_ext/total_bw.c | 201 ++++++++++++++++++-
 1 file changed, 200 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/sched_ext/total_bw.c b/tools/testing/selftests/sched_ext/total_bw.c
index 5b0a619bab86e..2af01cee90cc0 100644
--- a/tools/testing/selftests/sched_ext/total_bw.c
+++ b/tools/testing/selftests/sched_ext/total_bw.c
@@ -100,6 +100,98 @@ static int read_total_bw_values(long *bw_values, int max_cpus)
 	return cpu_count;
 }
 
+/*
+ * Read a per-CPU dl_server param (runtime or period) from debugfs.
+ * Returns the value in nanoseconds, or -1 on failure.
+ */
+static long read_server_param(const char *server, const char *param, int cpu)
+{
+	char path[128];
+	long value = -1;
+	FILE *fp;
+
+	snprintf(path, sizeof(path),
+		 "/sys/kernel/debug/sched/%s_server/cpu%d/%s",
+		 server, cpu, param);
+	fp = fopen(path, "r");
+	if (!fp)
+		return -1;
+	if (fscanf(fp, "%ld", &value) != 1)
+		value = -1;
+	fclose(fp);
+
+	return value;
+}
+
+/*
+ * Write a per-CPU dl_server param to debugfs. Returns 0 on success.
+ */
+static int write_server_param(const char *server, const char *param,
+			      int cpu, long value)
+{
+	char path[128];
+	FILE *fp;
+	int ret = 0;
+
+	snprintf(path, sizeof(path),
+		 "/sys/kernel/debug/sched/%s_server/cpu%d/%s",
+		 server, cpu, param);
+	fp = fopen(path, "w");
+	if (!fp)
+		return -1;
+	if (fprintf(fp, "%ld", value) < 0)
+		ret = -1;
+	if (fclose(fp) != 0)
+		ret = -1;
+
+	return ret;
+}
+
+static int read_fair_runtime_all(int nr_cpus, long *runtimes)
+{
+	int i;
+
+	for (i = 0; i < nr_cpus; i++) {
+		runtimes[i] = read_server_param("fair", "runtime", i);
+		if (runtimes[i] <= 0)
+			return -1;
+	}
+
+	return 0;
+}
+
+static int write_fair_runtime_all(int nr_cpus, long value)
+{
+	int i;
+
+	for (i = 0; i < nr_cpus; i++) {
+		if (write_server_param("fair", "runtime", i, value) < 0) {
+			SCX_ERR("Failed to write fair_server runtime on CPU %d", i);
+			return -1;
+		}
+	}
+
+	return 0;
+}
+
+/*
+ * Restore per-CPU fair_server runtimes.
+ */
+static int restore_fair_runtime_all(int nr_cpus, const long *runtimes)
+{
+	int ret = 0;
+	int i;
+
+	for (i = 0; i < nr_cpus; i++) {
+		if (write_server_param("fair", "runtime", i, runtimes[i]) < 0) {
+			SCX_ERR("Failed to restore fair_server runtime on CPU %d", i);
+			ret = -1;
+		}
+	}
+
+	return ret;
+}
+
 static bool verify_total_bw_consistency(long *bw_values, int count)
 {
 	int i;
@@ -217,6 +309,9 @@ static enum scx_test_status run(void *ctx)
 	struct bpf_link *link;
 	long loaded_bw[MAX_CPUS];
 	long unloaded_bw[MAX_CPUS];
+	long doubled_bw[MAX_CPUS];
+	long original_runtime[MAX_CPUS], doubled_runtime;
+	enum scx_test_status ret;
 	int i;
 
 	/* Test scenario 2: BPF program loaded */
@@ -257,7 +352,111 @@ static enum scx_test_status run(void *ctx)
 	}
 
 	fprintf(stderr, "All total_bw values are consistent across all scenarios\n");
-	return SCX_TEST_PASS;
+
+	/*
+	 * Validate auto-register/unregister of dl_server bandwidth reservations.
+	 *
+	 * Doubling fair_server's runtime doubles its bw contribution. With a
+	 * full-mode BPF scheduler (minimal_ops), the kernel should detach
+	 * fair_server and attach ext_server, dropping total_bw back to its
+	 * pre-customization (default ext_server-only) value. On unload, the
+	 * fair_server reservation should come back with its customized runtime
+	 * preserved, so total_bw doubles again.
+	 */
+	if (read_fair_runtime_all(test_ctx->nr_cpus, original_runtime) < 0) {
+		fprintf(stderr, "Skipping attach/detach validation: debugfs not accessible\n");
+		return SCX_TEST_PASS;
+	}
+	doubled_runtime = original_runtime[0] * 2;
+
+	fprintf(stderr,
+		"Setting fair_server runtime to %ld ns on all CPUs (orig %ld)\n",
+		doubled_runtime, original_runtime[0]);
+
+	if (write_fair_runtime_all(test_ctx->nr_cpus, doubled_runtime) < 0) {
+		ret = SCX_TEST_FAIL;
+		goto restore;
+	}
+
+	if (fetch_verify_total_bw(doubled_bw, test_ctx->nr_cpus) < 0) {
+		SCX_ERR("Failed to get stable values after doubling fair runtime");
+		ret = SCX_TEST_FAIL;
+		goto restore;
+	}
+
+	/*
+	 * After doubling the runtime, fair_server's bw contribution must grow.
+	 * We don't assert exactly 2x, because the kernel's to_ratio() truncates
+	 * the value, so 2 * to_ratio(period, runtime) and
+	 * to_ratio(period, 2 * runtime) can differ.
+	 */
+	for (i = 0; i < test_ctx->nr_cpus; i++) {
+		if (doubled_bw[i] <= test_ctx->baseline_bw[i]) {
+			SCX_ERR("CPU%d: fair did not increase total_bw (baseline=%ld, doubled=%ld)",
+				i, test_ctx->baseline_bw[i], doubled_bw[i]);
+			ret = SCX_TEST_FAIL;
+			goto restore;
+		}
+	}
+
+	link = bpf_map__attach_struct_ops(test_ctx->skel->maps.minimal_ops);
+	if (!link) {
+		SCX_ERR("Failed to attach scheduler for detach test");
+		ret = SCX_TEST_FAIL;
+		goto restore;
+	}
+
+	if (fetch_verify_total_bw(loaded_bw, test_ctx->nr_cpus) < 0) {
+		SCX_ERR("Failed to get stable values with BPF loaded (detach test)");
+		bpf_link__destroy(link);
+		ret = SCX_TEST_FAIL;
+		goto restore;
+	}
+
+	/*
+	 * In full mode the customized fair_server is detached and ext_server is
+	 * attached at its default runtime, total_bw must match baseline.
+	 */
+	for (i = 0; i < test_ctx->nr_cpus; i++) {
+		if (loaded_bw[i] != test_ctx->baseline_bw[i]) {
+			SCX_ERR("CPU%d: expected bw %ld (fair detached, ext default), got %ld",
+				i, test_ctx->baseline_bw[i], loaded_bw[i]);
+			bpf_link__destroy(link);
+			ret = SCX_TEST_FAIL;
+			goto restore;
+		}
+	}
+
+	bpf_link__destroy(link);
+
+	if (fetch_verify_total_bw(unloaded_bw, test_ctx->nr_cpus) < 0) {
+		SCX_ERR("Failed to get stable values after BPF unload (detach test)");
+		ret = SCX_TEST_FAIL;
+		goto restore;
+	}
+
+	/*
+	 * After unload, fair_server is re-attached with its preserved 2x
+	 * runtime, so total_bw should return to the doubled value.
+	 */
+	for (i = 0; i < test_ctx->nr_cpus; i++) {
+		if (unloaded_bw[i] != doubled_bw[i]) {
+			SCX_ERR("CPU%d: BPF unloaded: expected %ld (fair restored at 2x), got %ld",
+				i, doubled_bw[i], unloaded_bw[i]);
+			ret = SCX_TEST_FAIL;
+			goto restore;
+		}
+	}
+
+	fprintf(stderr,
+		"dl_server attach/detach with customized fair runtime verified\n");
+	ret = SCX_TEST_PASS;
+
+restore:
+	if (restore_fair_runtime_all(test_ctx->nr_cpus, original_runtime) < 0)
+		SCX_ERR("Failed to fully restore per-CPU fair_server runtimes");
+
+	return ret;
 }
 
 static void cleanup(void *ctx)
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCHSET v3 sched_ext/for-7.2] sched_ext: Auto-manage ext/fair dl_server bandwidth
@ 2026-05-26 16:42 Andrea Righi
  2026-05-26 16:42 ` [PATCH 2/2] selftests/sched_ext: Validate dl_server attach/detach in total_bw test Andrea Righi
  0 siblings, 1 reply; 10+ messages in thread
From: Andrea Righi @ 2026-05-26 16:42 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Christian Loehle, Phil Auld,
	Koba Ko, Joel Fernandes, Richard Cheng, Cheng-Yang Chou,
	sched-ext, linux-kernel

Currently, a fixed bandwidth is reserved at boot for both the fair and ext
deadline servers, and this reservation remains unchanged unless explicitly
modified via debugfs. As a result, both servers permanently contribute to global
bandwidth accounting, regardless of whether a BPF scheduler is active.

While unused bandwidth can still be reclaimed at runtime by other classes, this
static reservation prevents RT from fully utilizing available headroom in
situations where one of the sched_ext or fair class is guaranteed to be inactive
(for example, when no BPF scheduler is loaded, or when sched_ext runs in full
mode and replaces fair).

As discussed at the VIII OSPM summit in Cambridge [1], a better solution would
be to dynamically register and unregister deadline server bandwidth based on the
active sched_ext state. This allows the kernel to automatically enable bandwidth
accounting only for the scheduling class that is currently active, while
disabling it for inactive ones.

This patch series implements this automatic register/unregister logic. Moreover,
the sched_ext total_bw kselftest is also modified to validate the correct
behavior across the different scheduling configurations and ensure that
bandwidth accounting follows the expected state transitions.

[1] https://retis.santannapisa.it/ospm-summit/

Git tree: git://git.kernel.org/pub/scm/linux/kernel/git/arighi/linux.git dl-server-bw-v3

Changes in v3:
 - Don't bypass __dl_overflow() for detached servers in dl_server_apply_params()
   to reject oversized configs up front (reported by Sashiko)
 - A potential divide-by-zero in dl_server_apply_params() reported by Sashiko
   has been fixed in a separate patch (not introduced by this patch set):
   https://lore.kernel.org/all/20260526100502.575774-1-arighi@nvidia.com/
 - Link to v2: https://lore.kernel.org/all/20260526082954.550958-1-arighi@nvidia.com/

Changes in v2:
 - Rework the sched_ext enable path as suggested by Peter: attach ext_server
   before committing the scheduler switch and fail the enable if admission
   control rejects the reservation; detach fair_server only after a successful
   full-mode switch.
 - Added dl_server_swap_bw() for the disable/recovery path so ext_server detach
   and fair_server reattach happen under the same dl_b->lock, closing the
   window where concurrent SCHED_DEADLINE admission could steal the freed
   bandwidth (reported by Sashiko).
 - Fixed the attach/detach accounting issue reported by Sashiko by updating
   rq->dl.this_bw together with root-domain total_bw, draining active or
   non-contending servers before detach and preventing detached servers from
   starting.
 - Reuse dl_rq_change_utilization() to drain the server, so the detach path goes
   through the same machinery as dl_server_apply_params()
 - Made root-domain accounting honor the same cpu_active() conditions used by
   root-domain rebuilds, while preserving runtime/period updates made while a
   server is detached.
 - Fixed the total_bw selftest issues reported by Sashiko: check fclose()
   errors for debugfs writes, preserve per-CPU fair_server runtime values, and
   restore all CPUs on cleanup even if one write fails.
 - Link to v1: https://lore.kernel.org/all/20260521174509.1534623-1-arighi@nvidia.com/

Andrea Righi (2):
      sched_ext: Auto-register/unregister dl_server reservations
      selftests/sched_ext: Validate dl_server attach/detach in total_bw test

 include/linux/sched.h                        |   6 +
 kernel/sched/deadline.c                      | 204 ++++++++++++++++++++++++++-
 kernel/sched/ext.c                           |  71 ++++++++++
 kernel/sched/sched.h                         |   4 +
 tools/testing/selftests/sched_ext/total_bw.c | 201 +++++++++++++++++++++++++-
 5 files changed, 478 insertions(+), 8 deletions(-)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 2/2] selftests/sched_ext: Validate dl_server attach/detach in total_bw test
  2026-05-26 16:42 [PATCHSET v3 sched_ext/for-7.2] sched_ext: Auto-manage ext/fair dl_server bandwidth Andrea Righi
@ 2026-05-26 16:42 ` Andrea Righi
  2026-05-26 17:33   ` sashiko-bot
  0 siblings, 1 reply; 10+ messages in thread
From: Andrea Righi @ 2026-05-26 16:42 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Christian Loehle, Phil Auld,
	Koba Ko, Joel Fernandes, Richard Cheng, Cheng-Yang Chou,
	sched-ext, linux-kernel

Extend the total_bw selftest to validate the fair/ext dl_server
auto-attach/detach operations.

After the existing consistency checks, the test now doubles the
fair_server's runtime on every CPU via debugfs and verifies that:
 1. total_bw grew after the customization (proves fair_server was
    attached and apply_params() honored the dl_bw_attached flag),
 2. with the minimal BPF scheduler loaded, total_bw drops back to the
    baseline value (proves fair_server was detached and ext_server was
    attached at its own default runtime),
 3. after unload total_bw matches the doubled value from step 1 (proves
    fair_server was re-attached with the runtime customization preserved
    across the load/unload cycle).

Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 tools/testing/selftests/sched_ext/total_bw.c | 201 ++++++++++++++++++-
 1 file changed, 200 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/sched_ext/total_bw.c b/tools/testing/selftests/sched_ext/total_bw.c
index 5b0a619bab86e..2af01cee90cc0 100644
--- a/tools/testing/selftests/sched_ext/total_bw.c
+++ b/tools/testing/selftests/sched_ext/total_bw.c
@@ -100,6 +100,98 @@ static int read_total_bw_values(long *bw_values, int max_cpus)
 	return cpu_count;
 }
 
+/*
+ * Read a per-CPU dl_server param (runtime or period) from debugfs.
+ * Returns the value in nanoseconds, or -1 on failure.
+ */
+static long read_server_param(const char *server, const char *param, int cpu)
+{
+	char path[128];
+	long value = -1;
+	FILE *fp;
+
+	snprintf(path, sizeof(path),
+		 "/sys/kernel/debug/sched/%s_server/cpu%d/%s",
+		 server, cpu, param);
+	fp = fopen(path, "r");
+	if (!fp)
+		return -1;
+	if (fscanf(fp, "%ld", &value) != 1)
+		value = -1;
+	fclose(fp);
+
+	return value;
+}
+
+/*
+ * Write a per-CPU dl_server param to debugfs. Returns 0 on success.
+ */
+static int write_server_param(const char *server, const char *param,
+			      int cpu, long value)
+{
+	char path[128];
+	FILE *fp;
+	int ret = 0;
+
+	snprintf(path, sizeof(path),
+		 "/sys/kernel/debug/sched/%s_server/cpu%d/%s",
+		 server, cpu, param);
+	fp = fopen(path, "w");
+	if (!fp)
+		return -1;
+	if (fprintf(fp, "%ld", value) < 0)
+		ret = -1;
+	if (fclose(fp) != 0)
+		ret = -1;
+
+	return ret;
+}
+
+static int read_fair_runtime_all(int nr_cpus, long *runtimes)
+{
+	int i;
+
+	for (i = 0; i < nr_cpus; i++) {
+		runtimes[i] = read_server_param("fair", "runtime", i);
+		if (runtimes[i] <= 0)
+			return -1;
+	}
+
+	return 0;
+}
+
+static int write_fair_runtime_all(int nr_cpus, long value)
+{
+	int i;
+
+	for (i = 0; i < nr_cpus; i++) {
+		if (write_server_param("fair", "runtime", i, value) < 0) {
+			SCX_ERR("Failed to write fair_server runtime on CPU %d", i);
+			return -1;
+		}
+	}
+
+	return 0;
+}
+
+/*
+ * Restore per-CPU fair_server runtimes.
+ */
+static int restore_fair_runtime_all(int nr_cpus, const long *runtimes)
+{
+	int ret = 0;
+	int i;
+
+	for (i = 0; i < nr_cpus; i++) {
+		if (write_server_param("fair", "runtime", i, runtimes[i]) < 0) {
+			SCX_ERR("Failed to restore fair_server runtime on CPU %d", i);
+			ret = -1;
+		}
+	}
+
+	return ret;
+}
+
 static bool verify_total_bw_consistency(long *bw_values, int count)
 {
 	int i;
@@ -217,6 +309,9 @@ static enum scx_test_status run(void *ctx)
 	struct bpf_link *link;
 	long loaded_bw[MAX_CPUS];
 	long unloaded_bw[MAX_CPUS];
+	long doubled_bw[MAX_CPUS];
+	long original_runtime[MAX_CPUS], doubled_runtime;
+	enum scx_test_status ret;
 	int i;
 
 	/* Test scenario 2: BPF program loaded */
@@ -257,7 +352,111 @@ static enum scx_test_status run(void *ctx)
 	}
 
 	fprintf(stderr, "All total_bw values are consistent across all scenarios\n");
-	return SCX_TEST_PASS;
+
+	/*
+	 * Validate auto-register/unregister of dl_server bandwidth reservations.
+	 *
+	 * Doubling fair_server's runtime doubles its bw contribution. With a
+	 * full-mode BPF scheduler (minimal_ops), the kernel should detach
+	 * fair_server and attach ext_server, dropping total_bw back to its
+	 * pre-customization (default ext_server-only) value. On unload, the
+	 * fair_server reservation should come back with its customized runtime
+	 * preserved, so total_bw doubles again.
+	 */
+	if (read_fair_runtime_all(test_ctx->nr_cpus, original_runtime) < 0) {
+		fprintf(stderr, "Skipping attach/detach validation: debugfs not accessible\n");
+		return SCX_TEST_PASS;
+	}
+	doubled_runtime = original_runtime[0] * 2;
+
+	fprintf(stderr,
+		"Setting fair_server runtime to %ld ns on all CPUs (orig %ld)\n",
+		doubled_runtime, original_runtime[0]);
+
+	if (write_fair_runtime_all(test_ctx->nr_cpus, doubled_runtime) < 0) {
+		ret = SCX_TEST_FAIL;
+		goto restore;
+	}
+
+	if (fetch_verify_total_bw(doubled_bw, test_ctx->nr_cpus) < 0) {
+		SCX_ERR("Failed to get stable values after doubling fair runtime");
+		ret = SCX_TEST_FAIL;
+		goto restore;
+	}
+
+	/*
+	 * After doubling the runtime, fair_server's bw contribution must grow.
+	 * We don't assert exactly 2x, because the kernel's to_ratio() truncates
+	 * the value, so 2 * to_ratio(period, runtime) and
+	 * to_ratio(period, 2 * runtime) can differ.
+	 */
+	for (i = 0; i < test_ctx->nr_cpus; i++) {
+		if (doubled_bw[i] <= test_ctx->baseline_bw[i]) {
+			SCX_ERR("CPU%d: fair did not increase total_bw (baseline=%ld, doubled=%ld)",
+				i, test_ctx->baseline_bw[i], doubled_bw[i]);
+			ret = SCX_TEST_FAIL;
+			goto restore;
+		}
+	}
+
+	link = bpf_map__attach_struct_ops(test_ctx->skel->maps.minimal_ops);
+	if (!link) {
+		SCX_ERR("Failed to attach scheduler for detach test");
+		ret = SCX_TEST_FAIL;
+		goto restore;
+	}
+
+	if (fetch_verify_total_bw(loaded_bw, test_ctx->nr_cpus) < 0) {
+		SCX_ERR("Failed to get stable values with BPF loaded (detach test)");
+		bpf_link__destroy(link);
+		ret = SCX_TEST_FAIL;
+		goto restore;
+	}
+
+	/*
+	 * In full mode the customized fair_server is detached and ext_server is
+	 * attached at its default runtime, total_bw must match baseline.
+	 */
+	for (i = 0; i < test_ctx->nr_cpus; i++) {
+		if (loaded_bw[i] != test_ctx->baseline_bw[i]) {
+			SCX_ERR("CPU%d: expected bw %ld (fair detached, ext default), got %ld",
+				i, test_ctx->baseline_bw[i], loaded_bw[i]);
+			bpf_link__destroy(link);
+			ret = SCX_TEST_FAIL;
+			goto restore;
+		}
+	}
+
+	bpf_link__destroy(link);
+
+	if (fetch_verify_total_bw(unloaded_bw, test_ctx->nr_cpus) < 0) {
+		SCX_ERR("Failed to get stable values after BPF unload (detach test)");
+		ret = SCX_TEST_FAIL;
+		goto restore;
+	}
+
+	/*
+	 * After unload, fair_server is re-attached with its preserved 2x
+	 * runtime, so total_bw should return to the doubled value.
+	 */
+	for (i = 0; i < test_ctx->nr_cpus; i++) {
+		if (unloaded_bw[i] != doubled_bw[i]) {
+			SCX_ERR("CPU%d: BPF unloaded: expected %ld (fair restored at 2x), got %ld",
+				i, doubled_bw[i], unloaded_bw[i]);
+			ret = SCX_TEST_FAIL;
+			goto restore;
+		}
+	}
+
+	fprintf(stderr,
+		"dl_server attach/detach with customized fair runtime verified\n");
+	ret = SCX_TEST_PASS;
+
+restore:
+	if (restore_fair_runtime_all(test_ctx->nr_cpus, original_runtime) < 0)
+		SCX_ERR("Failed to fully restore per-CPU fair_server runtimes");
+
+	return ret;
 }
 
 static void cleanup(void *ctx)
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] selftests/sched_ext: Validate dl_server attach/detach in total_bw test
  2026-05-26 16:42 ` [PATCH 2/2] selftests/sched_ext: Validate dl_server attach/detach in total_bw test Andrea Righi
@ 2026-05-26 17:33   ` sashiko-bot
  0 siblings, 0 replies; 10+ messages in thread
From: sashiko-bot @ 2026-05-26 17:33 UTC (permalink / raw)
  To: Andrea Righi; +Cc: sched-ext

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Medium] The dl_server runtime and period parameters are parsed and written as signed `long` instead of `u64` or `unsigned long long`, leading to potential integer overflow on 32-bit platforms.
--

commit 6475744500400f1c70cab62bc90b17ee9b9cbbea
Author: Andrea Righi <arighi@nvidia.com>

selftests/sched_ext: Validate dl_server attach/detach in total_bw test

Extend the total_bw selftest to validate the fair/ext dl_server
auto-attach/detach operations by doubling the fair_server's runtime
via debugfs and verifying total_bw changes.

> diff --git a/tools/testing/selftests/sched_ext/total_bw.c b/tools/testing/selftests/sched_ext/total_bw.c
> index 5b0a619bab86e..2af01cee90cc0 100644
> --- a/tools/testing/selftests/sched_ext/total_bw.c
> +++ b/tools/testing/selftests/sched_ext/total_bw.c
[ ... ]
> @@ -100,6 +100,98 @@ static int read_total_bw_values(long *bw_values, int max_cpus)
>  	return cpu_count;
>  }
>  
> +/*
> + * Read a per-CPU dl_server param (runtime or period) from debugfs.
> + * Returns the value in nanoseconds, or -1 on failure.
> + */
> +static long read_server_param(const char *server, const char *param, int cpu)
> +{
> +	char path[128];
> +	long value = -1;

[Severity: Medium]
Will parsing the dl_server runtime and period parameters as signed long
lead to integer overflows on 32-bit platforms?

In the kernel, dl_server parameters like runtime and period are represented
as u64 values (in nanoseconds). On 32-bit platforms, a signed long is 32
bits, which can only represent up to ~2.14 seconds in nanoseconds.

If a test system configures a fair_server runtime larger than ~1.07 seconds,
doubling it could cause a signed integer overflow. Would it be better to use
unsigned long long or u64 with %llu to avoid this?

> +	FILE *fp;
> +
> +	snprintf(path, sizeof(path),
> +		 "/sys/kernel/debug/sched/%s_server/cpu%d/%s",
> +		 server, cpu, param);
> +	fp = fopen(path, "r");
> +	if (!fp)
> +		return -1;
> +	if (fscanf(fp, "%ld", &value) != 1)
> +		value = -1;
> +	fclose(fp);
> +
> +	return value;
> +}
> +
> +/*
> + * Write a per-CPU dl_server param to debugfs. Returns 0 on success.
> + */
> +static int write_server_param(const char *server, const char *param,
> +			      int cpu, long value)

[Severity: Medium]
Similarly here, does taking long value as a parameter restrict the maximum
runtime that can be written on 32-bit systems?

> +{
> +	char path[128];
> +	FILE *fp;
> +	int ret = 0;
> +
> +	snprintf(path, sizeof(path),
> +		 "/sys/kernel/debug/sched/%s_server/cpu%d/%s",
> +		 server, cpu, param);
> +	fp = fopen(path, "w");
> +	if (!fp)
> +		return -1;
> +	if (fprintf(fp, "%ld", value) < 0)
> +		ret = -1;
> +	if (fclose(fp) != 0)
> +		ret = -1;
> +
> +	return ret;
> +}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260526164420.638711-1-arighi@nvidia.com?part=2

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2026-05-26 17:33 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-21 17:33 [PATCHSET sched_ext/for-7.2] sched_ext: Auto-manage ext/fair dl_server bandwidth Andrea Righi
2026-05-21 17:33 ` [PATCH 1/2] sched_ext: Auto-register/unregister dl_server reservations Andrea Righi
2026-05-21 18:23   ` sashiko-bot
2026-05-22  8:36   ` Peter Zijlstra
2026-05-22 10:02     ` Andrea Righi
2026-05-21 17:33 ` [PATCH 2/2] selftests/sched_ext: Validate dl_server attach/detach in total_bw test Andrea Righi
2026-05-21 18:31   ` sashiko-bot
  -- strict thread matches above, loose matches on Subject: below --
2026-05-26  8:27 [PATCHSET v2 sched_ext/for-7.2] sched_ext: Auto-manage ext/fair dl_server bandwidth Andrea Righi
2026-05-26  8:27 ` [PATCH 2/2] selftests/sched_ext: Validate dl_server attach/detach in total_bw test Andrea Righi
2026-05-26 16:42 [PATCHSET v3 sched_ext/for-7.2] sched_ext: Auto-manage ext/fair dl_server bandwidth Andrea Righi
2026-05-26 16:42 ` [PATCH 2/2] selftests/sched_ext: Validate dl_server attach/detach in total_bw test Andrea Righi
2026-05-26 17:33   ` sashiko-bot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox