[PATCH 4/7] sched_ext: Add a DL server for sched

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 4/7] sched_ext: Add a DL server for sched_ext tasks
  2025-12-17  9:35 [PATCHSET v11 sched_ext/for-6.20] Add a deadline " Andrea Righi
@ 2025-12-17  9:35 ` Andrea Righi
  2025-12-17 15:49   ` Juri Lelli
  0 siblings, 1 reply; 40+ messages in thread
From: Andrea Righi @ 2025-12-17  9:35 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Tejun Heo, David Vernet, Changwoo Min,
	Shuah Khan, Joel Fernandes, Christian Loehle, Emil Tsalapatis,
	sched-ext, bpf, linux-kselftest, linux-kernel

sched_ext currently suffers starvation due to RT. The same workload when
converted to EXT can get zero runtime if RT is 100% running, causing EXT
processes to stall. Fix it by adding a DL server for EXT.

A kselftest is also included later to confirm that both DL servers are
functioning correctly:

 # ./runner -t rt_stall
 ===== START =====
 TEST: rt_stall
 DESCRIPTION: Verify that RT tasks cannot stall SCHED_EXT tasks
 OUTPUT:
 TAP version 13
 1..1
 # Runtime of FAIR task (PID 1511) is 0.250000 seconds
 # Runtime of RT task (PID 1512) is 4.750000 seconds
 # FAIR task got 5.00% of total runtime
 ok 1 PASS: FAIR task got more than 4.00% of runtime
 TAP version 13
 1..1
 # Runtime of EXT task (PID 1514) is 0.250000 seconds
 # Runtime of RT task (PID 1515) is 4.750000 seconds
 # EXT task got 5.00% of total runtime
 ok 2 PASS: EXT task got more than 4.00% of runtime
 TAP version 13
 1..1
 # Runtime of FAIR task (PID 1517) is 0.250000 seconds
 # Runtime of RT task (PID 1518) is 4.750000 seconds
 # FAIR task got 5.00% of total runtime
 ok 3 PASS: FAIR task got more than 4.00% of runtime
 TAP version 13
 1..1
 # Runtime of EXT task (PID 1521) is 0.250000 seconds
 # Runtime of RT task (PID 1522) is 4.750000 seconds
 # EXT task got 5.00% of total runtime
 ok 4 PASS: EXT task got more than 4.00% of runtime
 ok 1 rt_stall #
 =====  END  =====

v4: - initialize EXT server bandwidth reservation at init time and
      always keep it active (Andrea Righi)
    - check for rq->nr_running == 1 to determine when to account idle
      time (Juri Lelli)
v3: - clarify that fair is not the only dl_server (Juri Lelli)
    - remove explicit stop to reduce timer reprogramming overhead
      (Juri Lelli)
    - do not restart pick_task() when it's invoked by the dl_server
      (Tejun Heo)
    - depend on CONFIG_SCHED_CLASS_EXT (Andrea Righi)
v2: - drop ->balance() now that pick_task() has an rf argument
      (Andrea Righi)

Tested-by: Christian Loehle <christian.loehle@arm.com>
Co-developed-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/core.c     |  6 +++
 kernel/sched/deadline.c | 84 ++++++++++++++++++++++++++++++-----------
 kernel/sched/ext.c      | 42 +++++++++++++++++++++
 kernel/sched/idle.c     |  3 ++
 kernel/sched/sched.h    |  2 +
 kernel/sched/topology.c |  5 +++
 6 files changed, 119 insertions(+), 23 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 41ba0be169117..a2400ee33a356 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8475,6 +8475,9 @@ int sched_cpu_dying(unsigned int cpu)
 		dump_rq_tasks(rq, KERN_WARNING);
 	}
 	dl_server_stop(&rq->fair_server);
+#ifdef CONFIG_SCHED_CLASS_EXT
+	dl_server_stop(&rq->ext_server);
+#endif
 	rq_unlock_irqrestore(rq, &rf);
 
 	calc_load_migrate(rq);
@@ -8678,6 +8681,9 @@ void __init sched_init(void)
 		hrtick_rq_init(rq);
 		atomic_set(&rq->nr_iowait, 0);
 		fair_server_init(rq);
+#ifdef CONFIG_SCHED_CLASS_EXT
+		ext_server_init(rq);
+#endif
 
 #ifdef CONFIG_SCHED_CORE
 		rq->core = rq;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 2789db5217cd4..88f2b5ed5678a 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1445,8 +1445,8 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64
 		dl_se->dl_defer_idle = 0;
 
 	/*
-	 * The fair server can consume its runtime while throttled (not queued/
-	 * running as regular CFS).
+	 * The DL server can consume its runtime while throttled (not
+	 * queued / running as regular CFS).
 	 *
 	 * If the server consumes its entire runtime in this state. The server
 	 * is not required for the current period. Thus, reset the server by
@@ -1531,10 +1531,10 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64
 	}
 
 	/*
-	 * The fair server (sole dl_server) does not account for real-time
-	 * workload because it is running fair work.
+	 * The dl_server does not account for real-time workload because it
+	 * is running fair work.
 	 */
-	if (dl_se == &rq->fair_server)
+	if (dl_se->dl_server)
 		return;
 
 #ifdef CONFIG_RT_GROUP_SCHED
@@ -1569,9 +1569,9 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64
  * In the non-defer mode, the idle time is not accounted, as the
  * server provides a guarantee.
  *
- * If the dl_server is in defer mode, the idle time is also considered
- * as time available for the fair server, avoiding a penalty for the
- * rt scheduler that did not consumed that time.
+ * If the dl_server is in defer mode, the idle time is also considered as
+ * time available for the dl_server, avoiding a penalty for the rt
+ * scheduler that did not consumed that time.
  */
 void dl_server_update_idle(struct sched_dl_entity *dl_se, s64 delta_exec)
 {
@@ -1810,6 +1810,7 @@ void dl_server_stop(struct sched_dl_entity *dl_se)
 	hrtimer_try_to_cancel(&dl_se->dl_timer);
 	dl_se->dl_defer_armed = 0;
 	dl_se->dl_throttled = 0;
+	dl_se->dl_defer_running = 0;
 	dl_se->dl_defer_idle = 0;
 	dl_se->dl_server_active = 0;
 }
@@ -1844,6 +1845,18 @@ void sched_init_dl_servers(void)
 		dl_se->dl_server = 1;
 		dl_se->dl_defer = 1;
 		setup_new_dl_entity(dl_se);
+
+#ifdef CONFIG_SCHED_CLASS_EXT
+		dl_se = &rq->ext_server;
+
+		WARN_ON(dl_server(dl_se));
+
+		dl_server_apply_params(dl_se, runtime, period, 1);
+
+		dl_se->dl_server = 1;
+		dl_se->dl_defer = 1;
+		setup_new_dl_entity(dl_se);
+#endif
 	}
 }
 
@@ -3183,6 +3196,36 @@ void dl_add_task_root_domain(struct task_struct *p)
 	raw_spin_unlock_irqrestore(&p->pi_lock, rf.flags);
 }
 
+static void dl_server_add_bw(struct root_domain *rd, int cpu)
+{
+	struct sched_dl_entity *dl_se;
+
+	dl_se = &cpu_rq(cpu)->fair_server;
+	if (dl_server(dl_se) && cpu_active(cpu))
+		__dl_add(&rd->dl_bw, dl_se->dl_bw, dl_bw_cpus(cpu));
+
+#ifdef CONFIG_SCHED_CLASS_EXT
+	dl_se = &cpu_rq(cpu)->ext_server;
+	if (dl_server(dl_se) && cpu_active(cpu))
+		__dl_add(&rd->dl_bw, dl_se->dl_bw, dl_bw_cpus(cpu));
+#endif
+}
+
+static u64 dl_server_read_bw(int cpu)
+{
+	u64 dl_bw = 0;
+
+	if (cpu_rq(cpu)->fair_server.dl_server)
+		dl_bw += cpu_rq(cpu)->fair_server.dl_bw;
+
+#ifdef CONFIG_SCHED_CLASS_EXT
+	if (cpu_rq(cpu)->ext_server.dl_server)
+		dl_bw += cpu_rq(cpu)->ext_server.dl_bw;
+#endif
+
+	return dl_bw;
+}
+
 void dl_clear_root_domain(struct root_domain *rd)
 {
 	int i;
@@ -3201,12 +3244,8 @@ void dl_clear_root_domain(struct root_domain *rd)
 	 * dl_servers are not tasks. Since dl_add_task_root_domain ignores
 	 * them, we need to account for them here explicitly.
 	 */
-	for_each_cpu(i, rd->span) {
-		struct sched_dl_entity *dl_se = &cpu_rq(i)->fair_server;
-
-		if (dl_server(dl_se) && cpu_active(i))
-			__dl_add(&rd->dl_bw, dl_se->dl_bw, dl_bw_cpus(i));
-	}
+	for_each_cpu(i, rd->span)
+		dl_server_add_bw(rd, i);
 }
 
 void dl_clear_root_domain_cpu(int cpu)
@@ -3702,7 +3741,7 @@ static int dl_bw_manage(enum dl_bw_request req, int cpu, u64 dl_bw)
 	unsigned long flags, cap;
 	struct dl_bw *dl_b;
 	bool overflow = 0;
-	u64 fair_server_bw = 0;
+	u64 dl_server_bw = 0;
 
 	rcu_read_lock_sched();
 	dl_b = dl_bw_of(cpu);
@@ -3735,27 +3774,26 @@ static int dl_bw_manage(enum dl_bw_request req, int cpu, u64 dl_bw)
 		cap -= arch_scale_cpu_capacity(cpu);
 
 		/*
-		 * cpu is going offline and NORMAL tasks will be moved away
-		 * from it. We can thus discount dl_server bandwidth
-		 * contribution as it won't need to be servicing tasks after
-		 * the cpu is off.
+		 * cpu is going offline and NORMAL and EXT tasks will be
+		 * moved away from it. We can thus discount dl_server
+		 * bandwidth contribution as it won't need to be servicing
+		 * tasks after the cpu is off.
 		 */
-		if (cpu_rq(cpu)->fair_server.dl_server)
-			fair_server_bw = cpu_rq(cpu)->fair_server.dl_bw;
+		dl_server_bw = dl_server_read_bw(cpu);
 
 		/*
 		 * Not much to check if no DEADLINE bandwidth is present.
 		 * dl_servers we can discount, as tasks will be moved out the
 		 * offlined CPUs anyway.
 		 */
-		if (dl_b->total_bw - fair_server_bw > 0) {
+		if (dl_b->total_bw - dl_server_bw > 0) {
 			/*
 			 * Leaving at least one CPU for DEADLINE tasks seems a
 			 * wise thing to do. As said above, cpu is not offline
 			 * yet, so account for that.
 			 */
 			if (dl_bw_cpus(cpu) - 1)
-				overflow = __dl_overflow(dl_b, cap, fair_server_bw, 0);
+				overflow = __dl_overflow(dl_b, cap, dl_server_bw, 0);
 			else
 				overflow = 1;
 		}
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 94164f2dec6dc..04daaac74f514 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -957,6 +957,8 @@ static void update_curr_scx(struct rq *rq)
 		if (!curr->scx.slice)
 			touch_core_sched(rq, curr);
 	}
+
+	dl_server_update(&rq->ext_server, delta_exec);
 }
 
 static bool scx_dsq_priq_less(struct rb_node *node_a,
@@ -1500,6 +1502,10 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags
 	if (enq_flags & SCX_ENQ_WAKEUP)
 		touch_core_sched(rq, p);
 
+	/* Start dl_server if this is the first task being enqueued */
+	if (rq->scx.nr_running == 1)
+		dl_server_start(&rq->ext_server);
+
 	do_enqueue_task(rq, p, enq_flags, sticky_cpu);
 out:
 	rq->scx.flags &= ~SCX_RQ_IN_WAKEUP;
@@ -2511,6 +2517,33 @@ static struct task_struct *pick_task_scx(struct rq *rq, struct rq_flags *rf)
 	return do_pick_task_scx(rq, rf, false);
 }
 
+/*
+ * Select the next task to run from the ext scheduling class.
+ *
+ * Use do_pick_task_scx() directly with @force_scx enabled, since the
+ * dl_server must always select a sched_ext task.
+ */
+static struct task_struct *
+ext_server_pick_task(struct sched_dl_entity *dl_se, struct rq_flags *rf)
+{
+	if (!scx_enabled())
+		return NULL;
+
+	return do_pick_task_scx(dl_se->rq, rf, true);
+}
+
+/*
+ * Initialize the ext server deadline entity.
+ */
+void ext_server_init(struct rq *rq)
+{
+	struct sched_dl_entity *dl_se = &rq->ext_server;
+
+	init_dl_entity(dl_se);
+
+	dl_server_init(dl_se, rq, ext_server_pick_task);
+}
+
 #ifdef CONFIG_SCHED_CORE
 /**
  * scx_prio_less - Task ordering for core-sched
@@ -3090,6 +3123,15 @@ static void switching_to_scx(struct rq *rq, struct task_struct *p)
 static void switched_from_scx(struct rq *rq, struct task_struct *p)
 {
 	scx_disable_task(p);
+
+	/*
+	 * After class switch, if the DL server is still active, restart it so
+	 * that DL timers will be queued, in case SCX switched to higher class.
+	 */
+	if (dl_server_active(&rq->ext_server)) {
+		dl_server_stop(&rq->ext_server);
+		dl_server_start(&rq->ext_server);
+	}
 }
 
 static void wakeup_preempt_scx(struct rq *rq, struct task_struct *p,int wake_flags) {}
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index c174afe1dd177..53793b9a04185 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -530,6 +530,9 @@ static void update_curr_idle(struct rq *rq)
 	se->exec_start = now;
 
 	dl_server_update_idle(&rq->fair_server, delta_exec);
+#ifdef CONFIG_SCHED_CLASS_EXT
+	dl_server_update_idle(&rq->ext_server, delta_exec);
+#endif
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d30cca6870f5f..28c24cda1c3ce 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -414,6 +414,7 @@ extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
 extern void sched_init_dl_servers(void);
 
 extern void fair_server_init(struct rq *rq);
+extern void ext_server_init(struct rq *rq);
 extern void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq);
 extern int dl_server_apply_params(struct sched_dl_entity *dl_se,
 		    u64 runtime, u64 period, bool init);
@@ -1151,6 +1152,7 @@ struct rq {
 	struct dl_rq		dl;
 #ifdef CONFIG_SCHED_CLASS_EXT
 	struct scx_rq		scx;
+	struct sched_dl_entity	ext_server;
 #endif
 
 	struct sched_dl_entity	fair_server;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index cf643a5ddedd2..ac268da917781 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -508,6 +508,11 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd)
 	if (rq->fair_server.dl_server)
 		__dl_server_attach_root(&rq->fair_server, rq);
 
+#ifdef CONFIG_SCHED_CLASS_EXT
+	if (rq->ext_server.dl_server)
+		__dl_server_attach_root(&rq->ext_server, rq);
+#endif
+
 	rq_unlock_irqrestore(rq, &rf);
 
 	if (old_rd)
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/7] sched_ext: Add a DL server for sched_ext tasks
  2025-12-17  9:35 ` [PATCH 4/7] sched_ext: Add a DL " Andrea Righi
@ 2025-12-17 15:49   ` Juri Lelli
  2025-12-17 20:35     ` Andrea Righi
  0 siblings, 1 reply; 40+ messages in thread
From: Juri Lelli @ 2025-12-17 15:49 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Tejun Heo, David Vernet, Changwoo Min, Shuah Khan, Joel Fernandes,
	Christian Loehle, Emil Tsalapatis, sched-ext, bpf,
	linux-kselftest, linux-kernel

Hi!

On 17/12/25 10:35, Andrea Righi wrote:
> sched_ext currently suffers starvation due to RT. The same workload when
> converted to EXT can get zero runtime if RT is 100% running, causing EXT
> processes to stall. Fix it by adding a DL server for EXT.

...

> v4: - initialize EXT server bandwidth reservation at init time and
>       always keep it active (Andrea Righi)
>     - check for rq->nr_running == 1 to determine when to account idle
>       time (Juri Lelli)
> v3: - clarify that fair is not the only dl_server (Juri Lelli)
>     - remove explicit stop to reduce timer reprogramming overhead
>       (Juri Lelli)
>     - do not restart pick_task() when it's invoked by the dl_server
>       (Tejun Heo)
>     - depend on CONFIG_SCHED_CLASS_EXT (Andrea Righi)
> v2: - drop ->balance() now that pick_task() has an rf argument
>       (Andrea Righi)
> 
> Tested-by: Christian Loehle <christian.loehle@arm.com>
> Co-developed-by: Joel Fernandes <joelagnelf@nvidia.com>
> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> ---

...

> @@ -3090,6 +3123,15 @@ static void switching_to_scx(struct rq *rq, struct task_struct *p)
>  static void switched_from_scx(struct rq *rq, struct task_struct *p)
>  {
>  	scx_disable_task(p);
> +
> +	/*
> +	 * After class switch, if the DL server is still active, restart it so
> +	 * that DL timers will be queued, in case SCX switched to higher class.
> +	 */
> +	if (dl_server_active(&rq->ext_server)) {
> +		dl_server_stop(&rq->ext_server);
> +		dl_server_start(&rq->ext_server);
> +	}
>  }

We might have discussed this already, in that case I forgot, sorry. But,
why we do need to start the server again if switched from scx? Couldn't
make sense of the comment that is already present.

Thanks,
Juri


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/7] sched_ext: Add a DL server for sched_ext tasks
  2025-12-17 15:49   ` Juri Lelli
@ 2025-12-17 20:35     ` Andrea Righi
  0 siblings, 0 replies; 40+ messages in thread
From: Andrea Righi @ 2025-12-17 20:35 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Tejun Heo, David Vernet, Changwoo Min, Shuah Khan, Joel Fernandes,
	Christian Loehle, Emil Tsalapatis, sched-ext, bpf,
	linux-kselftest, linux-kernel

Hi Juri,

On Wed, Dec 17, 2025 at 04:49:02PM +0100, Juri Lelli wrote:
> Hi!
> 
> On 17/12/25 10:35, Andrea Righi wrote:
> > sched_ext currently suffers starvation due to RT. The same workload when
> > converted to EXT can get zero runtime if RT is 100% running, causing EXT
> > processes to stall. Fix it by adding a DL server for EXT.
> 
> ...
> 
> > v4: - initialize EXT server bandwidth reservation at init time and
> >       always keep it active (Andrea Righi)
> >     - check for rq->nr_running == 1 to determine when to account idle
> >       time (Juri Lelli)
> > v3: - clarify that fair is not the only dl_server (Juri Lelli)
> >     - remove explicit stop to reduce timer reprogramming overhead
> >       (Juri Lelli)
> >     - do not restart pick_task() when it's invoked by the dl_server
> >       (Tejun Heo)
> >     - depend on CONFIG_SCHED_CLASS_EXT (Andrea Righi)
> > v2: - drop ->balance() now that pick_task() has an rf argument
> >       (Andrea Righi)
> > 
> > Tested-by: Christian Loehle <christian.loehle@arm.com>
> > Co-developed-by: Joel Fernandes <joelagnelf@nvidia.com>
> > Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
> > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> > ---
> 
> ...
> 
> > @@ -3090,6 +3123,15 @@ static void switching_to_scx(struct rq *rq, struct task_struct *p)
> >  static void switched_from_scx(struct rq *rq, struct task_struct *p)
> >  {
> >  	scx_disable_task(p);
> > +
> > +	/*
> > +	 * After class switch, if the DL server is still active, restart it so
> > +	 * that DL timers will be queued, in case SCX switched to higher class.
> > +	 */
> > +	if (dl_server_active(&rq->ext_server)) {
> > +		dl_server_stop(&rq->ext_server);
> > +		dl_server_start(&rq->ext_server);
> > +	}
> >  }
> 
> We might have discussed this already, in that case I forgot, sorry. But,
> why we do need to start the server again if switched from scx? Couldn't
> make sense of the comment that is already present.

The intention was to restart the DL timers, but thinking more about it,
this appears more harmful than helpful, as it may actually disrupt
accounting.

I did a quick test without the restart and everything seems to work. I'll
run more tests and I'll send an updated patch if everything works well
without the restart.

Thanks!
-Andrea

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 4/7] sched_ext: Add a DL server for sched_ext tasks
  2026-01-20 21:50 [PATCHSET RESEND v11 " Andrea Righi
@ 2026-01-20 21:50 ` Andrea Righi
  2026-01-21 12:29   ` Peter Zijlstra
  2026-01-21 12:31   ` Peter Zijlstra
  0 siblings, 2 replies; 40+ messages in thread
From: Andrea Righi @ 2026-01-20 21:50 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Joel Fernandes, Tejun Heo, David Vernet,
	Changwoo Min
  Cc: Shuah Khan, sched-ext, bpf, linux-kernel, Christian Loehle

sched_ext currently suffers starvation due to RT. The same workload when
converted to EXT can get zero runtime if RT is 100% running, causing EXT
processes to stall. Fix it by adding a DL server for EXT.

A kselftest is also included later to confirm that both DL servers are
functioning correctly:

 # ./runner -t rt_stall
 ===== START =====
 TEST: rt_stall
 DESCRIPTION: Verify that RT tasks cannot stall SCHED_EXT tasks
 OUTPUT:
 TAP version 13
 1..1
 # Runtime of FAIR task (PID 1511) is 0.250000 seconds
 # Runtime of RT task (PID 1512) is 4.750000 seconds
 # FAIR task got 5.00% of total runtime
 ok 1 PASS: FAIR task got more than 4.00% of runtime
 TAP version 13
 1..1
 # Runtime of EXT task (PID 1514) is 0.250000 seconds
 # Runtime of RT task (PID 1515) is 4.750000 seconds
 # EXT task got 5.00% of total runtime
 ok 2 PASS: EXT task got more than 4.00% of runtime
 TAP version 13
 1..1
 # Runtime of FAIR task (PID 1517) is 0.250000 seconds
 # Runtime of RT task (PID 1518) is 4.750000 seconds
 # FAIR task got 5.00% of total runtime
 ok 3 PASS: FAIR task got more than 4.00% of runtime
 TAP version 13
 1..1
 # Runtime of EXT task (PID 1521) is 0.250000 seconds
 # Runtime of RT task (PID 1522) is 4.750000 seconds
 # EXT task got 5.00% of total runtime
 ok 4 PASS: EXT task got more than 4.00% of runtime
 ok 1 rt_stall #
 =====  END  =====

v5: - do not restart the EXT server on switch_class() (Juri Lelli)
v4: - initialize EXT server bandwidth reservation at init time and
      always keep it active (Andrea Righi)
    - check for rq->nr_running == 1 to determine when to account idle
      time (Juri Lelli)
v3: - clarify that fair is not the only dl_server (Juri Lelli)
    - remove explicit stop to reduce timer reprogramming overhead
      (Juri Lelli)
    - do not restart pick_task() when it's invoked by the dl_server
      (Tejun Heo)
    - depend on CONFIG_SCHED_CLASS_EXT (Andrea Righi)
v2: - drop ->balance() now that pick_task() has an rf argument
      (Andrea Righi)

Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Co-developed-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/core.c     |  6 +++
 kernel/sched/deadline.c | 84 ++++++++++++++++++++++++++++++-----------
 kernel/sched/ext.c      | 33 ++++++++++++++++
 kernel/sched/idle.c     |  3 ++
 kernel/sched/sched.h    |  2 +
 kernel/sched/topology.c |  5 +++
 6 files changed, 110 insertions(+), 23 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 045f83ad261e2..88476d8b4e3d2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8477,6 +8477,9 @@ int sched_cpu_dying(unsigned int cpu)
 		dump_rq_tasks(rq, KERN_WARNING);
 	}
 	dl_server_stop(&rq->fair_server);
+#ifdef CONFIG_SCHED_CLASS_EXT
+	dl_server_stop(&rq->ext_server);
+#endif
 	rq_unlock_irqrestore(rq, &rf);
 
 	calc_load_migrate(rq);
@@ -8680,6 +8683,9 @@ void __init sched_init(void)
 		hrtick_rq_init(rq);
 		atomic_set(&rq->nr_iowait, 0);
 		fair_server_init(rq);
+#ifdef CONFIG_SCHED_CLASS_EXT
+		ext_server_init(rq);
+#endif
 
 #ifdef CONFIG_SCHED_CORE
 		rq->core = rq;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 71b58a25e2a91..56c7c99a1067a 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1443,8 +1443,8 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64
 		dl_se->dl_defer_idle = 0;
 
 	/*
-	 * The fair server can consume its runtime while throttled (not queued/
-	 * running as regular CFS).
+	 * The DL server can consume its runtime while throttled (not
+	 * queued / running as regular CFS).
 	 *
 	 * If the server consumes its entire runtime in this state. The server
 	 * is not required for the current period. Thus, reset the server by
@@ -1529,10 +1529,10 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64
 	}
 
 	/*
-	 * The fair server (sole dl_server) does not account for real-time
-	 * workload because it is running fair work.
+	 * The dl_server does not account for real-time workload because it
+	 * is running fair work.
 	 */
-	if (dl_se == &rq->fair_server)
+	if (dl_se->dl_server)
 		return;
 
 #ifdef CONFIG_RT_GROUP_SCHED
@@ -1567,9 +1567,9 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64
  * In the non-defer mode, the idle time is not accounted, as the
  * server provides a guarantee.
  *
- * If the dl_server is in defer mode, the idle time is also considered
- * as time available for the fair server, avoiding a penalty for the
- * rt scheduler that did not consumed that time.
+ * If the dl_server is in defer mode, the idle time is also considered as
+ * time available for the dl_server, avoiding a penalty for the rt
+ * scheduler that did not consumed that time.
  */
 void dl_server_update_idle(struct sched_dl_entity *dl_se, s64 delta_exec)
 {
@@ -1813,6 +1813,7 @@ void dl_server_stop(struct sched_dl_entity *dl_se)
 	hrtimer_try_to_cancel(&dl_se->dl_timer);
 	dl_se->dl_defer_armed = 0;
 	dl_se->dl_throttled = 0;
+	dl_se->dl_defer_running = 0;
 	dl_se->dl_defer_idle = 0;
 	dl_se->dl_server_active = 0;
 }
@@ -1848,6 +1849,18 @@ void sched_init_dl_servers(void)
 		dl_se->dl_server = 1;
 		dl_se->dl_defer = 1;
 		setup_new_dl_entity(dl_se);
+
+#ifdef CONFIG_SCHED_CLASS_EXT
+		dl_se = &rq->ext_server;
+
+		WARN_ON(dl_server(dl_se));
+
+		dl_server_apply_params(dl_se, runtime, period, 1);
+
+		dl_se->dl_server = 1;
+		dl_se->dl_defer = 1;
+		setup_new_dl_entity(dl_se);
+#endif
 	}
 }
 
@@ -3179,6 +3192,36 @@ void dl_add_task_root_domain(struct task_struct *p)
 	raw_spin_unlock_irqrestore(&p->pi_lock, rf.flags);
 }
 
+static void dl_server_add_bw(struct root_domain *rd, int cpu)
+{
+	struct sched_dl_entity *dl_se;
+
+	dl_se = &cpu_rq(cpu)->fair_server;
+	if (dl_server(dl_se) && cpu_active(cpu))
+		__dl_add(&rd->dl_bw, dl_se->dl_bw, dl_bw_cpus(cpu));
+
+#ifdef CONFIG_SCHED_CLASS_EXT
+	dl_se = &cpu_rq(cpu)->ext_server;
+	if (dl_server(dl_se) && cpu_active(cpu))
+		__dl_add(&rd->dl_bw, dl_se->dl_bw, dl_bw_cpus(cpu));
+#endif
+}
+
+static u64 dl_server_read_bw(int cpu)
+{
+	u64 dl_bw = 0;
+
+	if (cpu_rq(cpu)->fair_server.dl_server)
+		dl_bw += cpu_rq(cpu)->fair_server.dl_bw;
+
+#ifdef CONFIG_SCHED_CLASS_EXT
+	if (cpu_rq(cpu)->ext_server.dl_server)
+		dl_bw += cpu_rq(cpu)->ext_server.dl_bw;
+#endif
+
+	return dl_bw;
+}
+
 void dl_clear_root_domain(struct root_domain *rd)
 {
 	int i;
@@ -3197,12 +3240,8 @@ void dl_clear_root_domain(struct root_domain *rd)
 	 * dl_servers are not tasks. Since dl_add_task_root_domain ignores
 	 * them, we need to account for them here explicitly.
 	 */
-	for_each_cpu(i, rd->span) {
-		struct sched_dl_entity *dl_se = &cpu_rq(i)->fair_server;
-
-		if (dl_server(dl_se) && cpu_active(i))
-			__dl_add(&rd->dl_bw, dl_se->dl_bw, dl_bw_cpus(i));
-	}
+	for_each_cpu(i, rd->span)
+		dl_server_add_bw(rd, i);
 }
 
 void dl_clear_root_domain_cpu(int cpu)
@@ -3704,7 +3743,7 @@ static int dl_bw_manage(enum dl_bw_request req, int cpu, u64 dl_bw)
 	unsigned long flags, cap;
 	struct dl_bw *dl_b;
 	bool overflow = 0;
-	u64 fair_server_bw = 0;
+	u64 dl_server_bw = 0;
 
 	rcu_read_lock_sched();
 	dl_b = dl_bw_of(cpu);
@@ -3737,27 +3776,26 @@ static int dl_bw_manage(enum dl_bw_request req, int cpu, u64 dl_bw)
 		cap -= arch_scale_cpu_capacity(cpu);
 
 		/*
-		 * cpu is going offline and NORMAL tasks will be moved away
-		 * from it. We can thus discount dl_server bandwidth
-		 * contribution as it won't need to be servicing tasks after
-		 * the cpu is off.
+		 * cpu is going offline and NORMAL and EXT tasks will be
+		 * moved away from it. We can thus discount dl_server
+		 * bandwidth contribution as it won't need to be servicing
+		 * tasks after the cpu is off.
 		 */
-		if (cpu_rq(cpu)->fair_server.dl_server)
-			fair_server_bw = cpu_rq(cpu)->fair_server.dl_bw;
+		dl_server_bw = dl_server_read_bw(cpu);
 
 		/*
 		 * Not much to check if no DEADLINE bandwidth is present.
 		 * dl_servers we can discount, as tasks will be moved out the
 		 * offlined CPUs anyway.
 		 */
-		if (dl_b->total_bw - fair_server_bw > 0) {
+		if (dl_b->total_bw - dl_server_bw > 0) {
 			/*
 			 * Leaving at least one CPU for DEADLINE tasks seems a
 			 * wise thing to do. As said above, cpu is not offline
 			 * yet, so account for that.
 			 */
 			if (dl_bw_cpus(cpu) - 1)
-				overflow = __dl_overflow(dl_b, cap, fair_server_bw, 0);
+				overflow = __dl_overflow(dl_b, cap, dl_server_bw, 0);
 			else
 				overflow = 1;
 		}
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index afe28c04d5aa7..809f774183202 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -958,6 +958,8 @@ static void update_curr_scx(struct rq *rq)
 		if (!curr->scx.slice)
 			touch_core_sched(rq, curr);
 	}
+
+	dl_server_update(&rq->ext_server, delta_exec);
 }
 
 static bool scx_dsq_priq_less(struct rb_node *node_a,
@@ -1501,6 +1503,10 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags
 	if (enq_flags & SCX_ENQ_WAKEUP)
 		touch_core_sched(rq, p);
 
+	/* Start dl_server if this is the first task being enqueued */
+	if (rq->scx.nr_running == 1)
+		dl_server_start(&rq->ext_server);
+
 	do_enqueue_task(rq, p, enq_flags, sticky_cpu);
 out:
 	rq->scx.flags &= ~SCX_RQ_IN_WAKEUP;
@@ -2512,6 +2518,33 @@ static struct task_struct *pick_task_scx(struct rq *rq, struct rq_flags *rf)
 	return do_pick_task_scx(rq, rf, false);
 }
 
+/*
+ * Select the next task to run from the ext scheduling class.
+ *
+ * Use do_pick_task_scx() directly with @force_scx enabled, since the
+ * dl_server must always select a sched_ext task.
+ */
+static struct task_struct *
+ext_server_pick_task(struct sched_dl_entity *dl_se, struct rq_flags *rf)
+{
+	if (!scx_enabled())
+		return NULL;
+
+	return do_pick_task_scx(dl_se->rq, rf, true);
+}
+
+/*
+ * Initialize the ext server deadline entity.
+ */
+void ext_server_init(struct rq *rq)
+{
+	struct sched_dl_entity *dl_se = &rq->ext_server;
+
+	init_dl_entity(dl_se);
+
+	dl_server_init(dl_se, rq, ext_server_pick_task);
+}
+
 #ifdef CONFIG_SCHED_CORE
 /**
  * scx_prio_less - Task ordering for core-sched
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index c174afe1dd177..53793b9a04185 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -530,6 +530,9 @@ static void update_curr_idle(struct rq *rq)
 	se->exec_start = now;
 
 	dl_server_update_idle(&rq->fair_server, delta_exec);
+#ifdef CONFIG_SCHED_CLASS_EXT
+	dl_server_update_idle(&rq->ext_server, delta_exec);
+#endif
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 93fce4bbff5ea..d630f46325379 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -414,6 +414,7 @@ extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
 extern void sched_init_dl_servers(void);
 
 extern void fair_server_init(struct rq *rq);
+extern void ext_server_init(struct rq *rq);
 extern void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq);
 extern int dl_server_apply_params(struct sched_dl_entity *dl_se,
 		    u64 runtime, u64 period, bool init);
@@ -1151,6 +1152,7 @@ struct rq {
 	struct dl_rq		dl;
 #ifdef CONFIG_SCHED_CLASS_EXT
 	struct scx_rq		scx;
+	struct sched_dl_entity	ext_server;
 #endif
 
 	struct sched_dl_entity	fair_server;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index cf643a5ddedd2..ac268da917781 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -508,6 +508,11 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd)
 	if (rq->fair_server.dl_server)
 		__dl_server_attach_root(&rq->fair_server, rq);
 
+#ifdef CONFIG_SCHED_CLASS_EXT
+	if (rq->ext_server.dl_server)
+		__dl_server_attach_root(&rq->ext_server, rq);
+#endif
+
 	rq_unlock_irqrestore(rq, &rf);
 
 	if (old_rd)
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/7] sched_ext: Add a DL server for sched_ext tasks
  2026-01-20 21:50 ` [PATCH 4/7] sched_ext: Add a DL " Andrea Righi
@ 2026-01-21 12:29   ` Peter Zijlstra
  2026-01-21 12:49     ` Andrea Righi
  2026-01-21 12:31   ` Peter Zijlstra
  1 sibling, 1 reply; 40+ messages in thread
From: Peter Zijlstra @ 2026-01-21 12:29 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Joel Fernandes, Tejun Heo, David Vernet, Changwoo Min, Shuah Khan,
	sched-ext, bpf, linux-kernel, Christian Loehle

On Tue, Jan 20, 2026 at 10:50:35PM +0100, Andrea Righi wrote:

> @@ -1813,6 +1813,7 @@ void dl_server_stop(struct sched_dl_entity *dl_se)
>  	hrtimer_try_to_cancel(&dl_se->dl_timer);
>  	dl_se->dl_defer_armed = 0;
>  	dl_se->dl_throttled = 0;
> +	dl_se->dl_defer_running = 0;
>  	dl_se->dl_defer_idle = 0;
>  	dl_se->dl_server_active = 0;
>  }

This should definitely not be in this patch. Why was this added? Were
you trying to do the same as:

  ca1e8eede4fc ("sched/deadline: Fix server stopping with runnable tasks")


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/7] sched_ext: Add a DL server for sched_ext tasks
  2026-01-20 21:50 ` [PATCH 4/7] sched_ext: Add a DL " Andrea Righi
  2026-01-21 12:29   ` Peter Zijlstra
@ 2026-01-21 12:31   ` Peter Zijlstra
  2026-01-21 12:51     ` Andrea Righi
  1 sibling, 1 reply; 40+ messages in thread
From: Peter Zijlstra @ 2026-01-21 12:31 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Joel Fernandes, Tejun Heo, David Vernet, Changwoo Min, Shuah Khan,
	sched-ext, bpf, linux-kernel, Christian Loehle

On Tue, Jan 20, 2026 at 10:50:35PM +0100, Andrea Righi wrote:
> sched_ext currently suffers starvation due to RT. The same workload when
> converted to EXT can get zero runtime if RT is 100% running, causing EXT
> processes to stall. Fix it by adding a DL server for EXT.
> 
> A kselftest is also included later to confirm that both DL servers are
> functioning correctly:
> 
>  # ./runner -t rt_stall
>  ===== START =====
>  TEST: rt_stall
>  DESCRIPTION: Verify that RT tasks cannot stall SCHED_EXT tasks
>  OUTPUT:
>  TAP version 13
>  1..1
>  # Runtime of FAIR task (PID 1511) is 0.250000 seconds
>  # Runtime of RT task (PID 1512) is 4.750000 seconds
>  # FAIR task got 5.00% of total runtime
>  ok 1 PASS: FAIR task got more than 4.00% of runtime
>  TAP version 13
>  1..1
>  # Runtime of EXT task (PID 1514) is 0.250000 seconds
>  # Runtime of RT task (PID 1515) is 4.750000 seconds
>  # EXT task got 5.00% of total runtime
>  ok 2 PASS: EXT task got more than 4.00% of runtime
>  TAP version 13
>  1..1
>  # Runtime of FAIR task (PID 1517) is 0.250000 seconds
>  # Runtime of RT task (PID 1518) is 4.750000 seconds
>  # FAIR task got 5.00% of total runtime
>  ok 3 PASS: FAIR task got more than 4.00% of runtime
>  TAP version 13
>  1..1
>  # Runtime of EXT task (PID 1521) is 0.250000 seconds
>  # Runtime of RT task (PID 1522) is 4.750000 seconds
>  # EXT task got 5.00% of total runtime
>  ok 4 PASS: EXT task got more than 4.00% of runtime
>  ok 1 rt_stall #
>  =====  END  =====
> 
> v5: - do not restart the EXT server on switch_class() (Juri Lelli)
> v4: - initialize EXT server bandwidth reservation at init time and
>       always keep it active (Andrea Righi)
>     - check for rq->nr_running == 1 to determine when to account idle
>       time (Juri Lelli)
> v3: - clarify that fair is not the only dl_server (Juri Lelli)
>     - remove explicit stop to reduce timer reprogramming overhead
>       (Juri Lelli)
>     - do not restart pick_task() when it's invoked by the dl_server
>       (Tejun Heo)
>     - depend on CONFIG_SCHED_CLASS_EXT (Andrea Righi)
> v2: - drop ->balance() now that pick_task() has an rf argument
>       (Andrea Righi)

FWIW (for all these patches), those v# comments, they go...

> Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
> Tested-by: Christian Loehle <christian.loehle@arm.com>
> Co-developed-by: Joel Fernandes <joelagnelf@nvidia.com>
> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> ---

... here, after the ---.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/7] sched_ext: Add a DL server for sched_ext tasks
  2026-01-21 12:29   ` Peter Zijlstra
@ 2026-01-21 12:49     ` Andrea Righi
  2026-01-21 15:52       ` Peter Zijlstra
  0 siblings, 1 reply; 40+ messages in thread
From: Andrea Righi @ 2026-01-21 12:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Joel Fernandes, Tejun Heo, David Vernet, Changwoo Min, Shuah Khan,
	sched-ext, bpf, linux-kernel, Christian Loehle

On Wed, Jan 21, 2026 at 01:29:01PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 20, 2026 at 10:50:35PM +0100, Andrea Righi wrote:
> 
> > @@ -1813,6 +1813,7 @@ void dl_server_stop(struct sched_dl_entity *dl_se)
> >  	hrtimer_try_to_cancel(&dl_se->dl_timer);
> >  	dl_se->dl_defer_armed = 0;
> >  	dl_se->dl_throttled = 0;
> > +	dl_se->dl_defer_running = 0;
> >  	dl_se->dl_defer_idle = 0;
> >  	dl_se->dl_server_active = 0;
> >  }
> 
> This should definitely not be in this patch. Why was this added? Were
> you trying to do the same as:
> 
>   ca1e8eede4fc ("sched/deadline: Fix server stopping with runnable tasks")
> 

The problem is that if remove this, RT can completely stall EXT tasks, also
with ca1e8eede4fc applied.

Example (with the rt_stall kselftest):

 # Runtime of EXT task (PID 2100) is 0.000000 seconds
 # Runtime of RT task (PID 2101) is 4.990000 seconds
 # EXT task got 0.00% of total runtime
 not ok 2 FAIL: EXT task got less than 4.00% of runtime

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/7] sched_ext: Add a DL server for sched_ext tasks
  2026-01-21 12:31   ` Peter Zijlstra
@ 2026-01-21 12:51     ` Andrea Righi
  0 siblings, 0 replies; 40+ messages in thread
From: Andrea Righi @ 2026-01-21 12:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Joel Fernandes, Tejun Heo, David Vernet, Changwoo Min, Shuah Khan,
	sched-ext, bpf, linux-kernel, Christian Loehle

On Wed, Jan 21, 2026 at 01:31:58PM +0100, Peter Zijlstra wrote:
...
> > v5: - do not restart the EXT server on switch_class() (Juri Lelli)
> > v4: - initialize EXT server bandwidth reservation at init time and
> >       always keep it active (Andrea Righi)
> >     - check for rq->nr_running == 1 to determine when to account idle
> >       time (Juri Lelli)
> > v3: - clarify that fair is not the only dl_server (Juri Lelli)
> >     - remove explicit stop to reduce timer reprogramming overhead
> >       (Juri Lelli)
> >     - do not restart pick_task() when it's invoked by the dl_server
> >       (Tejun Heo)
> >     - depend on CONFIG_SCHED_CLASS_EXT (Andrea Righi)
> > v2: - drop ->balance() now that pick_task() has an rf argument
> >       (Andrea Righi)
> 
> FWIW (for all these patches), those v# comments, they go...
> 
> > Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
> > Tested-by: Christian Loehle <christian.loehle@arm.com>
> > Co-developed-by: Joel Fernandes <joelagnelf@nvidia.com>
> > Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
> > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> > ---
> 
> ... here, after the ---.

Oh... I used to do it this way, then noticed others were including the
version in the patch and I started doing the same, I’ll revert to my
original way. :)

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/7] sched_ext: Add a DL server for sched_ext tasks
  2026-01-21 12:49     ` Andrea Righi
@ 2026-01-21 15:52       ` Peter Zijlstra
  2026-01-21 17:27         ` Andrea Righi
  0 siblings, 1 reply; 40+ messages in thread
From: Peter Zijlstra @ 2026-01-21 15:52 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Joel Fernandes, Tejun Heo, David Vernet, Changwoo Min, Shuah Khan,
	sched-ext, bpf, linux-kernel, Christian Loehle

On Wed, Jan 21, 2026 at 01:49:38PM +0100, Andrea Righi wrote:
> On Wed, Jan 21, 2026 at 01:29:01PM +0100, Peter Zijlstra wrote:
> > On Tue, Jan 20, 2026 at 10:50:35PM +0100, Andrea Righi wrote:
> > 
> > > @@ -1813,6 +1813,7 @@ void dl_server_stop(struct sched_dl_entity *dl_se)
> > >  	hrtimer_try_to_cancel(&dl_se->dl_timer);
> > >  	dl_se->dl_defer_armed = 0;
> > >  	dl_se->dl_throttled = 0;
> > > +	dl_se->dl_defer_running = 0;
> > >  	dl_se->dl_defer_idle = 0;
> > >  	dl_se->dl_server_active = 0;
> > >  }
> > 
> > This should definitely not be in this patch. Why was this added? Were
> > you trying to do the same as:
> > 
> >   ca1e8eede4fc ("sched/deadline: Fix server stopping with runnable tasks")
> > 
> 
> The problem is that if remove this, RT can completely stall EXT tasks, also
> with ca1e8eede4fc applied.

But that's not something ext specific, right? Can you pull this change
out and write a sane Changelog for it, describing the problem and so?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/7] sched_ext: Add a DL server for sched_ext tasks
  2026-01-21 15:52       ` Peter Zijlstra
@ 2026-01-21 17:27         ` Andrea Righi
  0 siblings, 0 replies; 40+ messages in thread
From: Andrea Righi @ 2026-01-21 17:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Joel Fernandes, Tejun Heo, David Vernet, Changwoo Min, Shuah Khan,
	sched-ext, bpf, linux-kernel, Christian Loehle

On Wed, Jan 21, 2026 at 04:52:45PM +0100, Peter Zijlstra wrote:
> On Wed, Jan 21, 2026 at 01:49:38PM +0100, Andrea Righi wrote:
> > On Wed, Jan 21, 2026 at 01:29:01PM +0100, Peter Zijlstra wrote:
> > > On Tue, Jan 20, 2026 at 10:50:35PM +0100, Andrea Righi wrote:
> > > 
> > > > @@ -1813,6 +1813,7 @@ void dl_server_stop(struct sched_dl_entity *dl_se)
> > > >  	hrtimer_try_to_cancel(&dl_se->dl_timer);
> > > >  	dl_se->dl_defer_armed = 0;
> > > >  	dl_se->dl_throttled = 0;
> > > > +	dl_se->dl_defer_running = 0;
> > > >  	dl_se->dl_defer_idle = 0;
> > > >  	dl_se->dl_server_active = 0;
> > > >  }
> > > 
> > > This should definitely not be in this patch. Why was this added? Were
> > > you trying to do the same as:
> > > 
> > >   ca1e8eede4fc ("sched/deadline: Fix server stopping with runnable tasks")
> > > 
> > 
> > The problem is that if remove this, RT can completely stall EXT tasks, also
> > with ca1e8eede4fc applied.
> 
> But that's not something ext specific, right? Can you pull this change
> out and write a sane Changelog for it, describing the problem and so?

Yeah, this should happen with fair as well. I'll try to reproduce the
problem without ext and try to get more info.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCHSET v12 sched_ext/for-6.20] Add a deadline server for sched_ext tasks
@ 2026-01-26  9:58 Andrea Righi
  2026-01-26  9:58 ` [PATCH 1/7] sched/deadline: Clear the defer params Andrea Righi
                   ` (7 more replies)
  0 siblings, 8 replies; 40+ messages in thread
From: Andrea Righi @ 2026-01-26  9:58 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Tejun Heo, Joel Fernandes, David Vernet,
	Changwoo Min, Daniel Hodges, Christian Loehle, Emil Tsalapatis,
	sched-ext, linux-kernel

sched_ext tasks can be starved by long-running RT tasks, especially since
RT throttling was replaced by deadline servers to boost only SCHED_NORMAL
tasks.

Several users in the community have reported issues with RT stalling
sched_ext tasks. This is fairly common on distributions or environments
where applications like video compositors, audio services, etc. run as RT
tasks by default.

Example trace (showing a per-CPU kthread stalled due to the sway Wayland
compositor running as an RT task):

 runnable task stall (kworker/0:0[106377] failed to run for 5.043s)
 ...
 CPU 0   : nr_run=3 flags=0xd cpu_rel=0 ops_qseq=20646200 pnt_seq=45388738
           curr=sway[994] class=rt_sched_class
   R kworker/0:0[106377] -5043ms
       scx_state/flags=3/0x1 dsq_flags=0x0 ops_state/qseq=0/0
       sticky/holding_cpu=-1/-1 dsq_id=0x8000000000000002 dsq_vtime=0 slice=20000000
       cpus=01

This is often perceived as a bug in the BPF schedulers, but in reality they
can't do much: RT tasks run outside their control and can potentially
consume 100% of the CPU bandwidth.

Fix this by adding a sched_ext deadline server, so that sched_ext tasks are
also boosted and do not suffer starvation.

Two kselftests are also provided to verify the starvation fixes and
bandwidth allocation is correct.

== Design ==

 - The EXT server is initialized at boot time and remains configured
   throughout the system's lifetime
 - It starts automatically when the first sched_ext task is enqueued
   (rq->scx.nr_running == 1)
 - The server's pick function (ext_server_pick_task) always selects
   sched_ext tasks when active
 - Runtime accounting happens in update_curr_scx() during task execution
   and update_curr_idle() when idle
 - Bandwidth accounting includes both fair and ext servers in root domain
   calculations
 - A debugfs interface (/sys/kernel/debug/sched/ext_server/) allows runtime
   tuning of server parameters (see notes below)

== Notes ==

1) As discussed during the sched_ext microconference at LPC Tokyo, the plan
is to start with a simple approach, avoiding automatically creating or
tearing down the EXT server bandwidth reservation when a BPF scheduler is
loaded or unloaded. Instead, the reservation is kept permanently active.
This significantly simplifies the logic while still addressing the
starvation issue.

Any fine-tuning of the bandwidth reservation is delegated to the system
administrator, who can adjust it via the debugfs interface. In the future,
a more suitable interface can be introduced and automatic removal of the
reservation when the BPF scheduler is unloaded can be revisited.

A better interface to adjust the dl_server bandwidth reservation can be
discussed at the upcoming OSPM
(https://lore.kernel.org/lkml/aULDwbALUj0V7cVk@jlelli-thinkpadt14gen4.remote.csb/).

2) IMPORTANT: this patch requires [1] to function properly (sent
separately, not included in this patch set).

[1] https://lore.kernel.org/all/20260123161645.2181752-1-arighi@nvidia.com/

This patchset is also available in the following git branch:

 git://git.kernel.org/pub/scm/linux/kernel/git/arighi/linux.git scx-dl-server

Changes in v12:
 - Move dl_server execution state reset on stop fix to a separate patch
   (https://lore.kernel.org/all/20260123161645.2181752-1-arighi@nvidia.com/)
 - Removed per-patch changelog (keeping a global changelog here)
 - Link to v11: https://lore.kernel.org/all/20260120215808.188032-1-arighi@nvidia.com/

Changes in v11:
 - do not create/remove the bandwidth reservation for the ext server when a
   BPF scheduler is loaded/unloaded, but keep the reservation bandwdith
   always active
 - change rt_stall kselftest to validate both FAIR and EXT DL servers
 - Link to v10: https://lore.kernel.org/all/20250903095008.162049-1-arighi@nvidia.com/

Changes in v10:
 - reordered patches to better isolate sched_ext changes vs sched/deadline
   changes (Andrea Righi)
 - define ext_server only with CONFIG_SCHED_CLASS_EXT=y (Andrea Righi)
 - add WARN_ON_ONCE(!cpus) check in dl_server_apply_params() (Andrea Righi)
 - wait for inactive_task_timer to fire before removing the bandwidth
   reservation (Juri Lelli)
 - remove explicit dl_server_stop() in dequeue_task_scx() to reduce timer
   reprogramming overhead (Juri Lelli)
 - do not restart pick_task() when invoked by the dl_server (Tejun Heo)
 - rename rq_dl_server to dl_server (Peter Zijlstra)
 - fixed a missing dl_server start in dl_server_on() (Christian Loehle)
 - add a comment to the rt_stall selftest to better explain the 4%
   threshold (Emil Tsalapatis)
 - Link to v9: https://lore.kernel.org/all/20251017093214.70029-1-arighi@nvidia.com/

Changes in v9:
 - Drop the ->balance() logic as its functionality is now integrated into
   ->pick_task(), allowing dl_server to call pick_task_scx() directly
 - Link to v8: https://lore.kernel.org/all/20250903095008.162049-1-arighi@nvidia.com/

Changes in v8:
 - Add tj's patch to de-couple balance and pick_task and avoid changing
   sched/core callbacks to propagate @rf
 - Simplify dl_se->dl_server check (suggested by PeterZ)
 - Small coding style fixes in the kselftests
 - Link to v7: https://lore.kernel.org/all/20250809184800.129831-1-joelagnelf@nvidia.com/

Changes in v7:
 - Rebased to Linus master
 - Link to v6: https://lore.kernel.org/all/20250702232944.3221001-1-joelagnelf@nvidia.com/

Changes in v6:
 - Added Acks to few patches
 - Fixes to few nits suggested by Tejun
 - Link to v5: https://lore.kernel.org/all/20250620203234.3349930-1-joelagnelf@nvidia.com/

Changes in v5:
 - Added a kselftest (total_bw) to sched_ext to verify bandwidth values
   from debugfs
 - Address comment from Andrea about redundant rq clock invalidation
 - Link to v4: https://lore.kernel.org/all/20250617200523.1261231-1-joelagnelf@nvidia.com/

Changes in v4:
 - Fixed issues with hotplugged CPUs having their DL server bandwidth
   altered due to loading SCX
 - Fixed other issues
 - Rebased on Linus master
 - All sched_ext kselftests reliably pass now, also verified that the
   total_bw in debugfs (CONFIG_SCHED_DEBUG) is conserved with these patches
 - Link to v3: https://lore.kernel.org/all/20250613051734.4023260-1-joelagnelf@nvidia.com/

Changes in v3:
 - Removed code duplication in debugfs. Made ext interface separate
 - Fixed issue where rq_lock_irqsave was not used in the relinquish patch
 - Fixed running bw accounting issue in dl_server_remove_params
 - Link to v2: https://lore.kernel.org/all/20250602180110.816225-1-joelagnelf@nvidia.com/

Changes in v2:
 - Fixed a hang related to using rq_lock instead of rq_lock_irqsave
 - Added support to remove BW of DL servers when they are switched to/from EXT
 - Link to v1: https://lore.kernel.org/all/20250315022158.2354454-1-joelagnelf@nvidia.com/

Andrea Righi (2):
      sched_ext: Add a DL server for sched_ext tasks
      selftests/sched_ext: Add test for sched_ext dl_server

Joel Fernandes (5):
      sched/deadline: Clear the defer params
      sched/debug: Fix updating of ppos on server write ops
      sched/debug: Stop and start server based on if it was active
      sched/debug: Add support to change sched_ext server params
      selftests/sched_ext: Add test for DL server total_bw consistency

 kernel/sched/core.c                              |   6 +
 kernel/sched/deadline.c                          |  86 +++++--
 kernel/sched/debug.c                             | 171 +++++++++++---
 kernel/sched/ext.c                               |  33 +++
 kernel/sched/idle.c                              |   3 +
 kernel/sched/sched.h                             |   2 +
 kernel/sched/topology.c                          |   5 +
 tools/testing/selftests/sched_ext/Makefile       |   2 +
 tools/testing/selftests/sched_ext/rt_stall.bpf.c |  23 ++
 tools/testing/selftests/sched_ext/rt_stall.c     | 240 +++++++++++++++++++
 tools/testing/selftests/sched_ext/total_bw.c     | 281 +++++++++++++++++++++++
 11 files changed, 801 insertions(+), 51 deletions(-)
 create mode 100644 tools/testing/selftests/sched_ext/rt_stall.bpf.c
 create mode 100644 tools/testing/selftests/sched_ext/rt_stall.c
 create mode 100644 tools/testing/selftests/sched_ext/total_bw.c

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 1/7] sched/deadline: Clear the defer params
  2026-01-26  9:58 [PATCHSET v12 sched_ext/for-6.20] Add a deadline server for sched_ext tasks Andrea Righi
@ 2026-01-26  9:58 ` Andrea Righi
  2026-02-03 11:18   ` [tip: sched/core] " tip-bot2 for Joel Fernandes
  2026-01-26  9:59 ` [PATCH 2/7] sched/debug: Fix updating of ppos on server write ops Andrea Righi
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 40+ messages in thread
From: Andrea Righi @ 2026-01-26  9:58 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Tejun Heo, Joel Fernandes, David Vernet,
	Changwoo Min, Daniel Hodges, Christian Loehle, Emil Tsalapatis,
	sched-ext, linux-kernel

From: Joel Fernandes <joelagnelf@nvidia.com>

The defer params were not cleared in __dl_clear_params. Clear them.

Without this is some of my test cases are flaking and the DL timer is
not starting correctly AFAICS.

Fixes: a110a81c52a9 ("sched/deadline: Deferrable dl server")
Tested-by: Christian Loehle <christian.loehle@arm.com>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
 kernel/sched/deadline.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index e42867061ea77..28823f7eb8667 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -3646,6 +3646,9 @@ static void __dl_clear_params(struct sched_dl_entity *dl_se)
 	dl_se->dl_non_contending	= 0;
 	dl_se->dl_overrun		= 0;
 	dl_se->dl_server		= 0;
+	dl_se->dl_defer			= 0;
+	dl_se->dl_defer_running		= 0;
+	dl_se->dl_defer_armed		= 0;
 
 #ifdef CONFIG_RT_MUTEXES
 	dl_se->pi_se			= dl_se;
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 2/7] sched/debug: Fix updating of ppos on server write ops
  2026-01-26  9:58 [PATCHSET v12 sched_ext/for-6.20] Add a deadline server for sched_ext tasks Andrea Righi
  2026-01-26  9:58 ` [PATCH 1/7] sched/deadline: Clear the defer params Andrea Righi
@ 2026-01-26  9:59 ` Andrea Righi
  2026-02-03 11:18   ` [tip: sched/core] " tip-bot2 for Joel Fernandes
  2026-01-26  9:59 ` [PATCH 3/7] sched/debug: Stop and start server based on if it was active Andrea Righi
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 40+ messages in thread
From: Andrea Righi @ 2026-01-26  9:59 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Tejun Heo, Joel Fernandes, David Vernet,
	Changwoo Min, Daniel Hodges, Christian Loehle, Emil Tsalapatis,
	sched-ext, linux-kernel

From: Joel Fernandes <joelagnelf@nvidia.com>

Updating "ppos" on error conditions does not make much sense. The pattern
is to return the error code directly without modifying the position, or
modify the position on success and return the number of bytes written.

Since on success, the return value of apply is 0, there is no point in
modifying ppos either. Fix it by removing all this and just returning
error code or number of bytes written on success.

Tested-by: Christian Loehle <christian.loehle@arm.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
 kernel/sched/debug.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 41caa22e0680a..93f009e1076d8 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -345,8 +345,8 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
 	long cpu = (long) ((struct seq_file *) filp->private_data)->private;
 	struct rq *rq = cpu_rq(cpu);
 	u64 runtime, period;
+	int retval = 0;
 	size_t err;
-	int retval;
 	u64 value;
 
 	err = kstrtoull_from_user(ubuf, cnt, 10, &value);
@@ -380,8 +380,6 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
 		dl_server_stop(&rq->fair_server);
 
 		retval = dl_server_apply_params(&rq->fair_server, runtime, period, 0);
-		if (retval)
-			cnt = retval;
 
 		if (!runtime)
 			printk_deferred("Fair server disabled in CPU %d, system may crash due to starvation.\n",
@@ -389,6 +387,9 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
 
 		if (rq->cfs.h_nr_queued)
 			dl_server_start(&rq->fair_server);
+
+		if (retval < 0)
+			return retval;
 	}
 
 	*ppos += cnt;
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 3/7] sched/debug: Stop and start server based on if it was active
  2026-01-26  9:58 [PATCHSET v12 sched_ext/for-6.20] Add a deadline server for sched_ext tasks Andrea Righi
  2026-01-26  9:58 ` [PATCH 1/7] sched/deadline: Clear the defer params Andrea Righi
  2026-01-26  9:59 ` [PATCH 2/7] sched/debug: Fix updating of ppos on server write ops Andrea Righi
@ 2026-01-26  9:59 ` Andrea Righi
  2026-02-02 21:13   ` Peter Zijlstra
  2026-02-03 11:18   ` [tip: sched/core] " tip-bot2 for Joel Fernandes
  2026-01-26  9:59 ` [PATCH 4/7] sched_ext: Add a DL server for sched_ext tasks Andrea Righi
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 40+ messages in thread
From: Andrea Righi @ 2026-01-26  9:59 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Tejun Heo, Joel Fernandes, David Vernet,
	Changwoo Min, Daniel Hodges, Christian Loehle, Emil Tsalapatis,
	sched-ext, linux-kernel

From: Joel Fernandes <joelagnelf@nvidia.com>

Currently the DL server interface for applying parameters checks
CFS-internals to identify if the server is active. This is error-prone
and makes it difficult when adding new servers in the future.

Fix it, by using dl_server_active() which is also used by the DL server
code to determine if the DL server was started.

Tested-by: Christian Loehle <christian.loehle@arm.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
 kernel/sched/debug.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 93f009e1076d8..dd793f8f3858a 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -354,6 +354,8 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
 		return err;
 
 	scoped_guard (rq_lock_irqsave, rq) {
+		bool is_active;
+
 		runtime  = rq->fair_server.dl_runtime;
 		period = rq->fair_server.dl_period;
 
@@ -376,8 +378,11 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
 			return  -EINVAL;
 		}
 
-		update_rq_clock(rq);
-		dl_server_stop(&rq->fair_server);
+		is_active = dl_server_active(&rq->fair_server);
+		if (is_active) {
+			update_rq_clock(rq);
+			dl_server_stop(&rq->fair_server);
+		}
 
 		retval = dl_server_apply_params(&rq->fair_server, runtime, period, 0);
 
@@ -385,7 +390,7 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
 			printk_deferred("Fair server disabled in CPU %d, system may crash due to starvation.\n",
 					cpu_of(rq));
 
-		if (rq->cfs.h_nr_queued)
+		if (is_active && runtime)
 			dl_server_start(&rq->fair_server);
 
 		if (retval < 0)
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 4/7] sched_ext: Add a DL server for sched_ext tasks
  2026-01-26  9:58 [PATCHSET v12 sched_ext/for-6.20] Add a deadline server for sched_ext tasks Andrea Righi
                   ` (2 preceding siblings ...)
  2026-01-26  9:59 ` [PATCH 3/7] sched/debug: Stop and start server based on if it was active Andrea Righi
@ 2026-01-26  9:59 ` Andrea Righi
  2026-02-02 19:50   ` Peter Zijlstra
  2026-02-03 11:18   ` [tip: sched/core] " tip-bot2 for Andrea Righi
  2026-01-26  9:59 ` [PATCH 5/7] sched/debug: Add support to change sched_ext server params Andrea Righi
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 40+ messages in thread
From: Andrea Righi @ 2026-01-26  9:59 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Tejun Heo, Joel Fernandes, David Vernet,
	Changwoo Min, Daniel Hodges, Christian Loehle, Emil Tsalapatis,
	sched-ext, linux-kernel

sched_ext currently suffers starvation due to RT. The same workload when
converted to EXT can get zero runtime if RT is 100% running, causing EXT
processes to stall. Fix it by adding a DL server for EXT.

A kselftest is also included later to confirm that both DL servers are
functioning correctly:

 # ./runner -t rt_stall
 ===== START =====
 TEST: rt_stall
 DESCRIPTION: Verify that RT tasks cannot stall SCHED_EXT tasks
 OUTPUT:
 TAP version 13
 1..1
 # Runtime of FAIR task (PID 1511) is 0.250000 seconds
 # Runtime of RT task (PID 1512) is 4.750000 seconds
 # FAIR task got 5.00% of total runtime
 ok 1 PASS: FAIR task got more than 4.00% of runtime
 TAP version 13
 1..1
 # Runtime of EXT task (PID 1514) is 0.250000 seconds
 # Runtime of RT task (PID 1515) is 4.750000 seconds
 # EXT task got 5.00% of total runtime
 ok 2 PASS: EXT task got more than 4.00% of runtime
 TAP version 13
 1..1
 # Runtime of FAIR task (PID 1517) is 0.250000 seconds
 # Runtime of RT task (PID 1518) is 4.750000 seconds
 # FAIR task got 5.00% of total runtime
 ok 3 PASS: FAIR task got more than 4.00% of runtime
 TAP version 13
 1..1
 # Runtime of EXT task (PID 1521) is 0.250000 seconds
 # Runtime of RT task (PID 1522) is 4.750000 seconds
 # EXT task got 5.00% of total runtime
 ok 4 PASS: EXT task got more than 4.00% of runtime
 ok 1 rt_stall #
 =====  END  =====

Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Co-developed-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/core.c     |  6 +++
 kernel/sched/deadline.c | 83 +++++++++++++++++++++++++++++------------
 kernel/sched/ext.c      | 33 ++++++++++++++++
 kernel/sched/idle.c     |  3 ++
 kernel/sched/sched.h    |  2 +
 kernel/sched/topology.c |  5 +++
 6 files changed, 109 insertions(+), 23 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 045f83ad261e2..88476d8b4e3d2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8477,6 +8477,9 @@ int sched_cpu_dying(unsigned int cpu)
 		dump_rq_tasks(rq, KERN_WARNING);
 	}
 	dl_server_stop(&rq->fair_server);
+#ifdef CONFIG_SCHED_CLASS_EXT
+	dl_server_stop(&rq->ext_server);
+#endif
 	rq_unlock_irqrestore(rq, &rf);
 
 	calc_load_migrate(rq);
@@ -8680,6 +8683,9 @@ void __init sched_init(void)
 		hrtick_rq_init(rq);
 		atomic_set(&rq->nr_iowait, 0);
 		fair_server_init(rq);
+#ifdef CONFIG_SCHED_CLASS_EXT
+		ext_server_init(rq);
+#endif
 
 #ifdef CONFIG_SCHED_CORE
 		rq->core = rq;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 28823f7eb8667..fda77512c6e47 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1443,8 +1443,8 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64
 		dl_se->dl_defer_idle = 0;
 
 	/*
-	 * The fair server can consume its runtime while throttled (not queued/
-	 * running as regular CFS).
+	 * The DL server can consume its runtime while throttled (not
+	 * queued / running as regular CFS).
 	 *
 	 * If the server consumes its entire runtime in this state. The server
 	 * is not required for the current period. Thus, reset the server by
@@ -1529,10 +1529,10 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64
 	}
 
 	/*
-	 * The fair server (sole dl_server) does not account for real-time
-	 * workload because it is running fair work.
+	 * The dl_server does not account for real-time workload because it
+	 * is running fair work.
 	 */
-	if (dl_se == &rq->fair_server)
+	if (dl_se->dl_server)
 		return;
 
 #ifdef CONFIG_RT_GROUP_SCHED
@@ -1567,9 +1567,9 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64
  * In the non-defer mode, the idle time is not accounted, as the
  * server provides a guarantee.
  *
- * If the dl_server is in defer mode, the idle time is also considered
- * as time available for the fair server, avoiding a penalty for the
- * rt scheduler that did not consumed that time.
+ * If the dl_server is in defer mode, the idle time is also considered as
+ * time available for the dl_server, avoiding a penalty for the rt
+ * scheduler that did not consumed that time.
  */
 void dl_server_update_idle(struct sched_dl_entity *dl_se, s64 delta_exec)
 {
@@ -1850,6 +1850,18 @@ void sched_init_dl_servers(void)
 		dl_se->dl_server = 1;
 		dl_se->dl_defer = 1;
 		setup_new_dl_entity(dl_se);
+
+#ifdef CONFIG_SCHED_CLASS_EXT
+		dl_se = &rq->ext_server;
+
+		WARN_ON(dl_server(dl_se));
+
+		dl_server_apply_params(dl_se, runtime, period, 1);
+
+		dl_se->dl_server = 1;
+		dl_se->dl_defer = 1;
+		setup_new_dl_entity(dl_se);
+#endif
 	}
 }
 
@@ -3181,6 +3193,36 @@ void dl_add_task_root_domain(struct task_struct *p)
 	raw_spin_unlock_irqrestore(&p->pi_lock, rf.flags);
 }
 
+static void dl_server_add_bw(struct root_domain *rd, int cpu)
+{
+	struct sched_dl_entity *dl_se;
+
+	dl_se = &cpu_rq(cpu)->fair_server;
+	if (dl_server(dl_se) && cpu_active(cpu))
+		__dl_add(&rd->dl_bw, dl_se->dl_bw, dl_bw_cpus(cpu));
+
+#ifdef CONFIG_SCHED_CLASS_EXT
+	dl_se = &cpu_rq(cpu)->ext_server;
+	if (dl_server(dl_se) && cpu_active(cpu))
+		__dl_add(&rd->dl_bw, dl_se->dl_bw, dl_bw_cpus(cpu));
+#endif
+}
+
+static u64 dl_server_read_bw(int cpu)
+{
+	u64 dl_bw = 0;
+
+	if (cpu_rq(cpu)->fair_server.dl_server)
+		dl_bw += cpu_rq(cpu)->fair_server.dl_bw;
+
+#ifdef CONFIG_SCHED_CLASS_EXT
+	if (cpu_rq(cpu)->ext_server.dl_server)
+		dl_bw += cpu_rq(cpu)->ext_server.dl_bw;
+#endif
+
+	return dl_bw;
+}
+
 void dl_clear_root_domain(struct root_domain *rd)
 {
 	int i;
@@ -3199,12 +3241,8 @@ void dl_clear_root_domain(struct root_domain *rd)
 	 * dl_servers are not tasks. Since dl_add_task_root_domain ignores
 	 * them, we need to account for them here explicitly.
 	 */
-	for_each_cpu(i, rd->span) {
-		struct sched_dl_entity *dl_se = &cpu_rq(i)->fair_server;
-
-		if (dl_server(dl_se) && cpu_active(i))
-			__dl_add(&rd->dl_bw, dl_se->dl_bw, dl_bw_cpus(i));
-	}
+	for_each_cpu(i, rd->span)
+		dl_server_add_bw(rd, i);
 }
 
 void dl_clear_root_domain_cpu(int cpu)
@@ -3706,7 +3744,7 @@ static int dl_bw_manage(enum dl_bw_request req, int cpu, u64 dl_bw)
 	unsigned long flags, cap;
 	struct dl_bw *dl_b;
 	bool overflow = 0;
-	u64 fair_server_bw = 0;
+	u64 dl_server_bw = 0;
 
 	rcu_read_lock_sched();
 	dl_b = dl_bw_of(cpu);
@@ -3739,27 +3777,26 @@ static int dl_bw_manage(enum dl_bw_request req, int cpu, u64 dl_bw)
 		cap -= arch_scale_cpu_capacity(cpu);
 
 		/*
-		 * cpu is going offline and NORMAL tasks will be moved away
-		 * from it. We can thus discount dl_server bandwidth
-		 * contribution as it won't need to be servicing tasks after
-		 * the cpu is off.
+		 * cpu is going offline and NORMAL and EXT tasks will be
+		 * moved away from it. We can thus discount dl_server
+		 * bandwidth contribution as it won't need to be servicing
+		 * tasks after the cpu is off.
 		 */
-		if (cpu_rq(cpu)->fair_server.dl_server)
-			fair_server_bw = cpu_rq(cpu)->fair_server.dl_bw;
+		dl_server_bw = dl_server_read_bw(cpu);
 
 		/*
 		 * Not much to check if no DEADLINE bandwidth is present.
 		 * dl_servers we can discount, as tasks will be moved out the
 		 * offlined CPUs anyway.
 		 */
-		if (dl_b->total_bw - fair_server_bw > 0) {
+		if (dl_b->total_bw - dl_server_bw > 0) {
 			/*
 			 * Leaving at least one CPU for DEADLINE tasks seems a
 			 * wise thing to do. As said above, cpu is not offline
 			 * yet, so account for that.
 			 */
 			if (dl_bw_cpus(cpu) - 1)
-				overflow = __dl_overflow(dl_b, cap, fair_server_bw, 0);
+				overflow = __dl_overflow(dl_b, cap, dl_server_bw, 0);
 			else
 				overflow = 1;
 		}
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index afe28c04d5aa7..809f774183202 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -958,6 +958,8 @@ static void update_curr_scx(struct rq *rq)
 		if (!curr->scx.slice)
 			touch_core_sched(rq, curr);
 	}
+
+	dl_server_update(&rq->ext_server, delta_exec);
 }
 
 static bool scx_dsq_priq_less(struct rb_node *node_a,
@@ -1501,6 +1503,10 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags
 	if (enq_flags & SCX_ENQ_WAKEUP)
 		touch_core_sched(rq, p);
 
+	/* Start dl_server if this is the first task being enqueued */
+	if (rq->scx.nr_running == 1)
+		dl_server_start(&rq->ext_server);
+
 	do_enqueue_task(rq, p, enq_flags, sticky_cpu);
 out:
 	rq->scx.flags &= ~SCX_RQ_IN_WAKEUP;
@@ -2512,6 +2518,33 @@ static struct task_struct *pick_task_scx(struct rq *rq, struct rq_flags *rf)
 	return do_pick_task_scx(rq, rf, false);
 }
 
+/*
+ * Select the next task to run from the ext scheduling class.
+ *
+ * Use do_pick_task_scx() directly with @force_scx enabled, since the
+ * dl_server must always select a sched_ext task.
+ */
+static struct task_struct *
+ext_server_pick_task(struct sched_dl_entity *dl_se, struct rq_flags *rf)
+{
+	if (!scx_enabled())
+		return NULL;
+
+	return do_pick_task_scx(dl_se->rq, rf, true);
+}
+
+/*
+ * Initialize the ext server deadline entity.
+ */
+void ext_server_init(struct rq *rq)
+{
+	struct sched_dl_entity *dl_se = &rq->ext_server;
+
+	init_dl_entity(dl_se);
+
+	dl_server_init(dl_se, rq, ext_server_pick_task);
+}
+
 #ifdef CONFIG_SCHED_CORE
 /**
  * scx_prio_less - Task ordering for core-sched
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index abf8f15d60c9e..d6b4cda176ccf 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -536,6 +536,9 @@ static void update_curr_idle(struct rq *rq)
 	se->exec_start = now;
 
 	dl_server_update_idle(&rq->fair_server, delta_exec);
+#ifdef CONFIG_SCHED_CLASS_EXT
+	dl_server_update_idle(&rq->ext_server, delta_exec);
+#endif
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 93fce4bbff5ea..d630f46325379 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -414,6 +414,7 @@ extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
 extern void sched_init_dl_servers(void);
 
 extern void fair_server_init(struct rq *rq);
+extern void ext_server_init(struct rq *rq);
 extern void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq);
 extern int dl_server_apply_params(struct sched_dl_entity *dl_se,
 		    u64 runtime, u64 period, bool init);
@@ -1151,6 +1152,7 @@ struct rq {
 	struct dl_rq		dl;
 #ifdef CONFIG_SCHED_CLASS_EXT
 	struct scx_rq		scx;
+	struct sched_dl_entity	ext_server;
 #endif
 
 	struct sched_dl_entity	fair_server;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index cf643a5ddedd2..ac268da917781 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -508,6 +508,11 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd)
 	if (rq->fair_server.dl_server)
 		__dl_server_attach_root(&rq->fair_server, rq);
 
+#ifdef CONFIG_SCHED_CLASS_EXT
+	if (rq->ext_server.dl_server)
+		__dl_server_attach_root(&rq->ext_server, rq);
+#endif
+
 	rq_unlock_irqrestore(rq, &rf);
 
 	if (old_rd)
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 5/7] sched/debug: Add support to change sched_ext server params
  2026-01-26  9:58 [PATCHSET v12 sched_ext/for-6.20] Add a deadline server for sched_ext tasks Andrea Righi
                   ` (3 preceding siblings ...)
  2026-01-26  9:59 ` [PATCH 4/7] sched_ext: Add a DL server for sched_ext tasks Andrea Righi
@ 2026-01-26  9:59 ` Andrea Righi
  2026-02-03 11:18   ` [tip: sched/core] " tip-bot2 for Joel Fernandes
  2026-01-26  9:59 ` [PATCH 6/7] selftests/sched_ext: Add test for sched_ext dl_server Andrea Righi
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 40+ messages in thread
From: Andrea Righi @ 2026-01-26  9:59 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Tejun Heo, Joel Fernandes, David Vernet,
	Changwoo Min, Daniel Hodges, Christian Loehle, Emil Tsalapatis,
	sched-ext, linux-kernel

From: Joel Fernandes <joelagnelf@nvidia.com>

When a sched_ext server is loaded, tasks in the fair class are
automatically moved to the sched_ext class. Add support to modify the
ext server parameters similar to how the fair server parameters are
modified.

Re-use common code between ext and fair servers as needed.

Tested-by: Christian Loehle <christian.loehle@arm.com>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Co-developed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
 kernel/sched/debug.c | 157 ++++++++++++++++++++++++++++++++++++-------
 1 file changed, 133 insertions(+), 24 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index dd793f8f3858a..2e9896668c6fd 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -336,14 +336,16 @@ enum dl_param {
 	DL_PERIOD,
 };
 
-static unsigned long fair_server_period_max = (1UL << 22) * NSEC_PER_USEC; /* ~4 seconds */
-static unsigned long fair_server_period_min = (100) * NSEC_PER_USEC;     /* 100 us */
+static unsigned long dl_server_period_max = (1UL << 22) * NSEC_PER_USEC; /* ~4 seconds */
+static unsigned long dl_server_period_min = (100) * NSEC_PER_USEC;     /* 100 us */
 
-static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubuf,
-				       size_t cnt, loff_t *ppos, enum dl_param param)
+static ssize_t sched_server_write_common(struct file *filp, const char __user *ubuf,
+					 size_t cnt, loff_t *ppos, enum dl_param param,
+					 void *server)
 {
 	long cpu = (long) ((struct seq_file *) filp->private_data)->private;
 	struct rq *rq = cpu_rq(cpu);
+	struct sched_dl_entity *dl_se = (struct sched_dl_entity *)server;
 	u64 runtime, period;
 	int retval = 0;
 	size_t err;
@@ -356,8 +358,8 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
 	scoped_guard (rq_lock_irqsave, rq) {
 		bool is_active;
 
-		runtime  = rq->fair_server.dl_runtime;
-		period = rq->fair_server.dl_period;
+		runtime = dl_se->dl_runtime;
+		period = dl_se->dl_period;
 
 		switch (param) {
 		case DL_RUNTIME:
@@ -373,25 +375,25 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
 		}
 
 		if (runtime > period ||
-		    period > fair_server_period_max ||
-		    period < fair_server_period_min) {
+		    period > dl_server_period_max ||
+		    period < dl_server_period_min) {
 			return  -EINVAL;
 		}
 
-		is_active = dl_server_active(&rq->fair_server);
+		is_active = dl_server_active(dl_se);
 		if (is_active) {
 			update_rq_clock(rq);
-			dl_server_stop(&rq->fair_server);
+			dl_server_stop(dl_se);
 		}
 
-		retval = dl_server_apply_params(&rq->fair_server, runtime, period, 0);
+		retval = dl_server_apply_params(dl_se, runtime, period, 0);
 
 		if (!runtime)
-			printk_deferred("Fair server disabled in CPU %d, system may crash due to starvation.\n",
-					cpu_of(rq));
+			printk_deferred("%s server disabled in CPU %d, system may crash due to starvation.\n",
+					server == &rq->fair_server ? "Fair" : "Ext", cpu_of(rq));
 
 		if (is_active && runtime)
-			dl_server_start(&rq->fair_server);
+			dl_server_start(dl_se);
 
 		if (retval < 0)
 			return retval;
@@ -401,36 +403,42 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
 	return cnt;
 }
 
-static size_t sched_fair_server_show(struct seq_file *m, void *v, enum dl_param param)
+static size_t sched_server_show_common(struct seq_file *m, void *v, enum dl_param param,
+				       void *server)
 {
-	unsigned long cpu = (unsigned long) m->private;
-	struct rq *rq = cpu_rq(cpu);
+	struct sched_dl_entity *dl_se = (struct sched_dl_entity *)server;
 	u64 value;
 
 	switch (param) {
 	case DL_RUNTIME:
-		value = rq->fair_server.dl_runtime;
+		value = dl_se->dl_runtime;
 		break;
 	case DL_PERIOD:
-		value = rq->fair_server.dl_period;
+		value = dl_se->dl_period;
 		break;
 	}
 
 	seq_printf(m, "%llu\n", value);
 	return 0;
-
 }
 
 static ssize_t
 sched_fair_server_runtime_write(struct file *filp, const char __user *ubuf,
 				size_t cnt, loff_t *ppos)
 {
-	return sched_fair_server_write(filp, ubuf, cnt, ppos, DL_RUNTIME);
+	long cpu = (long) ((struct seq_file *) filp->private_data)->private;
+	struct rq *rq = cpu_rq(cpu);
+
+	return sched_server_write_common(filp, ubuf, cnt, ppos, DL_RUNTIME,
+					&rq->fair_server);
 }
 
 static int sched_fair_server_runtime_show(struct seq_file *m, void *v)
 {
-	return sched_fair_server_show(m, v, DL_RUNTIME);
+	unsigned long cpu = (unsigned long) m->private;
+	struct rq *rq = cpu_rq(cpu);
+
+	return sched_server_show_common(m, v, DL_RUNTIME, &rq->fair_server);
 }
 
 static int sched_fair_server_runtime_open(struct inode *inode, struct file *filp)
@@ -446,16 +454,57 @@ static const struct file_operations fair_server_runtime_fops = {
 	.release	= single_release,
 };
 
+#ifdef CONFIG_SCHED_CLASS_EXT
+static ssize_t
+sched_ext_server_runtime_write(struct file *filp, const char __user *ubuf,
+			       size_t cnt, loff_t *ppos)
+{
+	long cpu = (long) ((struct seq_file *) filp->private_data)->private;
+	struct rq *rq = cpu_rq(cpu);
+
+	return sched_server_write_common(filp, ubuf, cnt, ppos, DL_RUNTIME,
+					&rq->ext_server);
+}
+
+static int sched_ext_server_runtime_show(struct seq_file *m, void *v)
+{
+	unsigned long cpu = (unsigned long) m->private;
+	struct rq *rq = cpu_rq(cpu);
+
+	return sched_server_show_common(m, v, DL_RUNTIME, &rq->ext_server);
+}
+
+static int sched_ext_server_runtime_open(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, sched_ext_server_runtime_show, inode->i_private);
+}
+
+static const struct file_operations ext_server_runtime_fops = {
+	.open		= sched_ext_server_runtime_open,
+	.write		= sched_ext_server_runtime_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+#endif /* CONFIG_SCHED_CLASS_EXT */
+
 static ssize_t
 sched_fair_server_period_write(struct file *filp, const char __user *ubuf,
 			       size_t cnt, loff_t *ppos)
 {
-	return sched_fair_server_write(filp, ubuf, cnt, ppos, DL_PERIOD);
+	long cpu = (long) ((struct seq_file *) filp->private_data)->private;
+	struct rq *rq = cpu_rq(cpu);
+
+	return sched_server_write_common(filp, ubuf, cnt, ppos, DL_PERIOD,
+					&rq->fair_server);
 }
 
 static int sched_fair_server_period_show(struct seq_file *m, void *v)
 {
-	return sched_fair_server_show(m, v, DL_PERIOD);
+	unsigned long cpu = (unsigned long) m->private;
+	struct rq *rq = cpu_rq(cpu);
+
+	return sched_server_show_common(m, v, DL_PERIOD, &rq->fair_server);
 }
 
 static int sched_fair_server_period_open(struct inode *inode, struct file *filp)
@@ -471,6 +520,40 @@ static const struct file_operations fair_server_period_fops = {
 	.release	= single_release,
 };
 
+#ifdef CONFIG_SCHED_CLASS_EXT
+static ssize_t
+sched_ext_server_period_write(struct file *filp, const char __user *ubuf,
+			      size_t cnt, loff_t *ppos)
+{
+	long cpu = (long) ((struct seq_file *) filp->private_data)->private;
+	struct rq *rq = cpu_rq(cpu);
+
+	return sched_server_write_common(filp, ubuf, cnt, ppos, DL_PERIOD,
+					&rq->ext_server);
+}
+
+static int sched_ext_server_period_show(struct seq_file *m, void *v)
+{
+	unsigned long cpu = (unsigned long) m->private;
+	struct rq *rq = cpu_rq(cpu);
+
+	return sched_server_show_common(m, v, DL_PERIOD, &rq->ext_server);
+}
+
+static int sched_ext_server_period_open(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, sched_ext_server_period_show, inode->i_private);
+}
+
+static const struct file_operations ext_server_period_fops = {
+	.open		= sched_ext_server_period_open,
+	.write		= sched_ext_server_period_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+#endif /* CONFIG_SCHED_CLASS_EXT */
+
 static struct dentry *debugfs_sched;
 
 static void debugfs_fair_server_init(void)
@@ -494,6 +577,29 @@ static void debugfs_fair_server_init(void)
 	}
 }
 
+#ifdef CONFIG_SCHED_CLASS_EXT
+static void debugfs_ext_server_init(void)
+{
+	struct dentry *d_ext;
+	unsigned long cpu;
+
+	d_ext = debugfs_create_dir("ext_server", debugfs_sched);
+	if (!d_ext)
+		return;
+
+	for_each_possible_cpu(cpu) {
+		struct dentry *d_cpu;
+		char buf[32];
+
+		snprintf(buf, sizeof(buf), "cpu%lu", cpu);
+		d_cpu = debugfs_create_dir(buf, d_ext);
+
+		debugfs_create_file("runtime", 0644, d_cpu, (void *) cpu, &ext_server_runtime_fops);
+		debugfs_create_file("period", 0644, d_cpu, (void *) cpu, &ext_server_period_fops);
+	}
+}
+#endif /* CONFIG_SCHED_CLASS_EXT */
+
 static __init int sched_init_debug(void)
 {
 	struct dentry __maybe_unused *numa;
@@ -532,6 +638,9 @@ static __init int sched_init_debug(void)
 	debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
 
 	debugfs_fair_server_init();
+#ifdef CONFIG_SCHED_CLASS_EXT
+	debugfs_ext_server_init();
+#endif
 
 	return 0;
 }
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 6/7] selftests/sched_ext: Add test for sched_ext dl_server
  2026-01-26  9:58 [PATCHSET v12 sched_ext/for-6.20] Add a deadline server for sched_ext tasks Andrea Righi
                   ` (4 preceding siblings ...)
  2026-01-26  9:59 ` [PATCH 5/7] sched/debug: Add support to change sched_ext server params Andrea Righi
@ 2026-01-26  9:59 ` Andrea Righi
  2026-02-03 11:18   ` [tip: sched/core] " tip-bot2 for Andrea Righi
  2026-01-26  9:59 ` [PATCH 7/7] selftests/sched_ext: Add test for DL server total_bw consistency Andrea Righi
  2026-02-02 16:45 ` [PATCHSET v12 sched_ext/for-6.20] Add a deadline server for sched_ext tasks Tejun Heo
  7 siblings, 1 reply; 40+ messages in thread
From: Andrea Righi @ 2026-01-26  9:59 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Tejun Heo, Joel Fernandes, David Vernet,
	Changwoo Min, Daniel Hodges, Christian Loehle, Emil Tsalapatis,
	sched-ext, linux-kernel

Add a selftest to validate the correct behavior of the deadline server
for the ext_sched_class.

Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Co-developed-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 tools/testing/selftests/sched_ext/Makefile    |   1 +
 .../selftests/sched_ext/rt_stall.bpf.c        |  23 ++
 tools/testing/selftests/sched_ext/rt_stall.c  | 240 ++++++++++++++++++
 3 files changed, 264 insertions(+)
 create mode 100644 tools/testing/selftests/sched_ext/rt_stall.bpf.c
 create mode 100644 tools/testing/selftests/sched_ext/rt_stall.c

diff --git a/tools/testing/selftests/sched_ext/Makefile b/tools/testing/selftests/sched_ext/Makefile
index 5fe45f9c5f8fd..c9255d1499b6e 100644
--- a/tools/testing/selftests/sched_ext/Makefile
+++ b/tools/testing/selftests/sched_ext/Makefile
@@ -183,6 +183,7 @@ auto-test-targets :=			\
 	select_cpu_dispatch_bad_dsq	\
 	select_cpu_dispatch_dbl_dsp	\
 	select_cpu_vtime		\
+	rt_stall			\
 	test_example			\
 
 testcase-targets := $(addsuffix .o,$(addprefix $(SCXOBJ_DIR)/,$(auto-test-targets)))
diff --git a/tools/testing/selftests/sched_ext/rt_stall.bpf.c b/tools/testing/selftests/sched_ext/rt_stall.bpf.c
new file mode 100644
index 0000000000000..80086779dd1eb
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/rt_stall.bpf.c
@@ -0,0 +1,23 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * A scheduler that verified if RT tasks can stall SCHED_EXT tasks.
+ *
+ * Copyright (c) 2025 NVIDIA Corporation.
+ */
+
+#include <scx/common.bpf.h>
+
+char _license[] SEC("license") = "GPL";
+
+UEI_DEFINE(uei);
+
+void BPF_STRUCT_OPS(rt_stall_exit, struct scx_exit_info *ei)
+{
+	UEI_RECORD(uei, ei);
+}
+
+SEC(".struct_ops.link")
+struct sched_ext_ops rt_stall_ops = {
+	.exit			= (void *)rt_stall_exit,
+	.name			= "rt_stall",
+};
diff --git a/tools/testing/selftests/sched_ext/rt_stall.c b/tools/testing/selftests/sched_ext/rt_stall.c
new file mode 100644
index 0000000000000..015200f80f6e2
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/rt_stall.c
@@ -0,0 +1,240 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2025 NVIDIA Corporation.
+ */
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <sched.h>
+#include <sys/prctl.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <time.h>
+#include <linux/sched.h>
+#include <signal.h>
+#include <bpf/bpf.h>
+#include <scx/common.h>
+#include <unistd.h>
+#include "rt_stall.bpf.skel.h"
+#include "scx_test.h"
+#include "../kselftest.h"
+
+#define CORE_ID		0	/* CPU to pin tasks to */
+#define RUN_TIME        5	/* How long to run the test in seconds */
+
+/* Simple busy-wait function for test tasks */
+static void process_func(void)
+{
+	while (1) {
+		/* Busy wait */
+		for (volatile unsigned long i = 0; i < 10000000UL; i++)
+			;
+	}
+}
+
+/* Set CPU affinity to a specific core */
+static void set_affinity(int cpu)
+{
+	cpu_set_t mask;
+
+	CPU_ZERO(&mask);
+	CPU_SET(cpu, &mask);
+	if (sched_setaffinity(0, sizeof(mask), &mask) != 0) {
+		perror("sched_setaffinity");
+		exit(EXIT_FAILURE);
+	}
+}
+
+/* Set task scheduling policy and priority */
+static void set_sched(int policy, int priority)
+{
+	struct sched_param param;
+
+	param.sched_priority = priority;
+	if (sched_setscheduler(0, policy, &param) != 0) {
+		perror("sched_setscheduler");
+		exit(EXIT_FAILURE);
+	}
+}
+
+/* Get process runtime from /proc/<pid>/stat */
+static float get_process_runtime(int pid)
+{
+	char path[256];
+	FILE *file;
+	long utime, stime;
+	int fields;
+
+	snprintf(path, sizeof(path), "/proc/%d/stat", pid);
+	file = fopen(path, "r");
+	if (file == NULL) {
+		perror("Failed to open stat file");
+		return -1;
+	}
+
+	/* Skip the first 13 fields and read the 14th and 15th */
+	fields = fscanf(file,
+			"%*d %*s %*c %*d %*d %*d %*d %*d %*u %*u %*u %*u %*u %lu %lu",
+			&utime, &stime);
+	fclose(file);
+
+	if (fields != 2) {
+		fprintf(stderr, "Failed to read stat file\n");
+		return -1;
+	}
+
+	/* Calculate the total time spent in the process */
+	long total_time = utime + stime;
+	long ticks_per_second = sysconf(_SC_CLK_TCK);
+	float runtime_seconds = total_time * 1.0 / ticks_per_second;
+
+	return runtime_seconds;
+}
+
+static enum scx_test_status setup(void **ctx)
+{
+	struct rt_stall *skel;
+
+	skel = rt_stall__open();
+	SCX_FAIL_IF(!skel, "Failed to open");
+	SCX_ENUM_INIT(skel);
+	SCX_FAIL_IF(rt_stall__load(skel), "Failed to load skel");
+
+	*ctx = skel;
+
+	return SCX_TEST_PASS;
+}
+
+static bool sched_stress_test(bool is_ext)
+{
+	/*
+	 * We're expecting the EXT task to get around 5% of CPU time when
+	 * competing with the RT task (small 1% fluctuations are expected).
+	 *
+	 * However, the EXT task should get at least 4% of the CPU to prove
+	 * that the EXT deadline server is working correctly. A percentage
+	 * less than 4% indicates a bug where RT tasks can potentially
+	 * stall SCHED_EXT tasks, causing the test to fail.
+	 */
+	const float expected_min_ratio = 0.04; /* 4% */
+	const char *class_str = is_ext ? "EXT" : "FAIR";
+
+	float ext_runtime, rt_runtime, actual_ratio;
+	int ext_pid, rt_pid;
+
+	ksft_print_header();
+	ksft_set_plan(1);
+
+	/* Create and set up a EXT task */
+	ext_pid = fork();
+	if (ext_pid == 0) {
+		set_affinity(CORE_ID);
+		process_func();
+		exit(0);
+	} else if (ext_pid < 0) {
+		perror("fork task");
+		ksft_exit_fail();
+	}
+
+	/* Create an RT task */
+	rt_pid = fork();
+	if (rt_pid == 0) {
+		set_affinity(CORE_ID);
+		set_sched(SCHED_FIFO, 50);
+		process_func();
+		exit(0);
+	} else if (rt_pid < 0) {
+		perror("fork for RT task");
+		ksft_exit_fail();
+	}
+
+	/* Let the processes run for the specified time */
+	sleep(RUN_TIME);
+
+	/* Get runtime for the EXT task */
+	ext_runtime = get_process_runtime(ext_pid);
+	if (ext_runtime == -1)
+		ksft_exit_fail_msg("Error getting runtime for %s task (PID %d)\n",
+				   class_str, ext_pid);
+	ksft_print_msg("Runtime of %s task (PID %d) is %f seconds\n",
+		       class_str, ext_pid, ext_runtime);
+
+	/* Get runtime for the RT task */
+	rt_runtime = get_process_runtime(rt_pid);
+	if (rt_runtime == -1)
+		ksft_exit_fail_msg("Error getting runtime for RT task (PID %d)\n", rt_pid);
+	ksft_print_msg("Runtime of RT task (PID %d) is %f seconds\n", rt_pid, rt_runtime);
+
+	/* Kill the processes */
+	kill(ext_pid, SIGKILL);
+	kill(rt_pid, SIGKILL);
+	waitpid(ext_pid, NULL, 0);
+	waitpid(rt_pid, NULL, 0);
+
+	/* Verify that the scx task got enough runtime */
+	actual_ratio = ext_runtime / (ext_runtime + rt_runtime);
+	ksft_print_msg("%s task got %.2f%% of total runtime\n",
+		       class_str, actual_ratio * 100);
+
+	if (actual_ratio >= expected_min_ratio) {
+		ksft_test_result_pass("PASS: %s task got more than %.2f%% of runtime\n",
+				      class_str, expected_min_ratio * 100);
+		return true;
+	}
+	ksft_test_result_fail("FAIL: %s task got less than %.2f%% of runtime\n",
+			      class_str, expected_min_ratio * 100);
+	return false;
+}
+
+static enum scx_test_status run(void *ctx)
+{
+	struct rt_stall *skel = ctx;
+	struct bpf_link *link = NULL;
+	bool res;
+	int i;
+
+	/*
+	 * Test if the dl_server is working both with and without the
+	 * sched_ext scheduler attached.
+	 *
+	 * This ensures all the scenarios are covered:
+	 *   - fair_server stop -> ext_server start
+	 *   - ext_server stop -> fair_server stop
+	 */
+	for (i = 0; i < 4; i++) {
+		bool is_ext = i % 2;
+
+		if (is_ext) {
+			memset(&skel->data->uei, 0, sizeof(skel->data->uei));
+			link = bpf_map__attach_struct_ops(skel->maps.rt_stall_ops);
+			SCX_FAIL_IF(!link, "Failed to attach scheduler");
+		}
+		res = sched_stress_test(is_ext);
+		if (is_ext) {
+			SCX_EQ(skel->data->uei.kind, EXIT_KIND(SCX_EXIT_NONE));
+			bpf_link__destroy(link);
+		}
+
+		if (!res)
+			ksft_exit_fail();
+	}
+
+	return SCX_TEST_PASS;
+}
+
+static void cleanup(void *ctx)
+{
+	struct rt_stall *skel = ctx;
+
+	rt_stall__destroy(skel);
+}
+
+struct scx_test rt_stall = {
+	.name = "rt_stall",
+	.description = "Verify that RT tasks cannot stall SCHED_EXT tasks",
+	.setup = setup,
+	.run = run,
+	.cleanup = cleanup,
+};
+REGISTER_SCX_TEST(&rt_stall)
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 7/7] selftests/sched_ext: Add test for DL server total_bw consistency
  2026-01-26  9:58 [PATCHSET v12 sched_ext/for-6.20] Add a deadline server for sched_ext tasks Andrea Righi
                   ` (5 preceding siblings ...)
  2026-01-26  9:59 ` [PATCH 6/7] selftests/sched_ext: Add test for sched_ext dl_server Andrea Righi
@ 2026-01-26  9:59 ` Andrea Righi
  2026-02-03 11:18   ` [tip: sched/core] " tip-bot2 for Joel Fernandes
  2026-02-02 16:45 ` [PATCHSET v12 sched_ext/for-6.20] Add a deadline server for sched_ext tasks Tejun Heo
  7 siblings, 1 reply; 40+ messages in thread
From: Andrea Righi @ 2026-01-26  9:59 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Tejun Heo, Joel Fernandes, David Vernet,
	Changwoo Min, Daniel Hodges, Christian Loehle, Emil Tsalapatis,
	sched-ext, linux-kernel

From: Joel Fernandes <joelagnelf@nvidia.com>

Add a new kselftest to verify that the total_bw value in
/sys/kernel/debug/sched/debug remains consistent across all CPUs
under different sched_ext BPF program states:

1. Before a BPF scheduler is loaded
2. While a BPF scheduler is loaded and active
3. After a BPF scheduler is unloaded

The test runs CPU stress threads to ensure DL server bandwidth
values stabilize before checking consistency. This helps catch
potential issues with DL server bandwidth accounting during
sched_ext transitions.

Tested-by: Christian Loehle <christian.loehle@arm.com>
Co-developed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
 tools/testing/selftests/sched_ext/Makefile   |   1 +
 tools/testing/selftests/sched_ext/total_bw.c | 281 +++++++++++++++++++
 2 files changed, 282 insertions(+)
 create mode 100644 tools/testing/selftests/sched_ext/total_bw.c

diff --git a/tools/testing/selftests/sched_ext/Makefile b/tools/testing/selftests/sched_ext/Makefile
index c9255d1499b6e..2c601a7eaff5f 100644
--- a/tools/testing/selftests/sched_ext/Makefile
+++ b/tools/testing/selftests/sched_ext/Makefile
@@ -185,6 +185,7 @@ auto-test-targets :=			\
 	select_cpu_vtime		\
 	rt_stall			\
 	test_example			\
+	total_bw			\
 
 testcase-targets := $(addsuffix .o,$(addprefix $(SCXOBJ_DIR)/,$(auto-test-targets)))
 
diff --git a/tools/testing/selftests/sched_ext/total_bw.c b/tools/testing/selftests/sched_ext/total_bw.c
new file mode 100644
index 0000000000000..5b0a619bab86e
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/total_bw.c
@@ -0,0 +1,281 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Test to verify that total_bw value remains consistent across all CPUs
+ * in different BPF program states.
+ *
+ * Copyright (C) 2025 NVIDIA Corporation.
+ */
+#include <bpf/bpf.h>
+#include <errno.h>
+#include <pthread.h>
+#include <scx/common.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/wait.h>
+#include <unistd.h>
+#include "minimal.bpf.skel.h"
+#include "scx_test.h"
+
+#define MAX_CPUS 512
+#define STRESS_DURATION_SEC 5
+
+struct total_bw_ctx {
+	struct minimal *skel;
+	long baseline_bw[MAX_CPUS];
+	int nr_cpus;
+};
+
+static void *cpu_stress_thread(void *arg)
+{
+	volatile int i;
+	time_t end_time = time(NULL) + STRESS_DURATION_SEC;
+
+	while (time(NULL) < end_time)
+		for (i = 0; i < 1000000; i++)
+			;
+
+	return NULL;
+}
+
+/*
+ * The first enqueue on a CPU causes the DL server to start, for that
+ * reason run stressor threads in the hopes it schedules on all CPUs.
+ */
+static int run_cpu_stress(int nr_cpus)
+{
+	pthread_t *threads;
+	int i, ret = 0;
+
+	threads = calloc(nr_cpus, sizeof(pthread_t));
+	if (!threads)
+		return -ENOMEM;
+
+	/* Create threads to run on each CPU */
+	for (i = 0; i < nr_cpus; i++) {
+		if (pthread_create(&threads[i], NULL, cpu_stress_thread, NULL)) {
+			ret = -errno;
+			fprintf(stderr, "Failed to create thread %d: %s\n", i, strerror(-ret));
+			break;
+		}
+	}
+
+	/* Wait for all threads to complete */
+	for (i = 0; i < nr_cpus; i++) {
+		if (threads[i])
+			pthread_join(threads[i], NULL);
+	}
+
+	free(threads);
+	return ret;
+}
+
+static int read_total_bw_values(long *bw_values, int max_cpus)
+{
+	FILE *fp;
+	char line[256];
+	int cpu_count = 0;
+
+	fp = fopen("/sys/kernel/debug/sched/debug", "r");
+	if (!fp) {
+		SCX_ERR("Failed to open debug file");
+		return -1;
+	}
+
+	while (fgets(line, sizeof(line), fp)) {
+		char *bw_str = strstr(line, "total_bw");
+
+		if (bw_str) {
+			bw_str = strchr(bw_str, ':');
+			if (bw_str) {
+				/* Only store up to max_cpus values */
+				if (cpu_count < max_cpus)
+					bw_values[cpu_count] = atol(bw_str + 1);
+				cpu_count++;
+			}
+		}
+	}
+
+	fclose(fp);
+	return cpu_count;
+}
+
+static bool verify_total_bw_consistency(long *bw_values, int count)
+{
+	int i;
+	long first_value;
+
+	if (count <= 0)
+		return false;
+
+	first_value = bw_values[0];
+
+	for (i = 1; i < count; i++) {
+		if (bw_values[i] != first_value) {
+			SCX_ERR("Inconsistent total_bw: CPU0=%ld, CPU%d=%ld",
+				first_value, i, bw_values[i]);
+			return false;
+		}
+	}
+
+	return true;
+}
+
+static int fetch_verify_total_bw(long *bw_values, int nr_cpus)
+{
+	int attempts = 0;
+	int max_attempts = 10;
+	int count;
+
+	/*
+	 * The first enqueue on a CPU causes the DL server to start, for that
+	 * reason run stressor threads in the hopes it schedules on all CPUs.
+	 */
+	if (run_cpu_stress(nr_cpus) < 0) {
+		SCX_ERR("Failed to run CPU stress");
+		return -1;
+	}
+
+	/* Try multiple times to get stable values */
+	while (attempts < max_attempts) {
+		count = read_total_bw_values(bw_values, nr_cpus);
+		fprintf(stderr, "Read %d total_bw values (testing %d CPUs)\n", count, nr_cpus);
+		/* If system has more CPUs than we're testing, that's OK */
+		if (count < nr_cpus) {
+			SCX_ERR("Expected at least %d CPUs, got %d", nr_cpus, count);
+			attempts++;
+			sleep(1);
+			continue;
+		}
+
+		/* Only verify the CPUs we're testing */
+		if (verify_total_bw_consistency(bw_values, nr_cpus)) {
+			fprintf(stderr, "Values are consistent: %ld\n", bw_values[0]);
+			return 0;
+		}
+
+		attempts++;
+		sleep(1);
+	}
+
+	return -1;
+}
+
+static enum scx_test_status setup(void **ctx)
+{
+	struct total_bw_ctx *test_ctx;
+
+	if (access("/sys/kernel/debug/sched/debug", R_OK) != 0) {
+		fprintf(stderr, "Skipping test: debugfs sched/debug not accessible\n");
+		return SCX_TEST_SKIP;
+	}
+
+	test_ctx = calloc(1, sizeof(*test_ctx));
+	if (!test_ctx)
+		return SCX_TEST_FAIL;
+
+	test_ctx->nr_cpus = sysconf(_SC_NPROCESSORS_ONLN);
+	if (test_ctx->nr_cpus <= 0) {
+		free(test_ctx);
+		return SCX_TEST_FAIL;
+	}
+
+	/* If system has more CPUs than MAX_CPUS, just test the first MAX_CPUS */
+	if (test_ctx->nr_cpus > MAX_CPUS)
+		test_ctx->nr_cpus = MAX_CPUS;
+
+	/* Test scenario 1: BPF program not loaded */
+	/* Read and verify baseline total_bw before loading BPF program */
+	fprintf(stderr, "BPF prog initially not loaded, reading total_bw values\n");
+	if (fetch_verify_total_bw(test_ctx->baseline_bw, test_ctx->nr_cpus) < 0) {
+		SCX_ERR("Failed to get stable baseline values");
+		free(test_ctx);
+		return SCX_TEST_FAIL;
+	}
+
+	/* Load the BPF skeleton */
+	test_ctx->skel = minimal__open();
+	if (!test_ctx->skel) {
+		free(test_ctx);
+		return SCX_TEST_FAIL;
+	}
+
+	SCX_ENUM_INIT(test_ctx->skel);
+	if (minimal__load(test_ctx->skel)) {
+		minimal__destroy(test_ctx->skel);
+		free(test_ctx);
+		return SCX_TEST_FAIL;
+	}
+
+	*ctx = test_ctx;
+	return SCX_TEST_PASS;
+}
+
+static enum scx_test_status run(void *ctx)
+{
+	struct total_bw_ctx *test_ctx = ctx;
+	struct bpf_link *link;
+	long loaded_bw[MAX_CPUS];
+	long unloaded_bw[MAX_CPUS];
+	int i;
+
+	/* Test scenario 2: BPF program loaded */
+	link = bpf_map__attach_struct_ops(test_ctx->skel->maps.minimal_ops);
+	if (!link) {
+		SCX_ERR("Failed to attach scheduler");
+		return SCX_TEST_FAIL;
+	}
+
+	fprintf(stderr, "BPF program loaded, reading total_bw values\n");
+	if (fetch_verify_total_bw(loaded_bw, test_ctx->nr_cpus) < 0) {
+		SCX_ERR("Failed to get stable values with BPF loaded");
+		bpf_link__destroy(link);
+		return SCX_TEST_FAIL;
+	}
+	bpf_link__destroy(link);
+
+	/* Test scenario 3: BPF program unloaded */
+	fprintf(stderr, "BPF program unloaded, reading total_bw values\n");
+	if (fetch_verify_total_bw(unloaded_bw, test_ctx->nr_cpus) < 0) {
+		SCX_ERR("Failed to get stable values after BPF unload");
+		return SCX_TEST_FAIL;
+	}
+
+	/* Verify all three scenarios have the same total_bw values */
+	for (i = 0; i < test_ctx->nr_cpus; i++) {
+		if (test_ctx->baseline_bw[i] != loaded_bw[i]) {
+			SCX_ERR("CPU%d: baseline_bw=%ld != loaded_bw=%ld",
+				i, test_ctx->baseline_bw[i], loaded_bw[i]);
+			return SCX_TEST_FAIL;
+		}
+
+		if (test_ctx->baseline_bw[i] != unloaded_bw[i]) {
+			SCX_ERR("CPU%d: baseline_bw=%ld != unloaded_bw=%ld",
+				i, test_ctx->baseline_bw[i], unloaded_bw[i]);
+			return SCX_TEST_FAIL;
+		}
+	}
+
+	fprintf(stderr, "All total_bw values are consistent across all scenarios\n");
+	return SCX_TEST_PASS;
+}
+
+static void cleanup(void *ctx)
+{
+	struct total_bw_ctx *test_ctx = ctx;
+
+	if (test_ctx) {
+		if (test_ctx->skel)
+			minimal__destroy(test_ctx->skel);
+		free(test_ctx);
+	}
+}
+
+struct scx_test total_bw = {
+	.name = "total_bw",
+	.description = "Verify total_bw consistency across BPF program states",
+	.setup = setup,
+	.run = run,
+	.cleanup = cleanup,
+};
+REGISTER_SCX_TEST(&total_bw)
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCHSET v12 sched_ext/for-6.20] Add a deadline server for sched_ext tasks
  2026-01-26  9:58 [PATCHSET v12 sched_ext/for-6.20] Add a deadline server for sched_ext tasks Andrea Righi
                   ` (6 preceding siblings ...)
  2026-01-26  9:59 ` [PATCH 7/7] selftests/sched_ext: Add test for DL server total_bw consistency Andrea Righi
@ 2026-02-02 16:45 ` Tejun Heo
  2026-02-02 19:56   ` Peter Zijlstra
  7 siblings, 1 reply; 40+ messages in thread
From: Tejun Heo @ 2026-02-02 16:45 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Joel Fernandes, David Vernet, Changwoo Min,
	Daniel Hodges, Christian Loehle, Emil Tsalapatis, sched-ext,
	linux-kernel

Hello,

On Mon, Jan 26, 2026 at 10:58:58AM +0100, Andrea Righi wrote:
> Changes in v12:
>  - Move dl_server execution state reset on stop fix to a separate patch
>    (https://lore.kernel.org/all/20260123161645.2181752-1-arighi@nvidia.com/)
>  - Removed per-patch changelog (keeping a global changelog here)
>  - Link to v11: https://lore.kernel.org/all/20260120215808.188032-1-arighi@nvidia.com/

Peter, Ingo, this patchset has been around the block for a long time and the
remaining deadline and debug patches are reviewed and seem fairly isolated.
Given that the patchset addresses an on-going issue, I'd prefer to land the
series before the merge window. If you want to route 1-3 (or the whole
series) through sched/core, please let me know. Otherwise, I can them
through sched_ext tree.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/7] sched_ext: Add a DL server for sched_ext tasks
  2026-01-26  9:59 ` [PATCH 4/7] sched_ext: Add a DL server for sched_ext tasks Andrea Righi
@ 2026-02-02 19:50   ` Peter Zijlstra
  2026-02-02 20:32     ` Andrea Righi
  2026-02-03 11:18   ` [tip: sched/core] " tip-bot2 for Andrea Righi
  1 sibling, 1 reply; 40+ messages in thread
From: Peter Zijlstra @ 2026-02-02 19:50 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Tejun Heo, Joel Fernandes, David Vernet, Changwoo Min,
	Daniel Hodges, Christian Loehle, Emil Tsalapatis, sched-ext,
	linux-kernel

On Mon, Jan 26, 2026 at 10:59:02AM +0100, Andrea Righi wrote:

> @@ -3181,6 +3193,36 @@ void dl_add_task_root_domain(struct task_struct *p)
>  	raw_spin_unlock_irqrestore(&p->pi_lock, rf.flags);
>  }
>  
> +static void dl_server_add_bw(struct root_domain *rd, int cpu)
> +{
> +	struct sched_dl_entity *dl_se;
> +
> +	dl_se = &cpu_rq(cpu)->fair_server;
> +	if (dl_server(dl_se) && cpu_active(cpu))
> +		__dl_add(&rd->dl_bw, dl_se->dl_bw, dl_bw_cpus(cpu));
> +
> +#ifdef CONFIG_SCHED_CLASS_EXT
> +	dl_se = &cpu_rq(cpu)->ext_server;
> +	if (dl_server(dl_se) && cpu_active(cpu))
> +		__dl_add(&rd->dl_bw, dl_se->dl_bw, dl_bw_cpus(cpu));
> +#endif
> +}
> +
> +static u64 dl_server_read_bw(int cpu)
> +{
> +	u64 dl_bw = 0;
> +
> +	if (cpu_rq(cpu)->fair_server.dl_server)
> +		dl_bw += cpu_rq(cpu)->fair_server.dl_bw;
> +
> +#ifdef CONFIG_SCHED_CLASS_EXT
> +	if (cpu_rq(cpu)->ext_server.dl_server)
> +		dl_bw += cpu_rq(cpu)->ext_server.dl_bw;
> +#endif
> +
> +	return dl_bw;
> +}

Should not this also depend on scx_enabled()? It seems unfortunate to
consume bandwidth if scx isn't even enabled.



> @@ -1501,6 +1503,10 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags
>  	if (enq_flags & SCX_ENQ_WAKEUP)
>  		touch_core_sched(rq, p);
>  
> +	/* Start dl_server if this is the first task being enqueued */
> +	if (rq->scx.nr_running == 1)
> +		dl_server_start(&rq->ext_server);
> +
>  	do_enqueue_task(rq, p, enq_flags, sticky_cpu);
>  out:
>  	rq->scx.flags &= ~SCX_RQ_IN_WAKEUP;

So this starts the dl_server for the CPU the thing gets enqueued on, but
SCX being what it is, there is absolutely no guarantee its actually ever
pickable from there, right?

Does it make sense to delay this until its a DSQ_LOCAL enqueue? Or will
it never get that far without help?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHSET v12 sched_ext/for-6.20] Add a deadline server for sched_ext tasks
  2026-02-02 16:45 ` [PATCHSET v12 sched_ext/for-6.20] Add a deadline server for sched_ext tasks Tejun Heo
@ 2026-02-02 19:56   ` Peter Zijlstra
  2026-02-02 20:20     ` Tejun Heo
  0 siblings, 1 reply; 40+ messages in thread
From: Peter Zijlstra @ 2026-02-02 19:56 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrea Righi, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Joel Fernandes, David Vernet, Changwoo Min,
	Daniel Hodges, Christian Loehle, Emil Tsalapatis, sched-ext,
	linux-kernel

On Mon, Feb 02, 2026 at 06:45:29AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Mon, Jan 26, 2026 at 10:58:58AM +0100, Andrea Righi wrote:
> > Changes in v12:
> >  - Move dl_server execution state reset on stop fix to a separate patch
> >    (https://lore.kernel.org/all/20260123161645.2181752-1-arighi@nvidia.com/)
> >  - Removed per-patch changelog (keeping a global changelog here)
> >  - Link to v11: https://lore.kernel.org/all/20260120215808.188032-1-arighi@nvidia.com/
> 
> Peter, Ingo, this patchset has been around the block for a long time and the
> remaining deadline and debug patches are reviewed and seem fairly isolated.

They're in well enough shape to merge I suppose, although I think
there's still a few issues to fix.

Notably I think it would be good to not have the scx_server consume dl
bandwidth if scx isn't enabled. And I wonder about that start condition.

> Given that the patchset addresses an on-going issue, I'd prefer to land the
> series before the merge window. If you want to route 1-3 (or the whole
> series) through sched/core, please let me know. Otherwise, I can them
> through sched_ext tree.

They seem to apply without issue to tip/sched/core, so if that works for
you I suppose I can stick them all in.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHSET v12 sched_ext/for-6.20] Add a deadline server for sched_ext tasks
  2026-02-02 19:56   ` Peter Zijlstra
@ 2026-02-02 20:20     ` Tejun Heo
  0 siblings, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2026-02-02 20:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Righi, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Joel Fernandes, David Vernet, Changwoo Min,
	Daniel Hodges, Christian Loehle, Emil Tsalapatis, sched-ext,
	linux-kernel

On Mon, Feb 02, 2026 at 08:56:34PM +0100, Peter Zijlstra wrote:
> > Given that the patchset addresses an on-going issue, I'd prefer to land the
> > series before the merge window. If you want to route 1-3 (or the whole
> > series) through sched/core, please let me know. Otherwise, I can them
> > through sched_ext tree.
> 
> They seem to apply without issue to tip/sched/core, so if that works for
> you I suppose I can stick them all in.

Yeah, that sounds great. Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/7] sched_ext: Add a DL server for sched_ext tasks
  2026-02-02 19:50   ` Peter Zijlstra
@ 2026-02-02 20:32     ` Andrea Righi
  2026-02-02 21:10       ` Peter Zijlstra
  0 siblings, 1 reply; 40+ messages in thread
From: Andrea Righi @ 2026-02-02 20:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Tejun Heo, Joel Fernandes, David Vernet, Changwoo Min,
	Daniel Hodges, Christian Loehle, Emil Tsalapatis, sched-ext,
	linux-kernel

Hi Peter,

On Mon, Feb 02, 2026 at 08:50:35PM +0100, Peter Zijlstra wrote:
> On Mon, Jan 26, 2026 at 10:59:02AM +0100, Andrea Righi wrote:
> 
> > @@ -3181,6 +3193,36 @@ void dl_add_task_root_domain(struct task_struct *p)
> >  	raw_spin_unlock_irqrestore(&p->pi_lock, rf.flags);
> >  }
> >  
> > +static void dl_server_add_bw(struct root_domain *rd, int cpu)
> > +{
> > +	struct sched_dl_entity *dl_se;
> > +
> > +	dl_se = &cpu_rq(cpu)->fair_server;
> > +	if (dl_server(dl_se) && cpu_active(cpu))
> > +		__dl_add(&rd->dl_bw, dl_se->dl_bw, dl_bw_cpus(cpu));
> > +
> > +#ifdef CONFIG_SCHED_CLASS_EXT
> > +	dl_se = &cpu_rq(cpu)->ext_server;
> > +	if (dl_server(dl_se) && cpu_active(cpu))
> > +		__dl_add(&rd->dl_bw, dl_se->dl_bw, dl_bw_cpus(cpu));
> > +#endif
> > +}
> > +
> > +static u64 dl_server_read_bw(int cpu)
> > +{
> > +	u64 dl_bw = 0;
> > +
> > +	if (cpu_rq(cpu)->fair_server.dl_server)
> > +		dl_bw += cpu_rq(cpu)->fair_server.dl_bw;
> > +
> > +#ifdef CONFIG_SCHED_CLASS_EXT
> > +	if (cpu_rq(cpu)->ext_server.dl_server)
> > +		dl_bw += cpu_rq(cpu)->ext_server.dl_bw;
> > +#endif
> > +
> > +	return dl_bw;
> > +}
> 
> Should not this also depend on scx_enabled()? It seems unfortunate to
> consume bandwidth if scx isn't even enabled.

Yeah, that's a good point. We can just add scx_enabled() here. Let me try
running some tests with this.

> 
> 
> 
> > @@ -1501,6 +1503,10 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags
> >  	if (enq_flags & SCX_ENQ_WAKEUP)
> >  		touch_core_sched(rq, p);
> >  
> > +	/* Start dl_server if this is the first task being enqueued */
> > +	if (rq->scx.nr_running == 1)
> > +		dl_server_start(&rq->ext_server);
> > +
> >  	do_enqueue_task(rq, p, enq_flags, sticky_cpu);
> >  out:
> >  	rq->scx.flags &= ~SCX_RQ_IN_WAKEUP;
> 
> So this starts the dl_server for the CPU the thing gets enqueued on, but
> SCX being what it is, there is absolutely no guarantee its actually ever
> pickable from there, right?
> 
> Does it make sense to delay this until its a DSQ_LOCAL enqueue? Or will
> it never get that far without help?

We could probably move dl_server_start() to local_dsq_post_enq() and start
the dl server when a task is actually dispatched to the local DSQ. I think
it should work, but I'd like to do more extensive testing with this change.

In any case, IMHO these are improvements we can make later, even in its
current form, this is still better than what we have and it would at least
address the recurring bug reports about RT tasks starving scx tasks...

Thanks!
-Andrea

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/7] sched_ext: Add a DL server for sched_ext tasks
  2026-02-02 20:32     ` Andrea Righi
@ 2026-02-02 21:10       ` Peter Zijlstra
  2026-02-02 22:18         ` Andrea Righi
  0 siblings, 1 reply; 40+ messages in thread
From: Peter Zijlstra @ 2026-02-02 21:10 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Tejun Heo, Joel Fernandes, David Vernet, Changwoo Min,
	Daniel Hodges, Christian Loehle, Emil Tsalapatis, sched-ext,
	linux-kernel

On Mon, Feb 02, 2026 at 09:32:32PM +0100, Andrea Righi wrote:
> Hi Peter,
> 
> On Mon, Feb 02, 2026 at 08:50:35PM +0100, Peter Zijlstra wrote:
> > On Mon, Jan 26, 2026 at 10:59:02AM +0100, Andrea Righi wrote:
> > 
> > > @@ -3181,6 +3193,36 @@ void dl_add_task_root_domain(struct task_struct *p)
> > >  	raw_spin_unlock_irqrestore(&p->pi_lock, rf.flags);
> > >  }
> > >  
> > > +static void dl_server_add_bw(struct root_domain *rd, int cpu)
> > > +{
> > > +	struct sched_dl_entity *dl_se;
> > > +
> > > +	dl_se = &cpu_rq(cpu)->fair_server;
> > > +	if (dl_server(dl_se) && cpu_active(cpu))
> > > +		__dl_add(&rd->dl_bw, dl_se->dl_bw, dl_bw_cpus(cpu));
> > > +
> > > +#ifdef CONFIG_SCHED_CLASS_EXT
> > > +	dl_se = &cpu_rq(cpu)->ext_server;
> > > +	if (dl_server(dl_se) && cpu_active(cpu))
> > > +		__dl_add(&rd->dl_bw, dl_se->dl_bw, dl_bw_cpus(cpu));
> > > +#endif
> > > +}
> > > +
> > > +static u64 dl_server_read_bw(int cpu)
> > > +{
> > > +	u64 dl_bw = 0;
> > > +
> > > +	if (cpu_rq(cpu)->fair_server.dl_server)
> > > +		dl_bw += cpu_rq(cpu)->fair_server.dl_bw;
> > > +
> > > +#ifdef CONFIG_SCHED_CLASS_EXT
> > > +	if (cpu_rq(cpu)->ext_server.dl_server)
> > > +		dl_bw += cpu_rq(cpu)->ext_server.dl_bw;
> > > +#endif
> > > +
> > > +	return dl_bw;
> > > +}
> > 
> > Should not this also depend on scx_enabled()? It seems unfortunate to
> > consume bandwidth if scx isn't even enabled.
> 
> Yeah, that's a good point. We can just add scx_enabled() here. Let me try
> running some tests with this.

I suspect you need some callbacks around where scx_enabled() is changed
to add/stop/remove things.

> > 
> > > @@ -1501,6 +1503,10 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags
> > >  	if (enq_flags & SCX_ENQ_WAKEUP)
> > >  		touch_core_sched(rq, p);
> > >  
> > > +	/* Start dl_server if this is the first task being enqueued */
> > > +	if (rq->scx.nr_running == 1)
> > > +		dl_server_start(&rq->ext_server);
> > > +
> > >  	do_enqueue_task(rq, p, enq_flags, sticky_cpu);
> > >  out:
> > >  	rq->scx.flags &= ~SCX_RQ_IN_WAKEUP;
> > 
> > So this starts the dl_server for the CPU the thing gets enqueued on, but
> > SCX being what it is, there is absolutely no guarantee its actually ever
> > pickable from there, right?
> > 
> > Does it make sense to delay this until its a DSQ_LOCAL enqueue? Or will
> > it never get that far without help?
> 
> We could probably move dl_server_start() to local_dsq_post_enq() and start
> the dl server when a task is actually dispatched to the local DSQ. I think
> it should work, but I'd like to do more extensive testing with this change.
> 
> In any case, IMHO these are improvements we can make later, even in its
> current form, this is still better than what we have and it would at least
> address the recurring bug reports about RT tasks starving scx tasks...

Yeah, we can do on top.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/7] sched/debug: Stop and start server based on if it was active
  2026-01-26  9:59 ` [PATCH 3/7] sched/debug: Stop and start server based on if it was active Andrea Righi
@ 2026-02-02 21:13   ` Peter Zijlstra
  2026-02-02 21:14     ` Peter Zijlstra
  2026-02-02 21:17     ` Peter Zijlstra
  2026-02-03 11:18   ` [tip: sched/core] " tip-bot2 for Joel Fernandes
  1 sibling, 2 replies; 40+ messages in thread
From: Peter Zijlstra @ 2026-02-02 21:13 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Tejun Heo, Joel Fernandes, David Vernet, Changwoo Min,
	Daniel Hodges, Christian Loehle, Emil Tsalapatis, sched-ext,
	linux-kernel

On Mon, Jan 26, 2026 at 10:59:01AM +0100, Andrea Righi wrote:
> From: Joel Fernandes <joelagnelf@nvidia.com>
> 
> Currently the DL server interface for applying parameters checks
> CFS-internals to identify if the server is active. This is error-prone
> and makes it difficult when adding new servers in the future.
> 
> Fix it, by using dl_server_active() which is also used by the DL server
> code to determine if the DL server was started.
> 
> Tested-by: Christian Loehle <christian.loehle@arm.com>
> Acked-by: Tejun Heo <tj@kernel.org>
> Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
> Reviewed-by: Andrea Righi <arighi@nvidia.com>
> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
> ---
>  kernel/sched/debug.c | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index 93f009e1076d8..dd793f8f3858a 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -354,6 +354,8 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
>  		return err;
>  
>  	scoped_guard (rq_lock_irqsave, rq) {
> +		bool is_active;
> +
>  		runtime  = rq->fair_server.dl_runtime;
>  		period = rq->fair_server.dl_period;
>  
> @@ -376,8 +378,11 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
>  			return  -EINVAL;
>  		}
>  
> -		update_rq_clock(rq);
> -		dl_server_stop(&rq->fair_server);
> +		is_active = dl_server_active(&rq->fair_server);
> +		if (is_active) {
> +			update_rq_clock(rq);
> +			dl_server_stop(&rq->fair_server);
> +		}
>  
>  		retval = dl_server_apply_params(&rq->fair_server, runtime, period, 0);
>  
> @@ -385,7 +390,7 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
>  			printk_deferred("Fair server disabled in CPU %d, system may crash due to starvation.\n",
>  					cpu_of(rq));
>  
> -		if (rq->cfs.h_nr_queued)
> +		if (is_active && runtime)
>  			dl_server_start(&rq->fair_server);
>  
>  		if (retval < 0)

Suppose runtime was 0, and gets incremented while there are already
tasks enqueued, then the above isn't going to DTRT.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/7] sched/debug: Stop and start server based on if it was active
  2026-02-02 21:13   ` Peter Zijlstra
@ 2026-02-02 21:14     ` Peter Zijlstra
  2026-02-02 21:17     ` Peter Zijlstra
  1 sibling, 0 replies; 40+ messages in thread
From: Peter Zijlstra @ 2026-02-02 21:14 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Tejun Heo, Joel Fernandes, David Vernet, Changwoo Min,
	Daniel Hodges, Christian Loehle, Emil Tsalapatis, sched-ext,
	linux-kernel

On Mon, Feb 02, 2026 at 10:13:26PM +0100, Peter Zijlstra wrote:
> On Mon, Jan 26, 2026 at 10:59:01AM +0100, Andrea Righi wrote:
> > From: Joel Fernandes <joelagnelf@nvidia.com>
> > 
> > Currently the DL server interface for applying parameters checks
> > CFS-internals to identify if the server is active. This is error-prone
> > and makes it difficult when adding new servers in the future.
> > 
> > Fix it, by using dl_server_active() which is also used by the DL server
> > code to determine if the DL server was started.
> > 
> > Tested-by: Christian Loehle <christian.loehle@arm.com>
> > Acked-by: Tejun Heo <tj@kernel.org>
> > Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
> > Reviewed-by: Andrea Righi <arighi@nvidia.com>
> > Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
> > ---
> >  kernel/sched/debug.c | 11 ++++++++---
> >  1 file changed, 8 insertions(+), 3 deletions(-)
> > 
> > diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> > index 93f009e1076d8..dd793f8f3858a 100644
> > --- a/kernel/sched/debug.c
> > +++ b/kernel/sched/debug.c
> > @@ -354,6 +354,8 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
> >  		return err;
> >  
> >  	scoped_guard (rq_lock_irqsave, rq) {
> > +		bool is_active;
> > +
> >  		runtime  = rq->fair_server.dl_runtime;
> >  		period = rq->fair_server.dl_period;
> >  
> > @@ -376,8 +378,11 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
> >  			return  -EINVAL;
> >  		}
> >  
> > -		update_rq_clock(rq);
> > -		dl_server_stop(&rq->fair_server);
> > +		is_active = dl_server_active(&rq->fair_server);
> > +		if (is_active) {
> > +			update_rq_clock(rq);
> > +			dl_server_stop(&rq->fair_server);
> > +		}
> >  
> >  		retval = dl_server_apply_params(&rq->fair_server, runtime, period, 0);
> >  
> > @@ -385,7 +390,7 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
> >  			printk_deferred("Fair server disabled in CPU %d, system may crash due to starvation.\n",
> >  					cpu_of(rq));
> >  
> > -		if (rq->cfs.h_nr_queued)
> > +		if (is_active && runtime)
> >  			dl_server_start(&rq->fair_server);
> >  
> >  		if (retval < 0)
> 
> Suppose runtime was 0, and gets incremented while there are already
> tasks enqueued, then the above isn't going to DTRT.

Perhaps simply make that:

	if (runtime)
		dl_server_start();

That might spuriously start the thing, but that should be harmless. It
will just go back to sleep for not finding any tasks to run and all
that.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/7] sched/debug: Stop and start server based on if it was active
  2026-02-02 21:13   ` Peter Zijlstra
  2026-02-02 21:14     ` Peter Zijlstra
@ 2026-02-02 21:17     ` Peter Zijlstra
  2026-02-02 22:37       ` Andrea Righi
  2026-02-03 10:11       ` Andrea Righi
  1 sibling, 2 replies; 40+ messages in thread
From: Peter Zijlstra @ 2026-02-02 21:17 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Tejun Heo, Joel Fernandes, David Vernet, Changwoo Min,
	Daniel Hodges, Christian Loehle, Emil Tsalapatis, sched-ext,
	linux-kernel

On Mon, Feb 02, 2026 at 10:13:26PM +0100, Peter Zijlstra wrote:
> On Mon, Jan 26, 2026 at 10:59:01AM +0100, Andrea Righi wrote:
> > From: Joel Fernandes <joelagnelf@nvidia.com>
> > 
> > Currently the DL server interface for applying parameters checks
> > CFS-internals to identify if the server is active. This is error-prone
> > and makes it difficult when adding new servers in the future.
> > 
> > Fix it, by using dl_server_active() which is also used by the DL server
> > code to determine if the DL server was started.
> > 
> > Tested-by: Christian Loehle <christian.loehle@arm.com>
> > Acked-by: Tejun Heo <tj@kernel.org>
> > Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
> > Reviewed-by: Andrea Righi <arighi@nvidia.com>
> > Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
> > ---
> >  kernel/sched/debug.c | 11 ++++++++---
> >  1 file changed, 8 insertions(+), 3 deletions(-)
> > 
> > diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> > index 93f009e1076d8..dd793f8f3858a 100644
> > --- a/kernel/sched/debug.c
> > +++ b/kernel/sched/debug.c
> > @@ -354,6 +354,8 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
> >  		return err;
> >  
> >  	scoped_guard (rq_lock_irqsave, rq) {
> > +		bool is_active;
> > +
> >  		runtime  = rq->fair_server.dl_runtime;
> >  		period = rq->fair_server.dl_period;
> >  
> > @@ -376,8 +378,11 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
> >  			return  -EINVAL;
> >  		}
> >  
> > -		update_rq_clock(rq);
> > -		dl_server_stop(&rq->fair_server);
> > +		is_active = dl_server_active(&rq->fair_server);
> > +		if (is_active) {
> > +			update_rq_clock(rq);
> > +			dl_server_stop(&rq->fair_server);
> > +		}
> >  
> >  		retval = dl_server_apply_params(&rq->fair_server, runtime, period, 0);
> >  
> > @@ -385,7 +390,7 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
> >  			printk_deferred("Fair server disabled in CPU %d, system may crash due to starvation.\n",
> >  					cpu_of(rq));
> >  
> > -		if (rq->cfs.h_nr_queued)
> > +		if (is_active && runtime)
> >  			dl_server_start(&rq->fair_server);
> >  
> >  		if (retval < 0)
> 
> Suppose runtime was 0, and gets incremented while there are already
> tasks enqueued, then the above isn't going to DTRT.

Something like so perhaps?

---
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 59e650f9d436..884bdf7a292f 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -340,7 +340,7 @@ static ssize_t sched_server_write_common(struct file *filp, const char __user *u
 	long cpu = (long) ((struct seq_file *) filp->private_data)->private;
 	struct rq *rq = cpu_rq(cpu);
 	struct sched_dl_entity *dl_se = (struct sched_dl_entity *)server;
-	u64 runtime, period;
+	u64 old_runtime, runtime, period;
 	int retval = 0;
 	size_t err;
 	u64 value;
@@ -352,7 +352,7 @@ static ssize_t sched_server_write_common(struct file *filp, const char __user *u
 	scoped_guard (rq_lock_irqsave, rq) {
 		bool is_active;
 
-		runtime = dl_se->dl_runtime;
+		old_runtime = runtime = dl_se->dl_runtime;
 		period = dl_se->dl_period;
 
 		switch (param) {
@@ -382,17 +382,20 @@ static ssize_t sched_server_write_common(struct file *filp, const char __user *u
 
 		retval = dl_server_apply_params(dl_se, runtime, period, 0);
 
-		if (!runtime)
-			printk_deferred("%s server disabled in CPU %d, system may crash due to starvation.\n",
-					server == &rq->fair_server ? "Fair" : "Ext", cpu_of(rq));
-
-		if (is_active && runtime)
+		if (runtime)
 			dl_server_start(dl_se);
 
 		if (retval < 0)
 			return retval;
 	}
 
+	if (!!old_runtime ^ !!runtime) {
+		pr_info("%s server %sabled in CPU %d, system may crash due to starvation.\n",
+			server == &rq->fair_server ? "Fair" : "Ext",
+			runtime ? "en" : "dis",
+			cpu_of(rq));
+	}
+
 	*ppos += cnt;
 	return cnt;
 }

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/7] sched_ext: Add a DL server for sched_ext tasks
  2026-02-02 21:10       ` Peter Zijlstra
@ 2026-02-02 22:18         ` Andrea Righi
  0 siblings, 0 replies; 40+ messages in thread
From: Andrea Righi @ 2026-02-02 22:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Tejun Heo, Joel Fernandes, David Vernet, Changwoo Min,
	Daniel Hodges, Christian Loehle, Emil Tsalapatis, sched-ext,
	linux-kernel

On Mon, Feb 02, 2026 at 10:10:04PM +0100, Peter Zijlstra wrote:
> On Mon, Feb 02, 2026 at 09:32:32PM +0100, Andrea Righi wrote:
> > Hi Peter,
> > 
> > On Mon, Feb 02, 2026 at 08:50:35PM +0100, Peter Zijlstra wrote:
> > > On Mon, Jan 26, 2026 at 10:59:02AM +0100, Andrea Righi wrote:
> > > 
> > > > @@ -3181,6 +3193,36 @@ void dl_add_task_root_domain(struct task_struct *p)
> > > >  	raw_spin_unlock_irqrestore(&p->pi_lock, rf.flags);
> > > >  }
> > > >  
> > > > +static void dl_server_add_bw(struct root_domain *rd, int cpu)
> > > > +{
> > > > +	struct sched_dl_entity *dl_se;
> > > > +
> > > > +	dl_se = &cpu_rq(cpu)->fair_server;
> > > > +	if (dl_server(dl_se) && cpu_active(cpu))
> > > > +		__dl_add(&rd->dl_bw, dl_se->dl_bw, dl_bw_cpus(cpu));
> > > > +
> > > > +#ifdef CONFIG_SCHED_CLASS_EXT
> > > > +	dl_se = &cpu_rq(cpu)->ext_server;
> > > > +	if (dl_server(dl_se) && cpu_active(cpu))
> > > > +		__dl_add(&rd->dl_bw, dl_se->dl_bw, dl_bw_cpus(cpu));
> > > > +#endif
> > > > +}
> > > > +
> > > > +static u64 dl_server_read_bw(int cpu)
> > > > +{
> > > > +	u64 dl_bw = 0;
> > > > +
> > > > +	if (cpu_rq(cpu)->fair_server.dl_server)
> > > > +		dl_bw += cpu_rq(cpu)->fair_server.dl_bw;
> > > > +
> > > > +#ifdef CONFIG_SCHED_CLASS_EXT
> > > > +	if (cpu_rq(cpu)->ext_server.dl_server)
> > > > +		dl_bw += cpu_rq(cpu)->ext_server.dl_bw;
> > > > +#endif
> > > > +
> > > > +	return dl_bw;
> > > > +}
> > > 
> > > Should not this also depend on scx_enabled()? It seems unfortunate to
> > > consume bandwidth if scx isn't even enabled.
> > 
> > Yeah, that's a good point. We can just add scx_enabled() here. Let me try
> > running some tests with this.
> 
> I suspect you need some callbacks around where scx_enabled() is changed
> to add/stop/remove things.

Ah that's right, simply checking scx_enabled() isn't enough, because the
state can change at runtime, we need callbacks in scx_enable() and
scx_disable_workfn(). I'll test this tomorrow. :)

Thanks!
-Andrea

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/7] sched/debug: Stop and start server based on if it was active
  2026-02-02 21:17     ` Peter Zijlstra
@ 2026-02-02 22:37       ` Andrea Righi
  2026-02-03 10:34         ` Peter Zijlstra
  2026-02-03 10:11       ` Andrea Righi
  1 sibling, 1 reply; 40+ messages in thread
From: Andrea Righi @ 2026-02-02 22:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Tejun Heo, Joel Fernandes, David Vernet, Changwoo Min,
	Daniel Hodges, Christian Loehle, Emil Tsalapatis, sched-ext,
	linux-kernel

On Mon, Feb 02, 2026 at 10:17:23PM +0100, Peter Zijlstra wrote:
> On Mon, Feb 02, 2026 at 10:13:26PM +0100, Peter Zijlstra wrote:
> > On Mon, Jan 26, 2026 at 10:59:01AM +0100, Andrea Righi wrote:
> > > From: Joel Fernandes <joelagnelf@nvidia.com>
> > > 
> > > Currently the DL server interface for applying parameters checks
> > > CFS-internals to identify if the server is active. This is error-prone
> > > and makes it difficult when adding new servers in the future.
> > > 
> > > Fix it, by using dl_server_active() which is also used by the DL server
> > > code to determine if the DL server was started.
> > > 
> > > Tested-by: Christian Loehle <christian.loehle@arm.com>
> > > Acked-by: Tejun Heo <tj@kernel.org>
> > > Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
> > > Reviewed-by: Andrea Righi <arighi@nvidia.com>
> > > Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
> > > ---
> > >  kernel/sched/debug.c | 11 ++++++++---
> > >  1 file changed, 8 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> > > index 93f009e1076d8..dd793f8f3858a 100644
> > > --- a/kernel/sched/debug.c
> > > +++ b/kernel/sched/debug.c
> > > @@ -354,6 +354,8 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
> > >  		return err;
> > >  
> > >  	scoped_guard (rq_lock_irqsave, rq) {
> > > +		bool is_active;
> > > +
> > >  		runtime  = rq->fair_server.dl_runtime;
> > >  		period = rq->fair_server.dl_period;
> > >  
> > > @@ -376,8 +378,11 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
> > >  			return  -EINVAL;
> > >  		}
> > >  
> > > -		update_rq_clock(rq);
> > > -		dl_server_stop(&rq->fair_server);
> > > +		is_active = dl_server_active(&rq->fair_server);
> > > +		if (is_active) {
> > > +			update_rq_clock(rq);
> > > +			dl_server_stop(&rq->fair_server);
> > > +		}
> > >  
> > >  		retval = dl_server_apply_params(&rq->fair_server, runtime, period, 0);
> > >  
> > > @@ -385,7 +390,7 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
> > >  			printk_deferred("Fair server disabled in CPU %d, system may crash due to starvation.\n",
> > >  					cpu_of(rq));
> > >  
> > > -		if (rq->cfs.h_nr_queued)
> > > +		if (is_active && runtime)
> > >  			dl_server_start(&rq->fair_server);
> > >  
> > >  		if (retval < 0)
> > 
> > Suppose runtime was 0, and gets incremented while there are already
> > tasks enqueued, then the above isn't going to DTRT.
> 

Right that's a bug, if the user sets runtime=0, tasks can be enqueued while
the server is disabled, when the user later sets runtime>0 to re-enable the
server, the current code doesn't start it.

Also, as discussed with Juri at LPC, we likely need a better interface than
debugfs for configuring these parameters (possibly a topic for OSPM).

> Something like so perhaps?
> 
> ---
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index 59e650f9d436..884bdf7a292f 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -340,7 +340,7 @@ static ssize_t sched_server_write_common(struct file *filp, const char __user *u
>  	long cpu = (long) ((struct seq_file *) filp->private_data)->private;
>  	struct rq *rq = cpu_rq(cpu);
>  	struct sched_dl_entity *dl_se = (struct sched_dl_entity *)server;
> -	u64 runtime, period;
> +	u64 old_runtime, runtime, period;
>  	int retval = 0;
>  	size_t err;
>  	u64 value;
> @@ -352,7 +352,7 @@ static ssize_t sched_server_write_common(struct file *filp, const char __user *u
>  	scoped_guard (rq_lock_irqsave, rq) {
>  		bool is_active;
>  
> -		runtime = dl_se->dl_runtime;
> +		old_runtime = runtime = dl_se->dl_runtime;
>  		period = dl_se->dl_period;
>  
>  		switch (param) {
> @@ -382,17 +382,20 @@ static ssize_t sched_server_write_common(struct file *filp, const char __user *u
>  
>  		retval = dl_server_apply_params(dl_se, runtime, period, 0);
>  
> -		if (!runtime)
> -			printk_deferred("%s server disabled in CPU %d, system may crash due to starvation.\n",
> -					server == &rq->fair_server ? "Fair" : "Ext", cpu_of(rq));
> -
> -		if (is_active && runtime)
> +		if (runtime)
>  			dl_server_start(dl_se);
>  
>  		if (retval < 0)
>  			return retval;
>  	}
>  
> +	if (!!old_runtime ^ !!runtime) {
> +		pr_info("%s server %sabled in CPU %d, system may crash due to starvation.\n",
> +			server == &rq->fair_server ? "Fair" : "Ext",
> +			runtime ? "en" : "dis",
> +			cpu_of(rq));

Or:

    pr_info("%s server %sabled in CPU %d%s\n",
              server == &rq->fair_server ? "Fair" : "Ext",
              runtime ? "en" : "dis",
              cpu_of(rq),
              runtime ? "" : ", system may crash due to starvation");

> +	}
> +
>  	*ppos += cnt;
>  	return cnt;
>  }

I like that, it should fix the issue.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/7] sched/debug: Stop and start server based on if it was active
  2026-02-02 21:17     ` Peter Zijlstra
  2026-02-02 22:37       ` Andrea Righi
@ 2026-02-03 10:11       ` Andrea Righi
  1 sibling, 0 replies; 40+ messages in thread
From: Andrea Righi @ 2026-02-03 10:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Tejun Heo, Joel Fernandes, David Vernet, Changwoo Min,
	Daniel Hodges, Christian Loehle, Emil Tsalapatis, sched-ext,
	linux-kernel

Hi Peter,

On Mon, Feb 02, 2026 at 10:17:23PM +0100, Peter Zijlstra wrote:
> On Mon, Feb 02, 2026 at 10:13:26PM +0100, Peter Zijlstra wrote:
> > On Mon, Jan 26, 2026 at 10:59:01AM +0100, Andrea Righi wrote:
> > > From: Joel Fernandes <joelagnelf@nvidia.com>
> > > 
> > > Currently the DL server interface for applying parameters checks
> > > CFS-internals to identify if the server is active. This is error-prone
> > > and makes it difficult when adding new servers in the future.
> > > 
> > > Fix it, by using dl_server_active() which is also used by the DL server
> > > code to determine if the DL server was started.
> > > 
> > > Tested-by: Christian Loehle <christian.loehle@arm.com>
> > > Acked-by: Tejun Heo <tj@kernel.org>
> > > Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
> > > Reviewed-by: Andrea Righi <arighi@nvidia.com>
> > > Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
> > > ---
> > >  kernel/sched/debug.c | 11 ++++++++---
> > >  1 file changed, 8 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> > > index 93f009e1076d8..dd793f8f3858a 100644
> > > --- a/kernel/sched/debug.c
> > > +++ b/kernel/sched/debug.c
> > > @@ -354,6 +354,8 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
> > >  		return err;
> > >  
> > >  	scoped_guard (rq_lock_irqsave, rq) {
> > > +		bool is_active;
> > > +
> > >  		runtime  = rq->fair_server.dl_runtime;
> > >  		period = rq->fair_server.dl_period;
> > >  
> > > @@ -376,8 +378,11 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
> > >  			return  -EINVAL;
> > >  		}
> > >  
> > > -		update_rq_clock(rq);
> > > -		dl_server_stop(&rq->fair_server);
> > > +		is_active = dl_server_active(&rq->fair_server);
> > > +		if (is_active) {
> > > +			update_rq_clock(rq);
> > > +			dl_server_stop(&rq->fair_server);
> > > +		}
> > >  
> > >  		retval = dl_server_apply_params(&rq->fair_server, runtime, period, 0);
> > >  
> > > @@ -385,7 +390,7 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
> > >  			printk_deferred("Fair server disabled in CPU %d, system may crash due to starvation.\n",
> > >  					cpu_of(rq));
> > >  
> > > -		if (rq->cfs.h_nr_queued)
> > > +		if (is_active && runtime)
> > >  			dl_server_start(&rq->fair_server);
> > >  
> > >  		if (retval < 0)
> > 
> > Suppose runtime was 0, and gets incremented while there are already
> > tasks enqueued, then the above isn't going to DTRT.
> 
> Something like so perhaps?
> 
> ---
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index 59e650f9d436..884bdf7a292f 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -340,7 +340,7 @@ static ssize_t sched_server_write_common(struct file *filp, const char __user *u
>  	long cpu = (long) ((struct seq_file *) filp->private_data)->private;
>  	struct rq *rq = cpu_rq(cpu);
>  	struct sched_dl_entity *dl_se = (struct sched_dl_entity *)server;
> -	u64 runtime, period;
> +	u64 old_runtime, runtime, period;
>  	int retval = 0;
>  	size_t err;
>  	u64 value;
> @@ -352,7 +352,7 @@ static ssize_t sched_server_write_common(struct file *filp, const char __user *u
>  	scoped_guard (rq_lock_irqsave, rq) {
>  		bool is_active;
>  
> -		runtime = dl_se->dl_runtime;
> +		old_runtime = runtime = dl_se->dl_runtime;
>  		period = dl_se->dl_period;
>  
>  		switch (param) {
> @@ -382,17 +382,20 @@ static ssize_t sched_server_write_common(struct file *filp, const char __user *u
>  
>  		retval = dl_server_apply_params(dl_se, runtime, period, 0);
>  
> -		if (!runtime)
> -			printk_deferred("%s server disabled in CPU %d, system may crash due to starvation.\n",
> -					server == &rq->fair_server ? "Fair" : "Ext", cpu_of(rq));
> -
> -		if (is_active && runtime)
> +		if (runtime)
>  			dl_server_start(dl_se);
>  
>  		if (retval < 0)
>  			return retval;
>  	}
>  
> +	if (!!old_runtime ^ !!runtime) {
> +		pr_info("%s server %sabled in CPU %d, system may crash due to starvation.\n",
> +			server == &rq->fair_server ? "Fair" : "Ext",
> +			runtime ? "en" : "dis",
> +			cpu_of(rq));
> +	}
> +
>  	*ppos += cnt;
>  	return cnt;
>  }

I slightly changed your patch (see below), adding a missing
update_rq_clock(rq) before starting the DL server and updated the pr_info
message, as mentioned in my previous email.

I ran some tests, and with this change the DL server starts correctly when
runtime is changed from 0 to a value > 0, so this fixes the issue.

Would you prefer that I send an updated patch set, or should we apply this
fix on top?

Thanks,
-Andrea

---
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 2e9896668c6fd..dbd5e67a16c67 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -346,7 +346,7 @@ static ssize_t sched_server_write_common(struct file *filp, const char __user *u
 	long cpu = (long) ((struct seq_file *) filp->private_data)->private;
 	struct rq *rq = cpu_rq(cpu);
 	struct sched_dl_entity *dl_se = (struct sched_dl_entity *)server;
-	u64 runtime, period;
+	u64 old_runtime, runtime, period;
 	int retval = 0;
 	size_t err;
 	u64 value;
@@ -358,7 +358,7 @@ static ssize_t sched_server_write_common(struct file *filp, const char __user *u
 	scoped_guard (rq_lock_irqsave, rq) {
 		bool is_active;
 
-		runtime = dl_se->dl_runtime;
+		old_runtime = runtime = dl_se->dl_runtime;
 		period = dl_se->dl_period;
 
 		switch (param) {
@@ -388,17 +388,23 @@ static ssize_t sched_server_write_common(struct file *filp, const char __user *u
 
 		retval = dl_server_apply_params(dl_se, runtime, period, 0);
 
-		if (!runtime)
-			printk_deferred("%s server disabled in CPU %d, system may crash due to starvation.\n",
-					server == &rq->fair_server ? "Fair" : "Ext", cpu_of(rq));
-
-		if (is_active && runtime)
+		if (runtime) {
+			update_rq_clock(rq);
 			dl_server_start(dl_se);
+		}
 
 		if (retval < 0)
 			return retval;
 	}
 
+	if (!!old_runtime ^ !!runtime) {
+		pr_info("%s server %sabled in CPU %d%s\n",
+			server == &rq->fair_server ? "Fair" : "Ext",
+			runtime ? "en" : "dis",
+			cpu_of(rq),
+			runtime ? "" : ", system may crash due to starvation");
+	}
+
 	*ppos += cnt;
 	return cnt;
 }
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/7] sched/debug: Stop and start server based on if it was active
  2026-02-02 22:37       ` Andrea Righi
@ 2026-02-03 10:34         ` Peter Zijlstra
  2026-02-03 11:18           ` [tip: sched/core] sched/debug: Fix dl_server (re)start conditions tip-bot2 for Peter Zijlstra
  2026-02-03 13:50           ` [PATCH 3/7] sched/debug: Stop and start server based on if it was active Andrea Righi
  0 siblings, 2 replies; 40+ messages in thread
From: Peter Zijlstra @ 2026-02-03 10:34 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Tejun Heo, Joel Fernandes, David Vernet, Changwoo Min,
	Daniel Hodges, Christian Loehle, Emil Tsalapatis, sched-ext,
	linux-kernel

On Mon, Feb 02, 2026 at 11:37:31PM +0100, Andrea Righi wrote:

> Or:
> 
>     pr_info("%s server %sabled in CPU %d%s\n",
>               server == &rq->fair_server ? "Fair" : "Ext",
>               runtime ? "en" : "dis",
>               cpu_of(rq),
>               runtime ? "" : ", system may crash due to starvation");

Yeah, I noticed it was a bit wonkey. I made it thus.

> > +	}
> > +
> >  	*ppos += cnt;
> >  	return cnt;
> >  }
> 
> I like that, it should fix the issue.

There is one more issue when dl_server_apply_params() fails, in that
case we should test old_runtime to determine if we should (re)start the
dl_server.

I've ended up with this.

---
Subject: sched/debug: Fix dl_server (re)start conditions
From: Peter Zijlstra <peterz@infradead.org>
Date: Tue Feb 3 11:05:12 CET 2026

There are two problems with sched_server_write_common() that can cause the
dl_server to malfunction upon attempting to change the parameters:

1) when, after having disabled the dl_server by setting runtime=0, it is
   enabled again while tasks are already enqueued. In this case is_active would
   still be 0 and dl_server_start() would not be called.

2) when dl_server_apply_params() would fail, runtime is not applied and does
   not reflect the new state.

Instead have dl_server_start() check its actual dl_runtime, and have
sched_server_write_common() unconditionally (re)start the dl_server. It will
automatically stop if there isn't anything to do, so spurious activation is
harmless -- while failing to start it is a problem.

While there, move the printk out of the locked region and make it symmetric,
also printing on enable.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/deadline.c |    5 ++---
 kernel/sched/debug.c    |   32 ++++++++++++++------------------
 2 files changed, 16 insertions(+), 21 deletions(-)

--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1784,7 +1784,7 @@ void dl_server_start(struct sched_dl_ent
 {
 	struct rq *rq = dl_se->rq;
 
-	if (!dl_server(dl_se) || dl_se->dl_server_active)
+	if (!dl_server(dl_se) || dl_se->dl_server_active || !dl_se->dl_runtime)
 		return;
 
 	/*
@@ -1882,7 +1882,6 @@ int dl_server_apply_params(struct sched_
 	int cpu = cpu_of(rq);
 	struct dl_bw *dl_b;
 	unsigned long cap;
-	int retval = 0;
 	int cpus;
 
 	dl_b = dl_bw_of(cpu);
@@ -1914,7 +1913,7 @@ int dl_server_apply_params(struct sched_
 	dl_se->dl_bw = to_ratio(dl_se->dl_period, dl_se->dl_runtime);
 	dl_se->dl_density = to_ratio(dl_se->dl_deadline, dl_se->dl_runtime);
 
-	return retval;
+	return 0;
 }
 
 /*
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -338,9 +338,9 @@ static ssize_t sched_server_write_common
 					 void *server)
 {
 	long cpu = (long) ((struct seq_file *) filp->private_data)->private;
-	struct rq *rq = cpu_rq(cpu);
 	struct sched_dl_entity *dl_se = (struct sched_dl_entity *)server;
-	u64 runtime, period;
+	u64 old_runtime, runtime, period;
+	struct rq *rq = cpu_rq(cpu);
 	int retval = 0;
 	size_t err;
 	u64 value;
@@ -350,9 +350,7 @@ static ssize_t sched_server_write_common
 		return err;
 
 	scoped_guard (rq_lock_irqsave, rq) {
-		bool is_active;
-
-		runtime = dl_se->dl_runtime;
+		old_runtime = runtime = dl_se->dl_runtime;
 		period = dl_se->dl_period;
 
 		switch (param) {
@@ -374,25 +372,23 @@ static ssize_t sched_server_write_common
 			return  -EINVAL;
 		}
 
-		is_active = dl_server_active(dl_se);
-		if (is_active) {
-			update_rq_clock(rq);
-			dl_server_stop(dl_se);
-		}
-
+		update_rq_clock(rq);
+		dl_server_stop(dl_se);
 		retval = dl_server_apply_params(dl_se, runtime, period, 0);
-
-		if (!runtime)
-			printk_deferred("%s server disabled in CPU %d, system may crash due to starvation.\n",
-					server == &rq->fair_server ? "Fair" : "Ext", cpu_of(rq));
-
-		if (is_active && runtime)
-			dl_server_start(dl_se);
+		dl_server_start(dl_se);
 
 		if (retval < 0)
 			return retval;
 	}
 
+	if (!!old_runtime ^ !!runtime) {
+		pr_info("%s server %sabled on CPU %d%s.\n",
+			server == &rq->fair_server ? "Fair" : "Ext",
+			runtime ? "en" : "dis",
+			cpu_of(rq),
+			runtime ? "" : ", system may malfunction due to starvation");
+	}
+
 	*ppos += cnt;
 	return cnt;
 }

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [tip: sched/core] selftests/sched_ext: Add test for DL server total_bw consistency
  2026-01-26  9:59 ` [PATCH 7/7] selftests/sched_ext: Add test for DL server total_bw consistency Andrea Righi
@ 2026-02-03 11:18   ` tip-bot2 for Joel Fernandes
  0 siblings, 0 replies; 40+ messages in thread
From: tip-bot2 for Joel Fernandes @ 2026-02-03 11:18 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Andrea Righi, Joel Fernandes, Peter Zijlstra (Intel),
	Christian Loehle, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     dd6a37e8faa723c680cb8615efa5b042691b927f
Gitweb:        https://git.kernel.org/tip/dd6a37e8faa723c680cb8615efa5b042691b927f
Author:        Joel Fernandes <joelagnelf@nvidia.com>
AuthorDate:    Mon, 26 Jan 2026 10:59:05 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 03 Feb 2026 12:04:18 +01:00

selftests/sched_ext: Add test for DL server total_bw consistency

Add a new kselftest to verify that the total_bw value in
/sys/kernel/debug/sched/debug remains consistent across all CPUs
under different sched_ext BPF program states:

1. Before a BPF scheduler is loaded
2. While a BPF scheduler is loaded and active
3. After a BPF scheduler is unloaded

The test runs CPU stress threads to ensure DL server bandwidth
values stabilize before checking consistency. This helps catch
potential issues with DL server bandwidth accounting during
sched_ext transitions.

Co-developed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/20260126100050.3854740-8-arighi@nvidia.com
---
 tools/testing/selftests/sched_ext/Makefile   |   1 +-
 tools/testing/selftests/sched_ext/total_bw.c | 281 ++++++++++++++++++-
 2 files changed, 282 insertions(+)
 create mode 100644 tools/testing/selftests/sched_ext/total_bw.c

diff --git a/tools/testing/selftests/sched_ext/Makefile b/tools/testing/selftests/sched_ext/Makefile
index c9255d1..2c601a7 100644
--- a/tools/testing/selftests/sched_ext/Makefile
+++ b/tools/testing/selftests/sched_ext/Makefile
@@ -185,6 +185,7 @@ auto-test-targets :=			\
 	select_cpu_vtime		\
 	rt_stall			\
 	test_example			\
+	total_bw			\
 
 testcase-targets := $(addsuffix .o,$(addprefix $(SCXOBJ_DIR)/,$(auto-test-targets)))
 
diff --git a/tools/testing/selftests/sched_ext/total_bw.c b/tools/testing/selftests/sched_ext/total_bw.c
new file mode 100644
index 0000000..5b0a619
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/total_bw.c
@@ -0,0 +1,281 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Test to verify that total_bw value remains consistent across all CPUs
+ * in different BPF program states.
+ *
+ * Copyright (C) 2025 NVIDIA Corporation.
+ */
+#include <bpf/bpf.h>
+#include <errno.h>
+#include <pthread.h>
+#include <scx/common.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/wait.h>
+#include <unistd.h>
+#include "minimal.bpf.skel.h"
+#include "scx_test.h"
+
+#define MAX_CPUS 512
+#define STRESS_DURATION_SEC 5
+
+struct total_bw_ctx {
+	struct minimal *skel;
+	long baseline_bw[MAX_CPUS];
+	int nr_cpus;
+};
+
+static void *cpu_stress_thread(void *arg)
+{
+	volatile int i;
+	time_t end_time = time(NULL) + STRESS_DURATION_SEC;
+
+	while (time(NULL) < end_time)
+		for (i = 0; i < 1000000; i++)
+			;
+
+	return NULL;
+}
+
+/*
+ * The first enqueue on a CPU causes the DL server to start, for that
+ * reason run stressor threads in the hopes it schedules on all CPUs.
+ */
+static int run_cpu_stress(int nr_cpus)
+{
+	pthread_t *threads;
+	int i, ret = 0;
+
+	threads = calloc(nr_cpus, sizeof(pthread_t));
+	if (!threads)
+		return -ENOMEM;
+
+	/* Create threads to run on each CPU */
+	for (i = 0; i < nr_cpus; i++) {
+		if (pthread_create(&threads[i], NULL, cpu_stress_thread, NULL)) {
+			ret = -errno;
+			fprintf(stderr, "Failed to create thread %d: %s\n", i, strerror(-ret));
+			break;
+		}
+	}
+
+	/* Wait for all threads to complete */
+	for (i = 0; i < nr_cpus; i++) {
+		if (threads[i])
+			pthread_join(threads[i], NULL);
+	}
+
+	free(threads);
+	return ret;
+}
+
+static int read_total_bw_values(long *bw_values, int max_cpus)
+{
+	FILE *fp;
+	char line[256];
+	int cpu_count = 0;
+
+	fp = fopen("/sys/kernel/debug/sched/debug", "r");
+	if (!fp) {
+		SCX_ERR("Failed to open debug file");
+		return -1;
+	}
+
+	while (fgets(line, sizeof(line), fp)) {
+		char *bw_str = strstr(line, "total_bw");
+
+		if (bw_str) {
+			bw_str = strchr(bw_str, ':');
+			if (bw_str) {
+				/* Only store up to max_cpus values */
+				if (cpu_count < max_cpus)
+					bw_values[cpu_count] = atol(bw_str + 1);
+				cpu_count++;
+			}
+		}
+	}
+
+	fclose(fp);
+	return cpu_count;
+}
+
+static bool verify_total_bw_consistency(long *bw_values, int count)
+{
+	int i;
+	long first_value;
+
+	if (count <= 0)
+		return false;
+
+	first_value = bw_values[0];
+
+	for (i = 1; i < count; i++) {
+		if (bw_values[i] != first_value) {
+			SCX_ERR("Inconsistent total_bw: CPU0=%ld, CPU%d=%ld",
+				first_value, i, bw_values[i]);
+			return false;
+		}
+	}
+
+	return true;
+}
+
+static int fetch_verify_total_bw(long *bw_values, int nr_cpus)
+{
+	int attempts = 0;
+	int max_attempts = 10;
+	int count;
+
+	/*
+	 * The first enqueue on a CPU causes the DL server to start, for that
+	 * reason run stressor threads in the hopes it schedules on all CPUs.
+	 */
+	if (run_cpu_stress(nr_cpus) < 0) {
+		SCX_ERR("Failed to run CPU stress");
+		return -1;
+	}
+
+	/* Try multiple times to get stable values */
+	while (attempts < max_attempts) {
+		count = read_total_bw_values(bw_values, nr_cpus);
+		fprintf(stderr, "Read %d total_bw values (testing %d CPUs)\n", count, nr_cpus);
+		/* If system has more CPUs than we're testing, that's OK */
+		if (count < nr_cpus) {
+			SCX_ERR("Expected at least %d CPUs, got %d", nr_cpus, count);
+			attempts++;
+			sleep(1);
+			continue;
+		}
+
+		/* Only verify the CPUs we're testing */
+		if (verify_total_bw_consistency(bw_values, nr_cpus)) {
+			fprintf(stderr, "Values are consistent: %ld\n", bw_values[0]);
+			return 0;
+		}
+
+		attempts++;
+		sleep(1);
+	}
+
+	return -1;
+}
+
+static enum scx_test_status setup(void **ctx)
+{
+	struct total_bw_ctx *test_ctx;
+
+	if (access("/sys/kernel/debug/sched/debug", R_OK) != 0) {
+		fprintf(stderr, "Skipping test: debugfs sched/debug not accessible\n");
+		return SCX_TEST_SKIP;
+	}
+
+	test_ctx = calloc(1, sizeof(*test_ctx));
+	if (!test_ctx)
+		return SCX_TEST_FAIL;
+
+	test_ctx->nr_cpus = sysconf(_SC_NPROCESSORS_ONLN);
+	if (test_ctx->nr_cpus <= 0) {
+		free(test_ctx);
+		return SCX_TEST_FAIL;
+	}
+
+	/* If system has more CPUs than MAX_CPUS, just test the first MAX_CPUS */
+	if (test_ctx->nr_cpus > MAX_CPUS)
+		test_ctx->nr_cpus = MAX_CPUS;
+
+	/* Test scenario 1: BPF program not loaded */
+	/* Read and verify baseline total_bw before loading BPF program */
+	fprintf(stderr, "BPF prog initially not loaded, reading total_bw values\n");
+	if (fetch_verify_total_bw(test_ctx->baseline_bw, test_ctx->nr_cpus) < 0) {
+		SCX_ERR("Failed to get stable baseline values");
+		free(test_ctx);
+		return SCX_TEST_FAIL;
+	}
+
+	/* Load the BPF skeleton */
+	test_ctx->skel = minimal__open();
+	if (!test_ctx->skel) {
+		free(test_ctx);
+		return SCX_TEST_FAIL;
+	}
+
+	SCX_ENUM_INIT(test_ctx->skel);
+	if (minimal__load(test_ctx->skel)) {
+		minimal__destroy(test_ctx->skel);
+		free(test_ctx);
+		return SCX_TEST_FAIL;
+	}
+
+	*ctx = test_ctx;
+	return SCX_TEST_PASS;
+}
+
+static enum scx_test_status run(void *ctx)
+{
+	struct total_bw_ctx *test_ctx = ctx;
+	struct bpf_link *link;
+	long loaded_bw[MAX_CPUS];
+	long unloaded_bw[MAX_CPUS];
+	int i;
+
+	/* Test scenario 2: BPF program loaded */
+	link = bpf_map__attach_struct_ops(test_ctx->skel->maps.minimal_ops);
+	if (!link) {
+		SCX_ERR("Failed to attach scheduler");
+		return SCX_TEST_FAIL;
+	}
+
+	fprintf(stderr, "BPF program loaded, reading total_bw values\n");
+	if (fetch_verify_total_bw(loaded_bw, test_ctx->nr_cpus) < 0) {
+		SCX_ERR("Failed to get stable values with BPF loaded");
+		bpf_link__destroy(link);
+		return SCX_TEST_FAIL;
+	}
+	bpf_link__destroy(link);
+
+	/* Test scenario 3: BPF program unloaded */
+	fprintf(stderr, "BPF program unloaded, reading total_bw values\n");
+	if (fetch_verify_total_bw(unloaded_bw, test_ctx->nr_cpus) < 0) {
+		SCX_ERR("Failed to get stable values after BPF unload");
+		return SCX_TEST_FAIL;
+	}
+
+	/* Verify all three scenarios have the same total_bw values */
+	for (i = 0; i < test_ctx->nr_cpus; i++) {
+		if (test_ctx->baseline_bw[i] != loaded_bw[i]) {
+			SCX_ERR("CPU%d: baseline_bw=%ld != loaded_bw=%ld",
+				i, test_ctx->baseline_bw[i], loaded_bw[i]);
+			return SCX_TEST_FAIL;
+		}
+
+		if (test_ctx->baseline_bw[i] != unloaded_bw[i]) {
+			SCX_ERR("CPU%d: baseline_bw=%ld != unloaded_bw=%ld",
+				i, test_ctx->baseline_bw[i], unloaded_bw[i]);
+			return SCX_TEST_FAIL;
+		}
+	}
+
+	fprintf(stderr, "All total_bw values are consistent across all scenarios\n");
+	return SCX_TEST_PASS;
+}
+
+static void cleanup(void *ctx)
+{
+	struct total_bw_ctx *test_ctx = ctx;
+
+	if (test_ctx) {
+		if (test_ctx->skel)
+			minimal__destroy(test_ctx->skel);
+		free(test_ctx);
+	}
+}
+
+struct scx_test total_bw = {
+	.name = "total_bw",
+	.description = "Verify total_bw consistency across BPF program states",
+	.setup = setup,
+	.run = run,
+	.cleanup = cleanup,
+};
+REGISTER_SCX_TEST(&total_bw)

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [tip: sched/core] selftests/sched_ext: Add test for sched_ext dl_server
  2026-01-26  9:59 ` [PATCH 6/7] selftests/sched_ext: Add test for sched_ext dl_server Andrea Righi
@ 2026-02-03 11:18   ` tip-bot2 for Andrea Righi
  0 siblings, 0 replies; 40+ messages in thread
From: tip-bot2 for Andrea Righi @ 2026-02-03 11:18 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Joel Fernandes, Andrea Righi, Peter Zijlstra (Intel),
	Emil Tsalapatis, Christian Loehle, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     be621a76341caa911ff98175114ff072618d7d4a
Gitweb:        https://git.kernel.org/tip/be621a76341caa911ff98175114ff072618d7d4a
Author:        Andrea Righi <arighi@nvidia.com>
AuthorDate:    Mon, 26 Jan 2026 10:59:04 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 03 Feb 2026 12:04:18 +01:00

selftests/sched_ext: Add test for sched_ext dl_server

Add a selftest to validate the correct behavior of the deadline server
for the ext_sched_class.

Co-developed-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/20260126100050.3854740-7-arighi@nvidia.com
---
 tools/testing/selftests/sched_ext/Makefile       |   1 +-
 tools/testing/selftests/sched_ext/rt_stall.bpf.c |  23 +-
 tools/testing/selftests/sched_ext/rt_stall.c     | 240 ++++++++++++++-
 3 files changed, 264 insertions(+)
 create mode 100644 tools/testing/selftests/sched_ext/rt_stall.bpf.c
 create mode 100644 tools/testing/selftests/sched_ext/rt_stall.c

diff --git a/tools/testing/selftests/sched_ext/Makefile b/tools/testing/selftests/sched_ext/Makefile
index 5fe45f9..c9255d1 100644
--- a/tools/testing/selftests/sched_ext/Makefile
+++ b/tools/testing/selftests/sched_ext/Makefile
@@ -183,6 +183,7 @@ auto-test-targets :=			\
 	select_cpu_dispatch_bad_dsq	\
 	select_cpu_dispatch_dbl_dsp	\
 	select_cpu_vtime		\
+	rt_stall			\
 	test_example			\
 
 testcase-targets := $(addsuffix .o,$(addprefix $(SCXOBJ_DIR)/,$(auto-test-targets)))
diff --git a/tools/testing/selftests/sched_ext/rt_stall.bpf.c b/tools/testing/selftests/sched_ext/rt_stall.bpf.c
new file mode 100644
index 0000000..8008677
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/rt_stall.bpf.c
@@ -0,0 +1,23 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * A scheduler that verified if RT tasks can stall SCHED_EXT tasks.
+ *
+ * Copyright (c) 2025 NVIDIA Corporation.
+ */
+
+#include <scx/common.bpf.h>
+
+char _license[] SEC("license") = "GPL";
+
+UEI_DEFINE(uei);
+
+void BPF_STRUCT_OPS(rt_stall_exit, struct scx_exit_info *ei)
+{
+	UEI_RECORD(uei, ei);
+}
+
+SEC(".struct_ops.link")
+struct sched_ext_ops rt_stall_ops = {
+	.exit			= (void *)rt_stall_exit,
+	.name			= "rt_stall",
+};
diff --git a/tools/testing/selftests/sched_ext/rt_stall.c b/tools/testing/selftests/sched_ext/rt_stall.c
new file mode 100644
index 0000000..015200f
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/rt_stall.c
@@ -0,0 +1,240 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2025 NVIDIA Corporation.
+ */
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <sched.h>
+#include <sys/prctl.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <time.h>
+#include <linux/sched.h>
+#include <signal.h>
+#include <bpf/bpf.h>
+#include <scx/common.h>
+#include <unistd.h>
+#include "rt_stall.bpf.skel.h"
+#include "scx_test.h"
+#include "../kselftest.h"
+
+#define CORE_ID		0	/* CPU to pin tasks to */
+#define RUN_TIME        5	/* How long to run the test in seconds */
+
+/* Simple busy-wait function for test tasks */
+static void process_func(void)
+{
+	while (1) {
+		/* Busy wait */
+		for (volatile unsigned long i = 0; i < 10000000UL; i++)
+			;
+	}
+}
+
+/* Set CPU affinity to a specific core */
+static void set_affinity(int cpu)
+{
+	cpu_set_t mask;
+
+	CPU_ZERO(&mask);
+	CPU_SET(cpu, &mask);
+	if (sched_setaffinity(0, sizeof(mask), &mask) != 0) {
+		perror("sched_setaffinity");
+		exit(EXIT_FAILURE);
+	}
+}
+
+/* Set task scheduling policy and priority */
+static void set_sched(int policy, int priority)
+{
+	struct sched_param param;
+
+	param.sched_priority = priority;
+	if (sched_setscheduler(0, policy, &param) != 0) {
+		perror("sched_setscheduler");
+		exit(EXIT_FAILURE);
+	}
+}
+
+/* Get process runtime from /proc/<pid>/stat */
+static float get_process_runtime(int pid)
+{
+	char path[256];
+	FILE *file;
+	long utime, stime;
+	int fields;
+
+	snprintf(path, sizeof(path), "/proc/%d/stat", pid);
+	file = fopen(path, "r");
+	if (file == NULL) {
+		perror("Failed to open stat file");
+		return -1;
+	}
+
+	/* Skip the first 13 fields and read the 14th and 15th */
+	fields = fscanf(file,
+			"%*d %*s %*c %*d %*d %*d %*d %*d %*u %*u %*u %*u %*u %lu %lu",
+			&utime, &stime);
+	fclose(file);
+
+	if (fields != 2) {
+		fprintf(stderr, "Failed to read stat file\n");
+		return -1;
+	}
+
+	/* Calculate the total time spent in the process */
+	long total_time = utime + stime;
+	long ticks_per_second = sysconf(_SC_CLK_TCK);
+	float runtime_seconds = total_time * 1.0 / ticks_per_second;
+
+	return runtime_seconds;
+}
+
+static enum scx_test_status setup(void **ctx)
+{
+	struct rt_stall *skel;
+
+	skel = rt_stall__open();
+	SCX_FAIL_IF(!skel, "Failed to open");
+	SCX_ENUM_INIT(skel);
+	SCX_FAIL_IF(rt_stall__load(skel), "Failed to load skel");
+
+	*ctx = skel;
+
+	return SCX_TEST_PASS;
+}
+
+static bool sched_stress_test(bool is_ext)
+{
+	/*
+	 * We're expecting the EXT task to get around 5% of CPU time when
+	 * competing with the RT task (small 1% fluctuations are expected).
+	 *
+	 * However, the EXT task should get at least 4% of the CPU to prove
+	 * that the EXT deadline server is working correctly. A percentage
+	 * less than 4% indicates a bug where RT tasks can potentially
+	 * stall SCHED_EXT tasks, causing the test to fail.
+	 */
+	const float expected_min_ratio = 0.04; /* 4% */
+	const char *class_str = is_ext ? "EXT" : "FAIR";
+
+	float ext_runtime, rt_runtime, actual_ratio;
+	int ext_pid, rt_pid;
+
+	ksft_print_header();
+	ksft_set_plan(1);
+
+	/* Create and set up a EXT task */
+	ext_pid = fork();
+	if (ext_pid == 0) {
+		set_affinity(CORE_ID);
+		process_func();
+		exit(0);
+	} else if (ext_pid < 0) {
+		perror("fork task");
+		ksft_exit_fail();
+	}
+
+	/* Create an RT task */
+	rt_pid = fork();
+	if (rt_pid == 0) {
+		set_affinity(CORE_ID);
+		set_sched(SCHED_FIFO, 50);
+		process_func();
+		exit(0);
+	} else if (rt_pid < 0) {
+		perror("fork for RT task");
+		ksft_exit_fail();
+	}
+
+	/* Let the processes run for the specified time */
+	sleep(RUN_TIME);
+
+	/* Get runtime for the EXT task */
+	ext_runtime = get_process_runtime(ext_pid);
+	if (ext_runtime == -1)
+		ksft_exit_fail_msg("Error getting runtime for %s task (PID %d)\n",
+				   class_str, ext_pid);
+	ksft_print_msg("Runtime of %s task (PID %d) is %f seconds\n",
+		       class_str, ext_pid, ext_runtime);
+
+	/* Get runtime for the RT task */
+	rt_runtime = get_process_runtime(rt_pid);
+	if (rt_runtime == -1)
+		ksft_exit_fail_msg("Error getting runtime for RT task (PID %d)\n", rt_pid);
+	ksft_print_msg("Runtime of RT task (PID %d) is %f seconds\n", rt_pid, rt_runtime);
+
+	/* Kill the processes */
+	kill(ext_pid, SIGKILL);
+	kill(rt_pid, SIGKILL);
+	waitpid(ext_pid, NULL, 0);
+	waitpid(rt_pid, NULL, 0);
+
+	/* Verify that the scx task got enough runtime */
+	actual_ratio = ext_runtime / (ext_runtime + rt_runtime);
+	ksft_print_msg("%s task got %.2f%% of total runtime\n",
+		       class_str, actual_ratio * 100);
+
+	if (actual_ratio >= expected_min_ratio) {
+		ksft_test_result_pass("PASS: %s task got more than %.2f%% of runtime\n",
+				      class_str, expected_min_ratio * 100);
+		return true;
+	}
+	ksft_test_result_fail("FAIL: %s task got less than %.2f%% of runtime\n",
+			      class_str, expected_min_ratio * 100);
+	return false;
+}
+
+static enum scx_test_status run(void *ctx)
+{
+	struct rt_stall *skel = ctx;
+	struct bpf_link *link = NULL;
+	bool res;
+	int i;
+
+	/*
+	 * Test if the dl_server is working both with and without the
+	 * sched_ext scheduler attached.
+	 *
+	 * This ensures all the scenarios are covered:
+	 *   - fair_server stop -> ext_server start
+	 *   - ext_server stop -> fair_server stop
+	 */
+	for (i = 0; i < 4; i++) {
+		bool is_ext = i % 2;
+
+		if (is_ext) {
+			memset(&skel->data->uei, 0, sizeof(skel->data->uei));
+			link = bpf_map__attach_struct_ops(skel->maps.rt_stall_ops);
+			SCX_FAIL_IF(!link, "Failed to attach scheduler");
+		}
+		res = sched_stress_test(is_ext);
+		if (is_ext) {
+			SCX_EQ(skel->data->uei.kind, EXIT_KIND(SCX_EXIT_NONE));
+			bpf_link__destroy(link);
+		}
+
+		if (!res)
+			ksft_exit_fail();
+	}
+
+	return SCX_TEST_PASS;
+}
+
+static void cleanup(void *ctx)
+{
+	struct rt_stall *skel = ctx;
+
+	rt_stall__destroy(skel);
+}
+
+struct scx_test rt_stall = {
+	.name = "rt_stall",
+	.description = "Verify that RT tasks cannot stall SCHED_EXT tasks",
+	.setup = setup,
+	.run = run,
+	.cleanup = cleanup,
+};
+REGISTER_SCX_TEST(&rt_stall)

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [tip: sched/core] sched/debug: Fix dl_server (re)start conditions
  2026-02-03 10:34         ` Peter Zijlstra
@ 2026-02-03 11:18           ` tip-bot2 for Peter Zijlstra
  2026-02-03 13:50           ` [PATCH 3/7] sched/debug: Stop and start server based on if it was active Andrea Righi
  1 sibling, 0 replies; 40+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2026-02-03 11:18 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     5a40a9bb56d455e7548ba4b6d7787918323cbaf0
Gitweb:        https://git.kernel.org/tip/5a40a9bb56d455e7548ba4b6d7787918323cbaf0
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Tue, 03 Feb 2026 11:05:12 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 03 Feb 2026 12:04:18 +01:00

sched/debug: Fix dl_server (re)start conditions

There are two problems with sched_server_write_common() that can cause the
dl_server to malfunction upon attempting to change the parameters:

1) when, after having disabled the dl_server by setting runtime=0, it is
   enabled again while tasks are already enqueued. In this case is_active would
   still be 0 and dl_server_start() would not be called.

2) when dl_server_apply_params() would fail, runtime is not applied and does
   not reflect the new state.

Instead have dl_server_start() check its actual dl_runtime, and have
sched_server_write_common() unconditionally (re)start the dl_server. It will
automatically stop if there isn't anything to do, so spurious activation is
harmless -- while failing to start it is a problem.

While there, move the printk out of the locked region and make it symmetric,
also printing on enable.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260203103407.GK1282955@noisy.programming.kicks-ass.net
---
 kernel/sched/deadline.c |  5 ++---
 kernel/sched/debug.c    | 32 ++++++++++++++------------------
 2 files changed, 16 insertions(+), 21 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index eae14e5..d08b004 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1799,7 +1799,7 @@ void dl_server_start(struct sched_dl_entity *dl_se)
 	struct rq *rq = dl_se->rq;
 
 	dl_se->dl_defer_idle = 0;
-	if (!dl_server(dl_se) || dl_se->dl_server_active)
+	if (!dl_server(dl_se) || dl_se->dl_server_active || !dl_se->dl_runtime)
 		return;
 
 	/*
@@ -1898,7 +1898,6 @@ int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 perio
 	int cpu = cpu_of(rq);
 	struct dl_bw *dl_b;
 	unsigned long cap;
-	int retval = 0;
 	int cpus;
 
 	dl_b = dl_bw_of(cpu);
@@ -1930,7 +1929,7 @@ int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 perio
 	dl_se->dl_bw = to_ratio(dl_se->dl_period, dl_se->dl_runtime);
 	dl_se->dl_density = to_ratio(dl_se->dl_deadline, dl_se->dl_runtime);
 
-	return retval;
+	return 0;
 }
 
 /*
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 59e650f..b24f40f 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -338,9 +338,9 @@ static ssize_t sched_server_write_common(struct file *filp, const char __user *u
 					 void *server)
 {
 	long cpu = (long) ((struct seq_file *) filp->private_data)->private;
-	struct rq *rq = cpu_rq(cpu);
 	struct sched_dl_entity *dl_se = (struct sched_dl_entity *)server;
-	u64 runtime, period;
+	u64 old_runtime, runtime, period;
+	struct rq *rq = cpu_rq(cpu);
 	int retval = 0;
 	size_t err;
 	u64 value;
@@ -350,9 +350,7 @@ static ssize_t sched_server_write_common(struct file *filp, const char __user *u
 		return err;
 
 	scoped_guard (rq_lock_irqsave, rq) {
-		bool is_active;
-
-		runtime = dl_se->dl_runtime;
+		old_runtime = runtime = dl_se->dl_runtime;
 		period = dl_se->dl_period;
 
 		switch (param) {
@@ -374,25 +372,23 @@ static ssize_t sched_server_write_common(struct file *filp, const char __user *u
 			return  -EINVAL;
 		}
 
-		is_active = dl_server_active(dl_se);
-		if (is_active) {
-			update_rq_clock(rq);
-			dl_server_stop(dl_se);
-		}
-
+		update_rq_clock(rq);
+		dl_server_stop(dl_se);
 		retval = dl_server_apply_params(dl_se, runtime, period, 0);
-
-		if (!runtime)
-			printk_deferred("%s server disabled in CPU %d, system may crash due to starvation.\n",
-					server == &rq->fair_server ? "Fair" : "Ext", cpu_of(rq));
-
-		if (is_active && runtime)
-			dl_server_start(dl_se);
+		dl_server_start(dl_se);
 
 		if (retval < 0)
 			return retval;
 	}
 
+	if (!!old_runtime ^ !!runtime) {
+		pr_info("%s server %sabled on CPU %d%s.\n",
+			server == &rq->fair_server ? "Fair" : "Ext",
+			runtime ? "en" : "dis",
+			cpu_of(rq),
+			runtime ? "" : ", system may malfunction due to starvation");
+	}
+
 	*ppos += cnt;
 	return cnt;
 }

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [tip: sched/core] sched/debug: Add support to change sched_ext server params
  2026-01-26  9:59 ` [PATCH 5/7] sched/debug: Add support to change sched_ext server params Andrea Righi
@ 2026-02-03 11:18   ` tip-bot2 for Joel Fernandes
  0 siblings, 0 replies; 40+ messages in thread
From: tip-bot2 for Joel Fernandes @ 2026-02-03 11:18 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Andrea Righi, Joel Fernandes, Peter Zijlstra (Intel), Juri Lelli,
	Christian Loehle, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     76d12132ba459ab929cb66eb2030c666aacdb69a
Gitweb:        https://git.kernel.org/tip/76d12132ba459ab929cb66eb2030c666aacdb69a
Author:        Joel Fernandes <joelagnelf@nvidia.com>
AuthorDate:    Mon, 26 Jan 2026 10:59:03 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 03 Feb 2026 12:04:17 +01:00

sched/debug: Add support to change sched_ext server params

When a sched_ext server is loaded, tasks in the fair class are
automatically moved to the sched_ext class. Add support to modify the
ext server parameters similar to how the fair server parameters are
modified.

Re-use common code between ext and fair servers as needed.

Co-developed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/20260126100050.3854740-6-arighi@nvidia.com
---
 kernel/sched/debug.c | 157 +++++++++++++++++++++++++++++++++++-------
 1 file changed, 133 insertions(+), 24 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 41e3895..59e650f 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -330,14 +330,16 @@ enum dl_param {
 	DL_PERIOD,
 };
 
-static unsigned long fair_server_period_max = (1UL << 22) * NSEC_PER_USEC; /* ~4 seconds */
-static unsigned long fair_server_period_min = (100) * NSEC_PER_USEC;     /* 100 us */
+static unsigned long dl_server_period_max = (1UL << 22) * NSEC_PER_USEC; /* ~4 seconds */
+static unsigned long dl_server_period_min = (100) * NSEC_PER_USEC;     /* 100 us */
 
-static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubuf,
-				       size_t cnt, loff_t *ppos, enum dl_param param)
+static ssize_t sched_server_write_common(struct file *filp, const char __user *ubuf,
+					 size_t cnt, loff_t *ppos, enum dl_param param,
+					 void *server)
 {
 	long cpu = (long) ((struct seq_file *) filp->private_data)->private;
 	struct rq *rq = cpu_rq(cpu);
+	struct sched_dl_entity *dl_se = (struct sched_dl_entity *)server;
 	u64 runtime, period;
 	int retval = 0;
 	size_t err;
@@ -350,8 +352,8 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
 	scoped_guard (rq_lock_irqsave, rq) {
 		bool is_active;
 
-		runtime  = rq->fair_server.dl_runtime;
-		period = rq->fair_server.dl_period;
+		runtime = dl_se->dl_runtime;
+		period = dl_se->dl_period;
 
 		switch (param) {
 		case DL_RUNTIME:
@@ -367,25 +369,25 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
 		}
 
 		if (runtime > period ||
-		    period > fair_server_period_max ||
-		    period < fair_server_period_min) {
+		    period > dl_server_period_max ||
+		    period < dl_server_period_min) {
 			return  -EINVAL;
 		}
 
-		is_active = dl_server_active(&rq->fair_server);
+		is_active = dl_server_active(dl_se);
 		if (is_active) {
 			update_rq_clock(rq);
-			dl_server_stop(&rq->fair_server);
+			dl_server_stop(dl_se);
 		}
 
-		retval = dl_server_apply_params(&rq->fair_server, runtime, period, 0);
+		retval = dl_server_apply_params(dl_se, runtime, period, 0);
 
 		if (!runtime)
-			printk_deferred("Fair server disabled in CPU %d, system may crash due to starvation.\n",
-					cpu_of(rq));
+			printk_deferred("%s server disabled in CPU %d, system may crash due to starvation.\n",
+					server == &rq->fair_server ? "Fair" : "Ext", cpu_of(rq));
 
 		if (is_active && runtime)
-			dl_server_start(&rq->fair_server);
+			dl_server_start(dl_se);
 
 		if (retval < 0)
 			return retval;
@@ -395,36 +397,42 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
 	return cnt;
 }
 
-static size_t sched_fair_server_show(struct seq_file *m, void *v, enum dl_param param)
+static size_t sched_server_show_common(struct seq_file *m, void *v, enum dl_param param,
+				       void *server)
 {
-	unsigned long cpu = (unsigned long) m->private;
-	struct rq *rq = cpu_rq(cpu);
+	struct sched_dl_entity *dl_se = (struct sched_dl_entity *)server;
 	u64 value;
 
 	switch (param) {
 	case DL_RUNTIME:
-		value = rq->fair_server.dl_runtime;
+		value = dl_se->dl_runtime;
 		break;
 	case DL_PERIOD:
-		value = rq->fair_server.dl_period;
+		value = dl_se->dl_period;
 		break;
 	}
 
 	seq_printf(m, "%llu\n", value);
 	return 0;
-
 }
 
 static ssize_t
 sched_fair_server_runtime_write(struct file *filp, const char __user *ubuf,
 				size_t cnt, loff_t *ppos)
 {
-	return sched_fair_server_write(filp, ubuf, cnt, ppos, DL_RUNTIME);
+	long cpu = (long) ((struct seq_file *) filp->private_data)->private;
+	struct rq *rq = cpu_rq(cpu);
+
+	return sched_server_write_common(filp, ubuf, cnt, ppos, DL_RUNTIME,
+					&rq->fair_server);
 }
 
 static int sched_fair_server_runtime_show(struct seq_file *m, void *v)
 {
-	return sched_fair_server_show(m, v, DL_RUNTIME);
+	unsigned long cpu = (unsigned long) m->private;
+	struct rq *rq = cpu_rq(cpu);
+
+	return sched_server_show_common(m, v, DL_RUNTIME, &rq->fair_server);
 }
 
 static int sched_fair_server_runtime_open(struct inode *inode, struct file *filp)
@@ -440,16 +448,57 @@ static const struct file_operations fair_server_runtime_fops = {
 	.release	= single_release,
 };
 
+#ifdef CONFIG_SCHED_CLASS_EXT
+static ssize_t
+sched_ext_server_runtime_write(struct file *filp, const char __user *ubuf,
+			       size_t cnt, loff_t *ppos)
+{
+	long cpu = (long) ((struct seq_file *) filp->private_data)->private;
+	struct rq *rq = cpu_rq(cpu);
+
+	return sched_server_write_common(filp, ubuf, cnt, ppos, DL_RUNTIME,
+					&rq->ext_server);
+}
+
+static int sched_ext_server_runtime_show(struct seq_file *m, void *v)
+{
+	unsigned long cpu = (unsigned long) m->private;
+	struct rq *rq = cpu_rq(cpu);
+
+	return sched_server_show_common(m, v, DL_RUNTIME, &rq->ext_server);
+}
+
+static int sched_ext_server_runtime_open(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, sched_ext_server_runtime_show, inode->i_private);
+}
+
+static const struct file_operations ext_server_runtime_fops = {
+	.open		= sched_ext_server_runtime_open,
+	.write		= sched_ext_server_runtime_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+#endif /* CONFIG_SCHED_CLASS_EXT */
+
 static ssize_t
 sched_fair_server_period_write(struct file *filp, const char __user *ubuf,
 			       size_t cnt, loff_t *ppos)
 {
-	return sched_fair_server_write(filp, ubuf, cnt, ppos, DL_PERIOD);
+	long cpu = (long) ((struct seq_file *) filp->private_data)->private;
+	struct rq *rq = cpu_rq(cpu);
+
+	return sched_server_write_common(filp, ubuf, cnt, ppos, DL_PERIOD,
+					&rq->fair_server);
 }
 
 static int sched_fair_server_period_show(struct seq_file *m, void *v)
 {
-	return sched_fair_server_show(m, v, DL_PERIOD);
+	unsigned long cpu = (unsigned long) m->private;
+	struct rq *rq = cpu_rq(cpu);
+
+	return sched_server_show_common(m, v, DL_PERIOD, &rq->fair_server);
 }
 
 static int sched_fair_server_period_open(struct inode *inode, struct file *filp)
@@ -465,6 +514,40 @@ static const struct file_operations fair_server_period_fops = {
 	.release	= single_release,
 };
 
+#ifdef CONFIG_SCHED_CLASS_EXT
+static ssize_t
+sched_ext_server_period_write(struct file *filp, const char __user *ubuf,
+			      size_t cnt, loff_t *ppos)
+{
+	long cpu = (long) ((struct seq_file *) filp->private_data)->private;
+	struct rq *rq = cpu_rq(cpu);
+
+	return sched_server_write_common(filp, ubuf, cnt, ppos, DL_PERIOD,
+					&rq->ext_server);
+}
+
+static int sched_ext_server_period_show(struct seq_file *m, void *v)
+{
+	unsigned long cpu = (unsigned long) m->private;
+	struct rq *rq = cpu_rq(cpu);
+
+	return sched_server_show_common(m, v, DL_PERIOD, &rq->ext_server);
+}
+
+static int sched_ext_server_period_open(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, sched_ext_server_period_show, inode->i_private);
+}
+
+static const struct file_operations ext_server_period_fops = {
+	.open		= sched_ext_server_period_open,
+	.write		= sched_ext_server_period_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+#endif /* CONFIG_SCHED_CLASS_EXT */
+
 static struct dentry *debugfs_sched;
 
 static void debugfs_fair_server_init(void)
@@ -488,6 +571,29 @@ static void debugfs_fair_server_init(void)
 	}
 }
 
+#ifdef CONFIG_SCHED_CLASS_EXT
+static void debugfs_ext_server_init(void)
+{
+	struct dentry *d_ext;
+	unsigned long cpu;
+
+	d_ext = debugfs_create_dir("ext_server", debugfs_sched);
+	if (!d_ext)
+		return;
+
+	for_each_possible_cpu(cpu) {
+		struct dentry *d_cpu;
+		char buf[32];
+
+		snprintf(buf, sizeof(buf), "cpu%lu", cpu);
+		d_cpu = debugfs_create_dir(buf, d_ext);
+
+		debugfs_create_file("runtime", 0644, d_cpu, (void *) cpu, &ext_server_runtime_fops);
+		debugfs_create_file("period", 0644, d_cpu, (void *) cpu, &ext_server_period_fops);
+	}
+}
+#endif /* CONFIG_SCHED_CLASS_EXT */
+
 static __init int sched_init_debug(void)
 {
 	struct dentry __maybe_unused *numa;
@@ -526,6 +632,9 @@ static __init int sched_init_debug(void)
 	debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
 
 	debugfs_fair_server_init();
+#ifdef CONFIG_SCHED_CLASS_EXT
+	debugfs_ext_server_init();
+#endif
 
 	return 0;
 }

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [tip: sched/core] sched_ext: Add a DL server for sched_ext tasks
  2026-01-26  9:59 ` [PATCH 4/7] sched_ext: Add a DL server for sched_ext tasks Andrea Righi
  2026-02-02 19:50   ` Peter Zijlstra
@ 2026-02-03 11:18   ` tip-bot2 for Andrea Righi
  1 sibling, 0 replies; 40+ messages in thread
From: tip-bot2 for Andrea Righi @ 2026-02-03 11:18 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Joel Fernandes, Andrea Righi, Peter Zijlstra (Intel), Juri Lelli,
	Christian Loehle, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     cd959a3562050d1c676be37f1d256a96cb067868
Gitweb:        https://git.kernel.org/tip/cd959a3562050d1c676be37f1d256a96cb067868
Author:        Andrea Righi <arighi@nvidia.com>
AuthorDate:    Mon, 26 Jan 2026 10:59:02 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 03 Feb 2026 12:04:17 +01:00

sched_ext: Add a DL server for sched_ext tasks

sched_ext currently suffers starvation due to RT. The same workload when
converted to EXT can get zero runtime if RT is 100% running, causing EXT
processes to stall. Fix it by adding a DL server for EXT.

A kselftest is also included later to confirm that both DL servers are
functioning correctly:

 # ./runner -t rt_stall
 ===== START =====
 TEST: rt_stall
 DESCRIPTION: Verify that RT tasks cannot stall SCHED_EXT tasks
 OUTPUT:
 TAP version 13
 1..1
 # Runtime of FAIR task (PID 1511) is 0.250000 seconds
 # Runtime of RT task (PID 1512) is 4.750000 seconds
 # FAIR task got 5.00% of total runtime
 ok 1 PASS: FAIR task got more than 4.00% of runtime
 TAP version 13
 1..1
 # Runtime of EXT task (PID 1514) is 0.250000 seconds
 # Runtime of RT task (PID 1515) is 4.750000 seconds
 # EXT task got 5.00% of total runtime
 ok 2 PASS: EXT task got more than 4.00% of runtime
 TAP version 13
 1..1
 # Runtime of FAIR task (PID 1517) is 0.250000 seconds
 # Runtime of RT task (PID 1518) is 4.750000 seconds
 # FAIR task got 5.00% of total runtime
 ok 3 PASS: FAIR task got more than 4.00% of runtime
 TAP version 13
 1..1
 # Runtime of EXT task (PID 1521) is 0.250000 seconds
 # Runtime of RT task (PID 1522) is 4.750000 seconds
 # EXT task got 5.00% of total runtime
 ok 4 PASS: EXT task got more than 4.00% of runtime
 ok 1 rt_stall #
 =====  END  =====

Co-developed-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/20260126100050.3854740-5-arighi@nvidia.com
---
 kernel/sched/core.c     |  6 +++-
 kernel/sched/deadline.c | 83 ++++++++++++++++++++++++++++------------
 kernel/sched/ext.c      | 33 ++++++++++++++++-
 kernel/sched/idle.c     |  3 +-
 kernel/sched/sched.h    |  2 +-
 kernel/sched/topology.c |  5 ++-
 6 files changed, 109 insertions(+), 23 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 260633e..8f2dc0a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8484,6 +8484,9 @@ int sched_cpu_dying(unsigned int cpu)
 		dump_rq_tasks(rq, KERN_WARNING);
 	}
 	dl_server_stop(&rq->fair_server);
+#ifdef CONFIG_SCHED_CLASS_EXT
+	dl_server_stop(&rq->ext_server);
+#endif
 	rq_unlock_irqrestore(rq, &rf);
 
 	calc_load_migrate(rq);
@@ -8689,6 +8692,9 @@ void __init sched_init(void)
 		hrtick_rq_init(rq);
 		atomic_set(&rq->nr_iowait, 0);
 		fair_server_init(rq);
+#ifdef CONFIG_SCHED_CLASS_EXT
+		ext_server_init(rq);
+#endif
 
 #ifdef CONFIG_SCHED_CORE
 		rq->core = rq;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 7e181ec..eae14e5 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1449,8 +1449,8 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64 
 		dl_se->dl_defer_idle = 0;
 
 	/*
-	 * The fair server can consume its runtime while throttled (not queued/
-	 * running as regular CFS).
+	 * The DL server can consume its runtime while throttled (not
+	 * queued / running as regular CFS).
 	 *
 	 * If the server consumes its entire runtime in this state. The server
 	 * is not required for the current period. Thus, reset the server by
@@ -1535,10 +1535,10 @@ throttle:
 	}
 
 	/*
-	 * The fair server (sole dl_server) does not account for real-time
-	 * workload because it is running fair work.
+	 * The dl_server does not account for real-time workload because it
+	 * is running fair work.
 	 */
-	if (dl_se == &rq->fair_server)
+	if (dl_se->dl_server)
 		return;
 
 #ifdef CONFIG_RT_GROUP_SCHED
@@ -1573,9 +1573,9 @@ throttle:
  * In the non-defer mode, the idle time is not accounted, as the
  * server provides a guarantee.
  *
- * If the dl_server is in defer mode, the idle time is also considered
- * as time available for the fair server, avoiding a penalty for the
- * rt scheduler that did not consumed that time.
+ * If the dl_server is in defer mode, the idle time is also considered as
+ * time available for the dl_server, avoiding a penalty for the rt
+ * scheduler that did not consumed that time.
  */
 void dl_server_update_idle(struct sched_dl_entity *dl_se, s64 delta_exec)
 {
@@ -1860,6 +1860,18 @@ void sched_init_dl_servers(void)
 		dl_se->dl_server = 1;
 		dl_se->dl_defer = 1;
 		setup_new_dl_entity(dl_se);
+
+#ifdef CONFIG_SCHED_CLASS_EXT
+		dl_se = &rq->ext_server;
+
+		WARN_ON(dl_server(dl_se));
+
+		dl_server_apply_params(dl_se, runtime, period, 1);
+
+		dl_se->dl_server = 1;
+		dl_se->dl_defer = 1;
+		setup_new_dl_entity(dl_se);
+#endif
 	}
 }
 
@@ -3198,6 +3210,36 @@ void dl_add_task_root_domain(struct task_struct *p)
 	raw_spin_unlock_irqrestore(&p->pi_lock, rf.flags);
 }
 
+static void dl_server_add_bw(struct root_domain *rd, int cpu)
+{
+	struct sched_dl_entity *dl_se;
+
+	dl_se = &cpu_rq(cpu)->fair_server;
+	if (dl_server(dl_se) && cpu_active(cpu))
+		__dl_add(&rd->dl_bw, dl_se->dl_bw, dl_bw_cpus(cpu));
+
+#ifdef CONFIG_SCHED_CLASS_EXT
+	dl_se = &cpu_rq(cpu)->ext_server;
+	if (dl_server(dl_se) && cpu_active(cpu))
+		__dl_add(&rd->dl_bw, dl_se->dl_bw, dl_bw_cpus(cpu));
+#endif
+}
+
+static u64 dl_server_read_bw(int cpu)
+{
+	u64 dl_bw = 0;
+
+	if (cpu_rq(cpu)->fair_server.dl_server)
+		dl_bw += cpu_rq(cpu)->fair_server.dl_bw;
+
+#ifdef CONFIG_SCHED_CLASS_EXT
+	if (cpu_rq(cpu)->ext_server.dl_server)
+		dl_bw += cpu_rq(cpu)->ext_server.dl_bw;
+#endif
+
+	return dl_bw;
+}
+
 void dl_clear_root_domain(struct root_domain *rd)
 {
 	int i;
@@ -3216,12 +3258,8 @@ void dl_clear_root_domain(struct root_domain *rd)
 	 * dl_servers are not tasks. Since dl_add_task_root_domain ignores
 	 * them, we need to account for them here explicitly.
 	 */
-	for_each_cpu(i, rd->span) {
-		struct sched_dl_entity *dl_se = &cpu_rq(i)->fair_server;
-
-		if (dl_server(dl_se) && cpu_active(i))
-			__dl_add(&rd->dl_bw, dl_se->dl_bw, dl_bw_cpus(i));
-	}
+	for_each_cpu(i, rd->span)
+		dl_server_add_bw(rd, i);
 }
 
 void dl_clear_root_domain_cpu(int cpu)
@@ -3720,7 +3758,7 @@ static int dl_bw_manage(enum dl_bw_request req, int cpu, u64 dl_bw)
 	unsigned long flags, cap;
 	struct dl_bw *dl_b;
 	bool overflow = 0;
-	u64 fair_server_bw = 0;
+	u64 dl_server_bw = 0;
 
 	rcu_read_lock_sched();
 	dl_b = dl_bw_of(cpu);
@@ -3753,27 +3791,26 @@ static int dl_bw_manage(enum dl_bw_request req, int cpu, u64 dl_bw)
 		cap -= arch_scale_cpu_capacity(cpu);
 
 		/*
-		 * cpu is going offline and NORMAL tasks will be moved away
-		 * from it. We can thus discount dl_server bandwidth
-		 * contribution as it won't need to be servicing tasks after
-		 * the cpu is off.
+		 * cpu is going offline and NORMAL and EXT tasks will be
+		 * moved away from it. We can thus discount dl_server
+		 * bandwidth contribution as it won't need to be servicing
+		 * tasks after the cpu is off.
 		 */
-		if (cpu_rq(cpu)->fair_server.dl_server)
-			fair_server_bw = cpu_rq(cpu)->fair_server.dl_bw;
+		dl_server_bw = dl_server_read_bw(cpu);
 
 		/*
 		 * Not much to check if no DEADLINE bandwidth is present.
 		 * dl_servers we can discount, as tasks will be moved out the
 		 * offlined CPUs anyway.
 		 */
-		if (dl_b->total_bw - fair_server_bw > 0) {
+		if (dl_b->total_bw - dl_server_bw > 0) {
 			/*
 			 * Leaving at least one CPU for DEADLINE tasks seems a
 			 * wise thing to do. As said above, cpu is not offline
 			 * yet, so account for that.
 			 */
 			if (dl_bw_cpus(cpu) - 1)
-				overflow = __dl_overflow(dl_b, cap, fair_server_bw, 0);
+				overflow = __dl_overflow(dl_b, cap, dl_server_bw, 0);
 			else
 				overflow = 1;
 		}
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index ce5e64b..3bc49dc 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -958,6 +958,8 @@ static void update_curr_scx(struct rq *rq)
 		if (!curr->scx.slice)
 			touch_core_sched(rq, curr);
 	}
+
+	dl_server_update(&rq->ext_server, delta_exec);
 }
 
 static bool scx_dsq_priq_less(struct rb_node *node_a,
@@ -1501,6 +1503,10 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags
 	if (enq_flags & SCX_ENQ_WAKEUP)
 		touch_core_sched(rq, p);
 
+	/* Start dl_server if this is the first task being enqueued */
+	if (rq->scx.nr_running == 1)
+		dl_server_start(&rq->ext_server);
+
 	do_enqueue_task(rq, p, enq_flags, sticky_cpu);
 out:
 	rq->scx.flags &= ~SCX_RQ_IN_WAKEUP;
@@ -2512,6 +2518,33 @@ static struct task_struct *pick_task_scx(struct rq *rq, struct rq_flags *rf)
 	return do_pick_task_scx(rq, rf, false);
 }
 
+/*
+ * Select the next task to run from the ext scheduling class.
+ *
+ * Use do_pick_task_scx() directly with @force_scx enabled, since the
+ * dl_server must always select a sched_ext task.
+ */
+static struct task_struct *
+ext_server_pick_task(struct sched_dl_entity *dl_se, struct rq_flags *rf)
+{
+	if (!scx_enabled())
+		return NULL;
+
+	return do_pick_task_scx(dl_se->rq, rf, true);
+}
+
+/*
+ * Initialize the ext server deadline entity.
+ */
+void ext_server_init(struct rq *rq)
+{
+	struct sched_dl_entity *dl_se = &rq->ext_server;
+
+	init_dl_entity(dl_se);
+
+	dl_server_init(dl_se, rq, ext_server_pick_task);
+}
+
 #ifdef CONFIG_SCHED_CORE
 /**
  * scx_prio_less - Task ordering for core-sched
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 46a9845..3681b6a 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -537,6 +537,9 @@ static void update_curr_idle(struct rq *rq)
 	se->exec_start = now;
 
 	dl_server_update_idle(&rq->fair_server, delta_exec);
+#ifdef CONFIG_SCHED_CLASS_EXT
+	dl_server_update_idle(&rq->ext_server, delta_exec);
+#endif
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 309101c..2aa4251 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -414,6 +414,7 @@ extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
 extern void sched_init_dl_servers(void);
 
 extern void fair_server_init(struct rq *rq);
+extern void ext_server_init(struct rq *rq);
 extern void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq);
 extern int dl_server_apply_params(struct sched_dl_entity *dl_se,
 		    u64 runtime, u64 period, bool init);
@@ -1171,6 +1172,7 @@ struct rq {
 	struct dl_rq		dl;
 #ifdef CONFIG_SCHED_CLASS_EXT
 	struct scx_rq		scx;
+	struct sched_dl_entity	ext_server;
 #endif
 
 	struct sched_dl_entity	fair_server;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index cf643a5..ac268da 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -508,6 +508,11 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd)
 	if (rq->fair_server.dl_server)
 		__dl_server_attach_root(&rq->fair_server, rq);
 
+#ifdef CONFIG_SCHED_CLASS_EXT
+	if (rq->ext_server.dl_server)
+		__dl_server_attach_root(&rq->ext_server, rq);
+#endif
+
 	rq_unlock_irqrestore(rq, &rf);
 
 	if (old_rd)

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [tip: sched/core] sched/debug: Stop and start server based on if it was active
  2026-01-26  9:59 ` [PATCH 3/7] sched/debug: Stop and start server based on if it was active Andrea Righi
  2026-02-02 21:13   ` Peter Zijlstra
@ 2026-02-03 11:18   ` tip-bot2 for Joel Fernandes
  1 sibling, 0 replies; 40+ messages in thread
From: tip-bot2 for Joel Fernandes @ 2026-02-03 11:18 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Joel Fernandes, Peter Zijlstra (Intel), Juri Lelli, Andrea Righi,
	Tejun Heo, Christian Loehle, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     68ec89d0e99156803bdea3c986c0198624e40ea2
Gitweb:        https://git.kernel.org/tip/68ec89d0e99156803bdea3c986c0198624e40ea2
Author:        Joel Fernandes <joelagnelf@nvidia.com>
AuthorDate:    Mon, 26 Jan 2026 10:59:01 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 03 Feb 2026 12:04:17 +01:00

sched/debug: Stop and start server based on if it was active

Currently the DL server interface for applying parameters checks
CFS-internals to identify if the server is active. This is error-prone
and makes it difficult when adding new servers in the future.

Fix it, by using dl_server_active() which is also used by the DL server
code to determine if the DL server was started.

Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Tejun Heo <tj@kernel.org>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/20260126100050.3854740-4-arighi@nvidia.com
---
 kernel/sched/debug.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index ed9254d..41e3895 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -348,6 +348,8 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
 		return err;
 
 	scoped_guard (rq_lock_irqsave, rq) {
+		bool is_active;
+
 		runtime  = rq->fair_server.dl_runtime;
 		period = rq->fair_server.dl_period;
 
@@ -370,8 +372,11 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
 			return  -EINVAL;
 		}
 
-		update_rq_clock(rq);
-		dl_server_stop(&rq->fair_server);
+		is_active = dl_server_active(&rq->fair_server);
+		if (is_active) {
+			update_rq_clock(rq);
+			dl_server_stop(&rq->fair_server);
+		}
 
 		retval = dl_server_apply_params(&rq->fair_server, runtime, period, 0);
 
@@ -379,7 +384,7 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
 			printk_deferred("Fair server disabled in CPU %d, system may crash due to starvation.\n",
 					cpu_of(rq));
 
-		if (rq->cfs.h_nr_queued)
+		if (is_active && runtime)
 			dl_server_start(&rq->fair_server);
 
 		if (retval < 0)

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [tip: sched/core] sched/debug: Fix updating of ppos on server write ops
  2026-01-26  9:59 ` [PATCH 2/7] sched/debug: Fix updating of ppos on server write ops Andrea Righi
@ 2026-02-03 11:18   ` tip-bot2 for Joel Fernandes
  0 siblings, 0 replies; 40+ messages in thread
From: tip-bot2 for Joel Fernandes @ 2026-02-03 11:18 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Joel Fernandes, Peter Zijlstra (Intel), Juri Lelli, Andrea Righi,
	Tejun Heo, Christian Loehle, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     6080fb211672aec6ce8f2f5a2e0b4eae736f2027
Gitweb:        https://git.kernel.org/tip/6080fb211672aec6ce8f2f5a2e0b4eae736f2027
Author:        Joel Fernandes <joelagnelf@nvidia.com>
AuthorDate:    Mon, 26 Jan 2026 10:59:00 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 03 Feb 2026 12:04:16 +01:00

sched/debug: Fix updating of ppos on server write ops

Updating "ppos" on error conditions does not make much sense. The pattern
is to return the error code directly without modifying the position, or
modify the position on success and return the number of bytes written.

Since on success, the return value of apply is 0, there is no point in
modifying ppos either. Fix it by removing all this and just returning
error code or number of bytes written on success.

Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Tejun Heo <tj@kernel.org>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/20260126100050.3854740-3-arighi@nvidia.com
---
 kernel/sched/debug.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 929fdf0..ed9254d 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -339,8 +339,8 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
 	long cpu = (long) ((struct seq_file *) filp->private_data)->private;
 	struct rq *rq = cpu_rq(cpu);
 	u64 runtime, period;
+	int retval = 0;
 	size_t err;
-	int retval;
 	u64 value;
 
 	err = kstrtoull_from_user(ubuf, cnt, 10, &value);
@@ -374,8 +374,6 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
 		dl_server_stop(&rq->fair_server);
 
 		retval = dl_server_apply_params(&rq->fair_server, runtime, period, 0);
-		if (retval)
-			cnt = retval;
 
 		if (!runtime)
 			printk_deferred("Fair server disabled in CPU %d, system may crash due to starvation.\n",
@@ -383,6 +381,9 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
 
 		if (rq->cfs.h_nr_queued)
 			dl_server_start(&rq->fair_server);
+
+		if (retval < 0)
+			return retval;
 	}
 
 	*ppos += cnt;

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [tip: sched/core] sched/deadline: Clear the defer params
  2026-01-26  9:58 ` [PATCH 1/7] sched/deadline: Clear the defer params Andrea Righi
@ 2026-02-03 11:18   ` tip-bot2 for Joel Fernandes
  0 siblings, 0 replies; 40+ messages in thread
From: tip-bot2 for Joel Fernandes @ 2026-02-03 11:18 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Joel Fernandes, Peter Zijlstra (Intel), Andrea Righi, Juri Lelli,
	Christian Loehle, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     3cb3b27693bf30defb16aa096158a3b24583b8d2
Gitweb:        https://git.kernel.org/tip/3cb3b27693bf30defb16aa096158a3b24583b8d2
Author:        Joel Fernandes <joelagnelf@nvidia.com>
AuthorDate:    Mon, 26 Jan 2026 10:58:59 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 03 Feb 2026 12:04:16 +01:00

sched/deadline: Clear the defer params

The defer params were not cleared in __dl_clear_params. Clear them.

Without this is some of my test cases are flaking and the DL timer is
not starting correctly AFAICS.

Fixes: a110a81c52a9 ("sched/deadline: Deferrable dl server")
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/20260126100050.3854740-2-arighi@nvidia.com
---
 kernel/sched/deadline.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 82e7a21..7e181ec 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -3660,6 +3660,9 @@ static void __dl_clear_params(struct sched_dl_entity *dl_se)
 	dl_se->dl_non_contending	= 0;
 	dl_se->dl_overrun		= 0;
 	dl_se->dl_server		= 0;
+	dl_se->dl_defer			= 0;
+	dl_se->dl_defer_running		= 0;
+	dl_se->dl_defer_armed		= 0;
 
 #ifdef CONFIG_RT_MUTEXES
 	dl_se->pi_se			= dl_se;

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/7] sched/debug: Stop and start server based on if it was active
  2026-02-03 10:34         ` Peter Zijlstra
  2026-02-03 11:18           ` [tip: sched/core] sched/debug: Fix dl_server (re)start conditions tip-bot2 for Peter Zijlstra
@ 2026-02-03 13:50           ` Andrea Righi
  1 sibling, 0 replies; 40+ messages in thread
From: Andrea Righi @ 2026-02-03 13:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Tejun Heo, Joel Fernandes, David Vernet, Changwoo Min,
	Daniel Hodges, Christian Loehle, Emil Tsalapatis, sched-ext,
	linux-kernel

On Tue, Feb 03, 2026 at 11:34:07AM +0100, Peter Zijlstra wrote:
> On Mon, Feb 02, 2026 at 11:37:31PM +0100, Andrea Righi wrote:
> 
> > Or:
> > 
> >     pr_info("%s server %sabled in CPU %d%s\n",
> >               server == &rq->fair_server ? "Fair" : "Ext",
> >               runtime ? "en" : "dis",
> >               cpu_of(rq),
> >               runtime ? "" : ", system may crash due to starvation");
> 
> Yeah, I noticed it was a bit wonkey. I made it thus.
> 
> > > +	}
> > > +
> > >  	*ppos += cnt;
> > >  	return cnt;
> > >  }
> > 
> > I like that, it should fix the issue.
> 
> There is one more issue when dl_server_apply_params() fails, in that
> case we should test old_runtime to determine if we should (re)start the
> dl_server.
> 
> I've ended up with this.

LGTM, I also re-ran all my stress tests with this applied, everything is
working great and the runtime=0 issue is fixed.

Tested-by: Andrea Righi <arighi@nvidia.com>

Thanks!
-Andrea

> Subject: sched/debug: Fix dl_server (re)start conditions
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Tue Feb 3 11:05:12 CET 2026
> 
> There are two problems with sched_server_write_common() that can cause the
> dl_server to malfunction upon attempting to change the parameters:
> 
> 1) when, after having disabled the dl_server by setting runtime=0, it is
>    enabled again while tasks are already enqueued. In this case is_active would
>    still be 0 and dl_server_start() would not be called.
> 
> 2) when dl_server_apply_params() would fail, runtime is not applied and does
>    not reflect the new state.
> 
> Instead have dl_server_start() check its actual dl_runtime, and have
> sched_server_write_common() unconditionally (re)start the dl_server. It will
> automatically stop if there isn't anything to do, so spurious activation is
> harmless -- while failing to start it is a problem.
> 
> While there, move the printk out of the locked region and make it symmetric,
> also printing on enable.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/deadline.c |    5 ++---
>  kernel/sched/debug.c    |   32 ++++++++++++++------------------
>  2 files changed, 16 insertions(+), 21 deletions(-)
> 
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1784,7 +1784,7 @@ void dl_server_start(struct sched_dl_ent
>  {
>  	struct rq *rq = dl_se->rq;
>  
> -	if (!dl_server(dl_se) || dl_se->dl_server_active)
> +	if (!dl_server(dl_se) || dl_se->dl_server_active || !dl_se->dl_runtime)
>  		return;
>  
>  	/*
> @@ -1882,7 +1882,6 @@ int dl_server_apply_params(struct sched_
>  	int cpu = cpu_of(rq);
>  	struct dl_bw *dl_b;
>  	unsigned long cap;
> -	int retval = 0;
>  	int cpus;
>  
>  	dl_b = dl_bw_of(cpu);
> @@ -1914,7 +1913,7 @@ int dl_server_apply_params(struct sched_
>  	dl_se->dl_bw = to_ratio(dl_se->dl_period, dl_se->dl_runtime);
>  	dl_se->dl_density = to_ratio(dl_se->dl_deadline, dl_se->dl_runtime);
>  
> -	return retval;
> +	return 0;
>  }
>  
>  /*
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -338,9 +338,9 @@ static ssize_t sched_server_write_common
>  					 void *server)
>  {
>  	long cpu = (long) ((struct seq_file *) filp->private_data)->private;
> -	struct rq *rq = cpu_rq(cpu);
>  	struct sched_dl_entity *dl_se = (struct sched_dl_entity *)server;
> -	u64 runtime, period;
> +	u64 old_runtime, runtime, period;
> +	struct rq *rq = cpu_rq(cpu);
>  	int retval = 0;
>  	size_t err;
>  	u64 value;
> @@ -350,9 +350,7 @@ static ssize_t sched_server_write_common
>  		return err;
>  
>  	scoped_guard (rq_lock_irqsave, rq) {
> -		bool is_active;
> -
> -		runtime = dl_se->dl_runtime;
> +		old_runtime = runtime = dl_se->dl_runtime;
>  		period = dl_se->dl_period;
>  
>  		switch (param) {
> @@ -374,25 +372,23 @@ static ssize_t sched_server_write_common
>  			return  -EINVAL;
>  		}
>  
> -		is_active = dl_server_active(dl_se);
> -		if (is_active) {
> -			update_rq_clock(rq);
> -			dl_server_stop(dl_se);
> -		}
> -
> +		update_rq_clock(rq);
> +		dl_server_stop(dl_se);
>  		retval = dl_server_apply_params(dl_se, runtime, period, 0);
> -
> -		if (!runtime)
> -			printk_deferred("%s server disabled in CPU %d, system may crash due to starvation.\n",
> -					server == &rq->fair_server ? "Fair" : "Ext", cpu_of(rq));
> -
> -		if (is_active && runtime)
> -			dl_server_start(dl_se);
> +		dl_server_start(dl_se);
>  
>  		if (retval < 0)
>  			return retval;
>  	}
>  
> +	if (!!old_runtime ^ !!runtime) {
> +		pr_info("%s server %sabled on CPU %d%s.\n",
> +			server == &rq->fair_server ? "Fair" : "Ext",
> +			runtime ? "en" : "dis",
> +			cpu_of(rq),
> +			runtime ? "" : ", system may malfunction due to starvation");
> +	}
> +
>  	*ppos += cnt;
>  	return cnt;
>  }

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2026-02-03 13:50 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-26  9:58 [PATCHSET v12 sched_ext/for-6.20] Add a deadline server for sched_ext tasks Andrea Righi
2026-01-26  9:58 ` [PATCH 1/7] sched/deadline: Clear the defer params Andrea Righi
2026-02-03 11:18   ` [tip: sched/core] " tip-bot2 for Joel Fernandes
2026-01-26  9:59 ` [PATCH 2/7] sched/debug: Fix updating of ppos on server write ops Andrea Righi
2026-02-03 11:18   ` [tip: sched/core] " tip-bot2 for Joel Fernandes
2026-01-26  9:59 ` [PATCH 3/7] sched/debug: Stop and start server based on if it was active Andrea Righi
2026-02-02 21:13   ` Peter Zijlstra
2026-02-02 21:14     ` Peter Zijlstra
2026-02-02 21:17     ` Peter Zijlstra
2026-02-02 22:37       ` Andrea Righi
2026-02-03 10:34         ` Peter Zijlstra
2026-02-03 11:18           ` [tip: sched/core] sched/debug: Fix dl_server (re)start conditions tip-bot2 for Peter Zijlstra
2026-02-03 13:50           ` [PATCH 3/7] sched/debug: Stop and start server based on if it was active Andrea Righi
2026-02-03 10:11       ` Andrea Righi
2026-02-03 11:18   ` [tip: sched/core] " tip-bot2 for Joel Fernandes
2026-01-26  9:59 ` [PATCH 4/7] sched_ext: Add a DL server for sched_ext tasks Andrea Righi
2026-02-02 19:50   ` Peter Zijlstra
2026-02-02 20:32     ` Andrea Righi
2026-02-02 21:10       ` Peter Zijlstra
2026-02-02 22:18         ` Andrea Righi
2026-02-03 11:18   ` [tip: sched/core] " tip-bot2 for Andrea Righi
2026-01-26  9:59 ` [PATCH 5/7] sched/debug: Add support to change sched_ext server params Andrea Righi
2026-02-03 11:18   ` [tip: sched/core] " tip-bot2 for Joel Fernandes
2026-01-26  9:59 ` [PATCH 6/7] selftests/sched_ext: Add test for sched_ext dl_server Andrea Righi
2026-02-03 11:18   ` [tip: sched/core] " tip-bot2 for Andrea Righi
2026-01-26  9:59 ` [PATCH 7/7] selftests/sched_ext: Add test for DL server total_bw consistency Andrea Righi
2026-02-03 11:18   ` [tip: sched/core] " tip-bot2 for Joel Fernandes
2026-02-02 16:45 ` [PATCHSET v12 sched_ext/for-6.20] Add a deadline server for sched_ext tasks Tejun Heo
2026-02-02 19:56   ` Peter Zijlstra
2026-02-02 20:20     ` Tejun Heo
  -- strict thread matches above, loose matches on Subject: below --
2026-01-20 21:50 [PATCHSET RESEND v11 " Andrea Righi
2026-01-20 21:50 ` [PATCH 4/7] sched_ext: Add a DL " Andrea Righi
2026-01-21 12:29   ` Peter Zijlstra
2026-01-21 12:49     ` Andrea Righi
2026-01-21 15:52       ` Peter Zijlstra
2026-01-21 17:27         ` Andrea Righi
2026-01-21 12:31   ` Peter Zijlstra
2026-01-21 12:51     ` Andrea Righi
2025-12-17  9:35 [PATCHSET v11 sched_ext/for-6.20] Add a deadline " Andrea Righi
2025-12-17  9:35 ` [PATCH 4/7] sched_ext: Add a DL " Andrea Righi
2025-12-17 15:49   ` Juri Lelli
2025-12-17 20:35     ` Andrea Righi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox