linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET v8 sched_ext/for-6.18] Add a deadline server for sched_ext tasks
@ 2025-09-03  9:33 Andrea Righi
  2025-09-03  9:33 ` [PATCH 01/16] sched_ext: Exit early on hotplug events during attach Andrea Righi
                   ` (15 more replies)
  0 siblings, 16 replies; 34+ messages in thread
From: Andrea Righi @ 2025-09-03  9:33 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Joel Fernandes, Tejun Heo, David Vernet,
	Changwoo Min, Shuah Khan
  Cc: sched-ext, bpf, linux-kernel

sched_ext tasks can be starved by long-running RT tasks, especially since
RT throttling was replaced by deadline servers to boost only SCHED_NORMAL
tasks.

Several users in the community have reported issues with RT stalling
sched_ext tasks. This is fairly common on distributions or environments
where applications like video compositors, audio services, etc. run as RT
tasks by default.

Example trace (showing a per-CPU kthread stalled due to the sway Wayland
compositor running as an RT task):

 runnable task stall (kworker/0:0[106377] failed to run for 5.043s)
 ...
 CPU 0   : nr_run=3 flags=0xd cpu_rel=0 ops_qseq=20646200 pnt_seq=45388738
           curr=sway[994] class=rt_sched_class
   R kworker/0:0[106377] -5043ms
       scx_state/flags=3/0x1 dsq_flags=0x0 ops_state/qseq=0/0
       sticky/holding_cpu=-1/-1 dsq_id=0x8000000000000002 dsq_vtime=0 slice=20000000
       cpus=01

This is often perceived as a bug in the BPF schedulers, but in reality they
can't do much: RT tasks run outside their control and can potentially
consume 100% of the CPU bandwidth.

Fix this by adding a sched_ext deadline server as well so that sched_ext
tasks are also boosted and do not suffer starvation.

Two kselftests are also provided to verify the starvation fixes and
bandwidth allocation is correct.

This patchset is also available in the following git branch:

 git://git.kernel.org/pub/scm/linux/kernel/git/arighi/linux.git scx-dl-server

Changes in v8:
 - Add tj's patch to de-couple balance and pick_task and avoid changing
   sched/core callbacks to propagate @rf
 - Simplify dl_se->dl_server check (suggested by PeterZ)
 - Small coding style fixes in the kselftests
 - Link to v7: https://lore.kernel.org/all/20250809184800.129831-1-joelagnelf@nvidia.com/

Changes in v7:
 - Rebased to Linus master
 - Link to v6: https://lore.kernel.org/all/20250702232944.3221001-1-joelagnelf@nvidia.com/

Changes in v6:
 - Added Acks to few patches
 - Fixes to few nits suggested by Tejun
 - Link to v5: https://lore.kernel.org/all/20250620203234.3349930-1-joelagnelf@nvidia.com/

Changes in v5:
 - Added a kselftest (total_bw) to sched_ext to verify bandwidth values
   from debugfs
 - Address comment from Andrea about redundant rq clock invalidation
 - Link to v4: https://lore.kernel.org/all/20250617200523.1261231-1-joelagnelf@nvidia.com/

Changes in v4:
 - Fixed issues with hotplugged CPUs having their DL server bandwidth
   altered due to loading SCX
 - Fixed other issues
 - Rebased on Linus master
 - All sched_ext kselftests reliably pass now, also verified that the
   total_bw in debugfs (CONFIG_SCHED_DEBUG) is conserved with these patches
 - Link to v3: https://lore.kernel.org/all/20250613051734.4023260-1-joelagnelf@nvidia.com/

Changes in v3:
 - Removed code duplication in debugfs. Made ext interface separate
 - Fixed issue where rq_lock_irqsave was not used in the relinquish patch
 - Fixed running bw accounting issue in dl_server_remove_params
 - Link to v2: https://lore.kernel.org/all/20250602180110.816225-1-joelagnelf@nvidia.com/

Changes in v2:
 - Fixed a hang related to using rq_lock instead of rq_lock_irqsave
 - Added support to remove BW of DL servers when they are switched to/from EXT
 - Link to v1: https://lore.kernel.org/all/20250315022158.2354454-1-joelagnelf@nvidia.com/

Andrea Righi (6):
      sched_ext: Exit early on hotplug events during attach
      sched/deadline: Add support to remove DL server's bandwidth contribution
      sched/deadline: Account ext server bandwidth
      sched/deadline: Allow to initialize DL server when needed
      sched_ext: Selectively enable ext and fair DL servers
      selftests/sched_ext: Add test for sched_ext dl_server

Joel Fernandes (9):
      sched/debug: Fix updating of ppos on server write ops
      sched/debug: Stop and start server based on if it was active
      sched/deadline: Clear the defer params
      sched/deadline: Return EBUSY if dl_bw_cpus is zero
      sched: Add a server arg to dl_server_update_idle_time()
      sched_ext: Add a DL server for sched_ext tasks
      sched/debug: Add support to change sched_ext server params
      sched/deadline: Fix DL server crash in inactive_timer callback
      selftests/sched_ext: Add test for DL server total_bw consistency

Tejun Heo (1):
      sched/deadline: De-couple balance and pick_task

 include/linux/sched.h                            |   2 +
 kernel/sched/core.c                              |  17 +-
 kernel/sched/deadline.c                          | 152 +++++++++---
 kernel/sched/debug.c                             | 161 ++++++++++---
 kernel/sched/ext.c                               | 175 ++++++++++++--
 kernel/sched/fair.c                              |   4 +-
 kernel/sched/idle.c                              |   2 +-
 kernel/sched/sched.h                             |  15 +-
 kernel/sched/topology.c                          |   5 +
 tools/testing/selftests/sched_ext/Makefile       |   2 +
 tools/testing/selftests/sched_ext/rt_stall.bpf.c |  23 ++
 tools/testing/selftests/sched_ext/rt_stall.c     | 214 +++++++++++++++++
 tools/testing/selftests/sched_ext/total_bw.c     | 281 +++++++++++++++++++++++
 13 files changed, 968 insertions(+), 85 deletions(-)
 create mode 100644 tools/testing/selftests/sched_ext/rt_stall.bpf.c
 create mode 100644 tools/testing/selftests/sched_ext/rt_stall.c
 create mode 100644 tools/testing/selftests/sched_ext/total_bw.c

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 01/16] sched_ext: Exit early on hotplug events during attach
  2025-09-03  9:33 [PATCHSET v8 sched_ext/for-6.18] Add a deadline server for sched_ext tasks Andrea Righi
@ 2025-09-03  9:33 ` Andrea Righi
  2025-09-03 19:44   ` Tejun Heo
  2025-09-03  9:33 ` [PATCH 02/16] sched/debug: Fix updating of ppos on server write ops Andrea Righi
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 34+ messages in thread
From: Andrea Righi @ 2025-09-03  9:33 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Joel Fernandes, Tejun Heo, David Vernet,
	Changwoo Min, Shuah Khan
  Cc: sched-ext, bpf, linux-kernel

There is no need to complete the scx initialization if the current
scheduler is failing to be attached due to a hotplug event.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/ext.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 7dedc9a16281b..63d9273278e5e 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -5512,7 +5512,7 @@ static struct scx_sched *scx_alloc_and_add_sched(struct sched_ext_ops *ops)
 	return ERR_PTR(ret);
 }
 
-static void check_hotplug_seq(struct scx_sched *sch,
+static int check_hotplug_seq(struct scx_sched *sch,
 			      const struct sched_ext_ops *ops)
 {
 	unsigned long long global_hotplug_seq;
@@ -5529,8 +5529,11 @@ static void check_hotplug_seq(struct scx_sched *sch,
 				 SCX_ECODE_ACT_RESTART | SCX_ECODE_RSN_HOTPLUG,
 				 "expected hotplug seq %llu did not match actual %llu",
 				 ops->hotplug_seq, global_hotplug_seq);
+			return -EBUSY;
 		}
 	}
+
+	return 0;
 }
 
 static int validate_ops(struct scx_sched *sch, const struct sched_ext_ops *ops)
@@ -5627,11 +5630,15 @@ static int scx_enable(struct sched_ext_ops *ops, struct bpf_link *link)
 		if (((void (**)(void))ops)[i])
 			set_bit(i, sch->has_op);
 
-	check_hotplug_seq(sch, ops);
-	scx_idle_update_selcpu_topology(ops);
+	ret = check_hotplug_seq(sch, ops);
+	if (!ret)
+		scx_idle_update_selcpu_topology(ops);
 
 	cpus_read_unlock();
 
+	if (ret)
+		goto err_disable;
+
 	ret = validate_ops(sch, ops);
 	if (ret)
 		goto err_disable;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 02/16] sched/debug: Fix updating of ppos on server write ops
  2025-09-03  9:33 [PATCHSET v8 sched_ext/for-6.18] Add a deadline server for sched_ext tasks Andrea Righi
  2025-09-03  9:33 ` [PATCH 01/16] sched_ext: Exit early on hotplug events during attach Andrea Righi
@ 2025-09-03  9:33 ` Andrea Righi
  2025-09-03  9:33 ` [PATCH 03/16] sched/debug: Stop and start server based on if it was active Andrea Righi
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 34+ messages in thread
From: Andrea Righi @ 2025-09-03  9:33 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Joel Fernandes, Tejun Heo, David Vernet,
	Changwoo Min, Shuah Khan
  Cc: sched-ext, bpf, linux-kernel

From: Joel Fernandes <joelagnelf@nvidia.com>

Updating "ppos" on error conditions does not make much sense. The pattern
is to return the error code directly without modifying the position, or
modify the position on success and return the number of bytes written.

Since on success, the return value of apply is 0, there is no point in
modifying ppos either. Fix it by removing all this and just returning
error code or number of bytes written on success.

Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
 kernel/sched/debug.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 3f06ab84d53f0..dbe2aee8628ce 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -345,8 +345,8 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
 	long cpu = (long) ((struct seq_file *) filp->private_data)->private;
 	struct rq *rq = cpu_rq(cpu);
 	u64 runtime, period;
+	int retval = 0;
 	size_t err;
-	int retval;
 	u64 value;
 
 	err = kstrtoull_from_user(ubuf, cnt, 10, &value);
@@ -382,8 +382,6 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
 		}
 
 		retval = dl_server_apply_params(&rq->fair_server, runtime, period, 0);
-		if (retval)
-			cnt = retval;
 
 		if (!runtime)
 			printk_deferred("Fair server disabled in CPU %d, system may crash due to starvation.\n",
@@ -391,6 +389,9 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
 
 		if (rq->cfs.h_nr_queued)
 			dl_server_start(&rq->fair_server);
+
+		if (retval < 0)
+			return retval;
 	}
 
 	*ppos += cnt;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 03/16] sched/debug: Stop and start server based on if it was active
  2025-09-03  9:33 [PATCHSET v8 sched_ext/for-6.18] Add a deadline server for sched_ext tasks Andrea Righi
  2025-09-03  9:33 ` [PATCH 01/16] sched_ext: Exit early on hotplug events during attach Andrea Righi
  2025-09-03  9:33 ` [PATCH 02/16] sched/debug: Fix updating of ppos on server write ops Andrea Righi
@ 2025-09-03  9:33 ` Andrea Righi
  2025-09-03 14:43   ` Juri Lelli
  2025-09-03  9:33 ` [PATCH 04/16] sched/deadline: Clear the defer params Andrea Righi
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 34+ messages in thread
From: Andrea Righi @ 2025-09-03  9:33 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Joel Fernandes, Tejun Heo, David Vernet,
	Changwoo Min, Shuah Khan
  Cc: sched-ext, bpf, linux-kernel

From: Joel Fernandes <joelagnelf@nvidia.com>

Currently the DL server interface for applying parameters checks
CFS-internals to identify if the server is active. This is error-prone
and makes it difficult when adding new servers in the future.

Fix it, by using dl_server_active() which is also used by the DL server
code to determine if the DL server was started.

Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
 kernel/sched/debug.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index dbe2aee8628ce..e71f6618c1a6a 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -354,6 +354,8 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
 		return err;
 
 	scoped_guard (rq_lock_irqsave, rq) {
+		bool is_active;
+
 		runtime  = rq->fair_server.dl_runtime;
 		period = rq->fair_server.dl_period;
 
@@ -376,7 +378,8 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
 			return  -EINVAL;
 		}
 
-		if (rq->cfs.h_nr_queued) {
+		is_active = dl_server_active(&rq->fair_server);
+		if (is_active) {
 			update_rq_clock(rq);
 			dl_server_stop(&rq->fair_server);
 		}
@@ -387,7 +390,7 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
 			printk_deferred("Fair server disabled in CPU %d, system may crash due to starvation.\n",
 					cpu_of(rq));
 
-		if (rq->cfs.h_nr_queued)
+		if (is_active)
 			dl_server_start(&rq->fair_server);
 
 		if (retval < 0)
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 04/16] sched/deadline: Clear the defer params
  2025-09-03  9:33 [PATCHSET v8 sched_ext/for-6.18] Add a deadline server for sched_ext tasks Andrea Righi
                   ` (2 preceding siblings ...)
  2025-09-03  9:33 ` [PATCH 03/16] sched/debug: Stop and start server based on if it was active Andrea Righi
@ 2025-09-03  9:33 ` Andrea Righi
  2025-09-03 14:44   ` Juri Lelli
  2025-09-03  9:33 ` [PATCH 05/16] sched/deadline: Return EBUSY if dl_bw_cpus is zero Andrea Righi
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 34+ messages in thread
From: Andrea Righi @ 2025-09-03  9:33 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Joel Fernandes, Tejun Heo, David Vernet,
	Changwoo Min, Shuah Khan
  Cc: sched-ext, bpf, linux-kernel

From: Joel Fernandes <joelagnelf@nvidia.com>

The defer params were not cleared in __dl_clear_params. Clear them.

Without this is some of my test cases are flaking and the DL timer is
not starting correctly AFAICS.

Fixes: a110a81c52a9 ("sched/deadline: Deferrable dl server")
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
 kernel/sched/deadline.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index e2d51f4306b31..3c478a1b2890d 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -3381,6 +3381,9 @@ static void __dl_clear_params(struct sched_dl_entity *dl_se)
 	dl_se->dl_non_contending	= 0;
 	dl_se->dl_overrun		= 0;
 	dl_se->dl_server		= 0;
+	dl_se->dl_defer			= 0;
+	dl_se->dl_defer_running		= 0;
+	dl_se->dl_defer_armed		= 0;
 
 #ifdef CONFIG_RT_MUTEXES
 	dl_se->pi_se			= dl_se;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 05/16] sched/deadline: Return EBUSY if dl_bw_cpus is zero
  2025-09-03  9:33 [PATCHSET v8 sched_ext/for-6.18] Add a deadline server for sched_ext tasks Andrea Righi
                   ` (3 preceding siblings ...)
  2025-09-03  9:33 ` [PATCH 04/16] sched/deadline: Clear the defer params Andrea Righi
@ 2025-09-03  9:33 ` Andrea Righi
  2025-09-03 14:53   ` Juri Lelli
  2025-09-03  9:33 ` [PATCH 06/16] sched: Add a server arg to dl_server_update_idle_time() Andrea Righi
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 34+ messages in thread
From: Andrea Righi @ 2025-09-03  9:33 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Joel Fernandes, Tejun Heo, David Vernet,
	Changwoo Min, Shuah Khan
  Cc: sched-ext, bpf, linux-kernel

From: Joel Fernandes <joelagnelf@nvidia.com>

Hotplugged CPUs coming online do an enqueue but are not a part of any
root domain containing cpu_active() CPUs. So in this case, don't mess
with accounting and we can retry later. Without this patch, we see
crashes with sched_ext selftest's hotplug test due to divide by zero.

Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
 kernel/sched/deadline.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 3c478a1b2890d..753e50b1e86fc 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1689,7 +1689,12 @@ int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 perio
 	cpus = dl_bw_cpus(cpu);
 	cap = dl_bw_capacity(cpu);
 
-	if (__dl_overflow(dl_b, cap, old_bw, new_bw))
+	/*
+	 * Hotplugged CPUs coming online do an enqueue but are not a part of any
+	 * root domain containing cpu_active() CPUs. So in this case, don't mess
+	 * with accounting and we can retry later.
+	 */
+	if (!cpus || __dl_overflow(dl_b, cap, old_bw, new_bw))
 		return -EBUSY;
 
 	if (init) {
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 06/16] sched: Add a server arg to dl_server_update_idle_time()
  2025-09-03  9:33 [PATCHSET v8 sched_ext/for-6.18] Add a deadline server for sched_ext tasks Andrea Righi
                   ` (4 preceding siblings ...)
  2025-09-03  9:33 ` [PATCH 05/16] sched/deadline: Return EBUSY if dl_bw_cpus is zero Andrea Righi
@ 2025-09-03  9:33 ` Andrea Righi
  2025-09-03  9:33 ` [PATCH 07/16] sched_ext: Add a DL server for sched_ext tasks Andrea Righi
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 34+ messages in thread
From: Andrea Righi @ 2025-09-03  9:33 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Joel Fernandes, Tejun Heo, David Vernet,
	Changwoo Min, Shuah Khan
  Cc: sched-ext, bpf, linux-kernel

From: Joel Fernandes <joelagnelf@nvidia.com>

Since we are adding more servers, make dl_server_update_idle_time()
accept a server argument than a specific server.

Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
 kernel/sched/deadline.c | 16 ++++++++--------
 kernel/sched/fair.c     |  2 +-
 kernel/sched/idle.c     |  2 +-
 kernel/sched/sched.h    |  3 ++-
 4 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 753e50b1e86fc..75289385f310a 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1549,26 +1549,26 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64
  * as time available for the fair server, avoiding a penalty for the
  * rt scheduler that did not consumed that time.
  */
-void dl_server_update_idle_time(struct rq *rq, struct task_struct *p)
+void dl_server_update_idle_time(struct rq *rq, struct task_struct *p,
+			       struct sched_dl_entity *rq_dl_server)
 {
 	s64 delta_exec;
 
-	if (!rq->fair_server.dl_defer)
+	if (!rq_dl_server->dl_defer)
 		return;
 
 	/* no need to discount more */
-	if (rq->fair_server.runtime < 0)
+	if (rq_dl_server->runtime < 0)
 		return;
 
 	delta_exec = rq_clock_task(rq) - p->se.exec_start;
 	if (delta_exec < 0)
 		return;
 
-	rq->fair_server.runtime -= delta_exec;
-
-	if (rq->fair_server.runtime < 0) {
-		rq->fair_server.dl_defer_running = 0;
-		rq->fair_server.runtime = 0;
+	rq_dl_server->runtime -= delta_exec;
+	if (rq_dl_server->runtime < 0) {
+		rq_dl_server->dl_defer_running = 0;
+		rq_dl_server->runtime = 0;
 	}
 
 	p->se.exec_start = rq_clock_task(rq);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b173a059315c2..7573baca9a85a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6917,7 +6917,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	if (!rq_h_nr_queued && rq->cfs.h_nr_queued) {
 		/* Account for idle runtime */
 		if (!rq->nr_running)
-			dl_server_update_idle_time(rq, rq->curr);
+			dl_server_update_idle_time(rq, rq->curr, &rq->fair_server);
 		dl_server_start(&rq->fair_server);
 	}
 
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index c39b089d4f09b..63c8b17d8e7cf 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -454,7 +454,7 @@ static void wakeup_preempt_idle(struct rq *rq, struct task_struct *p, int flags)
 
 static void put_prev_task_idle(struct rq *rq, struct task_struct *prev, struct task_struct *next)
 {
-	dl_server_update_idle_time(rq, prev);
+	dl_server_update_idle_time(rq, prev, &rq->fair_server);
 	scx_update_idle(rq, false, true);
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index be9745d104f75..f3089d0b76493 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -388,7 +388,8 @@ extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
 extern void sched_init_dl_servers(void);
 
 extern void dl_server_update_idle_time(struct rq *rq,
-		    struct task_struct *p);
+		    struct task_struct *p,
+		    struct sched_dl_entity *rq_dl_server);
 extern void fair_server_init(struct rq *rq);
 extern void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq);
 extern int dl_server_apply_params(struct sched_dl_entity *dl_se,
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 07/16] sched_ext: Add a DL server for sched_ext tasks
  2025-09-03  9:33 [PATCHSET v8 sched_ext/for-6.18] Add a deadline server for sched_ext tasks Andrea Righi
                   ` (5 preceding siblings ...)
  2025-09-03  9:33 ` [PATCH 06/16] sched: Add a server arg to dl_server_update_idle_time() Andrea Righi
@ 2025-09-03  9:33 ` Andrea Righi
  2025-09-03 19:54   ` Tejun Heo
  2025-09-03  9:33 ` [PATCH 08/16] sched/debug: Add support to change sched_ext server params Andrea Righi
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 34+ messages in thread
From: Andrea Righi @ 2025-09-03  9:33 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Joel Fernandes, Tejun Heo, David Vernet,
	Changwoo Min, Shuah Khan
  Cc: sched-ext, bpf, linux-kernel, Luigi De Matteis

From: Joel Fernandes <joelagnelf@nvidia.com>

sched_ext currently suffers starvation due to RT. The same workload when
converted to EXT can get zero runtime if RT is 100% running, causing EXT
processes to stall. Fix it by adding a DL server for EXT.

A kselftest is also provided later to verify:

./runner -t rt_stall
===== START =====
TEST: rt_stall
DESCRIPTION: Verify that RT tasks cannot stall SCHED_EXT tasks
OUTPUT:
TAP version 13
1..1
ok 1 PASS: CFS task got more than 4.00% of runtime

Cc: Luigi De Matteis <ldematteis123@gmail.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
 kernel/sched/core.c     |  3 ++
 kernel/sched/deadline.c |  2 +-
 kernel/sched/ext.c      | 62 +++++++++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h    |  2 ++
 4 files changed, 66 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index be00629f0ba4c..f1a7ad7e560fb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8808,6 +8808,9 @@ void __init sched_init(void)
 		hrtick_rq_init(rq);
 		atomic_set(&rq->nr_iowait, 0);
 		fair_server_init(rq);
+#ifdef CONFIG_SCHED_CLASS_EXT
+		ext_server_init(rq);
+#endif
 
 #ifdef CONFIG_SCHED_CORE
 		rq->core = rq;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 75289385f310a..bfa08eba1d1b7 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1510,7 +1510,7 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64
 	 * The fair server (sole dl_server) does not account for real-time
 	 * workload because it is running fair work.
 	 */
-	if (dl_se == &rq->fair_server)
+	if (dl_se->dl_server)
 		return;
 
 #ifdef CONFIG_RT_GROUP_SCHED
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 63d9273278e5e..f7e2f9157496b 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1923,6 +1923,9 @@ static void update_curr_scx(struct rq *rq)
 		if (!curr->scx.slice)
 			touch_core_sched(rq, curr);
 	}
+
+	if (dl_server_active(&rq->ext_server))
+		dl_server_update(&rq->ext_server, delta_exec);
 }
 
 static bool scx_dsq_priq_less(struct rb_node *node_a,
@@ -2410,6 +2413,15 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags
 	if (enq_flags & SCX_ENQ_WAKEUP)
 		touch_core_sched(rq, p);
 
+	if (rq->scx.nr_running == 1) {
+		/* Account for idle runtime */
+		if (!rq->nr_running)
+			dl_server_update_idle_time(rq, rq->curr, &rq->ext_server);
+
+		/* Start dl_server if this is the first task being enqueued */
+		dl_server_start(&rq->ext_server);
+	}
+
 	do_enqueue_task(rq, p, enq_flags, sticky_cpu);
 out:
 	rq->scx.flags &= ~SCX_RQ_IN_WAKEUP;
@@ -2509,6 +2521,11 @@ static bool dequeue_task_scx(struct rq *rq, struct task_struct *p, int deq_flags
 	sub_nr_running(rq, 1);
 
 	dispatch_dequeue(rq, p);
+
+	/* Stop the server if this was the last task */
+	if (rq->scx.nr_running == 0)
+		dl_server_stop(&rq->ext_server);
+
 	return true;
 }
 
@@ -4045,6 +4062,15 @@ static void switching_to_scx(struct rq *rq, struct task_struct *p)
 static void switched_from_scx(struct rq *rq, struct task_struct *p)
 {
 	scx_disable_task(p);
+
+	/*
+	 * After class switch, if the DL server is still active, restart it so
+	 * that DL timers will be queued, in case SCX switched to higher class.
+	 */
+	if (dl_server_active(&rq->ext_server)) {
+		dl_server_stop(&rq->ext_server);
+		dl_server_start(&rq->ext_server);
+	}
 }
 
 static void wakeup_preempt_scx(struct rq *rq, struct task_struct *p,int wake_flags) {}
@@ -7311,8 +7337,8 @@ __bpf_kfunc u32 scx_bpf_cpuperf_cur(s32 cpu)
  * relative scale between 0 and %SCX_CPUPERF_ONE. This determines how the
  * schedutil cpufreq governor chooses the target frequency.
  *
- * The actual performance level chosen, CPU grouping, and the overhead and
- * latency of the operations are dependent on the hardware and cpufreq driver in
+ * The actual performance level chosen, CPU grouping, and the overhead and latency
+ * of the operations are dependent on the hardware and cpufreq driver in
  * use. Consult hardware and cpufreq documentation for more information. The
  * current performance level can be monitored using scx_bpf_cpuperf_cur().
  */
@@ -7604,6 +7630,38 @@ BTF_ID_FLAGS(func, scx_bpf_now)
 BTF_ID_FLAGS(func, scx_bpf_events, KF_TRUSTED_ARGS)
 BTF_KFUNCS_END(scx_kfunc_ids_any)
 
+/*
+ * Check if ext scheduler has tasks ready to run.
+ */
+static bool ext_server_has_tasks(struct sched_dl_entity *dl_se)
+{
+	return !!dl_se->rq->scx.nr_running;
+}
+
+/*
+ * Select the next task to run from the ext scheduling class.
+ */
+static struct task_struct *ext_server_pick_task(struct sched_dl_entity *dl_se,
+						void *flags)
+{
+	struct rq_flags *rf = flags;
+
+	balance_scx(dl_se->rq, dl_se->rq->curr, rf);
+	return pick_task_scx(dl_se->rq, rf);
+}
+
+/*
+ * Initialize the ext server deadline entity.
+ */
+void ext_server_init(struct rq *rq)
+{
+	struct sched_dl_entity *dl_se = &rq->ext_server;
+
+	init_dl_entity(dl_se);
+
+	dl_server_init(dl_se, rq, ext_server_has_tasks, ext_server_pick_task);
+}
+
 static const struct btf_kfunc_id_set scx_kfunc_set_any = {
 	.owner			= THIS_MODULE,
 	.set			= &scx_kfunc_ids_any,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f3089d0b76493..45add55ed161e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -391,6 +391,7 @@ extern void dl_server_update_idle_time(struct rq *rq,
 		    struct task_struct *p,
 		    struct sched_dl_entity *rq_dl_server);
 extern void fair_server_init(struct rq *rq);
+extern void ext_server_init(struct rq *rq);
 extern void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq);
 extern int dl_server_apply_params(struct sched_dl_entity *dl_se,
 		    u64 runtime, u64 period, bool init);
@@ -1125,6 +1126,7 @@ struct rq {
 #endif
 
 	struct sched_dl_entity	fair_server;
+	struct sched_dl_entity	ext_server;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	/* list of leaf cfs_rq on this CPU: */
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 08/16] sched/debug: Add support to change sched_ext server params
  2025-09-03  9:33 [PATCHSET v8 sched_ext/for-6.18] Add a deadline server for sched_ext tasks Andrea Righi
                   ` (6 preceding siblings ...)
  2025-09-03  9:33 ` [PATCH 07/16] sched_ext: Add a DL server for sched_ext tasks Andrea Righi
@ 2025-09-03  9:33 ` Andrea Righi
  2025-09-03  9:33 ` [PATCH 09/16] sched/deadline: Add support to remove DL server's bandwidth contribution Andrea Righi
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 34+ messages in thread
From: Andrea Righi @ 2025-09-03  9:33 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Joel Fernandes, Tejun Heo, David Vernet,
	Changwoo Min, Shuah Khan
  Cc: sched-ext, bpf, linux-kernel

From: Joel Fernandes <joelagnelf@nvidia.com>

When a sched_ext server is loaded, tasks in CFS are converted to run in
sched_ext class. Add support to modify the ext server parameters similar
to how the fair server parameters are modified.

Re-use common code between ext and fair servers as needed.

[ arighi: Use dl_se->dl_server to determine if dl_se is a DL server, as
          suggested by PeterZ. ]

Co-developed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
 kernel/sched/debug.c | 149 ++++++++++++++++++++++++++++++++++++-------
 1 file changed, 125 insertions(+), 24 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index e71f6618c1a6a..00ad35b812f76 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -336,14 +336,16 @@ enum dl_param {
 	DL_PERIOD,
 };
 
-static unsigned long fair_server_period_max = (1UL << 22) * NSEC_PER_USEC; /* ~4 seconds */
-static unsigned long fair_server_period_min = (100) * NSEC_PER_USEC;     /* 100 us */
+static unsigned long dl_server_period_max = (1UL << 22) * NSEC_PER_USEC; /* ~4 seconds */
+static unsigned long dl_server_period_min = (100) * NSEC_PER_USEC;     /* 100 us */
 
-static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubuf,
-				       size_t cnt, loff_t *ppos, enum dl_param param)
+static ssize_t sched_server_write_common(struct file *filp, const char __user *ubuf,
+					 size_t cnt, loff_t *ppos, enum dl_param param,
+					 void *server)
 {
 	long cpu = (long) ((struct seq_file *) filp->private_data)->private;
 	struct rq *rq = cpu_rq(cpu);
+	struct sched_dl_entity *dl_se = (struct sched_dl_entity *)server;
 	u64 runtime, period;
 	int retval = 0;
 	size_t err;
@@ -356,8 +358,8 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
 	scoped_guard (rq_lock_irqsave, rq) {
 		bool is_active;
 
-		runtime  = rq->fair_server.dl_runtime;
-		period = rq->fair_server.dl_period;
+		runtime = dl_se->dl_runtime;
+		period = dl_se->dl_period;
 
 		switch (param) {
 		case DL_RUNTIME:
@@ -373,25 +375,25 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
 		}
 
 		if (runtime > period ||
-		    period > fair_server_period_max ||
-		    period < fair_server_period_min) {
+		    period > dl_server_period_max ||
+		    period < dl_server_period_min) {
 			return  -EINVAL;
 		}
 
-		is_active = dl_server_active(&rq->fair_server);
+		is_active = dl_server_active(dl_se);
 		if (is_active) {
 			update_rq_clock(rq);
-			dl_server_stop(&rq->fair_server);
+			dl_server_stop(dl_se);
 		}
 
-		retval = dl_server_apply_params(&rq->fair_server, runtime, period, 0);
+		retval = dl_server_apply_params(dl_se, runtime, period, 0);
 
 		if (!runtime)
-			printk_deferred("Fair server disabled in CPU %d, system may crash due to starvation.\n",
-					cpu_of(rq));
+			printk_deferred("%s server disabled on CPU %d, system may crash due to starvation.\n",
+					server == &rq->fair_server ? "Fair" : "Ext", cpu_of(rq));
 
 		if (is_active)
-			dl_server_start(&rq->fair_server);
+			dl_server_start(dl_se);
 
 		if (retval < 0)
 			return retval;
@@ -401,36 +403,42 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
 	return cnt;
 }
 
-static size_t sched_fair_server_show(struct seq_file *m, void *v, enum dl_param param)
+static size_t sched_server_show_common(struct seq_file *m, void *v, enum dl_param param,
+				       void *server)
 {
-	unsigned long cpu = (unsigned long) m->private;
-	struct rq *rq = cpu_rq(cpu);
+	struct sched_dl_entity *dl_se = (struct sched_dl_entity *)server;
 	u64 value;
 
 	switch (param) {
 	case DL_RUNTIME:
-		value = rq->fair_server.dl_runtime;
+		value = dl_se->dl_runtime;
 		break;
 	case DL_PERIOD:
-		value = rq->fair_server.dl_period;
+		value = dl_se->dl_period;
 		break;
 	}
 
 	seq_printf(m, "%llu\n", value);
 	return 0;
-
 }
 
 static ssize_t
 sched_fair_server_runtime_write(struct file *filp, const char __user *ubuf,
 				size_t cnt, loff_t *ppos)
 {
-	return sched_fair_server_write(filp, ubuf, cnt, ppos, DL_RUNTIME);
+	long cpu = (long) ((struct seq_file *) filp->private_data)->private;
+	struct rq *rq = cpu_rq(cpu);
+
+	return sched_server_write_common(filp, ubuf, cnt, ppos, DL_RUNTIME,
+					&rq->fair_server);
 }
 
 static int sched_fair_server_runtime_show(struct seq_file *m, void *v)
 {
-	return sched_fair_server_show(m, v, DL_RUNTIME);
+	unsigned long cpu = (unsigned long) m->private;
+	struct rq *rq = cpu_rq(cpu);
+
+	return sched_server_show_common(m, v, DL_RUNTIME, &rq->fair_server);
 }
 
 static int sched_fair_server_runtime_open(struct inode *inode, struct file *filp)
@@ -446,16 +454,55 @@ static const struct file_operations fair_server_runtime_fops = {
 	.release	= single_release,
 };
 
+static ssize_t
+sched_ext_server_runtime_write(struct file *filp, const char __user *ubuf,
+			       size_t cnt, loff_t *ppos)
+{
+	long cpu = (long) ((struct seq_file *) filp->private_data)->private;
+	struct rq *rq = cpu_rq(cpu);
+
+	return sched_server_write_common(filp, ubuf, cnt, ppos, DL_RUNTIME,
+					&rq->ext_server);
+}
+
+static int sched_ext_server_runtime_show(struct seq_file *m, void *v)
+{
+	unsigned long cpu = (unsigned long) m->private;
+	struct rq *rq = cpu_rq(cpu);
+
+	return sched_server_show_common(m, v, DL_RUNTIME, &rq->ext_server);
+}
+
+static int sched_ext_server_runtime_open(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, sched_ext_server_runtime_show, inode->i_private);
+}
+
+static const struct file_operations ext_server_runtime_fops = {
+	.open		= sched_ext_server_runtime_open,
+	.write		= sched_ext_server_runtime_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
 static ssize_t
 sched_fair_server_period_write(struct file *filp, const char __user *ubuf,
 			       size_t cnt, loff_t *ppos)
 {
-	return sched_fair_server_write(filp, ubuf, cnt, ppos, DL_PERIOD);
+	long cpu = (long) ((struct seq_file *) filp->private_data)->private;
+	struct rq *rq = cpu_rq(cpu);
+
+	return sched_server_write_common(filp, ubuf, cnt, ppos, DL_PERIOD,
+					&rq->fair_server);
 }
 
 static int sched_fair_server_period_show(struct seq_file *m, void *v)
 {
-	return sched_fair_server_show(m, v, DL_PERIOD);
+	unsigned long cpu = (unsigned long) m->private;
+	struct rq *rq = cpu_rq(cpu);
+
+	return sched_server_show_common(m, v, DL_PERIOD, &rq->fair_server);
 }
 
 static int sched_fair_server_period_open(struct inode *inode, struct file *filp)
@@ -471,6 +518,38 @@ static const struct file_operations fair_server_period_fops = {
 	.release	= single_release,
 };
 
+static ssize_t
+sched_ext_server_period_write(struct file *filp, const char __user *ubuf,
+			      size_t cnt, loff_t *ppos)
+{
+	long cpu = (long) ((struct seq_file *) filp->private_data)->private;
+	struct rq *rq = cpu_rq(cpu);
+
+	return sched_server_write_common(filp, ubuf, cnt, ppos, DL_PERIOD,
+					&rq->ext_server);
+}
+
+static int sched_ext_server_period_show(struct seq_file *m, void *v)
+{
+	unsigned long cpu = (unsigned long) m->private;
+	struct rq *rq = cpu_rq(cpu);
+
+	return sched_server_show_common(m, v, DL_PERIOD, &rq->ext_server);
+}
+
+static int sched_ext_server_period_open(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, sched_ext_server_period_show, inode->i_private);
+}
+
+static const struct file_operations ext_server_period_fops = {
+	.open		= sched_ext_server_period_open,
+	.write		= sched_ext_server_period_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
 static struct dentry *debugfs_sched;
 
 static void debugfs_fair_server_init(void)
@@ -494,6 +573,27 @@ static void debugfs_fair_server_init(void)
 	}
 }
 
+static void debugfs_ext_server_init(void)
+{
+	struct dentry *d_ext;
+	unsigned long cpu;
+
+	d_ext = debugfs_create_dir("ext_server", debugfs_sched);
+	if (!d_ext)
+		return;
+
+	for_each_possible_cpu(cpu) {
+		struct dentry *d_cpu;
+		char buf[32];
+
+		snprintf(buf, sizeof(buf), "cpu%lu", cpu);
+		d_cpu = debugfs_create_dir(buf, d_ext);
+
+		debugfs_create_file("runtime", 0644, d_cpu, (void *) cpu, &ext_server_runtime_fops);
+		debugfs_create_file("period", 0644, d_cpu, (void *) cpu, &ext_server_period_fops);
+	}
+}
+
 static __init int sched_init_debug(void)
 {
 	struct dentry __maybe_unused *numa;
@@ -532,6 +632,7 @@ static __init int sched_init_debug(void)
 	debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
 
 	debugfs_fair_server_init();
+	debugfs_ext_server_init();
 
 	return 0;
 }
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 09/16] sched/deadline: Add support to remove DL server's bandwidth contribution
  2025-09-03  9:33 [PATCHSET v8 sched_ext/for-6.18] Add a deadline server for sched_ext tasks Andrea Righi
                   ` (7 preceding siblings ...)
  2025-09-03  9:33 ` [PATCH 08/16] sched/debug: Add support to change sched_ext server params Andrea Righi
@ 2025-09-03  9:33 ` Andrea Righi
  2025-09-03  9:33 ` [PATCH 10/16] sched/deadline: Account ext server bandwidth Andrea Righi
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 34+ messages in thread
From: Andrea Righi @ 2025-09-03  9:33 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Joel Fernandes, Tejun Heo, David Vernet,
	Changwoo Min, Shuah Khan
  Cc: sched-ext, bpf, linux-kernel

During switching from sched_ext to FAIR tasks and vice-versa, we need
support for removing the bandwidth contribution of either DL server. Add
support for the same.

Co-developed-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/deadline.c | 31 +++++++++++++++++++++++++++++++
 kernel/sched/sched.h    |  1 +
 2 files changed, 32 insertions(+)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index bfa08eba1d1b7..31d397aa777b9 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1707,6 +1707,12 @@ int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 perio
 		dl_rq_change_utilization(rq, dl_se, new_bw);
 	}
 
+	/* Clear these so that the dl_server is reinitialized */
+	if (new_bw == 0) {
+		dl_se->dl_defer = 0;
+		dl_se->dl_server = 0;
+	}
+
 	dl_se->dl_runtime = runtime;
 	dl_se->dl_deadline = period;
 	dl_se->dl_period = period;
@@ -1720,6 +1726,31 @@ int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 perio
 	return retval;
 }
 
+/**
+ * dl_server_remove_params - Remove bandwidth reservation for a DL server
+ * @dl_se: The DL server entity to remove bandwidth for
+ *
+ * This function removes the bandwidth reservation for a DL server entity,
+ * cleaning up all bandwidth accounting and server state.
+ *
+ * Returns: 0 on success, negative error code on failure
+ */
+int dl_server_remove_params(struct sched_dl_entity *dl_se)
+{
+	if (!dl_se->dl_server)
+		return 0; /* Already disabled */
+
+	/*
+	 * First dequeue if still queued. It should not be queued since
+	 * we call this only after the last dl_server_stop().
+	 */
+	if (WARN_ON_ONCE(on_dl_rq(dl_se)))
+		dequeue_dl_entity(dl_se, DEQUEUE_SLEEP);
+
+	/* Remove bandwidth reservation */
+	return dl_server_apply_params(dl_se, 0, dl_se->dl_period, false);
+}
+
 /*
  * Update the current task's runtime statistics (provided it is still
  * a -deadline task and has not been removed from the dl_rq).
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 45add55ed161e..928874ab9b2db 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -395,6 +395,7 @@ extern void ext_server_init(struct rq *rq);
 extern void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq);
 extern int dl_server_apply_params(struct sched_dl_entity *dl_se,
 		    u64 runtime, u64 period, bool init);
+extern int dl_server_remove_params(struct sched_dl_entity *dl_se);
 
 static inline bool dl_server_active(struct sched_dl_entity *dl_se)
 {
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 10/16] sched/deadline: Account ext server bandwidth
  2025-09-03  9:33 [PATCHSET v8 sched_ext/for-6.18] Add a deadline server for sched_ext tasks Andrea Righi
                   ` (8 preceding siblings ...)
  2025-09-03  9:33 ` [PATCH 09/16] sched/deadline: Add support to remove DL server's bandwidth contribution Andrea Righi
@ 2025-09-03  9:33 ` Andrea Righi
  2025-09-03  9:33 ` [PATCH 11/16] sched/deadline: Allow to initialize DL server when needed Andrea Righi
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 34+ messages in thread
From: Andrea Righi @ 2025-09-03  9:33 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Joel Fernandes, Tejun Heo, David Vernet,
	Changwoo Min, Shuah Khan
  Cc: sched-ext, bpf, linux-kernel

Always account for both the ext_server and fair_server bandwidths,
especially during CPU hotplug operations. Ignoring either can lead to
imbalances in total_bw when sched_ext schedulers are active and CPUs are
brought online / offline.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/deadline.c | 29 +++++++++++++++++++++--------
 kernel/sched/topology.c |  5 +++++
 2 files changed, 26 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 31d397aa777b9..165b12553e10d 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2981,9 +2981,17 @@ void dl_clear_root_domain(struct root_domain *rd)
 	 * them, we need to account for them here explicitly.
 	 */
 	for_each_cpu(i, rd->span) {
-		struct sched_dl_entity *dl_se = &cpu_rq(i)->fair_server;
+		struct sched_dl_entity *dl_se;
 
-		if (dl_server(dl_se) && cpu_active(i))
+		if (!cpu_active(i))
+			continue;
+
+		dl_se = &cpu_rq(i)->fair_server;
+		if (dl_server(dl_se))
+			__dl_add(&rd->dl_bw, dl_se->dl_bw, dl_bw_cpus(i));
+
+		dl_se = &cpu_rq(i)->ext_server;
+		if (dl_server(dl_se))
 			__dl_add(&rd->dl_bw, dl_se->dl_bw, dl_bw_cpus(i));
 	}
 }
@@ -3478,6 +3486,7 @@ static int dl_bw_manage(enum dl_bw_request req, int cpu, u64 dl_bw)
 	struct dl_bw *dl_b;
 	bool overflow = 0;
 	u64 fair_server_bw = 0;
+	u64 ext_server_bw = 0;
 
 	rcu_read_lock_sched();
 	dl_b = dl_bw_of(cpu);
@@ -3510,27 +3519,31 @@ static int dl_bw_manage(enum dl_bw_request req, int cpu, u64 dl_bw)
 		cap -= arch_scale_cpu_capacity(cpu);
 
 		/*
-		 * cpu is going offline and NORMAL tasks will be moved away
-		 * from it. We can thus discount dl_server bandwidth
-		 * contribution as it won't need to be servicing tasks after
-		 * the cpu is off.
+		 * cpu is going offline and NORMAL and EXT tasks will be
+		 * moved away from it. We can thus discount dl_server
+		 * bandwidth contribution as it won't need to be servicing
+		 * tasks after the cpu is off.
 		 */
 		if (cpu_rq(cpu)->fair_server.dl_server)
 			fair_server_bw = cpu_rq(cpu)->fair_server.dl_bw;
 
+		if (cpu_rq(cpu)->ext_server.dl_server)
+			ext_server_bw = cpu_rq(cpu)->ext_server.dl_bw;
+
 		/*
 		 * Not much to check if no DEADLINE bandwidth is present.
 		 * dl_servers we can discount, as tasks will be moved out the
 		 * offlined CPUs anyway.
 		 */
-		if (dl_b->total_bw - fair_server_bw > 0) {
+		if (dl_b->total_bw - fair_server_bw - ext_server_bw > 0) {
 			/*
 			 * Leaving at least one CPU for DEADLINE tasks seems a
 			 * wise thing to do. As said above, cpu is not offline
 			 * yet, so account for that.
 			 */
 			if (dl_bw_cpus(cpu) - 1)
-				overflow = __dl_overflow(dl_b, cap, fair_server_bw, 0);
+				overflow = __dl_overflow(dl_b, cap,
+							 fair_server_bw + ext_server_bw, 0);
 			else
 				overflow = 1;
 		}
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 977e133bb8a44..f4574b0cf8ebc 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -508,6 +508,11 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd)
 	if (rq->fair_server.dl_server)
 		__dl_server_attach_root(&rq->fair_server, rq);
 
+#ifdef CONFIG_SCHED_CLASS_EXT
+	if (rq->ext_server.dl_server)
+		__dl_server_attach_root(&rq->ext_server, rq);
+#endif
+
 	rq_unlock_irqrestore(rq, &rf);
 
 	if (old_rd)
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 11/16] sched/deadline: Allow to initialize DL server when needed
  2025-09-03  9:33 [PATCHSET v8 sched_ext/for-6.18] Add a deadline server for sched_ext tasks Andrea Righi
                   ` (9 preceding siblings ...)
  2025-09-03  9:33 ` [PATCH 10/16] sched/deadline: Account ext server bandwidth Andrea Righi
@ 2025-09-03  9:33 ` Andrea Righi
  2025-09-03  9:33 ` [PATCH 12/16] sched_ext: Selectively enable ext and fair DL servers Andrea Righi
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 34+ messages in thread
From: Andrea Righi @ 2025-09-03  9:33 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Joel Fernandes, Tejun Heo, David Vernet,
	Changwoo Min, Shuah Khan
  Cc: sched-ext, bpf, linux-kernel

When switching between fair and sched_ext, we need to initialize the
bandwidth contribution of the DL server independently for each class.

Add support for on-demand initialization to handle such transitions.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/deadline.c | 36 +++++++++++++++++++++++++++++-------
 kernel/sched/sched.h    |  1 +
 2 files changed, 30 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 165b12553e10d..b744187ec6372 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1583,6 +1583,32 @@ void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec)
 	}
 }
 
+/**
+ * dl_server_init_params - Initialize bandwidth reservation for a DL server
+ * @dl_se: The DL server entity to remove bandwidth for
+ *
+ * This function initializes the bandwidth reservation for a DL server
+ * entity, its bandwidth accounting and server state.
+ *
+ * Returns: 0 on success, negative error code on failure
+ */
+int dl_server_init_params(struct sched_dl_entity *dl_se)
+{
+	u64 runtime =  50 * NSEC_PER_MSEC;
+	u64 period = 1000 * NSEC_PER_MSEC;
+	int err;
+
+	err = dl_server_apply_params(dl_se, runtime, period, 1);
+	if (err)
+		return err;
+
+	dl_se->dl_server = 1;
+	dl_se->dl_defer = 1;
+	setup_new_dl_entity(dl_se);
+
+	return err;
+}
+
 void dl_server_start(struct sched_dl_entity *dl_se)
 {
 	struct rq *rq = dl_se->rq;
@@ -1638,8 +1664,7 @@ void sched_init_dl_servers(void)
 	struct sched_dl_entity *dl_se;
 
 	for_each_online_cpu(cpu) {
-		u64 runtime =  50 * NSEC_PER_MSEC;
-		u64 period = 1000 * NSEC_PER_MSEC;
+		int err;
 
 		rq = cpu_rq(cpu);
 
@@ -1649,11 +1674,8 @@ void sched_init_dl_servers(void)
 
 		WARN_ON(dl_server(dl_se));
 
-		dl_server_apply_params(dl_se, runtime, period, 1);
-
-		dl_se->dl_server = 1;
-		dl_se->dl_defer = 1;
-		setup_new_dl_entity(dl_se);
+		err = dl_server_init_params(dl_se);
+		WARN_ON_ONCE(err);
 	}
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 928874ab9b2db..1fbf4ffbcb208 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -395,6 +395,7 @@ extern void ext_server_init(struct rq *rq);
 extern void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq);
 extern int dl_server_apply_params(struct sched_dl_entity *dl_se,
 		    u64 runtime, u64 period, bool init);
+extern int dl_server_init_params(struct sched_dl_entity *dl_se);
 extern int dl_server_remove_params(struct sched_dl_entity *dl_se);
 
 static inline bool dl_server_active(struct sched_dl_entity *dl_se)
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 12/16] sched_ext: Selectively enable ext and fair DL servers
  2025-09-03  9:33 [PATCHSET v8 sched_ext/for-6.18] Add a deadline server for sched_ext tasks Andrea Righi
                   ` (10 preceding siblings ...)
  2025-09-03  9:33 ` [PATCH 11/16] sched/deadline: Allow to initialize DL server when needed Andrea Righi
@ 2025-09-03  9:33 ` Andrea Righi
  2025-09-03  9:33 ` [PATCH 13/16] sched/deadline: Fix DL server crash in inactive_timer callback Andrea Righi
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 34+ messages in thread
From: Andrea Righi @ 2025-09-03  9:33 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Joel Fernandes, Tejun Heo, David Vernet,
	Changwoo Min, Shuah Khan
  Cc: sched-ext, bpf, linux-kernel

Enable or disable the appropriate DL servers (ext and fair) depending on
whether an scx scheduler is started in full or partial mode:

 - in full mode, disable the fair DL server and enable the ext DL server
   on all online CPUs,
 - in partial mode (%SCX_OPS_SWITCH_PARTIAL), keep both fair and ext DL
   servers active to support tasks in both scheduling classes.

Additionally, handle CPU hotplug events by selectively enabling or
disabling the relevant DL servers on the CPU that is going
offline/online. This ensures correct bandwidth reservation also when
CPUs are brought online or offline.

Co-developed-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/ext.c | 97 +++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 87 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index f7e2f9157496b..69163927a29cd 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -3603,6 +3603,57 @@ static void set_cpus_allowed_scx(struct task_struct *p,
 				 p, (struct cpumask *)p->cpus_ptr);
 }
 
+static void dl_server_on(struct rq *rq, bool switch_all)
+{
+	struct rq_flags rf;
+	int err;
+
+	rq_lock_irqsave(rq, &rf);
+	update_rq_clock(rq);
+
+	if (switch_all) {
+		/*
+		 * If all fair tasks are moved to the scx scheduler, we
+		 * don't need the fair DL servers anymore, so remove it.
+		 *
+		 * When the current scx scheduler is unloaded, the fair DL
+		 * server will be re-initialized.
+		 */
+		if (dl_server_active(&rq->fair_server))
+			dl_server_stop(&rq->fair_server);
+		dl_server_remove_params(&rq->fair_server);
+	}
+
+	err = dl_server_init_params(&rq->ext_server);
+	WARN_ON_ONCE(err);
+
+	rq_unlock_irqrestore(rq, &rf);
+}
+
+static void dl_server_off(struct rq *rq, bool switch_all)
+{
+	struct rq_flags rf;
+	int err;
+
+	rq_lock_irqsave(rq, &rf);
+	update_rq_clock(rq);
+
+	if (dl_server_active(&rq->ext_server))
+		dl_server_stop(&rq->ext_server);
+	dl_server_remove_params(&rq->ext_server);
+
+	if (switch_all) {
+		/*
+		 * Re-initialize the fair DL server if it was previously disabled
+		 * because all fair tasks had been moved to the ext class.
+		 */
+		err = dl_server_init_params(&rq->fair_server);
+		WARN_ON_ONCE(err);
+	}
+
+	rq_unlock_irqrestore(rq, &rf);
+}
+
 static void handle_hotplug(struct rq *rq, bool online)
 {
 	struct scx_sched *sch = scx_root;
@@ -3618,9 +3669,20 @@ static void handle_hotplug(struct rq *rq, bool online)
 	if (unlikely(!sch))
 		return;
 
-	if (scx_enabled())
+	if (scx_enabled()) {
+		bool is_switching_all = READ_ONCE(scx_switching_all);
+
 		scx_idle_update_selcpu_topology(&sch->ops);
 
+		/*
+		 * Update ext and fair DL servers on hotplug events.
+		 */
+		if (online)
+			dl_server_on(rq, is_switching_all);
+		else
+			dl_server_off(rq, is_switching_all);
+	}
+
 	if (online && SCX_HAS_OP(sch, cpu_online))
 		SCX_CALL_OP(sch, SCX_KF_UNLOCKED, cpu_online, NULL, cpu);
 	else if (!online && SCX_HAS_OP(sch, cpu_offline))
@@ -4969,6 +5031,7 @@ static void scx_disable_workfn(struct kthread_work *work)
 	struct scx_exit_info *ei = sch->exit_info;
 	struct scx_task_iter sti;
 	struct task_struct *p;
+	bool is_switching_all = READ_ONCE(scx_switching_all);
 	int kind, cpu;
 
 	kind = atomic_read(&sch->exit_kind);
@@ -5024,6 +5087,22 @@ static void scx_disable_workfn(struct kthread_work *work)
 
 	scx_init_task_enabled = false;
 
+	for_each_online_cpu(cpu) {
+		struct rq *rq = cpu_rq(cpu);
+
+		/*
+		 * Invalidate all the rq clocks to prevent getting outdated
+		 * rq clocks from a previous scx scheduler.
+		 */
+		scx_rq_clock_invalidate(rq);
+
+		/*
+		 * We are unloading the sched_ext scheduler, we do not need its
+		 * DL server bandwidth anymore, remove it for all CPUs.
+		 */
+		dl_server_off(rq, is_switching_all);
+	}
+
 	scx_task_iter_start(&sti);
 	while ((p = scx_task_iter_next_locked(&sti))) {
 		const struct sched_class *old_class = p->sched_class;
@@ -5047,15 +5126,6 @@ static void scx_disable_workfn(struct kthread_work *work)
 	scx_task_iter_stop(&sti);
 	percpu_up_write(&scx_fork_rwsem);
 
-	/*
-	 * Invalidate all the rq clocks to prevent getting outdated
-	 * rq clocks from a previous scx scheduler.
-	 */
-	for_each_possible_cpu(cpu) {
-		struct rq *rq = cpu_rq(cpu);
-		scx_rq_clock_invalidate(rq);
-	}
-
 	/* no task is on scx, turn off all the switches and flush in-progress calls */
 	static_branch_disable(&__scx_enabled);
 	bitmap_zero(sch->has_op, SCX_OPI_END);
@@ -5796,6 +5866,13 @@ static int scx_enable(struct sched_ext_ops *ops, struct bpf_link *link)
 		check_class_changed(task_rq(p), p, old_class, p->prio);
 	}
 	scx_task_iter_stop(&sti);
+
+	/*
+	 * Enable the ext DL server on all online CPUs.
+	 */
+	for_each_online_cpu(cpu)
+		dl_server_on(cpu_rq(cpu), !(ops->flags & SCX_OPS_SWITCH_PARTIAL));
+
 	percpu_up_write(&scx_fork_rwsem);
 
 	scx_bypass(false);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 13/16] sched/deadline: Fix DL server crash in inactive_timer callback
  2025-09-03  9:33 [PATCHSET v8 sched_ext/for-6.18] Add a deadline server for sched_ext tasks Andrea Righi
                   ` (11 preceding siblings ...)
  2025-09-03  9:33 ` [PATCH 12/16] sched_ext: Selectively enable ext and fair DL servers Andrea Righi
@ 2025-09-03  9:33 ` Andrea Righi
  2025-09-03  9:33 ` [PATCH 14/16] sched/deadline: De-couple balance and pick_task Andrea Righi
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 34+ messages in thread
From: Andrea Righi @ 2025-09-03  9:33 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Joel Fernandes, Tejun Heo, David Vernet,
	Changwoo Min, Shuah Khan
  Cc: sched-ext, bpf, linux-kernel

From: Joel Fernandes <joelagnelf@nvidia.com>

When sched_ext is rapidly disabled/enabled (the reload_loop selftest),
the following crash is observed. This happens because the timer handler
could not be cancelled and still fires even though the dl_server
bandwidth may have been removed.

hrtimer_try_to_cancel() does not guarantee timer cancellation. This
results in a NULL pointer dereference as 'p' is bogus for a dl_se.

I think this happens because the timer may be about to run, but its
softirq has not executed yet. Because of that hrtimer_try_to_cancel()
cannot prevent the timer from being canceled, however dl_server is still
set to 0 by dl_server_apply_params(). When the timer handler eventually
runs, it crashes.

[   24.771835] BUG: kernel NULL pointer dereference, address: 000000000000006c
[   24.772097] #PF: supervisor read access in kernel mode
[   24.772248] #PF: error_code(0x0000) - not-present page
[   24.772404] PGD 0 P4D 0
[   24.772499] Oops: Oops: 0000 [#1] SMP PTI
[   24.772614] CPU: 9 UID: 0 PID: 0 Comm: swapper/9 [..] #74 PREEMPT(voluntary)
[   24.772932] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), [...]
[   24.773149] Sched_ext: maximal (disabling)
[   24.773944] RSP: 0018:ffffb162c0348ee0 EFLAGS: 00010046
[   24.774100] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88d4412f1800
[   24.774302] RDX: 0000000000000001 RSI: 0000000000000010 RDI: ffffffffac939240
[   24.774498] RBP: ffff88d47e65b940 R08: 0000000000000010 R09: 00000008bad3370a
[   24.774742] R10: 0000000000000000 R11: ffffffffa9f159d0 R12: ffff88d47e65b900
[   24.774962] R13: ffff88d47e65b960 R14: ffff88d47e66a340 R15: ffff88d47e66aed0
[   24.775182] FS:  0000000000000000(0000) GS:ffff88d4d1d56000(0000) knlGS:[...]
[   24.775392] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   24.775579] CR2: 000000000000006c CR3: 0000000002bb0003 CR4: 0000000000770ef0
[   24.775810] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   24.776023] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   24.776225] PKRU: 55555554
[   24.776292] Call Trace:
[   24.776373]  <IRQ>
[   24.776453]  ? __pfx_inactive_task_timer+0x10/0x10
[   24.776591]  __hrtimer_run_queues+0xf1/0x270
[   24.776744]  hrtimer_interrupt+0xfa/0x220
[   24.776847]  __sysvec_apic_timer_interrupt+0x4d/0x190
[   24.776988]  sysvec_apic_timer_interrupt+0x69/0x80
[   24.777132]  </IRQ>
[   24.777194]  <TASK>
[   24.777256]  asm_sysvec_apic_timer_interrupt+0x1a/0x20

Fix, by also checking the DL server's has_task pointer which only exists
for server tasks. This fixes the crash.

Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
 kernel/sched/deadline.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index b744187ec6372..84c7172ee805c 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1807,7 +1807,13 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
 	struct rq_flags rf;
 	struct rq *rq;
 
-	if (!dl_server(dl_se)) {
+	/*
+	 * It is possible that after dl_server_apply_params(), the dl_se->dl_server == 0,
+	 * but the inactive timer is still queued and could not get canceled. Double check
+	 * by looking at ->server_has_tasks to make sure we're dealing with a non-server
+	 * here. Otherwise p may be bogus and we'll crash.
+	 */
+	if (!dl_server(dl_se) && !dl_se->server_has_tasks) {
 		p = dl_task_of(dl_se);
 		rq = task_rq_lock(p, &rf);
 	} else {
@@ -1818,7 +1824,7 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
 	sched_clock_tick();
 	update_rq_clock(rq);
 
-	if (dl_server(dl_se))
+	if (dl_server(dl_se) || dl_se->server_has_tasks)
 		goto no_task;
 
 	if (!dl_task(p) || READ_ONCE(p->__state) == TASK_DEAD) {
@@ -1846,7 +1852,7 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
 	dl_se->dl_non_contending = 0;
 unlock:
 
-	if (!dl_server(dl_se)) {
+	if (!dl_server(dl_se) && !dl_se->server_has_tasks) {
 		task_rq_unlock(rq, p, &rf);
 		put_task_struct(p);
 	} else {
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 14/16] sched/deadline: De-couple balance and pick_task
  2025-09-03  9:33 [PATCHSET v8 sched_ext/for-6.18] Add a deadline server for sched_ext tasks Andrea Righi
                   ` (12 preceding siblings ...)
  2025-09-03  9:33 ` [PATCH 13/16] sched/deadline: Fix DL server crash in inactive_timer callback Andrea Righi
@ 2025-09-03  9:33 ` Andrea Righi
  2025-09-03  9:33 ` [PATCH 15/16] selftests/sched_ext: Add test for sched_ext dl_server Andrea Righi
  2025-09-03  9:33 ` [PATCH 16/16] selftests/sched_ext: Add test for DL server total_bw consistency Andrea Righi
  15 siblings, 0 replies; 34+ messages in thread
From: Andrea Righi @ 2025-09-03  9:33 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Joel Fernandes, Tejun Heo, David Vernet,
	Changwoo Min, Shuah Khan
  Cc: sched-ext, bpf, linux-kernel

From: Tejun Heo <tj@kernel.org>

Allow a dl_server to trigger ->balance() from balance_dl() for sched
classes that are always expecting a ->balance() call before
->pick_task(), e.g. sched_ext.

[ arighi:
    - adjust patch after dropping @rf from pick_task()
    - update dl_server_init() to take an additional @balance parameter
    - activate DL server balance only if there's any pending work ]

Co-developed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/sched.h   |  2 ++
 kernel/sched/core.c     | 14 +++++++++++---
 kernel/sched/deadline.c | 16 ++++++++++------
 kernel/sched/ext.c      | 17 ++++++++++-------
 kernel/sched/fair.c     |  2 +-
 kernel/sched/sched.h    |  8 +++++++-
 6 files changed, 41 insertions(+), 18 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2b272382673d6..aa3ae42da51a9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -635,6 +635,7 @@ struct sched_rt_entity {
 } __randomize_layout;
 
 typedef bool (*dl_server_has_tasks_f)(struct sched_dl_entity *);
+typedef void (*dl_server_balance_f)(struct sched_dl_entity *, void *);
 typedef struct task_struct *(*dl_server_pick_f)(struct sched_dl_entity *);
 
 struct sched_dl_entity {
@@ -734,6 +735,7 @@ struct sched_dl_entity {
 	 */
 	struct rq			*rq;
 	dl_server_has_tasks_f		server_has_tasks;
+	dl_server_balance_f		server_balance;
 	dl_server_pick_f		server_pick_task;
 
 #ifdef CONFIG_RT_MUTEXES
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f1a7ad7e560fb..3c2863d961f38 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5950,14 +5950,22 @@ static void prev_balance(struct rq *rq, struct task_struct *prev,
 
 #ifdef CONFIG_SCHED_CLASS_EXT
 	/*
-	 * SCX requires a balance() call before every pick_task() including when
-	 * waking up from SCHED_IDLE. If @start_class is below SCX, start from
-	 * SCX instead. Also, set a flag to detect missing balance() call.
+	 * SCX requires a balance() call before every pick_task() including
+	 * when waking up from SCHED_IDLE.
+	 *
+	 * If @start_class is below SCX, start balancing from SCX. If the
+	 * DL server has any pending work, start from the DL class instead.
+	 * This ensures the DL server is given a chance to trigger its own
+	 * balance() pass on every prev_balance() invocation.
+	 *
+	 * Also, set a flag to detect missing balance() call.
 	 */
 	if (scx_enabled()) {
 		rq->scx.flags |= SCX_RQ_BAL_PENDING;
 		if (sched_class_above(&ext_sched_class, start_class))
 			start_class = &ext_sched_class;
+		if (on_dl_rq(&rq->ext_server))
+			start_class = &dl_sched_class;
 	}
 #endif
 
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 84c7172ee805c..1f79b1e49b49c 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -88,11 +88,6 @@ static inline struct dl_rq *dl_rq_of_se(struct sched_dl_entity *dl_se)
 	return &rq_of_dl_se(dl_se)->dl;
 }
 
-static inline int on_dl_rq(struct sched_dl_entity *dl_se)
-{
-	return !RB_EMPTY_NODE(&dl_se->rb_node);
-}
-
 #ifdef CONFIG_RT_MUTEXES
 static inline struct sched_dl_entity *pi_of(struct sched_dl_entity *dl_se)
 {
@@ -1650,11 +1645,13 @@ static bool dl_server_stopped(struct sched_dl_entity *dl_se)
 
 void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
 		    dl_server_has_tasks_f has_tasks,
-		    dl_server_pick_f pick_task)
+		    dl_server_pick_f pick_task,
+		    dl_server_balance_f balance)
 {
 	dl_se->rq = rq;
 	dl_se->server_has_tasks = has_tasks;
 	dl_se->server_pick_task = pick_task;
+	dl_se->server_balance = balance;
 }
 
 void sched_init_dl_servers(void)
@@ -2349,8 +2346,12 @@ static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
 	resched_curr(rq);
 }
 
+static struct sched_dl_entity *pick_next_dl_entity(struct dl_rq *dl_rq);
+
 static int balance_dl(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
 {
+	struct sched_dl_entity *dl_se;
+
 	if (!on_dl_rq(&p->dl) && need_pull_dl_task(rq, p)) {
 		/*
 		 * This is OK, because current is on_cpu, which avoids it being
@@ -2363,6 +2364,9 @@ static int balance_dl(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
 		rq_repin_lock(rq, rf);
 	}
 
+	dl_se = pick_next_dl_entity(&rq->dl);
+	if (dl_se && dl_server(dl_se) && dl_se->server_balance)
+		dl_se->server_balance(dl_se, rf);
 	return sched_stop_runnable(rq) || sched_dl_runnable(rq);
 }
 
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 69163927a29cd..e6d84b9aa70dc 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -7715,16 +7715,19 @@ static bool ext_server_has_tasks(struct sched_dl_entity *dl_se)
 	return !!dl_se->rq->scx.nr_running;
 }
 
-/*
- * Select the next task to run from the ext scheduling class.
- */
-static struct task_struct *ext_server_pick_task(struct sched_dl_entity *dl_se,
-						void *flags)
+static void ext_server_balance(struct sched_dl_entity *dl_se, void *flags)
 {
 	struct rq_flags *rf = flags;
 
 	balance_scx(dl_se->rq, dl_se->rq->curr, rf);
-	return pick_task_scx(dl_se->rq, rf);
+}
+
+/*
+ * Select the next task to run from the ext scheduling class.
+ */
+static struct task_struct *ext_server_pick_task(struct sched_dl_entity *dl_se)
+{
+	return pick_task_scx(dl_se->rq);
 }
 
 /*
@@ -7736,7 +7739,7 @@ void ext_server_init(struct rq *rq)
 
 	init_dl_entity(dl_se);
 
-	dl_server_init(dl_se, rq, ext_server_has_tasks, ext_server_pick_task);
+	dl_server_init(dl_se, rq, ext_server_has_tasks, ext_server_pick_task, ext_server_balance);
 }
 
 static const struct btf_kfunc_id_set scx_kfunc_set_any = {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7573baca9a85a..0c16944d43db8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8875,7 +8875,7 @@ void fair_server_init(struct rq *rq)
 
 	init_dl_entity(dl_se);
 
-	dl_server_init(dl_se, rq, fair_server_has_tasks, fair_server_pick_task);
+	dl_server_init(dl_se, rq, fair_server_has_tasks, fair_server_pick_task, NULL);
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1fbf4ffbcb208..a8615bdd6bdfa 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -384,7 +384,8 @@ extern void dl_server_start(struct sched_dl_entity *dl_se);
 extern void dl_server_stop(struct sched_dl_entity *dl_se);
 extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
 		    dl_server_has_tasks_f has_tasks,
-		    dl_server_pick_f pick_task);
+		    dl_server_pick_f pick_task,
+		    dl_server_balance_f balance);
 extern void sched_init_dl_servers(void);
 
 extern void dl_server_update_idle_time(struct rq *rq,
@@ -403,6 +404,11 @@ static inline bool dl_server_active(struct sched_dl_entity *dl_se)
 	return dl_se->dl_server_active;
 }
 
+static inline int on_dl_rq(struct sched_dl_entity *dl_se)
+{
+	return !RB_EMPTY_NODE(&dl_se->rb_node);
+}
+
 #ifdef CONFIG_CGROUP_SCHED
 
 extern struct list_head task_groups;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 15/16] selftests/sched_ext: Add test for sched_ext dl_server
  2025-09-03  9:33 [PATCHSET v8 sched_ext/for-6.18] Add a deadline server for sched_ext tasks Andrea Righi
                   ` (13 preceding siblings ...)
  2025-09-03  9:33 ` [PATCH 14/16] sched/deadline: De-couple balance and pick_task Andrea Righi
@ 2025-09-03  9:33 ` Andrea Righi
  2025-09-03  9:33 ` [PATCH 16/16] selftests/sched_ext: Add test for DL server total_bw consistency Andrea Righi
  15 siblings, 0 replies; 34+ messages in thread
From: Andrea Righi @ 2025-09-03  9:33 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Joel Fernandes, Tejun Heo, David Vernet,
	Changwoo Min, Shuah Khan
  Cc: sched-ext, bpf, linux-kernel

Add a selftest to validate the correct behavior of the deadline server
for the ext_sched_class.

[ Joel: Replaced occurences of CFS in the test with EXT. ]

Co-developed-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 tools/testing/selftests/sched_ext/Makefile    |   1 +
 .../selftests/sched_ext/rt_stall.bpf.c        |  23 ++
 tools/testing/selftests/sched_ext/rt_stall.c  | 214 ++++++++++++++++++
 3 files changed, 238 insertions(+)
 create mode 100644 tools/testing/selftests/sched_ext/rt_stall.bpf.c
 create mode 100644 tools/testing/selftests/sched_ext/rt_stall.c

diff --git a/tools/testing/selftests/sched_ext/Makefile b/tools/testing/selftests/sched_ext/Makefile
index 9d9d6b4c38b01..f0a8cba3a99f1 100644
--- a/tools/testing/selftests/sched_ext/Makefile
+++ b/tools/testing/selftests/sched_ext/Makefile
@@ -182,6 +182,7 @@ auto-test-targets :=			\
 	select_cpu_dispatch_bad_dsq	\
 	select_cpu_dispatch_dbl_dsp	\
 	select_cpu_vtime		\
+	rt_stall			\
 	test_example			\
 
 testcase-targets := $(addsuffix .o,$(addprefix $(SCXOBJ_DIR)/,$(auto-test-targets)))
diff --git a/tools/testing/selftests/sched_ext/rt_stall.bpf.c b/tools/testing/selftests/sched_ext/rt_stall.bpf.c
new file mode 100644
index 0000000000000..80086779dd1eb
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/rt_stall.bpf.c
@@ -0,0 +1,23 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * A scheduler that verified if RT tasks can stall SCHED_EXT tasks.
+ *
+ * Copyright (c) 2025 NVIDIA Corporation.
+ */
+
+#include <scx/common.bpf.h>
+
+char _license[] SEC("license") = "GPL";
+
+UEI_DEFINE(uei);
+
+void BPF_STRUCT_OPS(rt_stall_exit, struct scx_exit_info *ei)
+{
+	UEI_RECORD(uei, ei);
+}
+
+SEC(".struct_ops.link")
+struct sched_ext_ops rt_stall_ops = {
+	.exit			= (void *)rt_stall_exit,
+	.name			= "rt_stall",
+};
diff --git a/tools/testing/selftests/sched_ext/rt_stall.c b/tools/testing/selftests/sched_ext/rt_stall.c
new file mode 100644
index 0000000000000..e9a0def9ee323
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/rt_stall.c
@@ -0,0 +1,214 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2025 NVIDIA Corporation.
+ */
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <sched.h>
+#include <sys/prctl.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <time.h>
+#include <linux/sched.h>
+#include <signal.h>
+#include <bpf/bpf.h>
+#include <scx/common.h>
+#include <sys/wait.h>
+#include <unistd.h>
+#include "rt_stall.bpf.skel.h"
+#include "scx_test.h"
+#include "../kselftest.h"
+
+#define CORE_ID		0	/* CPU to pin tasks to */
+#define RUN_TIME        5	/* How long to run the test in seconds */
+
+/* Simple busy-wait function for test tasks */
+static void process_func(void)
+{
+	while (1) {
+		/* Busy wait */
+		for (volatile unsigned long i = 0; i < 10000000UL; i++)
+			;
+	}
+}
+
+/* Set CPU affinity to a specific core */
+static void set_affinity(int cpu)
+{
+	cpu_set_t mask;
+
+	CPU_ZERO(&mask);
+	CPU_SET(cpu, &mask);
+	if (sched_setaffinity(0, sizeof(mask), &mask) != 0) {
+		perror("sched_setaffinity");
+		exit(EXIT_FAILURE);
+	}
+}
+
+/* Set task scheduling policy and priority */
+static void set_sched(int policy, int priority)
+{
+	struct sched_param param;
+
+	param.sched_priority = priority;
+	if (sched_setscheduler(0, policy, &param) != 0) {
+		perror("sched_setscheduler");
+		exit(EXIT_FAILURE);
+	}
+}
+
+/* Get process runtime from /proc/<pid>/stat */
+static float get_process_runtime(int pid)
+{
+	char path[256];
+	FILE *file;
+	long utime, stime;
+	int fields;
+
+	snprintf(path, sizeof(path), "/proc/%d/stat", pid);
+	file = fopen(path, "r");
+	if (file == NULL) {
+		perror("Failed to open stat file");
+		return -1;
+	}
+
+	/* Skip the first 13 fields and read the 14th and 15th */
+	fields = fscanf(file,
+			"%*d %*s %*c %*d %*d %*d %*d %*d %*u %*u %*u %*u %*u %lu %lu",
+			&utime, &stime);
+	fclose(file);
+
+	if (fields != 2) {
+		fprintf(stderr, "Failed to read stat file\n");
+		return -1;
+	}
+
+	/* Calculate the total time spent in the process */
+	long total_time = utime + stime;
+	long ticks_per_second = sysconf(_SC_CLK_TCK);
+	float runtime_seconds = total_time * 1.0 / ticks_per_second;
+
+	return runtime_seconds;
+}
+
+static enum scx_test_status setup(void **ctx)
+{
+	struct rt_stall *skel;
+
+	skel = rt_stall__open();
+	SCX_FAIL_IF(!skel, "Failed to open");
+	SCX_ENUM_INIT(skel);
+	SCX_FAIL_IF(rt_stall__load(skel), "Failed to load skel");
+
+	*ctx = skel;
+
+	return SCX_TEST_PASS;
+}
+
+static bool sched_stress_test(void)
+{
+	float cfs_runtime, rt_runtime, actual_ratio;
+	int cfs_pid, rt_pid;
+	float expected_min_ratio = 0.04; /* 4% */
+
+	ksft_print_header();
+	ksft_set_plan(1);
+
+	/* Create and set up a EXT task */
+	cfs_pid = fork();
+	if (cfs_pid == 0) {
+		set_affinity(CORE_ID);
+		process_func();
+		exit(0);
+	} else if (cfs_pid < 0) {
+		perror("fork for EXT task");
+		ksft_exit_fail();
+	}
+
+	/* Create an RT task */
+	rt_pid = fork();
+	if (rt_pid == 0) {
+		set_affinity(CORE_ID);
+		set_sched(SCHED_FIFO, 50);
+		process_func();
+		exit(0);
+	} else if (rt_pid < 0) {
+		perror("fork for RT task");
+		ksft_exit_fail();
+	}
+
+	/* Let the processes run for the specified time */
+	sleep(RUN_TIME);
+
+	/* Get runtime for the EXT task */
+	cfs_runtime = get_process_runtime(cfs_pid);
+	if (cfs_runtime != -1)
+		ksft_print_msg("Runtime of EXT task (PID %d) is %f seconds\n",
+			       cfs_pid, cfs_runtime);
+	else
+		ksft_exit_fail_msg("Error getting runtime for EXT task (PID %d)\n", cfs_pid);
+
+	/* Get runtime for the RT task */
+	rt_runtime = get_process_runtime(rt_pid);
+	if (rt_runtime != -1)
+		ksft_print_msg("Runtime of RT task (PID %d) is %f seconds\n", rt_pid, rt_runtime);
+	else
+		ksft_exit_fail_msg("Error getting runtime for RT task (PID %d)\n", rt_pid);
+
+	/* Kill the processes */
+	kill(cfs_pid, SIGKILL);
+	kill(rt_pid, SIGKILL);
+	waitpid(cfs_pid, NULL, 0);
+	waitpid(rt_pid, NULL, 0);
+
+	/* Verify that the scx task got enough runtime */
+	actual_ratio = cfs_runtime / (cfs_runtime + rt_runtime);
+	ksft_print_msg("EXT task got %.2f%% of total runtime\n", actual_ratio * 100);
+
+	if (actual_ratio >= expected_min_ratio) {
+		ksft_test_result_pass("PASS: EXT task got more than %.2f%% of runtime\n",
+				      expected_min_ratio * 100);
+		return true;
+	}
+	ksft_test_result_fail("FAIL: EXT task got less than %.2f%% of runtime\n",
+			      expected_min_ratio * 100);
+	return false;
+}
+
+static enum scx_test_status run(void *ctx)
+{
+	struct rt_stall *skel = ctx;
+	struct bpf_link *link;
+	bool res;
+
+	link = bpf_map__attach_struct_ops(skel->maps.rt_stall_ops);
+	SCX_FAIL_IF(!link, "Failed to attach scheduler");
+
+	res = sched_stress_test();
+
+	SCX_EQ(skel->data->uei.kind, EXIT_KIND(SCX_EXIT_NONE));
+	bpf_link__destroy(link);
+
+	if (!res)
+		ksft_exit_fail();
+
+	return SCX_TEST_PASS;
+}
+
+static void cleanup(void *ctx)
+{
+	struct rt_stall *skel = ctx;
+
+	rt_stall__destroy(skel);
+}
+
+struct scx_test rt_stall = {
+	.name = "rt_stall",
+	.description = "Verify that RT tasks cannot stall SCHED_EXT tasks",
+	.setup = setup,
+	.run = run,
+	.cleanup = cleanup,
+};
+REGISTER_SCX_TEST(&rt_stall)
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 16/16] selftests/sched_ext: Add test for DL server total_bw consistency
  2025-09-03  9:33 [PATCHSET v8 sched_ext/for-6.18] Add a deadline server for sched_ext tasks Andrea Righi
                   ` (14 preceding siblings ...)
  2025-09-03  9:33 ` [PATCH 15/16] selftests/sched_ext: Add test for sched_ext dl_server Andrea Righi
@ 2025-09-03  9:33 ` Andrea Righi
  15 siblings, 0 replies; 34+ messages in thread
From: Andrea Righi @ 2025-09-03  9:33 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Joel Fernandes, Tejun Heo, David Vernet,
	Changwoo Min, Shuah Khan
  Cc: sched-ext, bpf, linux-kernel

From: Joel Fernandes <joelagnelf@nvidia.com>

Add a new kselftest to verify that the total_bw value in
/sys/kernel/debug/sched/debug remains consistent across all CPUs
under different sched_ext BPF program states:

1. Before a BPF scheduler is loaded
2. While a BPF scheduler is loaded and active
3. After a BPF scheduler is unloaded

The test runs CPU stress threads to ensure DL server bandwidth
values stabilize before checking consistency. This helps catch
potential issues with DL server bandwidth accounting during
sched_ext transitions.

[ arighi: small coding style fixes ]

Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
 tools/testing/selftests/sched_ext/Makefile   |   1 +
 tools/testing/selftests/sched_ext/total_bw.c | 281 +++++++++++++++++++
 2 files changed, 282 insertions(+)
 create mode 100644 tools/testing/selftests/sched_ext/total_bw.c

diff --git a/tools/testing/selftests/sched_ext/Makefile b/tools/testing/selftests/sched_ext/Makefile
index f0a8cba3a99f1..d48be158b0a1b 100644
--- a/tools/testing/selftests/sched_ext/Makefile
+++ b/tools/testing/selftests/sched_ext/Makefile
@@ -184,6 +184,7 @@ auto-test-targets :=			\
 	select_cpu_vtime		\
 	rt_stall			\
 	test_example			\
+	total_bw			\
 
 testcase-targets := $(addsuffix .o,$(addprefix $(SCXOBJ_DIR)/,$(auto-test-targets)))
 
diff --git a/tools/testing/selftests/sched_ext/total_bw.c b/tools/testing/selftests/sched_ext/total_bw.c
new file mode 100644
index 0000000000000..740c90a6ceab8
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/total_bw.c
@@ -0,0 +1,281 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Test to verify that total_bw value remains consistent across all CPUs
+ * in different BPF program states.
+ *
+ * Copyright (C) 2025 Nvidia Corporation.
+ */
+#include <bpf/bpf.h>
+#include <errno.h>
+#include <pthread.h>
+#include <scx/common.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/wait.h>
+#include <unistd.h>
+#include "minimal.bpf.skel.h"
+#include "scx_test.h"
+
+#define MAX_CPUS 512
+#define STRESS_DURATION_SEC 5
+
+struct total_bw_ctx {
+	struct minimal *skel;
+	long baseline_bw[MAX_CPUS];
+	int nr_cpus;
+};
+
+static void *cpu_stress_thread(void *arg)
+{
+	volatile int i;
+	time_t end_time = time(NULL) + STRESS_DURATION_SEC;
+
+	while (time(NULL) < end_time)
+		for (i = 0; i < 1000000; i++)
+			;
+
+	return NULL;
+}
+
+/*
+ * The first enqueue on a CPU causes the DL server to start, for that
+ * reason run stressor threads in the hopes it schedules on all CPUs.
+ */
+static int run_cpu_stress(int nr_cpus)
+{
+	pthread_t *threads;
+	int i, ret = 0;
+
+	threads = calloc(nr_cpus, sizeof(pthread_t));
+	if (!threads)
+		return -ENOMEM;
+
+	/* Create threads to run on each CPU */
+	for (i = 0; i < nr_cpus; i++) {
+		if (pthread_create(&threads[i], NULL, cpu_stress_thread, NULL)) {
+			ret = -errno;
+			fprintf(stderr, "Failed to create thread %d: %s\n", i, strerror(-ret));
+			break;
+		}
+	}
+
+	/* Wait for all threads to complete */
+	for (i = 0; i < nr_cpus; i++) {
+		if (threads[i])
+			pthread_join(threads[i], NULL);
+	}
+
+	free(threads);
+	return ret;
+}
+
+static int read_total_bw_values(long *bw_values, int max_cpus)
+{
+	FILE *fp;
+	char line[256];
+	int cpu_count = 0;
+
+	fp = fopen("/sys/kernel/debug/sched/debug", "r");
+	if (!fp) {
+		SCX_ERR("Failed to open debug file");
+		return -1;
+	}
+
+	while (fgets(line, sizeof(line), fp)) {
+		char *bw_str = strstr(line, "total_bw");
+
+		if (bw_str) {
+			bw_str = strchr(bw_str, ':');
+			if (bw_str) {
+				/* Only store up to max_cpus values */
+				if (cpu_count < max_cpus)
+					bw_values[cpu_count] = atol(bw_str + 1);
+				cpu_count++;
+			}
+		}
+	}
+
+	fclose(fp);
+	return cpu_count;
+}
+
+static bool verify_total_bw_consistency(long *bw_values, int count)
+{
+	int i;
+	long first_value;
+
+	if (count <= 0)
+		return false;
+
+	first_value = bw_values[0];
+
+	for (i = 1; i < count; i++) {
+		if (bw_values[i] != first_value) {
+			SCX_ERR("Inconsistent total_bw: CPU0=%ld, CPU%d=%ld",
+				first_value, i, bw_values[i]);
+			return false;
+		}
+	}
+
+	return true;
+}
+
+static int fetch_verify_total_bw(long *bw_values, int nr_cpus)
+{
+	int attempts = 0;
+	int max_attempts = 10;
+	int count;
+
+	/*
+	 * The first enqueue on a CPU causes the DL server to start, for that
+	 * reason run stressor threads in the hopes it schedules on all CPUs.
+	 */
+	if (run_cpu_stress(nr_cpus) < 0) {
+		SCX_ERR("Failed to run CPU stress");
+		return -1;
+	}
+
+	/* Try multiple times to get stable values */
+	while (attempts < max_attempts) {
+		count = read_total_bw_values(bw_values, nr_cpus);
+		fprintf(stderr, "Read %d total_bw values (testing %d CPUs)\n", count, nr_cpus);
+		/* If system has more CPUs than we're testing, that's OK */
+		if (count < nr_cpus) {
+			SCX_ERR("Expected at least %d CPUs, got %d", nr_cpus, count);
+			attempts++;
+			sleep(1);
+			continue;
+		}
+
+		/* Only verify the CPUs we're testing */
+		if (verify_total_bw_consistency(bw_values, nr_cpus)) {
+			fprintf(stderr, "Values are consistent: %ld\n", bw_values[0]);
+			return 0;
+		}
+
+		attempts++;
+		sleep(1);
+	}
+
+	return -1;
+}
+
+static enum scx_test_status setup(void **ctx)
+{
+	struct total_bw_ctx *test_ctx;
+
+	if (access("/sys/kernel/debug/sched/debug", R_OK) != 0) {
+		fprintf(stderr, "Skipping test: debugfs sched/debug not accessible\n");
+		return SCX_TEST_SKIP;
+	}
+
+	test_ctx = calloc(1, sizeof(*test_ctx));
+	if (!test_ctx)
+		return SCX_TEST_FAIL;
+
+	test_ctx->nr_cpus = sysconf(_SC_NPROCESSORS_ONLN);
+	if (test_ctx->nr_cpus <= 0) {
+		free(test_ctx);
+		return SCX_TEST_FAIL;
+	}
+
+	/* If system has more CPUs than MAX_CPUS, just test the first MAX_CPUS */
+	if (test_ctx->nr_cpus > MAX_CPUS)
+		test_ctx->nr_cpus = MAX_CPUS;
+
+	/* Test scenario 1: BPF program not loaded */
+	/* Read and verify baseline total_bw before loading BPF program */
+	fprintf(stderr, "BPF prog initially not loaded, reading total_bw values\n");
+	if (fetch_verify_total_bw(test_ctx->baseline_bw, test_ctx->nr_cpus) < 0) {
+		SCX_ERR("Failed to get stable baseline values");
+		free(test_ctx);
+		return SCX_TEST_FAIL;
+	}
+
+	/* Load the BPF skeleton */
+	test_ctx->skel = minimal__open();
+	if (!test_ctx->skel) {
+		free(test_ctx);
+		return SCX_TEST_FAIL;
+	}
+
+	SCX_ENUM_INIT(test_ctx->skel);
+	if (minimal__load(test_ctx->skel)) {
+		minimal__destroy(test_ctx->skel);
+		free(test_ctx);
+		return SCX_TEST_FAIL;
+	}
+
+	*ctx = test_ctx;
+	return SCX_TEST_PASS;
+}
+
+static enum scx_test_status run(void *ctx)
+{
+	struct total_bw_ctx *test_ctx = ctx;
+	struct bpf_link *link;
+	long loaded_bw[MAX_CPUS];
+	long unloaded_bw[MAX_CPUS];
+	int i;
+
+	/* Test scenario 2: BPF program loaded */
+	link = bpf_map__attach_struct_ops(test_ctx->skel->maps.minimal_ops);
+	if (!link) {
+		SCX_ERR("Failed to attach scheduler");
+		return SCX_TEST_FAIL;
+	}
+
+	fprintf(stderr, "BPF program loaded, reading total_bw values\n");
+	if (fetch_verify_total_bw(loaded_bw, test_ctx->nr_cpus) < 0) {
+		SCX_ERR("Failed to get stable values with BPF loaded");
+		bpf_link__destroy(link);
+		return SCX_TEST_FAIL;
+	}
+	bpf_link__destroy(link);
+
+	/* Test scenario 3: BPF program unloaded */
+	fprintf(stderr, "BPF program unloaded, reading total_bw values\n");
+	if (fetch_verify_total_bw(unloaded_bw, test_ctx->nr_cpus) < 0) {
+		SCX_ERR("Failed to get stable values after BPF unload");
+		return SCX_TEST_FAIL;
+	}
+
+	/* Verify all three scenarios have the same total_bw values */
+	for (i = 0; i < test_ctx->nr_cpus; i++) {
+		if (test_ctx->baseline_bw[i] != loaded_bw[i]) {
+			SCX_ERR("CPU%d: baseline_bw=%ld != loaded_bw=%ld",
+				i, test_ctx->baseline_bw[i], loaded_bw[i]);
+			return SCX_TEST_FAIL;
+		}
+
+		if (test_ctx->baseline_bw[i] != unloaded_bw[i]) {
+			SCX_ERR("CPU%d: baseline_bw=%ld != unloaded_bw=%ld",
+				i, test_ctx->baseline_bw[i], unloaded_bw[i]);
+			return SCX_TEST_FAIL;
+		}
+	}
+
+	fprintf(stderr, "All total_bw values are consistent across all scenarios\n");
+	return SCX_TEST_PASS;
+}
+
+static void cleanup(void *ctx)
+{
+	struct total_bw_ctx *test_ctx = ctx;
+
+	if (test_ctx) {
+		if (test_ctx->skel)
+			minimal__destroy(test_ctx->skel);
+		free(test_ctx);
+	}
+}
+
+struct scx_test total_bw = {
+	.name = "total_bw",
+	.description = "Verify total_bw consistency across BPF program states",
+	.setup = setup,
+	.run = run,
+	.cleanup = cleanup,
+};
+REGISTER_SCX_TEST(&total_bw)
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 03/16] sched/debug: Stop and start server based on if it was active
  2025-09-03  9:33 ` [PATCH 03/16] sched/debug: Stop and start server based on if it was active Andrea Righi
@ 2025-09-03 14:43   ` Juri Lelli
  2025-09-03 15:02     ` Andrea Righi
  0 siblings, 1 reply; 34+ messages in thread
From: Juri Lelli @ 2025-09-03 14:43 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Joel Fernandes, Tejun Heo, David Vernet, Changwoo Min, Shuah Khan,
	sched-ext, bpf, linux-kernel

Hi,

On 03/09/25 11:33, Andrea Righi wrote:
> From: Joel Fernandes <joelagnelf@nvidia.com>
> 
> Currently the DL server interface for applying parameters checks
> CFS-internals to identify if the server is active. This is error-prone
> and makes it difficult when adding new servers in the future.
> 
> Fix it, by using dl_server_active() which is also used by the DL server
> code to determine if the DL server was started.
> 
> Acked-by: Tejun Heo <tj@kernel.org>
> Reviewed-by: Andrea Righi <arighi@nvidia.com>
> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
> ---
>  kernel/sched/debug.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index dbe2aee8628ce..e71f6618c1a6a 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -354,6 +354,8 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
>  		return err;
>  
>  	scoped_guard (rq_lock_irqsave, rq) {
> +		bool is_active;
> +
>  		runtime  = rq->fair_server.dl_runtime;
>  		period = rq->fair_server.dl_period;
>  
> @@ -376,7 +378,8 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
>  			return  -EINVAL;
>  		}
>  
> -		if (rq->cfs.h_nr_queued) {
> +		is_active = dl_server_active(&rq->fair_server);
> +		if (is_active) {
>  			update_rq_clock(rq);
>  			dl_server_stop(&rq->fair_server);
>  		}

I believe this chunk will unfortunately conflict with bb4700adc3ab
("sched/deadline: Always stop dl-server before changing parameters"),
but it should be an easy fix. :)

Thanks,
Juri


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 04/16] sched/deadline: Clear the defer params
  2025-09-03  9:33 ` [PATCH 04/16] sched/deadline: Clear the defer params Andrea Righi
@ 2025-09-03 14:44   ` Juri Lelli
  0 siblings, 0 replies; 34+ messages in thread
From: Juri Lelli @ 2025-09-03 14:44 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Joel Fernandes, Tejun Heo, David Vernet, Changwoo Min, Shuah Khan,
	sched-ext, bpf, linux-kernel

Hi,

On 03/09/25 11:33, Andrea Righi wrote:
> From: Joel Fernandes <joelagnelf@nvidia.com>
> 
> The defer params were not cleared in __dl_clear_params. Clear them.
> 
> Without this is some of my test cases are flaking and the DL timer is
> not starting correctly AFAICS.
> 
> Fixes: a110a81c52a9 ("sched/deadline: Deferrable dl server")
> Reviewed-by: Andrea Righi <arighi@nvidia.com>
> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>

Acked-by: Juri Lelli <juri.lelli@redhat.com>

Thanks!
Juri


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 05/16] sched/deadline: Return EBUSY if dl_bw_cpus is zero
  2025-09-03  9:33 ` [PATCH 05/16] sched/deadline: Return EBUSY if dl_bw_cpus is zero Andrea Righi
@ 2025-09-03 14:53   ` Juri Lelli
  2025-09-03 15:10     ` Andrea Righi
  2025-09-03 20:05     ` Peter Zijlstra
  0 siblings, 2 replies; 34+ messages in thread
From: Juri Lelli @ 2025-09-03 14:53 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Joel Fernandes, Tejun Heo, David Vernet, Changwoo Min, Shuah Khan,
	sched-ext, bpf, linux-kernel, Luca Abeni, Yuri Andriaccio

Hi,

On 03/09/25 11:33, Andrea Righi wrote:
> From: Joel Fernandes <joelagnelf@nvidia.com>
> 
> Hotplugged CPUs coming online do an enqueue but are not a part of any
> root domain containing cpu_active() CPUs. So in this case, don't mess
> with accounting and we can retry later. Without this patch, we see
> crashes with sched_ext selftest's hotplug test due to divide by zero.
> 
> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
> ---
>  kernel/sched/deadline.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 3c478a1b2890d..753e50b1e86fc 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1689,7 +1689,12 @@ int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 perio
>  	cpus = dl_bw_cpus(cpu);
>  	cap = dl_bw_capacity(cpu);
>  
> -	if (__dl_overflow(dl_b, cap, old_bw, new_bw))
> +	/*
> +	 * Hotplugged CPUs coming online do an enqueue but are not a part of any
> +	 * root domain containing cpu_active() CPUs. So in this case, don't mess
> +	 * with accounting and we can retry later.
> +	 */
> +	if (!cpus || __dl_overflow(dl_b, cap, old_bw, new_bw))
>  		return -EBUSY;
>  
>  	if (init) {

Yuri is proposing to ignore dl-servers bandwidth contribution from
admission control (as they essentially operate on the remaining
bandwidth portion not available to RT/DEADLINE tasks):

https://lore.kernel.org/lkml/20250903114448.664452-1-yurand2000@gmail.com/

His patch should make this patch not required. Would you be able and
willing to test this assumption?

I don't believe Peter already expressed his opinion on what Yuri is
proposing, so this might be moot. But if we go that way all dl-servers
should share that non-RT portion of bandwidth I would guess. And we will
need to probably add checks and subdivide among active dl-servers, don't
we? Peter, others, what do you think?

Thanks,
Juri


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 03/16] sched/debug: Stop and start server based on if it was active
  2025-09-03 14:43   ` Juri Lelli
@ 2025-09-03 15:02     ` Andrea Righi
  0 siblings, 0 replies; 34+ messages in thread
From: Andrea Righi @ 2025-09-03 15:02 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Joel Fernandes, Tejun Heo, David Vernet, Changwoo Min, Shuah Khan,
	sched-ext, bpf, linux-kernel

On Wed, Sep 03, 2025 at 04:43:19PM +0200, Juri Lelli wrote:
> Hi,
> 
> On 03/09/25 11:33, Andrea Righi wrote:
> > From: Joel Fernandes <joelagnelf@nvidia.com>
> > 
> > Currently the DL server interface for applying parameters checks
> > CFS-internals to identify if the server is active. This is error-prone
> > and makes it difficult when adding new servers in the future.
> > 
> > Fix it, by using dl_server_active() which is also used by the DL server
> > code to determine if the DL server was started.
> > 
> > Acked-by: Tejun Heo <tj@kernel.org>
> > Reviewed-by: Andrea Righi <arighi@nvidia.com>
> > Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
> > ---
> >  kernel/sched/debug.c | 7 +++++--
> >  1 file changed, 5 insertions(+), 2 deletions(-)
> > 
> > diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> > index dbe2aee8628ce..e71f6618c1a6a 100644
> > --- a/kernel/sched/debug.c
> > +++ b/kernel/sched/debug.c
> > @@ -354,6 +354,8 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
> >  		return err;
> >  
> >  	scoped_guard (rq_lock_irqsave, rq) {
> > +		bool is_active;
> > +
> >  		runtime  = rq->fair_server.dl_runtime;
> >  		period = rq->fair_server.dl_period;
> >  
> > @@ -376,7 +378,8 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
> >  			return  -EINVAL;
> >  		}
> >  
> > -		if (rq->cfs.h_nr_queued) {
> > +		is_active = dl_server_active(&rq->fair_server);
> > +		if (is_active) {
> >  			update_rq_clock(rq);
> >  			dl_server_stop(&rq->fair_server);
> >  		}
> 
> I believe this chunk will unfortunately conflict with bb4700adc3ab
> ("sched/deadline: Always stop dl-server before changing parameters"),
> but it should be an easy fix. :)

Right, I also tested that in a separate branch, this patchset is rebased on
top of Tejun's branch that doesn't have bb4700adc3ab yet. But from a
sched_ext perspective everything seems to work fine either way.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 05/16] sched/deadline: Return EBUSY if dl_bw_cpus is zero
  2025-09-03 14:53   ` Juri Lelli
@ 2025-09-03 15:10     ` Andrea Righi
  2025-09-03 15:15       ` Juri Lelli
  2025-09-03 20:05     ` Peter Zijlstra
  1 sibling, 1 reply; 34+ messages in thread
From: Andrea Righi @ 2025-09-03 15:10 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Joel Fernandes, Tejun Heo, David Vernet, Changwoo Min, Shuah Khan,
	sched-ext, bpf, linux-kernel, Luca Abeni, Yuri Andriaccio

On Wed, Sep 03, 2025 at 04:53:59PM +0200, Juri Lelli wrote:
> Hi,
> 
> On 03/09/25 11:33, Andrea Righi wrote:
> > From: Joel Fernandes <joelagnelf@nvidia.com>
> > 
> > Hotplugged CPUs coming online do an enqueue but are not a part of any
> > root domain containing cpu_active() CPUs. So in this case, don't mess
> > with accounting and we can retry later. Without this patch, we see
> > crashes with sched_ext selftest's hotplug test due to divide by zero.
> > 
> > Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
> > ---
> >  kernel/sched/deadline.c | 7 ++++++-
> >  1 file changed, 6 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> > index 3c478a1b2890d..753e50b1e86fc 100644
> > --- a/kernel/sched/deadline.c
> > +++ b/kernel/sched/deadline.c
> > @@ -1689,7 +1689,12 @@ int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 perio
> >  	cpus = dl_bw_cpus(cpu);
> >  	cap = dl_bw_capacity(cpu);
> >  
> > -	if (__dl_overflow(dl_b, cap, old_bw, new_bw))
> > +	/*
> > +	 * Hotplugged CPUs coming online do an enqueue but are not a part of any
> > +	 * root domain containing cpu_active() CPUs. So in this case, don't mess
> > +	 * with accounting and we can retry later.
> > +	 */
> > +	if (!cpus || __dl_overflow(dl_b, cap, old_bw, new_bw))
> >  		return -EBUSY;
> >  
> >  	if (init) {
> 
> Yuri is proposing to ignore dl-servers bandwidth contribution from
> admission control (as they essentially operate on the remaining
> bandwidth portion not available to RT/DEADLINE tasks):
> 
> https://lore.kernel.org/lkml/20250903114448.664452-1-yurand2000@gmail.com/
> 
> His patch should make this patch not required. Would you be able and
> willing to test this assumption?

I'll run some tests with Yuri's patch applied and dropping this one (and we
may also need to drop "[PATCH 10/16] sched/deadline: Account ext server
bandwidth").

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 05/16] sched/deadline: Return EBUSY if dl_bw_cpus is zero
  2025-09-03 15:10     ` Andrea Righi
@ 2025-09-03 15:15       ` Juri Lelli
  2025-09-03 15:24         ` Andrea Righi
  0 siblings, 1 reply; 34+ messages in thread
From: Juri Lelli @ 2025-09-03 15:15 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Joel Fernandes, Tejun Heo, David Vernet, Changwoo Min, Shuah Khan,
	sched-ext, bpf, linux-kernel, Luca Abeni, Yuri Andriaccio

On 03/09/25 17:10, Andrea Righi wrote:
> On Wed, Sep 03, 2025 at 04:53:59PM +0200, Juri Lelli wrote:
> > Hi,
> > 
> > On 03/09/25 11:33, Andrea Righi wrote:
> > > From: Joel Fernandes <joelagnelf@nvidia.com>
> > > 
> > > Hotplugged CPUs coming online do an enqueue but are not a part of any
> > > root domain containing cpu_active() CPUs. So in this case, don't mess
> > > with accounting and we can retry later. Without this patch, we see
> > > crashes with sched_ext selftest's hotplug test due to divide by zero.
> > > 
> > > Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
> > > ---
> > >  kernel/sched/deadline.c | 7 ++++++-
> > >  1 file changed, 6 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> > > index 3c478a1b2890d..753e50b1e86fc 100644
> > > --- a/kernel/sched/deadline.c
> > > +++ b/kernel/sched/deadline.c
> > > @@ -1689,7 +1689,12 @@ int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 perio
> > >  	cpus = dl_bw_cpus(cpu);
> > >  	cap = dl_bw_capacity(cpu);
> > >  
> > > -	if (__dl_overflow(dl_b, cap, old_bw, new_bw))
> > > +	/*
> > > +	 * Hotplugged CPUs coming online do an enqueue but are not a part of any
> > > +	 * root domain containing cpu_active() CPUs. So in this case, don't mess
> > > +	 * with accounting and we can retry later.
> > > +	 */
> > > +	if (!cpus || __dl_overflow(dl_b, cap, old_bw, new_bw))
> > >  		return -EBUSY;
> > >  
> > >  	if (init) {
> > 
> > Yuri is proposing to ignore dl-servers bandwidth contribution from
> > admission control (as they essentially operate on the remaining
> > bandwidth portion not available to RT/DEADLINE tasks):
> > 
> > https://lore.kernel.org/lkml/20250903114448.664452-1-yurand2000@gmail.com/
> > 
> > His patch should make this patch not required. Would you be able and
> > willing to test this assumption?
> 
> I'll run some tests with Yuri's patch applied and dropping this one (and we
> may also need to drop "[PATCH 10/16] sched/deadline: Account ext server
> bandwidth").

Please mind that Yuri's change is still under discussion! :))

I just wanted to mention it here as it might change how we account for
dl-servers if we decide to go that way.

Thanks,
Juri


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 05/16] sched/deadline: Return EBUSY if dl_bw_cpus is zero
  2025-09-03 15:15       ` Juri Lelli
@ 2025-09-03 15:24         ` Andrea Righi
  0 siblings, 0 replies; 34+ messages in thread
From: Andrea Righi @ 2025-09-03 15:24 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Joel Fernandes, Tejun Heo, David Vernet, Changwoo Min, Shuah Khan,
	sched-ext, bpf, linux-kernel, Luca Abeni, Yuri Andriaccio

On Wed, Sep 03, 2025 at 05:15:18PM +0200, Juri Lelli wrote:
> On 03/09/25 17:10, Andrea Righi wrote:
> > On Wed, Sep 03, 2025 at 04:53:59PM +0200, Juri Lelli wrote:
> > > Hi,
> > > 
> > > On 03/09/25 11:33, Andrea Righi wrote:
> > > > From: Joel Fernandes <joelagnelf@nvidia.com>
> > > > 
> > > > Hotplugged CPUs coming online do an enqueue but are not a part of any
> > > > root domain containing cpu_active() CPUs. So in this case, don't mess
> > > > with accounting and we can retry later. Without this patch, we see
> > > > crashes with sched_ext selftest's hotplug test due to divide by zero.
> > > > 
> > > > Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
> > > > ---
> > > >  kernel/sched/deadline.c | 7 ++++++-
> > > >  1 file changed, 6 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> > > > index 3c478a1b2890d..753e50b1e86fc 100644
> > > > --- a/kernel/sched/deadline.c
> > > > +++ b/kernel/sched/deadline.c
> > > > @@ -1689,7 +1689,12 @@ int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 perio
> > > >  	cpus = dl_bw_cpus(cpu);
> > > >  	cap = dl_bw_capacity(cpu);
> > > >  
> > > > -	if (__dl_overflow(dl_b, cap, old_bw, new_bw))
> > > > +	/*
> > > > +	 * Hotplugged CPUs coming online do an enqueue but are not a part of any
> > > > +	 * root domain containing cpu_active() CPUs. So in this case, don't mess
> > > > +	 * with accounting and we can retry later.
> > > > +	 */
> > > > +	if (!cpus || __dl_overflow(dl_b, cap, old_bw, new_bw))
> > > >  		return -EBUSY;
> > > >  
> > > >  	if (init) {
> > > 
> > > Yuri is proposing to ignore dl-servers bandwidth contribution from
> > > admission control (as they essentially operate on the remaining
> > > bandwidth portion not available to RT/DEADLINE tasks):
> > > 
> > > https://lore.kernel.org/lkml/20250903114448.664452-1-yurand2000@gmail.com/
> > > 
> > > His patch should make this patch not required. Would you be able and
> > > willing to test this assumption?
> > 
> > I'll run some tests with Yuri's patch applied and dropping this one (and we
> > may also need to drop "[PATCH 10/16] sched/deadline: Account ext server
> > bandwidth").
> 
> Please mind that Yuri's change is still under discussion! :))
> 
> I just wanted to mention it here as it might change how we account for
> dl-servers if we decide to go that way.

That's fine, I've already done a quick test. :)

It seems to work (more or less), meaning that in case of RT/sched_ext
contention the sched_ext tasks seem to get the right amount of CPU
bandwidth (5%), but the total_bw kselftest is quite broken and it's always
reporting a bw value of 0... in any case, even if we go this way there's no
major disruption apparently.

-Andrea

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 01/16] sched_ext: Exit early on hotplug events during attach
  2025-09-03  9:33 ` [PATCH 01/16] sched_ext: Exit early on hotplug events during attach Andrea Righi
@ 2025-09-03 19:44   ` Tejun Heo
  2025-09-03 21:40     ` Andrea Righi
  0 siblings, 1 reply; 34+ messages in thread
From: Tejun Heo @ 2025-09-03 19:44 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Joel Fernandes, David Vernet, Changwoo Min,
	Shuah Khan, sched-ext, bpf, linux-kernel

Hello,

On Wed, Sep 03, 2025 at 11:33:27AM +0200, Andrea Righi wrote:
>  static int validate_ops(struct scx_sched *sch, const struct sched_ext_ops *ops)
> @@ -5627,11 +5630,15 @@ static int scx_enable(struct sched_ext_ops *ops, struct bpf_link *link)
>  		if (((void (**)(void))ops)[i])
>  			set_bit(i, sch->has_op);
>  
> -	check_hotplug_seq(sch, ops);
> -	scx_idle_update_selcpu_topology(ops);
> +	ret = check_hotplug_seq(sch, ops);
> +	if (!ret)
> +		scx_idle_update_selcpu_topology(ops);
>  
>  	cpus_read_unlock();
>  
> +	if (ret)
> +		goto err_disable;

The double testing is a bit jarring. Maybe just add cpus_read_unlock() in
the error block so that error return can take place right after
check_hotplug_seq()? Alternatively, create a new error jump target - e.g.
err_disable_unlock_cpus and share it between here and the init failure path?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 07/16] sched_ext: Add a DL server for sched_ext tasks
  2025-09-03  9:33 ` [PATCH 07/16] sched_ext: Add a DL server for sched_ext tasks Andrea Righi
@ 2025-09-03 19:54   ` Tejun Heo
  2025-09-03 20:08     ` Peter Zijlstra
  0 siblings, 1 reply; 34+ messages in thread
From: Tejun Heo @ 2025-09-03 19:54 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Joel Fernandes, David Vernet, Changwoo Min,
	Shuah Khan, sched-ext, bpf, linux-kernel, Luigi De Matteis

Hello,

On Wed, Sep 03, 2025 at 11:33:33AM +0200, Andrea Righi wrote:
> +static struct task_struct *ext_server_pick_task(struct sched_dl_entity *dl_se,
> +						void *flags)
> +{
> +	struct rq_flags *rf = flags;
> +
> +	balance_scx(dl_se->rq, dl_se->rq->curr, rf);
> +	return pick_task_scx(dl_se->rq, rf);
> +}

I'm a bit confused. This series doesn't have prep patches to add @rf to
dl_server_pick_f. Is this the right patch?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 05/16] sched/deadline: Return EBUSY if dl_bw_cpus is zero
  2025-09-03 14:53   ` Juri Lelli
  2025-09-03 15:10     ` Andrea Righi
@ 2025-09-03 20:05     ` Peter Zijlstra
  2025-09-04  7:12       ` luca abeni
  2025-09-04  7:17       ` Juri Lelli
  1 sibling, 2 replies; 34+ messages in thread
From: Peter Zijlstra @ 2025-09-03 20:05 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Andrea Righi, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Joel Fernandes, Tejun Heo, David Vernet, Changwoo Min, Shuah Khan,
	sched-ext, bpf, linux-kernel, Luca Abeni, Yuri Andriaccio

On Wed, Sep 03, 2025 at 04:53:59PM +0200, Juri Lelli wrote:
> Hi,
> 
> On 03/09/25 11:33, Andrea Righi wrote:
> > From: Joel Fernandes <joelagnelf@nvidia.com>
> > 
> > Hotplugged CPUs coming online do an enqueue but are not a part of any
> > root domain containing cpu_active() CPUs. So in this case, don't mess
> > with accounting and we can retry later. Without this patch, we see
> > crashes with sched_ext selftest's hotplug test due to divide by zero.
> > 
> > Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
> > ---
> >  kernel/sched/deadline.c | 7 ++++++-
> >  1 file changed, 6 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> > index 3c478a1b2890d..753e50b1e86fc 100644
> > --- a/kernel/sched/deadline.c
> > +++ b/kernel/sched/deadline.c
> > @@ -1689,7 +1689,12 @@ int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 perio
> >  	cpus = dl_bw_cpus(cpu);
> >  	cap = dl_bw_capacity(cpu);
> >  
> > -	if (__dl_overflow(dl_b, cap, old_bw, new_bw))
> > +	/*
> > +	 * Hotplugged CPUs coming online do an enqueue but are not a part of any
> > +	 * root domain containing cpu_active() CPUs. So in this case, don't mess
> > +	 * with accounting and we can retry later.
> > +	 */
> > +	if (!cpus || __dl_overflow(dl_b, cap, old_bw, new_bw))
> >  		return -EBUSY;
> >  
> >  	if (init) {
> 
> Yuri is proposing to ignore dl-servers bandwidth contribution from
> admission control (as they essentially operate on the remaining
> bandwidth portion not available to RT/DEADLINE tasks):
> 
> https://lore.kernel.org/lkml/20250903114448.664452-1-yurand2000@gmail.com/
> 
> His patch should make this patch not required. Would you be able and
> willing to test this assumption?
> 
> I don't believe Peter already expressed his opinion on what Yuri is
> proposing, so this might be moot. 

Urgh, yeah, I don't like that at all. That reasoning makes no sense what
so ever. That 5% is not lost time, that 5% is being very optimistic and
'models' otherwise unaccountable time like IRQ and random overheads.

Thinking you can give out 100% CPU time to a bandwidth limited group of
tasks is delusional.

Explicitly not accounting things that you *can* is just plain wrong. So
no, Yuri's thing is not going to go anywhere.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 07/16] sched_ext: Add a DL server for sched_ext tasks
  2025-09-03 19:54   ` Tejun Heo
@ 2025-09-03 20:08     ` Peter Zijlstra
  2025-09-03 20:41       ` Tejun Heo
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2025-09-03 20:08 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrea Righi, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Joel Fernandes, David Vernet, Changwoo Min,
	Shuah Khan, sched-ext, bpf, linux-kernel, Luigi De Matteis

On Wed, Sep 03, 2025 at 09:54:58AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Wed, Sep 03, 2025 at 11:33:33AM +0200, Andrea Righi wrote:
> > +static struct task_struct *ext_server_pick_task(struct sched_dl_entity *dl_se,
> > +						void *flags)
> > +{
> > +	struct rq_flags *rf = flags;
> > +
> > +	balance_scx(dl_se->rq, dl_se->rq->curr, rf);
> > +	return pick_task_scx(dl_se->rq, rf);
> > +}
> 
> I'm a bit confused. This series doesn't have prep patches to add @rf to
> dl_server_pick_f. Is this the right patch?

Patch 14 seems to be the proposed alternative, and I'm not liking that
at all.

That rf passing was very much also needed for that other issue; I'm not
sure why that's gone away.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 07/16] sched_ext: Add a DL server for sched_ext tasks
  2025-09-03 20:08     ` Peter Zijlstra
@ 2025-09-03 20:41       ` Tejun Heo
  2025-09-03 20:56         ` Peter Zijlstra
  2025-09-03 21:33         ` Andrea Righi
  0 siblings, 2 replies; 34+ messages in thread
From: Tejun Heo @ 2025-09-03 20:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Righi, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Joel Fernandes, David Vernet, Changwoo Min,
	Shuah Khan, sched-ext, bpf, linux-kernel, Luigi De Matteis

Hello,

On Wed, Sep 03, 2025 at 10:08:22PM +0200, Peter Zijlstra wrote:
> > I'm a bit confused. This series doesn't have prep patches to add @rf to
> > dl_server_pick_f. Is this the right patch?
> 
> Patch 14 seems to be the proposed alternative, and I'm not liking that
> at all.
> 
> That rf passing was very much also needed for that other issue; I'm not
> sure why that's gone away.

Using balance() was my suggestion to stay within the current framework. If
we want to add @rf to pick_task(), that's more fundamental change. We
dropped the discussion in the other thread but I found it odd to add @rf to
pick_task() while disallowing the use of @rf in non-dl-server pick path and
if we want to allow that, we gotta solve the race between pick_task()
dropping rq lock and the ttwu inserting high pri task.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 07/16] sched_ext: Add a DL server for sched_ext tasks
  2025-09-03 20:41       ` Tejun Heo
@ 2025-09-03 20:56         ` Peter Zijlstra
  2025-09-03 21:33         ` Andrea Righi
  1 sibling, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2025-09-03 20:56 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrea Righi, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Joel Fernandes, David Vernet, Changwoo Min,
	Shuah Khan, sched-ext, bpf, linux-kernel, Luigi De Matteis

On Wed, Sep 03, 2025 at 10:41:21AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Wed, Sep 03, 2025 at 10:08:22PM +0200, Peter Zijlstra wrote:
> > > I'm a bit confused. This series doesn't have prep patches to add @rf to
> > > dl_server_pick_f. Is this the right patch?
> > 
> > Patch 14 seems to be the proposed alternative, and I'm not liking that
> > at all.
> > 
> > That rf passing was very much also needed for that other issue; I'm not
> > sure why that's gone away.
> 
> Using balance() was my suggestion to stay within the current framework. If
> we want to add @rf to pick_task(), that's more fundamental change. We
> dropped the discussion in the other thread but I found it odd to add @rf to
> pick_task() while disallowing the use of @rf in non-dl-server pick path and
> if we want to allow that, we gotta solve the race between pick_task()
> dropping rq lock and the ttwu inserting high pri task.

I thought the idea was to add rf unconditionally, dl-server or not, it
is needed in both cases.

Yes, that race needs dealing with. We have this existing pattern that
checks if a higher class has runnable tasks and restarting the pick.
This is currently only done for pick_next_task_fair() but that can
easily be extended.

You suggested maybe moving this to the ttwu side -- but up to this point
I thought we were in agreement. I'm not sure moving it to the ttwu side
makes things better; it would need ttwu to know a pick is in progress
and for which class. The existing restart pick is simpler, I think.

Yes, the restart is somewhat more complicated if we want to deal with
the dl-server, but not terribly so. It could just store a snapshot of
rq->dl.dl_nr_running from before the pick and only restart if that went
up.



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 07/16] sched_ext: Add a DL server for sched_ext tasks
  2025-09-03 20:41       ` Tejun Heo
  2025-09-03 20:56         ` Peter Zijlstra
@ 2025-09-03 21:33         ` Andrea Righi
  1 sibling, 0 replies; 34+ messages in thread
From: Andrea Righi @ 2025-09-03 21:33 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Joel Fernandes, David Vernet, Changwoo Min,
	Shuah Khan, sched-ext, bpf, linux-kernel, Luigi De Matteis

On Wed, Sep 03, 2025 at 10:41:21AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Wed, Sep 03, 2025 at 10:08:22PM +0200, Peter Zijlstra wrote:
> > > I'm a bit confused. This series doesn't have prep patches to add @rf to
> > > dl_server_pick_f. Is this the right patch?
> > 
> > Patch 14 seems to be the proposed alternative, and I'm not liking that
> > at all.
> > 
> > That rf passing was very much also needed for that other issue; I'm not
> > sure why that's gone away.
> 
> Using balance() was my suggestion to stay within the current framework. If
> we want to add @rf to pick_task(), that's more fundamental change. We
> dropped the discussion in the other thread but I found it odd to add @rf to
> pick_task() while disallowing the use of @rf in non-dl-server pick path and
> if we want to allow that, we gotta solve the race between pick_task()
> dropping rq lock and the ttwu inserting high pri task.

Yeah, patch 14 is fixing this, but this needs to be changed, because we
dropped the patch that adds @rf to pick_task(). I'll fix this in the next
version if we decide to stick with this way.

About balance() vs @rf, IIUC after pick_task() drops the rq lock a
concurrent ttwu() can already enqueue a higher-priority task, so the race
isn't really specific to @rf and it's more about making sure we don't start
using @rf in ways that rely on the pick being stable until the actual
switch, right?

If that’s correct, extending @rf to pick_task() wouldn't make things worse
than what we have, though sticking with balance() may still be the safer
incremental step...

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 01/16] sched_ext: Exit early on hotplug events during attach
  2025-09-03 19:44   ` Tejun Heo
@ 2025-09-03 21:40     ` Andrea Righi
  0 siblings, 0 replies; 34+ messages in thread
From: Andrea Righi @ 2025-09-03 21:40 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Joel Fernandes, David Vernet, Changwoo Min,
	Shuah Khan, sched-ext, bpf, linux-kernel

On Wed, Sep 03, 2025 at 09:44:32AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Wed, Sep 03, 2025 at 11:33:27AM +0200, Andrea Righi wrote:
> >  static int validate_ops(struct scx_sched *sch, const struct sched_ext_ops *ops)
> > @@ -5627,11 +5630,15 @@ static int scx_enable(struct sched_ext_ops *ops, struct bpf_link *link)
> >  		if (((void (**)(void))ops)[i])
> >  			set_bit(i, sch->has_op);
> >  
> > -	check_hotplug_seq(sch, ops);
> > -	scx_idle_update_selcpu_topology(ops);
> > +	ret = check_hotplug_seq(sch, ops);
> > +	if (!ret)
> > +		scx_idle_update_selcpu_topology(ops);
> >  
> >  	cpus_read_unlock();
> >  
> > +	if (ret)
> > +		goto err_disable;
> 
> The double testing is a bit jarring. Maybe just add cpus_read_unlock() in
> the error block so that error return can take place right after
> check_hotplug_seq()? Alternatively, create a new error jump target - e.g.
> err_disable_unlock_cpus and share it between here and the init failure path?

Makes sense, I'll adjust this in the next version.

Actually, this patch doesn't necessarily need to be part of this series,
I'll probably submit it separately.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 05/16] sched/deadline: Return EBUSY if dl_bw_cpus is zero
  2025-09-03 20:05     ` Peter Zijlstra
@ 2025-09-04  7:12       ` luca abeni
  2025-09-04  7:17       ` Juri Lelli
  1 sibling, 0 replies; 34+ messages in thread
From: luca abeni @ 2025-09-04  7:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Juri Lelli, Andrea Righi, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Joel Fernandes, Tejun Heo, David Vernet,
	Changwoo Min, Shuah Khan, sched-ext, bpf, linux-kernel,
	Yuri Andriaccio

Hi Peter,

On Wed, 3 Sep 2025 22:05:20 +0200
Peter Zijlstra <peterz@infradead.org> wrote:
[...]
> > Yuri is proposing to ignore dl-servers bandwidth contribution from
> > admission control (as they essentially operate on the remaining
> > bandwidth portion not available to RT/DEADLINE tasks):
> > 
> > https://lore.kernel.org/lkml/20250903114448.664452-1-yurand2000@gmail.com/
> > 
> > His patch should make this patch not required. Would you be able and
> > willing to test this assumption?
> > 
> > I don't believe Peter already expressed his opinion on what Yuri is
> > proposing, so this might be moot.   
> 
> Urgh, yeah, I don't like that at all. That reasoning makes no sense
> what so ever. That 5% is not lost time, that 5% is being very
> optimistic and 'models' otherwise unaccountable time like IRQ and
> random overheads.
> 
> Thinking you can give out 100% CPU time to a bandwidth limited group
> of tasks is delusional.
> 
> Explicitly not accounting things that you *can* is just plain wrong.
> So no, Yuri's thing is not going to go anywhere.

The goal of Yuri's patch was not to avoid accounting things... The goal
was to avoid subtracting the fair dl server utilization from the
utilization reserved for real-time tasks (assuming that
/proc/sys/kernel/sched_rt_runtime_us / /proc/sys/kernel/sched_rt_period_us
represents the fraction of CPU time reserved for real-time tasks).

Maybe we made errors in describing the patch (or in some details of the
implementation), but the final goal was just to ensure that
sched_rt_runtime_us/sched_rt_period_us goes to RT tasks; the remaining
fraction of CPU time is shared by SCHED_OTHER tasks, fair dl servers,
IRQs, and other overhead (the fair dl server utilization can be smaller
than 1-sched_rt_runtime_us/sched_rt_period_us, so some time can be
explicitly left for IRQs and kernel).


				Luca



				Luca

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 05/16] sched/deadline: Return EBUSY if dl_bw_cpus is zero
  2025-09-03 20:05     ` Peter Zijlstra
  2025-09-04  7:12       ` luca abeni
@ 2025-09-04  7:17       ` Juri Lelli
  1 sibling, 0 replies; 34+ messages in thread
From: Juri Lelli @ 2025-09-04  7:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Righi, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Joel Fernandes, Tejun Heo, David Vernet, Changwoo Min, Shuah Khan,
	sched-ext, bpf, linux-kernel, Luca Abeni, Yuri Andriaccio

On 03/09/25 22:05, Peter Zijlstra wrote:
> On Wed, Sep 03, 2025 at 04:53:59PM +0200, Juri Lelli wrote:
> > Hi,
> > 
> > On 03/09/25 11:33, Andrea Righi wrote:
> > > From: Joel Fernandes <joelagnelf@nvidia.com>
> > > 
> > > Hotplugged CPUs coming online do an enqueue but are not a part of any
> > > root domain containing cpu_active() CPUs. So in this case, don't mess
> > > with accounting and we can retry later. Without this patch, we see
> > > crashes with sched_ext selftest's hotplug test due to divide by zero.
> > > 
> > > Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
> > > ---
> > >  kernel/sched/deadline.c | 7 ++++++-
> > >  1 file changed, 6 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> > > index 3c478a1b2890d..753e50b1e86fc 100644
> > > --- a/kernel/sched/deadline.c
> > > +++ b/kernel/sched/deadline.c
> > > @@ -1689,7 +1689,12 @@ int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 perio
> > >  	cpus = dl_bw_cpus(cpu);
> > >  	cap = dl_bw_capacity(cpu);
> > >  
> > > -	if (__dl_overflow(dl_b, cap, old_bw, new_bw))
> > > +	/*
> > > +	 * Hotplugged CPUs coming online do an enqueue but are not a part of any
> > > +	 * root domain containing cpu_active() CPUs. So in this case, don't mess
> > > +	 * with accounting and we can retry later.
> > > +	 */
> > > +	if (!cpus || __dl_overflow(dl_b, cap, old_bw, new_bw))
> > >  		return -EBUSY;
> > >  
> > >  	if (init) {
> > 
> > Yuri is proposing to ignore dl-servers bandwidth contribution from
> > admission control (as they essentially operate on the remaining
> > bandwidth portion not available to RT/DEADLINE tasks):
> > 
> > https://lore.kernel.org/lkml/20250903114448.664452-1-yurand2000@gmail.com/
> > 
> > His patch should make this patch not required. Would you be able and
> > willing to test this assumption?
> > 
> > I don't believe Peter already expressed his opinion on what Yuri is
> > proposing, so this might be moot. 
> 
> Urgh, yeah, I don't like that at all. That reasoning makes no sense what
> so ever. That 5% is not lost time, that 5% is being very optimistic and
> 'models' otherwise unaccountable time like IRQ and random overheads.

But, wait. For dealing with IRQs and random overheads we usually say
'inflate your reservations', e.g. add a 3-5% to your runtime so that it
is sound against reality. And that gets included already in the 95%
default max cap and schedulability tests.

I believe what Yuri is saying is that dl-servers are different, because
they are only a safety net and don't provide any guarantees. With RT
throttling we used to run non-RT on the remaining 5% (from 95%) and with
Yuri's change we are going to go back at doing the same, but with
dl-server(s). If we don't do that we are somewhat going to pay overheads
twice, first we must inflate real reservations or your tasks gets
prematurely throttled, second we remove 5% of overall bandwidth if
dl-servers are accounted for with the rest of real reservation.

What do you think? :)


^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2025-09-04  7:17 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-03  9:33 [PATCHSET v8 sched_ext/for-6.18] Add a deadline server for sched_ext tasks Andrea Righi
2025-09-03  9:33 ` [PATCH 01/16] sched_ext: Exit early on hotplug events during attach Andrea Righi
2025-09-03 19:44   ` Tejun Heo
2025-09-03 21:40     ` Andrea Righi
2025-09-03  9:33 ` [PATCH 02/16] sched/debug: Fix updating of ppos on server write ops Andrea Righi
2025-09-03  9:33 ` [PATCH 03/16] sched/debug: Stop and start server based on if it was active Andrea Righi
2025-09-03 14:43   ` Juri Lelli
2025-09-03 15:02     ` Andrea Righi
2025-09-03  9:33 ` [PATCH 04/16] sched/deadline: Clear the defer params Andrea Righi
2025-09-03 14:44   ` Juri Lelli
2025-09-03  9:33 ` [PATCH 05/16] sched/deadline: Return EBUSY if dl_bw_cpus is zero Andrea Righi
2025-09-03 14:53   ` Juri Lelli
2025-09-03 15:10     ` Andrea Righi
2025-09-03 15:15       ` Juri Lelli
2025-09-03 15:24         ` Andrea Righi
2025-09-03 20:05     ` Peter Zijlstra
2025-09-04  7:12       ` luca abeni
2025-09-04  7:17       ` Juri Lelli
2025-09-03  9:33 ` [PATCH 06/16] sched: Add a server arg to dl_server_update_idle_time() Andrea Righi
2025-09-03  9:33 ` [PATCH 07/16] sched_ext: Add a DL server for sched_ext tasks Andrea Righi
2025-09-03 19:54   ` Tejun Heo
2025-09-03 20:08     ` Peter Zijlstra
2025-09-03 20:41       ` Tejun Heo
2025-09-03 20:56         ` Peter Zijlstra
2025-09-03 21:33         ` Andrea Righi
2025-09-03  9:33 ` [PATCH 08/16] sched/debug: Add support to change sched_ext server params Andrea Righi
2025-09-03  9:33 ` [PATCH 09/16] sched/deadline: Add support to remove DL server's bandwidth contribution Andrea Righi
2025-09-03  9:33 ` [PATCH 10/16] sched/deadline: Account ext server bandwidth Andrea Righi
2025-09-03  9:33 ` [PATCH 11/16] sched/deadline: Allow to initialize DL server when needed Andrea Righi
2025-09-03  9:33 ` [PATCH 12/16] sched_ext: Selectively enable ext and fair DL servers Andrea Righi
2025-09-03  9:33 ` [PATCH 13/16] sched/deadline: Fix DL server crash in inactive_timer callback Andrea Righi
2025-09-03  9:33 ` [PATCH 14/16] sched/deadline: De-couple balance and pick_task Andrea Righi
2025-09-03  9:33 ` [PATCH 15/16] selftests/sched_ext: Add test for sched_ext dl_server Andrea Righi
2025-09-03  9:33 ` [PATCH 16/16] selftests/sched_ext: Add test for DL server total_bw consistency Andrea Righi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).