linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 00/10] Add a deadline server for sched_ext tasks
@ 2025-06-13  5:17 Joel Fernandes
  2025-06-13  5:17 ` [PATCH v3 01/10] sched/debug: Fix updating of ppos on server write ops Joel Fernandes
                   ` (10 more replies)
  0 siblings, 11 replies; 14+ messages in thread
From: Joel Fernandes @ 2025-06-13  5:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tejun Heo, David Vernet,
	Andrea Righi, Changwoo Min, bpf

sched_ext tasks currently are starved by RT hoggers especially since RT
throttling was replaced by deadline servers to boost only CFS tasks. Several
users in the community have reported issues with RT stalling sched_ext tasks.
Add a sched_ext deadline server as well so that sched_ext tasks are also
boosted and do not suffer starvation.

A kselftest is also provided to verify the starvation issues are now fixed.

Btw, there is still something funky going on with CPU hotplug and the
relinquish patch. Sometimes the sched_ext's hotplug self-test locks up
(./runner -t hotplug). Reverting that patch fixes it, so I am suspecting
something is off in dl_server_remove_params() when it is being called on
offline CPUs.

v2->v3:
 - Removed code duplication in debugfs. Made ext interface separate.
 - Fixed issue where rq_lock_irqsave was not used in the relinquish patch.
 - Fixed running bw accounting issue in dl_server_remove_params.

Link to v1: https://lore.kernel.org/all/20250315022158.2354454-1-joelagnelf@nvidia.com/
Link to v2: https://lore.kernel.org/all/20250602180110.816225-1-joelagnelf@nvidia.com/

Andrea Righi (1):
  selftests/sched_ext: Add test for sched_ext dl_server

Joel Fernandes (9):
  sched/debug: Fix updating of ppos on server write ops
  sched/debug: Stop and start server based on if it was active
  sched/deadline: Clear the defer params
  sched: Add support to pick functions to take rf
  sched: Add a server arg to dl_server_update_idle_time()
  sched/ext: Add a DL server for sched_ext tasks
  sched/debug: Add support to change sched_ext server params
  sched/deadline: Add support to remove DL server bandwidth
  sched/ext: Relinquish DL server reservations when not needed

 include/linux/sched.h                         |   2 +-
 kernel/sched/core.c                           |  19 +-
 kernel/sched/deadline.c                       |  78 +++++--
 kernel/sched/debug.c                          | 171 +++++++++++---
 kernel/sched/ext.c                            | 108 ++++++++-
 kernel/sched/fair.c                           |  15 +-
 kernel/sched/idle.c                           |   4 +-
 kernel/sched/rt.c                             |   2 +-
 kernel/sched/sched.h                          |  13 +-
 kernel/sched/stop_task.c                      |   2 +-
 tools/testing/selftests/sched_ext/Makefile    |   1 +
 .../selftests/sched_ext/rt_stall.bpf.c        |  23 ++
 tools/testing/selftests/sched_ext/rt_stall.c  | 213 ++++++++++++++++++
 13 files changed, 579 insertions(+), 72 deletions(-)
 create mode 100644 tools/testing/selftests/sched_ext/rt_stall.bpf.c
 create mode 100644 tools/testing/selftests/sched_ext/rt_stall.c

-- 
2.34.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v3 01/10] sched/debug: Fix updating of ppos on server write ops
  2025-06-13  5:17 [PATCH v3 00/10] Add a deadline server for sched_ext tasks Joel Fernandes
@ 2025-06-13  5:17 ` Joel Fernandes
  2025-06-13  5:17 ` [PATCH v3 02/10] sched/debug: Stop and start server based on if it was active Joel Fernandes
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: Joel Fernandes @ 2025-06-13  5:17 UTC (permalink / raw)
  To: linux-kernel, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider
  Cc: Joel Fernandes, Tejun Heo, David Vernet, Andrea Righi,
	Changwoo Min

Updating "ppos" on error conditions does not make much sense. The pattern
is to return the error code directly without modifying the position, or
modify the position on success and return the number of bytes written.

Since on success, the return value of apply is 0, there is no point in
modifying ppos either. Fix it by removing all this and just returning
error code or number of bytes written on success.

Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
 kernel/sched/debug.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 557246880a7e..77b5d4bebc59 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -350,8 +350,8 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
 	long cpu = (long) ((struct seq_file *) filp->private_data)->private;
 	struct rq *rq = cpu_rq(cpu);
 	u64 runtime, period;
+	int retval = 0;
 	size_t err;
-	int retval;
 	u64 value;
 
 	err = kstrtoull_from_user(ubuf, cnt, 10, &value);
@@ -387,8 +387,6 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
 		}
 
 		retval = dl_server_apply_params(&rq->fair_server, runtime, period, 0);
-		if (retval)
-			cnt = retval;
 
 		if (!runtime)
 			printk_deferred("Fair server disabled in CPU %d, system may crash due to starvation.\n",
@@ -396,6 +394,9 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
 
 		if (rq->cfs.h_nr_queued)
 			dl_server_start(&rq->fair_server);
+
+		if (retval < 0)
+			return retval;
 	}
 
 	*ppos += cnt;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v3 02/10] sched/debug: Stop and start server based on if it was active
  2025-06-13  5:17 [PATCH v3 00/10] Add a deadline server for sched_ext tasks Joel Fernandes
  2025-06-13  5:17 ` [PATCH v3 01/10] sched/debug: Fix updating of ppos on server write ops Joel Fernandes
@ 2025-06-13  5:17 ` Joel Fernandes
  2025-06-13  5:17 ` [PATCH v3 03/10] sched/deadline: Clear the defer params Joel Fernandes
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: Joel Fernandes @ 2025-06-13  5:17 UTC (permalink / raw)
  To: linux-kernel, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider
  Cc: Joel Fernandes, Tejun Heo, David Vernet, Andrea Righi,
	Changwoo Min

Currently the DL server interface for applying parameters checks
CFS-internals to identify if the server is active. This is error-prone
and makes it difficult when adding new servers in the future.

Fix it, by using dl_server_active() which is also used by the DL server
code to determine if the DL server was started.

Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
 kernel/sched/debug.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 77b5d4bebc59..6866f0a9e88c 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -349,6 +349,7 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
 {
 	long cpu = (long) ((struct seq_file *) filp->private_data)->private;
 	struct rq *rq = cpu_rq(cpu);
+	bool was_active = false;
 	u64 runtime, period;
 	int retval = 0;
 	size_t err;
@@ -381,7 +382,8 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
 			return  -EINVAL;
 		}
 
-		if (rq->cfs.h_nr_queued) {
+		if (dl_server_active(&rq->fair_server)) {
+			was_active = true;
 			update_rq_clock(rq);
 			dl_server_stop(&rq->fair_server);
 		}
@@ -392,7 +394,7 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
 			printk_deferred("Fair server disabled in CPU %d, system may crash due to starvation.\n",
 					cpu_of(rq));
 
-		if (rq->cfs.h_nr_queued)
+		if (was_active)
 			dl_server_start(&rq->fair_server);
 
 		if (retval < 0)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v3 03/10] sched/deadline: Clear the defer params
  2025-06-13  5:17 [PATCH v3 00/10] Add a deadline server for sched_ext tasks Joel Fernandes
  2025-06-13  5:17 ` [PATCH v3 01/10] sched/debug: Fix updating of ppos on server write ops Joel Fernandes
  2025-06-13  5:17 ` [PATCH v3 02/10] sched/debug: Stop and start server based on if it was active Joel Fernandes
@ 2025-06-13  5:17 ` Joel Fernandes
  2025-06-13  5:17 ` [PATCH v3 04/10] sched: Add support to pick functions to take rf Joel Fernandes
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: Joel Fernandes @ 2025-06-13  5:17 UTC (permalink / raw)
  To: linux-kernel, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Daniel Bristot de Oliveira
  Cc: Joel Fernandes, Tejun Heo, David Vernet, Andrea Righi,
	Changwoo Min

The defer params were not cleared in __dl_clear_params. Clear them.

Without this is some of my test cases are flaking and the DL timer is
not starting correctly AFAICS.

Fixes: a110a81c52a9 ("sched/deadline: Deferrable dl server")
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
 kernel/sched/deadline.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index ad45a8fea245..ae15ec6294cf 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -3431,6 +3431,9 @@ static void __dl_clear_params(struct sched_dl_entity *dl_se)
 	dl_se->dl_non_contending	= 0;
 	dl_se->dl_overrun		= 0;
 	dl_se->dl_server		= 0;
+	dl_se->dl_defer			= 0;
+	dl_se->dl_defer_running	= 0;
+	dl_se->dl_defer_armed	= 0;
 
 #ifdef CONFIG_RT_MUTEXES
 	dl_se->pi_se			= dl_se;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v3 04/10] sched: Add support to pick functions to take rf
  2025-06-13  5:17 [PATCH v3 00/10] Add a deadline server for sched_ext tasks Joel Fernandes
                   ` (2 preceding siblings ...)
  2025-06-13  5:17 ` [PATCH v3 03/10] sched/deadline: Clear the defer params Joel Fernandes
@ 2025-06-13  5:17 ` Joel Fernandes
  2025-06-13  5:17 ` [PATCH v3 05/10] sched: Add a server arg to dl_server_update_idle_time() Joel Fernandes
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: Joel Fernandes @ 2025-06-13  5:17 UTC (permalink / raw)
  To: linux-kernel, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tejun Heo, David Vernet,
	Andrea Righi, Changwoo Min
  Cc: Joel Fernandes

Some pick functions like the internal pick_next_task_fair() already take
rf but some others dont. We need this for scx's server pick function.
Prepare for this by having pick functions accept it.

Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
 include/linux/sched.h    |  2 +-
 kernel/sched/core.c      | 16 ++++++++--------
 kernel/sched/deadline.c  |  8 ++++----
 kernel/sched/ext.c       |  2 +-
 kernel/sched/fair.c      | 13 ++++++++-----
 kernel/sched/idle.c      |  2 +-
 kernel/sched/rt.c        |  2 +-
 kernel/sched/sched.h     |  7 ++++---
 kernel/sched/stop_task.c |  2 +-
 9 files changed, 29 insertions(+), 25 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 45e5953b8f32..1000d2aa8482 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -630,7 +630,7 @@ struct sched_rt_entity {
 } __randomize_layout;
 
 typedef bool (*dl_server_has_tasks_f)(struct sched_dl_entity *);
-typedef struct task_struct *(*dl_server_pick_f)(struct sched_dl_entity *);
+typedef struct task_struct *(*dl_server_pick_f)(struct sched_dl_entity *, void *);
 
 struct sched_dl_entity {
 	struct rb_node			rb_node;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 62b3416f5e43..19b393b0b096 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6045,7 +6045,7 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 		/* Assume the next prioritized class is idle_sched_class */
 		if (!p) {
-			p = pick_task_idle(rq);
+			p = pick_task_idle(rq, rf);
 			put_prev_set_next_task(rq, prev, p);
 		}
 
@@ -6057,11 +6057,11 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 	for_each_active_class(class) {
 		if (class->pick_next_task) {
-			p = class->pick_next_task(rq, prev);
+			p = class->pick_next_task(rq, prev, rf);
 			if (p)
 				return p;
 		} else {
-			p = class->pick_task(rq);
+			p = class->pick_task(rq, rf);
 			if (p) {
 				put_prev_set_next_task(rq, prev, p);
 				return p;
@@ -6091,7 +6091,7 @@ static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
 	return a->core_cookie == b->core_cookie;
 }
 
-static inline struct task_struct *pick_task(struct rq *rq)
+static inline struct task_struct *pick_task(struct rq *rq, struct rq_flags *rf)
 {
 	const struct sched_class *class;
 	struct task_struct *p;
@@ -6099,7 +6099,7 @@ static inline struct task_struct *pick_task(struct rq *rq)
 	rq->dl_server = NULL;
 
 	for_each_active_class(class) {
-		p = class->pick_task(rq);
+		p = class->pick_task(rq, rf);
 		if (p)
 			return p;
 	}
@@ -6199,7 +6199,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	 * and there are no cookied tasks running on siblings.
 	 */
 	if (!need_sync) {
-		next = pick_task(rq);
+		next = pick_task(rq, rf);
 		if (!next->core_cookie) {
 			rq->core_pick = NULL;
 			rq->core_dl_server = NULL;
@@ -6230,7 +6230,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 		if (i != cpu && (rq_i != rq->core || !core_clock_updated))
 			update_rq_clock(rq_i);
 
-		rq_i->core_pick = p = pick_task(rq_i);
+		rq_i->core_pick = p = pick_task(rq_i, rf);
 		rq_i->core_dl_server = rq_i->dl_server;
 
 		if (!max || prio_less(max, p, fi_before))
@@ -6252,7 +6252,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 			if (cookie)
 				p = sched_core_find(rq_i, cookie);
 			if (!p)
-				p = idle_sched_class.pick_task(rq_i);
+				p = idle_sched_class.pick_task(rq_i, rf);
 		}
 
 		rq_i->core_pick = p;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index ae15ec6294cf..62d7c18bff64 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2419,7 +2419,7 @@ static struct sched_dl_entity *pick_next_dl_entity(struct dl_rq *dl_rq)
  * __pick_next_task_dl - Helper to pick the next -deadline task to run.
  * @rq: The runqueue to pick the next task from.
  */
-static struct task_struct *__pick_task_dl(struct rq *rq)
+static struct task_struct *__pick_task_dl(struct rq *rq, struct rq_flags *rf)
 {
 	struct sched_dl_entity *dl_se;
 	struct dl_rq *dl_rq = &rq->dl;
@@ -2433,7 +2433,7 @@ static struct task_struct *__pick_task_dl(struct rq *rq)
 	WARN_ON_ONCE(!dl_se);
 
 	if (dl_server(dl_se)) {
-		p = dl_se->server_pick_task(dl_se);
+		p = dl_se->server_pick_task(dl_se, rf);
 		if (!p) {
 			if (dl_server_active(dl_se)) {
 				dl_se->dl_yielded = 1;
@@ -2449,9 +2449,9 @@ static struct task_struct *__pick_task_dl(struct rq *rq)
 	return p;
 }
 
-static struct task_struct *pick_task_dl(struct rq *rq)
+static struct task_struct *pick_task_dl(struct rq *rq, struct rq_flags *rf)
 {
-	return __pick_task_dl(rq);
+	return __pick_task_dl(rq, rf);
 }
 
 static void put_prev_task_dl(struct rq *rq, struct task_struct *p, struct task_struct *next)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index f5133249fd4d..d765379cd94c 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -3281,7 +3281,7 @@ static struct task_struct *first_local_task(struct rq *rq)
 					struct task_struct, scx.dsq_list.node);
 }
 
-static struct task_struct *pick_task_scx(struct rq *rq)
+static struct task_struct *pick_task_scx(struct rq *rq, struct rq_flags *rf)
 {
 	struct task_struct *prev = rq->curr;
 	struct task_struct *p;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 125912c0e9dd..2b7958d2fb06 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8842,7 +8842,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 	resched_curr_lazy(rq);
 }
 
-static struct task_struct *pick_task_fair(struct rq *rq)
+static struct task_struct *pick_task_fair(struct rq *rq, struct rq_flags *rf)
 {
 	struct sched_entity *se;
 	struct cfs_rq *cfs_rq;
@@ -8880,7 +8880,7 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 	int new_tasks;
 
 again:
-	p = pick_task_fair(rq);
+	p = pick_task_fair(rq, rf);
 	if (!p)
 		goto idle;
 	se = &p->se;
@@ -8959,7 +8959,8 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 	return NULL;
 }
 
-static struct task_struct *__pick_next_task_fair(struct rq *rq, struct task_struct *prev)
+static struct task_struct *__pick_next_task_fair(struct rq *rq, struct task_struct *prev,
+												 struct rq_flags *rf)
 {
 	return pick_next_task_fair(rq, prev, NULL);
 }
@@ -8969,9 +8970,11 @@ static bool fair_server_has_tasks(struct sched_dl_entity *dl_se)
 	return !!dl_se->rq->cfs.nr_queued;
 }
 
-static struct task_struct *fair_server_pick_task(struct sched_dl_entity *dl_se)
+static struct task_struct *fair_server_pick_task(struct sched_dl_entity *dl_se, void *flags)
 {
-	return pick_task_fair(dl_se->rq);
+	struct rq_flags *rf = flags;
+
+	return pick_task_fair(dl_se->rq, rf);
 }
 
 void fair_server_init(struct rq *rq)
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 2c85c86b455f..01e9612deefe 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -463,7 +463,7 @@ static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool fir
 	next->se.exec_start = rq_clock_task(rq);
 }
 
-struct task_struct *pick_task_idle(struct rq *rq)
+struct task_struct *pick_task_idle(struct rq *rq, struct rq_flags *rf)
 {
 	scx_update_idle(rq, true, false);
 	return rq->idle;
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index e40422c37033..6e62fe531067 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1744,7 +1744,7 @@ static struct task_struct *_pick_next_task_rt(struct rq *rq)
 	return rt_task_of(rt_se);
 }
 
-static struct task_struct *pick_task_rt(struct rq *rq)
+static struct task_struct *pick_task_rt(struct rq *rq, struct rq_flags *rf)
 {
 	struct task_struct *p;
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c5a6a503eb6d..b4b9c98f0c6d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2401,7 +2401,7 @@ struct sched_class {
 	void (*wakeup_preempt)(struct rq *rq, struct task_struct *p, int flags);
 
 	int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
-	struct task_struct *(*pick_task)(struct rq *rq);
+	struct task_struct *(*pick_task)(struct rq *rq, struct rq_flags *rf);
 	/*
 	 * Optional! When implemented pick_next_task() should be equivalent to:
 	 *
@@ -2411,7 +2411,8 @@ struct sched_class {
 	 *       set_next_task_first(next);
 	 *   }
 	 */
-	struct task_struct *(*pick_next_task)(struct rq *rq, struct task_struct *prev);
+	struct task_struct *(*pick_next_task)(struct rq *rq, struct task_struct *prev,
+										  struct rq_flags *rf);
 
 	void (*put_prev_task)(struct rq *rq, struct task_struct *p, struct task_struct *next);
 	void (*set_next_task)(struct rq *rq, struct task_struct *p, bool first);
@@ -2574,7 +2575,7 @@ static inline bool sched_fair_runnable(struct rq *rq)
 }
 
 extern struct task_struct *pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
-extern struct task_struct *pick_task_idle(struct rq *rq);
+extern struct task_struct *pick_task_idle(struct rq *rq, struct rq_flags *rf);
 
 #define SCA_CHECK		0x01
 #define SCA_MIGRATE_DISABLE	0x02
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index 058dd42e3d9b..1c70123cb6a4 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -33,7 +33,7 @@ static void set_next_task_stop(struct rq *rq, struct task_struct *stop, bool fir
 	stop->se.exec_start = rq_clock_task(rq);
 }
 
-static struct task_struct *pick_task_stop(struct rq *rq)
+static struct task_struct *pick_task_stop(struct rq *rq, struct rq_flags *rf)
 {
 	if (!sched_stop_runnable(rq))
 		return NULL;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v3 05/10] sched: Add a server arg to dl_server_update_idle_time()
  2025-06-13  5:17 [PATCH v3 00/10] Add a deadline server for sched_ext tasks Joel Fernandes
                   ` (3 preceding siblings ...)
  2025-06-13  5:17 ` [PATCH v3 04/10] sched: Add support to pick functions to take rf Joel Fernandes
@ 2025-06-13  5:17 ` Joel Fernandes
  2025-06-13  5:17 ` [PATCH v3 06/10] sched/ext: Add a DL server for sched_ext tasks Joel Fernandes
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: Joel Fernandes @ 2025-06-13  5:17 UTC (permalink / raw)
  To: linux-kernel, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider
  Cc: Joel Fernandes, Tejun Heo, David Vernet, Andrea Righi,
	Changwoo Min

Since we are adding more servers, make dl_server_update_idle_time()
accept a server argument than a specific server.

Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
 kernel/sched/deadline.c | 17 +++++++++--------
 kernel/sched/fair.c     |  2 +-
 kernel/sched/idle.c     |  2 +-
 kernel/sched/sched.h    |  3 ++-
 4 files changed, 13 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 62d7c18bff64..eb2521584f15 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1609,28 +1609,29 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64
  * as time available for the fair server, avoiding a penalty for the
  * rt scheduler that did not consumed that time.
  */
-void dl_server_update_idle_time(struct rq *rq, struct task_struct *p)
+void dl_server_update_idle_time(struct rq *rq, struct task_struct *p,
+			       struct sched_dl_entity *rq_dl_server)
 {
 	s64 delta_exec, scaled_delta_exec;
 
-	if (!rq->fair_server.dl_defer)
+	if (!rq_dl_server->dl_defer)
 		return;
 
 	/* no need to discount more */
-	if (rq->fair_server.runtime < 0)
+	if (rq_dl_server->runtime < 0)
 		return;
 
 	delta_exec = rq_clock_task(rq) - p->se.exec_start;
 	if (delta_exec < 0)
 		return;
 
-	scaled_delta_exec = dl_scaled_delta_exec(rq, &rq->fair_server, delta_exec);
+	scaled_delta_exec = dl_scaled_delta_exec(rq, rq_dl_server, delta_exec);
 
-	rq->fair_server.runtime -= scaled_delta_exec;
+	rq_dl_server->runtime -= scaled_delta_exec;
 
-	if (rq->fair_server.runtime < 0) {
-		rq->fair_server.dl_defer_running = 0;
-		rq->fair_server.runtime = 0;
+	if (rq_dl_server->runtime < 0) {
+		rq_dl_server->dl_defer_running = 0;
+		rq_dl_server->runtime = 0;
 	}
 
 	p->se.exec_start = rq_clock_task(rq);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2b7958d2fb06..6fd4100fd5db 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7016,7 +7016,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	if (!rq_h_nr_queued && rq->cfs.h_nr_queued) {
 		/* Account for idle runtime */
 		if (!rq->nr_running)
-			dl_server_update_idle_time(rq, rq->curr);
+			dl_server_update_idle_time(rq, rq->curr, &rq->fair_server);
 		dl_server_start(&rq->fair_server);
 	}
 
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 01e9612deefe..13a3d20d35e2 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -451,7 +451,7 @@ static void wakeup_preempt_idle(struct rq *rq, struct task_struct *p, int flags)
 
 static void put_prev_task_idle(struct rq *rq, struct task_struct *prev, struct task_struct *next)
 {
-	dl_server_update_idle_time(rq, prev);
+	dl_server_update_idle_time(rq, prev, &rq->fair_server);
 	scx_update_idle(rq, false, true);
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b4b9c98f0c6d..467e39205ebf 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -386,7 +386,8 @@ extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
 		    dl_server_pick_f pick_task);
 
 extern void dl_server_update_idle_time(struct rq *rq,
-		    struct task_struct *p);
+		    struct task_struct *p,
+		    struct sched_dl_entity *rq_dl_server);
 extern void fair_server_init(struct rq *rq);
 extern void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq);
 extern int dl_server_apply_params(struct sched_dl_entity *dl_se,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v3 06/10] sched/ext: Add a DL server for sched_ext tasks
  2025-06-13  5:17 [PATCH v3 00/10] Add a deadline server for sched_ext tasks Joel Fernandes
                   ` (4 preceding siblings ...)
  2025-06-13  5:17 ` [PATCH v3 05/10] sched: Add a server arg to dl_server_update_idle_time() Joel Fernandes
@ 2025-06-13  5:17 ` Joel Fernandes
  2025-06-13  5:17 ` [PATCH v3 07/10] sched/debug: Add support to change sched_ext server params Joel Fernandes
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: Joel Fernandes @ 2025-06-13  5:17 UTC (permalink / raw)
  To: linux-kernel, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tejun Heo, David Vernet,
	Andrea Righi, Changwoo Min
  Cc: Joel Fernandes, Luigi De Matteis

sched_ext currently suffers starvation due to RT. The same workload when
converted to EXT can get zero runtime if RT is 100% running, causing EXT
processes to stall. Fix it by adding a DL server for EXT.

A kselftest is also provided later to verify:

./runner -t rt_stall
===== START =====
TEST: rt_stall
DESCRIPTION: Verify that RT tasks cannot stall SCHED_EXT tasks
OUTPUT:
TAP version 13
1..1
ok 1 PASS: CFS task got more than 4.00% of runtime

Cc: Luigi De Matteis <ldematteis123@gmail.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
 kernel/sched/core.c     |  3 ++
 kernel/sched/deadline.c |  2 +-
 kernel/sched/ext.c      | 62 +++++++++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h    |  2 ++
 4 files changed, 66 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 19b393b0b096..17e7cab0ddf5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8645,6 +8645,9 @@ void __init sched_init(void)
 		hrtick_rq_init(rq);
 		atomic_set(&rq->nr_iowait, 0);
 		fair_server_init(rq);
+#ifdef CONFIG_SCHED_CLASS_EXT
+		ext_server_init(rq);
+#endif
 
 #ifdef CONFIG_SCHED_CORE
 		rq->core = rq;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index eb2521584f15..4ed61266f3ea 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1570,7 +1570,7 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64
 	 * The fair server (sole dl_server) does not account for real-time
 	 * workload because it is running fair work.
 	 */
-	if (dl_se == &rq->fair_server)
+	if (dl_se == &rq->fair_server || dl_se == &rq->ext_server)
 		return;
 
 #ifdef CONFIG_RT_GROUP_SCHED
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index d765379cd94c..52f98c3944ed 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1830,6 +1830,9 @@ static void update_curr_scx(struct rq *rq)
 		if (!curr->scx.slice)
 			touch_core_sched(rq, curr);
 	}
+
+	if (dl_server_active(&rq->ext_server))
+		dl_server_update(&rq->ext_server, delta_exec);
 }
 
 static bool scx_dsq_priq_less(struct rb_node *node_a,
@@ -2308,6 +2311,15 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags
 	if (enq_flags & SCX_ENQ_WAKEUP)
 		touch_core_sched(rq, p);
 
+	if (rq->scx.nr_running == 1) {
+		/* Account for idle runtime */
+		if (!rq->nr_running)
+			dl_server_update_idle_time(rq, rq->curr, &rq->ext_server);
+
+		/* Start dl_server if this is the first task being enqueued */
+		dl_server_start(&rq->ext_server);
+	}
+
 	do_enqueue_task(rq, p, enq_flags, sticky_cpu);
 out:
 	rq->scx.flags &= ~SCX_RQ_IN_WAKEUP;
@@ -2403,6 +2415,11 @@ static bool dequeue_task_scx(struct rq *rq, struct task_struct *p, int deq_flags
 	sub_nr_running(rq, 1);
 
 	dispatch_dequeue(rq, p);
+
+	/* Stop the server if this was the last task */
+	if (rq->scx.nr_running == 0)
+		dl_server_stop(&rq->ext_server);
+
 	return true;
 }
 
@@ -3894,6 +3911,15 @@ static void switching_to_scx(struct rq *rq, struct task_struct *p)
 static void switched_from_scx(struct rq *rq, struct task_struct *p)
 {
 	scx_ops_disable_task(p);
+
+	/*
+	 * After class switch, if the DL server is still active, restart it so
+	 * that DL timers will be queued, in case SCX switched to higher class.
+	 */
+	if (dl_server_active(&rq->ext_server)) {
+		dl_server_stop(&rq->ext_server);
+		dl_server_start(&rq->ext_server);
+	}
 }
 
 static void wakeup_preempt_scx(struct rq *rq, struct task_struct *p,int wake_flags) {}
@@ -7106,8 +7132,8 @@ __bpf_kfunc u32 scx_bpf_cpuperf_cur(s32 cpu)
  * relative scale between 0 and %SCX_CPUPERF_ONE. This determines how the
  * schedutil cpufreq governor chooses the target frequency.
  *
- * The actual performance level chosen, CPU grouping, and the overhead and
- * latency of the operations are dependent on the hardware and cpufreq driver in
+ * The actual performance level chosen, CPU grouping, and the overhead and latency
+ * of the operations are dependent on the hardware and cpufreq driver in
  * use. Consult hardware and cpufreq documentation for more information. The
  * current performance level can be monitored using scx_bpf_cpuperf_cur().
  */
@@ -7385,6 +7411,38 @@ BTF_ID_FLAGS(func, scx_bpf_now)
 BTF_ID_FLAGS(func, scx_bpf_events, KF_TRUSTED_ARGS)
 BTF_KFUNCS_END(scx_kfunc_ids_any)
 
+/*
+ * Check if ext scheduler has tasks ready to run.
+ */
+static bool ext_server_has_tasks(struct sched_dl_entity *dl_se)
+{
+	return !!dl_se->rq->scx.nr_running;
+}
+
+/*
+ * Select the next task to run from the ext scheduling class.
+ */
+static struct task_struct *ext_server_pick_task(struct sched_dl_entity *dl_se,
+						void *flags)
+{
+	struct rq_flags *rf = flags;
+
+	balance_scx(dl_se->rq, dl_se->rq->curr, rf);
+	return pick_task_scx(dl_se->rq, rf);
+}
+
+/*
+ * Initialize the ext server deadline entity.
+ */
+void ext_server_init(struct rq *rq)
+{
+	struct sched_dl_entity *dl_se = &rq->ext_server;
+
+	init_dl_entity(dl_se);
+
+	dl_server_init(dl_se, rq, ext_server_has_tasks, ext_server_pick_task);
+}
+
 static const struct btf_kfunc_id_set scx_kfunc_set_any = {
 	.owner			= THIS_MODULE,
 	.set			= &scx_kfunc_ids_any,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 467e39205ebf..d206421b1146 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -389,6 +389,7 @@ extern void dl_server_update_idle_time(struct rq *rq,
 		    struct task_struct *p,
 		    struct sched_dl_entity *rq_dl_server);
 extern void fair_server_init(struct rq *rq);
+extern void ext_server_init(struct rq *rq);
 extern void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq);
 extern int dl_server_apply_params(struct sched_dl_entity *dl_se,
 		    u64 runtime, u64 period, bool init);
@@ -1137,6 +1138,7 @@ struct rq {
 #endif
 
 	struct sched_dl_entity	fair_server;
+	struct sched_dl_entity	ext_server;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	/* list of leaf cfs_rq on this CPU: */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v3 07/10] sched/debug: Add support to change sched_ext server params
  2025-06-13  5:17 [PATCH v3 00/10] Add a deadline server for sched_ext tasks Joel Fernandes
                   ` (5 preceding siblings ...)
  2025-06-13  5:17 ` [PATCH v3 06/10] sched/ext: Add a DL server for sched_ext tasks Joel Fernandes
@ 2025-06-13  5:17 ` Joel Fernandes
  2025-06-13  5:17 ` [PATCH v3 08/10] sched/deadline: Add support to remove DL server bandwidth Joel Fernandes
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: Joel Fernandes @ 2025-06-13  5:17 UTC (permalink / raw)
  To: linux-kernel, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider
  Cc: Joel Fernandes, Tejun Heo, David Vernet, Andrea Righi,
	Changwoo Min

When a sched_ext server is loaded, tasks in CFS are converted to run in
sched_ext class. Add support to modify the ext server parameters similar
to how the fair server parameters are modified.

Re-use common code between ext and fair servers as needed.

Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
 kernel/sched/debug.c | 160 ++++++++++++++++++++++++++++++++++++-------
 1 file changed, 135 insertions(+), 25 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 6866f0a9e88c..de1f14f73077 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -341,14 +341,16 @@ enum dl_param {
 	DL_PERIOD,
 };
 
-static unsigned long fair_server_period_max = (1UL << 22) * NSEC_PER_USEC; /* ~4 seconds */
-static unsigned long fair_server_period_min = (100) * NSEC_PER_USEC;     /* 100 us */
+static unsigned long dl_server_period_max = (1UL << 22) * NSEC_PER_USEC; /* ~4 seconds */
+static unsigned long dl_server_period_min = (100) * NSEC_PER_USEC;     /* 100 us */
 
-static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubuf,
-				       size_t cnt, loff_t *ppos, enum dl_param param)
+static ssize_t sched_server_write_common(struct file *filp, const char __user *ubuf,
+					 size_t cnt, loff_t *ppos, enum dl_param param,
+					 void *server)
 {
 	long cpu = (long) ((struct seq_file *) filp->private_data)->private;
 	struct rq *rq = cpu_rq(cpu);
+	struct sched_dl_entity *dl_se = (struct sched_dl_entity *)server;
 	bool was_active = false;
 	u64 runtime, period;
 	int retval = 0;
@@ -360,8 +362,8 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
 		return err;
 
 	scoped_guard (rq_lock_irqsave, rq) {
-		runtime  = rq->fair_server.dl_runtime;
-		period = rq->fair_server.dl_period;
+		runtime  = dl_se->dl_runtime;
+		period = dl_se->dl_period;
 
 		switch (param) {
 		case DL_RUNTIME:
@@ -377,25 +379,30 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
 		}
 
 		if (runtime > period ||
-		    period > fair_server_period_max ||
-		    period < fair_server_period_min) {
+		    period > dl_server_period_max ||
+		    period < dl_server_period_min) {
 			return  -EINVAL;
 		}
 
-		if (dl_server_active(&rq->fair_server)) {
+		if (dl_server_active(dl_se)) {
 			was_active = true;
 			update_rq_clock(rq);
-			dl_server_stop(&rq->fair_server);
+			dl_server_stop(dl_se);
 		}
 
-		retval = dl_server_apply_params(&rq->fair_server, runtime, period, 0);
+		retval = dl_server_apply_params(dl_se, runtime, period, 0);
 
-		if (!runtime)
-			printk_deferred("Fair server disabled in CPU %d, system may crash due to starvation.\n",
-					cpu_of(rq));
+		if (!runtime) {
+			if (server == &rq->fair_server)
+				printk_deferred("Fair server disabled on CPU %d, system may crash due to starvation.\n",
+						cpu_of(rq));
+			else
+				printk_deferred("Ext server disabled on CPU %d, system may crash due to starvation.\n",
+						cpu_of(rq));
+		}
 
 		if (was_active)
-			dl_server_start(&rq->fair_server);
+			dl_server_start(dl_se);
 
 		if (retval < 0)
 			return retval;
@@ -405,36 +412,46 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
 	return cnt;
 }
 
-static size_t sched_fair_server_show(struct seq_file *m, void *v, enum dl_param param)
+
+
+static size_t sched_server_show_common(struct seq_file *m, void *v, enum dl_param param,
+				       void *server)
 {
-	unsigned long cpu = (unsigned long) m->private;
-	struct rq *rq = cpu_rq(cpu);
+	struct sched_dl_entity *dl_se = (struct sched_dl_entity *)server;
 	u64 value;
 
 	switch (param) {
 	case DL_RUNTIME:
-		value = rq->fair_server.dl_runtime;
+		value = dl_se->dl_runtime;
 		break;
 	case DL_PERIOD:
-		value = rq->fair_server.dl_period;
+		value = dl_se->dl_period;
 		break;
 	}
 
 	seq_printf(m, "%llu\n", value);
 	return 0;
-
 }
 
+
+
 static ssize_t
 sched_fair_server_runtime_write(struct file *filp, const char __user *ubuf,
 				size_t cnt, loff_t *ppos)
 {
-	return sched_fair_server_write(filp, ubuf, cnt, ppos, DL_RUNTIME);
+	long cpu = (long) ((struct seq_file *) filp->private_data)->private;
+	struct rq *rq = cpu_rq(cpu);
+
+	return sched_server_write_common(filp, ubuf, cnt, ppos, DL_RUNTIME,
+					&rq->fair_server);
 }
 
 static int sched_fair_server_runtime_show(struct seq_file *m, void *v)
 {
-	return sched_fair_server_show(m, v, DL_RUNTIME);
+	unsigned long cpu = (unsigned long) m->private;
+	struct rq *rq = cpu_rq(cpu);
+
+	return sched_server_show_common(m, v, DL_RUNTIME, &rq->fair_server);
 }
 
 static int sched_fair_server_runtime_open(struct inode *inode, struct file *filp)
@@ -450,16 +467,55 @@ static const struct file_operations fair_server_runtime_fops = {
 	.release	= single_release,
 };
 
+static ssize_t
+sched_ext_server_runtime_write(struct file *filp, const char __user *ubuf,
+			       size_t cnt, loff_t *ppos)
+{
+	long cpu = (long) ((struct seq_file *) filp->private_data)->private;
+	struct rq *rq = cpu_rq(cpu);
+
+	return sched_server_write_common(filp, ubuf, cnt, ppos, DL_RUNTIME,
+					&rq->ext_server);
+}
+
+static int sched_ext_server_runtime_show(struct seq_file *m, void *v)
+{
+	unsigned long cpu = (unsigned long) m->private;
+	struct rq *rq = cpu_rq(cpu);
+
+	return sched_server_show_common(m, v, DL_RUNTIME, &rq->ext_server);
+}
+
+static int sched_ext_server_runtime_open(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, sched_ext_server_runtime_show, inode->i_private);
+}
+
+static const struct file_operations ext_server_runtime_fops = {
+	.open		= sched_ext_server_runtime_open,
+	.write		= sched_ext_server_runtime_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
 static ssize_t
 sched_fair_server_period_write(struct file *filp, const char __user *ubuf,
 			       size_t cnt, loff_t *ppos)
 {
-	return sched_fair_server_write(filp, ubuf, cnt, ppos, DL_PERIOD);
+	long cpu = (long) ((struct seq_file *) filp->private_data)->private;
+	struct rq *rq = cpu_rq(cpu);
+
+	return sched_server_write_common(filp, ubuf, cnt, ppos, DL_PERIOD,
+					&rq->fair_server);
 }
 
 static int sched_fair_server_period_show(struct seq_file *m, void *v)
 {
-	return sched_fair_server_show(m, v, DL_PERIOD);
+	unsigned long cpu = (unsigned long) m->private;
+	struct rq *rq = cpu_rq(cpu);
+
+	return sched_server_show_common(m, v, DL_PERIOD, &rq->fair_server);
 }
 
 static int sched_fair_server_period_open(struct inode *inode, struct file *filp)
@@ -475,6 +531,38 @@ static const struct file_operations fair_server_period_fops = {
 	.release	= single_release,
 };
 
+static ssize_t
+sched_ext_server_period_write(struct file *filp, const char __user *ubuf,
+			      size_t cnt, loff_t *ppos)
+{
+	long cpu = (long) ((struct seq_file *) filp->private_data)->private;
+	struct rq *rq = cpu_rq(cpu);
+
+	return sched_server_write_common(filp, ubuf, cnt, ppos, DL_PERIOD,
+					&rq->ext_server);
+}
+
+static int sched_ext_server_period_show(struct seq_file *m, void *v)
+{
+	unsigned long cpu = (unsigned long) m->private;
+	struct rq *rq = cpu_rq(cpu);
+
+	return sched_server_show_common(m, v, DL_PERIOD, &rq->ext_server);
+}
+
+static int sched_ext_server_period_open(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, sched_ext_server_period_show, inode->i_private);
+}
+
+static const struct file_operations ext_server_period_fops = {
+	.open		= sched_ext_server_period_open,
+	.write		= sched_ext_server_period_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
 static struct dentry *debugfs_sched;
 
 static void debugfs_fair_server_init(void)
@@ -498,6 +586,27 @@ static void debugfs_fair_server_init(void)
 	}
 }
 
+static void debugfs_ext_server_init(void)
+{
+	struct dentry *d_ext;
+	unsigned long cpu;
+
+	d_ext = debugfs_create_dir("ext_server", debugfs_sched);
+	if (!d_ext)
+		return;
+
+	for_each_possible_cpu(cpu) {
+		struct dentry *d_cpu;
+		char buf[32];
+
+		snprintf(buf, sizeof(buf), "cpu%lu", cpu);
+		d_cpu = debugfs_create_dir(buf, d_ext);
+
+		debugfs_create_file("runtime", 0644, d_cpu, (void *) cpu, &ext_server_runtime_fops);
+		debugfs_create_file("period", 0644, d_cpu, (void *) cpu, &ext_server_period_fops);
+	}
+}
+
 static __init int sched_init_debug(void)
 {
 	struct dentry __maybe_unused *numa;
@@ -538,6 +647,7 @@ static __init int sched_init_debug(void)
 	debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
 
 	debugfs_fair_server_init();
+	debugfs_ext_server_init();
 
 	return 0;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v3 08/10] sched/deadline: Add support to remove DL server bandwidth
  2025-06-13  5:17 [PATCH v3 00/10] Add a deadline server for sched_ext tasks Joel Fernandes
                   ` (6 preceding siblings ...)
  2025-06-13  5:17 ` [PATCH v3 07/10] sched/debug: Add support to change sched_ext server params Joel Fernandes
@ 2025-06-13  5:17 ` Joel Fernandes
  2025-06-13  5:17 ` [PATCH v3 09/10] sched/ext: Relinquish DL server reservations when not needed Joel Fernandes
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: Joel Fernandes @ 2025-06-13  5:17 UTC (permalink / raw)
  To: linux-kernel, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider
  Cc: Joel Fernandes, Tejun Heo, David Vernet, Andrea Righi,
	Changwoo Min

The DL server for sched_ext will have its bandwidth removed when
sched_ext is unloaded. Add support to DEADLINE for this, so that the
sched_ext DL server may do so.

Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
 kernel/sched/deadline.c | 55 +++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h    |  1 +
 2 files changed, 56 insertions(+)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 4ed61266f3ea..0e73274d8c31 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1752,6 +1752,61 @@ int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 perio
 	return retval;
 }
 
+/**
+ * dl_server_remove_params - Remove bandwidth reservation for a DL server
+ * @dl_se: The DL server entity to remove bandwidth for
+ *
+ * This function removes the bandwidth reservation for a DL server entity,
+ * cleaning up all bandwidth accounting and server state.
+ *
+ * Returns: 0 on success, negative error code on failure
+ */
+int dl_server_remove_params(struct sched_dl_entity *dl_se)
+{
+	struct rq *rq = dl_se->rq;
+	int cpu = cpu_of(rq);
+	struct dl_bw *dl_b;
+	int cpus;
+
+	if (!dl_se->dl_runtime)
+		return 0;  /* Already disabled */
+
+	/*
+	 * First dequeue if still queued. It should not be queued since
+	 * we call this only after the last dl_server_stop().
+	 */
+	if (WARN_ON_ONCE(on_dl_rq(dl_se)))
+		dequeue_dl_entity(dl_se, DEQUEUE_SLEEP);
+
+	/* Cancel any pending timers */
+	hrtimer_try_to_cancel(&dl_se->dl_timer);
+	hrtimer_try_to_cancel(&dl_se->inactive_timer);
+
+	/* Remove bandwidth from both runqueue and root domain accounting */
+	dl_b = dl_bw_of(cpu);
+	guard(raw_spinlock)(&dl_b->lock);
+	cpus = dl_bw_cpus(cpu);
+
+	sub_rq_bw(dl_se, &rq->dl);
+	__dl_sub(dl_b, dl_se->dl_bw, cpus);
+
+	/*
+	 * If server was active and consuming bandwidth, remove it from
+	 * running bandwidth accounting. This should not happen because
+	 * we call this only after the last dl_server_stop().
+	 */
+	if (WARN_ON_ONCE(!dl_se->dl_non_contending))
+		sub_running_bw(dl_se, &rq->dl);
+
+	/*
+	 * Clear all server parameters. This will also clear ->dl_server so
+	 * the next dl_server_apply_params() will reconfigure the server.
+	 */
+	__dl_clear_params(dl_se);
+
+	return 0;
+}
+
 /*
  * Update the current task's runtime statistics (provided it is still
  * a -deadline task and has not been removed from the dl_rq).
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d206421b1146..e6af0c1fc985 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -393,6 +393,7 @@ extern void ext_server_init(struct rq *rq);
 extern void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq);
 extern int dl_server_apply_params(struct sched_dl_entity *dl_se,
 		    u64 runtime, u64 period, bool init);
+extern int dl_server_remove_params(struct sched_dl_entity *dl_se);
 
 static inline bool dl_server_active(struct sched_dl_entity *dl_se)
 {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v3 09/10] sched/ext: Relinquish DL server reservations when not needed
  2025-06-13  5:17 [PATCH v3 00/10] Add a deadline server for sched_ext tasks Joel Fernandes
                   ` (7 preceding siblings ...)
  2025-06-13  5:17 ` [PATCH v3 08/10] sched/deadline: Add support to remove DL server bandwidth Joel Fernandes
@ 2025-06-13  5:17 ` Joel Fernandes
  2025-06-13  5:17 ` [PATCH v3 10/10] selftests/sched_ext: Add test for sched_ext dl_server Joel Fernandes
  2025-06-13 17:35 ` [PATCH v3 00/10] Add a deadline server for sched_ext tasks Joel Fernandes
  10 siblings, 0 replies; 14+ messages in thread
From: Joel Fernandes @ 2025-06-13  5:17 UTC (permalink / raw)
  To: linux-kernel, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tejun Heo, David Vernet,
	Andrea Righi, Changwoo Min
  Cc: Joel Fernandes

I tested loading a test SCX program and verifying the bandwidth both
before and after applying the patch:

Without patch:
Before loading scx:
  .dl_bw->total_bw               : 1887408
After unloading scx:
  .dl_bw->total_bw               : 3774816

After patch:
Before loading scx:
  .dl_bw->total_bw               : 1887408
After unloading scx:
  .dl_bw->total_bw               : 1887408

Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
 kernel/sched/deadline.c | 11 ++---------
 kernel/sched/ext.c      | 44 +++++++++++++++++++++++++++++++++++++----
 2 files changed, 42 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 0e73274d8c31..924dbbfb4b40 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1786,18 +1786,11 @@ int dl_server_remove_params(struct sched_dl_entity *dl_se)
 	dl_b = dl_bw_of(cpu);
 	guard(raw_spinlock)(&dl_b->lock);
 	cpus = dl_bw_cpus(cpu);
-
+	if (dl_se->dl_non_contending)
+		sub_running_bw(dl_se, &rq->dl);
 	sub_rq_bw(dl_se, &rq->dl);
 	__dl_sub(dl_b, dl_se->dl_bw, cpus);
 
-	/*
-	 * If server was active and consuming bandwidth, remove it from
-	 * running bandwidth accounting. This should not happen because
-	 * we call this only after the last dl_server_stop().
-	 */
-	if (WARN_ON_ONCE(!dl_se->dl_non_contending))
-		sub_running_bw(dl_se, &rq->dl);
-
 	/*
 	 * Clear all server parameters. This will also clear ->dl_server so
 	 * the next dl_server_apply_params() will reconfigure the server.
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 52f98c3944ed..2e77d9971c22 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -4784,13 +4784,28 @@ static void scx_ops_disable_workfn(struct kthread_work *work)
 	scx_task_iter_stop(&sti);
 	percpu_up_write(&scx_fork_rwsem);
 
-	/*
-	 * Invalidate all the rq clocks to prevent getting outdated
-	 * rq clocks from a previous scx scheduler.
-	 */
 	for_each_possible_cpu(cpu) {
 		struct rq *rq = cpu_rq(cpu);
+		struct rq_flags rf;
+
+		/*
+		 * Invalidate all the rq clocks to prevent getting outdated
+		 * rq clocks from a previous scx scheduler.
+		 */
 		scx_rq_clock_invalidate(rq);
+
+		/*
+		 * We are unloading the sched_ext scheduler, we do not need its
+		 * DL server bandwidth anymore, remove it for all CPUs. Whenever
+		 * the first SCX task is enqueued (when scx is re-loaded), its DL
+		 * server bandwidth will be re-initialized.
+		 */
+		rq_lock_irqsave(rq, &rf);
+		if (dl_server_active(&rq->ext_server)) {
+			dl_server_stop(&rq->ext_server);
+		}
+		dl_server_remove_params(&rq->ext_server);
+		rq_unlock_irqrestore(rq, &rf);
 	}
 
 	/* no task is on scx, turn off all the switches and flush in-progress calls */
@@ -5547,6 +5562,27 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
 		check_class_changed(task_rq(p), p, old_class, p->prio);
 	}
 	scx_task_iter_stop(&sti);
+
+	if (scx_switching_all) {
+		for_each_possible_cpu(cpu) {
+			struct rq *rq = cpu_rq(cpu);
+			struct rq_flags rf;
+
+			/*
+			 * We are switching all fair tasks to the sched_ext scheduler,
+			 * we do not need fair server's DL bandwidth anymore, remove it
+			 * for all CPUs. Whenever the first CFS task is enqueued (when
+			 * scx is unloaded), the fair server's DL bandwidth will be
+			 * re-initialized.
+			 */
+			rq_lock_irqsave(rq, &rf);
+			if (dl_server_active(&rq->fair_server))
+				dl_server_stop(&rq->fair_server);
+			dl_server_remove_params(&rq->fair_server);
+			rq_unlock_irqrestore(rq, &rf);
+		}
+	}
+
 	percpu_up_write(&scx_fork_rwsem);
 
 	scx_ops_bypass(false);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v3 10/10] selftests/sched_ext: Add test for sched_ext dl_server
  2025-06-13  5:17 [PATCH v3 00/10] Add a deadline server for sched_ext tasks Joel Fernandes
                   ` (8 preceding siblings ...)
  2025-06-13  5:17 ` [PATCH v3 09/10] sched/ext: Relinquish DL server reservations when not needed Joel Fernandes
@ 2025-06-13  5:17 ` Joel Fernandes
  2025-06-13 17:35 ` [PATCH v3 00/10] Add a deadline server for sched_ext tasks Joel Fernandes
  10 siblings, 0 replies; 14+ messages in thread
From: Joel Fernandes @ 2025-06-13  5:17 UTC (permalink / raw)
  To: linux-kernel, Tejun Heo, David Vernet, Andrea Righi, Changwoo Min,
	Shuah Khan
  Cc: Joel Fernandes, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, linux-kselftest, bpf

From: Andrea Righi <arighi@nvidia.com>

Add a selftest to validate the correct behavior of the deadline server
for the ext_sched_class.

[ Joel: Replaced occurences of CFS in the test with EXT. ]

Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 tools/testing/selftests/sched_ext/Makefile    |   1 +
 .../selftests/sched_ext/rt_stall.bpf.c        |  23 ++
 tools/testing/selftests/sched_ext/rt_stall.c  | 213 ++++++++++++++++++
 3 files changed, 237 insertions(+)
 create mode 100644 tools/testing/selftests/sched_ext/rt_stall.bpf.c
 create mode 100644 tools/testing/selftests/sched_ext/rt_stall.c

diff --git a/tools/testing/selftests/sched_ext/Makefile b/tools/testing/selftests/sched_ext/Makefile
index f4531327b8e7..dcc803eeab39 100644
--- a/tools/testing/selftests/sched_ext/Makefile
+++ b/tools/testing/selftests/sched_ext/Makefile
@@ -181,6 +181,7 @@ auto-test-targets :=			\
 	select_cpu_dispatch_bad_dsq	\
 	select_cpu_dispatch_dbl_dsp	\
 	select_cpu_vtime		\
+	rt_stall			\
 	test_example			\
 
 testcase-targets := $(addsuffix .o,$(addprefix $(SCXOBJ_DIR)/,$(auto-test-targets)))
diff --git a/tools/testing/selftests/sched_ext/rt_stall.bpf.c b/tools/testing/selftests/sched_ext/rt_stall.bpf.c
new file mode 100644
index 000000000000..80086779dd1e
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/rt_stall.bpf.c
@@ -0,0 +1,23 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * A scheduler that verified if RT tasks can stall SCHED_EXT tasks.
+ *
+ * Copyright (c) 2025 NVIDIA Corporation.
+ */
+
+#include <scx/common.bpf.h>
+
+char _license[] SEC("license") = "GPL";
+
+UEI_DEFINE(uei);
+
+void BPF_STRUCT_OPS(rt_stall_exit, struct scx_exit_info *ei)
+{
+	UEI_RECORD(uei, ei);
+}
+
+SEC(".struct_ops.link")
+struct sched_ext_ops rt_stall_ops = {
+	.exit			= (void *)rt_stall_exit,
+	.name			= "rt_stall",
+};
diff --git a/tools/testing/selftests/sched_ext/rt_stall.c b/tools/testing/selftests/sched_ext/rt_stall.c
new file mode 100644
index 000000000000..d4cb545ebfd8
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/rt_stall.c
@@ -0,0 +1,213 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2025 NVIDIA Corporation.
+ */
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <sched.h>
+#include <sys/prctl.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <time.h>
+#include <linux/sched.h>
+#include <signal.h>
+#include <bpf/bpf.h>
+#include <scx/common.h>
+#include <sys/wait.h>
+#include <unistd.h>
+#include "rt_stall.bpf.skel.h"
+#include "scx_test.h"
+#include "../kselftest.h"
+
+#define CORE_ID		0	/* CPU to pin tasks to */
+#define RUN_TIME        5	/* How long to run the test in seconds */
+
+/* Simple busy-wait function for test tasks */
+static void process_func(void)
+{
+	while (1) {
+		/* Busy wait */
+		for (volatile unsigned long i = 0; i < 10000000UL; i++);
+	}
+}
+
+/* Set CPU affinity to a specific core */
+static void set_affinity(int cpu)
+{
+	cpu_set_t mask;
+
+	CPU_ZERO(&mask);
+	CPU_SET(cpu, &mask);
+	if (sched_setaffinity(0, sizeof(mask), &mask) != 0) {
+		perror("sched_setaffinity");
+		exit(EXIT_FAILURE);
+	}
+}
+
+/* Set task scheduling policy and priority */
+static void set_sched(int policy, int priority)
+{
+	struct sched_param param;
+
+	param.sched_priority = priority;
+	if (sched_setscheduler(0, policy, &param) != 0) {
+		perror("sched_setscheduler");
+		exit(EXIT_FAILURE);
+	}
+}
+
+/* Get process runtime from /proc/<pid>/stat */
+static float get_process_runtime(int pid)
+{
+	char path[256];
+	FILE *file;
+	long utime, stime;
+	int fields;
+
+	snprintf(path, sizeof(path), "/proc/%d/stat", pid);
+	file = fopen(path, "r");
+	if (file == NULL) {
+		perror("Failed to open stat file");
+		return -1;
+	}
+
+	/* Skip the first 13 fields and read the 14th and 15th */
+	fields = fscanf(file,
+			"%*d %*s %*c %*d %*d %*d %*d %*d %*u %*u %*u %*u %*u %lu %lu",
+			&utime, &stime);
+	fclose(file);
+
+	if (fields != 2) {
+		fprintf(stderr, "Failed to read stat file\n");
+		return -1;
+	}
+
+	/* Calculate the total time spent in the process */
+	long total_time = utime + stime;
+	long ticks_per_second = sysconf(_SC_CLK_TCK);
+	float runtime_seconds = total_time * 1.0 / ticks_per_second;
+
+	return runtime_seconds;
+}
+
+static enum scx_test_status setup(void **ctx)
+{
+	struct rt_stall *skel;
+
+	skel = rt_stall__open();
+	SCX_FAIL_IF(!skel, "Failed to open");
+	SCX_ENUM_INIT(skel);
+	SCX_FAIL_IF(rt_stall__load(skel), "Failed to load skel");
+
+	*ctx = skel;
+
+	return SCX_TEST_PASS;
+}
+
+static bool sched_stress_test(void)
+{
+	float cfs_runtime, rt_runtime;
+	int cfs_pid, rt_pid;
+	float expected_min_ratio = 0.04; /* 4% */
+
+	ksft_print_header();
+	ksft_set_plan(1);
+
+	/* Create and set up a EXT task */
+	cfs_pid = fork();
+	if (cfs_pid == 0) {
+		set_affinity(CORE_ID);
+		process_func();
+		exit(0);
+	} else if (cfs_pid < 0) {
+		perror("fork for EXT task");
+		ksft_exit_fail();
+	}
+
+	/* Create an RT task */
+	rt_pid = fork();
+	if (rt_pid == 0) {
+		set_affinity(CORE_ID);
+		set_sched(SCHED_FIFO, 50);
+		process_func();
+		exit(0);
+	} else if (rt_pid < 0) {
+		perror("fork for RT task");
+		ksft_exit_fail();
+	}
+
+	/* Let the processes run for the specified time */
+	sleep(RUN_TIME);
+
+	/* Get runtime for the EXT task */
+	cfs_runtime = get_process_runtime(cfs_pid);
+	if (cfs_runtime != -1)
+		ksft_print_msg("Runtime of EXT task (PID %d) is %f seconds\n", cfs_pid, cfs_runtime);
+	else
+		ksft_exit_fail_msg("Error getting runtime for EXT task (PID %d)\n", cfs_pid);
+
+	/* Get runtime for the RT task */
+	rt_runtime = get_process_runtime(rt_pid);
+	if (rt_runtime != -1)
+		ksft_print_msg("Runtime of RT task (PID %d) is %f seconds\n", rt_pid, rt_runtime);
+	else
+		ksft_exit_fail_msg("Error getting runtime for RT task (PID %d)\n", rt_pid);
+
+	/* Kill the processes */
+	kill(cfs_pid, SIGKILL);
+	kill(rt_pid, SIGKILL);
+	waitpid(cfs_pid, NULL, 0);
+	waitpid(rt_pid, NULL, 0);
+
+	/* Verify that the scx task got enough runtime */
+	float actual_ratio = cfs_runtime / (cfs_runtime + rt_runtime);
+	ksft_print_msg("EXT task got %.2f%% of total runtime\n", actual_ratio * 100);
+
+	if (actual_ratio >= expected_min_ratio) {
+		ksft_test_result_pass("PASS: EXT task got more than %.2f%% of runtime\n",
+				      expected_min_ratio * 100);
+		return true;
+	} else {
+		ksft_test_result_fail("FAIL: EXT task got less than %.2f%% of runtime\n",
+				      expected_min_ratio * 100);
+		return false;
+	}
+}
+
+static enum scx_test_status run(void *ctx)
+{
+	struct rt_stall *skel = ctx;
+	struct bpf_link *link;
+	bool res;
+
+	link = bpf_map__attach_struct_ops(skel->maps.rt_stall_ops);
+	SCX_FAIL_IF(!link, "Failed to attach scheduler");
+
+	res = sched_stress_test();
+
+	SCX_EQ(skel->data->uei.kind, EXIT_KIND(SCX_EXIT_NONE));
+	bpf_link__destroy(link);
+
+	if (!res)
+		ksft_exit_fail();
+
+	return SCX_TEST_PASS;
+}
+
+static void cleanup(void *ctx)
+{
+	struct rt_stall *skel = ctx;
+
+	rt_stall__destroy(skel);
+}
+
+struct scx_test rt_stall = {
+	.name = "rt_stall",
+	.description = "Verify that RT tasks cannot stall SCHED_EXT tasks",
+	.setup = setup,
+	.run = run,
+	.cleanup = cleanup,
+};
+REGISTER_SCX_TEST(&rt_stall)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 00/10] Add a deadline server for sched_ext tasks
  2025-06-13  5:17 [PATCH v3 00/10] Add a deadline server for sched_ext tasks Joel Fernandes
                   ` (9 preceding siblings ...)
  2025-06-13  5:17 ` [PATCH v3 10/10] selftests/sched_ext: Add test for sched_ext dl_server Joel Fernandes
@ 2025-06-13 17:35 ` Joel Fernandes
  2025-06-13 18:05   ` Joel Fernandes
  10 siblings, 1 reply; 14+ messages in thread
From: Joel Fernandes @ 2025-06-13 17:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Tejun Heo, David Vernet, Andrea Righi,
	Changwoo Min, bpf



On 6/13/2025 1:17 AM, Joel Fernandes wrote:
> sched_ext tasks currently are starved by RT hoggers especially since RT
> throttling was replaced by deadline servers to boost only CFS tasks. Several
> users in the community have reported issues with RT stalling sched_ext tasks.
> Add a sched_ext deadline server as well so that sched_ext tasks are also
> boosted and do not suffer starvation.
> 
> A kselftest is also provided to verify the starvation issues are now fixed.
> 
> Btw, there is still something funky going on with CPU hotplug and the
> relinquish patch. Sometimes the sched_ext's hotplug self-test locks up
> (./runner -t hotplug). Reverting that patch fixes it, so I am suspecting
> something is off in dl_server_remove_params() when it is being called on
> offline CPUs.

I think I got somewhere here with this sched_ext hotplug test but still not
there yet. Juri, Andrea, Tejun, can you take a look at the below when you get a
chance?

In the hotplug test, when the CPU is brought online, I see the following warning
fire [1]. Basically, dl_server_apply_params() fails with -EBUSY due to overflow
checks.

@@ -1657,8 +1657,7 @@ void dl_server_start(struct sched_dl_entity *dl_se)
                u64 runtime =  50 * NSEC_PER_MSEC;
                u64 period = 1000 * NSEC_PER_MSEC;

-               dl_server_apply_params(dl_se, runtime, period, 1);
-
+               WARN_ON_ONCE(dl_server_apply_params(dl_se, runtime, period, 1));
                dl_se->dl_server = 1;
                dl_se->dl_defer = 1;
                setup_new_dl_entity(dl_se);

I dug deeper, and it seems CPU 1 was previously brought offline and then online
before the warning happened during *that onlining*. During the onlining,
enqueue_task_scx() -> dl_server_start() was called but dl_server_apply_params()
returned -EBUSY.

In dl_server_apply_params() -> __dl_overflow(), it appears dl_bw_cpus()=0 and
cap=0. That is really odd and probably the reason for warning. Is that because
the CPU was offlined earlier and is not yet attached to the root domain?

The problem also comes down to why does this happen only when calling my
dl_server_remove_params() only and not otherwise, and why on earth is
dl_bw_cpus() returning 0. There's at least 2 other CPUs online at the time.

Anyway, other than this mystery, I fixed all other bandwidth-related warnings
due to dl_server_remove_params() and the updated patch below [2].

[1] Warning:

[   11.878005] DL server bandwidth overflow on CPU 1: dl_b->bw=996147, cap=0,
total_bw=0, old_bw=0, new_bw=52428, dl_bw_cpus=0
[   11.878356] ------------[ cut here ]------------
[   11.878528] WARNING: CPU: 0 PID: 145 at
               kernel/sched/deadline.c:1670 dl_server_start+0x96/0xa0
[   11.879400] Sched_ext: hotplug_cbs (enabled+all), task: runnable_at=+0ms

       [   11.879404] RIP: 0010:dl_server_start+0x96/0xa0
[   11.879732] Code: 53 10 75 1d 49 8b 86 10 0c 00 00 48 8b
[   11.882510] Call Trace:
[   11.882592]  <TASK>
[   11.882685]  enqueue_task_scx+0x190/0x280
[   11.882802]  ttwu_do_activate+0xaa/0x2a0
[   11.882925]  try_to_wake_up+0x371/0x600
[   11.883047]  cpuhp_bringup_ap+0xd6/0x170

       [   11.883172]  cpuhp_invoke_callback+0x142/0x540

              [   11.883327]  _cpu_up+0x15b/0x270
[   11.883450]  cpu_up+0x52/0xb0
[   11.883576]  cpu_subsys_online+0x32/0x120
[   11.883704]  online_store+0x98/0x130
[   11.883824]  kernfs_fop_write_iter+0xeb/0x170
[   11.883972]  vfs_write+0x2c7/0x430

       [   11.884091]  ksys_write+0x70/0xe0
[   11.884209]  do_syscall_64+0xd6/0x250
[   11.884327]  ? clear_bhb_loop+0x40/0x90

       [   11.884443]  entry_SYSCALL_64_after_hwframe+0x77/0x7f


[2]: Updated patch "sched/ext: Relinquish DL server reservations when not needed":
https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/commit/?h=sched/scx-dlserver-boost-rebase&id=56581c2a6bb8e78593df80ad47520a8399055eae

thanks,

 - Joel


> 
> v2->v3:
>  - Removed code duplication in debugfs. Made ext interface separate.
>  - Fixed issue where rq_lock_irqsave was not used in the relinquish patch.
>  - Fixed running bw accounting issue in dl_server_remove_params.
> 
> Link to v1: https://lore.kernel.org/all/20250315022158.2354454-1-joelagnelf@nvidia.com/
> Link to v2: https://lore.kernel.org/all/20250602180110.816225-1-joelagnelf@nvidia.com/
> 
> Andrea Righi (1):
>   selftests/sched_ext: Add test for sched_ext dl_server
> 
> Joel Fernandes (9):
>   sched/debug: Fix updating of ppos on server write ops
>   sched/debug: Stop and start server based on if it was active
>   sched/deadline: Clear the defer params
>   sched: Add support to pick functions to take rf
>   sched: Add a server arg to dl_server_update_idle_time()
>   sched/ext: Add a DL server for sched_ext tasks
>   sched/debug: Add support to change sched_ext server params
>   sched/deadline: Add support to remove DL server bandwidth
>   sched/ext: Relinquish DL server reservations when not needed
> 
>  include/linux/sched.h                         |   2 +-
>  kernel/sched/core.c                           |  19 +-
>  kernel/sched/deadline.c                       |  78 +++++--
>  kernel/sched/debug.c                          | 171 +++++++++++---
>  kernel/sched/ext.c                            | 108 ++++++++-
>  kernel/sched/fair.c                           |  15 +-
>  kernel/sched/idle.c                           |   4 +-
>  kernel/sched/rt.c                             |   2 +-
>  kernel/sched/sched.h                          |  13 +-
>  kernel/sched/stop_task.c                      |   2 +-
>  tools/testing/selftests/sched_ext/Makefile    |   1 +
>  .../selftests/sched_ext/rt_stall.bpf.c        |  23 ++
>  tools/testing/selftests/sched_ext/rt_stall.c  | 213 ++++++++++++++++++
>  13 files changed, 579 insertions(+), 72 deletions(-)
>  create mode 100644 tools/testing/selftests/sched_ext/rt_stall.bpf.c
>  create mode 100644 tools/testing/selftests/sched_ext/rt_stall.c
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 00/10] Add a deadline server for sched_ext tasks
  2025-06-13 17:35 ` [PATCH v3 00/10] Add a deadline server for sched_ext tasks Joel Fernandes
@ 2025-06-13 18:05   ` Joel Fernandes
  2025-06-13 22:44     ` Andrea Righi
  0 siblings, 1 reply; 14+ messages in thread
From: Joel Fernandes @ 2025-06-13 18:05 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Tejun Heo, David Vernet, Andrea Righi,
	Changwoo Min, bpf



On 6/13/2025 1:35 PM, Joel Fernandes wrote:
> 
> 
> On 6/13/2025 1:17 AM, Joel Fernandes wrote:
>> sched_ext tasks currently are starved by RT hoggers especially since RT
>> throttling was replaced by deadline servers to boost only CFS tasks. Several
>> users in the community have reported issues with RT stalling sched_ext tasks.
>> Add a sched_ext deadline server as well so that sched_ext tasks are also
>> boosted and do not suffer starvation.
>>
>> A kselftest is also provided to verify the starvation issues are now fixed.
>>
>> Btw, there is still something funky going on with CPU hotplug and the
>> relinquish patch. Sometimes the sched_ext's hotplug self-test locks up
>> (./runner -t hotplug). Reverting that patch fixes it, so I am suspecting
>> something is off in dl_server_remove_params() when it is being called on
>> offline CPUs.
> 
> I think I got somewhere here with this sched_ext hotplug test but still not
> there yet. Juri, Andrea, Tejun, can you take a look at the below when you get a
> chance?

The following patch makes the sched_ext hotplug test reliably pass for me now.
Thoughts?

From: Joel Fernandes <joelagnelf@nvidia.com>
Subject: [PATCH] sched/deadline: Prevent setting server as started if params
 couldn't be applied

The following call trace fails to set dl_server_apply_params() as
dl_bw_cpus() is 0 during CPU onlining in the below path.

[   11.878356] ------------[ cut here ]------------
[   11.882592]  <TASK>
[   11.882685]  enqueue_task_scx+0x190/0x280
[   11.882802]  ttwu_do_activate+0xaa/0x2a0
[   11.882925]  try_to_wake_up+0x371/0x600
[   11.883047]  cpuhp_bringup_ap+0xd6/0x170

       [   11.883172]  cpuhp_invoke_callback+0x142/0x540

              [   11.883327]  _cpu_up+0x15b/0x270
[   11.883450]  cpu_up+0x52/0xb0
[   11.883576]  cpu_subsys_online+0x32/0x120
[   11.883704]  online_store+0x98/0x130
[   11.883824]  kernfs_fop_write_iter+0xeb/0x170
[   11.883972]  vfs_write+0x2c7/0x430

       [   11.884091]  ksys_write+0x70/0xe0
[   11.884209]  do_syscall_64+0xd6/0x250
[   11.884327]  ? clear_bhb_loop+0x40/0x90

       [   11.884443]  entry_SYSCALL_64_after_hwframe+0x77/0x7f

It seems too early to start the server. Simply defer the starting of the
server to the next enqueue if dl_server_apply_params() returns an error.
In any case, we should not pretend like the server started and it does
seem to mess up with the sched_ext CPU hotplug test.

With this, the sched_ext hotplug test reliably passes.

Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
 kernel/sched/deadline.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index f0cd1dbca4b8..8dd0c6d71489 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1657,8 +1657,8 @@ void dl_server_start(struct sched_dl_entity *dl_se)
                u64 runtime =  50 * NSEC_PER_MSEC;
                u64 period = 1000 * NSEC_PER_MSEC;

-               dl_server_apply_params(dl_se, runtime, period, 1);
-
+               if (dl_server_apply_params(dl_se, runtime, period, 1))
+                       return;
                dl_se->dl_server = 1;
                dl_se->dl_defer = 1;
                setup_new_dl_entity(dl_se);
@@ -1675,7 +1675,7 @@ void dl_server_start(struct sched_dl_entity *dl_se)

 void dl_server_stop(struct sched_dl_entity *dl_se)
 {
-       if (!dl_se->dl_runtime)
+       if (!dl_se->dl_runtime || !dl_se->dl_server_active)
                return;

        dequeue_dl_entity(dl_se, DEQUEUE_SLEEP);

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 00/10] Add a deadline server for sched_ext tasks
  2025-06-13 18:05   ` Joel Fernandes
@ 2025-06-13 22:44     ` Andrea Righi
  0 siblings, 0 replies; 14+ messages in thread
From: Andrea Righi @ 2025-06-13 22:44 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tejun Heo, David Vernet,
	Changwoo Min, bpf

Hi Joel,

On Fri, Jun 13, 2025 at 02:05:03PM -0400, Joel Fernandes wrote:
> 
> 
> On 6/13/2025 1:35 PM, Joel Fernandes wrote:
> > 
> > 
> > On 6/13/2025 1:17 AM, Joel Fernandes wrote:
> >> sched_ext tasks currently are starved by RT hoggers especially since RT
> >> throttling was replaced by deadline servers to boost only CFS tasks. Several
> >> users in the community have reported issues with RT stalling sched_ext tasks.
> >> Add a sched_ext deadline server as well so that sched_ext tasks are also
> >> boosted and do not suffer starvation.
> >>
> >> A kselftest is also provided to verify the starvation issues are now fixed.
> >>
> >> Btw, there is still something funky going on with CPU hotplug and the
> >> relinquish patch. Sometimes the sched_ext's hotplug self-test locks up
> >> (./runner -t hotplug). Reverting that patch fixes it, so I am suspecting
> >> something is off in dl_server_remove_params() when it is being called on
> >> offline CPUs.
> > 
> > I think I got somewhere here with this sched_ext hotplug test but still not
> > there yet. Juri, Andrea, Tejun, can you take a look at the below when you get a
> > chance?
> 
> The following patch makes the sched_ext hotplug test reliably pass for me now.
> Thoughts?

For me it gets stuck here, when the hotplug test tries to bring the CPU
offline:

TEST: hotplug
DESCRIPTION: Verify hotplug behavior
OUTPUT:
[    5.042497] smpboot: CPU 1 is now offline
[    5.069691] sched_ext: BPF scheduler "hotplug_cbs" enabled
[    5.108705] smpboot: Booting Node 0 Processor 1 APIC 0x1
[    5.149484] sched_ext: BPF scheduler "hotplug_cbs" disabled (unregistered from BPF)
EXIT: unregistered from BPF (hotplug event detected (1 going online))
[    5.204500] sched_ext: BPF scheduler "hotplug_cbs" enabled
Failed to bring CPU offline (Device or resource busy)

However, if I don't stop rq->fair_server in the scx_switching_all case
everything seems to work (which I still don't understand why).

I didn't have much time to look at this today, I'll investigate more
tomorrow.

-Andrea

> 
> From: Joel Fernandes <joelagnelf@nvidia.com>
> Subject: [PATCH] sched/deadline: Prevent setting server as started if params
>  couldn't be applied
> 
> The following call trace fails to set dl_server_apply_params() as
> dl_bw_cpus() is 0 during CPU onlining in the below path.
> 
> [   11.878356] ------------[ cut here ]------------
> [   11.882592]  <TASK>
> [   11.882685]  enqueue_task_scx+0x190/0x280
> [   11.882802]  ttwu_do_activate+0xaa/0x2a0
> [   11.882925]  try_to_wake_up+0x371/0x600
> [   11.883047]  cpuhp_bringup_ap+0xd6/0x170
> 
>        [   11.883172]  cpuhp_invoke_callback+0x142/0x540
> 
>               [   11.883327]  _cpu_up+0x15b/0x270
> [   11.883450]  cpu_up+0x52/0xb0
> [   11.883576]  cpu_subsys_online+0x32/0x120
> [   11.883704]  online_store+0x98/0x130
> [   11.883824]  kernfs_fop_write_iter+0xeb/0x170
> [   11.883972]  vfs_write+0x2c7/0x430
> 
>        [   11.884091]  ksys_write+0x70/0xe0
> [   11.884209]  do_syscall_64+0xd6/0x250
> [   11.884327]  ? clear_bhb_loop+0x40/0x90
> 
>        [   11.884443]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> 
> It seems too early to start the server. Simply defer the starting of the
> server to the next enqueue if dl_server_apply_params() returns an error.
> In any case, we should not pretend like the server started and it does
> seem to mess up with the sched_ext CPU hotplug test.
> 
> With this, the sched_ext hotplug test reliably passes.
> 
> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
> ---
>  kernel/sched/deadline.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index f0cd1dbca4b8..8dd0c6d71489 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1657,8 +1657,8 @@ void dl_server_start(struct sched_dl_entity *dl_se)
>                 u64 runtime =  50 * NSEC_PER_MSEC;
>                 u64 period = 1000 * NSEC_PER_MSEC;
> 
> -               dl_server_apply_params(dl_se, runtime, period, 1);
> -
> +               if (dl_server_apply_params(dl_se, runtime, period, 1))
> +                       return;
>                 dl_se->dl_server = 1;
>                 dl_se->dl_defer = 1;
>                 setup_new_dl_entity(dl_se);
> @@ -1675,7 +1675,7 @@ void dl_server_start(struct sched_dl_entity *dl_se)
> 
>  void dl_server_stop(struct sched_dl_entity *dl_se)
>  {
> -       if (!dl_se->dl_runtime)
> +       if (!dl_se->dl_runtime || !dl_se->dl_server_active)
>                 return;
> 
>         dequeue_dl_entity(dl_se, DEQUEUE_SLEEP);

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-06-13 22:44 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-13  5:17 [PATCH v3 00/10] Add a deadline server for sched_ext tasks Joel Fernandes
2025-06-13  5:17 ` [PATCH v3 01/10] sched/debug: Fix updating of ppos on server write ops Joel Fernandes
2025-06-13  5:17 ` [PATCH v3 02/10] sched/debug: Stop and start server based on if it was active Joel Fernandes
2025-06-13  5:17 ` [PATCH v3 03/10] sched/deadline: Clear the defer params Joel Fernandes
2025-06-13  5:17 ` [PATCH v3 04/10] sched: Add support to pick functions to take rf Joel Fernandes
2025-06-13  5:17 ` [PATCH v3 05/10] sched: Add a server arg to dl_server_update_idle_time() Joel Fernandes
2025-06-13  5:17 ` [PATCH v3 06/10] sched/ext: Add a DL server for sched_ext tasks Joel Fernandes
2025-06-13  5:17 ` [PATCH v3 07/10] sched/debug: Add support to change sched_ext server params Joel Fernandes
2025-06-13  5:17 ` [PATCH v3 08/10] sched/deadline: Add support to remove DL server bandwidth Joel Fernandes
2025-06-13  5:17 ` [PATCH v3 09/10] sched/ext: Relinquish DL server reservations when not needed Joel Fernandes
2025-06-13  5:17 ` [PATCH v3 10/10] selftests/sched_ext: Add test for sched_ext dl_server Joel Fernandes
2025-06-13 17:35 ` [PATCH v3 00/10] Add a deadline server for sched_ext tasks Joel Fernandes
2025-06-13 18:05   ` Joel Fernandes
2025-06-13 22:44     ` Andrea Righi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).