public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH V7 0/9] SCHED_DEADLINE server infrastructure
@ 2024-05-27 12:06 Daniel Bristot de Oliveira
  2024-05-27 12:06 ` [PATCH V7 1/9] sched/deadline: Comment sched_dl_entity::dl_server variable Daniel Bristot de Oliveira
                   ` (10 more replies)
  0 siblings, 11 replies; 34+ messages in thread
From: Daniel Bristot de Oliveira @ 2024-05-27 12:06 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, bristot, Phil Auld, Suleiman Souhlal,
	Youssef Esmat

This is v7 of Peter's SCHED_DEADLINE server infrastructure
implementation [1].

SCHED_DEADLINE servers can help fixing starvation issues of low priority
tasks (e.g., SCHED_OTHER) when higher priority tasks monopolize CPU
cycles. Today we have RT Throttling; DEADLINE servers should be able to
replace and improve that.

In the v1 there was discussion raised about the consequence of using
deadline based servers on the fixed-priority workloads. For a demonstration
here is the baseline of timerlat scheduling latency as-is, with kernel
build background workload:

 # rtla timerlat top -u -d 10m

  --------------------- %< ------------------------
                                     Timer Latency
  0 01:42:24   |          IRQ Timer Latency (us)        |         Thread Timer Latency (us)      |    Ret user Timer Latency (us)
CPU COUNT      |      cur       min       avg       max |      cur       min       avg       max |      cur       min       avg       max
  0 #6143559   |        0         0         0        92 |        2         1         3        98 |        4         1         5       100
  1 #6143559   |        1         0         0        97 |        7         1         5       101 |        9         1         7       103
  2 #6143559   |        0         0         0        88 |        3         1         5        95 |        5         1         7        99
  3 #6143559   |        0         0         0        90 |        6         1         5       103 |       10         1         7       126
  4 #6143558   |        1         0         0        81 |        7         1         4        86 |        9         1         7        90
  5 #6143558   |        0         0         0        74 |        3         1         5        79 |        4         1         7        83
  6 #6143558   |        0         0         0        83 |        2         1         5        89 |        3         0         7       108
  7 #6143558   |        0         0         0        85 |        3         1         4       126 |        5         1         6       137
  --------------------- >% ------------------------

And this is the same tests with DL server activating without any delay:
  --------------------- %< ------------------------
  0 00:10:01   |          IRQ Timer Latency (us)        |         Thread Timer Latency (us)      |    Ret user Timer Latency (us)
CPU COUNT      |      cur       min       avg       max |      cur       min       avg       max |      cur       min       avg       max
  0 #579147    |        0         0         0        54 |        2         1        52     61095 |        2         2        56     61102
  1 #578766    |        0         0         0        83 |        2         1        49     55824 |        3         2        53     55831
  2 #578559    |        0         0         1        59 |        2         1        50     55760 |        3         2        54     55770
  3 #578318    |        0         0         0        76 |        2         1        49     55751 |        3         2        54     55760
  4 #578611    |        0         0         0        64 |        2         1        49     55811 |        3         2        53     55820
  5 #578347    |        0         0         1        40 |        2         1        50     56121 |        3         2        55     56133
  6 #578938    |        0         0         1        75 |        2         1        49     55755 |        3         2        53     55764
  7 #578631    |        0         0         1        36 |        3         1        51     55528 |        4         2        55     55541
  --------------------- >% ------------------------

The problem with DL server only implementation is that FIFO tasks might
suffer preemption from NORMAL even when spare CPU cycles are available.
In fact, fair deadline server is enqueued right away when NORMAL tasks
wake up and they are first scheduled by the server, thus potentially
preempting a well behaving FIFO task. This is of course not ideal.

We had discussions about it, and one of the possibilities would be
using a different scheduling algorithm for this. But IMHO that is
an overkill.

Juri and I discussed this and though about delaying the server
activation for the (period - runtime), thus enabling the server
only if the fair scheduler is about to starve. We called it
the defer server.

The defer the server start to the (absolute deadline - runtime)
point in time. This is achieved by starting the dl server throttled,
with a next replenishing time set to activate the server at
(absolute deadline - runtime).

The server is enqueued with the runtime replenished. As the fair
scheduler runs without boost, its runtime is consumed. If the
fair server has its runtime before the runtime - deadline time,
the a new period is set, and the timer armed for the new
deadline.

The interface is per CPU and has two knobs:
	fair_server_runtime (950 ms)
	fair_server_period  (1s)

With defer enabled on CPUs [0:3], the results get better, having a
behavior similar to the one we have with the rt throttling.

  --------------------- %< ------------------------
                                     Timer Latency                                                                                       
  0 00:10:01   |          IRQ Timer Latency (us)        |         Thread Timer Latency (us)      |    Ret user Timer Latency (us)
CPU COUNT      |      cur       min       avg       max |      cur       min       avg       max |      cur       min       avg       max
  0 #599979    |        0         0         0        64 |        4         1         4        67 |        6         1         5        69
  1 #599979    |        0         0         1        17 |        6         1         5        50 |       10         2         7        71
  2 #599984    |        1         0         1        22 |        4         1         5        78 |        5         2         7       107
  3 #599986    |        0         0         1        72 |        7         1         5        79 |       10         2         7        82
  4 #581580    |        1         0         1        37 |        6         1        38     52797 |       10         2        41     52805
  5 #583270    |        1         0         1        41 |        9         1        36     52617 |       12         2        38     52623
  6 #581240    |        0         0         1        25 |        7         1        39     52870 |       11         2        41     52876
  7 #581208    |        0         0         1        69 |        6         1        39     52917 |        9         2        41     52923
  --------------------- >% ------------------------

Here are some osnoise measurement, with osnoise threads running as FIFO:1 with
different setups (defer enabled):
 - CPU 2 isolated
 - CPU 3 isolated shared with a CFS busy loop task
 - CPU 8 non-isolated
 - CPU 9 non-isolated shared with a CFS busy loop task

  --------------------- %< ------------------------
 ~# pgrep ktimer | while read pid; do chrt -p -f 2 $pid; done # for RT kernel
 ~# tuna  isolate -c 2
 ~# tuna  isolate -c 3
 ~# taskset -c 3 ./f &
 ~# taskset -c 9 ./f &
 ~# osnoise -P f:1 -c 2,3,8,9 -T 1 -d 10m -H 1
                                          Operating System Noise
duration:   0 00:10:00 | time is in us
CPU Period       Runtime        Noise  % CPU Aval   Max Noise   Max Single          HW          NMI          IRQ      Softirq       Thread
  2 #599       599000000          178    99.99997          18            2           0            0          270            0            0
  3 #598       598054434     31351553    94.75774      104442       104442           0            0      2837523            0         1794
  8 #599       599000001       567456    99.90526        3260         2375           2           89       620490            0        13539
  9 #598       598021196     31742537    94.69207       71707        53357           0           90      3411023            0         1762
   --------------------- >% ------------------------

the system runs fine!
	- no crashes (famous last words)
	- FIFO property is kept
	- per cpu interface because it is more flexible - and to detach this from
	  the throttling concept.
	In addition:
	 - This version is on my korg repo for three weeks without any
	   0 robot compaining.
	 - no regressions found on basic cfs tests, like kernel compilation.
	 - also tested with PREEMPT_RT 6.9-rt.

Global breaks only if the fair server activates (which is fair as RT
tasks are not behaving anyways).

The selftest mentioned in the sched/core patches is here:
	https://lore.kernel.org/all/20240313012451.1693807-8-joel@joelfernandes.org/

Thanks people at google for testing/suggesting:
 - Suleiman Souhlal <suleiman@google.com>
 - Youssef Esmat <youssefesmat@google.com>
 - Joel Fernandes (Google) <joel@joelfernandes.org>
 - Vineeth Pillai <vineeth@bitbyteword.org>

Changes from v6:
  - Rebased on top of v6.10-rc1
  - Improved comments (Daniel)
  - Fix division by 0 on adding bw to the rq (Daniel)
  - Use guard and scoped_guard (Peter)
  - Remove the defer knob (Peter)
  - Adjusted comments and code styling (Peter)
  - Be aware of cfs throttling (Vineeth)
  - Split the fixes and the feature from the basic fair server (Peter)
Changes from V5:
  - Fixes DL server for core scheduling (patches 3 and 4)
  (Joel/Vineeth/Suleiman)
  - Add a function to attach the fair server bandwidth to the
    new root domain when the rq changes root domain (Daniel)
  - Postpone the replenishment timer of the server if the defer reservation
    could consume runtime while waiting to boost (patch 2) (Daniel)
  - Add the running state to defer mode avoid forcing the defer mechanism
    if the server continues to be activated due to starvation (patch 2)
    (Daniel)
  - Consider idle time as time for the defer server to avoid penalty on RT
    tasks (patch 2) (Daniel)
  - Mark DL server as unthrottled before enqueue (patch 2)(Joel)
  - Make start_dl_timer callers more robust (patch 2) (Joel)
  - Do not restart the DL server on replenish from timer (patch 2)(Joel)
  - Fix Reverse args to dl_time_before in replenish (patch 2) (Suleiman Souhlal)
  - Removed the negative runtime optimization (patch 2)
  - Start the dl server as disabled in patch 1, enabling it only after
    removing the RT throttling, to avoid having two mechanism together
    by default (patch 1) (Daniel).
  - Added a check need resched after dl_server_start (patch 1) (Daniel)
  - reset dl_server pointer at put_prev_task_balance (patch 1) (Joel)
  - Do not include the already merged patches
  - Rebased to 6.9-rc2
Changes from V4:
  - Enable the server when nr fair tasks is > 0 (peter)
  - Consume runtime if the zerolax server is not boosted (peterz)
  - Adjust interface to deal with admission control (peterz)
  - Rebased to 6.6
Changes from V3:
  - Add the defer server (Daniel)
  - Add an per rq interface (Daniel with peter's feedback)
  - Add an option not defer the server
  - Typos and 1-liner fixes (Valentin, Luca, Peter)
  - Fair scheduler running on dl server do not account as RT task (Daniel)
  - Changed the condition to enable the server (RT & fair tasks) (Daniel)
Changes from v2:
  - Refactor/rephrase/typos changes
  - Defferable server using throttling
  - The server starts when RT && Fair tasks are enqueued
  - Interface with runtime/period/defer option
Changes from v1:
  - rebased on 6.4-rc1 tip/sched/core

Daniel Bristot de Oliveira (3):
  sched/deadline: Comment sched_dl_entity::dl_server variable
  sched/deadline: Deferrable dl server
  sched/fair: Fair server interface

Joel Fernandes (Google) (3):
  sched/core: Add clearing of ->dl_server in put_prev_task_balance()
  sched/core: Fix priority checking for DL server picks
  sched/core: Fix picking of tasks for core scheduling with DL server

Peter Zijlstra (2):
  sched/fair: Add trivial fair server
  sched/rt: Remove default bandwidth control

Youssef Esmat (1):
  sched/core: Clear prev->dl_server in CFS pick fast path

 include/linux/sched.h   |  17 +-
 kernel/sched/core.c     |  56 +++--
 kernel/sched/deadline.c | 449 ++++++++++++++++++++++++++++++++++------
 kernel/sched/debug.c    | 162 +++++++++++++++
 kernel/sched/fair.c     |  75 ++++++-
 kernel/sched/idle.c     |   2 +
 kernel/sched/rt.c       | 242 ++++++++++------------
 kernel/sched/sched.h    |  17 +-
 kernel/sched/topology.c |   8 +
 9 files changed, 813 insertions(+), 215 deletions(-)

-- 
2.45.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH V7 1/9] sched/deadline: Comment sched_dl_entity::dl_server variable
  2024-05-27 12:06 [PATCH V7 0/9] SCHED_DEADLINE server infrastructure Daniel Bristot de Oliveira
@ 2024-05-27 12:06 ` Daniel Bristot de Oliveira
  2024-07-29 10:34   ` [tip: sched/core] " tip-bot2 for Daniel Bristot de Oliveira
  2024-05-27 12:06 ` [PATCH V7 2/9] sched/core: Add clearing of ->dl_server in put_prev_task_balance() Daniel Bristot de Oliveira
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 34+ messages in thread
From: Daniel Bristot de Oliveira @ 2024-05-27 12:06 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, bristot, Phil Auld, Suleiman Souhlal,
	Youssef Esmat, stable

Add an explanation for the newly added variable.

Cc: stable@vger.kernel.org
Fixes: 63ba8422f876 ("sched/deadline: Introduce deadline servers")
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
---
 include/linux/sched.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 61591ac6eab6..abce356932cc 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -637,6 +637,8 @@ struct sched_dl_entity {
 	 *
 	 * @dl_overrun tells if the task asked to be informed about runtime
 	 * overruns.
+	 *
+	 * @dl_server tells if this is a server entity.
 	 */
 	unsigned int			dl_throttled      : 1;
 	unsigned int			dl_yielded        : 1;
-- 
2.45.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH V7 2/9] sched/core: Add clearing of ->dl_server in put_prev_task_balance()
  2024-05-27 12:06 [PATCH V7 0/9] SCHED_DEADLINE server infrastructure Daniel Bristot de Oliveira
  2024-05-27 12:06 ` [PATCH V7 1/9] sched/deadline: Comment sched_dl_entity::dl_server variable Daniel Bristot de Oliveira
@ 2024-05-27 12:06 ` Daniel Bristot de Oliveira
  2024-07-29 10:34   ` [tip: sched/core] " tip-bot2 for Joel Fernandes (Google)
  2024-05-27 12:06 ` [PATCH V7 3/9] sched/core: Clear prev->dl_server in CFS pick fast path Daniel Bristot de Oliveira
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 34+ messages in thread
From: Daniel Bristot de Oliveira @ 2024-05-27 12:06 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, bristot, Phil Auld, Suleiman Souhlal,
	Youssef Esmat, stable

From: "Joel Fernandes (Google)" <joel@joelfernandes.org>

Paths using put_prev_task_balance() need to do a pick shortly
after. Make sure they also clear the ->dl_server on prev as a
part of that.

Cc: stable@vger.kernel.org
Fixes: 63ba8422f876 ("sched/deadline: Introduce deadline servers")
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
---
 kernel/sched/core.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bcf2c4cc0522..08c409457152 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6006,6 +6006,14 @@ static void put_prev_task_balance(struct rq *rq, struct task_struct *prev,
 #endif
 
 	put_prev_task(rq, prev);
+
+	/*
+	 * We've updated @prev and no longer need the server link, clear it.
+	 * Must be done before ->pick_next_task() because that can (re)set
+	 * ->dl_server.
+	 */
+	if (prev->dl_server)
+		prev->dl_server = NULL;
 }
 
 /*
@@ -6049,14 +6057,6 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 restart:
 	put_prev_task_balance(rq, prev, rf);
 
-	/*
-	 * We've updated @prev and no longer need the server link, clear it.
-	 * Must be done before ->pick_next_task() because that can (re)set
-	 * ->dl_server.
-	 */
-	if (prev->dl_server)
-		prev->dl_server = NULL;
-
 	for_each_class(class) {
 		p = class->pick_next_task(rq);
 		if (p)
-- 
2.45.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH V7 3/9] sched/core: Clear prev->dl_server in CFS pick fast path
  2024-05-27 12:06 [PATCH V7 0/9] SCHED_DEADLINE server infrastructure Daniel Bristot de Oliveira
  2024-05-27 12:06 ` [PATCH V7 1/9] sched/deadline: Comment sched_dl_entity::dl_server variable Daniel Bristot de Oliveira
  2024-05-27 12:06 ` [PATCH V7 2/9] sched/core: Add clearing of ->dl_server in put_prev_task_balance() Daniel Bristot de Oliveira
@ 2024-05-27 12:06 ` Daniel Bristot de Oliveira
  2024-07-29 10:34   ` [tip: sched/core] " tip-bot2 for Youssef Esmat
  2024-05-27 12:06 ` [PATCH V7 4/9] sched/fair: Add trivial fair server Daniel Bristot de Oliveira
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 34+ messages in thread
From: Daniel Bristot de Oliveira @ 2024-05-27 12:06 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, bristot, Phil Auld, Suleiman Souhlal,
	Youssef Esmat, stable

From: Youssef Esmat <youssefesmat@google.com>

In case the previous pick was a DL server pick, ->dl_server might be
set. Clear it in the fast path as well.

Cc: stable@vger.kernel.org
Fixes: 63ba8422f876 ("sched/deadline: Introduce deadline servers")
Signed-off-by: Youssef Esmat <youssefesmat@google.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
---
 kernel/sched/core.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 08c409457152..6d01863f93ca 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6044,6 +6044,13 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 			p = pick_next_task_idle(rq);
 		}
 
+		/*
+		 * This is a normal CFS pick, but the previous could be a DL pick.
+		 * Clear it as previous is no longer picked.
+		 */
+		if (prev->dl_server)
+			prev->dl_server = NULL;
+
 		/*
 		 * This is the fast path; it cannot be a DL server pick;
 		 * therefore even if @p == @prev, ->dl_server must be NULL.
-- 
2.45.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH V7 4/9] sched/fair: Add trivial fair server
  2024-05-27 12:06 [PATCH V7 0/9] SCHED_DEADLINE server infrastructure Daniel Bristot de Oliveira
                   ` (2 preceding siblings ...)
  2024-05-27 12:06 ` [PATCH V7 3/9] sched/core: Clear prev->dl_server in CFS pick fast path Daniel Bristot de Oliveira
@ 2024-05-27 12:06 ` Daniel Bristot de Oliveira
  2024-07-29 10:34   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2024-05-27 12:06 ` [PATCH V7 5/9] sched/deadline: Deferrable dl server Daniel Bristot de Oliveira
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 34+ messages in thread
From: Daniel Bristot de Oliveira @ 2024-05-27 12:06 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, bristot, Phil Auld, Suleiman Souhlal,
	Youssef Esmat

From: Peter Zijlstra <peterz@infradead.org>

Use deadline servers to service fair tasks.

This patch adds a fair_server deadline entity which acts as a container
for fair entities and can be used to fix starvation when higher priority
(wrt fair) tasks are monopolizing CPU(s).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
---
 kernel/sched/core.c     |  1 +
 kernel/sched/deadline.c | 23 +++++++++++++++++++++++
 kernel/sched/fair.c     | 34 ++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h    |  4 ++++
 4 files changed, 62 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6d01863f93ca..53f0470a1d0a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10057,6 +10057,7 @@ void __init sched_init(void)
 #endif /* CONFIG_SMP */
 		hrtick_rq_init(rq);
 		atomic_set(&rq->nr_iowait, 0);
+		fair_server_init(rq);
 
 #ifdef CONFIG_SCHED_CORE
 		rq->core = rq;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index c75d1307d86d..b69d6c3e1587 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1381,6 +1381,13 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64
 			resched_curr(rq);
 	}
 
+	/*
+	 * The fair server (sole dl_server) does not account for real-time
+	 * workload because it is running fair work.
+	 */
+	if (dl_se == &rq->fair_server)
+		return;
+
 	/*
 	 * Because -- for now -- we share the rt bandwidth, we need to
 	 * account our runtime there too, otherwise actual rt tasks
@@ -1414,15 +1421,31 @@ void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec)
 
 void dl_server_start(struct sched_dl_entity *dl_se)
 {
+	struct rq *rq = dl_se->rq;
+
 	if (!dl_server(dl_se)) {
+		/* Disabled */
+		dl_se->dl_runtime = 0;
+		dl_se->dl_deadline = 1000 * NSEC_PER_MSEC;
+		dl_se->dl_period = 1000 * NSEC_PER_MSEC;
+
 		dl_se->dl_server = 1;
 		setup_new_dl_entity(dl_se);
 	}
+
+	if (!dl_se->dl_runtime)
+		return;
+
 	enqueue_dl_entity(dl_se, ENQUEUE_WAKEUP);
+	if (!dl_task(dl_se->rq->curr) || dl_entity_preempt(dl_se, &rq->curr->dl))
+		resched_curr(dl_se->rq);
 }
 
 void dl_server_stop(struct sched_dl_entity *dl_se)
 {
+	if (!dl_se->dl_runtime)
+		return;
+
 	dequeue_dl_entity(dl_se, DEQUEUE_SLEEP);
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8a5b1ae0aa55..2d5d3e6c1e72 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5766,6 +5766,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
 	struct sched_entity *se;
 	long task_delta, idle_task_delta, dequeue = 1;
+	long rq_h_nr_running = rq->cfs.h_nr_running;
 
 	raw_spin_lock(&cfs_b->lock);
 	/* This will start the period timer if necessary */
@@ -5838,6 +5839,9 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 	sub_nr_running(rq, task_delta);
 
 done:
+	/* Stop the fair server if throttling resulted in no runnable tasks */
+	if (rq_h_nr_running && !rq->cfs.h_nr_running)
+		dl_server_stop(&rq->fair_server);
 	/*
 	 * Note: distribution will already see us throttled via the
 	 * throttled-list.  rq->lock protects completion.
@@ -5855,6 +5859,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
 	struct sched_entity *se;
 	long task_delta, idle_task_delta;
+	long rq_h_nr_running = rq->cfs.h_nr_running;
 
 	se = cfs_rq->tg->se[cpu_of(rq)];
 
@@ -5930,6 +5935,10 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 unthrottle_throttle:
 	assert_list_leaf_cfs_rq(rq);
 
+	/* Start the fair server if un-throttling resulted in new runnable tasks */
+	if (!rq_h_nr_running && rq->cfs.h_nr_running)
+		dl_server_start(&rq->fair_server);
+
 	/* Determine whether we need to wake up potentially idle CPU: */
 	if (rq->curr == rq->idle && rq->cfs.nr_running)
 		resched_curr(rq);
@@ -6760,6 +6769,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	 */
 	util_est_enqueue(&rq->cfs, p);
 
+	if (!throttled_hierarchy(task_cfs_rq(p)) && !rq->cfs.h_nr_running)
+		dl_server_start(&rq->fair_server);
+
 	/*
 	 * If in_iowait is set, the code below may not trigger any cpufreq
 	 * utilization updates, so do it here explicitly with the IOWAIT flag
@@ -6904,6 +6916,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		rq->next_balance = jiffies;
 
 dequeue_throttle:
+	if (!throttled_hierarchy(task_cfs_rq(p)) && !rq->cfs.h_nr_running)
+		dl_server_stop(&rq->fair_server);
+
 	util_est_update(&rq->cfs, p, task_sleep);
 	hrtick_update(rq);
 }
@@ -8607,6 +8622,25 @@ static struct task_struct *__pick_next_task_fair(struct rq *rq)
 	return pick_next_task_fair(rq, NULL, NULL);
 }
 
+static bool fair_server_has_tasks(struct sched_dl_entity *dl_se)
+{
+	return !!dl_se->rq->cfs.nr_running;
+}
+
+static struct task_struct *fair_server_pick(struct sched_dl_entity *dl_se)
+{
+	return pick_next_task_fair(dl_se->rq, NULL, NULL);
+}
+
+void fair_server_init(struct rq *rq)
+{
+	struct sched_dl_entity *dl_se = &rq->fair_server;
+
+	init_dl_entity(dl_se);
+
+	dl_server_init(dl_se, rq, fair_server_has_tasks, fair_server_pick);
+}
+
 /*
  * Account for a descheduled task:
  */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a831af102070..39c9669b23a7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -356,6 +356,8 @@ extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
 		    dl_server_has_tasks_f has_tasks,
 		    dl_server_pick_f pick);
 
+extern void fair_server_init(struct rq *rq);
+
 #ifdef CONFIG_CGROUP_SCHED
 
 struct cfs_rq;
@@ -1037,6 +1039,8 @@ struct rq {
 	struct rt_rq		rt;
 	struct dl_rq		dl;
 
+	struct sched_dl_entity	fair_server;
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	/* list of leaf cfs_rq on this CPU: */
 	struct list_head	leaf_cfs_rq_list;
-- 
2.45.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH V7 5/9] sched/deadline: Deferrable dl server
  2024-05-27 12:06 [PATCH V7 0/9] SCHED_DEADLINE server infrastructure Daniel Bristot de Oliveira
                   ` (3 preceding siblings ...)
  2024-05-27 12:06 ` [PATCH V7 4/9] sched/fair: Add trivial fair server Daniel Bristot de Oliveira
@ 2024-05-27 12:06 ` Daniel Bristot de Oliveira
  2024-07-29 10:34   ` [tip: sched/core] " tip-bot2 for Daniel Bristot de Oliveira
  2024-05-27 12:06 ` [PATCH V7 6/9] sched/fair: Fair server interface Daniel Bristot de Oliveira
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 34+ messages in thread
From: Daniel Bristot de Oliveira @ 2024-05-27 12:06 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, bristot, Phil Auld, Suleiman Souhlal,
	Youssef Esmat

Among the motivations for the DL servers is the real-time throttling
mechanism. This mechanism works by throttling the rt_rq after
running for a long period without leaving space for fair tasks.

The base dl server avoids this problem by boosting fair tasks instead
of throttling the rt_rq. The point is that it boosts without waiting
for potential starvation, causing some non-intuitive cases.

For example, an IRQ dispatches two tasks on an idle system, a fair
and an RT. The DL server will be activated, running the fair task
before the RT one. This problem can be avoided by deferring the
dl server activation.

By setting the defer option, the dl_server will dispatch an
SCHED_DEADLINE reservation with replenished runtime, but throttled.

The dl_timer will be set for the defer time at (period - runtime) ns
from start time. Thus boosting the fair rq at defer time.

If the fair scheduler has the opportunity to run while waiting
for defer time, the dl server runtime will be consumed. If
the runtime is completely consumed before the defer time, the
server will be replenished while still in a throttled state. Then,
the dl_timer will be reset to the new defer time

If the fair server reaches the defer time without consuming
its runtime, the server will start running, following CBS rules
(thus without breaking SCHED_DEADLINE). Then the server will
continue the running state (without deferring) until it fair
tasks are able to execute as regular fair scheduler (end of
the starvation).

Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
---
 include/linux/sched.h   |  12 ++
 kernel/sched/deadline.c | 301 ++++++++++++++++++++++++++++++++++------
 kernel/sched/fair.c     |  24 +++-
 kernel/sched/idle.c     |   2 +
 kernel/sched/sched.h    |   4 +-
 5 files changed, 298 insertions(+), 45 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index abce356932cc..611771fec4df 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -639,12 +639,24 @@ struct sched_dl_entity {
 	 * overruns.
 	 *
 	 * @dl_server tells if this is a server entity.
+	 *
+	 * @dl_defer tells if this is a deferred or regular server. For
+	 * now only defer server exists.
+	 *
+	 * @dl_defer_armed tells if the deferrable server is waiting
+	 * for the replenishment timer to activate it.
+	 *
+	 * @dl_defer_running tells if the deferrable server is actually
+	 * running, skipping the defer phase.
 	 */
 	unsigned int			dl_throttled      : 1;
 	unsigned int			dl_yielded        : 1;
 	unsigned int			dl_non_contending : 1;
 	unsigned int			dl_overrun	  : 1;
 	unsigned int			dl_server         : 1;
+	unsigned int			dl_defer	  : 1;
+	unsigned int			dl_defer_armed	  : 1;
+	unsigned int			dl_defer_running  : 1;
 
 	/*
 	 * Bandwidth enforcement timer. Each -deadline task has its
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index b69d6c3e1587..eddfe18d9762 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -771,6 +771,15 @@ static inline void replenish_dl_new_period(struct sched_dl_entity *dl_se,
 	/* for non-boosted task, pi_of(dl_se) == dl_se */
 	dl_se->deadline = rq_clock(rq) + pi_of(dl_se)->dl_deadline;
 	dl_se->runtime = pi_of(dl_se)->dl_runtime;
+
+	/*
+	 * If it is a deferred reservation, and the server
+	 * is not handling an starvation case, defer it.
+	 */
+	if (dl_se->dl_defer & !dl_se->dl_defer_running) {
+		dl_se->dl_throttled = 1;
+		dl_se->dl_defer_armed = 1;
+	}
 }
 
 /*
@@ -809,6 +818,9 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
 	replenish_dl_new_period(dl_se, rq);
 }
 
+static int start_dl_timer(struct sched_dl_entity *dl_se);
+static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t);
+
 /*
  * Pure Earliest Deadline First (EDF) scheduling does not deal with the
  * possibility of a entity lasting more than what it declared, and thus
@@ -837,9 +849,18 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
 	/*
 	 * This could be the case for a !-dl task that is boosted.
 	 * Just go with full inherited parameters.
+	 *
+	 * Or, it could be the case of a deferred reservation that
+	 * was not able to consume its runtime in background and
+	 * reached this point with current u > U.
+	 *
+	 * In both cases, set a new period.
 	 */
-	if (dl_se->dl_deadline == 0)
-		replenish_dl_new_period(dl_se, rq);
+	if (dl_se->dl_deadline == 0 ||
+	    (dl_se->dl_defer_armed && dl_entity_overflow(dl_se, rq_clock(rq)))) {
+		dl_se->deadline = rq_clock(rq) + pi_of(dl_se)->dl_deadline;
+		dl_se->runtime = pi_of(dl_se)->dl_runtime;
+	}
 
 	if (dl_se->dl_yielded && dl_se->runtime > 0)
 		dl_se->runtime = 0;
@@ -873,6 +894,44 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
 		dl_se->dl_yielded = 0;
 	if (dl_se->dl_throttled)
 		dl_se->dl_throttled = 0;
+
+	/*
+	 * If this is the replenishment of a deferred reservation,
+	 * clear the flag and return.
+	 */
+	if (dl_se->dl_defer_armed) {
+		dl_se->dl_defer_armed = 0;
+		return;
+	}
+
+	/*
+	 * A this point, if the deferred server is not armed, and the deadline
+	 * is in the future, if it is not running already, throttle the server
+	 * and arm the defer timer.
+	 */
+	if (dl_se->dl_defer && !dl_se->dl_defer_running &&
+	    dl_time_before(rq_clock(dl_se->rq), dl_se->deadline - dl_se->runtime)) {
+		if (!is_dl_boosted(dl_se) && dl_se->server_has_tasks(dl_se)) {
+
+			/*
+			 * Set dl_se->dl_defer_armed and dl_throttled variables to
+			 * inform the start_dl_timer() that this is a deferred
+			 * activation.
+			 */
+			dl_se->dl_defer_armed = 1;
+			dl_se->dl_throttled = 1;
+			if (!start_dl_timer(dl_se)) {
+				/*
+				 * If for whatever reason (delays), a previous timer was
+				 * queued but not serviced, cancel it and clean the
+				 * deferrable server variables intended for start_dl_timer().
+				 */
+				hrtimer_try_to_cancel(&dl_se->dl_timer);
+				dl_se->dl_defer_armed = 0;
+				dl_se->dl_throttled = 0;
+			}
+		}
+	}
 }
 
 /*
@@ -1023,6 +1082,15 @@ static void update_dl_entity(struct sched_dl_entity *dl_se)
 		}
 
 		replenish_dl_new_period(dl_se, rq);
+	} else if (dl_server(dl_se) && dl_se->dl_defer) {
+		/*
+		 * The server can still use its previous deadline, so check if
+		 * it left the dl_defer_running state.
+		 */
+		if (!dl_se->dl_defer_running) {
+			dl_se->dl_defer_armed = 1;
+			dl_se->dl_throttled = 1;
+		}
 	}
 }
 
@@ -1055,8 +1123,21 @@ static int start_dl_timer(struct sched_dl_entity *dl_se)
 	 * We want the timer to fire at the deadline, but considering
 	 * that it is actually coming from rq->clock and not from
 	 * hrtimer's time base reading.
+	 *
+	 * The deferred reservation will have its timer set to
+	 * (deadline - runtime). At that point, the CBS rule will decide
+	 * if the current deadline can be used, or if a replenishment is
+	 * required to avoid add too much pressure on the system
+	 * (current u > U).
 	 */
-	act = ns_to_ktime(dl_next_period(dl_se));
+	if (dl_se->dl_defer_armed) {
+		WARN_ON_ONCE(!dl_se->dl_throttled);
+		act = ns_to_ktime(dl_se->deadline - dl_se->runtime);
+	} else {
+		/* act = deadline - rel-deadline + period */
+		act = ns_to_ktime(dl_next_period(dl_se));
+	}
+
 	now = hrtimer_cb_get_time(timer);
 	delta = ktime_to_ns(now) - rq_clock(rq);
 	act = ktime_add_ns(act, delta);
@@ -1106,6 +1187,62 @@ static void __push_dl_task(struct rq *rq, struct rq_flags *rf)
 #endif
 }
 
+/* a defer timer will not be reset if the runtime consumed was < dl_server_min_res */
+static const u64 dl_server_min_res = 1 * NSEC_PER_MSEC;
+
+static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_dl_entity *dl_se)
+{
+	struct rq *rq = rq_of_dl_se(dl_se);
+	u64 fw;
+
+	scoped_guard (rq_lock, rq) {
+		struct rq_flags *rf = &scope.rf;
+
+		if (!dl_se->dl_throttled || !dl_se->dl_runtime)
+			return HRTIMER_NORESTART;
+
+		sched_clock_tick();
+		update_rq_clock(rq);
+
+		if (!dl_se->dl_runtime)
+			return HRTIMER_NORESTART;
+
+		if (!dl_se->server_has_tasks(dl_se)) {
+			replenish_dl_entity(dl_se);
+			return HRTIMER_NORESTART;
+		}
+
+		if (dl_se->dl_defer_armed) {
+			/*
+			 * First check if the server could consume runtime in background.
+			 * If so, it is possible to push the defer timer for this amount
+			 * of time. The dl_server_min_res serves as a limit to avoid
+			 * forwarding the timer for a too small amount of time.
+			 */
+			if (dl_time_before(rq_clock(dl_se->rq),
+					   (dl_se->deadline - dl_se->runtime - dl_server_min_res))) {
+
+				/* reset the defer timer */
+				fw = dl_se->deadline - rq_clock(dl_se->rq) - dl_se->runtime;
+
+				hrtimer_forward_now(timer, ns_to_ktime(fw));
+				return HRTIMER_RESTART;
+			}
+
+			dl_se->dl_defer_running = 1;
+		}
+
+		enqueue_dl_entity(dl_se, ENQUEUE_REPLENISH);
+
+		if (!dl_task(dl_se->rq->curr) || dl_entity_preempt(dl_se, &dl_se->rq->curr->dl))
+			resched_curr(rq);
+
+		__push_dl_task(rq, rf);
+	}
+
+	return HRTIMER_NORESTART;
+}
+
 /*
  * This is the bandwidth enforcement timer callback. If here, we know
  * a task is not on its dl_rq, since the fact that the timer was running
@@ -1128,28 +1265,8 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
 	struct rq_flags rf;
 	struct rq *rq;
 
-	if (dl_server(dl_se)) {
-		struct rq *rq = rq_of_dl_se(dl_se);
-		struct rq_flags rf;
-
-		rq_lock(rq, &rf);
-		if (dl_se->dl_throttled) {
-			sched_clock_tick();
-			update_rq_clock(rq);
-
-			if (dl_se->server_has_tasks(dl_se)) {
-				enqueue_dl_entity(dl_se, ENQUEUE_REPLENISH);
-				resched_curr(rq);
-				__push_dl_task(rq, &rf);
-			} else {
-				replenish_dl_entity(dl_se);
-			}
-
-		}
-		rq_unlock(rq, &rf);
-
-		return HRTIMER_NORESTART;
-	}
+	if (dl_server(dl_se))
+		return dl_server_timer(timer, dl_se);
 
 	p = dl_task_of(dl_se);
 	rq = task_rq_lock(p, &rf);
@@ -1319,22 +1436,10 @@ static u64 grub_reclaim(u64 delta, struct rq *rq, struct sched_dl_entity *dl_se)
 	return (delta * u_act) >> BW_SHIFT;
 }
 
-static inline void
-update_stats_dequeue_dl(struct dl_rq *dl_rq, struct sched_dl_entity *dl_se,
-                        int flags);
-static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64 delta_exec)
+s64 dl_scaled_delta_exec(struct rq *rq, struct sched_dl_entity *dl_se, s64 delta_exec)
 {
 	s64 scaled_delta_exec;
 
-	if (unlikely(delta_exec <= 0)) {
-		if (unlikely(dl_se->dl_yielded))
-			goto throttle;
-		return;
-	}
-
-	if (dl_entity_is_special(dl_se))
-		return;
-
 	/*
 	 * For tasks that participate in GRUB, we implement GRUB-PA: the
 	 * spare reclaimed bandwidth is used to clock down frequency.
@@ -1353,8 +1458,64 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64
 		scaled_delta_exec = cap_scale(scaled_delta_exec, scale_cpu);
 	}
 
+	return scaled_delta_exec;
+}
+
+static inline void
+update_stats_dequeue_dl(struct dl_rq *dl_rq, struct sched_dl_entity *dl_se,
+			int flags);
+static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64 delta_exec)
+{
+	s64 scaled_delta_exec;
+
+	if (unlikely(delta_exec <= 0)) {
+		if (unlikely(dl_se->dl_yielded))
+			goto throttle;
+		return;
+	}
+
+	if (dl_server(dl_se) && dl_se->dl_throttled && !dl_se->dl_defer)
+		return;
+
+	if (dl_entity_is_special(dl_se))
+		return;
+
+	scaled_delta_exec = dl_scaled_delta_exec(rq, dl_se, delta_exec);
+
 	dl_se->runtime -= scaled_delta_exec;
 
+	/*
+	 * The fair server can consume its runtime while throttled (not queued/
+	 * running as regular CFS).
+	 *
+	 * If the server consumes its entire runtime in this state. The server
+	 * is not required for the current period. Thus, reset the server by
+	 * starting a new period, pushing the activation.
+	 */
+	if (dl_se->dl_defer && dl_se->dl_throttled && dl_runtime_exceeded(dl_se)) {
+		/*
+		 * If the server was previously activated - the starving condition
+		 * took place, it this point it went away because the fair scheduler
+		 * was able to get runtime in background. So return to the initial
+		 * state.
+		 */
+		dl_se->dl_defer_running = 0;
+
+		hrtimer_try_to_cancel(&dl_se->dl_timer);
+
+		replenish_dl_new_period(dl_se, dl_se->rq);
+
+		/*
+		 * Not being able to start the timer seems problematic. If it could not
+		 * be started for whatever reason, we need to "unthrottle" the DL server
+		 * and queue right away. Otherwise nothing might queue it. That's similar
+		 * to what enqueue_dl_entity() does on start_dl_timer==0. For now, just warn.
+		 */
+		WARN_ON_ONCE(!start_dl_timer(dl_se));
+
+		return;
+	}
+
 throttle:
 	if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) {
 		dl_se->dl_throttled = 1;
@@ -1414,9 +1575,46 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64
 	}
 }
 
+/*
+ * In the non-defer mode, the idle time is not accounted, as the
+ * server provides a guarantee.
+ *
+ * If the dl_server is in defer mode, the idle time is also considered
+ * as time available for the fair server, avoiding a penalty for the
+ * rt scheduler that did not consumed that time.
+ */
+void dl_server_update_idle_time(struct rq *rq, struct task_struct *p)
+{
+	s64 delta_exec, scaled_delta_exec;
+
+	if (!rq->fair_server.dl_defer)
+		return;
+
+	/* no need to discount more */
+	if (rq->fair_server.runtime < 0)
+		return;
+
+	delta_exec = rq_clock_task(rq) - p->se.exec_start;
+	if (delta_exec < 0)
+		return;
+
+	scaled_delta_exec = dl_scaled_delta_exec(rq, &rq->fair_server, delta_exec);
+
+	rq->fair_server.runtime -= scaled_delta_exec;
+
+	if (rq->fair_server.runtime < 0) {
+		rq->fair_server.dl_defer_running = 0;
+		rq->fair_server.runtime = 0;
+	}
+
+	p->se.exec_start = rq_clock_task(rq);
+}
+
 void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec)
 {
-	update_curr_dl_se(dl_se->rq, dl_se, delta_exec);
+	/* 0 runtime = fair server disabled */
+	if (dl_se->dl_runtime)
+		update_curr_dl_se(dl_se->rq, dl_se, delta_exec);
 }
 
 void dl_server_start(struct sched_dl_entity *dl_se)
@@ -1430,6 +1628,7 @@ void dl_server_start(struct sched_dl_entity *dl_se)
 		dl_se->dl_period = 1000 * NSEC_PER_MSEC;
 
 		dl_se->dl_server = 1;
+		dl_se->dl_defer = 1;
 		setup_new_dl_entity(dl_se);
 	}
 
@@ -1447,6 +1646,9 @@ void dl_server_stop(struct sched_dl_entity *dl_se)
 		return;
 
 	dequeue_dl_entity(dl_se, DEQUEUE_SLEEP);
+	hrtimer_try_to_cancel(&dl_se->dl_timer);
+	dl_se->dl_defer_armed = 0;
+	dl_se->dl_throttled = 0;
 }
 
 void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
@@ -1758,7 +1960,7 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
 	 * be counted in the active utilization; hence, we need to call
 	 * add_running_bw().
 	 */
-	if (dl_se->dl_throttled && !(flags & ENQUEUE_REPLENISH)) {
+	if (!dl_se->dl_defer && dl_se->dl_throttled && !(flags & ENQUEUE_REPLENISH)) {
 		if (flags & ENQUEUE_WAKEUP)
 			task_contending(dl_se, flags);
 
@@ -1780,6 +1982,25 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
 		setup_new_dl_entity(dl_se);
 	}
 
+	/*
+	 * If the reservation is still throttled, e.g., it got replenished but is a
+	 * deferred task and still got to wait, don't enqueue.
+	 */
+	if (dl_se->dl_throttled && start_dl_timer(dl_se))
+		return;
+
+	/*
+	 * We're about to enqueue, make sure we're not ->dl_throttled!
+	 * In case the timer was not started, say because the defer time
+	 * has passed, mark as not throttled and mark unarmed.
+	 * Also cancel earlier timers, since letting those run is pointless.
+	 */
+	if (dl_se->dl_throttled) {
+		hrtimer_try_to_cancel(&dl_se->dl_timer);
+		dl_se->dl_defer_armed = 0;
+		dl_se->dl_throttled = 0;
+	}
+
 	__enqueue_dl_entity(dl_se);
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2d5d3e6c1e72..20e8b02c5cb3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1156,12 +1156,13 @@ s64 update_curr_common(struct rq *rq)
 static void update_curr(struct cfs_rq *cfs_rq)
 {
 	struct sched_entity *curr = cfs_rq->curr;
+	struct rq *rq = rq_of(cfs_rq);
 	s64 delta_exec;
 
 	if (unlikely(!curr))
 		return;
 
-	delta_exec = update_curr_se(rq_of(cfs_rq), curr);
+	delta_exec = update_curr_se(rq, curr);
 	if (unlikely(delta_exec <= 0))
 		return;
 
@@ -1169,8 +1170,19 @@ static void update_curr(struct cfs_rq *cfs_rq)
 	update_deadline(cfs_rq, curr);
 	update_min_vruntime(cfs_rq);
 
-	if (entity_is_task(curr))
-		update_curr_task(task_of(curr), delta_exec);
+	if (entity_is_task(curr)) {
+		struct task_struct *p = task_of(curr);
+
+		update_curr_task(p, delta_exec);
+
+		/*
+		 * Any fair task that runs outside of fair_server should
+		 * account against fair_server such that it can account for
+		 * this time and possibly avoid running this period.
+		 */
+		if (p->dl_server != &rq->fair_server)
+			dl_server_update(&rq->fair_server, delta_exec);
+	}
 
 	account_cfs_rq_runtime(cfs_rq, delta_exec);
 }
@@ -6769,8 +6781,12 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	 */
 	util_est_enqueue(&rq->cfs, p);
 
-	if (!throttled_hierarchy(task_cfs_rq(p)) && !rq->cfs.h_nr_running)
+	if (!throttled_hierarchy(task_cfs_rq(p)) && !rq->cfs.h_nr_running) {
+		/* Account for idle runtime */
+		if (!rq->nr_running)
+			dl_server_update_idle_time(rq, rq->curr);
 		dl_server_start(&rq->fair_server);
+	}
 
 	/*
 	 * If in_iowait is set, the code below may not trigger any cpufreq
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 6135fbe83d68..5f8806bc6924 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -458,12 +458,14 @@ static void wakeup_preempt_idle(struct rq *rq, struct task_struct *p, int flags)
 
 static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
 {
+	dl_server_update_idle_time(rq, prev);
 }
 
 static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool first)
 {
 	update_idle_core(rq);
 	schedstat_inc(rq->sched_goidle);
+	next->se.exec_start = rq_clock_task(rq);
 }
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 39c9669b23a7..76751b945474 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -328,7 +328,7 @@ extern bool __checkparam_dl(const struct sched_attr *attr);
 extern bool dl_param_changed(struct task_struct *p, const struct sched_attr *attr);
 extern int  dl_cpuset_cpumask_can_shrink(const struct cpumask *cur, const struct cpumask *trial);
 extern int  dl_bw_check_overflow(int cpu);
-
+extern s64 dl_scaled_delta_exec(struct rq *rq, struct sched_dl_entity *dl_se, s64 delta_exec);
 /*
  * SCHED_DEADLINE supports servers (nested scheduling) with the following
  * interface:
@@ -356,6 +356,8 @@ extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
 		    dl_server_has_tasks_f has_tasks,
 		    dl_server_pick_f pick);
 
+extern void dl_server_update_idle_time(struct rq *rq,
+		    struct task_struct *p);
 extern void fair_server_init(struct rq *rq);
 
 #ifdef CONFIG_CGROUP_SCHED
-- 
2.45.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH V7 6/9] sched/fair: Fair server interface
  2024-05-27 12:06 [PATCH V7 0/9] SCHED_DEADLINE server infrastructure Daniel Bristot de Oliveira
                   ` (4 preceding siblings ...)
  2024-05-27 12:06 ` [PATCH V7 5/9] sched/deadline: Deferrable dl server Daniel Bristot de Oliveira
@ 2024-05-27 12:06 ` Daniel Bristot de Oliveira
  2024-07-29 10:34   ` [tip: sched/core] " tip-bot2 for Daniel Bristot de Oliveira
  2024-05-27 12:06 ` [PATCH V7 7/9] sched/core: Fix priority checking for DL server picks Daniel Bristot de Oliveira
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 34+ messages in thread
From: Daniel Bristot de Oliveira @ 2024-05-27 12:06 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, bristot, Phil Auld, Suleiman Souhlal,
	Youssef Esmat

Add an interface for fair server setup on debugfs.

Each CPU has two files under /debug/sched/fair_server/cpu{ID}:

 - runtime: set runtime in ns
 - period:  set period in ns

This then leaves /proc/sys/kernel/sched_rt_{period,runtime}_us to set
bounds on admission control.

The interface also add the server to the dl bandwidth accounting.

Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
---
 kernel/sched/deadline.c | 103 +++++++++++++++++++++-----
 kernel/sched/debug.c    | 159 ++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h    |   3 +
 kernel/sched/topology.c |   8 ++
 4 files changed, 256 insertions(+), 17 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index eddfe18d9762..f8afe0a69c1e 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -320,19 +320,12 @@ void sub_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
 		__sub_running_bw(dl_se->dl_bw, dl_rq);
 }
 
-static void dl_change_utilization(struct task_struct *p, u64 new_bw)
+static void dl_rq_change_utilization(struct rq *rq, struct sched_dl_entity *dl_se, u64 new_bw)
 {
-	struct rq *rq;
-
-	WARN_ON_ONCE(p->dl.flags & SCHED_FLAG_SUGOV);
-
-	if (task_on_rq_queued(p))
-		return;
+	if (dl_se->dl_non_contending) {
+		sub_running_bw(dl_se, &rq->dl);
+		dl_se->dl_non_contending = 0;
 
-	rq = task_rq(p);
-	if (p->dl.dl_non_contending) {
-		sub_running_bw(&p->dl, &rq->dl);
-		p->dl.dl_non_contending = 0;
 		/*
 		 * If the timer handler is currently running and the
 		 * timer cannot be canceled, inactive_task_timer()
@@ -340,13 +333,25 @@ static void dl_change_utilization(struct task_struct *p, u64 new_bw)
 		 * will not touch the rq's active utilization,
 		 * so we are still safe.
 		 */
-		if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
-			put_task_struct(p);
+		if (hrtimer_try_to_cancel(&dl_se->inactive_timer) == 1) {
+			if (!dl_server(dl_se))
+				put_task_struct(dl_task_of(dl_se));
+		}
 	}
-	__sub_rq_bw(p->dl.dl_bw, &rq->dl);
+	__sub_rq_bw(dl_se->dl_bw, &rq->dl);
 	__add_rq_bw(new_bw, &rq->dl);
 }
 
+static void dl_change_utilization(struct task_struct *p, u64 new_bw)
+{
+	WARN_ON_ONCE(p->dl.flags & SCHED_FLAG_SUGOV);
+
+	if (task_on_rq_queued(p))
+		return;
+
+	dl_rq_change_utilization(task_rq(p), &p->dl, new_bw);
+}
+
 static void __dl_clear_params(struct sched_dl_entity *dl_se);
 
 /*
@@ -1621,11 +1626,17 @@ void dl_server_start(struct sched_dl_entity *dl_se)
 {
 	struct rq *rq = dl_se->rq;
 
+	/*
+	 * XXX: the apply do not work fine at the init phase for the
+	 * fair server because things are not yet set. We need to improve
+	 * this before getting generic.
+	 */
 	if (!dl_server(dl_se)) {
 		/* Disabled */
-		dl_se->dl_runtime = 0;
-		dl_se->dl_deadline = 1000 * NSEC_PER_MSEC;
-		dl_se->dl_period = 1000 * NSEC_PER_MSEC;
+		u64 runtime = 0;
+		u64 period = 1000 * NSEC_PER_MSEC;
+
+		dl_server_apply_params(dl_se, runtime, period, 1);
 
 		dl_se->dl_server = 1;
 		dl_se->dl_defer = 1;
@@ -1660,6 +1671,64 @@ void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
 	dl_se->server_pick = pick;
 }
 
+void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq)
+{
+	u64 new_bw = dl_se->dl_bw;
+	int cpu = cpu_of(rq);
+	struct dl_bw *dl_b;
+
+	dl_b = dl_bw_of(cpu_of(rq));
+	guard(raw_spinlock)(&dl_b->lock);
+
+	if (!dl_bw_cpus(cpu))
+		return;
+
+	__dl_add(dl_b, new_bw, dl_bw_cpus(cpu));
+}
+
+int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 period, bool init)
+{
+	u64 old_bw = init ? 0 : to_ratio(dl_se->dl_period, dl_se->dl_runtime);
+	u64 new_bw = to_ratio(period, runtime);
+	struct rq *rq = dl_se->rq;
+	int cpu = cpu_of(rq);
+	struct dl_bw *dl_b;
+	unsigned long cap;
+	int retval = 0;
+	int cpus;
+
+	dl_b = dl_bw_of(cpu);
+	guard(raw_spinlock)(&dl_b->lock);
+
+	cpus = dl_bw_cpus(cpu);
+	cap = dl_bw_capacity(cpu);
+
+	if (__dl_overflow(dl_b, cap, old_bw, new_bw))
+		return -EBUSY;
+
+	if (init) {
+		__add_rq_bw(new_bw, &rq->dl);
+		__dl_add(dl_b, new_bw, cpus);
+	} else {
+		__dl_sub(dl_b, dl_se->dl_bw, cpus);
+		__dl_add(dl_b, new_bw, cpus);
+
+		dl_rq_change_utilization(rq, dl_se, new_bw);
+	}
+
+	dl_se->dl_runtime = runtime;
+	dl_se->dl_deadline = period;
+	dl_se->dl_period = period;
+
+	dl_se->runtime = 0;
+	dl_se->deadline = 0;
+
+	dl_se->dl_bw = to_ratio(dl_se->dl_period, dl_se->dl_runtime);
+	dl_se->dl_density = to_ratio(dl_se->dl_deadline, dl_se->dl_runtime);
+
+	return retval;
+}
+
 /*
  * Update the current task's runtime statistics (provided it is still
  * a -deadline task and has not been removed from the dl_rq).
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index c1eb9a1afd13..b14ffb100867 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -333,8 +333,165 @@ static const struct file_operations sched_debug_fops = {
 	.release	= seq_release,
 };
 
+enum dl_param {
+	DL_RUNTIME = 0,
+	DL_PERIOD,
+};
+
+static unsigned long fair_server_period_max = (1 << 22) * NSEC_PER_USEC; /* ~4 seconds */
+static unsigned long fair_server_period_min = (100) * NSEC_PER_USEC;     /* 100 us */
+
+static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubuf,
+				       size_t cnt, loff_t *ppos, enum dl_param param)
+{
+	long cpu = (long) ((struct seq_file *) filp->private_data)->private;
+	struct rq *rq = cpu_rq(cpu);
+	u64 runtime, period;
+	size_t err;
+	int retval;
+	u64 value;
+
+	err = kstrtoull_from_user(ubuf, cnt, 10, &value);
+	if (err)
+		return err;
+
+	scoped_guard (rq_lock_irqsave, rq) {
+		runtime  = rq->fair_server.dl_runtime;
+		period = rq->fair_server.dl_period;
+
+		switch (param) {
+		case DL_RUNTIME:
+			if (runtime == value)
+				break;
+			runtime = value;
+			break;
+		case DL_PERIOD:
+			if (value == period)
+				break;
+			period = value;
+			break;
+		}
+
+		if (runtime > period ||
+		    period > fair_server_period_max ||
+		    period < fair_server_period_min) {
+			return  -EINVAL;
+		}
+
+		if (rq->cfs.h_nr_running) {
+			update_rq_clock(rq);
+			dl_server_stop(&rq->fair_server);
+		}
+
+		retval = dl_server_apply_params(&rq->fair_server, runtime, period, 0);
+		if (retval)
+			cnt = retval;
+
+		if (!runtime)
+			printk_deferred("Fair server disabled in CPU %d, system may crash due to starvation.\n",
+					cpu_of(rq));
+
+		if (rq->cfs.h_nr_running)
+			dl_server_start(&rq->fair_server);
+	}
+
+	*ppos += cnt;
+	return cnt;
+}
+
+static size_t sched_fair_server_show(struct seq_file *m, void *v, enum dl_param param)
+{
+	unsigned long cpu = (unsigned long) m->private;
+	struct rq *rq = cpu_rq(cpu);
+	u64 value;
+
+	switch (param) {
+	case DL_RUNTIME:
+		value = rq->fair_server.dl_runtime;
+		break;
+	case DL_PERIOD:
+		value = rq->fair_server.dl_period;
+		break;
+	}
+
+	seq_printf(m, "%llu\n", value);
+	return 0;
+
+}
+
+static ssize_t
+sched_fair_server_runtime_write(struct file *filp, const char __user *ubuf,
+				size_t cnt, loff_t *ppos)
+{
+	return sched_fair_server_write(filp, ubuf, cnt, ppos, DL_RUNTIME);
+}
+
+static int sched_fair_server_runtime_show(struct seq_file *m, void *v)
+{
+	return sched_fair_server_show(m, v, DL_RUNTIME);
+}
+
+static int sched_fair_server_runtime_open(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, sched_fair_server_runtime_show, inode->i_private);
+}
+
+static const struct file_operations fair_server_runtime_fops = {
+	.open		= sched_fair_server_runtime_open,
+	.write		= sched_fair_server_runtime_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+static ssize_t
+sched_fair_server_period_write(struct file *filp, const char __user *ubuf,
+			       size_t cnt, loff_t *ppos)
+{
+	return sched_fair_server_write(filp, ubuf, cnt, ppos, DL_PERIOD);
+}
+
+static int sched_fair_server_period_show(struct seq_file *m, void *v)
+{
+	return sched_fair_server_show(m, v, DL_PERIOD);
+}
+
+static int sched_fair_server_period_open(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, sched_fair_server_period_show, inode->i_private);
+}
+
+static const struct file_operations fair_server_period_fops = {
+	.open		= sched_fair_server_period_open,
+	.write		= sched_fair_server_period_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
 static struct dentry *debugfs_sched;
 
+static void debugfs_fair_server_init(void)
+{
+	struct dentry *d_fair;
+	unsigned long cpu;
+
+	d_fair = debugfs_create_dir("fair_server", debugfs_sched);
+	if (!d_fair)
+		return;
+
+	for_each_possible_cpu(cpu) {
+		struct dentry *d_cpu;
+		char buf[32];
+
+		snprintf(buf, sizeof(buf), "cpu%lu", cpu);
+		d_cpu = debugfs_create_dir(buf, d_fair);
+
+		debugfs_create_file("runtime", 0644, d_cpu, (void *) cpu, &fair_server_runtime_fops);
+		debugfs_create_file("period", 0644, d_cpu, (void *) cpu, &fair_server_period_fops);
+	}
+}
+
 static __init int sched_init_debug(void)
 {
 	struct dentry __maybe_unused *numa;
@@ -374,6 +531,8 @@ static __init int sched_init_debug(void)
 
 	debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
 
+	debugfs_fair_server_init();
+
 	return 0;
 }
 late_initcall(sched_init_debug);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 76751b945474..ed7be7b085af 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -359,6 +359,9 @@ extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
 extern void dl_server_update_idle_time(struct rq *rq,
 		    struct task_struct *p);
 extern void fair_server_init(struct rq *rq);
+extern void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq);
+extern int dl_server_apply_params(struct sched_dl_entity *dl_se,
+		    u64 runtime, u64 period, bool init);
 
 #ifdef CONFIG_CGROUP_SCHED
 
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index a6994a1fcc90..a172ed4d95af 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -516,6 +516,14 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd)
 	if (cpumask_test_cpu(rq->cpu, cpu_active_mask))
 		set_rq_online(rq);
 
+	/*
+	 * Because the rq is not a task, dl_add_task_root_domain() did not
+	 * move the fair server bw to the rd if it already started.
+	 * Add it now.
+	 */
+	if (rq->fair_server.dl_server)
+		__dl_server_attach_root(&rq->fair_server, rq);
+
 	rq_unlock_irqrestore(rq, &rf);
 
 	if (old_rd)
-- 
2.45.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH V7 7/9] sched/core: Fix priority checking for DL server picks
  2024-05-27 12:06 [PATCH V7 0/9] SCHED_DEADLINE server infrastructure Daniel Bristot de Oliveira
                   ` (5 preceding siblings ...)
  2024-05-27 12:06 ` [PATCH V7 6/9] sched/fair: Fair server interface Daniel Bristot de Oliveira
@ 2024-05-27 12:06 ` Daniel Bristot de Oliveira
  2024-07-29 10:34   ` [tip: sched/core] " tip-bot2 for Joel Fernandes (Google)
  2024-05-27 12:06 ` [PATCH V7 8/9] sched/core: Fix picking of tasks for core scheduling with DL server Daniel Bristot de Oliveira
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 34+ messages in thread
From: Daniel Bristot de Oliveira @ 2024-05-27 12:06 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, bristot, Phil Auld, Suleiman Souhlal,
	Youssef Esmat

From: "Joel Fernandes (Google)" <joel@joelfernandes.org>

In core scheduling, a DL server pick (which is CFS task) should be
given higher priority than tasks in other classes.

Not doing so causes CFS starvation. A kselftest is added later to
demonstrate this.  A CFS task that is competing with RT tasks can
be completely starved without this and the DL server's boosting
completely ignored.

Fix these problems.

Reviewed-by: Vineeth Pillai <vineeth@bitbyteword.org>
Reported-by: Suleiman Souhlal <suleiman@google.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
---
 kernel/sched/core.c | 23 +++++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 53f0470a1d0a..01336277eac9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -162,6 +162,9 @@ static inline int __task_prio(const struct task_struct *p)
 	if (p->sched_class == &stop_sched_class) /* trumps deadline */
 		return -2;
 
+	if (p->dl_server)
+		return -1; /* deadline */
+
 	if (rt_prio(p->prio)) /* includes deadline */
 		return p->prio; /* [-1, 99] */
 
@@ -191,8 +194,24 @@ static inline bool prio_less(const struct task_struct *a,
 	if (-pb < -pa)
 		return false;
 
-	if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
-		return !dl_time_before(a->dl.deadline, b->dl.deadline);
+	if (pa == -1) { /* dl_prio() doesn't work because of stop_class above */
+		const struct sched_dl_entity *a_dl, *b_dl;
+
+		a_dl = &a->dl;
+		/*
+		 * Since,'a' and 'b' can be CFS tasks served by DL server,
+		 * __task_prio() can return -1 (for DL) even for those. In that
+		 * case, get to the dl_server's DL entity.
+		 */
+		if (a->dl_server)
+			a_dl = a->dl_server;
+
+		b_dl = &b->dl;
+		if (b->dl_server)
+			b_dl = b->dl_server;
+
+		return !dl_time_before(a_dl->deadline, b_dl->deadline);
+	}
 
 	if (pa == MAX_RT_PRIO + MAX_NICE)	/* fair */
 		return cfs_prio_less(a, b, in_fi);
-- 
2.45.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH V7 8/9] sched/core: Fix picking of tasks for core scheduling with DL server
  2024-05-27 12:06 [PATCH V7 0/9] SCHED_DEADLINE server infrastructure Daniel Bristot de Oliveira
                   ` (6 preceding siblings ...)
  2024-05-27 12:06 ` [PATCH V7 7/9] sched/core: Fix priority checking for DL server picks Daniel Bristot de Oliveira
@ 2024-05-27 12:06 ` Daniel Bristot de Oliveira
  2024-07-29 10:34   ` [tip: sched/core] " tip-bot2 for Joel Fernandes (Google)
  2024-05-27 12:06 ` [PATCH V7 9/9] sched/rt: Remove default bandwidth control Daniel Bristot de Oliveira
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 34+ messages in thread
From: Daniel Bristot de Oliveira @ 2024-05-27 12:06 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, bristot, Phil Auld, Suleiman Souhlal,
	Youssef Esmat

From: "Joel Fernandes (Google)" <joel@joelfernandes.org>

* Use simple CFS pick_task for DL pick_task

  DL server's pick_task calls CFS's pick_next_task_fair(), this is wrong
  because core scheduling's pick_task only calls CFS's pick_task() for
  evaluation / checking of the CFS task (comparing across CPUs), not for
  actually affirmatively picking the next task. This causes RB tree
  corruption issues in CFS that were found by syzbot.

* Make pick_task_fair clear DL server

  A DL task pick might set ->dl_server, but it is possible the task will
  never run (say the other HT has a stop task). If the CFS task is picked
  in the future directly (say without DL server), ->dl_server will be
  set. So clear it in pick_task_fair().

This fixes the KASAN issue reported by syzbot in set_next_entity().

(DL refactoring suggestions by Vineeth Pillai).

Reviewed-by: Vineeth Pillai <vineeth@bitbyteword.org>
Reported-by: Suleiman Souhlal <suleiman@google.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
---
 include/linux/sched.h   |  3 ++-
 kernel/sched/deadline.c | 27 ++++++++++++++++++++++-----
 kernel/sched/fair.c     | 23 +++++++++++++++++++++--
 kernel/sched/sched.h    |  3 ++-
 4 files changed, 47 insertions(+), 9 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 611771fec4df..eb8f8b7929c8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -684,7 +684,8 @@ struct sched_dl_entity {
 	 */
 	struct rq			*rq;
 	dl_server_has_tasks_f		server_has_tasks;
-	dl_server_pick_f		server_pick;
+	dl_server_pick_f		server_pick_next;
+	dl_server_pick_f		server_pick_task;
 
 #ifdef CONFIG_RT_MUTEXES
 	/*
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index f8afe0a69c1e..0dbb42cf7fe6 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1664,11 +1664,13 @@ void dl_server_stop(struct sched_dl_entity *dl_se)
 
 void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
 		    dl_server_has_tasks_f has_tasks,
-		    dl_server_pick_f pick)
+		    dl_server_pick_f pick_next,
+		    dl_server_pick_f pick_task)
 {
 	dl_se->rq = rq;
 	dl_se->server_has_tasks = has_tasks;
-	dl_se->server_pick = pick;
+	dl_se->server_pick_next = pick_next;
+	dl_se->server_pick_task = pick_task;
 }
 
 void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq)
@@ -2394,7 +2396,12 @@ static struct sched_dl_entity *pick_next_dl_entity(struct dl_rq *dl_rq)
 	return __node_2_dle(left);
 }
 
-static struct task_struct *pick_task_dl(struct rq *rq)
+/*
+ * __pick_next_task_dl - Helper to pick the next -deadline task to run.
+ * @rq: The runqueue to pick the next task from.
+ * @peek: If true, just peek at the next task. Only relevant for dlserver.
+ */
+static struct task_struct *__pick_next_task_dl(struct rq *rq, bool peek)
 {
 	struct sched_dl_entity *dl_se;
 	struct dl_rq *dl_rq = &rq->dl;
@@ -2408,7 +2415,10 @@ static struct task_struct *pick_task_dl(struct rq *rq)
 	WARN_ON_ONCE(!dl_se);
 
 	if (dl_server(dl_se)) {
-		p = dl_se->server_pick(dl_se);
+		if (IS_ENABLED(CONFIG_SMP) && peek)
+			p = dl_se->server_pick_task(dl_se);
+		else
+			p = dl_se->server_pick_next(dl_se);
 		if (!p) {
 			WARN_ON_ONCE(1);
 			dl_se->dl_yielded = 1;
@@ -2423,11 +2433,18 @@ static struct task_struct *pick_task_dl(struct rq *rq)
 	return p;
 }
 
+#ifdef CONFIG_SMP
+static struct task_struct *pick_task_dl(struct rq *rq)
+{
+	return __pick_next_task_dl(rq, true);
+}
+#endif
+
 static struct task_struct *pick_next_task_dl(struct rq *rq)
 {
 	struct task_struct *p;
 
-	p = pick_task_dl(rq);
+	p = __pick_next_task_dl(rq, false);
 	if (!p)
 		return p;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 20e8b02c5cb3..14ec002bb4f9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8484,6 +8484,14 @@ static struct task_struct *pick_task_fair(struct rq *rq)
 		cfs_rq = group_cfs_rq(se);
 	} while (cfs_rq);
 
+	/*
+	 * This can be called from directly from CFS's ->pick_task() or indirectly
+	 * from DL's ->pick_task when fair server is enabled. In the indirect case,
+	 * DL will set ->dl_server just after this function is called, so its Ok to
+	 * clear. In the direct case, we are picking directly so we must clear it.
+	 */
+	task_of(se)->dl_server = NULL;
+
 	return task_of(se);
 }
 #endif
@@ -8643,7 +8651,16 @@ static bool fair_server_has_tasks(struct sched_dl_entity *dl_se)
 	return !!dl_se->rq->cfs.nr_running;
 }
 
-static struct task_struct *fair_server_pick(struct sched_dl_entity *dl_se)
+static struct task_struct *fair_server_pick_task(struct sched_dl_entity *dl_se)
+{
+#ifdef CONFIG_SMP
+	return pick_task_fair(dl_se->rq);
+#else
+	return NULL;
+#endif
+}
+
+static struct task_struct *fair_server_pick_next(struct sched_dl_entity *dl_se)
 {
 	return pick_next_task_fair(dl_se->rq, NULL, NULL);
 }
@@ -8654,7 +8671,9 @@ void fair_server_init(struct rq *rq)
 
 	init_dl_entity(dl_se);
 
-	dl_server_init(dl_se, rq, fair_server_has_tasks, fair_server_pick);
+	dl_server_init(dl_se, rq, fair_server_has_tasks, fair_server_pick_next,
+		       fair_server_pick_task);
+
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ed7be7b085af..3b8684b5ec8e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -354,7 +354,8 @@ extern void dl_server_start(struct sched_dl_entity *dl_se);
 extern void dl_server_stop(struct sched_dl_entity *dl_se);
 extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
 		    dl_server_has_tasks_f has_tasks,
-		    dl_server_pick_f pick);
+		    dl_server_pick_f pick_next,
+		    dl_server_pick_f pick_task);
 
 extern void dl_server_update_idle_time(struct rq *rq,
 		    struct task_struct *p);
-- 
2.45.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH V7 9/9] sched/rt: Remove default bandwidth control
  2024-05-27 12:06 [PATCH V7 0/9] SCHED_DEADLINE server infrastructure Daniel Bristot de Oliveira
                   ` (7 preceding siblings ...)
  2024-05-27 12:06 ` [PATCH V7 8/9] sched/core: Fix picking of tasks for core scheduling with DL server Daniel Bristot de Oliveira
@ 2024-05-27 12:06 ` Daniel Bristot de Oliveira
  2024-07-29 10:34   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2024-11-27 10:55   ` [PATCH V7 9/9] " Michal Koutný
  2024-06-21 13:37 ` [PATCH V7 0/9] SCHED_DEADLINE server infrastructure Juri Lelli
  2024-06-21 14:41 ` Vineeth Remanan Pillai
  10 siblings, 2 replies; 34+ messages in thread
From: Daniel Bristot de Oliveira @ 2024-05-27 12:06 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, bristot, Phil Auld, Suleiman Souhlal,
	Youssef Esmat

From: Peter Zijlstra <peterz@infradead.org>

Now that fair_server exists, we no longer need RT bandwidth control
unless RT_GROUP_SCHED.

Enable fair_server with parameters equivalent to RT throttling.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
---
 kernel/sched/core.c     |   9 +-
 kernel/sched/deadline.c |   5 +-
 kernel/sched/debug.c    |   3 +
 kernel/sched/rt.c       | 242 ++++++++++++++++++----------------------
 kernel/sched/sched.h    |   3 +-
 5 files changed, 120 insertions(+), 142 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 01336277eac9..8439c2f992db 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9987,8 +9987,6 @@ void __init sched_init(void)
 #endif /* CONFIG_RT_GROUP_SCHED */
 	}
 
-	init_rt_bandwidth(&def_rt_bandwidth, global_rt_period(), global_rt_runtime());
-
 #ifdef CONFIG_SMP
 	init_defrootdomain();
 #endif
@@ -10043,8 +10041,13 @@ void __init sched_init(void)
 		init_tg_cfs_entry(&root_task_group, &rq->cfs, NULL, i, NULL);
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
-		rq->rt.rt_runtime = def_rt_bandwidth.rt_runtime;
 #ifdef CONFIG_RT_GROUP_SCHED
+		/*
+		 * This is required for init cpu because rt.c:__enable_runtime()
+		 * starts working after scheduler_running, which is not the case
+		 * yet.
+		 */
+		rq->rt.rt_runtime = global_rt_runtime();
 		init_tg_rt_entry(&root_task_group, &rq->rt, NULL, i, NULL);
 #endif
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 0dbb42cf7fe6..7df8179bfa08 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1554,6 +1554,7 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64
 	if (dl_se == &rq->fair_server)
 		return;
 
+#ifdef CONFIG_RT_GROUP_SCHED
 	/*
 	 * Because -- for now -- we share the rt bandwidth, we need to
 	 * account our runtime there too, otherwise actual rt tasks
@@ -1578,6 +1579,7 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64
 			rt_rq->rt_time += delta_exec;
 		raw_spin_unlock(&rt_rq->rt_runtime_lock);
 	}
+#endif
 }
 
 /*
@@ -1632,8 +1634,7 @@ void dl_server_start(struct sched_dl_entity *dl_se)
 	 * this before getting generic.
 	 */
 	if (!dl_server(dl_se)) {
-		/* Disabled */
-		u64 runtime = 0;
+		u64 runtime =  50 * NSEC_PER_MSEC;
 		u64 period = 1000 * NSEC_PER_MSEC;
 
 		dl_server_apply_params(dl_se, runtime, period, 1);
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index b14ffb100867..2d5851d65c67 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -889,9 +889,12 @@ void print_rt_rq(struct seq_file *m, int cpu, struct rt_rq *rt_rq)
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", #x, SPLIT_NS(rt_rq->x))
 
 	PU(rt_nr_running);
+
+#ifdef CONFIG_RT_GROUP_SCHED
 	P(rt_throttled);
 	PN(rt_time);
 	PN(rt_runtime);
+#endif
 
 #undef PN
 #undef PU
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index aa4c1c874fa4..fb591b98d71f 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -8,10 +8,6 @@ int sched_rr_timeslice = RR_TIMESLICE;
 /* More than 4 hours if BW_SHIFT equals 20. */
 static const u64 max_rt_runtime = MAX_BW;
 
-static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun);
-
-struct rt_bandwidth def_rt_bandwidth;
-
 /*
  * period over which we measure -rt task CPU usage in us.
  * default: 1s
@@ -66,6 +62,40 @@ static int __init sched_rt_sysctl_init(void)
 late_initcall(sched_rt_sysctl_init);
 #endif
 
+void init_rt_rq(struct rt_rq *rt_rq)
+{
+	struct rt_prio_array *array;
+	int i;
+
+	array = &rt_rq->active;
+	for (i = 0; i < MAX_RT_PRIO; i++) {
+		INIT_LIST_HEAD(array->queue + i);
+		__clear_bit(i, array->bitmap);
+	}
+	/* delimiter for bitsearch: */
+	__set_bit(MAX_RT_PRIO, array->bitmap);
+
+#if defined CONFIG_SMP
+	rt_rq->highest_prio.curr = MAX_RT_PRIO-1;
+	rt_rq->highest_prio.next = MAX_RT_PRIO-1;
+	rt_rq->overloaded = 0;
+	plist_head_init(&rt_rq->pushable_tasks);
+#endif /* CONFIG_SMP */
+	/* We start is dequeued state, because no RT tasks are queued */
+	rt_rq->rt_queued = 0;
+
+#ifdef CONFIG_RT_GROUP_SCHED
+	rt_rq->rt_time = 0;
+	rt_rq->rt_throttled = 0;
+	rt_rq->rt_runtime = 0;
+	raw_spin_lock_init(&rt_rq->rt_runtime_lock);
+#endif
+}
+
+#ifdef CONFIG_RT_GROUP_SCHED
+
+static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun);
+
 static enum hrtimer_restart sched_rt_period_timer(struct hrtimer *timer)
 {
 	struct rt_bandwidth *rt_b =
@@ -130,35 +160,6 @@ static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
 	do_start_rt_bandwidth(rt_b);
 }
 
-void init_rt_rq(struct rt_rq *rt_rq)
-{
-	struct rt_prio_array *array;
-	int i;
-
-	array = &rt_rq->active;
-	for (i = 0; i < MAX_RT_PRIO; i++) {
-		INIT_LIST_HEAD(array->queue + i);
-		__clear_bit(i, array->bitmap);
-	}
-	/* delimiter for bitsearch: */
-	__set_bit(MAX_RT_PRIO, array->bitmap);
-
-#if defined CONFIG_SMP
-	rt_rq->highest_prio.curr = MAX_RT_PRIO-1;
-	rt_rq->highest_prio.next = MAX_RT_PRIO-1;
-	rt_rq->overloaded = 0;
-	plist_head_init(&rt_rq->pushable_tasks);
-#endif /* CONFIG_SMP */
-	/* We start is dequeued state, because no RT tasks are queued */
-	rt_rq->rt_queued = 0;
-
-	rt_rq->rt_time = 0;
-	rt_rq->rt_throttled = 0;
-	rt_rq->rt_runtime = 0;
-	raw_spin_lock_init(&rt_rq->rt_runtime_lock);
-}
-
-#ifdef CONFIG_RT_GROUP_SCHED
 static void destroy_rt_bandwidth(struct rt_bandwidth *rt_b)
 {
 	hrtimer_cancel(&rt_b->rt_period_timer);
@@ -195,7 +196,6 @@ void unregister_rt_sched_group(struct task_group *tg)
 {
 	if (tg->rt_se)
 		destroy_rt_bandwidth(&tg->rt_bandwidth);
-
 }
 
 void free_rt_sched_group(struct task_group *tg)
@@ -253,8 +253,7 @@ int alloc_rt_sched_group(struct task_group *tg, struct task_group *parent)
 	if (!tg->rt_se)
 		goto err;
 
-	init_rt_bandwidth(&tg->rt_bandwidth,
-			ktime_to_ns(def_rt_bandwidth.rt_period), 0);
+	init_rt_bandwidth(&tg->rt_bandwidth, ktime_to_ns(global_rt_period()), 0);
 
 	for_each_possible_cpu(i) {
 		rt_rq = kzalloc_node(sizeof(struct rt_rq),
@@ -604,70 +603,6 @@ static inline struct rt_bandwidth *sched_rt_bandwidth(struct rt_rq *rt_rq)
 	return &rt_rq->tg->rt_bandwidth;
 }
 
-#else /* !CONFIG_RT_GROUP_SCHED */
-
-static inline u64 sched_rt_runtime(struct rt_rq *rt_rq)
-{
-	return rt_rq->rt_runtime;
-}
-
-static inline u64 sched_rt_period(struct rt_rq *rt_rq)
-{
-	return ktime_to_ns(def_rt_bandwidth.rt_period);
-}
-
-typedef struct rt_rq *rt_rq_iter_t;
-
-#define for_each_rt_rq(rt_rq, iter, rq) \
-	for ((void) iter, rt_rq = &rq->rt; rt_rq; rt_rq = NULL)
-
-#define for_each_sched_rt_entity(rt_se) \
-	for (; rt_se; rt_se = NULL)
-
-static inline struct rt_rq *group_rt_rq(struct sched_rt_entity *rt_se)
-{
-	return NULL;
-}
-
-static inline void sched_rt_rq_enqueue(struct rt_rq *rt_rq)
-{
-	struct rq *rq = rq_of_rt_rq(rt_rq);
-
-	if (!rt_rq->rt_nr_running)
-		return;
-
-	enqueue_top_rt_rq(rt_rq);
-	resched_curr(rq);
-}
-
-static inline void sched_rt_rq_dequeue(struct rt_rq *rt_rq)
-{
-	dequeue_top_rt_rq(rt_rq, rt_rq->rt_nr_running);
-}
-
-static inline int rt_rq_throttled(struct rt_rq *rt_rq)
-{
-	return rt_rq->rt_throttled;
-}
-
-static inline const struct cpumask *sched_rt_period_mask(void)
-{
-	return cpu_online_mask;
-}
-
-static inline
-struct rt_rq *sched_rt_period_rt_rq(struct rt_bandwidth *rt_b, int cpu)
-{
-	return &cpu_rq(cpu)->rt;
-}
-
-static inline struct rt_bandwidth *sched_rt_bandwidth(struct rt_rq *rt_rq)
-{
-	return &def_rt_bandwidth;
-}
-
-#endif /* CONFIG_RT_GROUP_SCHED */
-
 bool sched_rt_bandwidth_account(struct rt_rq *rt_rq)
 {
 	struct rt_bandwidth *rt_b = sched_rt_bandwidth(rt_rq);
@@ -859,7 +794,7 @@ static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun)
 	const struct cpumask *span;
 
 	span = sched_rt_period_mask();
-#ifdef CONFIG_RT_GROUP_SCHED
+
 	/*
 	 * FIXME: isolated CPUs should really leave the root task group,
 	 * whether they are isolcpus or were isolated via cpusets, lest
@@ -871,7 +806,7 @@ static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun)
 	 */
 	if (rt_b == &root_task_group.rt_bandwidth)
 		span = cpu_online_mask;
-#endif
+
 	for_each_cpu(i, span) {
 		int enqueue = 0;
 		struct rt_rq *rt_rq = sched_rt_period_rt_rq(rt_b, i);
@@ -938,18 +873,6 @@ static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun)
 	return idle;
 }
 
-static inline int rt_se_prio(struct sched_rt_entity *rt_se)
-{
-#ifdef CONFIG_RT_GROUP_SCHED
-	struct rt_rq *rt_rq = group_rt_rq(rt_se);
-
-	if (rt_rq)
-		return rt_rq->highest_prio.curr;
-#endif
-
-	return rt_task_of(rt_se)->prio;
-}
-
 static int sched_rt_runtime_exceeded(struct rt_rq *rt_rq)
 {
 	u64 runtime = sched_rt_runtime(rt_rq);
@@ -993,6 +916,72 @@ static int sched_rt_runtime_exceeded(struct rt_rq *rt_rq)
 	return 0;
 }
 
+#else /* !CONFIG_RT_GROUP_SCHED */
+
+typedef struct rt_rq *rt_rq_iter_t;
+
+#define for_each_rt_rq(rt_rq, iter, rq) \
+	for ((void) iter, rt_rq = &rq->rt; rt_rq; rt_rq = NULL)
+
+#define for_each_sched_rt_entity(rt_se) \
+	for (; rt_se; rt_se = NULL)
+
+static inline struct rt_rq *group_rt_rq(struct sched_rt_entity *rt_se)
+{
+	return NULL;
+}
+
+static inline void sched_rt_rq_enqueue(struct rt_rq *rt_rq)
+{
+	struct rq *rq = rq_of_rt_rq(rt_rq);
+
+	if (!rt_rq->rt_nr_running)
+		return;
+
+	enqueue_top_rt_rq(rt_rq);
+	resched_curr(rq);
+}
+
+static inline void sched_rt_rq_dequeue(struct rt_rq *rt_rq)
+{
+	dequeue_top_rt_rq(rt_rq, rt_rq->rt_nr_running);
+}
+
+static inline int rt_rq_throttled(struct rt_rq *rt_rq)
+{
+	return false;
+}
+
+static inline const struct cpumask *sched_rt_period_mask(void)
+{
+	return cpu_online_mask;
+}
+
+static inline
+struct rt_rq *sched_rt_period_rt_rq(struct rt_bandwidth *rt_b, int cpu)
+{
+	return &cpu_rq(cpu)->rt;
+}
+
+#ifdef CONFIG_SMP
+static void __enable_runtime(struct rq *rq) { }
+static void __disable_runtime(struct rq *rq) { }
+#endif
+
+#endif /* CONFIG_RT_GROUP_SCHED */
+
+static inline int rt_se_prio(struct sched_rt_entity *rt_se)
+{
+#ifdef CONFIG_RT_GROUP_SCHED
+	struct rt_rq *rt_rq = group_rt_rq(rt_se);
+
+	if (rt_rq)
+		return rt_rq->highest_prio.curr;
+#endif
+
+	return rt_task_of(rt_se)->prio;
+}
+
 /*
  * Update the current task's runtime statistics. Skip current tasks that
  * are not in our scheduling class.
@@ -1000,7 +989,6 @@ static int sched_rt_runtime_exceeded(struct rt_rq *rt_rq)
 static void update_curr_rt(struct rq *rq)
 {
 	struct task_struct *curr = rq->curr;
-	struct sched_rt_entity *rt_se = &curr->rt;
 	s64 delta_exec;
 
 	if (curr->sched_class != &rt_sched_class)
@@ -1010,6 +998,9 @@ static void update_curr_rt(struct rq *rq)
 	if (unlikely(delta_exec <= 0))
 		return;
 
+#ifdef CONFIG_RT_GROUP_SCHED
+	struct sched_rt_entity *rt_se = &curr->rt;
+
 	if (!rt_bandwidth_enabled())
 		return;
 
@@ -1028,6 +1019,7 @@ static void update_curr_rt(struct rq *rq)
 				do_start_rt_bandwidth(sched_rt_bandwidth(rt_rq));
 		}
 	}
+#endif
 }
 
 static void
@@ -1184,7 +1176,6 @@ dec_rt_group(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
 static void
 inc_rt_group(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
 {
-	start_rt_bandwidth(&def_rt_bandwidth);
 }
 
 static inline
@@ -2912,19 +2903,6 @@ int sched_rt_can_attach(struct task_group *tg, struct task_struct *tsk)
 #ifdef CONFIG_SYSCTL
 static int sched_rt_global_constraints(void)
 {
-	unsigned long flags;
-	int i;
-
-	raw_spin_lock_irqsave(&def_rt_bandwidth.rt_runtime_lock, flags);
-	for_each_possible_cpu(i) {
-		struct rt_rq *rt_rq = &cpu_rq(i)->rt;
-
-		raw_spin_lock(&rt_rq->rt_runtime_lock);
-		rt_rq->rt_runtime = global_rt_runtime();
-		raw_spin_unlock(&rt_rq->rt_runtime_lock);
-	}
-	raw_spin_unlock_irqrestore(&def_rt_bandwidth.rt_runtime_lock, flags);
-
 	return 0;
 }
 #endif /* CONFIG_SYSCTL */
@@ -2944,12 +2922,6 @@ static int sched_rt_global_validate(void)
 
 static void sched_rt_do_global(void)
 {
-	unsigned long flags;
-
-	raw_spin_lock_irqsave(&def_rt_bandwidth.rt_runtime_lock, flags);
-	def_rt_bandwidth.rt_runtime = global_rt_runtime();
-	def_rt_bandwidth.rt_period = ns_to_ktime(global_rt_period());
-	raw_spin_unlock_irqrestore(&def_rt_bandwidth.rt_runtime_lock, flags);
 }
 
 static int sched_rt_handler(struct ctl_table *table, int write, void *buffer,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3b8684b5ec8e..c93de331171b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -729,13 +729,13 @@ struct rt_rq {
 #endif /* CONFIG_SMP */
 	int			rt_queued;
 
+#ifdef CONFIG_RT_GROUP_SCHED
 	int			rt_throttled;
 	u64			rt_time;
 	u64			rt_runtime;
 	/* Nests inside the rq lock: */
 	raw_spinlock_t		rt_runtime_lock;
 
-#ifdef CONFIG_RT_GROUP_SCHED
 	unsigned int		rt_nr_boosted;
 
 	struct rq		*rq;
@@ -2478,7 +2478,6 @@ extern void reweight_task(struct task_struct *p, int prio);
 extern void resched_curr(struct rq *rq);
 extern void resched_cpu(int cpu);
 
-extern struct rt_bandwidth def_rt_bandwidth;
 extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime);
 extern bool sched_rt_bandwidth_account(struct rt_rq *rt_rq);
 
-- 
2.45.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH V7 0/9] SCHED_DEADLINE server infrastructure
  2024-05-27 12:06 [PATCH V7 0/9] SCHED_DEADLINE server infrastructure Daniel Bristot de Oliveira
                   ` (8 preceding siblings ...)
  2024-05-27 12:06 ` [PATCH V7 9/9] sched/rt: Remove default bandwidth control Daniel Bristot de Oliveira
@ 2024-06-21 13:37 ` Juri Lelli
  2024-06-21 13:43   ` Daniel Bristot de Oliveira
  2024-06-21 14:41 ` Vineeth Remanan Pillai
  10 siblings, 1 reply; 34+ messages in thread
From: Juri Lelli @ 2024-06-21 13:37 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, Phil Auld, Suleiman Souhlal,
	Youssef Esmat

Hi Daniel,

On 27/05/24 14:06, Daniel Bristot de Oliveira wrote:
> This is v7 of Peter's SCHED_DEADLINE server infrastructure
> implementation [1].

I finally managed to give this a go and can report that it works great
for what I've seen. :)

So, please consider this reply a

Tested-by: Juri Lelli <juri.lelli@redhat.com>

> SCHED_DEADLINE servers can help fixing starvation issues of low priority
> tasks (e.g., SCHED_OTHER) when higher priority tasks monopolize CPU
> cycles. Today we have RT Throttling; DEADLINE servers should be able to
> replace and improve that.

...

> The problem with DL server only implementation is that FIFO tasks might
> suffer preemption from NORMAL even when spare CPU cycles are available.
> In fact, fair deadline server is enqueued right away when NORMAL tasks
> wake up and they are first scheduled by the server, thus potentially
> preempting a well behaving FIFO task. This is of course not ideal.
> 
> We had discussions about it, and one of the possibilities would be
> using a different scheduling algorithm for this. But IMHO that is
> an overkill.
> 
> Juri and I discussed this and though about delaying the server
> activation for the (period - runtime), thus enabling the server
> only if the fair scheduler is about to starve. We called it
> the defer server.
> 
> The defer the server start to the (absolute deadline - runtime)
> point in time. This is achieved by starting the dl server throttled,
> with a next replenishing time set to activate the server at
> (absolute deadline - runtime).
> 
> The server is enqueued with the runtime replenished. As the fair
> scheduler runs without boost, its runtime is consumed. If the
> fair server has its runtime before the runtime - deadline time,
> the a new period is set, and the timer armed for the new
> deadline.

I also wanted to pay particular attention to this part implementing the
deferred server, but failed to find enough focus time for now. I will
keep trying. One thing that I wondered though is if this change (and the
move towards this replacing current RT throttling) would call for a Doc
update. What do you think?

Thanks!
Juri


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH V7 0/9] SCHED_DEADLINE server infrastructure
  2024-06-21 13:37 ` [PATCH V7 0/9] SCHED_DEADLINE server infrastructure Juri Lelli
@ 2024-06-21 13:43   ` Daniel Bristot de Oliveira
  2024-06-21 13:50     ` Juri Lelli
  0 siblings, 1 reply; 34+ messages in thread
From: Daniel Bristot de Oliveira @ 2024-06-21 13:43 UTC (permalink / raw)
  To: Juri Lelli, Peter Zijlstra, Ingo Molnar
  Cc: Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	linux-kernel, Luca Abeni, Tommaso Cucinotta, Thomas Gleixner,
	Joel Fernandes, Vineeth Pillai, Shuah Khan, Phil Auld,
	Suleiman Souhlal, Youssef Esmat

On 6/21/24 15:37, Juri Lelli wrote:
> Hi Daniel,
> 
> On 27/05/24 14:06, Daniel Bristot de Oliveira wrote:
>> This is v7 of Peter's SCHED_DEADLINE server infrastructure
>> implementation [1].
> 
> I finally managed to give this a go and can report that it works great
> for what I've seen. :)
> 
> So, please consider this reply a
> 
> Tested-by: Juri Lelli <juri.lelli@redhat.com>

Thanks!

>> SCHED_DEADLINE servers can help fixing starvation issues of low priority
>> tasks (e.g., SCHED_OTHER) when higher priority tasks monopolize CPU
>> cycles. Today we have RT Throttling; DEADLINE servers should be able to
>> replace and improve that.
> 
> ...
> 
>> The problem with DL server only implementation is that FIFO tasks might
>> suffer preemption from NORMAL even when spare CPU cycles are available.
>> In fact, fair deadline server is enqueued right away when NORMAL tasks
>> wake up and they are first scheduled by the server, thus potentially
>> preempting a well behaving FIFO task. This is of course not ideal.
>>
>> We had discussions about it, and one of the possibilities would be
>> using a different scheduling algorithm for this. But IMHO that is
>> an overkill.
>>
>> Juri and I discussed this and though about delaying the server
>> activation for the (period - runtime), thus enabling the server
>> only if the fair scheduler is about to starve. We called it
>> the defer server.
>>
>> The defer the server start to the (absolute deadline - runtime)
>> point in time. This is achieved by starting the dl server throttled,
>> with a next replenishing time set to activate the server at
>> (absolute deadline - runtime).
>>
>> The server is enqueued with the runtime replenished. As the fair
>> scheduler runs without boost, its runtime is consumed. If the
>> fair server has its runtime before the runtime - deadline time,
>> the a new period is set, and the timer armed for the new
>> deadline.
> 
> I also wanted to pay particular attention to this part implementing the
> deferred server, but failed to find enough focus time for now. I will
> keep trying. One thing that I wondered though is if this change (and the
> move towards this replacing current RT throttling) would call for a Doc
> update. What do you think?

Yeah, am I planning a v8 for the next week. It has no code changes, just a rebase
and the addition of documentation.

I am not mentioning the RT throttling in the documentation. Instead, I am treating
this as a new feature on its own, which is inline with the comments over the code.

I will add an rv monitor to it, extending the documentation, but I will do it
on another series... once we get this done.

Thoughts?

Peter/Ingo, which branch should I rebase it?

-- Daniel

> Thanks!
> Juri
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH V7 0/9] SCHED_DEADLINE server infrastructure
  2024-06-21 13:43   ` Daniel Bristot de Oliveira
@ 2024-06-21 13:50     ` Juri Lelli
  0 siblings, 0 replies; 34+ messages in thread
From: Juri Lelli @ 2024-06-21 13:50 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, Phil Auld, Suleiman Souhlal,
	Youssef Esmat

On 21/06/24 15:43, Daniel Bristot de Oliveira wrote:
> On 6/21/24 15:37, Juri Lelli wrote:
> > Hi Daniel,
> > 
> > On 27/05/24 14:06, Daniel Bristot de Oliveira wrote:
> >> This is v7 of Peter's SCHED_DEADLINE server infrastructure
> >> implementation [1].
> > 
> > I finally managed to give this a go and can report that it works great
> > for what I've seen. :)
> > 
> > So, please consider this reply a
> > 
> > Tested-by: Juri Lelli <juri.lelli@redhat.com>
> 
> Thanks!
> 
> >> SCHED_DEADLINE servers can help fixing starvation issues of low priority
> >> tasks (e.g., SCHED_OTHER) when higher priority tasks monopolize CPU
> >> cycles. Today we have RT Throttling; DEADLINE servers should be able to
> >> replace and improve that.
> > 
> > ...
> > 
> >> The problem with DL server only implementation is that FIFO tasks might
> >> suffer preemption from NORMAL even when spare CPU cycles are available.
> >> In fact, fair deadline server is enqueued right away when NORMAL tasks
> >> wake up and they are first scheduled by the server, thus potentially
> >> preempting a well behaving FIFO task. This is of course not ideal.
> >>
> >> We had discussions about it, and one of the possibilities would be
> >> using a different scheduling algorithm for this. But IMHO that is
> >> an overkill.
> >>
> >> Juri and I discussed this and though about delaying the server
> >> activation for the (period - runtime), thus enabling the server
> >> only if the fair scheduler is about to starve. We called it
> >> the defer server.
> >>
> >> The defer the server start to the (absolute deadline - runtime)
> >> point in time. This is achieved by starting the dl server throttled,
> >> with a next replenishing time set to activate the server at
> >> (absolute deadline - runtime).
> >>
> >> The server is enqueued with the runtime replenished. As the fair
> >> scheduler runs without boost, its runtime is consumed. If the
> >> fair server has its runtime before the runtime - deadline time,
> >> the a new period is set, and the timer armed for the new
> >> deadline.
> > 
> > I also wanted to pay particular attention to this part implementing the
> > deferred server, but failed to find enough focus time for now. I will
> > keep trying. One thing that I wondered though is if this change (and the
> > move towards this replacing current RT throttling) would call for a Doc
> > update. What do you think?
> 
> Yeah, am I planning a v8 for the next week. It has no code changes, just a rebase
> and the addition of documentation.
> 
> I am not mentioning the RT throttling in the documentation. Instead, I am treating
> this as a new feature on its own, which is inline with the comments over the code.
> 
> I will add an rv monitor to it, extending the documentation, but I will do it
> on another series... once we get this done.
> 
> Thoughts?

Works for me! Guess we can deal with the RT throttling references in the
future when that gets eventually pruned.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH V7 0/9] SCHED_DEADLINE server infrastructure
  2024-05-27 12:06 [PATCH V7 0/9] SCHED_DEADLINE server infrastructure Daniel Bristot de Oliveira
                   ` (9 preceding siblings ...)
  2024-06-21 13:37 ` [PATCH V7 0/9] SCHED_DEADLINE server infrastructure Juri Lelli
@ 2024-06-21 14:41 ` Vineeth Remanan Pillai
  2024-06-21 14:59   ` Daniel Bristot de Oliveira
  2024-07-29 10:32   ` Peter Zijlstra
  10 siblings, 2 replies; 34+ messages in thread
From: Vineeth Remanan Pillai @ 2024-06-21 14:41 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Shuah Khan, Phil Auld, Suleiman Souhlal, Youssef Esmat

Hi Daniel,


>
> This is v7 of Peter's SCHED_DEADLINE server infrastructure
> implementation [1].
>
Thanks for the v7 :-)

Sorry that I could not get to reviewing and testing this revision. In
v6 we had experienced a minor bug where suspend/resume had issues with
dlserver. Since suspend does not do dequeue, dlserver is not stopped
and this causes the premature wakeups. I haven't looked at v7 in
detail, but I think the issue might still be present. We have a
workaround patch for this in our 5.15 kernel based on v5 which I am
attaching for reference. This might not apply cleanly on v7 and
possibly not be the best solution, but thought of sharing it to give
an insight about the issue.

Thanks,
Vineeth

Attached patch
-----------------------

Subject: [PATCH] sched/dlserver: Freeze dlserver on system suspend.

dlserver is stopped only if a dequeue or cfs rq throttle results in no
runnable cfs tasks. But this doesn't happen during a system suspend and
can cause the dl server to stay active and break suspend/resume.

Freeze the dlserver on system suspend. Freezing is stopping the
dlserver, but maintaining the dl_server_active state so as to not
confuse the enqueue/dequeue path.

Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/5528103
Reviewed-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Vineeth Pillai <vineethrp@google.com>
Commit-Queue: Vineeth Pillai <vineethrp@google.com>
---
 include/linux/sched.h   | 27 +++++++++++++
 kernel/power/suspend.c  |  3 ++
 kernel/sched/deadline.c | 87 +++++++++++++++++++++++++++++++++++++----
 3 files changed, 110 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 123ef7804d95..23beff5e48a2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -650,6 +650,15 @@ struct sched_dl_entity {
        unsigned int                    dl_defer          : 1;
        unsigned int                    dl_defer_armed    : 1;
        unsigned int                    dl_server_active  : 1;
+       /*
+        * dl_server is marked as frozen when the system suspends. Frozen
+        * means that dl_server is stopped, but the dl_server_active state
+        * is maintained so that the enqueue/dequeue path is not confused.
+        * We need this separate state other than dl_server_active because
+        * suspend doesn't dequeue the tasks and hence does not stop the
+        * dl_server during suspend. And this may lead to spurious resumes.
+        */
+       unsigned int                    dl_server_frozen  : 1;

        /*
         * Bandwidth enforcement timer. Each -deadline task has its
@@ -690,6 +699,24 @@ struct sched_dl_entity {
 #endif
 };

+/*
+ * Power management related actions for dl_server
+ */
+enum dl_server_pm_action {
+       dl_server_pm_freeze = 0,
+       dl_server_pm_thaw = 1
+};
+extern void freeze_thaw_dl_server(enum dl_server_pm_action action);
+static inline void freeze_dl_server(void)
+{
+       freeze_thaw_dl_server(dl_server_pm_freeze);
+}
+static inline void thaw_dl_server(void)
+{
+       freeze_thaw_dl_server(dl_server_pm_thaw);
+}
+
+
 #ifdef CONFIG_UCLAMP_TASK
 /* Number of utilization clamp buckets (shorter alias) */
 #define UCLAMP_BUCKETS CONFIG_UCLAMP_BUCKETS_COUNT
diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
index 55235bf52c7e..a6d5f8f3072e 100644
--- a/kernel/power/suspend.c
+++ b/kernel/power/suspend.c
@@ -592,6 +592,8 @@ static int enter_state(suspend_state_t state)
        if (error)
                goto Unlock;

+       freeze_dl_server();
+
        if (suspend_test(TEST_FREEZER))
                goto Finish;

@@ -602,6 +604,7 @@ static int enter_state(suspend_state_t state)
        pm_restore_gfp_mask();

  Finish:
+       thaw_dl_server();
        events_check_enabled = false;
        pm_pr_dbg("Finishing wakeup.\n");
        suspend_finish();
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 5fc40caf20d7..f95a375af329 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1478,10 +1478,11 @@ static void update_curr_dl_se(struct rq *rq,
struct sched_dl_entity *dl_se, s64

 void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec)
 {
-       update_curr_dl_se(dl_se->rq, dl_se, delta_exec);
+       if (!dl_se->dl_server_frozen)
+               update_curr_dl_se(dl_se->rq, dl_se, delta_exec);
 }

-void dl_server_start(struct sched_dl_entity *dl_se)
+static inline void __dl_server_start(struct sched_dl_entity *dl_se)
 {
        /*
         * XXX: the apply do not work fine at the init phase for the
@@ -1500,25 +1501,97 @@ void dl_server_start(struct sched_dl_entity *dl_se)
                setup_new_dl_entity(dl_se);
        }

+       enqueue_dl_entity(dl_se, ENQUEUE_WAKEUP);
+}
+
+void dl_server_start(struct sched_dl_entity *dl_se)
+{
+       if (dl_se->dl_server_frozen)
+               goto set_active;
+
        if (WARN_ON_ONCE(dl_se->dl_server_active))
                return;

-       enqueue_dl_entity(dl_se, ENQUEUE_WAKEUP);
+       __dl_server_start(dl_se);
+
+set_active:
        dl_se->dl_server_active = 1;
 }

-void dl_server_stop(struct sched_dl_entity *dl_se)
+static inline void __dl_server_stop(struct sched_dl_entity *dl_se)
 {
-       if (WARN_ON_ONCE(!dl_se->dl_server_active))
-               return;
-
        dequeue_dl_entity(dl_se, DEQUEUE_SLEEP);
        hrtimer_try_to_cancel(&dl_se->dl_timer);
        dl_se->dl_defer_armed = 0;
        dl_se->dl_throttled = 0;
+}
+
+void dl_server_stop(struct sched_dl_entity *dl_se)
+{
+       if (dl_se->dl_server_frozen)
+               goto reset_active;
+
+       if (WARN_ON_ONCE(!dl_se->dl_server_active))
+               return;
+
+       __dl_server_stop(dl_se);
+
+reset_active:
        dl_se->dl_server_active = 0;
 }

+void dl_server_freeze(struct sched_dl_entity *dl_se)
+{
+       if (dl_se->dl_server_active) {
+               update_rq_clock(dl_se->rq);
+               __dl_server_stop(dl_se);
+       }
+       dl_se->dl_server_frozen = 1;
+}
+
+void dl_server_thaw(struct sched_dl_entity *dl_se)
+{
+       if (dl_se->dl_server_active) {
+               update_rq_clock(dl_se->rq);
+               __dl_server_start(dl_se);
+       }
+       dl_se->dl_server_frozen = 0;
+}
+
+void freeze_thaw_dl_server(enum dl_server_pm_action action)
+{
+       int cpu;
+
+       cpus_read_lock();
+       for_each_online_cpu(cpu) {
+               struct rq_flags rf;
+               struct rq *rq = cpu_rq(cpu);
+               struct sched_dl_entity *dl_se;
+
+               sched_clock_tick();
+               rq_lock_irqsave(rq, &rf);
+               dl_se = &rq->fair_server;
+               switch (action) {
+               case dl_server_pm_freeze:
+                       if (WARN_ON_ONCE(dl_se->dl_server_frozen))
+                               break;
+
+                       dl_server_freeze(dl_se);
+                       break;
+               case dl_server_pm_thaw:
+                       if (WARN_ON_ONCE(!dl_se->dl_server_frozen))
+                               break;
+
+
+                       dl_server_thaw(dl_se);
+                       break;
+               default:
+                       WARN_ON_ONCE(1);
+               }
+               rq_unlock_irqrestore(rq, &rf);
+       }
+       cpus_read_unlock();
+}
+
 void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
                    dl_server_has_tasks_f has_tasks,
                    dl_server_pick_f pick_next,
-- 
2.45.2.741.gdbec12cfda-goog

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH V7 0/9] SCHED_DEADLINE server infrastructure
  2024-06-21 14:41 ` Vineeth Remanan Pillai
@ 2024-06-21 14:59   ` Daniel Bristot de Oliveira
  2024-06-21 15:09     ` Vineeth Remanan Pillai
  2024-07-29 10:32   ` Peter Zijlstra
  1 sibling, 1 reply; 34+ messages in thread
From: Daniel Bristot de Oliveira @ 2024-06-21 14:59 UTC (permalink / raw)
  To: Vineeth Remanan Pillai
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Shuah Khan, Phil Auld, Suleiman Souhlal, Youssef Esmat

On 6/21/24 16:41, Vineeth Remanan Pillai wrote:
> Sorry that I could not get to reviewing and testing this revision. In
> v6 we had experienced a minor bug where suspend/resume had issues with
> dlserver. Since suspend does not do dequeue, dlserver is not stopped
> and this causes the premature wakeups.

Ouch! I will have a look next week on this. Do you guys know any other bug?

an earlier report without necessarily a fix/work around is a good thing
for us to try to reproduce it/think about it as earlier as we can...

-- Daniel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH V7 0/9] SCHED_DEADLINE server infrastructure
  2024-06-21 14:59   ` Daniel Bristot de Oliveira
@ 2024-06-21 15:09     ` Vineeth Remanan Pillai
  2024-06-21 15:16       ` Daniel Bristot de Oliveira
  0 siblings, 1 reply; 34+ messages in thread
From: Vineeth Remanan Pillai @ 2024-06-21 15:09 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Shuah Khan, Phil Auld, Suleiman Souhlal, Youssef Esmat

On Fri, Jun 21, 2024 at 10:59 AM Daniel Bristot de Oliveira
<bristot@kernel.org> wrote:
>
> On 6/21/24 16:41, Vineeth Remanan Pillai wrote:
> > Sorry that I could not get to reviewing and testing this revision. In
> > v6 we had experienced a minor bug where suspend/resume had issues with
> > dlserver. Since suspend does not do dequeue, dlserver is not stopped
> > and this causes the premature wakeups.
>
> Ouch! I will have a look next week on this. Do you guys know any other bug?
>
> an earlier report without necessarily a fix/work around is a good thing
> for us to try to reproduce it/think about it as earlier as we can...
>
Sorry my mistake, I was buried in other things and missed reporting
this earlier.

There was another minor regression seen lately after we fixed the
above issue- idle cpu was spending more time in C7 than C10 with the
dlserver changes. This was reported very recently and we haven't
investigated this much yet. Just a heads up and will keep you posted
as we know more.

Thanks,
Vineeth

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH V7 0/9] SCHED_DEADLINE server infrastructure
  2024-06-21 15:09     ` Vineeth Remanan Pillai
@ 2024-06-21 15:16       ` Daniel Bristot de Oliveira
  0 siblings, 0 replies; 34+ messages in thread
From: Daniel Bristot de Oliveira @ 2024-06-21 15:16 UTC (permalink / raw)
  To: Vineeth Remanan Pillai
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Shuah Khan, Phil Auld, Suleiman Souhlal, Youssef Esmat

On 6/21/24 17:09, Vineeth Remanan Pillai wrote:
> On Fri, Jun 21, 2024 at 10:59 AM Daniel Bristot de Oliveira
> <bristot@kernel.org> wrote:
>>
>> On 6/21/24 16:41, Vineeth Remanan Pillai wrote:
>>> Sorry that I could not get to reviewing and testing this revision. In
>>> v6 we had experienced a minor bug where suspend/resume had issues with
>>> dlserver. Since suspend does not do dequeue, dlserver is not stopped
>>> and this causes the premature wakeups.
>>
>> Ouch! I will have a look next week on this. Do you guys know any other bug?
>>
>> an earlier report without necessarily a fix/work around is a good thing
>> for us to try to reproduce it/think about it as earlier as we can...
>>
> Sorry my mistake, I was buried in other things and missed reporting
> this earlier.

no worries.

> 
> There was another minor regression seen lately after we fixed the
> above issue- idle cpu was spending more time in C7 than C10 with the
> dlserver changes. This was reported very recently and we haven't
> investigated this much yet. Just a heads up and will keep you posted
> as we know more.

Maybe that is an expected side effect because of the timer for the server, AFAIR you
guys are using a short period large runtime (25ms/50ms)?

> Thanks,
> Vineeth


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH V7 0/9] SCHED_DEADLINE server infrastructure
  2024-06-21 14:41 ` Vineeth Remanan Pillai
  2024-06-21 14:59   ` Daniel Bristot de Oliveira
@ 2024-07-29 10:32   ` Peter Zijlstra
  2024-07-29 20:42     ` Vineeth Remanan Pillai
  1 sibling, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2024-07-29 10:32 UTC (permalink / raw)
  To: Vineeth Remanan Pillai
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Shuah Khan, Phil Auld, Suleiman Souhlal, Youssef Esmat

On Fri, Jun 21, 2024 at 10:41:35AM -0400, Vineeth Remanan Pillai wrote:

> Sorry that I could not get to reviewing and testing this revision. In
> v6 we had experienced a minor bug where suspend/resume had issues with
> dlserver. Since suspend does not do dequeue, dlserver is not stopped
> and this causes the premature wakeups. I haven't looked at v7 in
> detail, but I think the issue might still be present.

It is not.

> We have a workaround patch for this in our 5.15 kernel 

That is the problem... your necro kernel doesn't yet have the freezer
rewrite I imagine:

  f5d39b020809 ("freezer,sched: Rewrite core freezer logic")

That would cause all frozen tasks to be dequeued, and once all tasks
are dequeued, the deadline server stops itself too.

Juri did some testing to double check and no suspend / resume issues
were found.

Anyway, I've merged the lot into tip/sched/core.

Thanks all!

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [tip: sched/core] sched/core: Fix picking of tasks for core scheduling with DL server
  2024-05-27 12:06 ` [PATCH V7 8/9] sched/core: Fix picking of tasks for core scheduling with DL server Daniel Bristot de Oliveira
@ 2024-07-29 10:34   ` tip-bot2 for Joel Fernandes (Google)
  0 siblings, 0 replies; 34+ messages in thread
From: tip-bot2 for Joel Fernandes (Google) @ 2024-07-29 10:34 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Suleiman Souhlal, Joel Fernandes (Google),
	Daniel Bristot de Oliveira, Peter Zijlstra (Intel),
	Vineeth Pillai, Juri Lelli, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     c8a85394cfdb4696b4e2f8a0f3066a1c921af426
Gitweb:        https://git.kernel.org/tip/c8a85394cfdb4696b4e2f8a0f3066a1c921af426
Author:        Joel Fernandes (Google) <joel@joelfernandes.org>
AuthorDate:    Mon, 27 May 2024 14:06:54 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 29 Jul 2024 12:22:37 +02:00

sched/core: Fix picking of tasks for core scheduling with DL server

* Use simple CFS pick_task for DL pick_task

  DL server's pick_task calls CFS's pick_next_task_fair(), this is wrong
  because core scheduling's pick_task only calls CFS's pick_task() for
  evaluation / checking of the CFS task (comparing across CPUs), not for
  actually affirmatively picking the next task. This causes RB tree
  corruption issues in CFS that were found by syzbot.

* Make pick_task_fair clear DL server

  A DL task pick might set ->dl_server, but it is possible the task will
  never run (say the other HT has a stop task). If the CFS task is picked
  in the future directly (say without DL server), ->dl_server will be
  set. So clear it in pick_task_fair().

This fixes the KASAN issue reported by syzbot in set_next_entity().

(DL refactoring suggestions by Vineeth Pillai).

Reported-by: Suleiman Souhlal <suleiman@google.com>
Signed-off-by: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vineeth Pillai <vineeth@bitbyteword.org>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/b10489ab1f03d23e08e6097acea47442e7d6466f.1716811044.git.bristot@kernel.org
---
 include/linux/sched.h   |  3 ++-
 kernel/sched/deadline.c | 27 ++++++++++++++++++++++-----
 kernel/sched/fair.c     | 23 +++++++++++++++++++++--
 kernel/sched/sched.h    |  3 ++-
 4 files changed, 47 insertions(+), 9 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4edd7e2..2c1b4ee 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -686,7 +686,8 @@ struct sched_dl_entity {
 	 */
 	struct rq			*rq;
 	dl_server_has_tasks_f		server_has_tasks;
-	dl_server_pick_f		server_pick;
+	dl_server_pick_f		server_pick_next;
+	dl_server_pick_f		server_pick_task;
 
 #ifdef CONFIG_RT_MUTEXES
 	/*
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 747c0c5..8571bc9 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1664,11 +1664,13 @@ void dl_server_stop(struct sched_dl_entity *dl_se)
 
 void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
 		    dl_server_has_tasks_f has_tasks,
-		    dl_server_pick_f pick)
+		    dl_server_pick_f pick_next,
+		    dl_server_pick_f pick_task)
 {
 	dl_se->rq = rq;
 	dl_se->server_has_tasks = has_tasks;
-	dl_se->server_pick = pick;
+	dl_se->server_pick_next = pick_next;
+	dl_se->server_pick_task = pick_task;
 }
 
 void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq)
@@ -2399,7 +2401,12 @@ static struct sched_dl_entity *pick_next_dl_entity(struct dl_rq *dl_rq)
 	return __node_2_dle(left);
 }
 
-static struct task_struct *pick_task_dl(struct rq *rq)
+/*
+ * __pick_next_task_dl - Helper to pick the next -deadline task to run.
+ * @rq: The runqueue to pick the next task from.
+ * @peek: If true, just peek at the next task. Only relevant for dlserver.
+ */
+static struct task_struct *__pick_next_task_dl(struct rq *rq, bool peek)
 {
 	struct sched_dl_entity *dl_se;
 	struct dl_rq *dl_rq = &rq->dl;
@@ -2413,7 +2420,10 @@ again:
 	WARN_ON_ONCE(!dl_se);
 
 	if (dl_server(dl_se)) {
-		p = dl_se->server_pick(dl_se);
+		if (IS_ENABLED(CONFIG_SMP) && peek)
+			p = dl_se->server_pick_task(dl_se);
+		else
+			p = dl_se->server_pick_next(dl_se);
 		if (!p) {
 			WARN_ON_ONCE(1);
 			dl_se->dl_yielded = 1;
@@ -2428,11 +2438,18 @@ again:
 	return p;
 }
 
+#ifdef CONFIG_SMP
+static struct task_struct *pick_task_dl(struct rq *rq)
+{
+	return __pick_next_task_dl(rq, true);
+}
+#endif
+
 static struct task_struct *pick_next_task_dl(struct rq *rq)
 {
 	struct task_struct *p;
 
-	p = pick_task_dl(rq);
+	p = __pick_next_task_dl(rq, false);
 	if (!p)
 		return p;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1ea5ec8..ee251ac 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8479,6 +8479,14 @@ again:
 		cfs_rq = group_cfs_rq(se);
 	} while (cfs_rq);
 
+	/*
+	 * This can be called from directly from CFS's ->pick_task() or indirectly
+	 * from DL's ->pick_task when fair server is enabled. In the indirect case,
+	 * DL will set ->dl_server just after this function is called, so its Ok to
+	 * clear. In the direct case, we are picking directly so we must clear it.
+	 */
+	task_of(se)->dl_server = NULL;
+
 	return task_of(se);
 }
 #endif
@@ -8638,7 +8646,16 @@ static bool fair_server_has_tasks(struct sched_dl_entity *dl_se)
 	return !!dl_se->rq->cfs.nr_running;
 }
 
-static struct task_struct *fair_server_pick(struct sched_dl_entity *dl_se)
+static struct task_struct *fair_server_pick_task(struct sched_dl_entity *dl_se)
+{
+#ifdef CONFIG_SMP
+	return pick_task_fair(dl_se->rq);
+#else
+	return NULL;
+#endif
+}
+
+static struct task_struct *fair_server_pick_next(struct sched_dl_entity *dl_se)
 {
 	return pick_next_task_fair(dl_se->rq, NULL, NULL);
 }
@@ -8649,7 +8666,9 @@ void fair_server_init(struct rq *rq)
 
 	init_dl_entity(dl_se);
 
-	dl_server_init(dl_se, rq, fair_server_has_tasks, fair_server_pick);
+	dl_server_init(dl_se, rq, fair_server_has_tasks, fair_server_pick_next,
+		       fair_server_pick_task);
+
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b777ac3..f7e028b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -361,7 +361,8 @@ extern void dl_server_start(struct sched_dl_entity *dl_se);
 extern void dl_server_stop(struct sched_dl_entity *dl_se);
 extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
 		    dl_server_has_tasks_f has_tasks,
-		    dl_server_pick_f pick);
+		    dl_server_pick_f pick_next,
+		    dl_server_pick_f pick_task);
 
 extern void dl_server_update_idle_time(struct rq *rq,
 		    struct task_struct *p);

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [tip: sched/core] sched/rt: Remove default bandwidth control
  2024-05-27 12:06 ` [PATCH V7 9/9] sched/rt: Remove default bandwidth control Daniel Bristot de Oliveira
@ 2024-07-29 10:34   ` tip-bot2 for Peter Zijlstra
  2024-11-27 10:55   ` [PATCH V7 9/9] " Michal Koutný
  1 sibling, 0 replies; 34+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2024-07-29 10:34 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Daniel Bristot de Oliveira,
	Vineeth Pillai (Google), Juri Lelli, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     5f6bd380c7bdbe10f7b4e8ddcceed60ce0714c6d
Gitweb:        https://git.kernel.org/tip/5f6bd380c7bdbe10f7b4e8ddcceed60ce0714c6d
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Mon, 27 May 2024 14:06:55 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 29 Jul 2024 12:22:37 +02:00

sched/rt: Remove default bandwidth control

Now that fair_server exists, we no longer need RT bandwidth control
unless RT_GROUP_SCHED.

Enable fair_server with parameters equivalent to RT throttling.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: "Vineeth Pillai (Google)" <vineeth@bitbyteword.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/14d562db55df5c3c780d91940743acb166895ef7.1716811044.git.bristot@kernel.org
---
 kernel/sched/core.c     |   9 +-
 kernel/sched/deadline.c |   5 +-
 kernel/sched/debug.c    |   3 +-
 kernel/sched/rt.c       | 242 +++++++++++++++++----------------------
 kernel/sched/sched.h    |   3 +-
 5 files changed, 120 insertions(+), 142 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 11abfcd..29fde99 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8266,8 +8266,6 @@ void __init sched_init(void)
 #endif /* CONFIG_RT_GROUP_SCHED */
 	}
 
-	init_rt_bandwidth(&def_rt_bandwidth, global_rt_period(), global_rt_runtime());
-
 #ifdef CONFIG_SMP
 	init_defrootdomain();
 #endif
@@ -8322,8 +8320,13 @@ void __init sched_init(void)
 		init_tg_cfs_entry(&root_task_group, &rq->cfs, NULL, i, NULL);
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
-		rq->rt.rt_runtime = def_rt_bandwidth.rt_runtime;
 #ifdef CONFIG_RT_GROUP_SCHED
+		/*
+		 * This is required for init cpu because rt.c:__enable_runtime()
+		 * starts working after scheduler_running, which is not the case
+		 * yet.
+		 */
+		rq->rt.rt_runtime = global_rt_runtime();
 		init_tg_rt_entry(&root_task_group, &rq->rt, NULL, i, NULL);
 #endif
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 8571bc9..c5f1cc7 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1554,6 +1554,7 @@ throttle:
 	if (dl_se == &rq->fair_server)
 		return;
 
+#ifdef CONFIG_RT_GROUP_SCHED
 	/*
 	 * Because -- for now -- we share the rt bandwidth, we need to
 	 * account our runtime there too, otherwise actual rt tasks
@@ -1578,6 +1579,7 @@ throttle:
 			rt_rq->rt_time += delta_exec;
 		raw_spin_unlock(&rt_rq->rt_runtime_lock);
 	}
+#endif
 }
 
 /*
@@ -1632,8 +1634,7 @@ void dl_server_start(struct sched_dl_entity *dl_se)
 	 * this before getting generic.
 	 */
 	if (!dl_server(dl_se)) {
-		/* Disabled */
-		u64 runtime = 0;
+		u64 runtime =  50 * NSEC_PER_MSEC;
 		u64 period = 1000 * NSEC_PER_MSEC;
 
 		dl_server_apply_params(dl_se, runtime, period, 1);
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 72f2715..e75914e 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -885,9 +885,12 @@ void print_rt_rq(struct seq_file *m, int cpu, struct rt_rq *rt_rq)
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", #x, SPLIT_NS(rt_rq->x))
 
 	PU(rt_nr_running);
+
+#ifdef CONFIG_RT_GROUP_SCHED
 	P(rt_throttled);
 	PN(rt_time);
 	PN(rt_runtime);
+#endif
 
 #undef PN
 #undef PU
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 310523c..a8731da 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -8,10 +8,6 @@ int sched_rr_timeslice = RR_TIMESLICE;
 /* More than 4 hours if BW_SHIFT equals 20. */
 static const u64 max_rt_runtime = MAX_BW;
 
-static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun);
-
-struct rt_bandwidth def_rt_bandwidth;
-
 /*
  * period over which we measure -rt task CPU usage in us.
  * default: 1s
@@ -66,6 +62,40 @@ static int __init sched_rt_sysctl_init(void)
 late_initcall(sched_rt_sysctl_init);
 #endif
 
+void init_rt_rq(struct rt_rq *rt_rq)
+{
+	struct rt_prio_array *array;
+	int i;
+
+	array = &rt_rq->active;
+	for (i = 0; i < MAX_RT_PRIO; i++) {
+		INIT_LIST_HEAD(array->queue + i);
+		__clear_bit(i, array->bitmap);
+	}
+	/* delimiter for bitsearch: */
+	__set_bit(MAX_RT_PRIO, array->bitmap);
+
+#if defined CONFIG_SMP
+	rt_rq->highest_prio.curr = MAX_RT_PRIO-1;
+	rt_rq->highest_prio.next = MAX_RT_PRIO-1;
+	rt_rq->overloaded = 0;
+	plist_head_init(&rt_rq->pushable_tasks);
+#endif /* CONFIG_SMP */
+	/* We start is dequeued state, because no RT tasks are queued */
+	rt_rq->rt_queued = 0;
+
+#ifdef CONFIG_RT_GROUP_SCHED
+	rt_rq->rt_time = 0;
+	rt_rq->rt_throttled = 0;
+	rt_rq->rt_runtime = 0;
+	raw_spin_lock_init(&rt_rq->rt_runtime_lock);
+#endif
+}
+
+#ifdef CONFIG_RT_GROUP_SCHED
+
+static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun);
+
 static enum hrtimer_restart sched_rt_period_timer(struct hrtimer *timer)
 {
 	struct rt_bandwidth *rt_b =
@@ -130,35 +160,6 @@ static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
 	do_start_rt_bandwidth(rt_b);
 }
 
-void init_rt_rq(struct rt_rq *rt_rq)
-{
-	struct rt_prio_array *array;
-	int i;
-
-	array = &rt_rq->active;
-	for (i = 0; i < MAX_RT_PRIO; i++) {
-		INIT_LIST_HEAD(array->queue + i);
-		__clear_bit(i, array->bitmap);
-	}
-	/* delimiter for bit-search: */
-	__set_bit(MAX_RT_PRIO, array->bitmap);
-
-#if defined CONFIG_SMP
-	rt_rq->highest_prio.curr = MAX_RT_PRIO-1;
-	rt_rq->highest_prio.next = MAX_RT_PRIO-1;
-	rt_rq->overloaded = 0;
-	plist_head_init(&rt_rq->pushable_tasks);
-#endif /* CONFIG_SMP */
-	/* We start is dequeued state, because no RT tasks are queued */
-	rt_rq->rt_queued = 0;
-
-	rt_rq->rt_time = 0;
-	rt_rq->rt_throttled = 0;
-	rt_rq->rt_runtime = 0;
-	raw_spin_lock_init(&rt_rq->rt_runtime_lock);
-}
-
-#ifdef CONFIG_RT_GROUP_SCHED
 static void destroy_rt_bandwidth(struct rt_bandwidth *rt_b)
 {
 	hrtimer_cancel(&rt_b->rt_period_timer);
@@ -195,7 +196,6 @@ void unregister_rt_sched_group(struct task_group *tg)
 {
 	if (tg->rt_se)
 		destroy_rt_bandwidth(&tg->rt_bandwidth);
-
 }
 
 void free_rt_sched_group(struct task_group *tg)
@@ -253,8 +253,7 @@ int alloc_rt_sched_group(struct task_group *tg, struct task_group *parent)
 	if (!tg->rt_se)
 		goto err;
 
-	init_rt_bandwidth(&tg->rt_bandwidth,
-			ktime_to_ns(def_rt_bandwidth.rt_period), 0);
+	init_rt_bandwidth(&tg->rt_bandwidth, ktime_to_ns(global_rt_period()), 0);
 
 	for_each_possible_cpu(i) {
 		rt_rq = kzalloc_node(sizeof(struct rt_rq),
@@ -604,70 +603,6 @@ static inline struct rt_bandwidth *sched_rt_bandwidth(struct rt_rq *rt_rq)
 	return &rt_rq->tg->rt_bandwidth;
 }
 
-#else /* !CONFIG_RT_GROUP_SCHED */
-
-static inline u64 sched_rt_runtime(struct rt_rq *rt_rq)
-{
-	return rt_rq->rt_runtime;
-}
-
-static inline u64 sched_rt_period(struct rt_rq *rt_rq)
-{
-	return ktime_to_ns(def_rt_bandwidth.rt_period);
-}
-
-typedef struct rt_rq *rt_rq_iter_t;
-
-#define for_each_rt_rq(rt_rq, iter, rq) \
-	for ((void) iter, rt_rq = &rq->rt; rt_rq; rt_rq = NULL)
-
-#define for_each_sched_rt_entity(rt_se) \
-	for (; rt_se; rt_se = NULL)
-
-static inline struct rt_rq *group_rt_rq(struct sched_rt_entity *rt_se)
-{
-	return NULL;
-}
-
-static inline void sched_rt_rq_enqueue(struct rt_rq *rt_rq)
-{
-	struct rq *rq = rq_of_rt_rq(rt_rq);
-
-	if (!rt_rq->rt_nr_running)
-		return;
-
-	enqueue_top_rt_rq(rt_rq);
-	resched_curr(rq);
-}
-
-static inline void sched_rt_rq_dequeue(struct rt_rq *rt_rq)
-{
-	dequeue_top_rt_rq(rt_rq, rt_rq->rt_nr_running);
-}
-
-static inline int rt_rq_throttled(struct rt_rq *rt_rq)
-{
-	return rt_rq->rt_throttled;
-}
-
-static inline const struct cpumask *sched_rt_period_mask(void)
-{
-	return cpu_online_mask;
-}
-
-static inline
-struct rt_rq *sched_rt_period_rt_rq(struct rt_bandwidth *rt_b, int cpu)
-{
-	return &cpu_rq(cpu)->rt;
-}
-
-static inline struct rt_bandwidth *sched_rt_bandwidth(struct rt_rq *rt_rq)
-{
-	return &def_rt_bandwidth;
-}
-
-#endif /* CONFIG_RT_GROUP_SCHED */
-
 bool sched_rt_bandwidth_account(struct rt_rq *rt_rq)
 {
 	struct rt_bandwidth *rt_b = sched_rt_bandwidth(rt_rq);
@@ -859,7 +794,7 @@ static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun)
 	const struct cpumask *span;
 
 	span = sched_rt_period_mask();
-#ifdef CONFIG_RT_GROUP_SCHED
+
 	/*
 	 * FIXME: isolated CPUs should really leave the root task group,
 	 * whether they are isolcpus or were isolated via cpusets, lest
@@ -871,7 +806,7 @@ static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun)
 	 */
 	if (rt_b == &root_task_group.rt_bandwidth)
 		span = cpu_online_mask;
-#endif
+
 	for_each_cpu(i, span) {
 		int enqueue = 0;
 		struct rt_rq *rt_rq = sched_rt_period_rt_rq(rt_b, i);
@@ -938,18 +873,6 @@ static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun)
 	return idle;
 }
 
-static inline int rt_se_prio(struct sched_rt_entity *rt_se)
-{
-#ifdef CONFIG_RT_GROUP_SCHED
-	struct rt_rq *rt_rq = group_rt_rq(rt_se);
-
-	if (rt_rq)
-		return rt_rq->highest_prio.curr;
-#endif
-
-	return rt_task_of(rt_se)->prio;
-}
-
 static int sched_rt_runtime_exceeded(struct rt_rq *rt_rq)
 {
 	u64 runtime = sched_rt_runtime(rt_rq);
@@ -993,6 +916,72 @@ static int sched_rt_runtime_exceeded(struct rt_rq *rt_rq)
 	return 0;
 }
 
+#else /* !CONFIG_RT_GROUP_SCHED */
+
+typedef struct rt_rq *rt_rq_iter_t;
+
+#define for_each_rt_rq(rt_rq, iter, rq) \
+	for ((void) iter, rt_rq = &rq->rt; rt_rq; rt_rq = NULL)
+
+#define for_each_sched_rt_entity(rt_se) \
+	for (; rt_se; rt_se = NULL)
+
+static inline struct rt_rq *group_rt_rq(struct sched_rt_entity *rt_se)
+{
+	return NULL;
+}
+
+static inline void sched_rt_rq_enqueue(struct rt_rq *rt_rq)
+{
+	struct rq *rq = rq_of_rt_rq(rt_rq);
+
+	if (!rt_rq->rt_nr_running)
+		return;
+
+	enqueue_top_rt_rq(rt_rq);
+	resched_curr(rq);
+}
+
+static inline void sched_rt_rq_dequeue(struct rt_rq *rt_rq)
+{
+	dequeue_top_rt_rq(rt_rq, rt_rq->rt_nr_running);
+}
+
+static inline int rt_rq_throttled(struct rt_rq *rt_rq)
+{
+	return false;
+}
+
+static inline const struct cpumask *sched_rt_period_mask(void)
+{
+	return cpu_online_mask;
+}
+
+static inline
+struct rt_rq *sched_rt_period_rt_rq(struct rt_bandwidth *rt_b, int cpu)
+{
+	return &cpu_rq(cpu)->rt;
+}
+
+#ifdef CONFIG_SMP
+static void __enable_runtime(struct rq *rq) { }
+static void __disable_runtime(struct rq *rq) { }
+#endif
+
+#endif /* CONFIG_RT_GROUP_SCHED */
+
+static inline int rt_se_prio(struct sched_rt_entity *rt_se)
+{
+#ifdef CONFIG_RT_GROUP_SCHED
+	struct rt_rq *rt_rq = group_rt_rq(rt_se);
+
+	if (rt_rq)
+		return rt_rq->highest_prio.curr;
+#endif
+
+	return rt_task_of(rt_se)->prio;
+}
+
 /*
  * Update the current task's runtime statistics. Skip current tasks that
  * are not in our scheduling class.
@@ -1000,7 +989,6 @@ static int sched_rt_runtime_exceeded(struct rt_rq *rt_rq)
 static void update_curr_rt(struct rq *rq)
 {
 	struct task_struct *curr = rq->curr;
-	struct sched_rt_entity *rt_se = &curr->rt;
 	s64 delta_exec;
 
 	if (curr->sched_class != &rt_sched_class)
@@ -1010,6 +998,9 @@ static void update_curr_rt(struct rq *rq)
 	if (unlikely(delta_exec <= 0))
 		return;
 
+#ifdef CONFIG_RT_GROUP_SCHED
+	struct sched_rt_entity *rt_se = &curr->rt;
+
 	if (!rt_bandwidth_enabled())
 		return;
 
@@ -1028,6 +1019,7 @@ static void update_curr_rt(struct rq *rq)
 				do_start_rt_bandwidth(sched_rt_bandwidth(rt_rq));
 		}
 	}
+#endif
 }
 
 static void
@@ -1184,7 +1176,6 @@ dec_rt_group(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
 static void
 inc_rt_group(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
 {
-	start_rt_bandwidth(&def_rt_bandwidth);
 }
 
 static inline
@@ -2912,19 +2903,6 @@ int sched_rt_can_attach(struct task_group *tg, struct task_struct *tsk)
 #ifdef CONFIG_SYSCTL
 static int sched_rt_global_constraints(void)
 {
-	unsigned long flags;
-	int i;
-
-	raw_spin_lock_irqsave(&def_rt_bandwidth.rt_runtime_lock, flags);
-	for_each_possible_cpu(i) {
-		struct rt_rq *rt_rq = &cpu_rq(i)->rt;
-
-		raw_spin_lock(&rt_rq->rt_runtime_lock);
-		rt_rq->rt_runtime = global_rt_runtime();
-		raw_spin_unlock(&rt_rq->rt_runtime_lock);
-	}
-	raw_spin_unlock_irqrestore(&def_rt_bandwidth.rt_runtime_lock, flags);
-
 	return 0;
 }
 #endif /* CONFIG_SYSCTL */
@@ -2944,12 +2922,6 @@ static int sched_rt_global_validate(void)
 
 static void sched_rt_do_global(void)
 {
-	unsigned long flags;
-
-	raw_spin_lock_irqsave(&def_rt_bandwidth.rt_runtime_lock, flags);
-	def_rt_bandwidth.rt_runtime = global_rt_runtime();
-	def_rt_bandwidth.rt_period = ns_to_ktime(global_rt_period());
-	raw_spin_unlock_irqrestore(&def_rt_bandwidth.rt_runtime_lock, flags);
 }
 
 static int sched_rt_handler(const struct ctl_table *table, int write, void *buffer,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f7e028b..1e1d1b4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -729,13 +729,13 @@ struct rt_rq {
 #endif /* CONFIG_SMP */
 	int			rt_queued;
 
+#ifdef CONFIG_RT_GROUP_SCHED
 	int			rt_throttled;
 	u64			rt_time;
 	u64			rt_runtime;
 	/* Nests inside the rq lock: */
 	raw_spinlock_t		rt_runtime_lock;
 
-#ifdef CONFIG_RT_GROUP_SCHED
 	unsigned int		rt_nr_boosted;
 
 	struct rq		*rq;
@@ -2519,7 +2519,6 @@ extern void reweight_task(struct task_struct *p, const struct load_weight *lw);
 extern void resched_curr(struct rq *rq);
 extern void resched_cpu(int cpu);
 
-extern struct rt_bandwidth def_rt_bandwidth;
 extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime);
 extern bool sched_rt_bandwidth_account(struct rt_rq *rt_rq);
 

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [tip: sched/core] sched/core: Fix priority checking for DL server picks
  2024-05-27 12:06 ` [PATCH V7 7/9] sched/core: Fix priority checking for DL server picks Daniel Bristot de Oliveira
@ 2024-07-29 10:34   ` tip-bot2 for Joel Fernandes (Google)
  0 siblings, 0 replies; 34+ messages in thread
From: tip-bot2 for Joel Fernandes (Google) @ 2024-07-29 10:34 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Suleiman Souhlal, Joel Fernandes (Google),
	Daniel Bristot de Oliveira, Peter Zijlstra (Intel),
	Vineeth Pillai, Juri Lelli, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     4b26cfdd395638918e370f62bd2c33e6f63ffb5e
Gitweb:        https://git.kernel.org/tip/4b26cfdd395638918e370f62bd2c33e6f63ffb5e
Author:        Joel Fernandes (Google) <joel@joelfernandes.org>
AuthorDate:    Mon, 27 May 2024 14:06:53 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 29 Jul 2024 12:22:36 +02:00

sched/core: Fix priority checking for DL server picks

In core scheduling, a DL server pick (which is CFS task) should be
given higher priority than tasks in other classes.

Not doing so causes CFS starvation. A kselftest is added later to
demonstrate this.  A CFS task that is competing with RT tasks can
be completely starved without this and the DL server's boosting
completely ignored.

Fix these problems.

Reported-by: Suleiman Souhlal <suleiman@google.com>
Signed-off-by: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vineeth Pillai <vineeth@bitbyteword.org>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/48b78521d86f3b33c24994d843c1aad6b987dda9.1716811044.git.bristot@kernel.org
---
 kernel/sched/core.c | 23 +++++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f95600c..11abfcd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -163,6 +163,9 @@ static inline int __task_prio(const struct task_struct *p)
 	if (p->sched_class == &stop_sched_class) /* trumps deadline */
 		return -2;
 
+	if (p->dl_server)
+		return -1; /* deadline */
+
 	if (rt_prio(p->prio)) /* includes deadline */
 		return p->prio; /* [-1, 99] */
 
@@ -192,8 +195,24 @@ static inline bool prio_less(const struct task_struct *a,
 	if (-pb < -pa)
 		return false;
 
-	if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
-		return !dl_time_before(a->dl.deadline, b->dl.deadline);
+	if (pa == -1) { /* dl_prio() doesn't work because of stop_class above */
+		const struct sched_dl_entity *a_dl, *b_dl;
+
+		a_dl = &a->dl;
+		/*
+		 * Since,'a' and 'b' can be CFS tasks served by DL server,
+		 * __task_prio() can return -1 (for DL) even for those. In that
+		 * case, get to the dl_server's DL entity.
+		 */
+		if (a->dl_server)
+			a_dl = a->dl_server;
+
+		b_dl = &b->dl;
+		if (b->dl_server)
+			b_dl = b->dl_server;
+
+		return !dl_time_before(a_dl->deadline, b_dl->deadline);
+	}
 
 	if (pa == MAX_RT_PRIO + MAX_NICE)	/* fair */
 		return cfs_prio_less(a, b, in_fi);

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [tip: sched/core] sched/fair: Fair server interface
  2024-05-27 12:06 ` [PATCH V7 6/9] sched/fair: Fair server interface Daniel Bristot de Oliveira
@ 2024-07-29 10:34   ` tip-bot2 for Daniel Bristot de Oliveira
  0 siblings, 0 replies; 34+ messages in thread
From: tip-bot2 for Daniel Bristot de Oliveira @ 2024-07-29 10:34 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Daniel Bristot de Oliveira, Peter Zijlstra (Intel), Juri Lelli,
	x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     d741f297bceaf52aa1b97b997708fc0cd853c73e
Gitweb:        https://git.kernel.org/tip/d741f297bceaf52aa1b97b997708fc0cd853c73e
Author:        Daniel Bristot de Oliveira <bristot@kernel.org>
AuthorDate:    Mon, 27 May 2024 14:06:52 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 29 Jul 2024 12:22:36 +02:00

sched/fair: Fair server interface

Add an interface for fair server setup on debugfs.

Each CPU has two files under /debug/sched/fair_server/cpu{ID}:

 - runtime: set runtime in ns
 - period:  set period in ns

This then leaves /proc/sys/kernel/sched_rt_{period,runtime}_us to set
bounds on admission control.

The interface also add the server to the dl bandwidth accounting.

Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/a9ef9fc69bcedb44bddc9bc34f2b313296052819.1716811044.git.bristot@kernel.org
---
 kernel/sched/deadline.c | 103 ++++++++++++++++++++-----
 kernel/sched/debug.c    | 159 +++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h    |   3 +-
 kernel/sched/topology.c |   8 ++-
 4 files changed, 256 insertions(+), 17 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 1b29531..747c0c5 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -320,19 +320,12 @@ void sub_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
 		__sub_running_bw(dl_se->dl_bw, dl_rq);
 }
 
-static void dl_change_utilization(struct task_struct *p, u64 new_bw)
+static void dl_rq_change_utilization(struct rq *rq, struct sched_dl_entity *dl_se, u64 new_bw)
 {
-	struct rq *rq;
-
-	WARN_ON_ONCE(p->dl.flags & SCHED_FLAG_SUGOV);
-
-	if (task_on_rq_queued(p))
-		return;
+	if (dl_se->dl_non_contending) {
+		sub_running_bw(dl_se, &rq->dl);
+		dl_se->dl_non_contending = 0;
 
-	rq = task_rq(p);
-	if (p->dl.dl_non_contending) {
-		sub_running_bw(&p->dl, &rq->dl);
-		p->dl.dl_non_contending = 0;
 		/*
 		 * If the timer handler is currently running and the
 		 * timer cannot be canceled, inactive_task_timer()
@@ -340,13 +333,25 @@ static void dl_change_utilization(struct task_struct *p, u64 new_bw)
 		 * will not touch the rq's active utilization,
 		 * so we are still safe.
 		 */
-		if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
-			put_task_struct(p);
+		if (hrtimer_try_to_cancel(&dl_se->inactive_timer) == 1) {
+			if (!dl_server(dl_se))
+				put_task_struct(dl_task_of(dl_se));
+		}
 	}
-	__sub_rq_bw(p->dl.dl_bw, &rq->dl);
+	__sub_rq_bw(dl_se->dl_bw, &rq->dl);
 	__add_rq_bw(new_bw, &rq->dl);
 }
 
+static void dl_change_utilization(struct task_struct *p, u64 new_bw)
+{
+	WARN_ON_ONCE(p->dl.flags & SCHED_FLAG_SUGOV);
+
+	if (task_on_rq_queued(p))
+		return;
+
+	dl_rq_change_utilization(task_rq(p), &p->dl, new_bw);
+}
+
 static void __dl_clear_params(struct sched_dl_entity *dl_se);
 
 /*
@@ -1621,11 +1626,17 @@ void dl_server_start(struct sched_dl_entity *dl_se)
 {
 	struct rq *rq = dl_se->rq;
 
+	/*
+	 * XXX: the apply do not work fine at the init phase for the
+	 * fair server because things are not yet set. We need to improve
+	 * this before getting generic.
+	 */
 	if (!dl_server(dl_se)) {
 		/* Disabled */
-		dl_se->dl_runtime = 0;
-		dl_se->dl_deadline = 1000 * NSEC_PER_MSEC;
-		dl_se->dl_period = 1000 * NSEC_PER_MSEC;
+		u64 runtime = 0;
+		u64 period = 1000 * NSEC_PER_MSEC;
+
+		dl_server_apply_params(dl_se, runtime, period, 1);
 
 		dl_se->dl_server = 1;
 		dl_se->dl_defer = 1;
@@ -1660,6 +1671,64 @@ void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
 	dl_se->server_pick = pick;
 }
 
+void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq)
+{
+	u64 new_bw = dl_se->dl_bw;
+	int cpu = cpu_of(rq);
+	struct dl_bw *dl_b;
+
+	dl_b = dl_bw_of(cpu_of(rq));
+	guard(raw_spinlock)(&dl_b->lock);
+
+	if (!dl_bw_cpus(cpu))
+		return;
+
+	__dl_add(dl_b, new_bw, dl_bw_cpus(cpu));
+}
+
+int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 period, bool init)
+{
+	u64 old_bw = init ? 0 : to_ratio(dl_se->dl_period, dl_se->dl_runtime);
+	u64 new_bw = to_ratio(period, runtime);
+	struct rq *rq = dl_se->rq;
+	int cpu = cpu_of(rq);
+	struct dl_bw *dl_b;
+	unsigned long cap;
+	int retval = 0;
+	int cpus;
+
+	dl_b = dl_bw_of(cpu);
+	guard(raw_spinlock)(&dl_b->lock);
+
+	cpus = dl_bw_cpus(cpu);
+	cap = dl_bw_capacity(cpu);
+
+	if (__dl_overflow(dl_b, cap, old_bw, new_bw))
+		return -EBUSY;
+
+	if (init) {
+		__add_rq_bw(new_bw, &rq->dl);
+		__dl_add(dl_b, new_bw, cpus);
+	} else {
+		__dl_sub(dl_b, dl_se->dl_bw, cpus);
+		__dl_add(dl_b, new_bw, cpus);
+
+		dl_rq_change_utilization(rq, dl_se, new_bw);
+	}
+
+	dl_se->dl_runtime = runtime;
+	dl_se->dl_deadline = period;
+	dl_se->dl_period = period;
+
+	dl_se->runtime = 0;
+	dl_se->deadline = 0;
+
+	dl_se->dl_bw = to_ratio(dl_se->dl_period, dl_se->dl_runtime);
+	dl_se->dl_density = to_ratio(dl_se->dl_deadline, dl_se->dl_runtime);
+
+	return retval;
+}
+
 /*
  * Update the current task's runtime statistics (provided it is still
  * a -deadline task and has not been removed from the dl_rq).
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 90c4a99..72f2715 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -333,8 +333,165 @@ static const struct file_operations sched_debug_fops = {
 	.release	= seq_release,
 };
 
+enum dl_param {
+	DL_RUNTIME = 0,
+	DL_PERIOD,
+};
+
+static unsigned long fair_server_period_max = (1 << 22) * NSEC_PER_USEC; /* ~4 seconds */
+static unsigned long fair_server_period_min = (100) * NSEC_PER_USEC;     /* 100 us */
+
+static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubuf,
+				       size_t cnt, loff_t *ppos, enum dl_param param)
+{
+	long cpu = (long) ((struct seq_file *) filp->private_data)->private;
+	struct rq *rq = cpu_rq(cpu);
+	u64 runtime, period;
+	size_t err;
+	int retval;
+	u64 value;
+
+	err = kstrtoull_from_user(ubuf, cnt, 10, &value);
+	if (err)
+		return err;
+
+	scoped_guard (rq_lock_irqsave, rq) {
+		runtime  = rq->fair_server.dl_runtime;
+		period = rq->fair_server.dl_period;
+
+		switch (param) {
+		case DL_RUNTIME:
+			if (runtime == value)
+				break;
+			runtime = value;
+			break;
+		case DL_PERIOD:
+			if (value == period)
+				break;
+			period = value;
+			break;
+		}
+
+		if (runtime > period ||
+		    period > fair_server_period_max ||
+		    period < fair_server_period_min) {
+			return  -EINVAL;
+		}
+
+		if (rq->cfs.h_nr_running) {
+			update_rq_clock(rq);
+			dl_server_stop(&rq->fair_server);
+		}
+
+		retval = dl_server_apply_params(&rq->fair_server, runtime, period, 0);
+		if (retval)
+			cnt = retval;
+
+		if (!runtime)
+			printk_deferred("Fair server disabled in CPU %d, system may crash due to starvation.\n",
+					cpu_of(rq));
+
+		if (rq->cfs.h_nr_running)
+			dl_server_start(&rq->fair_server);
+	}
+
+	*ppos += cnt;
+	return cnt;
+}
+
+static size_t sched_fair_server_show(struct seq_file *m, void *v, enum dl_param param)
+{
+	unsigned long cpu = (unsigned long) m->private;
+	struct rq *rq = cpu_rq(cpu);
+	u64 value;
+
+	switch (param) {
+	case DL_RUNTIME:
+		value = rq->fair_server.dl_runtime;
+		break;
+	case DL_PERIOD:
+		value = rq->fair_server.dl_period;
+		break;
+	}
+
+	seq_printf(m, "%llu\n", value);
+	return 0;
+
+}
+
+static ssize_t
+sched_fair_server_runtime_write(struct file *filp, const char __user *ubuf,
+				size_t cnt, loff_t *ppos)
+{
+	return sched_fair_server_write(filp, ubuf, cnt, ppos, DL_RUNTIME);
+}
+
+static int sched_fair_server_runtime_show(struct seq_file *m, void *v)
+{
+	return sched_fair_server_show(m, v, DL_RUNTIME);
+}
+
+static int sched_fair_server_runtime_open(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, sched_fair_server_runtime_show, inode->i_private);
+}
+
+static const struct file_operations fair_server_runtime_fops = {
+	.open		= sched_fair_server_runtime_open,
+	.write		= sched_fair_server_runtime_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+static ssize_t
+sched_fair_server_period_write(struct file *filp, const char __user *ubuf,
+			       size_t cnt, loff_t *ppos)
+{
+	return sched_fair_server_write(filp, ubuf, cnt, ppos, DL_PERIOD);
+}
+
+static int sched_fair_server_period_show(struct seq_file *m, void *v)
+{
+	return sched_fair_server_show(m, v, DL_PERIOD);
+}
+
+static int sched_fair_server_period_open(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, sched_fair_server_period_show, inode->i_private);
+}
+
+static const struct file_operations fair_server_period_fops = {
+	.open		= sched_fair_server_period_open,
+	.write		= sched_fair_server_period_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
 static struct dentry *debugfs_sched;
 
+static void debugfs_fair_server_init(void)
+{
+	struct dentry *d_fair;
+	unsigned long cpu;
+
+	d_fair = debugfs_create_dir("fair_server", debugfs_sched);
+	if (!d_fair)
+		return;
+
+	for_each_possible_cpu(cpu) {
+		struct dentry *d_cpu;
+		char buf[32];
+
+		snprintf(buf, sizeof(buf), "cpu%lu", cpu);
+		d_cpu = debugfs_create_dir(buf, d_fair);
+
+		debugfs_create_file("runtime", 0644, d_cpu, (void *) cpu, &fair_server_runtime_fops);
+		debugfs_create_file("period", 0644, d_cpu, (void *) cpu, &fair_server_period_fops);
+	}
+}
+
 static __init int sched_init_debug(void)
 {
 	struct dentry __maybe_unused *numa;
@@ -374,6 +531,8 @@ static __init int sched_init_debug(void)
 
 	debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
 
+	debugfs_fair_server_init();
+
 	return 0;
 }
 late_initcall(sched_init_debug);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 64fb677..b777ac3 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -366,6 +366,9 @@ extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
 extern void dl_server_update_idle_time(struct rq *rq,
 		    struct task_struct *p);
 extern void fair_server_init(struct rq *rq);
+extern void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq);
+extern int dl_server_apply_params(struct sched_dl_entity *dl_se,
+		    u64 runtime, u64 period, bool init);
 
 #ifdef CONFIG_CGROUP_SCHED
 
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 76504b7..9748a4c 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -516,6 +516,14 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd)
 	if (cpumask_test_cpu(rq->cpu, cpu_active_mask))
 		set_rq_online(rq);
 
+	/*
+	 * Because the rq is not a task, dl_add_task_root_domain() did not
+	 * move the fair server bw to the rd if it already started.
+	 * Add it now.
+	 */
+	if (rq->fair_server.dl_server)
+		__dl_server_attach_root(&rq->fair_server, rq);
+
 	rq_unlock_irqrestore(rq, &rf);
 
 	if (old_rd)

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [tip: sched/core] sched/deadline: Deferrable dl server
  2024-05-27 12:06 ` [PATCH V7 5/9] sched/deadline: Deferrable dl server Daniel Bristot de Oliveira
@ 2024-07-29 10:34   ` tip-bot2 for Daniel Bristot de Oliveira
  0 siblings, 0 replies; 34+ messages in thread
From: tip-bot2 for Daniel Bristot de Oliveira @ 2024-07-29 10:34 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Daniel Bristot de Oliveira, Peter Zijlstra (Intel), Juri Lelli,
	x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     a110a81c52a9de73e2e57e826dd3bf0fd4c22226
Gitweb:        https://git.kernel.org/tip/a110a81c52a9de73e2e57e826dd3bf0fd4c22226
Author:        Daniel Bristot de Oliveira <bristot@kernel.org>
AuthorDate:    Mon, 27 May 2024 14:06:51 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 29 Jul 2024 12:22:36 +02:00

sched/deadline: Deferrable dl server

Among the motivations for the DL servers is the real-time throttling
mechanism. This mechanism works by throttling the rt_rq after
running for a long period without leaving space for fair tasks.

The base dl server avoids this problem by boosting fair tasks instead
of throttling the rt_rq. The point is that it boosts without waiting
for potential starvation, causing some non-intuitive cases.

For example, an IRQ dispatches two tasks on an idle system, a fair
and an RT. The DL server will be activated, running the fair task
before the RT one. This problem can be avoided by deferring the
dl server activation.

By setting the defer option, the dl_server will dispatch an
SCHED_DEADLINE reservation with replenished runtime, but throttled.

The dl_timer will be set for the defer time at (period - runtime) ns
from start time. Thus boosting the fair rq at defer time.

If the fair scheduler has the opportunity to run while waiting
for defer time, the dl server runtime will be consumed. If
the runtime is completely consumed before the defer time, the
server will be replenished while still in a throttled state. Then,
the dl_timer will be reset to the new defer time

If the fair server reaches the defer time without consuming
its runtime, the server will start running, following CBS rules
(thus without breaking SCHED_DEADLINE). Then the server will
continue the running state (without deferring) until it fair
tasks are able to execute as regular fair scheduler (end of
the starvation).

Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/dd175943c72533cd9f0b87767c6499204879cc38.1716811044.git.bristot@kernel.org
---
 include/linux/sched.h   |  12 ++-
 kernel/sched/deadline.c | 301 +++++++++++++++++++++++++++++++++------
 kernel/sched/fair.c     |  24 ++-
 kernel/sched/idle.c     |   2 +-
 kernel/sched/sched.h    |   4 +-
 5 files changed, 298 insertions(+), 45 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1c771ea..4edd7e2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -641,12 +641,24 @@ struct sched_dl_entity {
 	 * overruns.
 	 *
 	 * @dl_server tells if this is a server entity.
+	 *
+	 * @dl_defer tells if this is a deferred or regular server. For
+	 * now only defer server exists.
+	 *
+	 * @dl_defer_armed tells if the deferrable server is waiting
+	 * for the replenishment timer to activate it.
+	 *
+	 * @dl_defer_running tells if the deferrable server is actually
+	 * running, skipping the defer phase.
 	 */
 	unsigned int			dl_throttled      : 1;
 	unsigned int			dl_yielded        : 1;
 	unsigned int			dl_non_contending : 1;
 	unsigned int			dl_overrun	  : 1;
 	unsigned int			dl_server         : 1;
+	unsigned int			dl_defer	  : 1;
+	unsigned int			dl_defer_armed	  : 1;
+	unsigned int			dl_defer_running  : 1;
 
 	/*
 	 * Bandwidth enforcement timer. Each -deadline task has its
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index f5b5313..1b29531 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -771,6 +771,15 @@ static inline void replenish_dl_new_period(struct sched_dl_entity *dl_se,
 	/* for non-boosted task, pi_of(dl_se) == dl_se */
 	dl_se->deadline = rq_clock(rq) + pi_of(dl_se)->dl_deadline;
 	dl_se->runtime = pi_of(dl_se)->dl_runtime;
+
+	/*
+	 * If it is a deferred reservation, and the server
+	 * is not handling an starvation case, defer it.
+	 */
+	if (dl_se->dl_defer & !dl_se->dl_defer_running) {
+		dl_se->dl_throttled = 1;
+		dl_se->dl_defer_armed = 1;
+	}
 }
 
 /*
@@ -809,6 +818,9 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
 	replenish_dl_new_period(dl_se, rq);
 }
 
+static int start_dl_timer(struct sched_dl_entity *dl_se);
+static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t);
+
 /*
  * Pure Earliest Deadline First (EDF) scheduling does not deal with the
  * possibility of a entity lasting more than what it declared, and thus
@@ -837,9 +849,18 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
 	/*
 	 * This could be the case for a !-dl task that is boosted.
 	 * Just go with full inherited parameters.
+	 *
+	 * Or, it could be the case of a deferred reservation that
+	 * was not able to consume its runtime in background and
+	 * reached this point with current u > U.
+	 *
+	 * In both cases, set a new period.
 	 */
-	if (dl_se->dl_deadline == 0)
-		replenish_dl_new_period(dl_se, rq);
+	if (dl_se->dl_deadline == 0 ||
+	    (dl_se->dl_defer_armed && dl_entity_overflow(dl_se, rq_clock(rq)))) {
+		dl_se->deadline = rq_clock(rq) + pi_of(dl_se)->dl_deadline;
+		dl_se->runtime = pi_of(dl_se)->dl_runtime;
+	}
 
 	if (dl_se->dl_yielded && dl_se->runtime > 0)
 		dl_se->runtime = 0;
@@ -873,6 +894,44 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
 		dl_se->dl_yielded = 0;
 	if (dl_se->dl_throttled)
 		dl_se->dl_throttled = 0;
+
+	/*
+	 * If this is the replenishment of a deferred reservation,
+	 * clear the flag and return.
+	 */
+	if (dl_se->dl_defer_armed) {
+		dl_se->dl_defer_armed = 0;
+		return;
+	}
+
+	/*
+	 * A this point, if the deferred server is not armed, and the deadline
+	 * is in the future, if it is not running already, throttle the server
+	 * and arm the defer timer.
+	 */
+	if (dl_se->dl_defer && !dl_se->dl_defer_running &&
+	    dl_time_before(rq_clock(dl_se->rq), dl_se->deadline - dl_se->runtime)) {
+		if (!is_dl_boosted(dl_se) && dl_se->server_has_tasks(dl_se)) {
+
+			/*
+			 * Set dl_se->dl_defer_armed and dl_throttled variables to
+			 * inform the start_dl_timer() that this is a deferred
+			 * activation.
+			 */
+			dl_se->dl_defer_armed = 1;
+			dl_se->dl_throttled = 1;
+			if (!start_dl_timer(dl_se)) {
+				/*
+				 * If for whatever reason (delays), a previous timer was
+				 * queued but not serviced, cancel it and clean the
+				 * deferrable server variables intended for start_dl_timer().
+				 */
+				hrtimer_try_to_cancel(&dl_se->dl_timer);
+				dl_se->dl_defer_armed = 0;
+				dl_se->dl_throttled = 0;
+			}
+		}
+	}
 }
 
 /*
@@ -1023,6 +1082,15 @@ static void update_dl_entity(struct sched_dl_entity *dl_se)
 		}
 
 		replenish_dl_new_period(dl_se, rq);
+	} else if (dl_server(dl_se) && dl_se->dl_defer) {
+		/*
+		 * The server can still use its previous deadline, so check if
+		 * it left the dl_defer_running state.
+		 */
+		if (!dl_se->dl_defer_running) {
+			dl_se->dl_defer_armed = 1;
+			dl_se->dl_throttled = 1;
+		}
 	}
 }
 
@@ -1055,8 +1123,21 @@ static int start_dl_timer(struct sched_dl_entity *dl_se)
 	 * We want the timer to fire at the deadline, but considering
 	 * that it is actually coming from rq->clock and not from
 	 * hrtimer's time base reading.
+	 *
+	 * The deferred reservation will have its timer set to
+	 * (deadline - runtime). At that point, the CBS rule will decide
+	 * if the current deadline can be used, or if a replenishment is
+	 * required to avoid add too much pressure on the system
+	 * (current u > U).
 	 */
-	act = ns_to_ktime(dl_next_period(dl_se));
+	if (dl_se->dl_defer_armed) {
+		WARN_ON_ONCE(!dl_se->dl_throttled);
+		act = ns_to_ktime(dl_se->deadline - dl_se->runtime);
+	} else {
+		/* act = deadline - rel-deadline + period */
+		act = ns_to_ktime(dl_next_period(dl_se));
+	}
+
 	now = hrtimer_cb_get_time(timer);
 	delta = ktime_to_ns(now) - rq_clock(rq);
 	act = ktime_add_ns(act, delta);
@@ -1106,6 +1187,62 @@ static void __push_dl_task(struct rq *rq, struct rq_flags *rf)
 #endif
 }
 
+/* a defer timer will not be reset if the runtime consumed was < dl_server_min_res */
+static const u64 dl_server_min_res = 1 * NSEC_PER_MSEC;
+
+static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_dl_entity *dl_se)
+{
+	struct rq *rq = rq_of_dl_se(dl_se);
+	u64 fw;
+
+	scoped_guard (rq_lock, rq) {
+		struct rq_flags *rf = &scope.rf;
+
+		if (!dl_se->dl_throttled || !dl_se->dl_runtime)
+			return HRTIMER_NORESTART;
+
+		sched_clock_tick();
+		update_rq_clock(rq);
+
+		if (!dl_se->dl_runtime)
+			return HRTIMER_NORESTART;
+
+		if (!dl_se->server_has_tasks(dl_se)) {
+			replenish_dl_entity(dl_se);
+			return HRTIMER_NORESTART;
+		}
+
+		if (dl_se->dl_defer_armed) {
+			/*
+			 * First check if the server could consume runtime in background.
+			 * If so, it is possible to push the defer timer for this amount
+			 * of time. The dl_server_min_res serves as a limit to avoid
+			 * forwarding the timer for a too small amount of time.
+			 */
+			if (dl_time_before(rq_clock(dl_se->rq),
+					   (dl_se->deadline - dl_se->runtime - dl_server_min_res))) {
+
+				/* reset the defer timer */
+				fw = dl_se->deadline - rq_clock(dl_se->rq) - dl_se->runtime;
+
+				hrtimer_forward_now(timer, ns_to_ktime(fw));
+				return HRTIMER_RESTART;
+			}
+
+			dl_se->dl_defer_running = 1;
+		}
+
+		enqueue_dl_entity(dl_se, ENQUEUE_REPLENISH);
+
+		if (!dl_task(dl_se->rq->curr) || dl_entity_preempt(dl_se, &dl_se->rq->curr->dl))
+			resched_curr(rq);
+
+		__push_dl_task(rq, rf);
+	}
+
+	return HRTIMER_NORESTART;
+}
+
 /*
  * This is the bandwidth enforcement timer callback. If here, we know
  * a task is not on its dl_rq, since the fact that the timer was running
@@ -1128,28 +1265,8 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
 	struct rq_flags rf;
 	struct rq *rq;
 
-	if (dl_server(dl_se)) {
-		struct rq *rq = rq_of_dl_se(dl_se);
-		struct rq_flags rf;
-
-		rq_lock(rq, &rf);
-		if (dl_se->dl_throttled) {
-			sched_clock_tick();
-			update_rq_clock(rq);
-
-			if (dl_se->server_has_tasks(dl_se)) {
-				enqueue_dl_entity(dl_se, ENQUEUE_REPLENISH);
-				resched_curr(rq);
-				__push_dl_task(rq, &rf);
-			} else {
-				replenish_dl_entity(dl_se);
-			}
-
-		}
-		rq_unlock(rq, &rf);
-
-		return HRTIMER_NORESTART;
-	}
+	if (dl_server(dl_se))
+		return dl_server_timer(timer, dl_se);
 
 	p = dl_task_of(dl_se);
 	rq = task_rq_lock(p, &rf);
@@ -1319,22 +1436,10 @@ static u64 grub_reclaim(u64 delta, struct rq *rq, struct sched_dl_entity *dl_se)
 	return (delta * u_act) >> BW_SHIFT;
 }
 
-static inline void
-update_stats_dequeue_dl(struct dl_rq *dl_rq, struct sched_dl_entity *dl_se,
-                        int flags);
-static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64 delta_exec)
+s64 dl_scaled_delta_exec(struct rq *rq, struct sched_dl_entity *dl_se, s64 delta_exec)
 {
 	s64 scaled_delta_exec;
 
-	if (unlikely(delta_exec <= 0)) {
-		if (unlikely(dl_se->dl_yielded))
-			goto throttle;
-		return;
-	}
-
-	if (dl_entity_is_special(dl_se))
-		return;
-
 	/*
 	 * For tasks that participate in GRUB, we implement GRUB-PA: the
 	 * spare reclaimed bandwidth is used to clock down frequency.
@@ -1353,8 +1458,64 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64 
 		scaled_delta_exec = cap_scale(scaled_delta_exec, scale_cpu);
 	}
 
+	return scaled_delta_exec;
+}
+
+static inline void
+update_stats_dequeue_dl(struct dl_rq *dl_rq, struct sched_dl_entity *dl_se,
+			int flags);
+static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64 delta_exec)
+{
+	s64 scaled_delta_exec;
+
+	if (unlikely(delta_exec <= 0)) {
+		if (unlikely(dl_se->dl_yielded))
+			goto throttle;
+		return;
+	}
+
+	if (dl_server(dl_se) && dl_se->dl_throttled && !dl_se->dl_defer)
+		return;
+
+	if (dl_entity_is_special(dl_se))
+		return;
+
+	scaled_delta_exec = dl_scaled_delta_exec(rq, dl_se, delta_exec);
+
 	dl_se->runtime -= scaled_delta_exec;
 
+	/*
+	 * The fair server can consume its runtime while throttled (not queued/
+	 * running as regular CFS).
+	 *
+	 * If the server consumes its entire runtime in this state. The server
+	 * is not required for the current period. Thus, reset the server by
+	 * starting a new period, pushing the activation.
+	 */
+	if (dl_se->dl_defer && dl_se->dl_throttled && dl_runtime_exceeded(dl_se)) {
+		/*
+		 * If the server was previously activated - the starving condition
+		 * took place, it this point it went away because the fair scheduler
+		 * was able to get runtime in background. So return to the initial
+		 * state.
+		 */
+		dl_se->dl_defer_running = 0;
+
+		hrtimer_try_to_cancel(&dl_se->dl_timer);
+
+		replenish_dl_new_period(dl_se, dl_se->rq);
+
+		/*
+		 * Not being able to start the timer seems problematic. If it could not
+		 * be started for whatever reason, we need to "unthrottle" the DL server
+		 * and queue right away. Otherwise nothing might queue it. That's similar
+		 * to what enqueue_dl_entity() does on start_dl_timer==0. For now, just warn.
+		 */
+		WARN_ON_ONCE(!start_dl_timer(dl_se));
+
+		return;
+	}
+
 throttle:
 	if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) {
 		dl_se->dl_throttled = 1;
@@ -1414,9 +1575,46 @@ throttle:
 	}
 }
 
+/*
+ * In the non-defer mode, the idle time is not accounted, as the
+ * server provides a guarantee.
+ *
+ * If the dl_server is in defer mode, the idle time is also considered
+ * as time available for the fair server, avoiding a penalty for the
+ * rt scheduler that did not consumed that time.
+ */
+void dl_server_update_idle_time(struct rq *rq, struct task_struct *p)
+{
+	s64 delta_exec, scaled_delta_exec;
+
+	if (!rq->fair_server.dl_defer)
+		return;
+
+	/* no need to discount more */
+	if (rq->fair_server.runtime < 0)
+		return;
+
+	delta_exec = rq_clock_task(rq) - p->se.exec_start;
+	if (delta_exec < 0)
+		return;
+
+	scaled_delta_exec = dl_scaled_delta_exec(rq, &rq->fair_server, delta_exec);
+
+	rq->fair_server.runtime -= scaled_delta_exec;
+
+	if (rq->fair_server.runtime < 0) {
+		rq->fair_server.dl_defer_running = 0;
+		rq->fair_server.runtime = 0;
+	}
+
+	p->se.exec_start = rq_clock_task(rq);
+}
+
 void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec)
 {
-	update_curr_dl_se(dl_se->rq, dl_se, delta_exec);
+	/* 0 runtime = fair server disabled */
+	if (dl_se->dl_runtime)
+		update_curr_dl_se(dl_se->rq, dl_se, delta_exec);
 }
 
 void dl_server_start(struct sched_dl_entity *dl_se)
@@ -1430,6 +1628,7 @@ void dl_server_start(struct sched_dl_entity *dl_se)
 		dl_se->dl_period = 1000 * NSEC_PER_MSEC;
 
 		dl_se->dl_server = 1;
+		dl_se->dl_defer = 1;
 		setup_new_dl_entity(dl_se);
 	}
 
@@ -1447,6 +1646,9 @@ void dl_server_stop(struct sched_dl_entity *dl_se)
 		return;
 
 	dequeue_dl_entity(dl_se, DEQUEUE_SLEEP);
+	hrtimer_try_to_cancel(&dl_se->dl_timer);
+	dl_se->dl_defer_armed = 0;
+	dl_se->dl_throttled = 0;
 }
 
 void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
@@ -1758,7 +1960,7 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
 	 * be counted in the active utilization; hence, we need to call
 	 * add_running_bw().
 	 */
-	if (dl_se->dl_throttled && !(flags & ENQUEUE_REPLENISH)) {
+	if (!dl_se->dl_defer && dl_se->dl_throttled && !(flags & ENQUEUE_REPLENISH)) {
 		if (flags & ENQUEUE_WAKEUP)
 			task_contending(dl_se, flags);
 
@@ -1780,6 +1982,25 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
 		setup_new_dl_entity(dl_se);
 	}
 
+	/*
+	 * If the reservation is still throttled, e.g., it got replenished but is a
+	 * deferred task and still got to wait, don't enqueue.
+	 */
+	if (dl_se->dl_throttled && start_dl_timer(dl_se))
+		return;
+
+	/*
+	 * We're about to enqueue, make sure we're not ->dl_throttled!
+	 * In case the timer was not started, say because the defer time
+	 * has passed, mark as not throttled and mark unarmed.
+	 * Also cancel earlier timers, since letting those run is pointless.
+	 */
+	if (dl_se->dl_throttled) {
+		hrtimer_try_to_cancel(&dl_se->dl_timer);
+		dl_se->dl_defer_armed = 0;
+		dl_se->dl_throttled = 0;
+	}
+
 	__enqueue_dl_entity(dl_se);
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index aba23b0..1ea5ec8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1156,12 +1156,13 @@ s64 update_curr_common(struct rq *rq)
 static void update_curr(struct cfs_rq *cfs_rq)
 {
 	struct sched_entity *curr = cfs_rq->curr;
+	struct rq *rq = rq_of(cfs_rq);
 	s64 delta_exec;
 
 	if (unlikely(!curr))
 		return;
 
-	delta_exec = update_curr_se(rq_of(cfs_rq), curr);
+	delta_exec = update_curr_se(rq, curr);
 	if (unlikely(delta_exec <= 0))
 		return;
 
@@ -1169,8 +1170,19 @@ static void update_curr(struct cfs_rq *cfs_rq)
 	update_deadline(cfs_rq, curr);
 	update_min_vruntime(cfs_rq);
 
-	if (entity_is_task(curr))
-		update_curr_task(task_of(curr), delta_exec);
+	if (entity_is_task(curr)) {
+		struct task_struct *p = task_of(curr);
+
+		update_curr_task(p, delta_exec);
+
+		/*
+		 * Any fair task that runs outside of fair_server should
+		 * account against fair_server such that it can account for
+		 * this time and possibly avoid running this period.
+		 */
+		if (p->dl_server != &rq->fair_server)
+			dl_server_update(&rq->fair_server, delta_exec);
+	}
 
 	account_cfs_rq_runtime(cfs_rq, delta_exec);
 }
@@ -6768,8 +6780,12 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	 */
 	util_est_enqueue(&rq->cfs, p);
 
-	if (!throttled_hierarchy(task_cfs_rq(p)) && !rq->cfs.h_nr_running)
+	if (!throttled_hierarchy(task_cfs_rq(p)) && !rq->cfs.h_nr_running) {
+		/* Account for idle runtime */
+		if (!rq->nr_running)
+			dl_server_update_idle_time(rq, rq->curr);
 		dl_server_start(&rq->fair_server);
+	}
 
 	/*
 	 * If in_iowait is set, the code below may not trigger any cpufreq
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 6e78d07..d560f7f 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -452,12 +452,14 @@ static void wakeup_preempt_idle(struct rq *rq, struct task_struct *p, int flags)
 
 static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
 {
+	dl_server_update_idle_time(rq, prev);
 }
 
 static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool first)
 {
 	update_idle_core(rq);
 	schedstat_inc(rq->sched_goidle);
+	next->se.exec_start = rq_clock_task(rq);
 }
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7416bcd..64fb677 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -335,7 +335,7 @@ extern bool __checkparam_dl(const struct sched_attr *attr);
 extern bool dl_param_changed(struct task_struct *p, const struct sched_attr *attr);
 extern int  dl_cpuset_cpumask_can_shrink(const struct cpumask *cur, const struct cpumask *trial);
 extern int  dl_bw_check_overflow(int cpu);
-
+extern s64 dl_scaled_delta_exec(struct rq *rq, struct sched_dl_entity *dl_se, s64 delta_exec);
 /*
  * SCHED_DEADLINE supports servers (nested scheduling) with the following
  * interface:
@@ -363,6 +363,8 @@ extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
 		    dl_server_has_tasks_f has_tasks,
 		    dl_server_pick_f pick);
 
+extern void dl_server_update_idle_time(struct rq *rq,
+		    struct task_struct *p);
 extern void fair_server_init(struct rq *rq);
 
 #ifdef CONFIG_CGROUP_SCHED

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [tip: sched/core] sched/core: Clear prev->dl_server in CFS pick fast path
  2024-05-27 12:06 ` [PATCH V7 3/9] sched/core: Clear prev->dl_server in CFS pick fast path Daniel Bristot de Oliveira
@ 2024-07-29 10:34   ` tip-bot2 for Youssef Esmat
  0 siblings, 0 replies; 34+ messages in thread
From: tip-bot2 for Youssef Esmat @ 2024-07-29 10:34 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Youssef Esmat, Daniel Bristot de Oliveira, Peter Zijlstra (Intel),
	Juri Lelli, stable, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     a741b82423f41501e301eb6f9820b45ca202e877
Gitweb:        https://git.kernel.org/tip/a741b82423f41501e301eb6f9820b45ca202e877
Author:        Youssef Esmat <youssefesmat@google.com>
AuthorDate:    Mon, 27 May 2024 14:06:49 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 29 Jul 2024 12:22:35 +02:00

sched/core: Clear prev->dl_server in CFS pick fast path

In case the previous pick was a DL server pick, ->dl_server might be
set. Clear it in the fast path as well.

Fixes: 63ba8422f876 ("sched/deadline: Introduce deadline servers")
Signed-off-by: Youssef Esmat <youssefesmat@google.com>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/7f7381ccba09efcb4a1c1ff808ed58385eccc222.1716811044.git.bristot@kernel.org
---
 kernel/sched/core.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e61da3b..1074ae8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5840,6 +5840,13 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 		}
 
 		/*
+		 * This is a normal CFS pick, but the previous could be a DL pick.
+		 * Clear it as previous is no longer picked.
+		 */
+		if (prev->dl_server)
+			prev->dl_server = NULL;
+
+		/*
 		 * This is the fast path; it cannot be a DL server pick;
 		 * therefore even if @p == @prev, ->dl_server must be NULL.
 		 */

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [tip: sched/core] sched/fair: Add trivial fair server
  2024-05-27 12:06 ` [PATCH V7 4/9] sched/fair: Add trivial fair server Daniel Bristot de Oliveira
@ 2024-07-29 10:34   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 34+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2024-07-29 10:34 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Daniel Bristot de Oliveira, Juri Lelli,
	x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     557a6bfc662c4d560f909b78adb1270c9862efa8
Gitweb:        https://git.kernel.org/tip/557a6bfc662c4d560f909b78adb1270c9862efa8
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Mon, 27 May 2024 14:06:50 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 29 Jul 2024 12:22:36 +02:00

sched/fair: Add trivial fair server

Use deadline servers to service fair tasks.

This patch adds a fair_server deadline entity which acts as a container
for fair entities and can be used to fix starvation when higher priority
(wrt fair) tasks are monopolizing CPU(s).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/b6b0bcefaf25391bcf5b6ecdb9f1218de402d42e.1716811044.git.bristot@kernel.org
---
 kernel/sched/core.c     |  1 +
 kernel/sched/deadline.c | 23 +++++++++++++++++++++++
 kernel/sched/fair.c     | 34 ++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h    |  4 ++++
 4 files changed, 62 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1074ae8..f95600c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8336,6 +8336,7 @@ void __init sched_init(void)
 #endif /* CONFIG_SMP */
 		hrtick_rq_init(rq);
 		atomic_set(&rq->nr_iowait, 0);
+		fair_server_init(rq);
 
 #ifdef CONFIG_SCHED_CORE
 		rq->core = rq;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index f59e5c1..f5b5313 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1382,6 +1382,13 @@ throttle:
 	}
 
 	/*
+	 * The fair server (sole dl_server) does not account for real-time
+	 * workload because it is running fair work.
+	 */
+	if (dl_se == &rq->fair_server)
+		return;
+
+	/*
 	 * Because -- for now -- we share the rt bandwidth, we need to
 	 * account our runtime there too, otherwise actual rt tasks
 	 * would be able to exceed the shared quota.
@@ -1414,15 +1421,31 @@ void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec)
 
 void dl_server_start(struct sched_dl_entity *dl_se)
 {
+	struct rq *rq = dl_se->rq;
+
 	if (!dl_server(dl_se)) {
+		/* Disabled */
+		dl_se->dl_runtime = 0;
+		dl_se->dl_deadline = 1000 * NSEC_PER_MSEC;
+		dl_se->dl_period = 1000 * NSEC_PER_MSEC;
+
 		dl_se->dl_server = 1;
 		setup_new_dl_entity(dl_se);
 	}
+
+	if (!dl_se->dl_runtime)
+		return;
+
 	enqueue_dl_entity(dl_se, ENQUEUE_WAKEUP);
+	if (!dl_task(dl_se->rq->curr) || dl_entity_preempt(dl_se, &rq->curr->dl))
+		resched_curr(dl_se->rq);
 }
 
 void dl_server_stop(struct sched_dl_entity *dl_se)
 {
+	if (!dl_se->dl_runtime)
+		return;
+
 	dequeue_dl_entity(dl_se, DEQUEUE_SLEEP);
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 99c80ab..aba23b0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5765,6 +5765,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
 	struct sched_entity *se;
 	long task_delta, idle_task_delta, dequeue = 1;
+	long rq_h_nr_running = rq->cfs.h_nr_running;
 
 	raw_spin_lock(&cfs_b->lock);
 	/* This will start the period timer if necessary */
@@ -5837,6 +5838,9 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 	sub_nr_running(rq, task_delta);
 
 done:
+	/* Stop the fair server if throttling resulted in no runnable tasks */
+	if (rq_h_nr_running && !rq->cfs.h_nr_running)
+		dl_server_stop(&rq->fair_server);
 	/*
 	 * Note: distribution will already see us throttled via the
 	 * throttled-list.  rq->lock protects completion.
@@ -5854,6 +5858,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
 	struct sched_entity *se;
 	long task_delta, idle_task_delta;
+	long rq_h_nr_running = rq->cfs.h_nr_running;
 
 	se = cfs_rq->tg->se[cpu_of(rq)];
 
@@ -5929,6 +5934,10 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 unthrottle_throttle:
 	assert_list_leaf_cfs_rq(rq);
 
+	/* Start the fair server if un-throttling resulted in new runnable tasks */
+	if (!rq_h_nr_running && rq->cfs.h_nr_running)
+		dl_server_start(&rq->fair_server);
+
 	/* Determine whether we need to wake up potentially idle CPU: */
 	if (rq->curr == rq->idle && rq->cfs.nr_running)
 		resched_curr(rq);
@@ -6759,6 +6768,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	 */
 	util_est_enqueue(&rq->cfs, p);
 
+	if (!throttled_hierarchy(task_cfs_rq(p)) && !rq->cfs.h_nr_running)
+		dl_server_start(&rq->fair_server);
+
 	/*
 	 * If in_iowait is set, the code below may not trigger any cpufreq
 	 * utilization updates, so do it here explicitly with the IOWAIT flag
@@ -6903,6 +6915,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		rq->next_balance = jiffies;
 
 dequeue_throttle:
+	if (!throttled_hierarchy(task_cfs_rq(p)) && !rq->cfs.h_nr_running)
+		dl_server_stop(&rq->fair_server);
+
 	util_est_update(&rq->cfs, p, task_sleep);
 	hrtick_update(rq);
 }
@@ -8602,6 +8617,25 @@ static struct task_struct *__pick_next_task_fair(struct rq *rq)
 	return pick_next_task_fair(rq, NULL, NULL);
 }
 
+static bool fair_server_has_tasks(struct sched_dl_entity *dl_se)
+{
+	return !!dl_se->rq->cfs.nr_running;
+}
+
+static struct task_struct *fair_server_pick(struct sched_dl_entity *dl_se)
+{
+	return pick_next_task_fair(dl_se->rq, NULL, NULL);
+}
+
+void fair_server_init(struct rq *rq)
+{
+	struct sched_dl_entity *dl_se = &rq->fair_server;
+
+	init_dl_entity(dl_se);
+
+	dl_server_init(dl_se, rq, fair_server_has_tasks, fair_server_pick);
+}
+
 /*
  * Account for a descheduled task:
  */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8a07102..7416bcd 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -363,6 +363,8 @@ extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
 		    dl_server_has_tasks_f has_tasks,
 		    dl_server_pick_f pick);
 
+extern void fair_server_init(struct rq *rq);
+
 #ifdef CONFIG_CGROUP_SCHED
 
 extern struct list_head task_groups;
@@ -1039,6 +1041,8 @@ struct rq {
 	struct rt_rq		rt;
 	struct dl_rq		dl;
 
+	struct sched_dl_entity	fair_server;
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	/* list of leaf cfs_rq on this CPU: */
 	struct list_head	leaf_cfs_rq_list;

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [tip: sched/core] sched/core: Add clearing of ->dl_server in put_prev_task_balance()
  2024-05-27 12:06 ` [PATCH V7 2/9] sched/core: Add clearing of ->dl_server in put_prev_task_balance() Daniel Bristot de Oliveira
@ 2024-07-29 10:34   ` tip-bot2 for Joel Fernandes (Google)
  0 siblings, 0 replies; 34+ messages in thread
From: tip-bot2 for Joel Fernandes (Google) @ 2024-07-29 10:34 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Joel Fernandes (Google), Daniel Bristot de Oliveira,
	Peter Zijlstra (Intel), Juri Lelli, stable, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     c245910049d04fbfa85bb2f5acd591c24e9907c7
Gitweb:        https://git.kernel.org/tip/c245910049d04fbfa85bb2f5acd591c24e9907c7
Author:        Joel Fernandes (Google) <joel@joelfernandes.org>
AuthorDate:    Mon, 27 May 2024 14:06:48 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 29 Jul 2024 12:22:35 +02:00

sched/core: Add clearing of ->dl_server in put_prev_task_balance()

Paths using put_prev_task_balance() need to do a pick shortly
after. Make sure they also clear the ->dl_server on prev as a
part of that.

Fixes: 63ba8422f876 ("sched/deadline: Introduce deadline servers")
Signed-off-by: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/d184d554434bedbad0581cb34656582d78655150.1716811044.git.bristot@kernel.org
---
 kernel/sched/core.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0a71050..e61da3b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5801,6 +5801,14 @@ static void put_prev_task_balance(struct rq *rq, struct task_struct *prev,
 #endif
 
 	put_prev_task(rq, prev);
+
+	/*
+	 * We've updated @prev and no longer need the server link, clear it.
+	 * Must be done before ->pick_next_task() because that can (re)set
+	 * ->dl_server.
+	 */
+	if (prev->dl_server)
+		prev->dl_server = NULL;
 }
 
 /*
@@ -5844,14 +5852,6 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 restart:
 	put_prev_task_balance(rq, prev, rf);
 
-	/*
-	 * We've updated @prev and no longer need the server link, clear it.
-	 * Must be done before ->pick_next_task() because that can (re)set
-	 * ->dl_server.
-	 */
-	if (prev->dl_server)
-		prev->dl_server = NULL;
-
 	for_each_class(class) {
 		p = class->pick_next_task(rq);
 		if (p)

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [tip: sched/core] sched/deadline: Comment sched_dl_entity::dl_server variable
  2024-05-27 12:06 ` [PATCH V7 1/9] sched/deadline: Comment sched_dl_entity::dl_server variable Daniel Bristot de Oliveira
@ 2024-07-29 10:34   ` tip-bot2 for Daniel Bristot de Oliveira
  0 siblings, 0 replies; 34+ messages in thread
From: tip-bot2 for Daniel Bristot de Oliveira @ 2024-07-29 10:34 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Daniel Bristot de Oliveira, Peter Zijlstra (Intel), Juri Lelli,
	stable, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     f23c042ce34ba265cf3129d530702b5d218e3f4b
Gitweb:        https://git.kernel.org/tip/f23c042ce34ba265cf3129d530702b5d218e3f4b
Author:        Daniel Bristot de Oliveira <bristot@kernel.org>
AuthorDate:    Mon, 27 May 2024 14:06:47 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 29 Jul 2024 12:22:35 +02:00

sched/deadline: Comment sched_dl_entity::dl_server variable

Add an explanation for the newly added variable.

Fixes: 63ba8422f876 ("sched/deadline: Introduce deadline servers")
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/147f7aa8cb8fd925f36aa8059af6a35aad08b45a.1716811044.git.bristot@kernel.org
---
 include/linux/sched.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f8d1503..1c771ea 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -639,6 +639,8 @@ struct sched_dl_entity {
 	 *
 	 * @dl_overrun tells if the task asked to be informed about runtime
 	 * overruns.
+	 *
+	 * @dl_server tells if this is a server entity.
 	 */
 	unsigned int			dl_throttled      : 1;
 	unsigned int			dl_yielded        : 1;

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH V7 0/9] SCHED_DEADLINE server infrastructure
  2024-07-29 10:32   ` Peter Zijlstra
@ 2024-07-29 20:42     ` Vineeth Remanan Pillai
  0 siblings, 0 replies; 34+ messages in thread
From: Vineeth Remanan Pillai @ 2024-07-29 20:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Shuah Khan, Phil Auld, Suleiman Souhlal, Youssef Esmat

On Mon, Jul 29, 2024 at 6:32 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Jun 21, 2024 at 10:41:35AM -0400, Vineeth Remanan Pillai wrote:
>
> > Sorry that I could not get to reviewing and testing this revision. In
> > v6 we had experienced a minor bug where suspend/resume had issues with
> > dlserver. Since suspend does not do dequeue, dlserver is not stopped
> > and this causes the premature wakeups. I haven't looked at v7 in
> > detail, but I think the issue might still be present.
>
> It is not.
>
> > We have a workaround patch for this in our 5.15 kernel
>
> That is the problem... your necro kernel doesn't yet have the freezer
> rewrite I imagine:
>
>   f5d39b020809 ("freezer,sched: Rewrite core freezer logic")
>
> That would cause all frozen tasks to be dequeued, and once all tasks
> are dequeued, the deadline server stops itself too.
>
You're right, we are on 5.15 kernel and do not have this fix. Thanks
for pointing this out.

> Juri did some testing to double check and no suspend / resume issues
> were found.
>
> Anyway, I've merged the lot into tip/sched/core.
>
Thanks, I shall port it to chromeos kernel and run through the usual
round of tests and update the details soon.

Thanks,
Vineeth

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH V7 9/9] sched/rt: Remove default bandwidth control
  2024-05-27 12:06 ` [PATCH V7 9/9] sched/rt: Remove default bandwidth control Daniel Bristot de Oliveira
  2024-07-29 10:34   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
@ 2024-11-27 10:55   ` Michal Koutný
  2024-11-27 15:35     ` Juri Lelli
  1 sibling, 1 reply; 34+ messages in thread
From: Michal Koutný @ 2024-11-27 10:55 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, Phil Auld, Suleiman Souhlal,
	Youssef Esmat

[-- Attachment #1: Type: text/plain, Size: 1740 bytes --]

Hello.

(I'm replying now as I installed v6.12 and this message has the
context.)

On Mon, May 27, 2024 at 02:06:55PM GMT, Daniel Bristot de Oliveira <bristot@kernel.org> wrote:
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 0dbb42cf7fe6..7df8179bfa08 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1554,6 +1554,7 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64
>  	if (dl_se == &rq->fair_server)
>  		return;
>  
> +#ifdef CONFIG_RT_GROUP_SCHED
>  	/*
>  	 * Because -- for now -- we share the rt bandwidth, we need to
>  	 * account our runtime there too, otherwise actual rt tasks
> @@ -1578,6 +1579,7 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64
>  			rt_rq->rt_time += delta_exec;
>  		raw_spin_unlock(&rt_rq->rt_runtime_lock);
>  	}
> +#endif
>  }
>  
>  /*
> @@ -1632,8 +1634,7 @@ void dl_server_start(struct sched_dl_entity *dl_se)
>  	 * this before getting generic.
>  	 */
>  	if (!dl_server(dl_se)) {
> -		/* Disabled */
> -		u64 runtime = 0;
> +		u64 runtime =  50 * NSEC_PER_MSEC;
>  		u64 period = 1000 * NSEC_PER_MSEC;
>  
>  		dl_server_apply_params(dl_se, runtime, period, 1);

The global_rt_runtime() also applies to deadline class when CPU's DL
bandwidth is init'd in init_dl_rq_bw_ratio().

The default DL bandwidth is thus is 95%. The fair server is given 5%.
Is that 5% of those 95%?

Or is it meant to be complementary? (Perhaps not as I can configure
rt_runtime_us/rt_period_us > 95% without an error. But then I don't
understand what the global rt_runtime_us (w/out CONFIG_RT_GROUP_SCHED)
configures.)

Thanks for some hints,
Michal


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH V7 9/9] sched/rt: Remove default bandwidth control
  2024-11-27 10:55   ` [PATCH V7 9/9] " Michal Koutný
@ 2024-11-27 15:35     ` Juri Lelli
  2024-11-29 10:02       ` Michal Koutný
  0 siblings, 1 reply; 34+ messages in thread
From: Juri Lelli @ 2024-11-27 15:35 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, Phil Auld, Suleiman Souhlal,
	Youssef Esmat

Hi Michal,

On 27/11/24 11:55, Michal Koutný wrote:
> Hello.
> 
> (I'm replying now as I installed v6.12 and this message has the
> context.)
> 
> On Mon, May 27, 2024 at 02:06:55PM GMT, Daniel Bristot de Oliveira <bristot@kernel.org> wrote:
> > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> > index 0dbb42cf7fe6..7df8179bfa08 100644
> > --- a/kernel/sched/deadline.c
> > +++ b/kernel/sched/deadline.c
> > @@ -1554,6 +1554,7 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64
> >  	if (dl_se == &rq->fair_server)
> >  		return;
> >  
> > +#ifdef CONFIG_RT_GROUP_SCHED
> >  	/*
> >  	 * Because -- for now -- we share the rt bandwidth, we need to
> >  	 * account our runtime there too, otherwise actual rt tasks
> > @@ -1578,6 +1579,7 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64
> >  			rt_rq->rt_time += delta_exec;
> >  		raw_spin_unlock(&rt_rq->rt_runtime_lock);
> >  	}
> > +#endif
> >  }
> >  
> >  /*
> > @@ -1632,8 +1634,7 @@ void dl_server_start(struct sched_dl_entity *dl_se)
> >  	 * this before getting generic.
> >  	 */
> >  	if (!dl_server(dl_se)) {
> > -		/* Disabled */
> > -		u64 runtime = 0;
> > +		u64 runtime =  50 * NSEC_PER_MSEC;
> >  		u64 period = 1000 * NSEC_PER_MSEC;
> >  
> >  		dl_server_apply_params(dl_se, runtime, period, 1);
> 
> The global_rt_runtime() also applies to deadline class when CPU's DL
> bandwidth is init'd in init_dl_rq_bw_ratio().
> 
> The default DL bandwidth is thus is 95%. The fair server is given 5%.
> Is that 5% of those 95%?

Yes, it is indeed.

Best,
Juri


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH V7 9/9] sched/rt: Remove default bandwidth control
  2024-11-27 15:35     ` Juri Lelli
@ 2024-11-29 10:02       ` Michal Koutný
  2024-11-29 14:44         ` Juri Lelli
  0 siblings, 1 reply; 34+ messages in thread
From: Michal Koutný @ 2024-11-29 10:02 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, Phil Auld, Suleiman Souhlal,
	Youssef Esmat

[-- Attachment #1: Type: text/plain, Size: 787 bytes --]

On Wed, Nov 27, 2024 at 04:35:39PM GMT, Juri Lelli <juri.lelli@redhat.com> wrote:
> > The default DL bandwidth is thus is 95%. The fair server is given 5%.
> > Is that 5% of those 95%?
> 
> Yes, it is indeed.

Thanks for navigating me. I have followup questions about
/proc/sys/kernel/sched_rt_runtime_us / /proc/sys/kernel/sched_rt_period_us
(a ratio, without CONGIG_RT_GROUP_SCHED)

- 0
  - disables DL, (not RT, so they can monopolize a CPU)
- 1
  - DL tasks can monopolize CPU, SCHED_NORMAL have 5% thanks to
    fair_server
- 1-Δ
  - SCHED_NORMAL tasks have
    - (1-Δ)*5% on behalf of DL (above RT)
    - Δ regularly (below RT)

Is this breakdown correct?
I'm wondering if different values of Δ mean anything or how they can be
used.

Regards,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH V7 9/9] sched/rt: Remove default bandwidth control
  2024-11-29 10:02       ` Michal Koutný
@ 2024-11-29 14:44         ` Juri Lelli
  2024-11-29 20:21           ` Michal Koutný
  0 siblings, 1 reply; 34+ messages in thread
From: Juri Lelli @ 2024-11-29 14:44 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	linux-kernel, Luca Abeni, Tommaso Cucinotta, Thomas Gleixner,
	Joel Fernandes, Vineeth Pillai, Shuah Khan, Phil Auld,
	Suleiman Souhlal

On 29/11/24 11:02, Michal Koutný wrote:
> On Wed, Nov 27, 2024 at 04:35:39PM GMT, Juri Lelli <juri.lelli@redhat.com> wrote:
> > > The default DL bandwidth is thus is 95%. The fair server is given 5%.
> > > Is that 5% of those 95%?
> > 
> > Yes, it is indeed.
> 
> Thanks for navigating me. I have followup questions about
> /proc/sys/kernel/sched_rt_runtime_us / /proc/sys/kernel/sched_rt_period_us
> (a ratio, without CONGIG_RT_GROUP_SCHED)
> 
> - 0
>   - disables DL, (not RT, so they can monopolize a CPU)
> - 1
>   - DL tasks can monopolize CPU, SCHED_NORMAL have 5% thanks to
>     fair_server
> - 1-Δ
>   - SCHED_NORMAL tasks have
>     - (1-Δ)*5% on behalf of DL (above RT)
>     - Δ regularly (below RT)
> 
> Is this breakdown correct?

So, sched_rt_runtime_us/sched_rt_period_us only applies to admission
control (for DEADLINE tasks, including dl_servers). The actual
parameters controlling the dl_server for SCHED_NORMAL are under sched/
debug as per-cpu values, e.g.:

# grep . /sys/kernel/debug/sched/fair_server/cpu*/*
/sys/kernel/debug/sched/fair_server/cpu0/period:1000000000
/sys/kernel/debug/sched/fair_server/cpu0/runtime:50000000
/sys/kernel/debug/sched/fair_server/cpu1/period:1000000000
/sys/kernel/debug/sched/fair_server/cpu1/runtime:50000000
/sys/kernel/debug/sched/fair_server/cpu2/period:1000000000
/sys/kernel/debug/sched/fair_server/cpu2/runtime:50000000
/sys/kernel/debug/sched/fair_server/cpu3/period:1000000000
/sys/kernel/debug/sched/fair_server/cpu3/runtime:50000000
/sys/kernel/debug/sched/fair_server/cpu4/period:1000000000
/sys/kernel/debug/sched/fair_server/cpu4/runtime:50000000
/sys/kernel/debug/sched/fair_server/cpu5/period:1000000000
/sys/kernel/debug/sched/fair_server/cpu5/runtime:50000000
/sys/kernel/debug/sched/fair_server/cpu6/period:1000000000
/sys/kernel/debug/sched/fair_server/cpu6/runtime:50000000
/sys/kernel/debug/sched/fair_server/cpu7/period:1000000000
/sys/kernel/debug/sched/fair_server/cpu7/runtime:50000000

You can disable admission control by echoing -1 in sched_rt_runtime_us,
but still have the dl_server working for SCHED_NORMAL tasks. By
disabling admission control SCHED_DEADLINE can indeed monopolize CPU
(over subscription).

Best,
Juri


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH V7 9/9] sched/rt: Remove default bandwidth control
  2024-11-29 14:44         ` Juri Lelli
@ 2024-11-29 20:21           ` Michal Koutný
  2024-12-02  9:39             ` Juri Lelli
  0 siblings, 1 reply; 34+ messages in thread
From: Michal Koutný @ 2024-11-29 20:21 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	linux-kernel, Luca Abeni, Tommaso Cucinotta, Thomas Gleixner,
	Joel Fernandes, Vineeth Pillai, Shuah Khan, Phil Auld,
	Suleiman Souhlal

[-- Attachment #1: Type: text/plain, Size: 3711 bytes --]

On Fri, Nov 29, 2024 at 03:44:53PM GMT, Juri Lelli <juri.lelli@redhat.com> wrote:
> You can disable admission control by echoing -1 in sched_rt_runtime_us,
> but still have the dl_server working for SCHED_NORMAL tasks. By
> disabling admission control SCHED_DEADLINE can indeed monopolize CPU
> (over subscription).

| I'm wondering if different values of Δ mean anything or how they can
| be used.

Aha, the knob therefore remains relevant for the DL admission control.

So if I put it together, I will write it down like this:

-- 8< --
Subject: [PATCH] sched/RT: Update paragraphs about RT bandwidth control

This has slightly changed with the introductions of fair_server.
Update the most relevant parts.

Link: https://lore.kernel.org/r/Z0c8S8i3qt7SEU14@jlelli-thinkpadt14gen4.remote.csb/
Signed-off-by: Michal Koutný <mkoutny@suse.com>
---
 Documentation/scheduler/sched-deadline.rst | 13 +++++++------
 Documentation/scheduler/sched-rt-group.rst |  8 ++++----
 2 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/Documentation/scheduler/sched-deadline.rst b/Documentation/scheduler/sched-deadline.rst
index 22838ed8e13aa..a727827b8dd52 100644
--- a/Documentation/scheduler/sched-deadline.rst
+++ b/Documentation/scheduler/sched-deadline.rst
@@ -591,12 +591,13 @@ Deadline Task Scheduling
 
  The system wide settings are configured under the /proc virtual file system.
 
- For now the -rt knobs are used for -deadline admission control and the
- -deadline runtime is accounted against the -rt runtime. We realize that this
- isn't entirely desirable; however, it is better to have a small interface for
- now, and be able to change it easily later. The ideal situation (see 5.) is to
- run -rt tasks from a -deadline server; in which case the -rt bandwidth is a
- direct subset of dl_bw.
+ For now the -rt knobs are used for -deadline admission control and with
+ CONFIG_RT_GROUP_SCHED the -deadline runtime is accounted against the (root)
+ -rt runtime. With !CONFIG_RT_GROUP_SCHED the knob only serves for the -dl
+ admission control. We realize that this isn't entirely desirable; however, it
+ is better to have a small interface for now, and be able to change it easily
+ later. The ideal situation (see 5.) is to run -rt tasks from a -deadline
+ server; in which case the -rt bandwidth is a direct subset of dl_bw.
 
  This means that, for a root_domain comprising M CPUs, -deadline tasks
  can be created while the sum of their bandwidths stays below:
diff --git a/Documentation/scheduler/sched-rt-group.rst b/Documentation/scheduler/sched-rt-group.rst
index d685609ed3d7a..80b05a3009ea2 100644
--- a/Documentation/scheduler/sched-rt-group.rst
+++ b/Documentation/scheduler/sched-rt-group.rst
@@ -92,10 +92,10 @@ The system wide settings are configured under the /proc virtual file system:
 /proc/sys/kernel/sched_rt_runtime_us:
   A global limit on how much time real-time scheduling may use. This is always
   less or equal to the period_us, as it denotes the time allocated from the
-  period_us for the real-time tasks. Even without CONFIG_RT_GROUP_SCHED enabled,
-  this will limit time reserved to real-time processes. With
-  CONFIG_RT_GROUP_SCHED=y it signifies the total bandwidth available to all
-  real-time groups.
+  period_us for the real-time tasks. Without CONFIG_RT_GROUP_SCHED enabled,
+  this only serves for admission control of deadline tasks. With
+  CONFIG_RT_GROUP_SCHED=y it also signifies the total bandwidth available to
+  all real-time groups.
 
   * Time is specified in us because the interface is s32. This gives an
     operating range from 1us to about 35 minutes.
-- 
2.47.0

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH V7 9/9] sched/rt: Remove default bandwidth control
  2024-11-29 20:21           ` Michal Koutný
@ 2024-12-02  9:39             ` Juri Lelli
  0 siblings, 0 replies; 34+ messages in thread
From: Juri Lelli @ 2024-12-02  9:39 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	linux-kernel, Luca Abeni, Tommaso Cucinotta, Thomas Gleixner,
	Joel Fernandes, Vineeth Pillai, Shuah Khan, Phil Auld,
	Suleiman Souhlal

On 29/11/24 21:21, Michal Koutný wrote:
> On Fri, Nov 29, 2024 at 03:44:53PM GMT, Juri Lelli <juri.lelli@redhat.com> wrote:
> > You can disable admission control by echoing -1 in sched_rt_runtime_us,
> > but still have the dl_server working for SCHED_NORMAL tasks. By
> > disabling admission control SCHED_DEADLINE can indeed monopolize CPU
> > (over subscription).
> 
> | I'm wondering if different values of Δ mean anything or how they can
> | be used.
> 
> Aha, the knob therefore remains relevant for the DL admission control.
> 
> So if I put it together, I will write it down like this:

Ah, yes sounds good to me, thanks for writing it down properly! :)

If you send it out as a separate patch, please feel free to add

Acked-by: Juri Lelli <juri.lelli@redhat.com>

Best,
Juri


^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2024-12-02  9:39 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-27 12:06 [PATCH V7 0/9] SCHED_DEADLINE server infrastructure Daniel Bristot de Oliveira
2024-05-27 12:06 ` [PATCH V7 1/9] sched/deadline: Comment sched_dl_entity::dl_server variable Daniel Bristot de Oliveira
2024-07-29 10:34   ` [tip: sched/core] " tip-bot2 for Daniel Bristot de Oliveira
2024-05-27 12:06 ` [PATCH V7 2/9] sched/core: Add clearing of ->dl_server in put_prev_task_balance() Daniel Bristot de Oliveira
2024-07-29 10:34   ` [tip: sched/core] " tip-bot2 for Joel Fernandes (Google)
2024-05-27 12:06 ` [PATCH V7 3/9] sched/core: Clear prev->dl_server in CFS pick fast path Daniel Bristot de Oliveira
2024-07-29 10:34   ` [tip: sched/core] " tip-bot2 for Youssef Esmat
2024-05-27 12:06 ` [PATCH V7 4/9] sched/fair: Add trivial fair server Daniel Bristot de Oliveira
2024-07-29 10:34   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2024-05-27 12:06 ` [PATCH V7 5/9] sched/deadline: Deferrable dl server Daniel Bristot de Oliveira
2024-07-29 10:34   ` [tip: sched/core] " tip-bot2 for Daniel Bristot de Oliveira
2024-05-27 12:06 ` [PATCH V7 6/9] sched/fair: Fair server interface Daniel Bristot de Oliveira
2024-07-29 10:34   ` [tip: sched/core] " tip-bot2 for Daniel Bristot de Oliveira
2024-05-27 12:06 ` [PATCH V7 7/9] sched/core: Fix priority checking for DL server picks Daniel Bristot de Oliveira
2024-07-29 10:34   ` [tip: sched/core] " tip-bot2 for Joel Fernandes (Google)
2024-05-27 12:06 ` [PATCH V7 8/9] sched/core: Fix picking of tasks for core scheduling with DL server Daniel Bristot de Oliveira
2024-07-29 10:34   ` [tip: sched/core] " tip-bot2 for Joel Fernandes (Google)
2024-05-27 12:06 ` [PATCH V7 9/9] sched/rt: Remove default bandwidth control Daniel Bristot de Oliveira
2024-07-29 10:34   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2024-11-27 10:55   ` [PATCH V7 9/9] " Michal Koutný
2024-11-27 15:35     ` Juri Lelli
2024-11-29 10:02       ` Michal Koutný
2024-11-29 14:44         ` Juri Lelli
2024-11-29 20:21           ` Michal Koutný
2024-12-02  9:39             ` Juri Lelli
2024-06-21 13:37 ` [PATCH V7 0/9] SCHED_DEADLINE server infrastructure Juri Lelli
2024-06-21 13:43   ` Daniel Bristot de Oliveira
2024-06-21 13:50     ` Juri Lelli
2024-06-21 14:41 ` Vineeth Remanan Pillai
2024-06-21 14:59   ` Daniel Bristot de Oliveira
2024-06-21 15:09     ` Vineeth Remanan Pillai
2024-06-21 15:16       ` Daniel Bristot de Oliveira
2024-07-29 10:32   ` Peter Zijlstra
2024-07-29 20:42     ` Vineeth Remanan Pillai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox