[RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v2 00/25]  Hierarchical Constant Bandwidth Server
@ 2025-07-31 10:55 Yuri Andriaccio
  2025-07-31 10:55 ` [RFC PATCH v2 01/25] sched/deadline: Remove fair-servers from real-time task's bandwidth accounting Yuri Andriaccio
                   ` (25 more replies)
  0 siblings, 26 replies; 38+ messages in thread
From: Yuri Andriaccio @ 2025-07-31 10:55 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider
  Cc: linux-kernel, Luca Abeni, Yuri Andriaccio

Hello,

This is the v2 for Hierarchical Constant Bandwidth Server, aiming at replacing
the current RT_GROUP_SCHED mechanism with something more robust and
theoretically sound. The patchset has been presented at OSPM25
(https://retis.sssup.it/ospm-summit/), and a summary of its inner workings can
be found at https://lwn.net/Articles/1021332/ . You can find the v1 of this
patchset at the bottom of the page, which talks in more detail what this
patchset is all about and how it is implemented.

The big update for this v2 version is the addition of migration code, which
allows to migrate tasks between different CPUs (following of course affinity
settings).

As requested, we've split the big patches in smaller chunks in order to improve
in readability. Additionally, it has been rebased on the latest tip/master to
keep up with the latest scheduler updates and new features of dl_servers.

Last but not least, the first patch, which has been presented separately at
https://lore.kernel.org/all/20250725164412.35912-1-yurand2000@gmail.com/ , is
necessary to fully utilize the deadline bandwidth while keeping the fair-servers
active. You can refer to the aforementioned link for details. The issue
presented in this patch also reflects in HCBS: in the current version of the
kernel, by default, 5% of the realtime bandwidth is reserved for fair-servers,
5% is not usable, and only the remaining 90% could be used by deadline tasks, or
in our case, by HCBS dl_servers. The first patch addresses this issue and allows
to fully utilize the default 95% of bandwidth for rt-tasks/servers.

Summary of the patches:
     1) Account fair-servers bw separately from other dl tasks and servers bw.
   2-5) Preparation patches, so that the RT classes' code can be used both
        for normal and cgroup scheduling.
  6-15) Implementation of HCBS, no migration and only one level hierarchy.
        The old RT_GROUP_SCHED code is removed.
 16-18) Remove cgroups v1 in favour of v2.
    19) Add support for deeper hierarchies.
 20-25) Add support for tasks migration.

Updates from v1:
- Rebase to tip/master.
- Add migration code.
- Split big patches for more readability.
- Refactor code to use guarded locks where applicable.
- Remove unnecessary patches from v1 which have been addressed differently by
  mainline updates.
- Remove unnecessary checks and general code cleanup.

Notes:
Task migration support needs some extra work to reduce its invasiveness,
especially patches 22-23.

Testing v2:
The HCBS mechanism has been further evaluated on two fully-fledged distros,
instead of virtual machines, demonstrating stability in this latest version.
A small suite of regression tests shows that the newly added mechanism does not
break fair-servers and other scheduling mechanisms. Stress tests show that our
implementation is robust while time-based tests demonstrate that the theoretical
analysis of real-time tasksets matches with the implementation.

The tests can be found at https://github.com/Yurand2000/HCBS-rust-initrd . The
executables are essentially the same as the ones mentioned in the v1 version,
minor some updates. You can refer to that for additional details.

Future Work:

We want to further test this patchset, and provide a more commented description
of the test suite so that it can be fully automated for testing also by other
people. Additionally, we will finish the currently partial/untested,
implementation of HCBS with different runtimes per CPU, instead of having the
same runtime allocated on all CPUs, to include it in a future RCF.

Future patches:
 - HCBS with different runtimes per CPU.
 - capacity aware bandwidth reservation.
 - enable/disable dl_servers when a CPU goes online/offline.

Have a nice day,
Yuri

v1: https://lore.kernel.org/all/20250605071412.139240-1-yurand2000@gmail.com/

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Yuri Andriaccio (6):
  sched/deadline: Remove fair-servers from real-time task's bandwidth
    accounting
  sched/rt: Disable RT_GROUP_SCHED
  sched/deadline: Account rt-cgroups bandwidth in
    sched_dl_global_validate
  sched/rt: Remove support for cgroups-v1
  sched/rt: Zero rt-cgroups default bandwidth
  sched/core: Execute enqueued balance callbacks when migrating task
    betweeen cgroups

luca abeni (19):
  sched/deadline: Do not access dl_se->rq directly
  sched/deadline: Distinct between dl_rq and my_q
  sched/rt: Pass an rt_rq instead of an rq where needed
  sched/rt: Move some functions from rt.c to sched.h
  sched/rt: Introduce HCBS specific structs in task_group
  sched/deadline: Account rt-cgroups bandwidth in deadline tasks
    schedulability tests.
  sched/core: Initialize root_task_group
  sched/deadline: Add dl_init_tg
  sched/rt: Add {alloc/free}_rt_sched_group and dl_server specific
    functions
  sched/rt: Add HCBS related checks and operations for rt tasks
  sched/rt: Update rt-cgroup schedulability checks
  sched/rt: Remove old RT_GROUP_SCHED data structures
  sched/core: Cgroup v2 support
  sched/deadline: Allow deeper hierarchies of RT cgroups
  sched/rt: Add rt-cgroup migration
  sched/rt: add HCBS migration related checks and function calls
  sched/deadline: Make rt-cgroup's servers pull tasks on timer
    replenishment
  sched/deadline: Fix HCBS migrations on server stop
  sched/core: Execute enqueued balance callbacks when changing allowed
    CPUs

 include/linux/sched.h    |   10 +-
 kernel/sched/autogroup.c |    4 +-
 kernel/sched/core.c      |   68 +-
 kernel/sched/deadline.c  |  311 ++--
 kernel/sched/debug.c     |    6 -
 kernel/sched/fair.c      |    6 +-
 kernel/sched/rt.c        | 3024 ++++++++++++++++++--------------------
 kernel/sched/sched.h     |  140 +-
 kernel/sched/syscalls.c  |    6 +-
 kernel/sched/topology.c  |    8 -
 10 files changed, 1829 insertions(+), 1754 deletions(-)

-- 
2.50.1

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC PATCH v2 01/25] sched/deadline: Remove fair-servers from real-time task's bandwidth accounting
  2025-07-31 10:55 [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
@ 2025-07-31 10:55 ` Yuri Andriaccio
  2025-07-31 10:55 ` [RFC PATCH v2 02/25] sched/deadline: Do not access dl_se->rq directly Yuri Andriaccio
                   ` (24 subsequent siblings)
  25 siblings, 0 replies; 38+ messages in thread
From: Yuri Andriaccio @ 2025-07-31 10:55 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider
  Cc: linux-kernel, Luca Abeni, Yuri Andriaccio

Fair-servers are currently used in place of the old RT_THROTTLING mechanism to
prevent the starvation of SCHED_OTHER (and other lower priority) tasks when
real-time FIFO/RR processes are trying to fully utilize the CPU. To allow the
RT_THROTTLING mechanism, the maximum allocatable bandwidth for real-time tasks
has been limited to 95% of the CPU-time.

The RT_THROTTLING mechanism is now removed in favor of fair-servers, which are
currently set to use, as expected, 5% of the CPU-time. Still, they share the
same bandwidth that allows to run real-time tasks, and which is still set to 95%
of the total CPU-time. This means that by removing the RT_THROTTLING mechanism,
the bandwidth remaning for real-time SCHED_DEADLINE tasks and other dl-servers
(FIFO/RR are not affected) is only 90%.

This patch reclaims the 5% lost CPU-time, which is definitely reserved for
SCHED_OTHER tasks, but should not be accounted togheter with the other real-time
tasks. More generally, the fair-servers' bandwidth must not be accounted with
other real-time tasks.

Updates:
- Make the fair-servers' bandwidth not be accounted into the total allocated
  bandwidth for real-time tasks.
- Remove the admission control test when allocating a fair-server.
- Do not account for fair-servers in the GRUB's bandwidth reclaiming mechanism.
- Limit the max bandwidth to (BW_UNIT - max_rt_bw) when changing the parameters
  of a fair-server, preventing overcommitment.
- Add dl_bw_fair, which computes the total allocated bandwidth of the
  fair-servers in the given root-domain.
- Update admission tests (in sched_dl_global_validate) when changing the
  maximum allocatable bandwidth for real-time tasks, preventing overcommitment.

Since the fair-server's bandwidth can be changed through debugfs, it has not
been enforced that a fair-server's bw must be always equal to (BW_UNIT -
max_rt_bw), rather it must be less or equal to this value. This allows retaining
the fair-servers' settings changed through the debugfs when chaning the
max_rt_bw.

This also means that in order to increase the maximum bandwidth for real-time
tasks, the bw of fair-servers must be first decreased through debugfs otherwise
admission tests will fail, and viceversa, to increase the bw of fair-servers,
the bw of real-time tasks must be reduced beforehand.

This v2 version addresses the compilation error on i386 reported at:
https://lore.kernel.org/oe-kbuild-all/202507220727.BmA1Osdg-lkp@intel.com/

v1: https://lore.kernel.org/all/20250721111131.309388-1-yurand2000@gmail.com/

Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
---
 kernel/sched/deadline.c | 66 ++++++++++++++++++-----------------------
 kernel/sched/sched.h    |  1 -
 kernel/sched/topology.c |  8 -----
 3 files changed, 29 insertions(+), 46 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index e2d51f4306b..8ba6bf3ef68 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -141,6 +141,24 @@ static inline int dl_bw_cpus(int i)
 	return cpus;
 }
 
+static inline u64 dl_bw_fair(int i)
+{
+	struct root_domain *rd = cpu_rq(i)->rd;
+	u64 fair_server_bw = 0;
+
+	RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
+			 "sched RCU must be held");
+
+	if (cpumask_subset(rd->span, cpu_active_mask))
+		i = cpumask_first(rd->span);
+
+	for_each_cpu_and(i, rd->span, cpu_active_mask) {
+		fair_server_bw += cpu_rq(i)->fair_server.dl_bw;
+	}
+
+	return fair_server_bw;
+}
+
 static inline unsigned long __dl_bw_capacity(const struct cpumask *mask)
 {
 	unsigned long cap = 0;
@@ -1657,25 +1675,9 @@ void sched_init_dl_servers(void)
 	}
 }
 
-void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq)
-{
-	u64 new_bw = dl_se->dl_bw;
-	int cpu = cpu_of(rq);
-	struct dl_bw *dl_b;
-
-	dl_b = dl_bw_of(cpu_of(rq));
-	guard(raw_spinlock)(&dl_b->lock);
-
-	if (!dl_bw_cpus(cpu))
-		return;
-
-	__dl_add(dl_b, new_bw, dl_bw_cpus(cpu));
-}
-
 int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 period, bool init)
 {
-	u64 old_bw = init ? 0 : to_ratio(dl_se->dl_period, dl_se->dl_runtime);
-	u64 new_bw = to_ratio(period, runtime);
+	u64 max_bw, new_bw = to_ratio(period, runtime);
 	struct rq *rq = dl_se->rq;
 	int cpu = cpu_of(rq);
 	struct dl_bw *dl_b;
@@ -1688,17 +1690,14 @@ int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 perio
 
 	cpus = dl_bw_cpus(cpu);
 	cap = dl_bw_capacity(cpu);
+	max_bw = div64_ul(cap_scale(BW_UNIT - dl_b->bw, cap), (unsigned long)cpus);
 
-	if (__dl_overflow(dl_b, cap, old_bw, new_bw))
+	if (new_bw > max_bw)
 		return -EBUSY;
 
 	if (init) {
 		__add_rq_bw(new_bw, &rq->dl);
-		__dl_add(dl_b, new_bw, cpus);
 	} else {
-		__dl_sub(dl_b, dl_se->dl_bw, cpus);
-		__dl_add(dl_b, new_bw, cpus);
-
 		dl_rq_change_utilization(rq, dl_se, new_bw);
 	}
 
@@ -2939,17 +2938,6 @@ void dl_clear_root_domain(struct root_domain *rd)
 	rd->dl_bw.total_bw = 0;
 	for_each_cpu(i, rd->span)
 		cpu_rq(i)->dl.extra_bw = cpu_rq(i)->dl.max_bw;
-
-	/*
-	 * dl_servers are not tasks. Since dl_add_task_root_domain ignores
-	 * them, we need to account for them here explicitly.
-	 */
-	for_each_cpu(i, rd->span) {
-		struct sched_dl_entity *dl_se = &cpu_rq(i)->fair_server;
-
-		if (dl_server(dl_se) && cpu_active(i))
-			__dl_add(&rd->dl_bw, dl_se->dl_bw, dl_bw_cpus(i));
-	}
 }
 
 void dl_clear_root_domain_cpu(int cpu)
@@ -3133,9 +3121,10 @@ int sched_dl_global_validate(void)
 	u64 period = global_rt_period();
 	u64 new_bw = to_ratio(period, runtime);
 	u64 cookie = ++dl_cookie;
+	u64 fair_bw;
 	struct dl_bw *dl_b;
-	int cpu, cpus, ret = 0;
-	unsigned long flags;
+	int cpu, ret = 0;
+	unsigned long cap, flags;
 
 	/*
 	 * Here we want to check the bandwidth not being set to some
@@ -3149,10 +3138,13 @@ int sched_dl_global_validate(void)
 			goto next;
 
 		dl_b = dl_bw_of(cpu);
-		cpus = dl_bw_cpus(cpu);
+		cap = dl_bw_capacity(cpu);
+		fair_bw = dl_bw_fair(cpu);
 
 		raw_spin_lock_irqsave(&dl_b->lock, flags);
-		if (new_bw * cpus < dl_b->total_bw)
+		if (cap_scale(new_bw, cap) < dl_b->total_bw)
+			ret = -EBUSY;
+		if (cap_scale(new_bw, cap) + fair_bw > cap_scale(BW_UNIT, cap))
 			ret = -EBUSY;
 		raw_spin_unlock_irqrestore(&dl_b->lock, flags);
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d3f33d10c58..8719ab8a817 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -390,7 +390,6 @@ extern void sched_init_dl_servers(void);
 extern void dl_server_update_idle_time(struct rq *rq,
 		    struct task_struct *p);
 extern void fair_server_init(struct rq *rq);
-extern void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq);
 extern int dl_server_apply_params(struct sched_dl_entity *dl_se,
 		    u64 runtime, u64 period, bool init);
 
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 977e133bb8a..4ea3365984a 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -500,14 +500,6 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd)
 	if (cpumask_test_cpu(rq->cpu, cpu_active_mask))
 		set_rq_online(rq);
 
-	/*
-	 * Because the rq is not a task, dl_add_task_root_domain() did not
-	 * move the fair server bw to the rd if it already started.
-	 * Add it now.
-	 */
-	if (rq->fair_server.dl_server)
-		__dl_server_attach_root(&rq->fair_server, rq);
-
 	rq_unlock_irqrestore(rq, &rf);
 
 	if (old_rd)
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH v2 02/25] sched/deadline: Do not access dl_se->rq directly
  2025-07-31 10:55 [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
  2025-07-31 10:55 ` [RFC PATCH v2 01/25] sched/deadline: Remove fair-servers from real-time task's bandwidth accounting Yuri Andriaccio
@ 2025-07-31 10:55 ` Yuri Andriaccio
  2025-08-14  8:30   ` Juri Lelli
  2025-07-31 10:55 ` [RFC PATCH v2 03/25] sched/deadline: Distinct between dl_rq and my_q Yuri Andriaccio
                   ` (23 subsequent siblings)
  25 siblings, 1 reply; 38+ messages in thread
From: Yuri Andriaccio @ 2025-07-31 10:55 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider
  Cc: linux-kernel, Luca Abeni, Yuri Andriaccio

From: luca abeni <luca.abeni@santannapisa.it>

Make deadline.c code access the runqueue of a scheduling entity saved in the
sched_dl_entity data structure. This allows future patches to save different
runqueues in sched_dl_entity other than the global runqueues.

Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
---
 kernel/sched/deadline.c | 27 ++++++++++++++-------------
 1 file changed, 14 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 8ba6bf3ef68..46b9b78cca2 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -892,7 +892,7 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
 	 * and arm the defer timer.
 	 */
 	if (dl_se->dl_defer && !dl_se->dl_defer_running &&
-	    dl_time_before(rq_clock(dl_se->rq), dl_se->deadline - dl_se->runtime)) {
+	    dl_time_before(rq_clock(rq), dl_se->deadline - dl_se->runtime)) {
 		if (!is_dl_boosted(dl_se) && dl_se->server_has_tasks(dl_se)) {
 
 			/*
@@ -1202,11 +1202,11 @@ static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_
 			 * of time. The dl_server_min_res serves as a limit to avoid
 			 * forwarding the timer for a too small amount of time.
 			 */
-			if (dl_time_before(rq_clock(dl_se->rq),
+			if (dl_time_before(rq_clock(rq),
 					   (dl_se->deadline - dl_se->runtime - dl_server_min_res))) {
 
 				/* reset the defer timer */
-				fw = dl_se->deadline - rq_clock(dl_se->rq) - dl_se->runtime;
+				fw = dl_se->deadline - rq_clock(rq) - dl_se->runtime;
 
 				hrtimer_forward_now(timer, ns_to_ktime(fw));
 				return HRTIMER_RESTART;
@@ -1217,7 +1217,7 @@ static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_
 
 		enqueue_dl_entity(dl_se, ENQUEUE_REPLENISH);
 
-		if (!dl_task(dl_se->rq->curr) || dl_entity_preempt(dl_se, &dl_se->rq->curr->dl))
+		if (!dl_task(rq->curr) || dl_entity_preempt(dl_se, &rq->curr->dl))
 			resched_curr(rq);
 
 		__push_dl_task(rq, rf);
@@ -1485,7 +1485,7 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64
 
 		hrtimer_try_to_cancel(&dl_se->dl_timer);
 
-		replenish_dl_new_period(dl_se, dl_se->rq);
+		replenish_dl_new_period(dl_se, rq);
 
 		/*
 		 * Not being able to start the timer seems problematic. If it could not
@@ -1597,21 +1597,22 @@ void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec)
 	/* 0 runtime = fair server disabled */
 	if (dl_se->dl_runtime) {
 		dl_se->dl_server_idle = 0;
-		update_curr_dl_se(dl_se->rq, dl_se, delta_exec);
+		update_curr_dl_se(rq_of_dl_se(dl_se), dl_se, delta_exec);
 	}
 }
 
 void dl_server_start(struct sched_dl_entity *dl_se)
 {
-	struct rq *rq = dl_se->rq;
+	struct rq *rq;
 
 	if (!dl_server(dl_se) || dl_se->dl_server_active)
 		return;
 
 	dl_se->dl_server_active = 1;
 	enqueue_dl_entity(dl_se, ENQUEUE_WAKEUP);
-	if (!dl_task(dl_se->rq->curr) || dl_entity_preempt(dl_se, &rq->curr->dl))
-		resched_curr(dl_se->rq);
+	rq = rq_of_dl_se(dl_se);
+	if (!dl_task(rq->curr) || dl_entity_preempt(dl_se, &rq->curr->dl))
+		resched_curr(rq);
 }
 
 void dl_server_stop(struct sched_dl_entity *dl_se)
@@ -1667,9 +1668,9 @@ void sched_init_dl_servers(void)
 
 		WARN_ON(dl_server(dl_se));
 
-		dl_server_apply_params(dl_se, runtime, period, 1);
-
 		dl_se->dl_server = 1;
+		BUG_ON(dl_server_apply_params(dl_se, runtime, period, 1));
+
 		dl_se->dl_defer = 1;
 		setup_new_dl_entity(dl_se);
 	}
@@ -1678,7 +1679,7 @@ void sched_init_dl_servers(void)
 int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 period, bool init)
 {
 	u64 max_bw, new_bw = to_ratio(period, runtime);
-	struct rq *rq = dl_se->rq;
+	struct rq *rq = rq_of_dl_se(dl_se);
 	int cpu = cpu_of(rq);
 	struct dl_bw *dl_b;
 	unsigned long cap;
@@ -1752,7 +1753,7 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
 		p = dl_task_of(dl_se);
 		rq = task_rq_lock(p, &rf);
 	} else {
-		rq = dl_se->rq;
+		rq = rq_of_dl_se(dl_se);
 		rq_lock(rq, &rf);
 	}
 
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH v2 03/25] sched/deadline: Distinct between dl_rq and my_q
  2025-07-31 10:55 [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
  2025-07-31 10:55 ` [RFC PATCH v2 01/25] sched/deadline: Remove fair-servers from real-time task's bandwidth accounting Yuri Andriaccio
  2025-07-31 10:55 ` [RFC PATCH v2 02/25] sched/deadline: Do not access dl_se->rq directly Yuri Andriaccio
@ 2025-07-31 10:55 ` Yuri Andriaccio
  2025-07-31 10:55 ` [RFC PATCH v2 04/25] sched/rt: Pass an rt_rq instead of an rq where needed Yuri Andriaccio
                   ` (22 subsequent siblings)
  25 siblings, 0 replies; 38+ messages in thread
From: Yuri Andriaccio @ 2025-07-31 10:55 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider
  Cc: linux-kernel, Luca Abeni, Yuri Andriaccio

From: luca abeni <luca.abeni@santannapisa.it>

Create two fields for runqueues in sched_dl_entity to make a distinction between
the global runqueue and the runqueue which the dl_server serves.

Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
---
 include/linux/sched.h   |  6 ++++--
 kernel/sched/deadline.c | 11 +++++++----
 kernel/sched/fair.c     |  6 +++---
 kernel/sched/sched.h    |  3 ++-
 4 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 40d2fa90df4..f0c8229afd1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -724,12 +724,14 @@ struct sched_dl_entity {
 	 * Bits for DL-server functionality. Also see the comment near
 	 * dl_server_update().
 	 *
-	 * @rq the runqueue this server is for
+	 * @dl_rq the runqueue on which this entity is (to be) queued
+	 * @my_q  the runqueue "owned" by this entity
 	 *
 	 * @server_has_tasks() returns true if @server_pick return a
 	 * runnable task.
 	 */
-	struct rq			*rq;
+	struct dl_rq			*dl_rq;
+	struct rq			*my_q;
 	dl_server_has_tasks_f		server_has_tasks;
 	dl_server_pick_f		server_pick_task;
 
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 46b9b78cca2..73ca5c0a086 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -75,11 +75,12 @@ static inline struct rq *rq_of_dl_rq(struct dl_rq *dl_rq)
 
 static inline struct rq *rq_of_dl_se(struct sched_dl_entity *dl_se)
 {
-	struct rq *rq = dl_se->rq;
+	struct rq *rq;
 
 	if (!dl_server(dl_se))
 		rq = task_rq(dl_task_of(dl_se));
-
+	else
+		rq = container_of(dl_se->dl_rq, struct rq, dl);
 	return rq;
 }
 
@@ -1641,11 +1642,13 @@ static bool dl_server_stopped(struct sched_dl_entity *dl_se)
 	return false;
 }
 
-void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
+void dl_server_init(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq,
+		    struct rq *served_rq,
 		    dl_server_has_tasks_f has_tasks,
 		    dl_server_pick_f pick_task)
 {
-	dl_se->rq = rq;
+	dl_se->dl_rq = dl_rq;
+	dl_se->my_q  = served_rq;
 	dl_se->server_has_tasks = has_tasks;
 	dl_se->server_pick_task = pick_task;
 }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b173a059315..2723086538b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8861,12 +8861,12 @@ static struct task_struct *__pick_next_task_fair(struct rq *rq, struct task_stru
 
 static bool fair_server_has_tasks(struct sched_dl_entity *dl_se)
 {
-	return !!dl_se->rq->cfs.nr_queued;
+	return !!dl_se->my_q->cfs.nr_queued;
 }
 
 static struct task_struct *fair_server_pick_task(struct sched_dl_entity *dl_se)
 {
-	return pick_task_fair(dl_se->rq);
+	return pick_task_fair(dl_se->my_q);
 }
 
 void fair_server_init(struct rq *rq)
@@ -8875,7 +8875,7 @@ void fair_server_init(struct rq *rq)
 
 	init_dl_entity(dl_se);
 
-	dl_server_init(dl_se, rq, fair_server_has_tasks, fair_server_pick_task);
+	dl_server_init(dl_se, &rq->dl, rq, fair_server_has_tasks, fair_server_pick_task);
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8719ab8a817..a8073d0824d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -382,7 +382,8 @@ extern s64 dl_scaled_delta_exec(struct rq *rq, struct sched_dl_entity *dl_se, s6
 extern void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec);
 extern void dl_server_start(struct sched_dl_entity *dl_se);
 extern void dl_server_stop(struct sched_dl_entity *dl_se);
-extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
+extern void dl_server_init(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq,
+		    struct rq *served_rq,
 		    dl_server_has_tasks_f has_tasks,
 		    dl_server_pick_f pick_task);
 extern void sched_init_dl_servers(void);
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH v2 04/25] sched/rt: Pass an rt_rq instead of an rq where needed
  2025-07-31 10:55 [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
                   ` (2 preceding siblings ...)
  2025-07-31 10:55 ` [RFC PATCH v2 03/25] sched/deadline: Distinct between dl_rq and my_q Yuri Andriaccio
@ 2025-07-31 10:55 ` Yuri Andriaccio
  2025-08-14  8:46   ` Juri Lelli
  2025-07-31 10:55 ` [RFC PATCH v2 05/25] sched/rt: Move some functions from rt.c to sched.h Yuri Andriaccio
                   ` (21 subsequent siblings)
  25 siblings, 1 reply; 38+ messages in thread
From: Yuri Andriaccio @ 2025-07-31 10:55 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider
  Cc: linux-kernel, Luca Abeni, Yuri Andriaccio

From: luca abeni <luca.abeni@santannapisa.it>

Make rt.c code access the runqueue through the rt_rq data structure rather than
passing an rq pointer directly. This allows future patches to define rt_rq data
structures which do not refer only to the global runqueue, but also to local
cgroup runqueues (rt_rq is not always equal to &rq->rt).

Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
---
 kernel/sched/rt.c | 83 +++++++++++++++++++++++++----------------------
 1 file changed, 44 insertions(+), 39 deletions(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 7936d433373..945e3d705cc 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -370,9 +370,9 @@ static inline void rt_clear_overload(struct rq *rq)
 	cpumask_clear_cpu(rq->cpu, rq->rd->rto_mask);
 }
 
-static inline int has_pushable_tasks(struct rq *rq)
+static inline int has_pushable_tasks(struct rt_rq *rt_rq)
 {
-	return !plist_head_empty(&rq->rt.pushable_tasks);
+	return !plist_head_empty(&rt_rq->pushable_tasks);
 }
 
 static DEFINE_PER_CPU(struct balance_callback, rt_push_head);
@@ -383,7 +383,7 @@ static void pull_rt_task(struct rq *);
 
 static inline void rt_queue_push_tasks(struct rq *rq)
 {
-	if (!has_pushable_tasks(rq))
+	if (!has_pushable_tasks(&rq->rt))
 		return;
 
 	queue_balance_callback(rq, &per_cpu(rt_push_head, rq->cpu), push_rt_tasks);
@@ -394,37 +394,37 @@ static inline void rt_queue_pull_task(struct rq *rq)
 	queue_balance_callback(rq, &per_cpu(rt_pull_head, rq->cpu), pull_rt_task);
 }
 
-static void enqueue_pushable_task(struct rq *rq, struct task_struct *p)
+static void enqueue_pushable_task(struct rt_rq *rt_rq, struct task_struct *p)
 {
-	plist_del(&p->pushable_tasks, &rq->rt.pushable_tasks);
+	plist_del(&p->pushable_tasks, &rt_rq->pushable_tasks);
 	plist_node_init(&p->pushable_tasks, p->prio);
-	plist_add(&p->pushable_tasks, &rq->rt.pushable_tasks);
+	plist_add(&p->pushable_tasks, &rt_rq->pushable_tasks);
 
 	/* Update the highest prio pushable task */
-	if (p->prio < rq->rt.highest_prio.next)
-		rq->rt.highest_prio.next = p->prio;
+	if (p->prio < rt_rq->highest_prio.next)
+		rt_rq->highest_prio.next = p->prio;
 
-	if (!rq->rt.overloaded) {
-		rt_set_overload(rq);
-		rq->rt.overloaded = 1;
+	if (!rt_rq->overloaded) {
+		rt_set_overload(rq_of_rt_rq(rt_rq));
+		rt_rq->overloaded = 1;
 	}
 }
 
-static void dequeue_pushable_task(struct rq *rq, struct task_struct *p)
+static void dequeue_pushable_task(struct rt_rq *rt_rq, struct task_struct *p)
 {
-	plist_del(&p->pushable_tasks, &rq->rt.pushable_tasks);
+	plist_del(&p->pushable_tasks, &rt_rq->pushable_tasks);
 
 	/* Update the new highest prio pushable task */
-	if (has_pushable_tasks(rq)) {
-		p = plist_first_entry(&rq->rt.pushable_tasks,
+	if (has_pushable_tasks(rt_rq)) {
+		p = plist_first_entry(&rt_rq->pushable_tasks,
 				      struct task_struct, pushable_tasks);
-		rq->rt.highest_prio.next = p->prio;
+		rt_rq->highest_prio.next = p->prio;
 	} else {
-		rq->rt.highest_prio.next = MAX_RT_PRIO-1;
+		rt_rq->highest_prio.next = MAX_RT_PRIO-1;
 
-		if (rq->rt.overloaded) {
-			rt_clear_overload(rq);
-			rq->rt.overloaded = 0;
+		if (rt_rq->overloaded) {
+			rt_clear_overload(rq_of_rt_rq(rt_rq));
+			rt_rq->overloaded = 0;
 		}
 	}
 }
@@ -1431,6 +1431,7 @@ static void
 enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)
 {
 	struct sched_rt_entity *rt_se = &p->rt;
+	struct rt_rq *rt_rq = rt_rq_of_se(rt_se);
 
 	if (flags & ENQUEUE_WAKEUP)
 		rt_se->timeout = 0;
@@ -1444,17 +1445,18 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)
 		return;
 
 	if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
-		enqueue_pushable_task(rq, p);
+		enqueue_pushable_task(rt_rq, p);
 }
 
 static bool dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
 {
 	struct sched_rt_entity *rt_se = &p->rt;
+	struct rt_rq *rt_rq = rt_rq_of_se(rt_se);
 
 	update_curr_rt(rq);
 	dequeue_rt_entity(rt_se, flags);
 
-	dequeue_pushable_task(rq, p);
+	dequeue_pushable_task(rt_rq, p);
 
 	return true;
 }
@@ -1639,14 +1641,14 @@ static void wakeup_preempt_rt(struct rq *rq, struct task_struct *p, int flags)
 static inline void set_next_task_rt(struct rq *rq, struct task_struct *p, bool first)
 {
 	struct sched_rt_entity *rt_se = &p->rt;
-	struct rt_rq *rt_rq = &rq->rt;
+	struct rt_rq *rt_rq = rt_rq_of_se(&p->rt);
 
 	p->se.exec_start = rq_clock_task(rq);
 	if (on_rt_rq(&p->rt))
 		update_stats_wait_end_rt(rt_rq, rt_se);
 
 	/* The running task is never eligible for pushing */
-	dequeue_pushable_task(rq, p);
+	dequeue_pushable_task(rt_rq, p);
 
 	if (!first)
 		return;
@@ -1710,7 +1712,7 @@ static struct task_struct *pick_task_rt(struct rq *rq)
 static void put_prev_task_rt(struct rq *rq, struct task_struct *p, struct task_struct *next)
 {
 	struct sched_rt_entity *rt_se = &p->rt;
-	struct rt_rq *rt_rq = &rq->rt;
+	struct rt_rq *rt_rq = rt_rq_of_se(&p->rt);
 
 	if (on_rt_rq(&p->rt))
 		update_stats_wait_start_rt(rt_rq, rt_se);
@@ -1726,7 +1728,7 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p, struct task_s
 	 * if it is still active
 	 */
 	if (on_rt_rq(&p->rt) && p->nr_cpus_allowed > 1)
-		enqueue_pushable_task(rq, p);
+		enqueue_pushable_task(rt_rq, p);
 }
 
 /* Only try algorithms three times */
@@ -1736,16 +1738,16 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p, struct task_s
  * Return the highest pushable rq's task, which is suitable to be executed
  * on the CPU, NULL otherwise
  */
-static struct task_struct *pick_highest_pushable_task(struct rq *rq, int cpu)
+static struct task_struct *pick_highest_pushable_task(struct rt_rq *rt_rq, int cpu)
 {
-	struct plist_head *head = &rq->rt.pushable_tasks;
+	struct plist_head *head = &rt_rq->pushable_tasks;
 	struct task_struct *p;
 
-	if (!has_pushable_tasks(rq))
+	if (!has_pushable_tasks(rt_rq))
 		return NULL;
 
 	plist_for_each_entry(p, head, pushable_tasks) {
-		if (task_is_pushable(rq, p, cpu))
+		if (task_is_pushable(rq_of_rt_rq(rt_rq), p, cpu))
 			return p;
 	}
 
@@ -1845,14 +1847,15 @@ static int find_lowest_rq(struct task_struct *task)
 	return -1;
 }
 
-static struct task_struct *pick_next_pushable_task(struct rq *rq)
+static struct task_struct *pick_next_pushable_task(struct rt_rq *rt_rq)
 {
+	struct rq *rq = rq_of_rt_rq(rt_rq);
 	struct task_struct *p;
 
-	if (!has_pushable_tasks(rq))
+	if (!has_pushable_tasks(rt_rq))
 		return NULL;
 
-	p = plist_first_entry(&rq->rt.pushable_tasks,
+	p = plist_first_entry(&rt_rq->pushable_tasks,
 			      struct task_struct, pushable_tasks);
 
 	BUG_ON(rq->cpu != task_cpu(p));
@@ -1905,7 +1908,7 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
 			 */
 			if (unlikely(is_migration_disabled(task) ||
 				     !cpumask_test_cpu(lowest_rq->cpu, &task->cpus_mask) ||
-				     task != pick_next_pushable_task(rq))) {
+				     task != pick_next_pushable_task(&rq->rt))) {
 
 				double_unlock_balance(rq, lowest_rq);
 				lowest_rq = NULL;
@@ -1939,7 +1942,7 @@ static int push_rt_task(struct rq *rq, bool pull)
 	if (!rq->rt.overloaded)
 		return 0;
 
-	next_task = pick_next_pushable_task(rq);
+	next_task = pick_next_pushable_task(&rq->rt);
 	if (!next_task)
 		return 0;
 
@@ -2014,7 +2017,7 @@ static int push_rt_task(struct rq *rq, bool pull)
 		 * run-queue and is also still the next task eligible for
 		 * pushing.
 		 */
-		task = pick_next_pushable_task(rq);
+		task = pick_next_pushable_task(&rq->rt);
 		if (task == next_task) {
 			/*
 			 * The task hasn't migrated, and is still the next
@@ -2202,7 +2205,7 @@ void rto_push_irq_work_func(struct irq_work *work)
 	 * We do not need to grab the lock to check for has_pushable_tasks.
 	 * When it gets updated, a check is made if a push is possible.
 	 */
-	if (has_pushable_tasks(rq)) {
+	if (has_pushable_tasks(&rq->rt)) {
 		raw_spin_rq_lock(rq);
 		while (push_rt_task(rq, true))
 			;
@@ -2231,6 +2234,7 @@ static void pull_rt_task(struct rq *this_rq)
 	int this_cpu = this_rq->cpu, cpu;
 	bool resched = false;
 	struct task_struct *p, *push_task;
+	struct rt_rq *src_rt_rq;
 	struct rq *src_rq;
 	int rt_overload_count = rt_overloaded(this_rq);
 
@@ -2260,6 +2264,7 @@ static void pull_rt_task(struct rq *this_rq)
 			continue;
 
 		src_rq = cpu_rq(cpu);
+		src_rt_rq = &src_rq->rt;
 
 		/*
 		 * Don't bother taking the src_rq->lock if the next highest
@@ -2268,7 +2273,7 @@ static void pull_rt_task(struct rq *this_rq)
 		 * logically higher, the src_rq will push this task away.
 		 * And if its going logically lower, we do not care
 		 */
-		if (src_rq->rt.highest_prio.next >=
+		if (src_rt_rq->highest_prio.next >=
 		    this_rq->rt.highest_prio.curr)
 			continue;
 
@@ -2284,7 +2289,7 @@ static void pull_rt_task(struct rq *this_rq)
 		 * We can pull only a task, which is pushable
 		 * on its rq, and no others.
 		 */
-		p = pick_highest_pushable_task(src_rq, this_cpu);
+		p = pick_highest_pushable_task(src_rt_rq, this_cpu);
 
 		/*
 		 * Do we have an RT task that preempts
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH v2 05/25] sched/rt: Move some functions from rt.c to sched.h
  2025-07-31 10:55 [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
                   ` (3 preceding siblings ...)
  2025-07-31 10:55 ` [RFC PATCH v2 04/25] sched/rt: Pass an rt_rq instead of an rq where needed Yuri Andriaccio
@ 2025-07-31 10:55 ` Yuri Andriaccio
  2025-08-14  8:50   ` Juri Lelli
  2025-07-31 10:55 ` [RFC PATCH v2 06/25] sched/rt: Disable RT_GROUP_SCHED Yuri Andriaccio
                   ` (20 subsequent siblings)
  25 siblings, 1 reply; 38+ messages in thread
From: Yuri Andriaccio @ 2025-07-31 10:55 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider
  Cc: linux-kernel, Luca Abeni, Yuri Andriaccio

From: luca abeni <luca.abeni@santannapisa.it>

Make the following functions be non-static and move them in sched.h, so that
they can be used also in other source files:
- rt_task_of()
- rq_of_rt_rq()
- rt_rq_of_se()
- rq_of_rt_se()

There are no functional changes. This is needed by future patches.

Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
---
 kernel/sched/rt.c    | 52 --------------------------------------------
 kernel/sched/sched.h | 51 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 51 insertions(+), 52 deletions(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 945e3d705cc..3ea92b08a0e 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -168,34 +168,6 @@ static void destroy_rt_bandwidth(struct rt_bandwidth *rt_b)
 
 #define rt_entity_is_task(rt_se) (!(rt_se)->my_q)
 
-static inline struct task_struct *rt_task_of(struct sched_rt_entity *rt_se)
-{
-	WARN_ON_ONCE(!rt_entity_is_task(rt_se));
-
-	return container_of(rt_se, struct task_struct, rt);
-}
-
-static inline struct rq *rq_of_rt_rq(struct rt_rq *rt_rq)
-{
-	/* Cannot fold with non-CONFIG_RT_GROUP_SCHED version, layout */
-	WARN_ON(!rt_group_sched_enabled() && rt_rq->tg != &root_task_group);
-	return rt_rq->rq;
-}
-
-static inline struct rt_rq *rt_rq_of_se(struct sched_rt_entity *rt_se)
-{
-	WARN_ON(!rt_group_sched_enabled() && rt_se->rt_rq->tg != &root_task_group);
-	return rt_se->rt_rq;
-}
-
-static inline struct rq *rq_of_rt_se(struct sched_rt_entity *rt_se)
-{
-	struct rt_rq *rt_rq = rt_se->rt_rq;
-
-	WARN_ON(!rt_group_sched_enabled() && rt_rq->tg != &root_task_group);
-	return rt_rq->rq;
-}
-
 void unregister_rt_sched_group(struct task_group *tg)
 {
 	if (!rt_group_sched_enabled())
@@ -296,30 +268,6 @@ int alloc_rt_sched_group(struct task_group *tg, struct task_group *parent)
 
 #define rt_entity_is_task(rt_se) (1)
 
-static inline struct task_struct *rt_task_of(struct sched_rt_entity *rt_se)
-{
-	return container_of(rt_se, struct task_struct, rt);
-}
-
-static inline struct rq *rq_of_rt_rq(struct rt_rq *rt_rq)
-{
-	return container_of(rt_rq, struct rq, rt);
-}
-
-static inline struct rq *rq_of_rt_se(struct sched_rt_entity *rt_se)
-{
-	struct task_struct *p = rt_task_of(rt_se);
-
-	return task_rq(p);
-}
-
-static inline struct rt_rq *rt_rq_of_se(struct sched_rt_entity *rt_se)
-{
-	struct rq *rq = rq_of_rt_se(rt_se);
-
-	return &rq->rt;
-}
-
 void unregister_rt_sched_group(struct task_group *tg) { }
 
 void free_rt_sched_group(struct task_group *tg) { }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a8073d0824d..a1e6d2852ca 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3042,6 +3042,57 @@ extern void set_rq_offline(struct rq *rq);
 
 extern bool sched_smp_initialized;
 
+#ifdef CONFIG_RT_GROUP_SCHED
+static inline struct task_struct *rt_task_of(struct sched_rt_entity *rt_se)
+{
+#ifdef CONFIG_SCHED_DEBUG
+	WARN_ON_ONCE(rt_se->my_q);
+#endif
+	return container_of(rt_se, struct task_struct, rt);
+}
+
+static inline struct rq *rq_of_rt_rq(struct rt_rq *rt_rq)
+{
+	return rt_rq->rq;
+}
+
+static inline struct rt_rq *rt_rq_of_se(struct sched_rt_entity *rt_se)
+{
+	return rt_se->rt_rq;
+}
+
+static inline struct rq *rq_of_rt_se(struct sched_rt_entity *rt_se)
+{
+	struct rt_rq *rt_rq = rt_se->rt_rq;
+
+	return rt_rq->rq;
+}
+#else
+static inline struct task_struct *rt_task_of(struct sched_rt_entity *rt_se)
+{
+	return container_of(rt_se, struct task_struct, rt);
+}
+
+static inline struct rq *rq_of_rt_rq(struct rt_rq *rt_rq)
+{
+	return container_of(rt_rq, struct rq, rt);
+}
+
+static inline struct rq *rq_of_rt_se(struct sched_rt_entity *rt_se)
+{
+	struct task_struct *p = rt_task_of(rt_se);
+
+	return task_rq(p);
+}
+
+static inline struct rt_rq *rt_rq_of_se(struct sched_rt_entity *rt_se)
+{
+	struct rq *rq = rq_of_rt_se(rt_se);
+
+	return &rq->rt;
+}
+#endif
+
 DEFINE_LOCK_GUARD_2(double_rq_lock, struct rq,
 		    double_rq_lock(_T->lock, _T->lock2),
 		    double_rq_unlock(_T->lock, _T->lock2))
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH v2 06/25] sched/rt: Disable RT_GROUP_SCHED
  2025-07-31 10:55 [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
                   ` (4 preceding siblings ...)
  2025-07-31 10:55 ` [RFC PATCH v2 05/25] sched/rt: Move some functions from rt.c to sched.h Yuri Andriaccio
@ 2025-07-31 10:55 ` Yuri Andriaccio
  2025-07-31 10:55 ` [RFC PATCH v2 07/25] sched/rt: Introduce HCBS specific structs in task_group Yuri Andriaccio
                   ` (19 subsequent siblings)
  25 siblings, 0 replies; 38+ messages in thread
From: Yuri Andriaccio @ 2025-07-31 10:55 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider
  Cc: linux-kernel, Luca Abeni, Yuri Andriaccio

Disable the old RT_GROUP_SCHED scheduler. Note that this does not completely
remove all the RT_GROUP_SCHED functionality, just unhooks it and removes most of
the relevant functions. Some of the RT_GROUP_SCHED functions are kept because
they will be adapted for the HCBS scheduling.

Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
---
 kernel/sched/core.c     |   6 -
 kernel/sched/deadline.c |  34 --
 kernel/sched/debug.c    |   6 -
 kernel/sched/rt.c       | 848 ++--------------------------------------
 kernel/sched/sched.h    |   9 +-
 kernel/sched/syscalls.c |  13 -
 6 files changed, 26 insertions(+), 890 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3ec00d08d46..42587a3c71f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8714,11 +8714,6 @@ void __init sched_init(void)
 
 	init_defrootdomain();
 
-#ifdef CONFIG_RT_GROUP_SCHED
-	init_rt_bandwidth(&root_task_group.rt_bandwidth,
-			global_rt_period(), global_rt_runtime());
-#endif /* CONFIG_RT_GROUP_SCHED */
-
 #ifdef CONFIG_CGROUP_SCHED
 	task_group_cache = KMEM_CACHE(task_group, 0);
 
@@ -8770,7 +8765,6 @@ void __init sched_init(void)
 		 * starts working after scheduler_running, which is not the case
 		 * yet.
 		 */
-		rq->rt.rt_runtime = global_rt_runtime();
 		init_tg_rt_entry(&root_task_group, &rq->rt, NULL, i, NULL);
 #endif
 		rq->sd = NULL;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 73ca5c0a086..0640d0ca45b 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1524,40 +1524,6 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64
 		if (!is_leftmost(dl_se, &rq->dl))
 			resched_curr(rq);
 	}
-
-	/*
-	 * The fair server (sole dl_server) does not account for real-time
-	 * workload because it is running fair work.
-	 */
-	if (dl_se == &rq->fair_server)
-		return;
-
-#ifdef CONFIG_RT_GROUP_SCHED
-	/*
-	 * Because -- for now -- we share the rt bandwidth, we need to
-	 * account our runtime there too, otherwise actual rt tasks
-	 * would be able to exceed the shared quota.
-	 *
-	 * Account to the root rt group for now.
-	 *
-	 * The solution we're working towards is having the RT groups scheduled
-	 * using deadline servers -- however there's a few nasties to figure
-	 * out before that can happen.
-	 */
-	if (rt_bandwidth_enabled()) {
-		struct rt_rq *rt_rq = &rq->rt;
-
-		raw_spin_lock(&rt_rq->rt_runtime_lock);
-		/*
-		 * We'll let actual RT tasks worry about the overflow here, we
-		 * have our own CBS to keep us inline; only account when RT
-		 * bandwidth is relevant.
-		 */
-		if (sched_rt_bandwidth_account(rt_rq))
-			rt_rq->rt_time += delta_exec;
-		raw_spin_unlock(&rt_rq->rt_runtime_lock);
-	}
-#endif /* CONFIG_RT_GROUP_SCHED */
 }
 
 /*
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 3f06ab84d53..f05decde708 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -892,12 +892,6 @@ void print_rt_rq(struct seq_file *m, int cpu, struct rt_rq *rt_rq)
 
 	PU(rt_nr_running);
 
-#ifdef CONFIG_RT_GROUP_SCHED
-	P(rt_throttled);
-	PN(rt_time);
-	PN(rt_runtime);
-#endif
-
 #undef PN
 #undef PU
 #undef P
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 3ea92b08a0e..a6282784978 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1,3 +1,4 @@
+#pragma GCC diagnostic ignored "-Wunused-function"
 // SPDX-License-Identifier: GPL-2.0
 /*
  * Real-Time Scheduling Class (mapped to the SCHED_FIFO and SCHED_RR
@@ -82,117 +83,18 @@ void init_rt_rq(struct rt_rq *rt_rq)
 	rt_rq->highest_prio.next = MAX_RT_PRIO-1;
 	rt_rq->overloaded = 0;
 	plist_head_init(&rt_rq->pushable_tasks);
-	/* We start is dequeued state, because no RT tasks are queued */
-	rt_rq->rt_queued = 0;
-
-#ifdef CONFIG_RT_GROUP_SCHED
-	rt_rq->rt_time = 0;
-	rt_rq->rt_throttled = 0;
-	rt_rq->rt_runtime = 0;
-	raw_spin_lock_init(&rt_rq->rt_runtime_lock);
-	rt_rq->tg = &root_task_group;
-#endif
 }
 
 #ifdef CONFIG_RT_GROUP_SCHED
 
-static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun);
-
-static enum hrtimer_restart sched_rt_period_timer(struct hrtimer *timer)
-{
-	struct rt_bandwidth *rt_b =
-		container_of(timer, struct rt_bandwidth, rt_period_timer);
-	int idle = 0;
-	int overrun;
-
-	raw_spin_lock(&rt_b->rt_runtime_lock);
-	for (;;) {
-		overrun = hrtimer_forward_now(timer, rt_b->rt_period);
-		if (!overrun)
-			break;
-
-		raw_spin_unlock(&rt_b->rt_runtime_lock);
-		idle = do_sched_rt_period_timer(rt_b, overrun);
-		raw_spin_lock(&rt_b->rt_runtime_lock);
-	}
-	if (idle)
-		rt_b->rt_period_active = 0;
-	raw_spin_unlock(&rt_b->rt_runtime_lock);
-
-	return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
-}
-
-void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime)
-{
-	rt_b->rt_period = ns_to_ktime(period);
-	rt_b->rt_runtime = runtime;
-
-	raw_spin_lock_init(&rt_b->rt_runtime_lock);
-
-	hrtimer_setup(&rt_b->rt_period_timer, sched_rt_period_timer, CLOCK_MONOTONIC,
-		      HRTIMER_MODE_REL_HARD);
-}
-
-static inline void do_start_rt_bandwidth(struct rt_bandwidth *rt_b)
-{
-	raw_spin_lock(&rt_b->rt_runtime_lock);
-	if (!rt_b->rt_period_active) {
-		rt_b->rt_period_active = 1;
-		/*
-		 * SCHED_DEADLINE updates the bandwidth, as a run away
-		 * RT task with a DL task could hog a CPU. But DL does
-		 * not reset the period. If a deadline task was running
-		 * without an RT task running, it can cause RT tasks to
-		 * throttle when they start up. Kick the timer right away
-		 * to update the period.
-		 */
-		hrtimer_forward_now(&rt_b->rt_period_timer, ns_to_ktime(0));
-		hrtimer_start_expires(&rt_b->rt_period_timer,
-				      HRTIMER_MODE_ABS_PINNED_HARD);
-	}
-	raw_spin_unlock(&rt_b->rt_runtime_lock);
-}
-
-static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
-{
-	if (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF)
-		return;
-
-	do_start_rt_bandwidth(rt_b);
-}
-
-static void destroy_rt_bandwidth(struct rt_bandwidth *rt_b)
-{
-	hrtimer_cancel(&rt_b->rt_period_timer);
-}
-
-#define rt_entity_is_task(rt_se) (!(rt_se)->my_q)
-
 void unregister_rt_sched_group(struct task_group *tg)
 {
-	if (!rt_group_sched_enabled())
-		return;
-
-	if (tg->rt_se)
-		destroy_rt_bandwidth(&tg->rt_bandwidth);
 }
 
 void free_rt_sched_group(struct task_group *tg)
 {
-	int i;
-
 	if (!rt_group_sched_enabled())
 		return;
-
-	for_each_possible_cpu(i) {
-		if (tg->rt_rq)
-			kfree(tg->rt_rq[i]);
-		if (tg->rt_se)
-			kfree(tg->rt_se[i]);
-	}
-
-	kfree(tg->rt_rq);
-	kfree(tg->rt_se);
 }
 
 void init_tg_rt_entry(struct task_group *tg, struct rt_rq *rt_rq,
@@ -202,72 +104,23 @@ void init_tg_rt_entry(struct task_group *tg, struct rt_rq *rt_rq,
 	struct rq *rq = cpu_rq(cpu);
 
 	rt_rq->highest_prio.curr = MAX_RT_PRIO-1;
-	rt_rq->rt_nr_boosted = 0;
 	rt_rq->rq = rq;
 	rt_rq->tg = tg;
 
 	tg->rt_rq[cpu] = rt_rq;
 	tg->rt_se[cpu] = rt_se;
-
-	if (!rt_se)
-		return;
-
-	if (!parent)
-		rt_se->rt_rq = &rq->rt;
-	else
-		rt_se->rt_rq = parent->my_q;
-
-	rt_se->my_q = rt_rq;
-	rt_se->parent = parent;
-	INIT_LIST_HEAD(&rt_se->run_list);
 }
 
 int alloc_rt_sched_group(struct task_group *tg, struct task_group *parent)
 {
-	struct rt_rq *rt_rq;
-	struct sched_rt_entity *rt_se;
-	int i;
-
 	if (!rt_group_sched_enabled())
 		return 1;
 
-	tg->rt_rq = kcalloc(nr_cpu_ids, sizeof(rt_rq), GFP_KERNEL);
-	if (!tg->rt_rq)
-		goto err;
-	tg->rt_se = kcalloc(nr_cpu_ids, sizeof(rt_se), GFP_KERNEL);
-	if (!tg->rt_se)
-		goto err;
-
-	init_rt_bandwidth(&tg->rt_bandwidth, ktime_to_ns(global_rt_period()), 0);
-
-	for_each_possible_cpu(i) {
-		rt_rq = kzalloc_node(sizeof(struct rt_rq),
-				     GFP_KERNEL, cpu_to_node(i));
-		if (!rt_rq)
-			goto err;
-
-		rt_se = kzalloc_node(sizeof(struct sched_rt_entity),
-				     GFP_KERNEL, cpu_to_node(i));
-		if (!rt_se)
-			goto err_free_rq;
-
-		init_rt_rq(rt_rq);
-		rt_rq->rt_runtime = tg->rt_bandwidth.rt_runtime;
-		init_tg_rt_entry(tg, rt_rq, rt_se, i, parent->rt_se[i]);
-	}
-
 	return 1;
-
-err_free_rq:
-	kfree(rt_rq);
-err:
-	return 0;
 }
 
 #else /* !CONFIG_RT_GROUP_SCHED: */
 
-#define rt_entity_is_task(rt_se) (1)
-
 void unregister_rt_sched_group(struct task_group *tg) { }
 
 void free_rt_sched_group(struct task_group *tg) { }
@@ -377,9 +230,6 @@ static void dequeue_pushable_task(struct rt_rq *rt_rq, struct task_struct *p)
 	}
 }
 
-static void enqueue_top_rt_rq(struct rt_rq *rt_rq);
-static void dequeue_top_rt_rq(struct rt_rq *rt_rq, unsigned int count);
-
 static inline int on_rt_rq(struct sched_rt_entity *rt_se)
 {
 	return rt_se->on_rq;
@@ -426,16 +276,6 @@ static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
 
 #ifdef CONFIG_RT_GROUP_SCHED
 
-static inline u64 sched_rt_runtime(struct rt_rq *rt_rq)
-{
-	return rt_rq->rt_runtime;
-}
-
-static inline u64 sched_rt_period(struct rt_rq *rt_rq)
-{
-	return ktime_to_ns(rt_rq->tg->rt_bandwidth.rt_period);
-}
-
 typedef struct task_group *rt_rq_iter_t;
 
 static inline struct task_group *next_task_group(struct task_group *tg)
@@ -461,457 +301,20 @@ static inline struct task_group *next_task_group(struct task_group *tg)
 		iter && (rt_rq = iter->rt_rq[cpu_of(rq)]);		\
 		iter = next_task_group(iter))
 
-#define for_each_sched_rt_entity(rt_se) \
-	for (; rt_se; rt_se = rt_se->parent)
-
-static inline struct rt_rq *group_rt_rq(struct sched_rt_entity *rt_se)
-{
-	return rt_se->my_q;
-}
-
 static void enqueue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags);
 static void dequeue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags);
 
-static void sched_rt_rq_enqueue(struct rt_rq *rt_rq)
-{
-	struct task_struct *donor = rq_of_rt_rq(rt_rq)->donor;
-	struct rq *rq = rq_of_rt_rq(rt_rq);
-	struct sched_rt_entity *rt_se;
-
-	int cpu = cpu_of(rq);
-
-	rt_se = rt_rq->tg->rt_se[cpu];
-
-	if (rt_rq->rt_nr_running) {
-		if (!rt_se)
-			enqueue_top_rt_rq(rt_rq);
-		else if (!on_rt_rq(rt_se))
-			enqueue_rt_entity(rt_se, 0);
-
-		if (rt_rq->highest_prio.curr < donor->prio)
-			resched_curr(rq);
-	}
-}
-
-static void sched_rt_rq_dequeue(struct rt_rq *rt_rq)
-{
-	struct sched_rt_entity *rt_se;
-	int cpu = cpu_of(rq_of_rt_rq(rt_rq));
-
-	rt_se = rt_rq->tg->rt_se[cpu];
-
-	if (!rt_se) {
-		dequeue_top_rt_rq(rt_rq, rt_rq->rt_nr_running);
-		/* Kick cpufreq (see the comment in kernel/sched/sched.h). */
-		cpufreq_update_util(rq_of_rt_rq(rt_rq), 0);
-	}
-	else if (on_rt_rq(rt_se))
-		dequeue_rt_entity(rt_se, 0);
-}
-
-static inline int rt_rq_throttled(struct rt_rq *rt_rq)
-{
-	return rt_rq->rt_throttled && !rt_rq->rt_nr_boosted;
-}
-
-static int rt_se_boosted(struct sched_rt_entity *rt_se)
-{
-	struct rt_rq *rt_rq = group_rt_rq(rt_se);
-	struct task_struct *p;
-
-	if (rt_rq)
-		return !!rt_rq->rt_nr_boosted;
-
-	p = rt_task_of(rt_se);
-	return p->prio != p->normal_prio;
-}
-
-static inline const struct cpumask *sched_rt_period_mask(void)
-{
-	return this_rq()->rd->span;
-}
-
-static inline
-struct rt_rq *sched_rt_period_rt_rq(struct rt_bandwidth *rt_b, int cpu)
-{
-	return container_of(rt_b, struct task_group, rt_bandwidth)->rt_rq[cpu];
-}
-
-static inline struct rt_bandwidth *sched_rt_bandwidth(struct rt_rq *rt_rq)
-{
-	return &rt_rq->tg->rt_bandwidth;
-}
-
-bool sched_rt_bandwidth_account(struct rt_rq *rt_rq)
-{
-	struct rt_bandwidth *rt_b = sched_rt_bandwidth(rt_rq);
-
-	return (hrtimer_active(&rt_b->rt_period_timer) ||
-		rt_rq->rt_time < rt_b->rt_runtime);
-}
-
-/*
- * We ran out of runtime, see if we can borrow some from our neighbours.
- */
-static void do_balance_runtime(struct rt_rq *rt_rq)
-{
-	struct rt_bandwidth *rt_b = sched_rt_bandwidth(rt_rq);
-	struct root_domain *rd = rq_of_rt_rq(rt_rq)->rd;
-	int i, weight;
-	u64 rt_period;
-
-	weight = cpumask_weight(rd->span);
-
-	raw_spin_lock(&rt_b->rt_runtime_lock);
-	rt_period = ktime_to_ns(rt_b->rt_period);
-	for_each_cpu(i, rd->span) {
-		struct rt_rq *iter = sched_rt_period_rt_rq(rt_b, i);
-		s64 diff;
-
-		if (iter == rt_rq)
-			continue;
-
-		raw_spin_lock(&iter->rt_runtime_lock);
-		/*
-		 * Either all rqs have inf runtime and there's nothing to steal
-		 * or __disable_runtime() below sets a specific rq to inf to
-		 * indicate its been disabled and disallow stealing.
-		 */
-		if (iter->rt_runtime == RUNTIME_INF)
-			goto next;
-
-		/*
-		 * From runqueues with spare time, take 1/n part of their
-		 * spare time, but no more than our period.
-		 */
-		diff = iter->rt_runtime - iter->rt_time;
-		if (diff > 0) {
-			diff = div_u64((u64)diff, weight);
-			if (rt_rq->rt_runtime + diff > rt_period)
-				diff = rt_period - rt_rq->rt_runtime;
-			iter->rt_runtime -= diff;
-			rt_rq->rt_runtime += diff;
-			if (rt_rq->rt_runtime == rt_period) {
-				raw_spin_unlock(&iter->rt_runtime_lock);
-				break;
-			}
-		}
-next:
-		raw_spin_unlock(&iter->rt_runtime_lock);
-	}
-	raw_spin_unlock(&rt_b->rt_runtime_lock);
-}
-
-/*
- * Ensure this RQ takes back all the runtime it lend to its neighbours.
- */
-static void __disable_runtime(struct rq *rq)
-{
-	struct root_domain *rd = rq->rd;
-	rt_rq_iter_t iter;
-	struct rt_rq *rt_rq;
-
-	if (unlikely(!scheduler_running))
-		return;
-
-	for_each_rt_rq(rt_rq, iter, rq) {
-		struct rt_bandwidth *rt_b = sched_rt_bandwidth(rt_rq);
-		s64 want;
-		int i;
-
-		raw_spin_lock(&rt_b->rt_runtime_lock);
-		raw_spin_lock(&rt_rq->rt_runtime_lock);
-		/*
-		 * Either we're all inf and nobody needs to borrow, or we're
-		 * already disabled and thus have nothing to do, or we have
-		 * exactly the right amount of runtime to take out.
-		 */
-		if (rt_rq->rt_runtime == RUNTIME_INF ||
-				rt_rq->rt_runtime == rt_b->rt_runtime)
-			goto balanced;
-		raw_spin_unlock(&rt_rq->rt_runtime_lock);
-
-		/*
-		 * Calculate the difference between what we started out with
-		 * and what we current have, that's the amount of runtime
-		 * we lend and now have to reclaim.
-		 */
-		want = rt_b->rt_runtime - rt_rq->rt_runtime;
-
-		/*
-		 * Greedy reclaim, take back as much as we can.
-		 */
-		for_each_cpu(i, rd->span) {
-			struct rt_rq *iter = sched_rt_period_rt_rq(rt_b, i);
-			s64 diff;
-
-			/*
-			 * Can't reclaim from ourselves or disabled runqueues.
-			 */
-			if (iter == rt_rq || iter->rt_runtime == RUNTIME_INF)
-				continue;
-
-			raw_spin_lock(&iter->rt_runtime_lock);
-			if (want > 0) {
-				diff = min_t(s64, iter->rt_runtime, want);
-				iter->rt_runtime -= diff;
-				want -= diff;
-			} else {
-				iter->rt_runtime -= want;
-				want -= want;
-			}
-			raw_spin_unlock(&iter->rt_runtime_lock);
-
-			if (!want)
-				break;
-		}
-
-		raw_spin_lock(&rt_rq->rt_runtime_lock);
-		/*
-		 * We cannot be left wanting - that would mean some runtime
-		 * leaked out of the system.
-		 */
-		WARN_ON_ONCE(want);
-balanced:
-		/*
-		 * Disable all the borrow logic by pretending we have inf
-		 * runtime - in which case borrowing doesn't make sense.
-		 */
-		rt_rq->rt_runtime = RUNTIME_INF;
-		rt_rq->rt_throttled = 0;
-		raw_spin_unlock(&rt_rq->rt_runtime_lock);
-		raw_spin_unlock(&rt_b->rt_runtime_lock);
-
-		/* Make rt_rq available for pick_next_task() */
-		sched_rt_rq_enqueue(rt_rq);
-	}
-}
-
-static void __enable_runtime(struct rq *rq)
-{
-	rt_rq_iter_t iter;
-	struct rt_rq *rt_rq;
-
-	if (unlikely(!scheduler_running))
-		return;
-
-	/*
-	 * Reset each runqueue's bandwidth settings
-	 */
-	for_each_rt_rq(rt_rq, iter, rq) {
-		struct rt_bandwidth *rt_b = sched_rt_bandwidth(rt_rq);
-
-		raw_spin_lock(&rt_b->rt_runtime_lock);
-		raw_spin_lock(&rt_rq->rt_runtime_lock);
-		rt_rq->rt_runtime = rt_b->rt_runtime;
-		rt_rq->rt_time = 0;
-		rt_rq->rt_throttled = 0;
-		raw_spin_unlock(&rt_rq->rt_runtime_lock);
-		raw_spin_unlock(&rt_b->rt_runtime_lock);
-	}
-}
-
-static void balance_runtime(struct rt_rq *rt_rq)
-{
-	if (!sched_feat(RT_RUNTIME_SHARE))
-		return;
-
-	if (rt_rq->rt_time > rt_rq->rt_runtime) {
-		raw_spin_unlock(&rt_rq->rt_runtime_lock);
-		do_balance_runtime(rt_rq);
-		raw_spin_lock(&rt_rq->rt_runtime_lock);
-	}
-}
-
-static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun)
-{
-	int i, idle = 1, throttled = 0;
-	const struct cpumask *span;
-
-	span = sched_rt_period_mask();
-
-	/*
-	 * FIXME: isolated CPUs should really leave the root task group,
-	 * whether they are isolcpus or were isolated via cpusets, lest
-	 * the timer run on a CPU which does not service all runqueues,
-	 * potentially leaving other CPUs indefinitely throttled.  If
-	 * isolation is really required, the user will turn the throttle
-	 * off to kill the perturbations it causes anyway.  Meanwhile,
-	 * this maintains functionality for boot and/or troubleshooting.
-	 */
-	if (rt_b == &root_task_group.rt_bandwidth)
-		span = cpu_online_mask;
-
-	for_each_cpu(i, span) {
-		int enqueue = 0;
-		struct rt_rq *rt_rq = sched_rt_period_rt_rq(rt_b, i);
-		struct rq *rq = rq_of_rt_rq(rt_rq);
-		struct rq_flags rf;
-		int skip;
-
-		/*
-		 * When span == cpu_online_mask, taking each rq->lock
-		 * can be time-consuming. Try to avoid it when possible.
-		 */
-		raw_spin_lock(&rt_rq->rt_runtime_lock);
-		if (!sched_feat(RT_RUNTIME_SHARE) && rt_rq->rt_runtime != RUNTIME_INF)
-			rt_rq->rt_runtime = rt_b->rt_runtime;
-		skip = !rt_rq->rt_time && !rt_rq->rt_nr_running;
-		raw_spin_unlock(&rt_rq->rt_runtime_lock);
-		if (skip)
-			continue;
-
-		rq_lock(rq, &rf);
-		update_rq_clock(rq);
-
-		if (rt_rq->rt_time) {
-			u64 runtime;
-
-			raw_spin_lock(&rt_rq->rt_runtime_lock);
-			if (rt_rq->rt_throttled)
-				balance_runtime(rt_rq);
-			runtime = rt_rq->rt_runtime;
-			rt_rq->rt_time -= min(rt_rq->rt_time, overrun*runtime);
-			if (rt_rq->rt_throttled && rt_rq->rt_time < runtime) {
-				rt_rq->rt_throttled = 0;
-				enqueue = 1;
-
-				/*
-				 * When we're idle and a woken (rt) task is
-				 * throttled wakeup_preempt() will set
-				 * skip_update and the time between the wakeup
-				 * and this unthrottle will get accounted as
-				 * 'runtime'.
-				 */
-				if (rt_rq->rt_nr_running && rq->curr == rq->idle)
-					rq_clock_cancel_skipupdate(rq);
-			}
-			if (rt_rq->rt_time || rt_rq->rt_nr_running)
-				idle = 0;
-			raw_spin_unlock(&rt_rq->rt_runtime_lock);
-		} else if (rt_rq->rt_nr_running) {
-			idle = 0;
-			if (!rt_rq_throttled(rt_rq))
-				enqueue = 1;
-		}
-		if (rt_rq->rt_throttled)
-			throttled = 1;
-
-		if (enqueue)
-			sched_rt_rq_enqueue(rt_rq);
-		rq_unlock(rq, &rf);
-	}
-
-	if (!throttled && (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF))
-		return 1;
-
-	return idle;
-}
-
-static int sched_rt_runtime_exceeded(struct rt_rq *rt_rq)
-{
-	u64 runtime = sched_rt_runtime(rt_rq);
-
-	if (rt_rq->rt_throttled)
-		return rt_rq_throttled(rt_rq);
-
-	if (runtime >= sched_rt_period(rt_rq))
-		return 0;
-
-	balance_runtime(rt_rq);
-	runtime = sched_rt_runtime(rt_rq);
-	if (runtime == RUNTIME_INF)
-		return 0;
-
-	if (rt_rq->rt_time > runtime) {
-		struct rt_bandwidth *rt_b = sched_rt_bandwidth(rt_rq);
-
-		/*
-		 * Don't actually throttle groups that have no runtime assigned
-		 * but accrue some time due to boosting.
-		 */
-		if (likely(rt_b->rt_runtime)) {
-			rt_rq->rt_throttled = 1;
-			printk_deferred_once("sched: RT throttling activated\n");
-		} else {
-			/*
-			 * In case we did anyway, make it go away,
-			 * replenishment is a joke, since it will replenish us
-			 * with exactly 0 ns.
-			 */
-			rt_rq->rt_time = 0;
-		}
-
-		if (rt_rq_throttled(rt_rq)) {
-			sched_rt_rq_dequeue(rt_rq);
-			return 1;
-		}
-	}
-
-	return 0;
-}
-
-#else /* !CONFIG_RT_GROUP_SCHED: */
+#else /* !CONFIG_RT_GROUP_SCHED */
 
 typedef struct rt_rq *rt_rq_iter_t;
 
 #define for_each_rt_rq(rt_rq, iter, rq) \
 	for ((void) iter, rt_rq = &rq->rt; rt_rq; rt_rq = NULL)
 
-#define for_each_sched_rt_entity(rt_se) \
-	for (; rt_se; rt_se = NULL)
-
-static inline struct rt_rq *group_rt_rq(struct sched_rt_entity *rt_se)
-{
-	return NULL;
-}
-
-static inline void sched_rt_rq_enqueue(struct rt_rq *rt_rq)
-{
-	struct rq *rq = rq_of_rt_rq(rt_rq);
-
-	if (!rt_rq->rt_nr_running)
-		return;
-
-	enqueue_top_rt_rq(rt_rq);
-	resched_curr(rq);
-}
-
-static inline void sched_rt_rq_dequeue(struct rt_rq *rt_rq)
-{
-	dequeue_top_rt_rq(rt_rq, rt_rq->rt_nr_running);
-}
-
-static inline int rt_rq_throttled(struct rt_rq *rt_rq)
-{
-	return false;
-}
-
-static inline const struct cpumask *sched_rt_period_mask(void)
-{
-	return cpu_online_mask;
-}
-
-static inline
-struct rt_rq *sched_rt_period_rt_rq(struct rt_bandwidth *rt_b, int cpu)
-{
-	return &cpu_rq(cpu)->rt;
-}
-
-static void __enable_runtime(struct rq *rq) { }
-static void __disable_runtime(struct rq *rq) { }
-
-#endif /* !CONFIG_RT_GROUP_SCHED */
+#endif /* CONFIG_RT_GROUP_SCHED */
 
 static inline int rt_se_prio(struct sched_rt_entity *rt_se)
 {
-#ifdef CONFIG_RT_GROUP_SCHED
-	struct rt_rq *rt_rq = group_rt_rq(rt_se);
-
-	if (rt_rq)
-		return rt_rq->highest_prio.curr;
-#endif
-
 	return rt_task_of(rt_se)->prio;
 }
 
@@ -931,67 +334,8 @@ static void update_curr_rt(struct rq *rq)
 	if (unlikely(delta_exec <= 0))
 		return;
 
-#ifdef CONFIG_RT_GROUP_SCHED
-	struct sched_rt_entity *rt_se = &donor->rt;
-
 	if (!rt_bandwidth_enabled())
 		return;
-
-	for_each_sched_rt_entity(rt_se) {
-		struct rt_rq *rt_rq = rt_rq_of_se(rt_se);
-		int exceeded;
-
-		if (sched_rt_runtime(rt_rq) != RUNTIME_INF) {
-			raw_spin_lock(&rt_rq->rt_runtime_lock);
-			rt_rq->rt_time += delta_exec;
-			exceeded = sched_rt_runtime_exceeded(rt_rq);
-			if (exceeded)
-				resched_curr(rq);
-			raw_spin_unlock(&rt_rq->rt_runtime_lock);
-			if (exceeded)
-				do_start_rt_bandwidth(sched_rt_bandwidth(rt_rq));
-		}
-	}
-#endif /* CONFIG_RT_GROUP_SCHED */
-}
-
-static void
-dequeue_top_rt_rq(struct rt_rq *rt_rq, unsigned int count)
-{
-	struct rq *rq = rq_of_rt_rq(rt_rq);
-
-	BUG_ON(&rq->rt != rt_rq);
-
-	if (!rt_rq->rt_queued)
-		return;
-
-	BUG_ON(!rq->nr_running);
-
-	sub_nr_running(rq, count);
-	rt_rq->rt_queued = 0;
-
-}
-
-static void
-enqueue_top_rt_rq(struct rt_rq *rt_rq)
-{
-	struct rq *rq = rq_of_rt_rq(rt_rq);
-
-	BUG_ON(&rq->rt != rt_rq);
-
-	if (rt_rq->rt_queued)
-		return;
-
-	if (rt_rq_throttled(rt_rq))
-		return;
-
-	if (rt_rq->rt_nr_running) {
-		add_nr_running(rq, rt_rq->rt_nr_running);
-		rt_rq->rt_queued = 1;
-	}
-
-	/* Kick cpufreq (see the comment in kernel/sched/sched.h). */
-	cpufreq_update_util(rq, 0);
 }
 
 static void
@@ -1062,58 +406,17 @@ dec_rt_prio(struct rt_rq *rt_rq, int prio)
 	dec_rt_prio_smp(rt_rq, prio, prev_prio);
 }
 
-#ifdef CONFIG_RT_GROUP_SCHED
-
-static void
-inc_rt_group(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
-{
-	if (rt_se_boosted(rt_se))
-		rt_rq->rt_nr_boosted++;
-
-	start_rt_bandwidth(&rt_rq->tg->rt_bandwidth);
-}
-
-static void
-dec_rt_group(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
-{
-	if (rt_se_boosted(rt_se))
-		rt_rq->rt_nr_boosted--;
-
-	WARN_ON(!rt_rq->rt_nr_running && rt_rq->rt_nr_boosted);
-}
-
-#else /* !CONFIG_RT_GROUP_SCHED: */
-
-static void
-inc_rt_group(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
-{
-}
-
-static inline
-void dec_rt_group(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq) {}
-
-#endif /* !CONFIG_RT_GROUP_SCHED */
-
 static inline
 unsigned int rt_se_nr_running(struct sched_rt_entity *rt_se)
 {
-	struct rt_rq *group_rq = group_rt_rq(rt_se);
-
-	if (group_rq)
-		return group_rq->rt_nr_running;
-	else
-		return 1;
+	return 1;
 }
 
 static inline
 unsigned int rt_se_rr_nr_running(struct sched_rt_entity *rt_se)
 {
-	struct rt_rq *group_rq = group_rt_rq(rt_se);
 	struct task_struct *tsk;
 
-	if (group_rq)
-		return group_rq->rr_nr_running;
-
 	tsk = rt_task_of(rt_se);
 
 	return (tsk->policy == SCHED_RR) ? 1 : 0;
@@ -1122,26 +425,21 @@ unsigned int rt_se_rr_nr_running(struct sched_rt_entity *rt_se)
 static inline
 void inc_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
 {
-	int prio = rt_se_prio(rt_se);
-
-	WARN_ON(!rt_prio(prio));
+	WARN_ON(!rt_prio(rt_se_prio(rt_se)));
 	rt_rq->rt_nr_running += rt_se_nr_running(rt_se);
 	rt_rq->rr_nr_running += rt_se_rr_nr_running(rt_se);
 
-	inc_rt_prio(rt_rq, prio);
-	inc_rt_group(rt_se, rt_rq);
+	inc_rt_prio(rt_rq, rt_se_prio(rt_se));
 }
 
 static inline
 void dec_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
 {
 	WARN_ON(!rt_prio(rt_se_prio(rt_se)));
-	WARN_ON(!rt_rq->rt_nr_running);
 	rt_rq->rt_nr_running -= rt_se_nr_running(rt_se);
 	rt_rq->rr_nr_running -= rt_se_rr_nr_running(rt_se);
 
 	dec_rt_prio(rt_rq, rt_se_prio(rt_se));
-	dec_rt_group(rt_se, rt_rq);
 }
 
 /*
@@ -1170,10 +468,6 @@ static void __delist_rt_entity(struct sched_rt_entity *rt_se, struct rt_prio_arr
 static inline struct sched_statistics *
 __schedstats_from_rt_se(struct sched_rt_entity *rt_se)
 {
-	/* schedstats is not supported for rt group. */
-	if (!rt_entity_is_task(rt_se))
-		return NULL;
-
 	return &rt_task_of(rt_se)->stats;
 }
 
@@ -1186,9 +480,7 @@ update_stats_wait_start_rt(struct rt_rq *rt_rq, struct sched_rt_entity *rt_se)
 	if (!schedstat_enabled())
 		return;
 
-	if (rt_entity_is_task(rt_se))
-		p = rt_task_of(rt_se);
-
+	p = rt_task_of(rt_se);
 	stats = __schedstats_from_rt_se(rt_se);
 	if (!stats)
 		return;
@@ -1205,9 +497,7 @@ update_stats_enqueue_sleeper_rt(struct rt_rq *rt_rq, struct sched_rt_entity *rt_
 	if (!schedstat_enabled())
 		return;
 
-	if (rt_entity_is_task(rt_se))
-		p = rt_task_of(rt_se);
-
+	p = rt_task_of(rt_se);
 	stats = __schedstats_from_rt_se(rt_se);
 	if (!stats)
 		return;
@@ -1235,9 +525,7 @@ update_stats_wait_end_rt(struct rt_rq *rt_rq, struct sched_rt_entity *rt_se)
 	if (!schedstat_enabled())
 		return;
 
-	if (rt_entity_is_task(rt_se))
-		p = rt_task_of(rt_se);
-
+	p = rt_task_of(rt_se);
 	stats = __schedstats_from_rt_se(rt_se);
 	if (!stats)
 		return;
@@ -1254,9 +542,7 @@ update_stats_dequeue_rt(struct rt_rq *rt_rq, struct sched_rt_entity *rt_se,
 	if (!schedstat_enabled())
 		return;
 
-	if (rt_entity_is_task(rt_se))
-		p = rt_task_of(rt_se);
-
+	p = rt_task_of(rt_se);
 	if ((flags & DEQUEUE_SLEEP) && p) {
 		unsigned int state;
 
@@ -1275,21 +561,8 @@ static void __enqueue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flag
 {
 	struct rt_rq *rt_rq = rt_rq_of_se(rt_se);
 	struct rt_prio_array *array = &rt_rq->active;
-	struct rt_rq *group_rq = group_rt_rq(rt_se);
 	struct list_head *queue = array->queue + rt_se_prio(rt_se);
 
-	/*
-	 * Don't enqueue the group if its throttled, or when empty.
-	 * The latter is a consequence of the former when a child group
-	 * get throttled and the current group doesn't have any other
-	 * active members.
-	 */
-	if (group_rq && (rt_rq_throttled(group_rq) || !group_rq->rt_nr_running)) {
-		if (rt_se->on_list)
-			__delist_rt_entity(rt_se, array);
-		return;
-	}
-
 	if (move_entity(flags)) {
 		WARN_ON_ONCE(rt_se->on_list);
 		if (flags & ENQUEUE_HEAD)
@@ -1319,57 +592,18 @@ static void __dequeue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flag
 	dec_rt_tasks(rt_se, rt_rq);
 }
 
-/*
- * Because the prio of an upper entry depends on the lower
- * entries, we must remove entries top - down.
- */
-static void dequeue_rt_stack(struct sched_rt_entity *rt_se, unsigned int flags)
-{
-	struct sched_rt_entity *back = NULL;
-	unsigned int rt_nr_running;
-
-	for_each_sched_rt_entity(rt_se) {
-		rt_se->back = back;
-		back = rt_se;
-	}
-
-	rt_nr_running = rt_rq_of_se(back)->rt_nr_running;
-
-	for (rt_se = back; rt_se; rt_se = rt_se->back) {
-		if (on_rt_rq(rt_se))
-			__dequeue_rt_entity(rt_se, flags);
-	}
-
-	dequeue_top_rt_rq(rt_rq_of_se(back), rt_nr_running);
-}
-
 static void enqueue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)
 {
-	struct rq *rq = rq_of_rt_se(rt_se);
-
 	update_stats_enqueue_rt(rt_rq_of_se(rt_se), rt_se, flags);
 
-	dequeue_rt_stack(rt_se, flags);
-	for_each_sched_rt_entity(rt_se)
-		__enqueue_rt_entity(rt_se, flags);
-	enqueue_top_rt_rq(&rq->rt);
+	__enqueue_rt_entity(rt_se, flags);
 }
 
 static void dequeue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)
 {
-	struct rq *rq = rq_of_rt_se(rt_se);
-
 	update_stats_dequeue_rt(rt_rq_of_se(rt_se), rt_se, flags);
 
-	dequeue_rt_stack(rt_se, flags);
-
-	for_each_sched_rt_entity(rt_se) {
-		struct rt_rq *rt_rq = group_rt_rq(rt_se);
-
-		if (rt_rq && rt_rq->rt_nr_running)
-			__enqueue_rt_entity(rt_se, flags);
-	}
-	enqueue_top_rt_rq(&rq->rt);
+	__dequeue_rt_entity(rt_se, flags);
 }
 
 /*
@@ -1432,10 +666,8 @@ static void requeue_task_rt(struct rq *rq, struct task_struct *p, int head)
 	struct sched_rt_entity *rt_se = &p->rt;
 	struct rt_rq *rt_rq;
 
-	for_each_sched_rt_entity(rt_se) {
-		rt_rq = rt_rq_of_se(rt_se);
-		requeue_rt_entity(rt_rq, rt_se, head);
-	}
+	rt_rq = rt_rq_of_se(rt_se);
+	requeue_rt_entity(rt_rq, rt_se, head);
 }
 
 static void yield_task_rt(struct rq *rq)
@@ -1632,17 +864,7 @@ static struct sched_rt_entity *pick_next_rt_entity(struct rt_rq *rt_rq)
 
 static struct task_struct *_pick_next_task_rt(struct rq *rq)
 {
-	struct sched_rt_entity *rt_se;
-	struct rt_rq *rt_rq  = &rq->rt;
-
-	do {
-		rt_se = pick_next_rt_entity(rt_rq);
-		if (unlikely(!rt_se))
-			return NULL;
-		rt_rq = group_rt_rq(rt_se);
-	} while (rt_rq);
-
-	return rt_task_of(rt_se);
+	return NULL;
 }
 
 static struct task_struct *pick_task_rt(struct rq *rq)
@@ -2311,8 +1533,6 @@ static void rq_online_rt(struct rq *rq)
 	if (rq->rt.overloaded)
 		rt_set_overload(rq);
 
-	__enable_runtime(rq);
-
 	cpupri_set(&rq->rd->cpupri, rq->cpu, rq->rt.highest_prio.curr);
 }
 
@@ -2322,8 +1542,6 @@ static void rq_offline_rt(struct rq *rq)
 	if (rq->rt.overloaded)
 		rt_clear_overload(rq);
 
-	__disable_runtime(rq);
-
 	cpupri_set(&rq->rd->cpupri, rq->cpu, CPUPRI_INVALID);
 }
 
@@ -2481,12 +1699,12 @@ static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued)
 	 * Requeue to the end of queue if we (and all of our ancestors) are not
 	 * the only element on the queue
 	 */
-	for_each_sched_rt_entity(rt_se) {
-		if (rt_se->run_list.prev != rt_se->run_list.next) {
-			requeue_task_rt(rq, p, 0);
-			resched_curr(rq);
-			return;
-		}
+	if (rt_se->run_list.prev != rt_se->run_list.next) {
+		requeue_task_rt(rq, p, 0);
+		resched_curr(rq);
+		// set_tsk_need_resched(p);
+
+		return;
 	}
 }
 
@@ -2504,16 +1722,7 @@ static unsigned int get_rr_interval_rt(struct rq *rq, struct task_struct *task)
 #ifdef CONFIG_SCHED_CORE
 static int task_is_throttled_rt(struct task_struct *p, int cpu)
 {
-	struct rt_rq *rt_rq;
-
-#ifdef CONFIG_RT_GROUP_SCHED // XXX maybe add task_rt_rq(), see also sched_rt_period_rt_rq
-	rt_rq = task_group(p)->rt_rq[cpu];
-	WARN_ON(!rt_group_sched_enabled() && rt_rq->tg != &root_task_group);
-#else
-	rt_rq = &cpu_rq(cpu)->rt;
-#endif
-
-	return rt_rq_throttled(rt_rq);
+	return 0;
 }
 #endif /* CONFIG_SCHED_CORE */
 
@@ -2761,13 +1970,7 @@ long sched_group_rt_period(struct task_group *tg)
 #ifdef CONFIG_SYSCTL
 static int sched_rt_global_constraints(void)
 {
-	int ret = 0;
-
-	mutex_lock(&rt_constraints_mutex);
-	ret = __rt_schedulable(NULL, 0, 0);
-	mutex_unlock(&rt_constraints_mutex);
-
-	return ret;
+	return 0;
 }
 #endif /* CONFIG_SYSCTL */
 
@@ -2802,10 +2005,6 @@ static int sched_rt_global_validate(void)
 	return 0;
 }
 
-static void sched_rt_do_global(void)
-{
-}
-
 static int sched_rt_handler(const struct ctl_table *table, int write, void *buffer,
 		size_t *lenp, loff_t *ppos)
 {
@@ -2833,7 +2032,6 @@ static int sched_rt_handler(const struct ctl_table *table, int write, void *buff
 		if (ret)
 			goto undo;
 
-		sched_rt_do_global();
 		sched_dl_do_global();
 	}
 	if (0) {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a1e6d2852ca..2f9035cb9e5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -789,7 +789,7 @@ struct scx_rq {
 
 static inline int rt_bandwidth_enabled(void)
 {
-	return sysctl_sched_rt_runtime >= 0;
+	return 0;
 }
 
 /* RT IPI pull logic requires IRQ_WORK */
@@ -829,7 +829,7 @@ struct rt_rq {
 
 static inline bool rt_rq_is_runnable(struct rt_rq *rt_rq)
 {
-	return rt_rq->rt_queued && rt_rq->rt_nr_running;
+	return rt_rq->rt_nr_running;
 }
 
 /* Deadline class' related fields in a runqueue */
@@ -2545,7 +2545,7 @@ static inline bool sched_dl_runnable(struct rq *rq)
 
 static inline bool sched_rt_runnable(struct rq *rq)
 {
-	return rq->rt.rt_queued > 0;
+	return rq->rt.rt_nr_running > 0;
 }
 
 static inline bool sched_fair_runnable(struct rq *rq)
@@ -2656,9 +2656,6 @@ extern void resched_curr(struct rq *rq);
 extern void resched_curr_lazy(struct rq *rq);
 extern void resched_cpu(int cpu);
 
-extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime);
-extern bool sched_rt_bandwidth_account(struct rt_rq *rt_rq);
-
 extern void init_dl_entity(struct sched_dl_entity *dl_se);
 
 #define BW_SHIFT		20
diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
index 77ae87f36e8..93a9c03b28e 100644
--- a/kernel/sched/syscalls.c
+++ b/kernel/sched/syscalls.c
@@ -626,19 +626,6 @@ int __sched_setscheduler(struct task_struct *p,
 change:
 
 	if (user) {
-#ifdef CONFIG_RT_GROUP_SCHED
-		/*
-		 * Do not allow real-time tasks into groups that have no runtime
-		 * assigned.
-		 */
-		if (rt_group_sched_enabled() &&
-				rt_bandwidth_enabled() && rt_policy(policy) &&
-				task_group(p)->rt_bandwidth.rt_runtime == 0 &&
-				!task_group_is_autogroup(task_group(p))) {
-			retval = -EPERM;
-			goto unlock;
-		}
-#endif /* CONFIG_RT_GROUP_SCHED */
 		if (dl_bandwidth_enabled() && dl_policy(policy) &&
 				!(attr->sched_flags & SCHED_FLAG_SUGOV)) {
 			cpumask_t *span = rq->rd->span;
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH v2 07/25] sched/rt: Introduce HCBS specific structs in task_group
  2025-07-31 10:55 [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
                   ` (5 preceding siblings ...)
  2025-07-31 10:55 ` [RFC PATCH v2 06/25] sched/rt: Disable RT_GROUP_SCHED Yuri Andriaccio
@ 2025-07-31 10:55 ` Yuri Andriaccio
  2025-08-14 12:51   ` Juri Lelli
  2025-07-31 10:55 ` [RFC PATCH v2 08/25] sched/deadline: Account rt-cgroups bandwidth in deadline tasks schedulability tests Yuri Andriaccio
                   ` (18 subsequent siblings)
  25 siblings, 1 reply; 38+ messages in thread
From: Yuri Andriaccio @ 2025-07-31 10:55 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider
  Cc: linux-kernel, Luca Abeni, Yuri Andriaccio

From: luca abeni <luca.abeni@santannapisa.it>

Each task_group manages a number of new objects:
- a sched_dl_entity/dl_server for each CPU
- a dl_bandwidth object to keep track of its allocated dl_bandwidth

Co-developed-by: Alessio Balsini <a.balsini@sssup.it>
Signed-off-by: Alessio Balsini <a.balsini@sssup.it>
Co-developed-by: Andrea Parri <parri.andrea@gmail.com>
Signed-off-by: Andrea Parri <parri.andrea@gmail.com>
Co-developed-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
---
 kernel/sched/sched.h | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2f9035cb9e5..2a7601d400c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -319,6 +319,13 @@ struct rt_bandwidth {
 	unsigned int		rt_period_active;
 };
 
+struct dl_bandwidth {
+	raw_spinlock_t          dl_runtime_lock;
+	u64                     dl_runtime;
+	u64                     dl_period;
+};
+
+
 static inline int dl_bandwidth_enabled(void)
 {
 	return sysctl_sched_rt_runtime >= 0;
@@ -467,9 +474,15 @@ struct task_group {
 
 #ifdef CONFIG_RT_GROUP_SCHED
 	struct sched_rt_entity	**rt_se;
+	/*
+	 * The scheduling entities for the task group are managed as a single
+	 * sched_dl_entity, each of them sharing the same dl_bandwidth.
+	 */
+	struct sched_dl_entity	**dl_se;
 	struct rt_rq		**rt_rq;
 
 	struct rt_bandwidth	rt_bandwidth;
+	struct dl_bandwidth	dl_bandwidth;
 #endif
 
 #ifdef CONFIG_EXT_GROUP_SCHED
@@ -819,12 +832,12 @@ struct rt_rq {
 	raw_spinlock_t		rt_runtime_lock;
 
 	unsigned int		rt_nr_boosted;
-
-	struct rq		*rq; /* this is always top-level rq, cache? */
 #endif
 #ifdef CONFIG_CGROUP_SCHED
 	struct task_group	*tg; /* this tg has "this" rt_rq on given CPU for runnable entities */
 #endif
+
+	struct rq		*rq; /* cgroup's runqueue if the rt_rq entity belongs to a cgroup, otherwise top-level rq */
 };
 
 static inline bool rt_rq_is_runnable(struct rt_rq *rt_rq)
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH v2 08/25] sched/deadline: Account rt-cgroups bandwidth in deadline tasks schedulability tests.
  2025-07-31 10:55 [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
                   ` (6 preceding siblings ...)
  2025-07-31 10:55 ` [RFC PATCH v2 07/25] sched/rt: Introduce HCBS specific structs in task_group Yuri Andriaccio
@ 2025-07-31 10:55 ` Yuri Andriaccio
  2025-08-14 13:03   ` Juri Lelli
  2025-07-31 10:55 ` [RFC PATCH v2 09/25] sched/deadline: Account rt-cgroups bandwidth in sched_dl_global_validate Yuri Andriaccio
                   ` (17 subsequent siblings)
  25 siblings, 1 reply; 38+ messages in thread
From: Yuri Andriaccio @ 2025-07-31 10:55 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider
  Cc: linux-kernel, Luca Abeni, Yuri Andriaccio

From: luca abeni <luca.abeni@santannapisa.it>

Account the rt-cgroups hierarchy's reserved bandwidth in the schedulability
test of deadline entities. This mechanism allows to completely reserve portion
of the rt-bandwidth to rt-cgroups even if they do not use all of it.

Co-developed-by: Alessio Balsini <a.balsini@sssup.it>
Signed-off-by: Alessio Balsini <a.balsini@sssup.it>
Co-developed-by: Andrea Parri <parri.andrea@gmail.com>
Signed-off-by: Andrea Parri <parri.andrea@gmail.com>
Co-developed-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
---
 kernel/sched/deadline.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 0640d0ca45b..43af48038b9 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -231,8 +231,15 @@ void __dl_add(struct dl_bw *dl_b, u64 tsk_bw, int cpus)
 static inline bool
 __dl_overflow(struct dl_bw *dl_b, unsigned long cap, u64 old_bw, u64 new_bw)
 {
+	u64 dl_groups_root = 0;
+
+#ifdef CONFIG_RT_GROUP_SCHED
+	dl_groups_root = to_ratio(root_task_group.dl_bandwidth.dl_period,
+				  root_task_group.dl_bandwidth.dl_runtime);
+#endif
 	return dl_b->bw != -1 &&
-	       cap_scale(dl_b->bw, cap) < dl_b->total_bw - old_bw + new_bw;
+	       cap_scale(dl_b->bw, cap) < dl_b->total_bw - old_bw + new_bw
+					+ cap_scale(dl_groups_root, cap);
 }
 
 static inline
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH v2 09/25] sched/deadline: Account rt-cgroups bandwidth in sched_dl_global_validate
  2025-07-31 10:55 [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
                   ` (7 preceding siblings ...)
  2025-07-31 10:55 ` [RFC PATCH v2 08/25] sched/deadline: Account rt-cgroups bandwidth in deadline tasks schedulability tests Yuri Andriaccio
@ 2025-07-31 10:55 ` Yuri Andriaccio
  2025-07-31 10:55 ` [RFC PATCH v2 10/25] sched/core: Initialize root_task_group Yuri Andriaccio
                   ` (16 subsequent siblings)
  25 siblings, 0 replies; 38+ messages in thread
From: Yuri Andriaccio @ 2025-07-31 10:55 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider
  Cc: linux-kernel, Luca Abeni, Yuri Andriaccio

sched_dl_global_validate (similarly to the previous patch) must take into
account the rt-cgroups' reserved bandwidth.

Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
---
 kernel/sched/deadline.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 43af48038b9..55b7f883815 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -3098,11 +3098,17 @@ int sched_dl_global_validate(void)
 	u64 period = global_rt_period();
 	u64 new_bw = to_ratio(period, runtime);
 	u64 cookie = ++dl_cookie;
+	u64 dl_groups_root = 0;
 	u64 fair_bw;
 	struct dl_bw *dl_b;
 	int cpu, ret = 0;
 	unsigned long cap, flags;
 
+#ifdef CONFIG_RT_GROUP_SCHED
+	dl_groups_root = to_ratio(root_task_group.dl_bandwidth.dl_period,
+				  root_task_group.dl_bandwidth.dl_runtime);
+#endif
+
 	/*
 	 * Here we want to check the bandwidth not being set to some
 	 * value smaller than the currently allocated bandwidth in
@@ -3119,7 +3125,7 @@ int sched_dl_global_validate(void)
 		fair_bw = dl_bw_fair(cpu);
 
 		raw_spin_lock_irqsave(&dl_b->lock, flags);
-		if (cap_scale(new_bw, cap) < dl_b->total_bw)
+		if (cap_scale(new_bw, cap) < dl_b->total_bw + cap_scale(dl_groups_root, cap))
 			ret = -EBUSY;
 		if (cap_scale(new_bw, cap) + fair_bw > cap_scale(BW_UNIT, cap))
 			ret = -EBUSY;
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH v2 10/25] sched/core: Initialize root_task_group
  2025-07-31 10:55 [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
                   ` (8 preceding siblings ...)
  2025-07-31 10:55 ` [RFC PATCH v2 09/25] sched/deadline: Account rt-cgroups bandwidth in sched_dl_global_validate Yuri Andriaccio
@ 2025-07-31 10:55 ` Yuri Andriaccio
  2025-07-31 10:55 ` [RFC PATCH v2 11/25] sched/deadline: Add dl_init_tg Yuri Andriaccio
                   ` (15 subsequent siblings)
  25 siblings, 0 replies; 38+ messages in thread
From: Yuri Andriaccio @ 2025-07-31 10:55 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider
  Cc: linux-kernel, Luca Abeni, Yuri Andriaccio

From: luca abeni <luca.abeni@santannapisa.it>

Add the initialization function for task_group's dl_servers.
Initialize the default bandwidth for rt-cgroups and the root control group.
Add utility functions to check (and get) if a rt_rq entity is connected to a
real-time cgroup.

Co-developed-by: Alessio Balsini <a.balsini@sssup.it>
Signed-off-by: Alessio Balsini <a.balsini@sssup.it>
Co-developed-by: Andrea Parri <parri.andrea@gmail.com>
Signed-off-by: Andrea Parri <parri.andrea@gmail.com>
Co-developed-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
---
 kernel/sched/autogroup.c |  4 ++--
 kernel/sched/core.c      |  9 +++++++--
 kernel/sched/deadline.c  |  8 ++++++++
 kernel/sched/rt.c        | 18 ++++++++----------
 kernel/sched/sched.h     | 32 +++++++++++++++++++++++++++++---
 5 files changed, 54 insertions(+), 17 deletions(-)

diff --git a/kernel/sched/autogroup.c b/kernel/sched/autogroup.c
index cdea931aae3..017eadc0a0a 100644
--- a/kernel/sched/autogroup.c
+++ b/kernel/sched/autogroup.c
@@ -52,7 +52,7 @@ static inline void autogroup_destroy(struct kref *kref)
 
 #ifdef CONFIG_RT_GROUP_SCHED
 	/* We've redirected RT tasks to the root task group... */
-	ag->tg->rt_se = NULL;
+	ag->tg->dl_se = NULL;
 	ag->tg->rt_rq = NULL;
 #endif
 	sched_release_group(ag->tg);
@@ -109,7 +109,7 @@ static inline struct autogroup *autogroup_create(void)
 	 * the policy change to proceed.
 	 */
 	free_rt_sched_group(tg);
-	tg->rt_se = root_task_group.rt_se;
+	tg->dl_se = root_task_group.dl_se;
 	tg->rt_rq = root_task_group.rt_rq;
 #endif /* CONFIG_RT_GROUP_SCHED */
 	tg->autogroup = ag;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 42587a3c71f..3a69cb906c3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8703,7 +8703,7 @@ void __init sched_init(void)
 		scx_tg_init(&root_task_group);
 #endif /* CONFIG_EXT_GROUP_SCHED */
 #ifdef CONFIG_RT_GROUP_SCHED
-		root_task_group.rt_se = (struct sched_rt_entity **)ptr;
+		root_task_group.dl_se = (struct sched_dl_entity **)ptr;
 		ptr += nr_cpu_ids * sizeof(void **);
 
 		root_task_group.rt_rq = (struct rt_rq **)ptr;
@@ -8714,6 +8714,11 @@ void __init sched_init(void)
 
 	init_defrootdomain();
 
+#ifdef CONFIG_RT_GROUP_SCHED
+	init_dl_bandwidth(&root_task_group.dl_bandwidth,
+			global_rt_period(), global_rt_runtime());
+#endif /* CONFIG_RT_GROUP_SCHED */
+
 #ifdef CONFIG_CGROUP_SCHED
 	task_group_cache = KMEM_CACHE(task_group, 0);
 
@@ -8765,7 +8770,7 @@ void __init sched_init(void)
 		 * starts working after scheduler_running, which is not the case
 		 * yet.
 		 */
-		init_tg_rt_entry(&root_task_group, &rq->rt, NULL, i, NULL);
+		init_tg_rt_entry(&root_task_group, rq, NULL, i, NULL);
 #endif
 		rq->sd = NULL;
 		rq->rd = NULL;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 55b7f883815..b8228f553fe 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -538,6 +538,14 @@ static inline int is_leftmost(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq
 
 static void init_dl_rq_bw_ratio(struct dl_rq *dl_rq);
 
+void init_dl_bandwidth(struct dl_bandwidth *dl_b, u64 period, u64 runtime)
+{
+	raw_spin_lock_init(&dl_b->dl_runtime_lock);
+	dl_b->dl_period = period;
+	dl_b->dl_runtime = runtime;
+}
+
+
 void init_dl_bw(struct dl_bw *dl_b)
 {
 	raw_spin_lock_init(&dl_b->lock);
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index a6282784978..38178003184 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -97,18 +97,16 @@ void free_rt_sched_group(struct task_group *tg)
 		return;
 }
 
-void init_tg_rt_entry(struct task_group *tg, struct rt_rq *rt_rq,
-		struct sched_rt_entity *rt_se, int cpu,
-		struct sched_rt_entity *parent)
+void init_tg_rt_entry(struct task_group *tg, struct rq *served_rq,
+		struct sched_dl_entity *dl_se, int cpu,
+		struct sched_dl_entity *parent)
 {
-	struct rq *rq = cpu_rq(cpu);
+	served_rq->rt.highest_prio.curr = MAX_RT_PRIO-1;
+	served_rq->rt.rq = cpu_rq(cpu);
+	served_rq->rt.tg = tg;
 
-	rt_rq->highest_prio.curr = MAX_RT_PRIO-1;
-	rt_rq->rq = rq;
-	rt_rq->tg = tg;
-
-	tg->rt_rq[cpu] = rt_rq;
-	tg->rt_se[cpu] = rt_se;
+	tg->rt_rq[cpu] = &served_rq->rt;
+	tg->dl_se[cpu] = dl_se;
 }
 
 int alloc_rt_sched_group(struct task_group *tg, struct task_group *parent)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2a7601d400c..3283d824859 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -577,9 +577,9 @@ extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
 extern void unthrottle_cfs_rq(struct cfs_rq *cfs_rq);
 extern bool cfs_task_bw_constrained(struct task_struct *p);
 
-extern void init_tg_rt_entry(struct task_group *tg, struct rt_rq *rt_rq,
-		struct sched_rt_entity *rt_se, int cpu,
-		struct sched_rt_entity *parent);
+extern void init_tg_rt_entry(struct task_group *tg, struct rq *s_rq,
+		struct sched_dl_entity *rt_se, int cpu,
+		struct sched_dl_entity *parent);
 extern int sched_group_set_rt_runtime(struct task_group *tg, long rt_runtime_us);
 extern int sched_group_set_rt_period(struct task_group *tg, u64 rt_period_us);
 extern long sched_group_rt_runtime(struct task_group *tg);
@@ -2669,6 +2669,7 @@ extern void resched_curr(struct rq *rq);
 extern void resched_curr_lazy(struct rq *rq);
 extern void resched_cpu(int cpu);
 
+void init_dl_bandwidth(struct dl_bandwidth *dl_b, u64 period, u64 runtime);
 extern void init_dl_entity(struct sched_dl_entity *dl_se);
 
 #define BW_SHIFT		20
@@ -3077,6 +3078,21 @@ static inline struct rq *rq_of_rt_se(struct sched_rt_entity *rt_se)
 
 	return rt_rq->rq;
 }
+
+static inline int is_dl_group(struct rt_rq *rt_rq)
+{
+	return rt_rq->tg != &root_task_group;
+}
+
+/*
+ * Return the scheduling entity of this group of tasks.
+ */
+static inline struct sched_dl_entity *dl_group_of(struct rt_rq *rt_rq)
+{
+	BUG_ON(!is_dl_group(rt_rq));
+
+	return rt_rq->tg->dl_se[cpu_of(rt_rq->rq)];
+}
 #else
 static inline struct task_struct *rt_task_of(struct sched_rt_entity *rt_se)
 {
@@ -3101,6 +3117,16 @@ static inline struct rt_rq *rt_rq_of_se(struct sched_rt_entity *rt_se)
 
 	return &rq->rt;
 }
+
+static inline int is_dl_group(struct rt_rq *rt_rq)
+{
+	return 0;
+}
+
+static inline struct sched_dl_entity *dl_group_of(struct rt_rq *rt_rq)
+{
+	return NULL;
+}
 #endif
 
 DEFINE_LOCK_GUARD_2(double_rq_lock, struct rq,
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH v2 11/25] sched/deadline: Add dl_init_tg
  2025-07-31 10:55 [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
                   ` (9 preceding siblings ...)
  2025-07-31 10:55 ` [RFC PATCH v2 10/25] sched/core: Initialize root_task_group Yuri Andriaccio
@ 2025-07-31 10:55 ` Yuri Andriaccio
  2025-08-14 13:10   ` Juri Lelli
  2025-07-31 10:55 ` [RFC PATCH v2 12/25] sched/rt: Add {alloc/free}_rt_sched_group and dl_server specific functions Yuri Andriaccio
                   ` (14 subsequent siblings)
  25 siblings, 1 reply; 38+ messages in thread
From: Yuri Andriaccio @ 2025-07-31 10:55 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider
  Cc: linux-kernel, Luca Abeni, Yuri Andriaccio

From: luca abeni <luca.abeni@santannapisa.it>

This function is used to initialize and/or update a rt-cgroup dl_server, also
accounting for the allocated bandwidth.

Co-developed-by: Alessio Balsini <a.balsini@sssup.it>
Signed-off-by: Alessio Balsini <a.balsini@sssup.it>
Co-developed-by: Andrea Parri <parri.andrea@gmail.com>
Signed-off-by: Andrea Parri <parri.andrea@gmail.com>
Co-developed-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
---
 kernel/sched/deadline.c | 33 +++++++++++++++++++++++++++++++++
 kernel/sched/sched.h    |  1 +
 2 files changed, 34 insertions(+)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index b8228f553fe..264838c4a85 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -365,6 +365,39 @@ void cancel_inactive_timer(struct sched_dl_entity *dl_se)
 	cancel_dl_timer(dl_se, &dl_se->inactive_timer);
 }
 
+#ifdef CONFIG_RT_GROUP_SCHED
+void dl_init_tg(struct sched_dl_entity *dl_se, u64 rt_runtime, u64 rt_period)
+{
+	struct rq *rq = container_of(dl_se->dl_rq, struct rq, dl);
+	int is_active;
+	u64 new_bw;
+
+	raw_spin_rq_lock_irq(rq);
+	is_active = dl_se->my_q->rt.rt_nr_running > 0;
+
+	update_rq_clock(rq);
+	dl_server_stop(dl_se);
+
+	new_bw = to_ratio(dl_se->dl_period, dl_se->dl_runtime);
+	dl_rq_change_utilization(rq, dl_se, new_bw);
+
+	dl_se->dl_runtime  = rt_runtime;
+	dl_se->dl_deadline = rt_period;
+	dl_se->dl_period   = rt_period;
+
+	dl_se->runtime = 0;
+	dl_se->deadline = 0;
+
+	dl_se->dl_bw = new_bw;
+	dl_se->dl_density = new_bw;
+
+	if (is_active)
+		dl_server_start(dl_se);
+
+	raw_spin_rq_unlock_irq(rq);
+}
+#endif
+
 static void dl_change_utilization(struct task_struct *p, u64 new_bw)
 {
 	WARN_ON_ONCE(p->dl.flags & SCHED_FLAG_SUGOV);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3283d824859..611e3757fea 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -394,6 +394,7 @@ extern void dl_server_init(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq,
 		    dl_server_has_tasks_f has_tasks,
 		    dl_server_pick_f pick_task);
 extern void sched_init_dl_servers(void);
+extern void dl_init_tg(struct sched_dl_entity *dl_se, u64 rt_runtime, u64 rt_period);
 
 extern void dl_server_update_idle_time(struct rq *rq,
 		    struct task_struct *p);
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH v2 12/25] sched/rt: Add {alloc/free}_rt_sched_group and dl_server specific functions
  2025-07-31 10:55 [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
                   ` (10 preceding siblings ...)
  2025-07-31 10:55 ` [RFC PATCH v2 11/25] sched/deadline: Add dl_init_tg Yuri Andriaccio
@ 2025-07-31 10:55 ` Yuri Andriaccio
  2025-08-14 13:42   ` Juri Lelli
  2025-07-31 10:55 ` [RFC PATCH v2 13/25] sched/rt: Add HCBS related checks and operations for rt tasks Yuri Andriaccio
                   ` (13 subsequent siblings)
  25 siblings, 1 reply; 38+ messages in thread
From: Yuri Andriaccio @ 2025-07-31 10:55 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider
  Cc: linux-kernel, Luca Abeni, Yuri Andriaccio

From: luca abeni <luca.abeni@santannapisa.it>

Add allocation and deallocation code for rt-cgroups. Add rt dl_server's specific
functions that pick the next eligible task to run.

Co-developed-by: Alessio Balsini <a.balsini@sssup.it>
Signed-off-by: Alessio Balsini <a.balsini@sssup.it>
Co-developed-by: Andrea Parri <parri.andrea@gmail.com>
Signed-off-by: Andrea Parri <parri.andrea@gmail.com>
Co-developed-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
---
 kernel/sched/rt.c | 107 ++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 104 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 38178003184..9c4ac6875a2 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -93,8 +93,39 @@ void unregister_rt_sched_group(struct task_group *tg)
 
 void free_rt_sched_group(struct task_group *tg)
 {
+	int i;
+
 	if (!rt_group_sched_enabled())
 		return;
+
+	for_each_possible_cpu(i) {
+		if (tg->dl_se) {
+			unsigned long flags;
+
+			/*
+			 * Since the dl timer is going to be cancelled,
+			 * we risk to never decrease the running bw...
+			 * Fix this issue by changing the group runtime
+			 * to 0 immediately before freeing it.
+			 */
+			dl_init_tg(tg->dl_se[i], 0, tg->dl_se[i]->dl_period);
+			raw_spin_rq_lock_irqsave(cpu_rq(i), flags);
+			BUG_ON(tg->rt_rq[i]->rt_nr_running);
+			raw_spin_rq_unlock_irqrestore(cpu_rq(i), flags);
+
+			hrtimer_cancel(&tg->dl_se[i]->dl_timer);
+			kfree(tg->dl_se[i]);
+		}
+		if (tg->rt_rq) {
+			struct rq *served_rq;
+
+			served_rq = container_of(tg->rt_rq[i], struct rq, rt);
+			kfree(served_rq);
+		}
+	}
+
+	kfree(tg->rt_rq);
+	kfree(tg->dl_se);
 }
 
 void init_tg_rt_entry(struct task_group *tg, struct rq *served_rq,
@@ -109,12 +140,77 @@ void init_tg_rt_entry(struct task_group *tg, struct rq *served_rq,
 	tg->dl_se[cpu] = dl_se;
 }
 
+static bool rt_server_has_tasks(struct sched_dl_entity *dl_se)
+{
+	return !!dl_se->my_q->rt.rt_nr_running;
+}
+
+static struct task_struct *_pick_next_task_rt(struct rt_rq *rt_rq);
+static inline void set_next_task_rt(struct rq *rq, struct task_struct *p, bool first);
+static struct task_struct *rt_server_pick(struct sched_dl_entity *dl_se)
+{
+	struct rt_rq *rt_rq = &dl_se->my_q->rt;
+	struct rq *rq = rq_of_rt_rq(rt_rq);
+	struct task_struct *p;
+
+	if (dl_se->my_q->rt.rt_nr_running == 0)
+		return NULL;
+
+	p = _pick_next_task_rt(rt_rq);
+	set_next_task_rt(rq, p, true);
+
+	return p;
+}
+
 int alloc_rt_sched_group(struct task_group *tg, struct task_group *parent)
 {
+	struct rq *s_rq;
+	struct sched_dl_entity *dl_se;
+	int i;
+
 	if (!rt_group_sched_enabled())
 		return 1;
 
+	tg->rt_rq = kcalloc(nr_cpu_ids, sizeof(struct rt_rq *), GFP_KERNEL);
+	if (!tg->rt_rq)
+		goto err;
+	tg->dl_se = kcalloc(nr_cpu_ids, sizeof(dl_se), GFP_KERNEL);
+	if (!tg->dl_se)
+		goto err;
+
+	init_dl_bandwidth(&tg->dl_bandwidth, 0, 0);
+
+	for_each_possible_cpu(i) {
+		s_rq = kzalloc_node(sizeof(struct rq),
+				     GFP_KERNEL, cpu_to_node(i));
+		if (!s_rq)
+			goto err;
+
+		dl_se = kzalloc_node(sizeof(struct sched_dl_entity),
+				     GFP_KERNEL, cpu_to_node(i));
+		if (!dl_se)
+			goto err_free_rq;
+
+		init_rt_rq(&s_rq->rt);
+		init_dl_entity(dl_se);
+		dl_se->dl_runtime = tg->dl_bandwidth.dl_runtime;
+		dl_se->dl_period = tg->dl_bandwidth.dl_period;
+		dl_se->dl_deadline = dl_se->dl_period;
+		dl_se->dl_bw = to_ratio(dl_se->dl_period, dl_se->dl_runtime);
+		dl_se->dl_density = to_ratio(dl_se->dl_period, dl_se->dl_runtime);
+		dl_se->dl_server = 1;
+
+		dl_server_init(dl_se, &cpu_rq(i)->dl, s_rq, rt_server_has_tasks, rt_server_pick);
+
+		init_tg_rt_entry(tg, s_rq, dl_se, i, parent->dl_se[i]);
+	}
+
 	return 1;
+
+err_free_rq:
+	kfree(s_rq);
+err:
+	return 0;
 }
 
 #else /* !CONFIG_RT_GROUP_SCHED: */
@@ -860,9 +956,14 @@ static struct sched_rt_entity *pick_next_rt_entity(struct rt_rq *rt_rq)
 	return next;
 }
 
-static struct task_struct *_pick_next_task_rt(struct rq *rq)
+static struct task_struct *_pick_next_task_rt(struct rt_rq *rt_rq)
 {
-	return NULL;
+	struct sched_rt_entity *rt_se;
+
+	rt_se = pick_next_rt_entity(rt_rq);
+	BUG_ON(!rt_se);
+
+	return rt_task_of(rt_se);
 }
 
 static struct task_struct *pick_task_rt(struct rq *rq)
@@ -872,7 +973,7 @@ static struct task_struct *pick_task_rt(struct rq *rq)
 	if (!sched_rt_runnable(rq))
 		return NULL;
 
-	p = _pick_next_task_rt(rq);
+	p = _pick_next_task_rt(&rq->rt);
 
 	return p;
 }
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH v2 13/25] sched/rt: Add HCBS related checks and operations for rt tasks
  2025-07-31 10:55 [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
                   ` (11 preceding siblings ...)
  2025-07-31 10:55 ` [RFC PATCH v2 12/25] sched/rt: Add {alloc/free}_rt_sched_group and dl_server specific functions Yuri Andriaccio
@ 2025-07-31 10:55 ` Yuri Andriaccio
  2025-08-14 14:01   ` Juri Lelli
  2025-07-31 10:55 ` [RFC PATCH v2 14/25] sched/rt: Update rt-cgroup schedulability checks Yuri Andriaccio
                   ` (12 subsequent siblings)
  25 siblings, 1 reply; 38+ messages in thread
From: Yuri Andriaccio @ 2025-07-31 10:55 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider
  Cc: linux-kernel, Luca Abeni, Yuri Andriaccio

From: luca abeni <luca.abeni@santannapisa.it>

Add checks wheter a task belongs to the root cgroup or a rt-cgroup, since HCBS
reuses the rt classes' scheduler, and operate accordingly where needed.

Co-developed-by: Alessio Balsini <a.balsini@sssup.it>
Signed-off-by: Alessio Balsini <a.balsini@sssup.it>
Co-developed-by: Andrea Parri <parri.andrea@gmail.com>
Signed-off-by: Andrea Parri <parri.andrea@gmail.com>
Co-developed-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
---
 kernel/sched/core.c     |   3 +
 kernel/sched/deadline.c |  16 ++++-
 kernel/sched/rt.c       | 147 +++++++++++++++++++++++++++++++++++++---
 kernel/sched/sched.h    |   6 +-
 kernel/sched/syscalls.c |  13 ++++
 5 files changed, 171 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3a69cb906c3..6173684a02b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2196,6 +2196,9 @@ void wakeup_preempt(struct rq *rq, struct task_struct *p, int flags)
 {
 	struct task_struct *donor = rq->donor;
 
+	if (is_dl_group(rt_rq_of_se(&p->rt)) && task_has_rt_policy(p))
+		resched_curr(rq);
+
 	if (p->sched_class == donor->sched_class)
 		donor->sched_class->wakeup_preempt(rq, p, flags);
 	else if (sched_class_above(p->sched_class, donor->sched_class))
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 264838c4a85..b948000f29f 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1866,7 +1866,13 @@ void inc_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
 	u64 deadline = dl_se->deadline;
 
 	dl_rq->dl_nr_running++;
-	add_nr_running(rq_of_dl_rq(dl_rq), 1);
+	if (!dl_server(dl_se) || dl_se == &rq_of_dl_rq(dl_rq)->fair_server) {
+		add_nr_running(rq_of_dl_rq(dl_rq), 1);
+	} else {
+		struct rt_rq *rt_rq = &dl_se->my_q->rt;
+
+		add_nr_running(rq_of_dl_rq(dl_rq), rt_rq->rt_nr_running);
+	}
 
 	inc_dl_deadline(dl_rq, deadline);
 }
@@ -1876,7 +1882,13 @@ void dec_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
 {
 	WARN_ON(!dl_rq->dl_nr_running);
 	dl_rq->dl_nr_running--;
-	sub_nr_running(rq_of_dl_rq(dl_rq), 1);
+	if (!dl_server(dl_se) || dl_se == &rq_of_dl_rq(dl_rq)->fair_server) {
+		sub_nr_running(rq_of_dl_rq(dl_rq), 1);
+	} else {
+		struct rt_rq *rt_rq = &dl_se->my_q->rt;
+
+		sub_nr_running(rq_of_dl_rq(dl_rq), rt_rq->rt_nr_running);
+	}
 
 	dec_dl_deadline(dl_rq, dl_se->deadline);
 }
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 9c4ac6875a2..83695e11db4 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -419,6 +419,7 @@ static inline int rt_se_prio(struct sched_rt_entity *rt_se)
 static void update_curr_rt(struct rq *rq)
 {
 	struct task_struct *donor = rq->donor;
+	struct rt_rq *rt_rq;
 	s64 delta_exec;
 
 	if (donor->sched_class != &rt_sched_class)
@@ -428,8 +429,18 @@ static void update_curr_rt(struct rq *rq)
 	if (unlikely(delta_exec <= 0))
 		return;
 
-	if (!rt_bandwidth_enabled())
+	if (!rt_group_sched_enabled())
 		return;
+
+	if (!dl_bandwidth_enabled())
+		return;
+
+	rt_rq = rt_rq_of_se(&donor->rt);
+	if (is_dl_group(rt_rq)) {
+		struct sched_dl_entity *dl_se = dl_group_of(rt_rq);
+
+		dl_server_update(dl_se, delta_exec);
+	}
 }
 
 static void
@@ -440,7 +451,7 @@ inc_rt_prio_smp(struct rt_rq *rt_rq, int prio, int prev_prio)
 	/*
 	 * Change rq's cpupri only if rt_rq is the top queue.
 	 */
-	if (IS_ENABLED(CONFIG_RT_GROUP_SCHED) && &rq->rt != rt_rq)
+	if (IS_ENABLED(CONFIG_RT_GROUP_SCHED) && is_dl_group(rt_rq))
 		return;
 
 	if (rq->online && prio < prev_prio)
@@ -455,7 +466,7 @@ dec_rt_prio_smp(struct rt_rq *rt_rq, int prio, int prev_prio)
 	/*
 	 * Change rq's cpupri only if rt_rq is the top queue.
 	 */
-	if (IS_ENABLED(CONFIG_RT_GROUP_SCHED) && &rq->rt != rt_rq)
+	if (IS_ENABLED(CONFIG_RT_GROUP_SCHED) && is_dl_group(rt_rq))
 		return;
 
 	if (rq->online && rt_rq->highest_prio.curr != prev_prio)
@@ -524,6 +535,15 @@ void inc_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
 	rt_rq->rr_nr_running += rt_se_rr_nr_running(rt_se);
 
 	inc_rt_prio(rt_rq, rt_se_prio(rt_se));
+
+	if (IS_ENABLED(CONFIG_RT_GROUP_SCHED) && is_dl_group(rt_rq)) {
+		struct sched_dl_entity *dl_se = dl_group_of(rt_rq);
+
+		if (!dl_se->dl_throttled)
+			add_nr_running(rq_of_rt_rq(rt_rq), 1);
+	} else {
+		add_nr_running(rq_of_rt_rq(rt_rq), 1);
+	}
 }
 
 static inline
@@ -534,6 +554,15 @@ void dec_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
 	rt_rq->rr_nr_running -= rt_se_rr_nr_running(rt_se);
 
 	dec_rt_prio(rt_rq, rt_se_prio(rt_se));
+
+	if (IS_ENABLED(CONFIG_RT_GROUP_SCHED) && is_dl_group(rt_rq)) {
+		struct sched_dl_entity *dl_se = dl_group_of(rt_rq);
+
+		if (!dl_se->dl_throttled)
+			sub_nr_running(rq_of_rt_rq(rt_rq), 1);
+	} else {
+		sub_nr_running(rq_of_rt_rq(rt_rq), 1);
+	}
 }
 
 /*
@@ -715,6 +744,14 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)
 	check_schedstat_required();
 	update_stats_wait_start_rt(rt_rq_of_se(rt_se), rt_se);
 
+	/* Task arriving in an idle group of tasks. */
+	if (IS_ENABLED(CONFIG_RT_GROUP_SCHED) &&
+	    is_dl_group(rt_rq) && rt_rq->rt_nr_running == 0) {
+		struct sched_dl_entity *dl_se = dl_group_of(rt_rq);
+
+		dl_server_start(dl_se);
+	}
+
 	enqueue_rt_entity(rt_se, flags);
 
 	if (task_is_blocked(p))
@@ -734,6 +771,14 @@ static bool dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
 
 	dequeue_pushable_task(rt_rq, p);
 
+	/* Last task of the task group. */
+	if (IS_ENABLED(CONFIG_RT_GROUP_SCHED) &&
+	    is_dl_group(rt_rq) && rt_rq->rt_nr_running == 0) {
+		struct sched_dl_entity *dl_se = dl_group_of(rt_rq);
+
+		dl_server_stop(dl_se);
+	}
+
 	return true;
 }
 
@@ -891,6 +936,34 @@ static void wakeup_preempt_rt(struct rq *rq, struct task_struct *p, int flags)
 {
 	struct task_struct *donor = rq->donor;
 
+	if (!rt_group_sched_enabled())
+		goto no_group_sched;
+
+	if (is_dl_group(rt_rq_of_se(&p->rt)) &&
+	    is_dl_group(rt_rq_of_se(&rq->curr->rt))) {
+		struct sched_dl_entity *dl_se, *curr_dl_se;
+
+		dl_se = dl_group_of(rt_rq_of_se(&p->rt));
+		curr_dl_se = dl_group_of(rt_rq_of_se(&rq->curr->rt));
+
+		if (dl_entity_preempt(dl_se, curr_dl_se)) {
+			resched_curr(rq);
+			return;
+		} else if (!dl_entity_preempt(curr_dl_se, dl_se)) {
+			if (p->prio < rq->curr->prio) {
+				resched_curr(rq);
+				return;
+			}
+		}
+		return;
+	} else if (is_dl_group(rt_rq_of_se(&p->rt))) {
+		resched_curr(rq);
+		return;
+	} else if (is_dl_group(rt_rq_of_se(&rq->curr->rt))) {
+		return;
+	}
+
+no_group_sched:
 	if (p->prio < donor->prio) {
 		resched_curr(rq);
 		return;
@@ -1609,12 +1682,36 @@ static void pull_rt_task(struct rq *this_rq)
 		resched_curr(this_rq);
 }
 
+#ifdef CONFIG_RT_GROUP_SCHED
+static int group_push_rt_task(struct rt_rq *rt_rq)
+{
+	struct rq *rq = rq_of_rt_rq(rt_rq);
+
+	if (is_dl_group(rt_rq))
+		return 0;
+
+	return push_rt_task(rq, false);
+}
+
+static void group_push_rt_tasks(struct rt_rq *rt_rq)
+{
+	while (group_push_rt_task(rt_rq))
+		;
+}
+#else
+static void group_push_rt_tasks(struct rt_rq *rt_rq)
+{
+	push_rt_tasks(rq_of_rt_rq(rt_rq));
+}
+#endif
+
 /*
  * If we are not running and we are not going to reschedule soon, we should
  * try to push tasks away now
  */
 static void task_woken_rt(struct rq *rq, struct task_struct *p)
 {
+	struct rt_rq *rt_rq = rt_rq_of_se(&p->rt);
 	bool need_to_push = !task_on_cpu(rq, p) &&
 			    !test_tsk_need_resched(rq->curr) &&
 			    p->nr_cpus_allowed > 1 &&
@@ -1623,7 +1720,7 @@ static void task_woken_rt(struct rq *rq, struct task_struct *p)
 			     rq->donor->prio <= p->prio);
 
 	if (need_to_push)
-		push_rt_tasks(rq);
+		group_push_rt_tasks(rt_rq);
 }
 
 /* Assumes rq->lock is held */
@@ -1632,6 +1729,7 @@ static void rq_online_rt(struct rq *rq)
 	if (rq->rt.overloaded)
 		rt_set_overload(rq);
 
+	/*FIXME: Enable the dl server! */
 	cpupri_set(&rq->rd->cpupri, rq->cpu, rq->rt.highest_prio.curr);
 }
 
@@ -1641,6 +1739,7 @@ static void rq_offline_rt(struct rq *rq)
 	if (rq->rt.overloaded)
 		rt_clear_overload(rq);
 
+	/* FIXME: Disable the dl server! */
 	cpupri_set(&rq->rd->cpupri, rq->cpu, CPUPRI_INVALID);
 }
 
@@ -1650,6 +1749,8 @@ static void rq_offline_rt(struct rq *rq)
  */
 static void switched_from_rt(struct rq *rq, struct task_struct *p)
 {
+	struct rt_rq *rt_rq = rt_rq_of_se(&p->rt);
+
 	/*
 	 * If there are other RT tasks then we will reschedule
 	 * and the scheduling of the other RT tasks will handle
@@ -1657,10 +1758,11 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p)
 	 * we may need to handle the pulling of RT tasks
 	 * now.
 	 */
-	if (!task_on_rq_queued(p) || rq->rt.rt_nr_running)
+	if (!task_on_rq_queued(p) || rt_rq->rt_nr_running)
 		return;
 
-	rt_queue_pull_task(rq);
+	if (!IS_ENABLED(CONFIG_RT_GROUP_SCHED))
+		rt_queue_pull_task(rq);
 }
 
 void __init init_sched_rt_class(void)
@@ -1695,8 +1797,17 @@ static void switched_to_rt(struct rq *rq, struct task_struct *p)
 	 * then see if we can move to another run queue.
 	 */
 	if (task_on_rq_queued(p)) {
+
+#ifndef CONFIG_RT_GROUP_SCHED
 		if (p->nr_cpus_allowed > 1 && rq->rt.overloaded)
 			rt_queue_push_tasks(rq);
+#else
+		if (rt_rq_of_se(&p->rt)->overloaded) {
+		} else {
+			if (p->prio < rq->curr->prio)
+				resched_curr(rq);
+		}
+#endif
 		if (p->prio < rq->donor->prio && cpu_online(cpu_of(rq)))
 			resched_curr(rq);
 	}
@@ -1709,6 +1820,8 @@ static void switched_to_rt(struct rq *rq, struct task_struct *p)
 static void
 prio_changed_rt(struct rq *rq, struct task_struct *p, int oldprio)
 {
+	struct rt_rq *rt_rq = rt_rq_of_se(&p->rt);
+
 	if (!task_on_rq_queued(p))
 		return;
 
@@ -1717,16 +1830,25 @@ prio_changed_rt(struct rq *rq, struct task_struct *p, int oldprio)
 		 * If our priority decreases while running, we
 		 * may need to pull tasks to this runqueue.
 		 */
-		if (oldprio < p->prio)
+		if (!IS_ENABLED(CONFIG_RT_GROUP_SCHED) && oldprio < p->prio)
 			rt_queue_pull_task(rq);
 
 		/*
 		 * If there's a higher priority task waiting to run
 		 * then reschedule.
 		 */
-		if (p->prio > rq->rt.highest_prio.curr)
+		if (p->prio > rt_rq->highest_prio.curr)
 			resched_curr(rq);
 	} else {
+		/*
+		 * This task is not running, thus we check against the currently
+		 * running task for preemption. We can preempt only if both tasks are
+		 * in the same cgroup or on the global runqueue.
+		 */
+		if (IS_ENABLED(CONFIG_RT_GROUP_SCHED) &&
+		    rt_rq_of_se(&p->rt)->tg != rt_rq_of_se(&rq->curr->rt)->tg)
+			return;
+
 		/*
 		 * This task is not running, but if it is
 		 * greater than the current running task
@@ -1821,7 +1943,16 @@ static unsigned int get_rr_interval_rt(struct rq *rq, struct task_struct *task)
 #ifdef CONFIG_SCHED_CORE
 static int task_is_throttled_rt(struct task_struct *p, int cpu)
 {
+#ifdef CONFIG_RT_GROUP_SCHED
+	struct rt_rq *rt_rq;
+
+	rt_rq = task_group(p)->rt_rq[cpu];
+	WARN_ON(!rt_group_sched_enabled() && rt_rq->tg != &root_task_group);
+
+	return dl_group_of(rt_rq)->dl_throttled;
+#else
 	return 0;
+#endif
 }
 #endif /* CONFIG_SCHED_CORE */
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 611e3757fea..8bf8af7064f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2171,7 +2171,7 @@ static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
 	if (!rt_group_sched_enabled())
 		tg = &root_task_group;
 	p->rt.rt_rq  = tg->rt_rq[cpu];
-	p->rt.parent = tg->rt_se[cpu];
+	p->dl.dl_rq  = &cpu_rq(cpu)->dl;
 #endif /* CONFIG_RT_GROUP_SCHED */
 }
 
@@ -2727,6 +2727,7 @@ static inline void add_nr_running(struct rq *rq, unsigned count)
 
 static inline void sub_nr_running(struct rq *rq, unsigned count)
 {
+	BUG_ON(rq->nr_running < count);
 	rq->nr_running -= count;
 	if (trace_sched_update_nr_running_tp_enabled()) {
 		call_trace_sched_update_nr_running(rq, -count);
@@ -3057,9 +3058,6 @@ extern bool sched_smp_initialized;
 #ifdef CONFIG_RT_GROUP_SCHED
 static inline struct task_struct *rt_task_of(struct sched_rt_entity *rt_se)
 {
-#ifdef CONFIG_SCHED_DEBUG
-	WARN_ON_ONCE(rt_se->my_q);
-#endif
 	return container_of(rt_se, struct task_struct, rt);
 }
 
diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
index 93a9c03b28e..7c1f7649477 100644
--- a/kernel/sched/syscalls.c
+++ b/kernel/sched/syscalls.c
@@ -626,6 +626,19 @@ int __sched_setscheduler(struct task_struct *p,
 change:
 
 	if (user) {
+#ifdef CONFIG_RT_GROUP_SCHED
+		/*
+		 * Do not allow real-time tasks into groups that have no runtime
+		 * assigned.
+		 */
+		if (rt_group_sched_enabled() &&
+				dl_bandwidth_enabled() && rt_policy(policy) &&
+				task_group(p)->dl_bandwidth.dl_runtime == 0 &&
+				!task_group_is_autogroup(task_group(p))) {
+			retval = -EPERM;
+			goto unlock;
+		}
+#endif
 		if (dl_bandwidth_enabled() && dl_policy(policy) &&
 				!(attr->sched_flags & SCHED_FLAG_SUGOV)) {
 			cpumask_t *span = rq->rd->span;
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH v2 14/25] sched/rt: Update rt-cgroup schedulability checks
  2025-07-31 10:55 [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
                   ` (12 preceding siblings ...)
  2025-07-31 10:55 ` [RFC PATCH v2 13/25] sched/rt: Add HCBS related checks and operations for rt tasks Yuri Andriaccio
@ 2025-07-31 10:55 ` Yuri Andriaccio
  2025-07-31 10:55 ` [RFC PATCH v2 15/25] sched/rt: Remove old RT_GROUP_SCHED data structures Yuri Andriaccio
                   ` (11 subsequent siblings)
  25 siblings, 0 replies; 38+ messages in thread
From: Yuri Andriaccio @ 2025-07-31 10:55 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider
  Cc: linux-kernel, Luca Abeni, Yuri Andriaccio

From: luca abeni <luca.abeni@santannapisa.it>

Update schedulability checks and setup of runtime/period for rt-cgroups.

Co-developed-by: Alessio Balsini <a.balsini@sssup.it>
Signed-off-by: Alessio Balsini <a.balsini@sssup.it>
Co-developed-by: Andrea Parri <parri.andrea@gmail.com>
Signed-off-by: Andrea Parri <parri.andrea@gmail.com>
Co-developed-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
---
 kernel/sched/core.c     |  6 ++++
 kernel/sched/deadline.c | 46 ++++++++++++++++++++++----
 kernel/sched/rt.c       | 72 +++++++++++++++++++++++------------------
 kernel/sched/sched.h    |  1 +
 4 files changed, 88 insertions(+), 37 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6173684a02b..63cb9271052 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9277,6 +9277,12 @@ cpu_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 		return &root_task_group.css;
 	}
 
+	/* Do not allow cpu_cgroup hierachies with depth greater than 2. */
+#ifdef CONFIG_RT_GROUP_SCHED
+	if (parent != &root_task_group)
+		return ERR_PTR(-EINVAL);
+#endif
+
 	tg = sched_create_group(parent);
 	if (IS_ERR(tg))
 		return ERR_PTR(-ENOMEM);
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index b948000f29f..7b131630743 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -365,7 +365,47 @@ void cancel_inactive_timer(struct sched_dl_entity *dl_se)
 	cancel_dl_timer(dl_se, &dl_se->inactive_timer);
 }
 
+/*
+ * Used for dl_bw check and update, used under sched_rt_handler()::mutex and
+ * sched_domains_mutex.
+ */
+u64 dl_cookie;
+
 #ifdef CONFIG_RT_GROUP_SCHED
+int dl_check_tg(unsigned long total)
+{
+	unsigned long flags;
+	int which_cpu;
+	int cpus;
+	struct dl_bw *dl_b;
+	u64 gen = ++dl_cookie;
+
+	for_each_possible_cpu(which_cpu) {
+		rcu_read_lock_sched();
+
+		if (!dl_bw_visited(which_cpu, gen)) {
+			cpus = dl_bw_cpus(which_cpu);
+			dl_b = dl_bw_of(which_cpu);
+
+			raw_spin_lock_irqsave(&dl_b->lock, flags);
+
+			if (dl_b->bw != -1 &&
+			    dl_b->bw * cpus < dl_b->total_bw + total * cpus) {
+				raw_spin_unlock_irqrestore(&dl_b->lock, flags);
+				rcu_read_unlock_sched();
+
+				return 0;
+			}
+
+			raw_spin_unlock_irqrestore(&dl_b->lock, flags);
+		}
+
+		rcu_read_unlock_sched();
+	}
+
+	return 1;
+}
+
 void dl_init_tg(struct sched_dl_entity *dl_se, u64 rt_runtime, u64 rt_period)
 {
 	struct rq *rq = container_of(dl_se->dl_rq, struct rq, dl);
@@ -3139,12 +3179,6 @@ DEFINE_SCHED_CLASS(dl) = {
 #endif
 };
 
-/*
- * Used for dl_bw check and update, used under sched_rt_handler()::mutex and
- * sched_domains_mutex.
- */
-u64 dl_cookie;
-
 int sched_dl_global_validate(void)
 {
 	u64 runtime = global_rt_runtime();
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 83695e11db4..bd11f4a03f7 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1996,11 +1996,6 @@ DEFINE_SCHED_CLASS(rt) = {
 };
 
 #ifdef CONFIG_RT_GROUP_SCHED
-/*
- * Ensure that the real time constraints are schedulable.
- */
-static DEFINE_MUTEX(rt_constraints_mutex);
-
 static inline int tg_has_rt_tasks(struct task_group *tg)
 {
 	struct task_struct *task;
@@ -2034,8 +2029,8 @@ static int tg_rt_schedulable(struct task_group *tg, void *data)
 	unsigned long total, sum = 0;
 	u64 period, runtime;
 
-	period = ktime_to_ns(tg->rt_bandwidth.rt_period);
-	runtime = tg->rt_bandwidth.rt_runtime;
+	period  = tg->dl_bandwidth.dl_period;
+	runtime = tg->dl_bandwidth.dl_runtime;
 
 	if (tg == d->tg) {
 		period = d->rt_period;
@@ -2051,8 +2046,7 @@ static int tg_rt_schedulable(struct task_group *tg, void *data)
 	/*
 	 * Ensure we don't starve existing RT tasks if runtime turns zero.
 	 */
-	if (rt_bandwidth_enabled() && !runtime &&
-	    tg->rt_bandwidth.rt_runtime && tg_has_rt_tasks(tg))
+	if (dl_bandwidth_enabled() && !runtime && tg_has_rt_tasks(tg))
 		return -EBUSY;
 
 	if (WARN_ON(!rt_group_sched_enabled() && tg != &root_task_group))
@@ -2066,12 +2060,17 @@ static int tg_rt_schedulable(struct task_group *tg, void *data)
 	if (total > to_ratio(global_rt_period(), global_rt_runtime()))
 		return -EINVAL;
 
+	if (tg == &root_task_group) {
+		if (!dl_check_tg(total))
+			return -EBUSY;
+	}
+
 	/*
 	 * The sum of our children's runtime should not exceed our own.
 	 */
 	list_for_each_entry_rcu(child, &tg->children, siblings) {
-		period = ktime_to_ns(child->rt_bandwidth.rt_period);
-		runtime = child->rt_bandwidth.rt_runtime;
+		period  = child->dl_bandwidth.dl_period;
+		runtime = child->dl_bandwidth.dl_runtime;
 
 		if (child == d->tg) {
 			period = d->rt_period;
@@ -2097,6 +2096,20 @@ static int __rt_schedulable(struct task_group *tg, u64 period, u64 runtime)
 		.rt_runtime = runtime,
 	};
 
+	/*
+	* Since we truncate DL_SCALE bits, make sure we're at least
+	* that big.
+	*/
+	if (runtime != 0 && runtime < (1ULL << DL_SCALE))
+		return -EINVAL;
+
+	/*
+	* Since we use the MSB for wrap-around and sign issues, make
+	* sure it's not set (mind that period can be equal to zero).
+	*/
+	if (period & (1ULL << 63))
+		return -EINVAL;
+
 	rcu_read_lock();
 	ret = walk_tg_tree(tg_rt_schedulable, tg_nop, &data);
 	rcu_read_unlock();
@@ -2107,6 +2120,7 @@ static int __rt_schedulable(struct task_group *tg, u64 period, u64 runtime)
 static int tg_set_rt_bandwidth(struct task_group *tg,
 		u64 rt_period, u64 rt_runtime)
 {
+	static DEFINE_MUTEX(rt_constraints_mutex);
 	int i, err = 0;
 
 	/*
@@ -2126,34 +2140,30 @@ static int tg_set_rt_bandwidth(struct task_group *tg,
 	if (rt_runtime != RUNTIME_INF && rt_runtime > max_rt_runtime)
 		return -EINVAL;
 
-	mutex_lock(&rt_constraints_mutex);
+	guard(mutex)(&rt_constraints_mutex);
 	err = __rt_schedulable(tg, rt_period, rt_runtime);
 	if (err)
-		goto unlock;
+		return err;
 
-	raw_spin_lock_irq(&tg->rt_bandwidth.rt_runtime_lock);
-	tg->rt_bandwidth.rt_period = ns_to_ktime(rt_period);
-	tg->rt_bandwidth.rt_runtime = rt_runtime;
+	guard(raw_spinlock_irq)(&tg->dl_bandwidth.dl_runtime_lock);
+	tg->dl_bandwidth.dl_period  = rt_period;
+	tg->dl_bandwidth.dl_runtime = rt_runtime;
 
-	for_each_possible_cpu(i) {
-		struct rt_rq *rt_rq = tg->rt_rq[i];
+	if (tg == &root_task_group)
+		return 0;
 
-		raw_spin_lock(&rt_rq->rt_runtime_lock);
-		rt_rq->rt_runtime = rt_runtime;
-		raw_spin_unlock(&rt_rq->rt_runtime_lock);
+	for_each_possible_cpu(i) {
+		dl_init_tg(tg->dl_se[i], rt_runtime, rt_period);
 	}
-	raw_spin_unlock_irq(&tg->rt_bandwidth.rt_runtime_lock);
-unlock:
-	mutex_unlock(&rt_constraints_mutex);
 
-	return err;
+	return 0;
 }
 
 int sched_group_set_rt_runtime(struct task_group *tg, long rt_runtime_us)
 {
 	u64 rt_runtime, rt_period;
 
-	rt_period = ktime_to_ns(tg->rt_bandwidth.rt_period);
+	rt_period  = tg->dl_bandwidth.dl_period;
 	rt_runtime = (u64)rt_runtime_us * NSEC_PER_USEC;
 	if (rt_runtime_us < 0)
 		rt_runtime = RUNTIME_INF;
@@ -2167,10 +2177,10 @@ long sched_group_rt_runtime(struct task_group *tg)
 {
 	u64 rt_runtime_us;
 
-	if (tg->rt_bandwidth.rt_runtime == RUNTIME_INF)
+	if (tg->dl_bandwidth.dl_runtime == RUNTIME_INF)
 		return -1;
 
-	rt_runtime_us = tg->rt_bandwidth.rt_runtime;
+	rt_runtime_us = tg->dl_bandwidth.dl_runtime;
 	do_div(rt_runtime_us, NSEC_PER_USEC);
 	return rt_runtime_us;
 }
@@ -2183,7 +2193,7 @@ int sched_group_set_rt_period(struct task_group *tg, u64 rt_period_us)
 		return -EINVAL;
 
 	rt_period = rt_period_us * NSEC_PER_USEC;
-	rt_runtime = tg->rt_bandwidth.rt_runtime;
+	rt_runtime = tg->dl_bandwidth.dl_runtime;
 
 	return tg_set_rt_bandwidth(tg, rt_period, rt_runtime);
 }
@@ -2192,7 +2202,7 @@ long sched_group_rt_period(struct task_group *tg)
 {
 	u64 rt_period_us;
 
-	rt_period_us = ktime_to_ns(tg->rt_bandwidth.rt_period);
+	rt_period_us = tg->dl_bandwidth.dl_period;
 	do_div(rt_period_us, NSEC_PER_USEC);
 	return rt_period_us;
 }
@@ -2207,7 +2217,7 @@ static int sched_rt_global_constraints(void)
 int sched_rt_can_attach(struct task_group *tg, struct task_struct *tsk)
 {
 	/* Don't accept real-time tasks when there is no way for them to run */
-	if (rt_group_sched_enabled() && rt_task(tsk) && tg->rt_bandwidth.rt_runtime == 0)
+	if (rt_group_sched_enabled() && rt_task(tsk) && tg->dl_bandwidth.dl_runtime == 0)
 		return 0;
 
 	return 1;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8bf8af7064f..9f235df4bf1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -394,6 +394,7 @@ extern void dl_server_init(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq,
 		    dl_server_has_tasks_f has_tasks,
 		    dl_server_pick_f pick_task);
 extern void sched_init_dl_servers(void);
+extern int dl_check_tg(unsigned long total);
 extern void dl_init_tg(struct sched_dl_entity *dl_se, u64 rt_runtime, u64 rt_period);
 
 extern void dl_server_update_idle_time(struct rq *rq,
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH v2 15/25] sched/rt: Remove old RT_GROUP_SCHED data structures
  2025-07-31 10:55 [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
                   ` (13 preceding siblings ...)
  2025-07-31 10:55 ` [RFC PATCH v2 14/25] sched/rt: Update rt-cgroup schedulability checks Yuri Andriaccio
@ 2025-07-31 10:55 ` Yuri Andriaccio
  2025-07-31 10:55 ` [RFC PATCH v2 16/25] sched/core: Cgroup v2 support Yuri Andriaccio
                   ` (10 subsequent siblings)
  25 siblings, 0 replies; 38+ messages in thread
From: Yuri Andriaccio @ 2025-07-31 10:55 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider
  Cc: linux-kernel, Luca Abeni, Yuri Andriaccio

From: luca abeni <luca.abeni@santannapisa.it>

Completely remove the old RT_GROUP_SCHED's functions and data structures.

Co-developed-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
---
 include/linux/sched.h |  4 ----
 kernel/sched/rt.c     |  1 -
 kernel/sched/sched.h  | 26 --------------------------
 3 files changed, 31 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f0c8229afd1..343e8ef5ba1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -621,13 +621,9 @@ struct sched_rt_entity {
 	unsigned short			on_rq;
 	unsigned short			on_list;
 
-	struct sched_rt_entity		*back;
 #ifdef CONFIG_RT_GROUP_SCHED
-	struct sched_rt_entity		*parent;
 	/* rq on which this entity is (to be) queued: */
 	struct rt_rq			*rt_rq;
-	/* rq "owned" by this entity/group: */
-	struct rt_rq			*my_q;
 #endif
 } __randomize_layout;
 
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index bd11f4a03f7..f37ac9100d1 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1,4 +1,3 @@
-#pragma GCC diagnostic ignored "-Wunused-function"
 // SPDX-License-Identifier: GPL-2.0
 /*
  * Real-Time Scheduling Class (mapped to the SCHED_FIFO and SCHED_RR
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9f235df4bf1..4a1bbda3720 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -310,15 +310,6 @@ struct rt_prio_array {
 	struct list_head queue[MAX_RT_PRIO];
 };
 
-struct rt_bandwidth {
-	/* nests inside the rq lock: */
-	raw_spinlock_t		rt_runtime_lock;
-	ktime_t			rt_period;
-	u64			rt_runtime;
-	struct hrtimer		rt_period_timer;
-	unsigned int		rt_period_active;
-};
-
 struct dl_bandwidth {
 	raw_spinlock_t          dl_runtime_lock;
 	u64                     dl_runtime;
@@ -483,7 +474,6 @@ struct task_group {
 	struct sched_dl_entity	**dl_se;
 	struct rt_rq		**rt_rq;
 
-	struct rt_bandwidth	rt_bandwidth;
 	struct dl_bandwidth	dl_bandwidth;
 #endif
 
@@ -802,11 +792,6 @@ struct scx_rq {
 };
 #endif /* CONFIG_SCHED_CLASS_EXT */
 
-static inline int rt_bandwidth_enabled(void)
-{
-	return 0;
-}
-
 /* RT IPI pull logic requires IRQ_WORK */
 #if defined(CONFIG_IRQ_WORK) && defined(CONFIG_SMP)
 # define HAVE_RT_PUSH_IPI
@@ -824,17 +809,6 @@ struct rt_rq {
 	bool			overloaded;
 	struct plist_head	pushable_tasks;
 
-	int			rt_queued;
-
-#ifdef CONFIG_RT_GROUP_SCHED
-	int			rt_throttled;
-	u64			rt_time; /* consumed RT time, goes up in update_curr_rt */
-	u64			rt_runtime; /* allotted RT time, "slice" from rt_bandwidth, RT sharing/balancing */
-	/* Nests inside the rq lock: */
-	raw_spinlock_t		rt_runtime_lock;
-
-	unsigned int		rt_nr_boosted;
-#endif
 #ifdef CONFIG_CGROUP_SCHED
 	struct task_group	*tg; /* this tg has "this" rt_rq on given CPU for runnable entities */
 #endif
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH v2 16/25] sched/core: Cgroup v2 support
  2025-07-31 10:55 [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
                   ` (14 preceding siblings ...)
  2025-07-31 10:55 ` [RFC PATCH v2 15/25] sched/rt: Remove old RT_GROUP_SCHED data structures Yuri Andriaccio
@ 2025-07-31 10:55 ` Yuri Andriaccio
  2025-07-31 10:55 ` [RFC PATCH v2 17/25] sched/rt: Remove support for cgroups-v1 Yuri Andriaccio
                   ` (9 subsequent siblings)
  25 siblings, 0 replies; 38+ messages in thread
From: Yuri Andriaccio @ 2025-07-31 10:55 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider
  Cc: linux-kernel, Luca Abeni, Yuri Andriaccio

From: luca abeni <luca.abeni@santannapisa.it>

Make rt_runtime_us and rt_period_us virtual files accessible also to the cgroup
v2 controller, effectively enabling the RT_GROUP_SCHED mechanism to cgroups v2.

Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
---
 kernel/sched/core.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 63cb9271052..465f44d7235 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10300,6 +10300,18 @@ static struct cftype cpu_files[] = {
 		.write = cpu_uclamp_max_write,
 	},
 #endif /* CONFIG_UCLAMP_TASK_GROUP */
+#ifdef CONFIG_RT_GROUP_SCHED
+	{
+		.name = "rt_runtime_us",
+		.read_s64 = cpu_rt_runtime_read,
+		.write_s64 = cpu_rt_runtime_write,
+	},
+	{
+		.name = "rt_period_us",
+		.read_u64 = cpu_rt_period_read_uint,
+		.write_u64 = cpu_rt_period_write_uint,
+	},
+#endif /* CONFIG_RT_GROUP_SCHED */
 	{ }	/* terminate */
 };
 
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH v2 17/25] sched/rt: Remove support for cgroups-v1
  2025-07-31 10:55 [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
                   ` (15 preceding siblings ...)
  2025-07-31 10:55 ` [RFC PATCH v2 16/25] sched/core: Cgroup v2 support Yuri Andriaccio
@ 2025-07-31 10:55 ` Yuri Andriaccio
  2025-07-31 10:55 ` [RFC PATCH v2 18/25] sched/rt: Zero rt-cgroups default bandwidth Yuri Andriaccio
                   ` (8 subsequent siblings)
  25 siblings, 0 replies; 38+ messages in thread
From: Yuri Andriaccio @ 2025-07-31 10:55 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider
  Cc: linux-kernel, Luca Abeni, Yuri Andriaccio

Disable control files for cgroups-v1, and allow only cgroups-v2. This should
simplify maintaining the code, also because cgroups-v1 are deprecated.

Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
---
 kernel/sched/core.c | 18 ------------------
 1 file changed, 18 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 465f44d7235..85950e10bb1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10039,20 +10039,6 @@ static struct cftype cpu_legacy_files[] = {
 };
 
 #ifdef CONFIG_RT_GROUP_SCHED
-static struct cftype rt_group_files[] = {
-	{
-		.name = "rt_runtime_us",
-		.read_s64 = cpu_rt_runtime_read,
-		.write_s64 = cpu_rt_runtime_write,
-	},
-	{
-		.name = "rt_period_us",
-		.read_u64 = cpu_rt_period_read_uint,
-		.write_u64 = cpu_rt_period_write_uint,
-	},
-	{ }	/* Terminate */
-};
-
 # ifdef CONFIG_RT_GROUP_SCHED_DEFAULT_DISABLED
 DEFINE_STATIC_KEY_FALSE(rt_group_sched);
 # else
@@ -10078,10 +10064,6 @@ __setup("rt_group_sched=", setup_rt_group_sched);
 
 static int __init cpu_rt_group_init(void)
 {
-	if (!rt_group_sched_enabled())
-		return 0;
-
-	WARN_ON(cgroup_add_legacy_cftypes(&cpu_cgrp_subsys, rt_group_files));
 	return 0;
 }
 subsys_initcall(cpu_rt_group_init);
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH v2 18/25] sched/rt: Zero rt-cgroups default bandwidth
  2025-07-31 10:55 [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
                   ` (16 preceding siblings ...)
  2025-07-31 10:55 ` [RFC PATCH v2 17/25] sched/rt: Remove support for cgroups-v1 Yuri Andriaccio
@ 2025-07-31 10:55 ` Yuri Andriaccio
  2025-08-14 14:13   ` Juri Lelli
  2025-07-31 10:55 ` [RFC PATCH v2 19/25] sched/deadline: Allow deeper hierarchies of RT cgroups Yuri Andriaccio
                   ` (7 subsequent siblings)
  25 siblings, 1 reply; 38+ messages in thread
From: Yuri Andriaccio @ 2025-07-31 10:55 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider
  Cc: linux-kernel, Luca Abeni, Yuri Andriaccio

Set the default rt-cgroups runtime to zero, otherwise a cgroup-v1 kernel will
not be able to start SCHED_DEADLINE tasks. The bandwidth for rt-cgroups must
then be manually assigned after the kernel boots.

Allow zeroing the runtime of the root control group. This runtime only affects
the available bandwidth of the rt-cgroup hierarchy but not the SCHED_FIFO /
SCHED_RR tasks on the global runqueue.

Notes:
Disabling the root control group bandwidth should not cause any side effect, as
SCHED_FIFO / SCHED_RR tasks do not depend on it since the introduction of
fair_servers.

Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
---
 kernel/sched/core.c     |  4 ++--
 kernel/sched/rt.c       | 13 +++++--------
 kernel/sched/syscalls.c |  2 +-
 3 files changed, 8 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 85950e10bb1..3ac65c6af70 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8719,7 +8719,7 @@ void __init sched_init(void)
 
 #ifdef CONFIG_RT_GROUP_SCHED
 	init_dl_bandwidth(&root_task_group.dl_bandwidth,
-			global_rt_period(), global_rt_runtime());
+			global_rt_period(), 0);
 #endif /* CONFIG_RT_GROUP_SCHED */
 
 #ifdef CONFIG_CGROUP_SCHED
@@ -9348,7 +9348,7 @@ static int cpu_cgroup_can_attach(struct cgroup_taskset *tset)
 		goto scx_check;
 
 	cgroup_taskset_for_each(task, css, tset) {
-		if (!sched_rt_can_attach(css_tg(css), task))
+		if (rt_task(task) && !sched_rt_can_attach(css_tg(css), task))
 			return -EINVAL;
 	}
 scx_check:
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index f37ac9100d1..75a6860c2e2 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2122,13 +2122,6 @@ static int tg_set_rt_bandwidth(struct task_group *tg,
 	static DEFINE_MUTEX(rt_constraints_mutex);
 	int i, err = 0;
 
-	/*
-	 * Disallowing the root group RT runtime is BAD, it would disallow the
-	 * kernel creating (and or operating) RT threads.
-	 */
-	if (tg == &root_task_group && rt_runtime == 0)
-		return -EINVAL;
-
 	/* No period doesn't make any sense. */
 	if (rt_period == 0)
 		return -EINVAL;
@@ -2215,8 +2208,12 @@ static int sched_rt_global_constraints(void)
 
 int sched_rt_can_attach(struct task_group *tg, struct task_struct *tsk)
 {
+	/* Allow executing in the root cgroup regardless of allowed bandwidth */
+	if (tg == &root_task_group)
+		return 1;
+
 	/* Don't accept real-time tasks when there is no way for them to run */
-	if (rt_group_sched_enabled() && rt_task(tsk) && tg->dl_bandwidth.dl_runtime == 0)
+	if (rt_group_sched_enabled() && tg->dl_bandwidth.dl_runtime == 0)
 		return 0;
 
 	return 1;
diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
index 7c1f7649477..71f20be6f29 100644
--- a/kernel/sched/syscalls.c
+++ b/kernel/sched/syscalls.c
@@ -633,7 +633,7 @@ int __sched_setscheduler(struct task_struct *p,
 		 */
 		if (rt_group_sched_enabled() &&
 				dl_bandwidth_enabled() && rt_policy(policy) &&
-				task_group(p)->dl_bandwidth.dl_runtime == 0 &&
+				!sched_rt_can_attach(task_group(p), p) &&
 				!task_group_is_autogroup(task_group(p))) {
 			retval = -EPERM;
 			goto unlock;
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH v2 19/25] sched/deadline: Allow deeper hierarchies of RT cgroups
  2025-07-31 10:55 [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
                   ` (17 preceding siblings ...)
  2025-07-31 10:55 ` [RFC PATCH v2 18/25] sched/rt: Zero rt-cgroups default bandwidth Yuri Andriaccio
@ 2025-07-31 10:55 ` Yuri Andriaccio
  2025-08-14 14:29   ` Juri Lelli
  2025-07-31 10:55 ` [RFC PATCH v2 20/25] sched/rt: Add rt-cgroup migration Yuri Andriaccio
                   ` (6 subsequent siblings)
  25 siblings, 1 reply; 38+ messages in thread
From: Yuri Andriaccio @ 2025-07-31 10:55 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider
  Cc: linux-kernel, Luca Abeni, Yuri Andriaccio

From: luca abeni <luca.abeni@santannapisa.it>

Allow creation of cgroup hierachies with depth greater than two.
Add check to prevent attaching tasks to a child cgroup of an active cgroup (i.e.
with a running FIFO/RR task).
Add check to prevent attaching tasks to cgroups which have children with
non-zero runtime.
Update rt-cgroups allocated bandwidth accounting for nested cgroup hierachies.

Co-developed-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
---
 kernel/sched/core.c     |  6 -----
 kernel/sched/deadline.c | 51 +++++++++++++++++++++++++++++++++++++----
 kernel/sched/rt.c       | 25 +++++++++++++++++---
 kernel/sched/sched.h    |  2 +-
 4 files changed, 70 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3ac65c6af70..eb9de8c7b1f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9277,12 +9277,6 @@ cpu_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 		return &root_task_group.css;
 	}
 
-	/* Do not allow cpu_cgroup hierachies with depth greater than 2. */
-#ifdef CONFIG_RT_GROUP_SCHED
-	if (parent != &root_task_group)
-		return ERR_PTR(-EINVAL);
-#endif
-
 	tg = sched_create_group(parent);
 	if (IS_ERR(tg))
 		return ERR_PTR(-ENOMEM);
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 7b131630743..e263abcdc04 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -406,11 +406,42 @@ int dl_check_tg(unsigned long total)
 	return 1;
 }
 
-void dl_init_tg(struct sched_dl_entity *dl_se, u64 rt_runtime, u64 rt_period)
+static inline bool is_active_sched_group(struct task_group *tg)
 {
+	struct task_group *child;
+	bool is_active = 1;
+
+	// if there are no children, this is a leaf group, thus it is active
+	list_for_each_entry_rcu(child, &tg->children, siblings) {
+		if (child->dl_bandwidth.dl_runtime > 0) {
+			is_active = 0;
+		}
+	}
+	return is_active;
+}
+
+static inline bool sched_group_has_active_siblings(struct task_group *tg)
+{
+	struct task_group *child;
+	bool has_active_siblings = 0;
+
+	// if there are no children, this is a leaf group, thus it is active
+	list_for_each_entry_rcu(child, &tg->parent->children, siblings) {
+		if (child != tg && child->dl_bandwidth.dl_runtime > 0) {
+			has_active_siblings = 1;
+		}
+	}
+	return has_active_siblings;
+}
+
+void dl_init_tg(struct task_group *tg, int cpu, u64 rt_runtime, u64 rt_period)
+{
+	struct sched_dl_entity *dl_se = tg->dl_se[cpu];
 	struct rq *rq = container_of(dl_se->dl_rq, struct rq, dl);
-	int is_active;
-	u64 new_bw;
+	int is_active, is_active_group;
+	u64 old_runtime, new_bw;
+
+	is_active_group = is_active_sched_group(tg);
 
 	raw_spin_rq_lock_irq(rq);
 	is_active = dl_se->my_q->rt.rt_nr_running > 0;
@@ -418,8 +449,10 @@ void dl_init_tg(struct sched_dl_entity *dl_se, u64 rt_runtime, u64 rt_period)
 	update_rq_clock(rq);
 	dl_server_stop(dl_se);
 
+	old_runtime = dl_se->dl_runtime;
 	new_bw = to_ratio(dl_se->dl_period, dl_se->dl_runtime);
-	dl_rq_change_utilization(rq, dl_se, new_bw);
+	if (is_active_group)
+		dl_rq_change_utilization(rq, dl_se, new_bw);
 
 	dl_se->dl_runtime  = rt_runtime;
 	dl_se->dl_deadline = rt_period;
@@ -431,6 +464,16 @@ void dl_init_tg(struct sched_dl_entity *dl_se, u64 rt_runtime, u64 rt_period)
 	dl_se->dl_bw = new_bw;
 	dl_se->dl_density = new_bw;
 
+	// add/remove the parent's bw
+	if (tg->parent && tg->parent != &root_task_group)
+	{
+		if (rt_runtime == 0 && old_runtime != 0 && !sched_group_has_active_siblings(tg)) {
+			__add_rq_bw(tg->parent->dl_se[cpu]->dl_bw, dl_se->dl_rq);
+		} else if (rt_runtime != 0 && old_runtime == 0 && !sched_group_has_active_siblings(tg)) {
+			__sub_rq_bw(tg->parent->dl_se[cpu]->dl_bw, dl_se->dl_rq);
+		}
+	}
+
 	if (is_active)
 		dl_server_start(dl_se);
 
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 75a6860c2e2..29b51251fdc 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -107,7 +107,8 @@ void free_rt_sched_group(struct task_group *tg)
 			 * Fix this issue by changing the group runtime
 			 * to 0 immediately before freeing it.
 			 */
-			dl_init_tg(tg->dl_se[i], 0, tg->dl_se[i]->dl_period);
+			if (tg->dl_se[i]->dl_runtime)
+				dl_init_tg(tg, i, 0, tg->dl_se[i]->dl_period);
 			raw_spin_rq_lock_irqsave(cpu_rq(i), flags);
 			BUG_ON(tg->rt_rq[i]->rt_nr_running);
 			raw_spin_rq_unlock_irqrestore(cpu_rq(i), flags);
@@ -2122,6 +2123,14 @@ static int tg_set_rt_bandwidth(struct task_group *tg,
 	static DEFINE_MUTEX(rt_constraints_mutex);
 	int i, err = 0;
 
+	/*
+	 * Do not allow to set a RT runtime > 0 if the parent has RT tasks
+	 * (and is not the root group)
+	 */
+	if (rt_runtime && (tg != &root_task_group) && (tg->parent != &root_task_group) && tg_has_rt_tasks(tg->parent)) {
+		return -EINVAL;
+	}
+
 	/* No period doesn't make any sense. */
 	if (rt_period == 0)
 		return -EINVAL;
@@ -2145,7 +2154,7 @@ static int tg_set_rt_bandwidth(struct task_group *tg,
 		return 0;
 
 	for_each_possible_cpu(i) {
-		dl_init_tg(tg->dl_se[i], rt_runtime, rt_period);
+		dl_init_tg(tg, i, rt_runtime, rt_period);
 	}
 
 	return 0;
@@ -2208,6 +2217,9 @@ static int sched_rt_global_constraints(void)
 
 int sched_rt_can_attach(struct task_group *tg, struct task_struct *tsk)
 {
+	struct task_group *child;
+	int can_attach = 1;
+
 	/* Allow executing in the root cgroup regardless of allowed bandwidth */
 	if (tg == &root_task_group)
 		return 1;
@@ -2216,7 +2228,14 @@ int sched_rt_can_attach(struct task_group *tg, struct task_struct *tsk)
 	if (rt_group_sched_enabled() && tg->dl_bandwidth.dl_runtime == 0)
 		return 0;
 
-	return 1;
+	/* If one of the children has runtime > 0, cannot attach RT tasks! */
+	list_for_each_entry_rcu(child, &tg->children, siblings) {
+		if (child->dl_bandwidth.dl_runtime) {
+			can_attach = 0;
+		}
+	}
+
+	return can_attach;
 }
 
 #else /* !CONFIG_RT_GROUP_SCHED: */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4a1bbda3720..3dd2ede6d35 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -386,7 +386,7 @@ extern void dl_server_init(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq,
 		    dl_server_pick_f pick_task);
 extern void sched_init_dl_servers(void);
 extern int dl_check_tg(unsigned long total);
-extern void dl_init_tg(struct sched_dl_entity *dl_se, u64 rt_runtime, u64 rt_period);
+extern void dl_init_tg(struct task_group *tg, int cpu, u64 rt_runtime, u64 rt_period);
 
 extern void dl_server_update_idle_time(struct rq *rq,
 		    struct task_struct *p);
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH v2 20/25] sched/rt: Add rt-cgroup migration
  2025-07-31 10:55 [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
                   ` (18 preceding siblings ...)
  2025-07-31 10:55 ` [RFC PATCH v2 19/25] sched/deadline: Allow deeper hierarchies of RT cgroups Yuri Andriaccio
@ 2025-07-31 10:55 ` Yuri Andriaccio
  2025-07-31 10:55 ` [RFC PATCH v2 21/25] sched/rt: add HCBS migration related checks and function calls Yuri Andriaccio
                   ` (5 subsequent siblings)
  25 siblings, 0 replies; 38+ messages in thread
From: Yuri Andriaccio @ 2025-07-31 10:55 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider
  Cc: linux-kernel, Luca Abeni, Yuri Andriaccio

From: luca abeni <luca.abeni@santannapisa.it>

When the runtime is exhausted in a RT CGroup, the scheduler checks for
another non-throttled runqueue and, if available, migrates the tasks.

The bandwidth (runtime/period) chosen for a certain cgroup is replicated on
every core of the system, therefore, in an SMP system with M cores, the
total available bandwidth is the given runtime/period multiplied by M.

Co-developed-by: Alessio Balsini <a.balsini@sssup.it>
Signed-off-by: Alessio Balsini <a.balsini@sssup.it>
Co-developed-by: Andrea Parri <parri.andrea@gmail.com>
Signed-off-by: Andrea Parri <parri.andrea@gmail.com>
Co-developed-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
---
 kernel/sched/rt.c    | 471 ++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h |   5 +
 2 files changed, 468 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 29b51251fdc..2fdb2657554 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1,3 +1,4 @@
+#pragma GCC diagnostic ignored "-Wunused-function"
 // SPDX-License-Identifier: GPL-2.0
 /*
  * Real-Time Scheduling Class (mapped to the SCHED_FIFO and SCHED_RR
@@ -84,6 +85,8 @@ void init_rt_rq(struct rt_rq *rt_rq)
 	plist_head_init(&rt_rq->pushable_tasks);
 }
 
+static void group_pull_rt_task(struct rt_rq *this_rt_rq);
+
 #ifdef CONFIG_RT_GROUP_SCHED
 
 void unregister_rt_sched_group(struct task_group *tg)
@@ -289,6 +292,45 @@ static inline void rt_queue_pull_task(struct rq *rq)
 	queue_balance_callback(rq, &per_cpu(rt_pull_head, rq->cpu), pull_rt_task);
 }
 
+#ifdef CONFIG_RT_GROUP_SCHED
+static DEFINE_PER_CPU(struct balance_callback, rt_group_push_head);
+static DEFINE_PER_CPU(struct balance_callback, rt_group_pull_head);
+static void push_group_rt_tasks(struct rq *);
+static void pull_group_rt_tasks(struct rq *);
+
+static void rt_queue_push_from_group(struct rq *rq, struct rt_rq *rt_rq)
+{
+	BUG_ON(rt_rq == NULL);
+	BUG_ON(rt_rq->rq != rq);
+
+	if (rq->rq_to_push_from)
+		return;
+
+	rq->rq_to_push_from = container_of(rt_rq, struct rq, rt);
+	queue_balance_callback(rq, &per_cpu(rt_group_push_head, rq->cpu),
+			       push_group_rt_tasks);
+}
+
+static void rt_queue_pull_to_group(struct rq *rq, struct rt_rq *rt_rq)
+{
+	struct sched_dl_entity *dl_se = dl_group_of(rt_rq);
+
+	BUG_ON(rt_rq == NULL);
+	BUG_ON(!is_dl_group(rt_rq));
+	BUG_ON(rt_rq->rq != rq);
+
+	if (dl_se->dl_throttled || rq->rq_to_pull_to)
+		return;
+
+	rq->rq_to_pull_to = container_of(rt_rq, struct rq, rt);
+	queue_balance_callback(rq, &per_cpu(rt_group_pull_head, rq->cpu),
+			       pull_group_rt_tasks);
+}
+#else
+static inline void rt_queue_push_from_group(struct rq *rq, struct rt_rq *rt_rq) {};
+static inline void rt_queue_pull_to_group(struct rq *rq, struct rt_rq *rt_rq) {};
+#endif
+
 static void enqueue_pushable_task(struct rt_rq *rt_rq, struct task_struct *p)
 {
 	plist_del(&p->pushable_tasks, &rt_rq->pushable_tasks);
@@ -1277,6 +1319,8 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
  */
 static int push_rt_task(struct rq *rq, bool pull)
 {
+	BUG_ON(is_dl_group(&rq->rt));
+
 	struct task_struct *next_task;
 	struct rq *lowest_rq;
 	int ret = 0;
@@ -1573,6 +1617,8 @@ void rto_push_irq_work_func(struct irq_work *work)
 
 static void pull_rt_task(struct rq *this_rq)
 {
+	BUG_ON(is_dl_group(&this_rq->rt));
+
 	int this_cpu = this_rq->cpu, cpu;
 	bool resched = false;
 	struct task_struct *p, *push_task;
@@ -1683,27 +1729,436 @@ static void pull_rt_task(struct rq *this_rq)
 }
 
 #ifdef CONFIG_RT_GROUP_SCHED
-static int group_push_rt_task(struct rt_rq *rt_rq)
+/*
+ * Find the lowest priority runqueue among the runqueues of the same
+ * task group. Unlike find_lowest_rt(), this does not mean that the
+ * lowest priority cpu is running tasks from this runqueue.
+ */
+static int group_find_lowest_rt_rq(struct task_struct *task, struct rt_rq* task_rt_rq)
+{
+	struct sched_domain *sd;
+	struct cpumask mask, *lowest_mask = &mask;
+	struct sched_dl_entity *dl_se;
+	struct rt_rq *rt_rq;
+	int prio, lowest_prio;
+	int cpu, this_cpu = smp_processor_id();
+
+	BUG_ON(task->sched_task_group != task_rt_rq->tg);
+
+	if (task->nr_cpus_allowed == 1)
+		return -1; /* No other targets possible */
+
+	lowest_prio = task->prio - 1;
+	cpumask_clear(lowest_mask);
+	for_each_cpu_and(cpu, cpu_online_mask, task->cpus_ptr) {
+		dl_se = task_rt_rq->tg->dl_se[cpu];
+		rt_rq = &dl_se->my_q->rt;
+		prio = rt_rq->highest_prio.curr;
+
+		/*
+		 * If we're on asym system ensure we consider the different capacities
+		 * of the CPUs when searching for the lowest_mask.
+		 */
+		if (dl_se->dl_throttled || !rt_task_fits_capacity(task, cpu))
+			continue;
+
+		if (prio >= lowest_prio) {
+			if (prio > lowest_prio) {
+				cpumask_clear(lowest_mask);
+				lowest_prio = prio;
+			}
+
+			cpumask_set_cpu(cpu, lowest_mask);
+		}
+	}
+
+	if (cpumask_empty(lowest_mask))
+		return -1;
+
+	/*
+	 * At this point we have built a mask of CPUs representing the
+	 * lowest priority tasks in the system.  Now we want to elect
+	 * the best one based on our affinity and topology.
+	 *
+	 * We prioritize the last CPU that the task executed on since
+	 * it is most likely cache-hot in that location.
+	 */
+	cpu = task_cpu(task);
+	if (cpumask_test_cpu(cpu, lowest_mask))
+		return cpu;
+
+	/*
+	 * Otherwise, we consult the sched_domains span maps to figure
+	 * out which CPU is logically closest to our hot cache data.
+	 */
+	if (!cpumask_test_cpu(this_cpu, lowest_mask))
+		this_cpu = -1; /* Skip this_cpu opt if not among lowest */
+
+	rcu_read_lock();
+	for_each_domain(cpu, sd) {
+		if (sd->flags & SD_WAKE_AFFINE) {
+			int best_cpu;
+
+			/*
+			 * "this_cpu" is cheaper to preempt than a
+			 * remote processor.
+			 */
+			if (this_cpu != -1 &&
+			    cpumask_test_cpu(this_cpu, sched_domain_span(sd))) {
+				rcu_read_unlock();
+				return this_cpu;
+			}
+
+			best_cpu = cpumask_any_and_distribute(lowest_mask,
+							      sched_domain_span(sd));
+			if (best_cpu < nr_cpu_ids) {
+				rcu_read_unlock();
+				return best_cpu;
+			}
+		}
+	}
+	rcu_read_unlock();
+
+	/*
+	 * And finally, if there were no matches within the domains
+	 * just give the caller *something* to work with from the compatible
+	 * locations.
+	 */
+	if (this_cpu != -1)
+		return this_cpu;
+
+	cpu = cpumask_any_distribute(lowest_mask);
+	if (cpu < nr_cpu_ids)
+		return cpu;
+
+	return -1;
+}
+
+/*
+ * Find and lock the lowest priority runqueue among the runqueues
+ * of the same task group. Unlike find_lock_lowest_rt(), this does not
+ * mean that the lowest priority cpu is running tasks from this runqueue.
+ */
+static struct rt_rq* group_find_lock_lowest_rt_rq(struct task_struct *task, struct rt_rq *rt_rq)
+{
+	struct rq *rq = rq_of_rt_rq(rt_rq);
+	struct rq *lowest_rq;
+	struct rt_rq *lowest_rt_rq = NULL;
+	struct sched_dl_entity *lowest_dl_se;
+	int tries, cpu;
+
+	BUG_ON(task->sched_task_group != rt_rq->tg);
+
+	for (tries = 0; tries < RT_MAX_TRIES; tries++) {
+		cpu = group_find_lowest_rt_rq(task, rt_rq);
+
+		if ((cpu == -1) || (cpu == rq->cpu))
+			break;
+
+		lowest_dl_se = rt_rq->tg->dl_se[cpu];
+		lowest_rt_rq = &lowest_dl_se->my_q->rt;
+		lowest_rq = cpu_rq(cpu);
+
+		if (lowest_rt_rq->highest_prio.curr <= task->prio) {
+			/*
+			 * Target rq has tasks of equal or higher priority,
+			 * retrying does not release any lock and is unlikely
+			 * to yield a different result.
+			 */
+			lowest_rt_rq = NULL;
+			break;
+		}
+
+		/* if the prio of this runqueue changed, try again */
+		if (double_lock_balance(rq, lowest_rq)) {
+			/*
+			 * We had to unlock the run queue. In
+			 * the mean time, task could have
+			 * migrated already or had its affinity changed.
+			 * Also make sure that it wasn't scheduled on its rq.
+			 * It is possible the task was scheduled, set
+			 * "migrate_disabled" and then got preempted, so we must
+			 * check the task migration disable flag here too.
+			 */
+			if (unlikely(is_migration_disabled(task) ||
+				     lowest_dl_se->dl_throttled ||
+				     !cpumask_test_cpu(lowest_rq->cpu, &task->cpus_mask) ||
+				     task != pick_next_pushable_task(rt_rq))) {
+
+				double_unlock_balance(rq, lowest_rq);
+				lowest_rt_rq = NULL;
+				break;
+			}
+		}
+
+		/* If this rq is still suitable use it. */
+		if (lowest_rt_rq->highest_prio.curr > task->prio)
+			break;
+
+		/* try again */
+		double_unlock_balance(rq, lowest_rq);
+		lowest_rt_rq = NULL;
+	}
+
+	return lowest_rt_rq;
+}
+
+static int group_push_rt_task(struct rt_rq *rt_rq, bool pull)
 {
+	BUG_ON(!is_dl_group(rt_rq));
+
 	struct rq *rq = rq_of_rt_rq(rt_rq);
+	struct task_struct *next_task;
+	struct rq *lowest_rq;
+	struct rt_rq *lowest_rt_rq;
+	int ret = 0;
+
+	if (!rt_rq->overloaded)
+		return 0;
+
+	next_task = pick_next_pushable_task(rt_rq);
+	if (!next_task)
+		return 0;
+
+retry:
+	if (is_migration_disabled(next_task)) {
+		struct task_struct *push_task = NULL;
+		int cpu;
+
+		if (!pull || rq->push_busy)
+			return 0;
+
+		/*
+		 * If the current task does not belong to the same task group
+		 * we cannot push it away.
+		 */
+		if (rq->curr->sched_task_group != rt_rq->tg)
+			return 0;
+
+		/*
+		 * Invoking group_find_lowest_rt_rq() on anything but an RT task doesn't
+		 * make sense. Per the above priority check, curr has to
+		 * be of higher priority than next_task, so no need to
+		 * reschedule when bailing out.
+		 *
+		 * Note that the stoppers are masqueraded as SCHED_FIFO
+		 * (cf. sched_set_stop_task()), so we can't rely on rt_task().
+		 */
+		if (rq->curr->sched_class != &rt_sched_class)
+			return 0;
+
+		cpu = group_find_lowest_rt_rq(rq->curr, rt_rq);
+		if (cpu == -1 || cpu == rq->cpu)
+			return 0;
+
+		/*
+		 * Given we found a CPU with lower priority than @next_task,
+		 * therefore it should be running. However we cannot migrate it
+		 * to this other CPU, instead attempt to push the current
+		 * running task on this CPU away.
+		 */
+		push_task = get_push_task(rq);
+		if (push_task) {
+			preempt_disable();
+			raw_spin_rq_unlock(rq);
+			stop_one_cpu_nowait(rq->cpu, push_cpu_stop,
+					    push_task, &rq->push_work);
+			preempt_enable();
+			raw_spin_rq_lock(rq);
+		}
 
-	if (is_dl_group(rt_rq))
 		return 0;
+	}
+
+	if (WARN_ON(next_task == rq->curr))
+		return 0;
+
+	/* We might release rq lock */
+	get_task_struct(next_task);
+
+	/* group_find_lock_lowest_rq locks the rq if found */
+	lowest_rt_rq = group_find_lock_lowest_rt_rq(next_task, rt_rq);
+	if (!lowest_rt_rq) {
+		struct task_struct *task;
+		/*
+		 * group_find_lock_lowest_rt_rq releases rq->lock
+		 * so it is possible that next_task has migrated.
+		 *
+		 * We need to make sure that the task is still on the same
+		 * run-queue and is also still the next task eligible for
+		 * pushing.
+		 */
+		task = pick_next_pushable_task(rt_rq);
+		if (task == next_task) {
+			/*
+			 * The task hasn't migrated, and is still the next
+			 * eligible task, but we failed to find a run-queue
+			 * to push it to.  Do not retry in this case, since
+			 * other CPUs will pull from us when ready.
+			 */
+			goto out;
+		}
+
+		if (!task)
+			/* No more tasks, just exit */
+			goto out;
+
+		/*
+		 * Something has shifted, try again.
+		 */
+		put_task_struct(next_task);
+		next_task = task;
+		goto retry;
+	}
+
+	lowest_rq = rq_of_rt_rq(lowest_rt_rq);
+
+	move_queued_task_locked(rq, lowest_rq, next_task);
+	resched_curr(lowest_rq);
+	ret = 1;
+
+	double_unlock_balance(rq, lowest_rq);
+out:
+	put_task_struct(next_task);
+
+	return ret;
+}
+
+static void group_pull_rt_task(struct rt_rq *this_rt_rq)
+{
+	BUG_ON(!is_dl_group(this_rt_rq));
+
+	struct rq *this_rq = rq_of_rt_rq(this_rt_rq);
+	int this_cpu = this_rq->cpu, cpu;
+	bool resched = false;
+	struct task_struct *p, *push_task = NULL;
+	struct rt_rq *src_rt_rq;
+	struct rq *src_rq;
+	struct sched_dl_entity *src_dl_se;
+
+	for_each_online_cpu(cpu) {
+		if (this_cpu == cpu)
+			continue;
 
-	return push_rt_task(rq, false);
+		src_dl_se = this_rt_rq->tg->dl_se[cpu];
+		src_rt_rq = &src_dl_se->my_q->rt;
+
+		if (src_rt_rq->rt_nr_running <= 1 && !src_dl_se->dl_throttled)
+			continue;
+
+		src_rq = rq_of_rt_rq(src_rt_rq);
+
+		/*
+		 * Don't bother taking the src_rq->lock if the next highest
+		 * task is known to be lower-priority than our current task.
+		 * This may look racy, but if this value is about to go
+		 * logically higher, the src_rq will push this task away.
+		 * And if its going logically lower, we do not care
+		 */
+		if (src_rt_rq->highest_prio.next >=
+		    this_rt_rq->highest_prio.curr)
+			continue;
+
+		/*
+		 * We can potentially drop this_rq's lock in
+		 * double_lock_balance, and another CPU could
+		 * alter this_rq
+		 */
+		push_task = NULL;
+		double_lock_balance(this_rq, src_rq);
+
+		/*
+		 * We can pull only a task, which is pushable
+		 * on its rq, and no others.
+		 */
+		p = pick_highest_pushable_task(src_rt_rq, this_cpu);
+
+		/*
+		 * Do we have an RT task that preempts
+		 * the to-be-scheduled task?
+		 */
+		if (p && (p->prio < this_rt_rq->highest_prio.curr)) {
+			WARN_ON(p == src_rq->curr);
+			WARN_ON(!task_on_rq_queued(p));
+
+			/*
+			 * There's a chance that p is higher in priority
+			 * than what's currently running on its CPU.
+			 * This is just that p is waking up and hasn't
+			 * had a chance to schedule. We only pull
+			 * p if it is lower in priority than the
+			 * current task on the run queue
+			 */
+			if (p->prio < src_rq->curr->prio)
+				goto skip;
+
+			if (is_migration_disabled(p)) {
+				/*
+				 * If the current task does not belong to the same task group
+				 * we cannot push it away.
+				 */
+				if (src_rq->curr->sched_task_group != this_rt_rq->tg)
+					goto skip;
+
+				push_task = get_push_task(src_rq);
+			} else {
+				move_queued_task_locked(src_rq, this_rq, p);
+				resched = true;
+			}
+			/*
+			 * We continue with the search, just in
+			 * case there's an even higher prio task
+			 * in another runqueue. (low likelihood
+			 * but possible)
+			 */
+		}
+skip:
+		double_unlock_balance(this_rq, src_rq);
+
+		if (push_task) {
+			preempt_disable();
+			raw_spin_rq_unlock(this_rq);
+			stop_one_cpu_nowait(src_rq->cpu, push_cpu_stop,
+					    push_task, &src_rq->push_work);
+			preempt_enable();
+			raw_spin_rq_lock(this_rq);
+		}
+	}
+
+	if (resched)
+		resched_curr(this_rq);
 }
 
 static void group_push_rt_tasks(struct rt_rq *rt_rq)
 {
-	while (group_push_rt_task(rt_rq))
+	while (group_push_rt_task(rt_rq, false))
 		;
 }
-#else
-static void group_push_rt_tasks(struct rt_rq *rt_rq)
+
+static void push_group_rt_tasks(struct rq *rq)
 {
-	push_rt_tasks(rq_of_rt_rq(rt_rq));
+	BUG_ON(rq->rq_to_push_from == NULL);
+
+	if ((rq->rq_to_push_from->rt.rt_nr_running > 1) ||
+	    (dl_group_of(&rq->rq_to_push_from->rt)->dl_throttled == 1)) {
+		group_push_rt_task(&rq->rq_to_push_from->rt, false);
+	}
+
+	rq->rq_to_push_from = NULL;
 }
-#endif
+
+static void pull_group_rt_tasks(struct rq *rq)
+{
+	BUG_ON(rq->rq_to_pull_to == NULL);
+	BUG_ON(rq->rq_to_pull_to->rt.rq != rq);
+
+	group_pull_rt_task(&rq->rq_to_pull_to->rt);
+	rq->rq_to_pull_to = NULL;
+}
+#else /* CONFIG_RT_GROUP_SCHED */
+static void group_pull_rt_task(struct rt_rq *this_rt_rq) { }
+static void group_push_rt_tasks(struct rt_rq *rt_rq) { }
+#endif /* CONFIG_RT_GROUP_SCHED */
 
 /*
  * If we are not running and we are not going to reschedule soon, we should
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3dd2ede6d35..10e29f37f9b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1282,6 +1282,11 @@ struct rq {
 	call_single_data_t	cfsb_csd;
 	struct list_head	cfsb_csd_list;
 #endif
+
+#ifdef CONFIG_RT_GROUP_SCHED
+	struct rq		*rq_to_push_from;
+	struct rq		*rq_to_pull_to;
+#endif
 };
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH v2 21/25] sched/rt: add HCBS migration related checks and function calls
  2025-07-31 10:55 [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
                   ` (19 preceding siblings ...)
  2025-07-31 10:55 ` [RFC PATCH v2 20/25] sched/rt: Add rt-cgroup migration Yuri Andriaccio
@ 2025-07-31 10:55 ` Yuri Andriaccio
  2025-07-31 10:55 ` [RFC PATCH v2 22/25] sched/deadline: Make rt-cgroup's servers pull tasks on timer replenishment Yuri Andriaccio
                   ` (4 subsequent siblings)
  25 siblings, 0 replies; 38+ messages in thread
From: Yuri Andriaccio @ 2025-07-31 10:55 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider
  Cc: linux-kernel, Luca Abeni, Yuri Andriaccio

From: luca abeni <luca.abeni@santannapisa.it>

Add HCBS related checks and operations to allow rt-task migration,
differentiating between cgroup's tasks or tasks that run on the global runqueue.

Co-developed-by: Alessio Balsini <a.balsini@sssup.it>
Signed-off-by: Alessio Balsini <a.balsini@sssup.it>
Co-developed-by: Andrea Parri <parri.andrea@gmail.com>
Signed-off-by: Andrea Parri <parri.andrea@gmail.com>
Co-developed-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
---
 kernel/sched/rt.c | 63 ++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 48 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 2fdb2657554..677ab9e8aa4 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -865,6 +865,11 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
 	struct rq *rq;
 	bool test;
 
+	/* Just return the task_cpu for processes inside task groups */
+	if (IS_ENABLED(CONFIG_RT_GROUP_SCHED) &&
+	    is_dl_group(rt_rq_of_se(&p->rt)))
+		goto out;
+
 	/* For anything but wake ups, just return the task_cpu */
 	if (!(flags & (WF_TTWU | WF_FORK)))
 		goto out;
@@ -964,7 +969,10 @@ static int balance_rt(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
 		 * not yet started the picking loop.
 		 */
 		rq_unpin_lock(rq, rf);
-		pull_rt_task(rq);
+		if (IS_ENABLED(CONFIG_RT_GROUP_SCHED) && is_dl_group(rt_rq_of_se(&p->rt)))
+			group_pull_rt_task(rt_rq_of_se(&p->rt));
+		else
+			pull_rt_task(rq);
 		rq_repin_lock(rq, rf);
 	}
 
@@ -1050,7 +1058,10 @@ static inline void set_next_task_rt(struct rq *rq, struct task_struct *p, bool f
 	if (rq->donor->sched_class != &rt_sched_class)
 		update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 0);
 
-	rt_queue_push_tasks(rq);
+	if (IS_ENABLED(CONFIG_RT_GROUP_SCHED) && is_dl_group(rt_rq))
+		rt_queue_push_from_group(rq, rt_rq);
+	else
+		rt_queue_push_tasks(rq);
 }
 
 static struct sched_rt_entity *pick_next_rt_entity(struct rt_rq *rt_rq)
@@ -1113,6 +1124,13 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p, struct task_s
 	 */
 	if (on_rt_rq(&p->rt) && p->nr_cpus_allowed > 1)
 		enqueue_pushable_task(rt_rq, p);
+
+	if (IS_ENABLED(CONFIG_RT_GROUP_SCHED) && is_dl_group(rt_rq)) {
+		struct sched_dl_entity *dl_se = dl_group_of(rt_rq);
+
+		if (dl_se->dl_throttled)
+			rt_queue_push_from_group(rq, rt_rq);
+	}
 }
 
 /* Only try algorithms three times */
@@ -2174,8 +2192,13 @@ static void task_woken_rt(struct rq *rq, struct task_struct *p)
 			    (rq->curr->nr_cpus_allowed < 2 ||
 			     rq->donor->prio <= p->prio);
 
-	if (need_to_push)
+	if (!need_to_push)
+		return;
+
+	if (IS_ENABLED(CONFIG_RT_GROUP_SCHED) && is_dl_group(rt_rq))
 		group_push_rt_tasks(rt_rq);
+	else
+		push_rt_tasks(rq);
 }
 
 /* Assumes rq->lock is held */
@@ -2216,7 +2239,9 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p)
 	if (!task_on_rq_queued(p) || rt_rq->rt_nr_running)
 		return;
 
-	if (!IS_ENABLED(CONFIG_RT_GROUP_SCHED))
+	if (IS_ENABLED(CONFIG_RT_GROUP_SCHED) && is_dl_group(rt_rq))
+		rt_queue_pull_to_group(rq, rt_rq);
+	else
 		rt_queue_pull_task(rq);
 }
 
@@ -2243,6 +2268,13 @@ static void switched_to_rt(struct rq *rq, struct task_struct *p)
 	 */
 	if (task_current(rq, p)) {
 		update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 0);
+
+		if (IS_ENABLED(CONFIG_RT_GROUP_SCHED) && is_dl_group(rt_rq_of_se(&p->rt))) {
+			struct sched_dl_entity *dl_se = dl_group_of(rt_rq_of_se(&p->rt));
+
+			p->dl_server = dl_se;
+		}
+
 		return;
 	}
 
@@ -2252,17 +2284,14 @@ static void switched_to_rt(struct rq *rq, struct task_struct *p)
 	 * then see if we can move to another run queue.
 	 */
 	if (task_on_rq_queued(p)) {
-
-#ifndef CONFIG_RT_GROUP_SCHED
-		if (p->nr_cpus_allowed > 1 && rq->rt.overloaded)
+		if (!is_dl_group(rt_rq_of_se(&p->rt)) && p->nr_cpus_allowed > 1 && rq->rt.overloaded) {
 			rt_queue_push_tasks(rq);
-#else
-		if (rt_rq_of_se(&p->rt)->overloaded) {
-		} else {
-			if (p->prio < rq->curr->prio)
-				resched_curr(rq);
+			return;
+		} else if (is_dl_group(rt_rq_of_se(&p->rt)) && rt_rq_of_se(&p->rt)->overloaded) {
+			rt_queue_push_from_group(rq, rt_rq_of_se(&p->rt));
+			return;
 		}
-#endif
+
 		if (p->prio < rq->donor->prio && cpu_online(cpu_of(rq)))
 			resched_curr(rq);
 	}
@@ -2285,8 +2314,12 @@ prio_changed_rt(struct rq *rq, struct task_struct *p, int oldprio)
 		 * If our priority decreases while running, we
 		 * may need to pull tasks to this runqueue.
 		 */
-		if (!IS_ENABLED(CONFIG_RT_GROUP_SCHED) && oldprio < p->prio)
-			rt_queue_pull_task(rq);
+		if (oldprio < p->prio) {
+			if (IS_ENABLED(CONFIG_RT_GROUP_SCHED) && is_dl_group(rt_rq))
+				rt_queue_pull_to_group(rq, rt_rq);
+			else
+				rt_queue_pull_task(rq);
+		}
 
 		/*
 		 * If there's a higher priority task waiting to run
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH v2 22/25] sched/deadline: Make rt-cgroup's servers pull tasks on timer replenishment
  2025-07-31 10:55 [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
                   ` (20 preceding siblings ...)
  2025-07-31 10:55 ` [RFC PATCH v2 21/25] sched/rt: add HCBS migration related checks and function calls Yuri Andriaccio
@ 2025-07-31 10:55 ` Yuri Andriaccio
  2025-07-31 10:55 ` [RFC PATCH v2 23/25] sched/deadline: Fix HCBS migrations on server stop Yuri Andriaccio
                   ` (3 subsequent siblings)
  25 siblings, 0 replies; 38+ messages in thread
From: Yuri Andriaccio @ 2025-07-31 10:55 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider
  Cc: linux-kernel, Luca Abeni, Yuri Andriaccio

From: luca abeni <luca.abeni@santannapisa.it>

Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
---
 kernel/sched/deadline.c | 6 +++++-
 kernel/sched/rt.c       | 6 +++++-
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index e263abcdc04..021d7349897 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1308,6 +1308,7 @@ static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_
 {
 	struct rq *rq = rq_of_dl_se(dl_se);
 	u64 fw;
+	bool has_tasks;
 
 	scoped_guard (rq_lock, rq) {
 		struct rq_flags *rf = &scope.rf;
@@ -1321,7 +1322,10 @@ static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_
 		if (!dl_se->dl_runtime)
 			return HRTIMER_NORESTART;
 
-		if (!dl_se->server_has_tasks(dl_se)) {
+		rq_unpin_lock(rq, rf);
+		has_tasks = dl_se->server_has_tasks(dl_se);
+		rq_repin_lock(rq, rf);
+		if (!has_tasks) {
 			replenish_dl_entity(dl_se);
 			dl_server_stopped(dl_se);
 			return HRTIMER_NORESTART;
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 677ab9e8aa4..116fa0422b9 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1,4 +1,3 @@
-#pragma GCC diagnostic ignored "-Wunused-function"
 // SPDX-License-Identifier: GPL-2.0
 /*
  * Real-Time Scheduling Class (mapped to the SCHED_FIFO and SCHED_RR
@@ -145,6 +144,11 @@ void init_tg_rt_entry(struct task_group *tg, struct rq *served_rq,
 
 static bool rt_server_has_tasks(struct sched_dl_entity *dl_se)
 {
+#ifdef CONFIG_SMP
+	struct rt_rq *rt_rq = &dl_se->my_q->rt;
+
+	group_pull_rt_task(rt_rq);
+#endif
 	return !!dl_se->my_q->rt.rt_nr_running;
 }
 
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH v2 23/25] sched/deadline: Fix HCBS migrations on server stop
  2025-07-31 10:55 [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
                   ` (21 preceding siblings ...)
  2025-07-31 10:55 ` [RFC PATCH v2 22/25] sched/deadline: Make rt-cgroup's servers pull tasks on timer replenishment Yuri Andriaccio
@ 2025-07-31 10:55 ` Yuri Andriaccio
  2025-07-31 10:55 ` [RFC PATCH v2 24/25] sched/core: Execute enqueued balance callbacks when changing allowed CPUs Yuri Andriaccio
                   ` (2 subsequent siblings)
  25 siblings, 0 replies; 38+ messages in thread
From: Yuri Andriaccio @ 2025-07-31 10:55 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider
  Cc: linux-kernel, Luca Abeni, Yuri Andriaccio

From: luca abeni <luca.abeni@santannapisa.it>

Do not unthrottle a non-fair-server dl_server_stop(), since it ends up
being stopped when throttled (we try to migrate all the RT tasks away
from it).

Notes:
This is a temporary workaround, but it will be hopefully removed in
favor of less invasive code.

Co-developed-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
---
 kernel/sched/deadline.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 021d7349897..d9ab209a492 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1723,9 +1723,11 @@ void dl_server_stop(struct sched_dl_entity *dl_se)
 		return;
 
 	dequeue_dl_entity(dl_se, DEQUEUE_SLEEP);
-	hrtimer_try_to_cancel(&dl_se->dl_timer);
+	if (dl_se == &rq_of_dl_se(dl_se)->fair_server) {
+		hrtimer_try_to_cancel(&dl_se->dl_timer);
+		dl_se->dl_throttled = 0;
+	}
 	dl_se->dl_defer_armed = 0;
-	dl_se->dl_throttled = 0;
 	dl_se->dl_server_active = 0;
 }
 
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH v2 24/25] sched/core: Execute enqueued balance callbacks when changing allowed CPUs
  2025-07-31 10:55 [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
                   ` (22 preceding siblings ...)
  2025-07-31 10:55 ` [RFC PATCH v2 23/25] sched/deadline: Fix HCBS migrations on server stop Yuri Andriaccio
@ 2025-07-31 10:55 ` Yuri Andriaccio
  2025-07-31 10:55 ` [RFC PATCH v2 25/25] sched/core: Execute enqueued balance callbacks when migrating task betweeen cgroups Yuri Andriaccio
  2025-08-13 14:06 ` [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Juri Lelli
  25 siblings, 0 replies; 38+ messages in thread
From: Yuri Andriaccio @ 2025-07-31 10:55 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider
  Cc: linux-kernel, Luca Abeni, Yuri Andriaccio

From: luca abeni <luca.abeni@santannapisa.it>

Execute balancing callbacks when setting the affinity of a task, since the HCBS
scheduler may request balancing of throttled dl_servers to fully utilize the
server's bandwidth.

Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
---
 kernel/sched/core.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index eb9de8c7b1f..c8763c46030 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2947,6 +2947,7 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
 	if (cpumask_test_cpu(task_cpu(p), &p->cpus_mask) ||
 	    (task_current_donor(rq, p) && !task_current(rq, p))) {
 		struct task_struct *push_task = NULL;
+		struct balance_callback *head;
 
 		if ((flags & SCA_MIGRATE_ENABLE) &&
 		    (p->migration_flags & MDF_PUSH) && !rq->push_busy) {
@@ -2965,11 +2966,13 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
 		}
 
 		preempt_disable();
+		head = splice_balance_callbacks(rq);
 		task_rq_unlock(rq, p, rf);
 		if (push_task) {
 			stop_one_cpu_nowait(rq->cpu, push_cpu_stop,
 					    p, &rq->push_work);
 		}
+		balance_callbacks(rq, head);
 		preempt_enable();
 
 		if (complete)
@@ -3024,6 +3027,8 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
 	}
 
 	if (task_on_cpu(rq, p) || READ_ONCE(p->__state) == TASK_WAKING) {
+		struct balance_callback *head;
+
 		/*
 		 * MIGRATE_ENABLE gets here because 'p == current', but for
 		 * anything else we cannot do is_migration_disabled(), punt
@@ -3037,16 +3042,19 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
 			p->migration_flags &= ~MDF_PUSH;
 
 		preempt_disable();
+		head = splice_balance_callbacks(rq);
 		task_rq_unlock(rq, p, rf);
 		if (!stop_pending) {
 			stop_one_cpu_nowait(cpu_of(rq), migration_cpu_stop,
 					    &pending->arg, &pending->stop_work);
 		}
+		balance_callbacks(rq, head);
 		preempt_enable();
 
 		if (flags & SCA_MIGRATE_ENABLE)
 			return 0;
 	} else {
+		struct balance_callback *head;
 
 		if (!is_migration_disabled(p)) {
 			if (task_on_rq_queued(p))
@@ -3057,7 +3065,12 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
 				complete = true;
 			}
 		}
+
+		preempt_disable();
+		head = splice_balance_callbacks(rq);
 		task_rq_unlock(rq, p, rf);
+		balance_callbacks(rq, head);
+		preempt_enable();
 
 		if (complete)
 			complete_all(&pending->done);
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH v2 25/25] sched/core: Execute enqueued balance callbacks when migrating task betweeen cgroups
  2025-07-31 10:55 [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
                   ` (23 preceding siblings ...)
  2025-07-31 10:55 ` [RFC PATCH v2 24/25] sched/core: Execute enqueued balance callbacks when changing allowed CPUs Yuri Andriaccio
@ 2025-07-31 10:55 ` Yuri Andriaccio
  2025-08-13 14:06 ` [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Juri Lelli
  25 siblings, 0 replies; 38+ messages in thread
From: Yuri Andriaccio @ 2025-07-31 10:55 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider
  Cc: linux-kernel, Luca Abeni, Yuri Andriaccio

Execute balancing callbacks when migrating task between cgroups, since the HCBS
scheduler, similarly to the previous patch, may request balancing of throttled
dl_servers to fully utilize the server's bandwidth.

Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
---
 kernel/sched/core.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c8763c46030..65896f46e50 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9247,10 +9247,11 @@ void sched_move_task(struct task_struct *tsk, bool for_autogroup)
 {
 	int queued, running, queue_flags =
 		DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK;
+	struct balance_callback *head;
 	struct rq *rq;
+	struct rq_flags rf;
 
-	CLASS(task_rq_lock, rq_guard)(tsk);
-	rq = rq_guard.rq;
+	rq = task_rq_lock(tsk, &rf);
 
 	update_rq_clock(rq);
 
@@ -9277,6 +9278,12 @@ void sched_move_task(struct task_struct *tsk, bool for_autogroup)
 		 */
 		resched_curr(rq);
 	}
+
+	preempt_disable();
+	head = splice_balance_callbacks(rq);
+	task_rq_unlock(rq, tsk, &rf);
+	balance_callbacks(rq, head);
+	preempt_enable();
 }
 
 static struct cgroup_subsys_state *
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 00/25]  Hierarchical Constant Bandwidth Server
  2025-07-31 10:55 [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
                   ` (24 preceding siblings ...)
  2025-07-31 10:55 ` [RFC PATCH v2 25/25] sched/core: Execute enqueued balance callbacks when migrating task betweeen cgroups Yuri Andriaccio
@ 2025-08-13 14:06 ` Juri Lelli
  2025-08-13 14:22   ` Yuri Andriaccio
  25 siblings, 1 reply; 38+ messages in thread
From: Juri Lelli @ 2025-08-13 14:06 UTC (permalink / raw)
  To: Yuri Andriaccio
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	linux-kernel, Luca Abeni, Yuri Andriaccio

Hi,

On 31/07/25 12:55, Yuri Andriaccio wrote:
> Hello,
> 
> This is the v2 for Hierarchical Constant Bandwidth Server, aiming at replacing
> the current RT_GROUP_SCHED mechanism with something more robust and
> theoretically sound. The patchset has been presented at OSPM25
> (https://retis.sssup.it/ospm-summit/), and a summary of its inner workings can
> be found at https://lwn.net/Articles/1021332/ . You can find the v1 of this
> patchset at the bottom of the page, which talks in more detail what this
> patchset is all about and how it is implemented.
> 
> The big update for this v2 version is the addition of migration code, which
> allows to migrate tasks between different CPUs (following of course affinity
> settings).
> 
> As requested, we've split the big patches in smaller chunks in order to improve
> in readability. Additionally, it has been rebased on the latest tip/master to
> keep up with the latest scheduler updates and new features of dl_servers.
> 
> Last but not least, the first patch, which has been presented separately at
> https://lore.kernel.org/all/20250725164412.35912-1-yurand2000@gmail.com/ , is
> necessary to fully utilize the deadline bandwidth while keeping the fair-servers
> active. You can refer to the aforementioned link for details. The issue
> presented in this patch also reflects in HCBS: in the current version of the
> kernel, by default, 5% of the realtime bandwidth is reserved for fair-servers,
> 5% is not usable, and only the remaining 90% could be used by deadline tasks, or
> in our case, by HCBS dl_servers. The first patch addresses this issue and allows
> to fully utilize the default 95% of bandwidth for rt-tasks/servers.
> 
> Summary of the patches:
>      1) Account fair-servers bw separately from other dl tasks and servers bw.
>    2-5) Preparation patches, so that the RT classes' code can be used both
>         for normal and cgroup scheduling.
>   6-15) Implementation of HCBS, no migration and only one level hierarchy.
>         The old RT_GROUP_SCHED code is removed.
>  16-18) Remove cgroups v1 in favour of v2.
>     19) Add support for deeper hierarchies.
>  20-25) Add support for tasks migration.
> 
> Updates from v1:
> - Rebase to tip/master.

Would you mind sharing the baseline sha this set applies to? It looks
like it doesn't apply cleanly anymore.

Thanks!
Juri


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 00/25]  Hierarchical Constant Bandwidth Server
  2025-08-13 14:06 ` [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Juri Lelli
@ 2025-08-13 14:22   ` Yuri Andriaccio
  0 siblings, 0 replies; 38+ messages in thread
From: Yuri Andriaccio @ 2025-08-13 14:22 UTC (permalink / raw)
  To: juri.lelli
  Cc: bsegall, dietmar.eggemann, linux-kernel, luca.abeni, mgorman,
	mingo, peterz, rostedt, vincent.guittot, vschneid,
	yuri.andriaccio

Hi,

> Would you mind sharing the baseline sha this set applies to? It looks
> like it doesn't apply cleanly anymore.

The patchset should apply cleanly on top of commit "Merge tag 'sysctl-6.17-rc1'
of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl", with sha
4b290aae788e06561754b28c6842e4080957d3f7 .

Have a nice day,
Yuri

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 02/25] sched/deadline: Do not access dl_se->rq directly
  2025-07-31 10:55 ` [RFC PATCH v2 02/25] sched/deadline: Do not access dl_se->rq directly Yuri Andriaccio
@ 2025-08-14  8:30   ` Juri Lelli
  0 siblings, 0 replies; 38+ messages in thread
From: Juri Lelli @ 2025-08-14  8:30 UTC (permalink / raw)
  To: Yuri Andriaccio
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	linux-kernel, Luca Abeni, Yuri Andriaccio

Hi,

On 31/07/25 12:55, Yuri Andriaccio wrote:

...

> @@ -1667,9 +1668,9 @@ void sched_init_dl_servers(void)
>  
>  		WARN_ON(dl_server(dl_se));
>  
> -		dl_server_apply_params(dl_se, runtime, period, 1);
> -
>  		dl_se->dl_server = 1;
> +		BUG_ON(dl_server_apply_params(dl_se, runtime, period, 1));

A WARN_ON(), with possibly a recover strategy, is usually to prefer.

Thanks,
Juri


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 04/25] sched/rt: Pass an rt_rq instead of an rq where needed
  2025-07-31 10:55 ` [RFC PATCH v2 04/25] sched/rt: Pass an rt_rq instead of an rq where needed Yuri Andriaccio
@ 2025-08-14  8:46   ` Juri Lelli
  0 siblings, 0 replies; 38+ messages in thread
From: Juri Lelli @ 2025-08-14  8:46 UTC (permalink / raw)
  To: Yuri Andriaccio
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	linux-kernel, Luca Abeni, Yuri Andriaccio

Hi!

On 31/07/25 12:55, Yuri Andriaccio wrote:

...

> @@ -383,7 +383,7 @@ static void pull_rt_task(struct rq *);
>  
>  static inline void rt_queue_push_tasks(struct rq *rq)

Can't the above also take an rt_rq?

>  {
> -	if (!has_pushable_tasks(rq))
> +	if (!has_pushable_tasks(&rq->rt))
>  		return;
>  
>  	queue_balance_callback(rq, &per_cpu(rt_push_head, rq->cpu), push_rt_tasks);

Thanks,
Juri


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 05/25] sched/rt: Move some functions from rt.c to sched.h
  2025-07-31 10:55 ` [RFC PATCH v2 05/25] sched/rt: Move some functions from rt.c to sched.h Yuri Andriaccio
@ 2025-08-14  8:50   ` Juri Lelli
  0 siblings, 0 replies; 38+ messages in thread
From: Juri Lelli @ 2025-08-14  8:50 UTC (permalink / raw)
  To: Yuri Andriaccio
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	linux-kernel, Luca Abeni, Yuri Andriaccio

Hi!

On 31/07/25 12:55, Yuri Andriaccio wrote:
> From: luca abeni <luca.abeni@santannapisa.it>
> 
> Make the following functions be non-static and move them in sched.h, so that
> they can be used also in other source files:
> - rt_task_of()
> - rq_of_rt_rq()
> - rt_rq_of_se()
> - rq_of_rt_se()
> 
> There are no functional changes. This is needed by future patches.
> 
> Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
> Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
> ---
>  kernel/sched/rt.c    | 52 --------------------------------------------
>  kernel/sched/sched.h | 51 +++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 51 insertions(+), 52 deletions(-)
> 
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 945e3d705cc..3ea92b08a0e 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -168,34 +168,6 @@ static void destroy_rt_bandwidth(struct rt_bandwidth *rt_b)
>  
>  #define rt_entity_is_task(rt_se) (!(rt_se)->my_q)
>  
> -static inline struct task_struct *rt_task_of(struct sched_rt_entity *rt_se)
> -{
> -	WARN_ON_ONCE(!rt_entity_is_task(rt_se));
> -
> -	return container_of(rt_se, struct task_struct, rt);
> -}
> -
> -static inline struct rq *rq_of_rt_rq(struct rt_rq *rt_rq)
> -{
> -	/* Cannot fold with non-CONFIG_RT_GROUP_SCHED version, layout */
> -	WARN_ON(!rt_group_sched_enabled() && rt_rq->tg != &root_task_group);
> -	return rt_rq->rq;
> -}
> -
> -static inline struct rt_rq *rt_rq_of_se(struct sched_rt_entity *rt_se)
> -{
> -	WARN_ON(!rt_group_sched_enabled() && rt_se->rt_rq->tg != &root_task_group);
> -	return rt_se->rt_rq;
> -}
> -
> -static inline struct rq *rq_of_rt_se(struct sched_rt_entity *rt_se)
> -{
> -	struct rt_rq *rt_rq = rt_se->rt_rq;
> -
> -	WARN_ON(!rt_group_sched_enabled() && rt_rq->tg != &root_task_group);
> -	return rt_rq->rq;
> -}

Looks like we are losing WARN_ON checks with the change? Why is that?

...

> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index a8073d0824d..a1e6d2852ca 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -3042,6 +3042,57 @@ extern void set_rq_offline(struct rq *rq);
>  
>  extern bool sched_smp_initialized;
>  
> +#ifdef CONFIG_RT_GROUP_SCHED
> +static inline struct task_struct *rt_task_of(struct sched_rt_entity *rt_se)
> +{
> +#ifdef CONFIG_SCHED_DEBUG
> +	WARN_ON_ONCE(rt_se->my_q);
> +#endif

And gaining this one above. :)

Thanks,
Juri


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 07/25] sched/rt: Introduce HCBS specific structs in task_group
  2025-07-31 10:55 ` [RFC PATCH v2 07/25] sched/rt: Introduce HCBS specific structs in task_group Yuri Andriaccio
@ 2025-08-14 12:51   ` Juri Lelli
  0 siblings, 0 replies; 38+ messages in thread
From: Juri Lelli @ 2025-08-14 12:51 UTC (permalink / raw)
  To: Yuri Andriaccio
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	linux-kernel, Luca Abeni, Yuri Andriaccio

Hi!

On 31/07/25 12:55, Yuri Andriaccio wrote:

...

> @@ -467,9 +474,15 @@ struct task_group {
>  
>  #ifdef CONFIG_RT_GROUP_SCHED
>  	struct sched_rt_entity	**rt_se;
> +	/*
> +	 * The scheduling entities for the task group are managed as a single
> +	 * sched_dl_entity, each of them sharing the same dl_bandwidth.
> +	 */
> +	struct sched_dl_entity	**dl_se;

This is actually one sched_dl_entity per CPU, right? If that is the case
the comment is a little misleading I am afraid (mentioning single
entitiy).

Thanks,
Juri


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 08/25] sched/deadline: Account rt-cgroups bandwidth in deadline tasks schedulability tests.
  2025-07-31 10:55 ` [RFC PATCH v2 08/25] sched/deadline: Account rt-cgroups bandwidth in deadline tasks schedulability tests Yuri Andriaccio
@ 2025-08-14 13:03   ` Juri Lelli
  0 siblings, 0 replies; 38+ messages in thread
From: Juri Lelli @ 2025-08-14 13:03 UTC (permalink / raw)
  To: Yuri Andriaccio
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	linux-kernel, Luca Abeni, Yuri Andriaccio

Hi!

On 31/07/25 12:55, Yuri Andriaccio wrote:
> From: luca abeni <luca.abeni@santannapisa.it>
> 
> Account the rt-cgroups hierarchy's reserved bandwidth in the schedulability
> test of deadline entities. This mechanism allows to completely reserve portion
> of the rt-bandwidth to rt-cgroups even if they do not use all of it.
> 
> Co-developed-by: Alessio Balsini <a.balsini@sssup.it>
> Signed-off-by: Alessio Balsini <a.balsini@sssup.it>
> Co-developed-by: Andrea Parri <parri.andrea@gmail.com>
> Signed-off-by: Andrea Parri <parri.andrea@gmail.com>
> Co-developed-by: Yuri Andriaccio <yurand2000@gmail.com>
> Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
> Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
> ---

I wonder if, from a series flow perspective, would make sense to move
this one and the following after the changes that deal with groups
initialization. Also, please consider a little bit of squashing.

Thanks,
Juri


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 11/25] sched/deadline: Add dl_init_tg
  2025-07-31 10:55 ` [RFC PATCH v2 11/25] sched/deadline: Add dl_init_tg Yuri Andriaccio
@ 2025-08-14 13:10   ` Juri Lelli
  0 siblings, 0 replies; 38+ messages in thread
From: Juri Lelli @ 2025-08-14 13:10 UTC (permalink / raw)
  To: Yuri Andriaccio
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	linux-kernel, Luca Abeni, Yuri Andriaccio

Hi!

On 31/07/25 12:55, Yuri Andriaccio wrote:
> From: luca abeni <luca.abeni@santannapisa.it>
> 
> This function is used to initialize and/or update a rt-cgroup dl_server, also
> accounting for the allocated bandwidth.

This function/this patch are usually frowned up [1].

Thanks,
Juri

1 - https://elixir.bootlin.com/linux/v6.16/source/Documentation/process/submitting-patches.rst#L94


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 12/25] sched/rt: Add {alloc/free}_rt_sched_group and dl_server specific functions
  2025-07-31 10:55 ` [RFC PATCH v2 12/25] sched/rt: Add {alloc/free}_rt_sched_group and dl_server specific functions Yuri Andriaccio
@ 2025-08-14 13:42   ` Juri Lelli
  0 siblings, 0 replies; 38+ messages in thread
From: Juri Lelli @ 2025-08-14 13:42 UTC (permalink / raw)
  To: Yuri Andriaccio
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	linux-kernel, Luca Abeni, Yuri Andriaccio

Hi!

On 31/07/25 12:55, Yuri Andriaccio wrote:
> From: luca abeni <luca.abeni@santannapisa.it>
> 
> Add allocation and deallocation code for rt-cgroups. Add rt dl_server's specific
> functions that pick the next eligible task to run.
> 
> Co-developed-by: Alessio Balsini <a.balsini@sssup.it>
> Signed-off-by: Alessio Balsini <a.balsini@sssup.it>
> Co-developed-by: Andrea Parri <parri.andrea@gmail.com>
> Signed-off-by: Andrea Parri <parri.andrea@gmail.com>
> Co-developed-by: Yuri Andriaccio <yurand2000@gmail.com>
> Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
> Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
> ---
>  kernel/sched/rt.c | 107 ++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 104 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 38178003184..9c4ac6875a2 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -93,8 +93,39 @@ void unregister_rt_sched_group(struct task_group *tg)
>  
>  void free_rt_sched_group(struct task_group *tg)
>  {
> +	int i;
> +
>  	if (!rt_group_sched_enabled())
>  		return;
> +
> +	for_each_possible_cpu(i) {
> +		if (tg->dl_se) {
> +			unsigned long flags;
> +
> +			/*
> +			 * Since the dl timer is going to be cancelled,
> +			 * we risk to never decrease the running bw...
> +			 * Fix this issue by changing the group runtime
> +			 * to 0 immediately before freeing it.
> +			 */
> +			dl_init_tg(tg->dl_se[i], 0, tg->dl_se[i]->dl_period);
> +			raw_spin_rq_lock_irqsave(cpu_rq(i), flags);
> +			BUG_ON(tg->rt_rq[i]->rt_nr_running);
> +			raw_spin_rq_unlock_irqrestore(cpu_rq(i), flags);

So here we always call dl_init_tg for cpu 0, is it correct?

Also I wonder if the lock shouldn't cover both init and subsequent
check.

> +
> +			hrtimer_cancel(&tg->dl_se[i]->dl_timer);
> +			kfree(tg->dl_se[i]);
> +		}
> +		if (tg->rt_rq) {
> +			struct rq *served_rq;
> +
> +			served_rq = container_of(tg->rt_rq[i], struct rq, rt);
> +			kfree(served_rq);
> +		}
> +	}
> +
> +	kfree(tg->rt_rq);
> +	kfree(tg->dl_se);
>  }
>  
>  void init_tg_rt_entry(struct task_group *tg, struct rq *served_rq,
> @@ -109,12 +140,77 @@ void init_tg_rt_entry(struct task_group *tg, struct rq *served_rq,
>  	tg->dl_se[cpu] = dl_se;
>  }
>  
> +static bool rt_server_has_tasks(struct sched_dl_entity *dl_se)
> +{
> +	return !!dl_se->my_q->rt.rt_nr_running;
> +}
> +
> +static struct task_struct *_pick_next_task_rt(struct rt_rq *rt_rq);
> +static inline void set_next_task_rt(struct rq *rq, struct task_struct *p, bool first);
> +static struct task_struct *rt_server_pick(struct sched_dl_entity *dl_se)
> +{
> +	struct rt_rq *rt_rq = &dl_se->my_q->rt;
> +	struct rq *rq = rq_of_rt_rq(rt_rq);
> +	struct task_struct *p;
> +
> +	if (dl_se->my_q->rt.rt_nr_running == 0)
> +		return NULL;
> +
> +	p = _pick_next_task_rt(rt_rq);
> +	set_next_task_rt(rq, p, true);
> +
> +	return p;
> +}
> +
>  int alloc_rt_sched_group(struct task_group *tg, struct task_group *parent)
>  {
> +	struct rq *s_rq;
> +	struct sched_dl_entity *dl_se;
> +	int i;
> +
>  	if (!rt_group_sched_enabled())
>  		return 1;
>  
> +	tg->rt_rq = kcalloc(nr_cpu_ids, sizeof(struct rt_rq *), GFP_KERNEL);
> +	if (!tg->rt_rq)
> +		goto err;
> +	tg->dl_se = kcalloc(nr_cpu_ids, sizeof(dl_se), GFP_KERNEL);
> +	if (!tg->dl_se)
> +		goto err;

Don't we need to free the array allocated above if we fail here?

Thanks,
Juri


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 13/25] sched/rt: Add HCBS related checks and operations for rt tasks
  2025-07-31 10:55 ` [RFC PATCH v2 13/25] sched/rt: Add HCBS related checks and operations for rt tasks Yuri Andriaccio
@ 2025-08-14 14:01   ` Juri Lelli
  0 siblings, 0 replies; 38+ messages in thread
From: Juri Lelli @ 2025-08-14 14:01 UTC (permalink / raw)
  To: Yuri Andriaccio
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	linux-kernel, Luca Abeni, Yuri Andriaccio

Hi!

On 31/07/25 12:55, Yuri Andriaccio wrote:
> From: luca abeni <luca.abeni@santannapisa.it>
> 
> Add checks wheter a task belongs to the root cgroup or a rt-cgroup, since HCBS
> reuses the rt classes' scheduler, and operate accordingly where needed.
> 
> Co-developed-by: Alessio Balsini <a.balsini@sssup.it>
> Signed-off-by: Alessio Balsini <a.balsini@sssup.it>
> Co-developed-by: Andrea Parri <parri.andrea@gmail.com>
> Signed-off-by: Andrea Parri <parri.andrea@gmail.com>
> Co-developed-by: Yuri Andriaccio <yurand2000@gmail.com>
> Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
> Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
> ---
>  kernel/sched/core.c     |   3 +
>  kernel/sched/deadline.c |  16 ++++-
>  kernel/sched/rt.c       | 147 +++++++++++++++++++++++++++++++++++++---
>  kernel/sched/sched.h    |   6 +-
>  kernel/sched/syscalls.c |  13 ++++
>  5 files changed, 171 insertions(+), 14 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 3a69cb906c3..6173684a02b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2196,6 +2196,9 @@ void wakeup_preempt(struct rq *rq, struct task_struct *p, int flags)
>  {
>  	struct task_struct *donor = rq->donor;
>  
> +	if (is_dl_group(rt_rq_of_se(&p->rt)) && task_has_rt_policy(p))
> +		resched_curr(rq);

Why this unconditional resched for tasks in groups?

...

> @@ -715,6 +744,14 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)
>  	check_schedstat_required();
>  	update_stats_wait_start_rt(rt_rq_of_se(rt_se), rt_se);
>  
> +	/* Task arriving in an idle group of tasks. */
> +	if (IS_ENABLED(CONFIG_RT_GROUP_SCHED) &&
> +	    is_dl_group(rt_rq) && rt_rq->rt_nr_running == 0) {
> +		struct sched_dl_entity *dl_se = dl_group_of(rt_rq);
> +
> +		dl_server_start(dl_se);
> +	}
> +
>  	enqueue_rt_entity(rt_se, flags);

Is it OK to start the server before the first task is enqueued in an
idle group?

...

> @@ -891,6 +936,34 @@ static void wakeup_preempt_rt(struct rq *rq, struct task_struct *p, int flags)
>  {
>  	struct task_struct *donor = rq->donor;
>  
> +	if (!rt_group_sched_enabled())
> +		goto no_group_sched;
> +

I think the below deserves a comment detailing the rules of preemption
(inside/outside groups, etc.).

> +	if (is_dl_group(rt_rq_of_se(&p->rt)) &&
> +	    is_dl_group(rt_rq_of_se(&rq->curr->rt))) {
> +		struct sched_dl_entity *dl_se, *curr_dl_se;
> +
> +		dl_se = dl_group_of(rt_rq_of_se(&p->rt));
> +		curr_dl_se = dl_group_of(rt_rq_of_se(&rq->curr->rt));
> +
> +		if (dl_entity_preempt(dl_se, curr_dl_se)) {
> +			resched_curr(rq);
> +			return;
> +		} else if (!dl_entity_preempt(curr_dl_se, dl_se)) {

Isn't this condition implied by the above?

...

> +#ifdef CONFIG_RT_GROUP_SCHED
> +static int group_push_rt_task(struct rt_rq *rt_rq)
> +{
> +	struct rq *rq = rq_of_rt_rq(rt_rq);
> +
> +	if (is_dl_group(rt_rq))
> +		return 0;
> +
> +	return push_rt_task(rq, false);
> +}
> +
> +static void group_push_rt_tasks(struct rt_rq *rt_rq)
> +{
> +	while (group_push_rt_task(rt_rq))
> +		;
> +}
> +#else
> +static void group_push_rt_tasks(struct rt_rq *rt_rq)
> +{
> +	push_rt_tasks(rq_of_rt_rq(rt_rq));
> +}
> +#endif
> +

Hummm, maybe too much for a single patch? I am a little lost at this
point. :))

Thanks,
Juri


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 18/25] sched/rt: Zero rt-cgroups default bandwidth
  2025-07-31 10:55 ` [RFC PATCH v2 18/25] sched/rt: Zero rt-cgroups default bandwidth Yuri Andriaccio
@ 2025-08-14 14:13   ` Juri Lelli
  0 siblings, 0 replies; 38+ messages in thread
From: Juri Lelli @ 2025-08-14 14:13 UTC (permalink / raw)
  To: Yuri Andriaccio
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	linux-kernel, Luca Abeni, Yuri Andriaccio

Hi!

On 31/07/25 12:55, Yuri Andriaccio wrote:
> Set the default rt-cgroups runtime to zero, otherwise a cgroup-v1 kernel will
> not be able to start SCHED_DEADLINE tasks. The bandwidth for rt-cgroups must

Well, we disabled v1 support at this point with the previous patch,
didn't we? :)

> then be manually assigned after the kernel boots.
> 
> Allow zeroing the runtime of the root control group. This runtime only affects
> the available bandwidth of the rt-cgroup hierarchy but not the SCHED_FIFO /
> SCHED_RR tasks on the global runqueue.
> 
> Notes:
> Disabling the root control group bandwidth should not cause any side effect, as
> SCHED_FIFO / SCHED_RR tasks do not depend on it since the introduction of
> fair_servers.

I believe this deserves proper documentation.

Thanks,
Juri


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 19/25] sched/deadline: Allow deeper hierarchies of RT cgroups
  2025-07-31 10:55 ` [RFC PATCH v2 19/25] sched/deadline: Allow deeper hierarchies of RT cgroups Yuri Andriaccio
@ 2025-08-14 14:29   ` Juri Lelli
  0 siblings, 0 replies; 38+ messages in thread
From: Juri Lelli @ 2025-08-14 14:29 UTC (permalink / raw)
  To: Yuri Andriaccio
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	linux-kernel, Luca Abeni, Yuri Andriaccio

Hi!

On 31/07/25 12:55, Yuri Andriaccio wrote:

...

> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 7b131630743..e263abcdc04 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -406,11 +406,42 @@ int dl_check_tg(unsigned long total)
>  	return 1;
>  }
>  
> -void dl_init_tg(struct sched_dl_entity *dl_se, u64 rt_runtime, u64 rt_period)
> +static inline bool is_active_sched_group(struct task_group *tg)
>  {
> +	struct task_group *child;
> +	bool is_active = 1;
> +
> +	// if there are no children, this is a leaf group, thus it is active
> +	list_for_each_entry_rcu(child, &tg->children, siblings) {
> +		if (child->dl_bandwidth.dl_runtime > 0) {
> +			is_active = 0;
> +		}
> +	}
> +	return is_active;
> +}
> +
> +static inline bool sched_group_has_active_siblings(struct task_group *tg)
> +{
> +	struct task_group *child;
> +	bool has_active_siblings = 0;
> +
> +	// if there are no children, this is a leaf group, thus it is active
> +	list_for_each_entry_rcu(child, &tg->parent->children, siblings) {
> +		if (child != tg && child->dl_bandwidth.dl_runtime > 0) {
> +			has_active_siblings = 1;
> +		}
> +	}
> +	return has_active_siblings;
> +}

...

> @@ -2216,7 +2228,14 @@ int sched_rt_can_attach(struct task_group *tg, struct task_struct *tsk)
>  	if (rt_group_sched_enabled() && tg->dl_bandwidth.dl_runtime == 0)
>  		return 0;
>  
> -	return 1;
> +	/* If one of the children has runtime > 0, cannot attach RT tasks! */
> +	list_for_each_entry_rcu(child, &tg->children, siblings) {
> +		if (child->dl_bandwidth.dl_runtime) {
> +			can_attach = 0;
> +		}
> +	}

Can we maybe reuse some of the methods above?

Thanks,
Juri


^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2025-08-14 14:30 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-31 10:55 [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Yuri Andriaccio
2025-07-31 10:55 ` [RFC PATCH v2 01/25] sched/deadline: Remove fair-servers from real-time task's bandwidth accounting Yuri Andriaccio
2025-07-31 10:55 ` [RFC PATCH v2 02/25] sched/deadline: Do not access dl_se->rq directly Yuri Andriaccio
2025-08-14  8:30   ` Juri Lelli
2025-07-31 10:55 ` [RFC PATCH v2 03/25] sched/deadline: Distinct between dl_rq and my_q Yuri Andriaccio
2025-07-31 10:55 ` [RFC PATCH v2 04/25] sched/rt: Pass an rt_rq instead of an rq where needed Yuri Andriaccio
2025-08-14  8:46   ` Juri Lelli
2025-07-31 10:55 ` [RFC PATCH v2 05/25] sched/rt: Move some functions from rt.c to sched.h Yuri Andriaccio
2025-08-14  8:50   ` Juri Lelli
2025-07-31 10:55 ` [RFC PATCH v2 06/25] sched/rt: Disable RT_GROUP_SCHED Yuri Andriaccio
2025-07-31 10:55 ` [RFC PATCH v2 07/25] sched/rt: Introduce HCBS specific structs in task_group Yuri Andriaccio
2025-08-14 12:51   ` Juri Lelli
2025-07-31 10:55 ` [RFC PATCH v2 08/25] sched/deadline: Account rt-cgroups bandwidth in deadline tasks schedulability tests Yuri Andriaccio
2025-08-14 13:03   ` Juri Lelli
2025-07-31 10:55 ` [RFC PATCH v2 09/25] sched/deadline: Account rt-cgroups bandwidth in sched_dl_global_validate Yuri Andriaccio
2025-07-31 10:55 ` [RFC PATCH v2 10/25] sched/core: Initialize root_task_group Yuri Andriaccio
2025-07-31 10:55 ` [RFC PATCH v2 11/25] sched/deadline: Add dl_init_tg Yuri Andriaccio
2025-08-14 13:10   ` Juri Lelli
2025-07-31 10:55 ` [RFC PATCH v2 12/25] sched/rt: Add {alloc/free}_rt_sched_group and dl_server specific functions Yuri Andriaccio
2025-08-14 13:42   ` Juri Lelli
2025-07-31 10:55 ` [RFC PATCH v2 13/25] sched/rt: Add HCBS related checks and operations for rt tasks Yuri Andriaccio
2025-08-14 14:01   ` Juri Lelli
2025-07-31 10:55 ` [RFC PATCH v2 14/25] sched/rt: Update rt-cgroup schedulability checks Yuri Andriaccio
2025-07-31 10:55 ` [RFC PATCH v2 15/25] sched/rt: Remove old RT_GROUP_SCHED data structures Yuri Andriaccio
2025-07-31 10:55 ` [RFC PATCH v2 16/25] sched/core: Cgroup v2 support Yuri Andriaccio
2025-07-31 10:55 ` [RFC PATCH v2 17/25] sched/rt: Remove support for cgroups-v1 Yuri Andriaccio
2025-07-31 10:55 ` [RFC PATCH v2 18/25] sched/rt: Zero rt-cgroups default bandwidth Yuri Andriaccio
2025-08-14 14:13   ` Juri Lelli
2025-07-31 10:55 ` [RFC PATCH v2 19/25] sched/deadline: Allow deeper hierarchies of RT cgroups Yuri Andriaccio
2025-08-14 14:29   ` Juri Lelli
2025-07-31 10:55 ` [RFC PATCH v2 20/25] sched/rt: Add rt-cgroup migration Yuri Andriaccio
2025-07-31 10:55 ` [RFC PATCH v2 21/25] sched/rt: add HCBS migration related checks and function calls Yuri Andriaccio
2025-07-31 10:55 ` [RFC PATCH v2 22/25] sched/deadline: Make rt-cgroup's servers pull tasks on timer replenishment Yuri Andriaccio
2025-07-31 10:55 ` [RFC PATCH v2 23/25] sched/deadline: Fix HCBS migrations on server stop Yuri Andriaccio
2025-07-31 10:55 ` [RFC PATCH v2 24/25] sched/core: Execute enqueued balance callbacks when changing allowed CPUs Yuri Andriaccio
2025-07-31 10:55 ` [RFC PATCH v2 25/25] sched/core: Execute enqueued balance callbacks when migrating task betweeen cgroups Yuri Andriaccio
2025-08-13 14:06 ` [RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server Juri Lelli
2025-08-13 14:22   ` Yuri Andriaccio

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).