Re: [RFC PATCH v5 13/29] sched/rt: Implement dl-server operations for rt-cgroups

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

* Re: [RFC PATCH v5 13/29] sched/rt: Implement dl-server operations for rt-cgroups
       [not found] ` <20260430213835.62217-14-yurand2000@gmail.com>
@ 2026-05-05 13:04   ` Peter Zijlstra
  0 siblings, 0 replies; 29+ messages in thread
From: Peter Zijlstra @ 2026-05-05 13:04 UTC (permalink / raw)
  To: Yuri Andriaccio
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	linux-kernel, Luca Abeni, Yuri Andriaccio

On Thu, Apr 30, 2026 at 11:38:17PM +0200, Yuri Andriaccio wrote:
> +static struct sched_rt_entity *pick_next_rt_entity(struct rt_rq *rt_rq);
> +static inline void set_next_task_rt(struct rq *rq, struct task_struct *p, bool first);
> +
>  static struct task_struct *rt_server_pick(struct sched_dl_entity *dl_se, struct rq_flags *rf)
>  {
> -	return NULL;
> +	struct rt_rq *rt_rq = &dl_se->my_q->rt;
> +	struct rq *rq = rq_of_rt_rq(rt_rq);
> +	struct task_struct *p;
> +
> +	if (!sched_rt_runnable(dl_se->my_q))
> +		return NULL;
> +
> +	p = rt_task_of(pick_next_rt_entity(rt_rq));
> +	set_next_task_rt(rq, p, true);
> +
> +	return p;
>  }

set_next_task_rt() should not be needed at this point. There is only a
single ->pick_next_task() implementation left, and that will soon go
away too. All ->pick_task() methods are idempotent.

 [ https://lore.kernel.org/r/20260317104343.225156112@infradead.org ]

Notably, see core.c:__pick_next_task(), it will take care of
set_next_task() through put_prev_set_next_task() after calling
class->pick_task().

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v5 14/29] sched/rt: Update task event callbacks for HCBS scheduling
       [not found] ` <20260430213835.62217-15-yurand2000@gmail.com>
@ 2026-05-05 13:16   ` Peter Zijlstra
  0 siblings, 0 replies; 29+ messages in thread
From: Peter Zijlstra @ 2026-05-05 13:16 UTC (permalink / raw)
  To: Yuri Andriaccio
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	linux-kernel, Luca Abeni, Yuri Andriaccio

On Thu, Apr 30, 2026 at 11:38:18PM +0200, Yuri Andriaccio wrote:
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index defb812b0e48..67fbf4bbe461 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -975,7 +975,58 @@ static int balance_rt(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
>  static void wakeup_preempt_rt(struct rq *rq, struct task_struct *p, int flags)
>  {
>  	struct task_struct *donor = rq->donor;
> +	struct sched_dl_entity *woken_dl_se = NULL;
> +	struct sched_dl_entity *donor_dl_se = NULL;
> +
> +	if (!rt_group_sched_enabled())
> +		goto no_group_sched;
> 
> +	/*
> +	 * Preemption checks are different if the waking task and the current task
> +	 * are running on the global runqueue or in a cgroup. The following rules
> +	 * apply:
> +	 *   - dl-tasks (and equally dl_servers) always preempt FIFO/RR tasks.
> +	 *     - if curr is a FIFO/RR task inside a cgroup (i.e. run by a
> +	 *       dl_server), or curr is a DEADLINE task and waking is a FIFO/RR task
> +	 *       on the root cgroup, do nothing.
> +	 *     - if waking is inside a cgroup but curr is a FIFO/RR task in the root
> +	 *       cgroup, always reschedule.
> +	 *   - if they are both on the global runqueue, run the standard code.
> +	 *   - if they are both in the same cgroup, check for tasks priorities.
> +	 *   - if they are both in a cgroup, but not the same one, check whether the
> +	 *     woken task's dl_server preempts the current's dl_server.
> +	 *   - if curr is a DEADLINE task and waking is in a cgroup, check whether
> +	 *     the woken task's server preempts curr.
> +	 */
> +	if (is_dl_group(rt_rq_of_se(&p->rt)))
> +		woken_dl_se = dl_group_of(rt_rq_of_se(&p->rt));
> +	if (is_dl_group(rt_rq_of_se(&donor->rt)))
> +		donor_dl_se = dl_group_of(rt_rq_of_se(&donor->rt));
> +	else if (task_has_dl_policy(donor))
> +		donor_dl_se = &donor->dl;
> +
> +	if (woken_dl_se != NULL && donor_dl_se != NULL) {
> +		if (woken_dl_se == donor_dl_se) {
> +			if (p->prio < donor->prio)
> +				resched_curr(rq);
> +
> +			return;

This is effectively the traditional test, why not goto no_group_sched at
this point and share that code rather than duplicate?

> +		}
> +
> +		if (dl_entity_preempt(woken_dl_se, donor_dl_se))
> +			resched_curr(rq);
> +
> +		return;
> +
> +	} else if (woken_dl_se != NULL) {
> +		resched_curr(rq);
> +		return;
> +
> +	} else if (donor_dl_se != NULL) {
> +		return;
> +	}
> +
> +no_group_sched:
>  	/*
>  	 * XXX If we're preempted by DL, queue a push?
>  	 */

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v5 15/29] sched/rt: Update rt-cgroup schedulability checks
       [not found] ` <20260430213835.62217-16-yurand2000@gmail.com>
@ 2026-05-05 14:36   ` Peter Zijlstra
  0 siblings, 0 replies; 29+ messages in thread
From: Peter Zijlstra @ 2026-05-05 14:36 UTC (permalink / raw)
  To: Yuri Andriaccio
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	linux-kernel, Luca Abeni, Yuri Andriaccio

On Thu, Apr 30, 2026 at 11:38:19PM +0200, Yuri Andriaccio wrote:

> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -343,7 +343,39 @@ void cancel_inactive_timer(struct sched_dl_entity *dl_se)
>  	cancel_dl_timer(dl_se, &dl_se->inactive_timer);
>  }
>  
> +/*
> + * Used for dl_bw check and update, used under sched_rt_handler()::mutex and
> + * sched_domains_mutex.

Please try very hard to express locking constraints in code, rather than
comments. Compilers are very bad at verifying comments ;-)

> + */
> +u64 dl_cookie;
> +
>  #ifdef CONFIG_RT_GROUP_SCHED
> +int dl_check_tg(unsigned long total)
> +{
> +	int which_cpu;
> +	int cap;
> +	struct dl_bw *dl_b;
> +	u64 gen = ++dl_cookie;

This probably wants to be something like:

	lockdep_assert_held(sched_domain_mutex);

or something like that?

And if it really is sched_domain_mutex _AND_ sched_rt_handler()::mutex,
it might make sense to pull that mutex out from that function to give it
global visibility so we can test for it here.

For bonus points, you'll use __guarded_by() from the context analysis
bits; you'll need to add:

CONTEXT_ANALYSIS_deadline.o := y to kernel/sched/Makefile and build the
tree with clang-22 or later (although we'll be raising this to -23
soonish).

> +
> +	for_each_possible_cpu(which_cpu) {
> +		guard(rcu_sched)();
> +
> +		if (!dl_bw_visited(which_cpu, gen)) {
> +			cap = dl_bw_capacity(which_cpu);
> +			dl_b = dl_bw_of(which_cpu);
> +
> +			guard(raw_spinlock_irqsave)(&dl_b->lock);
> +
> +			if (dl_b->bw != -1 &&
> +			    cap_scale(dl_b->bw, cap) < dl_b->total_bw + cap_scale(total, cap))
> +				return 0;
> +		}
> +
> +	}
> +
> +	return 1;
> +}

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v5 18/29] sched/core: Cgroup v2 support
       [not found] ` <20260430213835.62217-19-yurand2000@gmail.com>
@ 2026-05-05 14:59   ` Peter Zijlstra
  2026-05-06 19:58     ` luca abeni
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2026-05-05 14:59 UTC (permalink / raw)
  To: Yuri Andriaccio
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	linux-kernel, Luca Abeni, Yuri Andriaccio

On Thu, Apr 30, 2026 at 11:38:22PM +0200, Yuri Andriaccio wrote:
> From: luca abeni <luca.abeni@santannapisa.it>
> 
> Make rt_runtime_us and rt_period_us virtual files accessible also to the
> cgroup v2 controller, effectively enabling the RT_GROUP_SCHED mechanism to
> cgroups v2.

Can we have a blub about why only strict periodic servers; eg. why no
sporadic? and such...

> Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
> Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
> ---
>  kernel/sched/core.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0c7032d254ba..3ffe3ac5071d 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -10245,6 +10245,18 @@ static struct cftype cpu_files[] = {
>  		.write = cpu_uclamp_max_write,
>  	},
>  #endif /* CONFIG_UCLAMP_TASK_GROUP */
> +#ifdef CONFIG_RT_GROUP_SCHED
> +	{
> +		.name = "rt_runtime_us",
> +		.read_s64 = cpu_rt_runtime_read,
> +		.write_s64 = cpu_rt_runtime_write,
> +	},
> +	{
> +		.name = "rt_period_us",
> +		.read_u64 = cpu_rt_period_read_uint,
> +		.write_u64 = cpu_rt_period_write_uint,
> +	},
> +#endif /* CONFIG_RT_GROUP_SCHED */
>  	{ }	/* terminate */
>  };
>  
> -- 
> 2.53.0
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v5 19/29] sched/rt: Remove support for cgroups-v1
       [not found] ` <20260430213835.62217-20-yurand2000@gmail.com>
@ 2026-05-05 15:01   ` Peter Zijlstra
  2026-05-07 15:35     ` Juri Lelli
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2026-05-05 15:01 UTC (permalink / raw)
  To: Yuri Andriaccio
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	linux-kernel, Luca Abeni, Yuri Andriaccio

On Thu, Apr 30, 2026 at 11:38:23PM +0200, Yuri Andriaccio wrote:
> Disable control files for cgroups-v1, and allow only cgroups-v2.
>   This should simplify maintaining the code, since cgroups-v1 are deprecated.

So while I love seeing all this code go away; I very much doubt we can
pull this off. People might actually be using this.

I think at best we can hide the whole cgroup-v1 thing behind a CONFIG
and eventually remove once no distro is left using it or something like
that :/



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
       [not found] ` <20260430213835.62217-21-yurand2000@gmail.com>
@ 2026-05-05 15:15   ` Peter Zijlstra
  2026-05-05 19:56     ` Tejun Heo
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2026-05-05 15:15 UTC (permalink / raw)
  To: Yuri Andriaccio
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	linux-kernel, Luca Abeni, Yuri Andriaccio, tj, hannes, mkoutny,
	cgroups

On Thu, Apr 30, 2026 at 11:38:24PM +0200, Yuri Andriaccio wrote:
> From: luca abeni <luca.abeni@santannapisa.it>
> 
> Allow for cgroup hierarchies with more than two levels.
> 
> Introduce the concept of live and active groups:
> - A group is live if it is a leaf group or if all its children have zero
>   runtime.
> - A live group with non-zero runtime can be used to schedule tasks.
> - An active cgroup is a live group with running tasks.
> - A non-live group cannot be used to run tasks, but it is only used for
>   bandwidth accounting, i.e. the sum of its children bandwidth must be
>   less than or equal to the bandwidth of the parent. This change allows
>   to use cgroups for bandwidth management for different users.
> - While the root cgroup specifies the total allocatable bandwidth of rt
>   cgroups, a further accounting is performed to keep track of the live
>   bandwidth, i.e. the sum of the bandwidth of live groups. The hierarchy
>   invariant states that the live bandwidth must always be less than or
>   equal to the total allocatable bw.
> 
> Add is_live_sched_group() and sched_group_has_live_siblings() in
> deadline.c. These utility functions are used by dl_init_tg to perform
> updates only when necessary:
> - Only live groups may update the active dl bandwidth of dl entities
>   (call to dl_rq_change_utilization), while non-live groups must not use
>   servers, and thus must not change the active dl bandwidth.
> - The total bandwidth accounting must be changed to follow the
>   live/non-live rules:
>   - When disabling (runtime zero) the last child of a group, the parent
>     becomes a live group, and so the parent's bw must be accounted back.
>   - When enabling (runtime non-zero) the first child, the parent becomes a
>     non-live group, and so the parent's bandwidth must be removed.
> 
> Update tg_set_rt_bandwidth() to change the runtime of a group to a
> non-zero value only if its parent is inactive, thus forcing it to become
> non-live if it was precedently (it would've already been non-live if a
> sibling cgroup was live). An exception is made for groups which have the
> root cgroup as parent.
> 
> Update sched_rt_can_attach() to allow attaching only on live groups.
> 
> Update dl_init_tg() to take a task_group pointer and a cpu's id rather
> than passing directly the pointer to the cpu's deadline server. The
> task_group pointer is necessary to check and update the live bandwidth
> accounting.
> 
> Co-developed-by: Yuri Andriaccio <yurand2000@gmail.com>
> Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
> Signed-off-by: luca abeni <luca.abeni@santannapisa.it>

This probably wants to have the cgroup folks on Cc (added now) to make
sure the semantics are in line with cgroup-v2 expectations.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v5 22/29] sched/rt: Add rt-cgroup migration functions
       [not found] ` <20260430213835.62217-23-yurand2000@gmail.com>
@ 2026-05-05 15:20   ` Peter Zijlstra
  2026-05-05 15:24   ` Peter Zijlstra
  1 sibling, 0 replies; 29+ messages in thread
From: Peter Zijlstra @ 2026-05-05 15:20 UTC (permalink / raw)
  To: Yuri Andriaccio
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	linux-kernel, Luca Abeni, Yuri Andriaccio

On Thu, Apr 30, 2026 at 11:38:26PM +0200, Yuri Andriaccio wrote:

> +static int group_find_lowest_rt_rq(struct task_struct *task, struct rt_rq *task_rt_rq)
> +{
> +	struct sched_domain *sd;
> +	struct cpumask lowest_mask;
> +	struct sched_dl_entity *dl_se;
> +	struct rt_rq *rt_rq;
> +	int prio, lowest_prio;
> +	int cpu, this_cpu = smp_processor_id();
> +
> +	if (task->nr_cpus_allowed == 1)
> +		return -1; /* No other targets possible */
> +
> +	lowest_prio = task->prio - 1;
> +	cpumask_clear(&lowest_mask);
> +	for_each_cpu_and(cpu, cpu_online_mask, task->cpus_ptr) {
> +		dl_se = task_rt_rq->tg->dl_se[cpu];
> +		rt_rq = &dl_se->my_q->rt;
> +		prio = rt_rq->highest_prio.curr;
> +
> +		/*
> +		 * If we're on asym system ensure we consider the different capacities
> +		 * of the CPUs when searching for the lowest_mask.
> +		 */
> +		if (dl_se->dl_throttled || !rt_task_fits_capacity(task, cpu))
> +			continue;
> +
> +		if (prio >= lowest_prio) {
> +			if (prio > lowest_prio) {
> +				cpumask_clear(&lowest_mask);
> +				lowest_prio = prio;
> +			}
> +
> +			cpumask_set_cpu(cpu, &lowest_mask);
> +		}
> +	}
> +
> +	if (cpumask_empty(&lowest_mask))
> +		return -1;
> +
> +	/*
> +	 * At this point we have built a mask of CPUs representing the
> +	 * lowest priority tasks in the system.  Now we want to elect
> +	 * the best one based on our affinity and topology.
> +	 *
> +	 * We prioritize the last CPU that the task executed on since
> +	 * it is most likely cache-hot in that location.
> +	 */
> +	cpu = task_cpu(task);
> +	if (cpumask_test_cpu(cpu, &lowest_mask))
> +		return cpu;
> +
> +	/*
> +	 * Otherwise, we consult the sched_domains span maps to figure
> +	 * out which CPU is logically closest to our hot cache data.
> +	 */
> +	if (!cpumask_test_cpu(this_cpu, &lowest_mask))
> +		this_cpu = -1; /* Skip this_cpu opt if not among lowest */
> +
> +	scoped_guard(rcu) {
> +	for_each_domain(cpu, sd) {
> +		if (sd->flags & SD_WAKE_AFFINE) {
> +			int best_cpu;
> +
> +			/*
> +			 * "this_cpu" is cheaper to preempt than a
> +			 * remote processor.
> +			 */
> +			if (this_cpu != -1 &&
> +			    cpumask_test_cpu(this_cpu, sched_domain_span(sd)))
> +				return this_cpu;
> +
> +			best_cpu = cpumask_any_and_distribute(&lowest_mask,
> +							      sched_domain_span(sd));
> +			if (best_cpu < nr_cpu_ids)
> +				return best_cpu;
> +		}
> +	}
> +	}

I appreciate you trying to save on indent, but this does violate
coding-style, please indent as normal.

> +
> +	/*
> +	 * And finally, if there were no matches within the domains
> +	 * just give the caller *something* to work with from the compatible
> +	 * locations.
> +	 */
> +	if (this_cpu != -1)
> +		return this_cpu;
> +
> +	cpu = cpumask_any_distribute(&lowest_mask);
> +	if (cpu < nr_cpu_ids)
> +		return cpu;
> +
> +	return -1;
> +}

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v5 22/29] sched/rt: Add rt-cgroup migration functions
       [not found] ` <20260430213835.62217-23-yurand2000@gmail.com>
  2026-05-05 15:20   ` [RFC PATCH v5 22/29] sched/rt: Add rt-cgroup migration functions Peter Zijlstra
@ 2026-05-05 15:24   ` Peter Zijlstra
  1 sibling, 0 replies; 29+ messages in thread
From: Peter Zijlstra @ 2026-05-05 15:24 UTC (permalink / raw)
  To: Yuri Andriaccio
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	linux-kernel, Luca Abeni, Yuri Andriaccio

On Thu, Apr 30, 2026 at 11:38:26PM +0200, Yuri Andriaccio wrote:
> From: luca abeni <luca.abeni@santannapisa.it>
> 
> Add migration related functions:
> 
> - group_find_lowest_rt_rq
> - group_find_lock_lowest_rt_rq
>   Find (and lock) the lowest priority non-root runqueue where to migrate
>   a given task.
> 
> - group_pull_rt_task
>   Try pull a task onto the given non-root runqueue.
> 
> - group_push_rt_task
> - group_push_rt_tasks
>   Try push tasks from the given non-root runqueue.
> 
> - group_pull_rt_task_callback
> - group_push_rt_tasks_callback
> - rt_queue_push_from_group
> - rt_queue_pull_to_group
>   Deferred execution of push and pull functions at balancing points.
> 
> Update struct rq to include fields for deferred balancing of cgroup runqueues.
> 
> ---
> 
> The functions are only implemented here, to be hooked up later in the patchset.

These functions duplicate a ton of existing logic, is there really no
way to share?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
  2026-05-05 15:15   ` [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups Peter Zijlstra
@ 2026-05-05 19:56     ` Tejun Heo
  2026-05-07 10:53       ` Peter Zijlstra
  2026-05-07 14:30       ` luca abeni
  0 siblings, 2 replies; 29+ messages in thread
From: Tejun Heo @ 2026-05-05 19:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yuri Andriaccio, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, linux-kernel, Luca Abeni, Yuri Andriaccio,
	hannes, mkoutny, cgroups

Hello,

Some high level comments:

- Please align it with existing cgroup2 interface files. See cpu.max. This
  can be e.g. cpu.rt.max without about the same semantics.

- cgroup2 enforces that internal cgroups w/ controllers enabled cannot have
  threads in them. No need to enforce that separately.

- However, the cpu controller is a threaded controller which means that it
  can have threaded sub-hierarchy where the no-internal-process rule doesn't
  apply. This was created explicitly for cpu controller. The proposed change
  blocks it effectively forcing cpu controller into regular domain
  controller behavior subject to no-internal-process rule. Note these are
  enforced at controller granularity and this means that users who use the
  threaded mode will be forced to pick between the two.

- This has the same problem with cgroup1's rt cgroup sched support where
  there is no way to have a permissive default configuration, which means
  that users who don't really care about distributing rt shares
  hierarchically would get blocked from running rt processes by default,
  which basically forces distros to disable rt cgroup sched support. This is
  not new but it'd be a shame to put in all the work and the end result is
  that most people don't even have access to the feature.

Here's my suggestion if there is desire for this to become something most
people have easy access to:

- Don't make it impossible to use in conjunction with other resource control
  mechanisms especially not CPU controller itself. Don't force people to
  choose between threaded mode and rt control. Allow them to co-exist in a
  reasonable manner.

- The same in the wider scope. Don't let it get in the way of people who
  don't care about it. Compromising on interface / failure mode is better
  than people not being able to use it in most cases.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v5 18/29] sched/core: Cgroup v2 support
  2026-05-05 14:59   ` [RFC PATCH v5 18/29] sched/core: Cgroup v2 support Peter Zijlstra
@ 2026-05-06 19:58     ` luca abeni
  2026-05-07  7:01       ` Peter Zijlstra
  0 siblings, 1 reply; 29+ messages in thread
From: luca abeni @ 2026-05-06 19:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yuri Andriaccio, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, linux-kernel, Yuri Andriaccio

Hi Peter,

On Tue, 5 May 2026 16:59:22 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Apr 30, 2026 at 11:38:22PM +0200, Yuri Andriaccio wrote:
> > From: luca abeni <luca.abeni@santannapisa.it>
> > 
> > Make rt_runtime_us and rt_period_us virtual files accessible also
> > to the cgroup v2 controller, effectively enabling the
> > RT_GROUP_SCHED mechanism to cgroups v2.  
> 
> Can we have a blub about why only strict periodic servers; eg. why no
> sporadic? and such...

Maybe I am misunderstanding your question, anyway: the file is called
"rt_runtime_us", but the scheduling algorithm used to schedule the
cgroup is SCHED_DEADLINE.
So, we do not use a strictly periodic server, but a CBS, that can also
support sporadic / non-periodic activations.


			Luca

> 
> > Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
> > Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
> > ---
> >  kernel/sched/core.c | 12 ++++++++++++
> >  1 file changed, 12 insertions(+)
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 0c7032d254ba..3ffe3ac5071d 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -10245,6 +10245,18 @@ static struct cftype cpu_files[] = {
> >  		.write = cpu_uclamp_max_write,
> >  	},
> >  #endif /* CONFIG_UCLAMP_TASK_GROUP */
> > +#ifdef CONFIG_RT_GROUP_SCHED
> > +	{
> > +		.name = "rt_runtime_us",
> > +		.read_s64 = cpu_rt_runtime_read,
> > +		.write_s64 = cpu_rt_runtime_write,
> > +	},
> > +	{
> > +		.name = "rt_period_us",
> > +		.read_u64 = cpu_rt_period_read_uint,
> > +		.write_u64 = cpu_rt_period_write_uint,
> > +	},
> > +#endif /* CONFIG_RT_GROUP_SCHED */
> >  	{ }	/* terminate */
> >  };
> >  
> > -- 
> > 2.53.0
> >   


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v5 18/29] sched/core: Cgroup v2 support
  2026-05-06 19:58     ` luca abeni
@ 2026-05-07  7:01       ` Peter Zijlstra
  2026-05-07 13:30         ` luca abeni
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2026-05-07  7:01 UTC (permalink / raw)
  To: luca abeni
  Cc: Yuri Andriaccio, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, linux-kernel, Yuri Andriaccio

On Wed, May 06, 2026 at 09:58:02PM +0200, luca abeni wrote:
> Hi Peter,
> 
> On Tue, 5 May 2026 16:59:22 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Thu, Apr 30, 2026 at 11:38:22PM +0200, Yuri Andriaccio wrote:
> > > From: luca abeni <luca.abeni@santannapisa.it>
> > > 
> > > Make rt_runtime_us and rt_period_us virtual files accessible also
> > > to the cgroup v2 controller, effectively enabling the
> > > RT_GROUP_SCHED mechanism to cgroups v2.  
> > 
> > Can we have a blub about why only strict periodic servers; eg. why no
> > sporadic? and such...
> 
> Maybe I am misunderstanding your question, anyway: the file is called
> "rt_runtime_us", but the scheduling algorithm used to schedule the
> cgroup is SCHED_DEADLINE.
> So, we do not use a strictly periodic server, but a CBS, that can also
> support sporadic / non-periodic activations.

The interface only exposes runtime and period, as such we can only
configure strict periodic servers (with implicit deadline). And I'm
thinking this makes sense, esp. to start off with, but I also think it
makes sense to explicitly call that out.

State that this does not allow configuring sporadic servers, and
hand-wave a reason for why not.

Or, if we struggle to justify it, perhaps add deadline, dunno.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
  2026-05-05 19:56     ` Tejun Heo
@ 2026-05-07 10:53       ` Peter Zijlstra
  2026-05-07 15:03         ` Juri Lelli
                           ` (3 more replies)
  2026-05-07 14:30       ` luca abeni
  1 sibling, 4 replies; 29+ messages in thread
From: Peter Zijlstra @ 2026-05-07 10:53 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Yuri Andriaccio, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, linux-kernel, Luca Abeni, Yuri Andriaccio,
	hannes, mkoutny, cgroups

On Tue, May 05, 2026 at 09:56:58AM -1000, Tejun Heo wrote:
> Hello,
> 
> Some high level comments:
> 
> - Please align it with existing cgroup2 interface files. See cpu.max. This
>   can be e.g. cpu.rt.max without about the same semantics.
> 
> - cgroup2 enforces that internal cgroups w/ controllers enabled cannot have
>   threads in them. No need to enforce that separately.

Looking at cpu_period_quota_parse() this thing takes two u64 values for:
{runtime, period} but allows runtime to be the string "max".

I think we'd want an optional extension to that and allow 3 values for:
{runtime, period, deadline}, where if the deadline is not given, it will
be the same as period.

In previous versions there was also an option to specify a cpumask,
getting rid of that is one of the reasons I suggested making this thing
a cgroup-v2 thing, then we can use the cpuset controller's effective
mask.

> - However, the cpu controller is a threaded controller which means that it
>   can have threaded sub-hierarchy where the no-internal-process rule doesn't
>   apply. This was created explicitly for cpu controller. The proposed change
>   blocks it effectively forcing cpu controller into regular domain
>   controller behavior subject to no-internal-process rule. Note these are
>   enforced at controller granularity and this means that users who use the
>   threaded mode will be forced to pick between the two.

Right... this then means we need two controls, one to do hierarchical
bandwidth distribution, and one to assign bandwidth to the internal
group -- which is then subject to its own bandwidth distribution
constraint.

This might be a little confusing, but there is no way around that
AFAICT.

> - This has the same problem with cgroup1's rt cgroup sched support where
>   there is no way to have a permissive default configuration, which means
>   that users who don't really care about distributing rt shares
>   hierarchically would get blocked from running rt processes by default,
>   which basically forces distros to disable rt cgroup sched support. This is
>   not new but it'd be a shame to put in all the work and the end result is
>   that most people don't even have access to the feature.

Right, but cgroup-v2 allows enabling/disabling specific controllers for
a (sub)-hierarchy, right? So if the controller is not enabled (by
default), it will fall back to putting the tasks in whatever parent does
have it on, and by default the root group would have and would accept
tasks.

Additionally, I think we want a flag to allow non-priv tasks to use RT
inside the controller -- after all, these tasks would be subject to
strict bandwidth controls and cannot burn the system like unbounded/root
FIFO tasks can.

Does that all sound workable?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v5 18/29] sched/core: Cgroup v2 support
  2026-05-07  7:01       ` Peter Zijlstra
@ 2026-05-07 13:30         ` luca abeni
  2026-05-07 14:16           ` Peter Zijlstra
  0 siblings, 1 reply; 29+ messages in thread
From: luca abeni @ 2026-05-07 13:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yuri Andriaccio, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, linux-kernel, Yuri Andriaccio

On Thu, 7 May 2026 09:01:03 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, May 06, 2026 at 09:58:02PM +0200, luca abeni wrote:
> > Hi Peter,
> > 
> > On Tue, 5 May 2026 16:59:22 +0200
> > Peter Zijlstra <peterz@infradead.org> wrote:
> >   
> > > On Thu, Apr 30, 2026 at 11:38:22PM +0200, Yuri Andriaccio wrote:  
> > > > From: luca abeni <luca.abeni@santannapisa.it>
> > > > 
> > > > Make rt_runtime_us and rt_period_us virtual files accessible
> > > > also to the cgroup v2 controller, effectively enabling the
> > > > RT_GROUP_SCHED mechanism to cgroups v2.    
> > > 
> > > Can we have a blub about why only strict periodic servers; eg.
> > > why no sporadic? and such...  
> > 
> > Maybe I am misunderstanding your question, anyway: the file is
> > called "rt_runtime_us", but the scheduling algorithm used to
> > schedule the cgroup is SCHED_DEADLINE.
> > So, we do not use a strictly periodic server, but a CBS, that can
> > also support sporadic / non-periodic activations.  
> 
> The interface only exposes runtime and period, as such we can only
> configure strict periodic servers (with implicit deadline). And I'm
> thinking this makes sense, esp. to start off with, but I also think it
> makes sense to explicitly call that out.

Ah, I understand now: you are thinking about SCHED_DEADLINE with
deadline<period, right?
(sorry, I originally misunderstood and I was thinking about sporadic
activation patterns, which are already supported)

Yes, I think we can easily add the possibility to also set the
"deadline" parameter (with deafault "deadline=period")



			Thanks,
				Luca

> 
> State that this does not allow configuring sporadic servers, and
> hand-wave a reason for why not.
> 
> Or, if we struggle to justify it, perhaps add deadline, dunno.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v5 18/29] sched/core: Cgroup v2 support
  2026-05-07 13:30         ` luca abeni
@ 2026-05-07 14:16           ` Peter Zijlstra
  0 siblings, 0 replies; 29+ messages in thread
From: Peter Zijlstra @ 2026-05-07 14:16 UTC (permalink / raw)
  To: luca abeni
  Cc: Yuri Andriaccio, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, linux-kernel, Yuri Andriaccio

On Thu, May 07, 2026 at 03:30:55PM +0200, luca abeni wrote:

> > The interface only exposes runtime and period, as such we can only
> > configure strict periodic servers (with implicit deadline). And I'm
> > thinking this makes sense, esp. to start off with, but I also think it
> > makes sense to explicitly call that out.
> 
> Ah, I understand now: you are thinking about SCHED_DEADLINE with
> deadline<period, right?

Yes indeed! I don't expect it will be the most popular choice, but I
don't see a good reason not to allow it. Also, Tomasso loves his +1 :-)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
  2026-05-05 19:56     ` Tejun Heo
  2026-05-07 10:53       ` Peter Zijlstra
@ 2026-05-07 14:30       ` luca abeni
  2026-05-11 18:28         ` Tejun Heo
  1 sibling, 1 reply; 29+ messages in thread
From: luca abeni @ 2026-05-07 14:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Peter Zijlstra, Yuri Andriaccio, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, linux-kernel, Yuri Andriaccio,
	hannes, mkoutny, cgroups

Hi Tejun,

first of all, thanks for your comments! I think this is the kind of
dicussion that we need to have...
Right now we have something that works "well enough" for real-time, but
we want to make it useful in general, so that distributions will not
disable it by default.

I need to better study your suggestions (I do not know cgroup v2
much...), but I have some questions to better understand possible
solutions:

On Tue, 5 May 2026 09:56:58 -1000
Tejun Heo <tj@kernel.org> wrote:
[...]
> - cgroup2 enforces that internal cgroups w/ controllers enabled
> cannot have threads in them. No need to enforce that separately.
> 
> - However, the cpu controller is a threaded controller which means
> that it can have threaded sub-hierarchy where the no-internal-process
> rule doesn't apply. This was created explicitly for cpu controller.
> The proposed change blocks it effectively forcing cpu controller into
> regular domain controller behavior subject to no-internal-process
> rule. Note these are enforced at controller granularity and this
> means that users who use the threaded mode will be forced to pick
> between the two.

Just to better understand: would it make sense to allow non-{FIFO,RT}
tasks to be in non-leaf cgroups (as allowed by the threaded CPU
controller), while enforcing that FIFO/RR tasks can only be in leaf
cgroups? Or would this be a hack that compromises the rt-CPU controller
usefulness?


> - This has the same problem with cgroup1's rt cgroup sched support
> where there is no way to have a permissive default configuration,
> which means that users who don't really care about distributing rt
> shares hierarchically would get blocked from running rt processes by
> default, which basically forces distros to disable rt cgroup sched
> support. This is not new but it'd be a shame to put in all the work
> and the end result is that most people don't even have access to the
> feature.

Yes, we have a bad default here.
Would a default like "allow running FIFO/RR tasks without runtime
enforcement" (this is what happens to FIFO/RR tasks running in the root
control group) be acceptable?


			Thanks,
				Luca

> 
> Here's my suggestion if there is desire for this to become something
> most people have easy access to:
> 
> - Don't make it impossible to use in conjunction with other resource
> control mechanisms especially not CPU controller itself. Don't force
> people to choose between threaded mode and rt control. Allow them to
> co-exist in a reasonable manner.
> 
> - The same in the wider scope. Don't let it get in the way of people
> who don't care about it. Compromising on interface / failure mode is
> better than people not being able to use it in most cases.
> 
> Thanks.
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
  2026-05-07 10:53       ` Peter Zijlstra
@ 2026-05-07 15:03         ` Juri Lelli
  2026-05-07 15:05           ` Peter Zijlstra
  2026-05-07 16:39           ` luca abeni
  2026-05-07 16:44         ` luca abeni
                           ` (2 subsequent siblings)
  3 siblings, 2 replies; 29+ messages in thread
From: Juri Lelli @ 2026-05-07 15:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, Yuri Andriaccio, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, linux-kernel, Luca Abeni, Yuri Andriaccio,
	hannes, mkoutny, cgroups

On 07/05/26 12:53, Peter Zijlstra wrote:
> On Tue, May 05, 2026 at 09:56:58AM -1000, Tejun Heo wrote:

...

> > - However, the cpu controller is a threaded controller which means that it
> >   can have threaded sub-hierarchy where the no-internal-process rule doesn't
> >   apply. This was created explicitly for cpu controller. The proposed change
> >   blocks it effectively forcing cpu controller into regular domain
> >   controller behavior subject to no-internal-process rule. Note these are
> >   enforced at controller granularity and this means that users who use the
> >   threaded mode will be forced to pick between the two.
> 
> Right... this then means we need two controls, one to do hierarchical
> bandwidth distribution, and one to assign bandwidth to the internal
> group -- which is then subject to its own bandwidth distribution
> constraint.
> 
> This might be a little confusing, but there is no way around that
> AFAICT.

Just to check if I'm following, you are thinking something like below?

groupA/
  cpu.rt.max = "50 50 100"       <- 0.5 from root
  cpu.rt.internal = "20 20 100"  <- 0.2 from groupA for threads at this
                                        level
  + threadA                               <
  + threadB                               <
  +- group1/
       cpu.rt.max = "30 30 100"  <- 0.3 from groupA
       + threadC

And we still keep it flat, so 2 dl-entities (per CPU), one handles
threads at groupA level and the other threads inside group1?

Thanks,
Juri


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
  2026-05-07 15:03         ` Juri Lelli
@ 2026-05-07 15:05           ` Peter Zijlstra
  2026-05-07 16:39           ` luca abeni
  1 sibling, 0 replies; 29+ messages in thread
From: Peter Zijlstra @ 2026-05-07 15:05 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Tejun Heo, Yuri Andriaccio, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, linux-kernel, Luca Abeni, Yuri Andriaccio,
	hannes, mkoutny, cgroups

On Thu, May 07, 2026 at 05:03:41PM +0200, Juri Lelli wrote:
> On 07/05/26 12:53, Peter Zijlstra wrote:
> > On Tue, May 05, 2026 at 09:56:58AM -1000, Tejun Heo wrote:
> 
> ...
> 
> > > - However, the cpu controller is a threaded controller which means that it
> > >   can have threaded sub-hierarchy where the no-internal-process rule doesn't
> > >   apply. This was created explicitly for cpu controller. The proposed change
> > >   blocks it effectively forcing cpu controller into regular domain
> > >   controller behavior subject to no-internal-process rule. Note these are
> > >   enforced at controller granularity and this means that users who use the
> > >   threaded mode will be forced to pick between the two.
> > 
> > Right... this then means we need two controls, one to do hierarchical
> > bandwidth distribution, and one to assign bandwidth to the internal
> > group -- which is then subject to its own bandwidth distribution
> > constraint.
> > 
> > This might be a little confusing, but there is no way around that
> > AFAICT.
> 
> Just to check if I'm following, you are thinking something like below?
> 
> groupA/
>   cpu.rt.max = "50 50 100"       <- 0.5 from root
>   cpu.rt.internal = "20 20 100"  <- 0.2 from groupA for threads at this
>                                         level
>   + threadA                               <
>   + threadB                               <
>   +- group1/
>        cpu.rt.max = "30 30 100"  <- 0.3 from groupA
>        + threadC
> 
> And we still keep it flat, so 2 dl-entities (per CPU), one handles
> threads at groupA level and the other threads inside group1?

Exactly!

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v5 19/29] sched/rt: Remove support for cgroups-v1
  2026-05-05 15:01   ` [RFC PATCH v5 19/29] sched/rt: Remove support for cgroups-v1 Peter Zijlstra
@ 2026-05-07 15:35     ` Juri Lelli
  0 siblings, 0 replies; 29+ messages in thread
From: Juri Lelli @ 2026-05-07 15:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yuri Andriaccio, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	linux-kernel, Luca Abeni, Yuri Andriaccio

On 05/05/26 17:01, Peter Zijlstra wrote:
> On Thu, Apr 30, 2026 at 11:38:23PM +0200, Yuri Andriaccio wrote:
> > Disable control files for cgroups-v1, and allow only cgroups-v2.
> >   This should simplify maintaining the code, since cgroups-v1 are deprecated.
> 
> So while I love seeing all this code go away; I very much doubt we can
> pull this off. People might actually be using this.

Quite a bold move indeed. :)

> I think at best we can hide the whole cgroup-v1 thing behind a CONFIG
> and eventually remove once no distro is left using it or something like
> that :/

This however means we will essentially need to maintain 2 versions of
rt.c until v1 is gone? AFAIK v1 rt group implementation is quite a
substantial amount of code. :/

I certainly see your point, just thinking out loud about what options do
we realistically have. Once HCBS is eventually merged, can we consider
v1 RT GROUP feature terminally broken and just point people at v2 if
they report an issue with v1?


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
  2026-05-07 15:03         ` Juri Lelli
  2026-05-07 15:05           ` Peter Zijlstra
@ 2026-05-07 16:39           ` luca abeni
  2026-05-11  9:29             ` Juri Lelli
  1 sibling, 1 reply; 29+ messages in thread
From: luca abeni @ 2026-05-07 16:39 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Peter Zijlstra, Tejun Heo, Yuri Andriaccio, Ingo Molnar,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, linux-kernel, Yuri Andriaccio,
	hannes, mkoutny, cgroups

Hi,

On Thu, 7 May 2026 17:03:41 +0200
Juri Lelli <juri.lelli@redhat.com> wrote:

> On 07/05/26 12:53, Peter Zijlstra wrote:
> > On Tue, May 05, 2026 at 09:56:58AM -1000, Tejun Heo wrote:  
> 
> ...
> 
> > > - However, the cpu controller is a threaded controller which
> > > means that it can have threaded sub-hierarchy where the
> > > no-internal-process rule doesn't apply. This was created
> > > explicitly for cpu controller. The proposed change blocks it
> > > effectively forcing cpu controller into regular domain controller
> > > behavior subject to no-internal-process rule. Note these are
> > > enforced at controller granularity and this means that users who
> > > use the threaded mode will be forced to pick between the two.  
> > 
> > Right... this then means we need two controls, one to do
> > hierarchical bandwidth distribution, and one to assign bandwidth to
> > the internal group -- which is then subject to its own bandwidth
> > distribution constraint.
> > 
> > This might be a little confusing, but there is no way around that
> > AFAICT.  
> 
> Just to check if I'm following, you are thinking something like below?
> 
> groupA/
>   cpu.rt.max = "50 50 100"       <- 0.5 from root
>   cpu.rt.internal = "20 20 100"  <- 0.2 from groupA for threads at
> this level
>   + threadA                               <
>   + threadB                               <
>   +- group1/
>        cpu.rt.max = "30 30 100"  <- 0.3 from groupA
>        + threadC
> 
> And we still keep it flat, so 2 dl-entities (per CPU), one handles
> threads at groupA level and the other threads inside group1?

An alternative idea I was thinking about: we create 2 dl entities (one
for "groupA" and one for "group1"); we set cpu.rt.max for groupA, and
we subtract group1's utilization from it (so, if groupA's cpu.rt.max is
"50 100" and group1's cpu.rt.max is "30 100", groupA is served by a dl
entity (50-30,100)=(20,100) while group1 is served by a dl entity
(30,100)).

Basically, with this idea the "internal" reservation is automatically
computed based on rt.max and on the children cgroups. A possible issue
is that if the children consume all the groupA's utilization the groupA
RT tasks remain with 0 runtime (and never execute).


				Luca

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
  2026-05-07 10:53       ` Peter Zijlstra
  2026-05-07 15:03         ` Juri Lelli
@ 2026-05-07 16:44         ` luca abeni
  2026-05-11  9:40         ` luca abeni
  2026-05-11 17:37         ` Tejun Heo
  3 siblings, 0 replies; 29+ messages in thread
From: luca abeni @ 2026-05-07 16:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, Yuri Andriaccio, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, linux-kernel, Yuri Andriaccio,
	hannes, mkoutny, cgroups

Hi,

On Thu, 7 May 2026 12:53:31 +0200
Peter Zijlstra <peterz@infradead.org> wrote:
[...]
> > - This has the same problem with cgroup1's rt cgroup sched support
> > where there is no way to have a permissive default configuration,
> > which means that users who don't really care about distributing rt
> > shares hierarchically would get blocked from running rt processes
> > by default, which basically forces distros to disable rt cgroup
> > sched support. This is not new but it'd be a shame to put in all
> > the work and the end result is that most people don't even have
> > access to the feature.  
> 
> Right, but cgroup-v2 allows enabling/disabling specific controllers
> for a (sub)-hierarchy, right? So if the controller is not enabled (by
> default), it will fall back to putting the tasks in whatever parent
> does have it on, and by default the root group would have and would
> accept tasks.

If I understand well, this is similar to what I was thinking about:
having a default that allows creating FIFO/RR tasks (and execute them
without runtime control - so, without being served by a dl server)


> Additionally, I think we want a flag to allow non-priv tasks to use RT
> inside the controller -- after all, these tasks would be subject to
> strict bandwidth controls and cannot burn the system like
> unbounded/root FIFO tasks can.

This is something Yuri and I wanted to propose as a follow-up patch,
once there is an agreement on the patchset (should be a pretty simple
change :)



				Luca

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
  2026-05-07 16:39           ` luca abeni
@ 2026-05-11  9:29             ` Juri Lelli
  2026-05-11 17:52               ` Tejun Heo
  0 siblings, 1 reply; 29+ messages in thread
From: Juri Lelli @ 2026-05-11  9:29 UTC (permalink / raw)
  To: luca abeni
  Cc: Peter Zijlstra, Tejun Heo, Yuri Andriaccio, Ingo Molnar,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, linux-kernel, Yuri Andriaccio,
	hannes, mkoutny, cgroups

On 07/05/26 18:39, luca abeni wrote:
> Hi,
> 
> On Thu, 7 May 2026 17:03:41 +0200
> Juri Lelli <juri.lelli@redhat.com> wrote:
> 
> > On 07/05/26 12:53, Peter Zijlstra wrote:
> > > On Tue, May 05, 2026 at 09:56:58AM -1000, Tejun Heo wrote:  
> > 
> > ...
> > 
> > > > - However, the cpu controller is a threaded controller which
> > > > means that it can have threaded sub-hierarchy where the
> > > > no-internal-process rule doesn't apply. This was created
> > > > explicitly for cpu controller. The proposed change blocks it
> > > > effectively forcing cpu controller into regular domain controller
> > > > behavior subject to no-internal-process rule. Note these are
> > > > enforced at controller granularity and this means that users who
> > > > use the threaded mode will be forced to pick between the two.  
> > > 
> > > Right... this then means we need two controls, one to do
> > > hierarchical bandwidth distribution, and one to assign bandwidth to
> > > the internal group -- which is then subject to its own bandwidth
> > > distribution constraint.
> > > 
> > > This might be a little confusing, but there is no way around that
> > > AFAICT.  
> > 
> > Just to check if I'm following, you are thinking something like below?
> > 
> > groupA/
> >   cpu.rt.max = "50 50 100"       <- 0.5 from root
> >   cpu.rt.internal = "20 20 100"  <- 0.2 from groupA for threads at
> > this level
> >   + threadA                               <
> >   + threadB                               <
> >   +- group1/
> >        cpu.rt.max = "30 30 100"  <- 0.3 from groupA
> >        + threadC
> > 
> > And we still keep it flat, so 2 dl-entities (per CPU), one handles
> > threads at groupA level and the other threads inside group1?
> 
> An alternative idea I was thinking about: we create 2 dl entities (one
> for "groupA" and one for "group1"); we set cpu.rt.max for groupA, and
> we subtract group1's utilization from it (so, if groupA's cpu.rt.max is
> "50 100" and group1's cpu.rt.max is "30 100", groupA is served by a dl
> entity (50-30,100)=(20,100) while group1 is served by a dl entity
> (30,100)).
> 
> Basically, with this idea the "internal" reservation is automatically
> computed based on rt.max and on the children cgroups. A possible issue
> is that if the children consume all the groupA's utilization the groupA
> RT tasks remain with 0 runtime (and never execute).

While I like the automatic approach, I also fear that it might be more
difficult to maintain/use from a systemd admin perspective, e.g. I
cannot make a subgroup reservation bigger because there are threads
running in the parent group which consume all the remaining (internal)
bandwidth. If we make it explicit it seems easier to see where bandwidth
is allocated at all levels.

Peter? Tejun? What do we want to do with this interface?


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
  2026-05-07 10:53       ` Peter Zijlstra
  2026-05-07 15:03         ` Juri Lelli
  2026-05-07 16:44         ` luca abeni
@ 2026-05-11  9:40         ` luca abeni
  2026-05-11 18:15           ` Tejun Heo
  2026-05-11 17:37         ` Tejun Heo
  3 siblings, 1 reply; 29+ messages in thread
From: luca abeni @ 2026-05-11  9:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, Yuri Andriaccio, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, linux-kernel, Yuri Andriaccio,
	hannes, mkoutny, cgroups

Hi all,

On Thu, 7 May 2026 12:53:31 +0200
Peter Zijlstra <peterz@infradead.org> wrote:
[...]
> > - This has the same problem with cgroup1's rt cgroup sched support
> > where there is no way to have a permissive default configuration,
> > which means that users who don't really care about distributing rt
> > shares hierarchically would get blocked from running rt processes
> > by default, which basically forces distros to disable rt cgroup
> > sched support. This is not new but it'd be a shame to put in all
> > the work and the end result is that most people don't even have
> > access to the feature.  
> 
> Right, but cgroup-v2 allows enabling/disabling specific controllers
> for a (sub)-hierarchy, right? So if the controller is not enabled (by
> default), it will fall back to putting the tasks in whatever parent
> does have it on, and by default the root group would have and would
> accept tasks.

We are discussing this issue with Yuri, and we have a doubt: if we
disable the RT-CPU controller for a cgroup, would it be possible to
enable it for its children?
(In other words: if we want the RT-CPU controller to be enabled for
some "leaf" cgroups, we need to enable it for their parents, right?)



			Thanks,
				Luca

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
  2026-05-07 10:53       ` Peter Zijlstra
                           ` (2 preceding siblings ...)
  2026-05-11  9:40         ` luca abeni
@ 2026-05-11 17:37         ` Tejun Heo
  3 siblings, 0 replies; 29+ messages in thread
From: Tejun Heo @ 2026-05-11 17:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yuri Andriaccio, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, linux-kernel, Luca Abeni, Yuri Andriaccio,
	hannes, mkoutny, cgroups

Hello, Peter.

On Thu, May 07, 2026 at 12:53:31PM +0200, Peter Zijlstra wrote:
...
> Looking at cpu_period_quota_parse() this thing takes two u64 values for:
> {runtime, period} but allows runtime to be the string "max".
> 
> I think we'd want an optional extension to that and allow 3 values for:
> {runtime, period, deadline}, where if the deadline is not given, it will
> be the same as period.

Yeah, I don't know what's needed here but extending the interface as
necessary is completely fine.

> Right... this then means we need two controls, one to do hierarchical
> bandwidth distribution, and one to assign bandwidth to the internal
> group -- which is then subject to its own bandwidth distribution
> constraint.
> 
> This might be a little confusing, but there is no way around that
> AFAICT.

Separating out the rt as a separate controller is one way and if the
configuration wants to stick to strict allocation model where nothing is
available by default unless explicitly allocated, this would be the only
way. Interface-wise, I think this is going to be fine but I suspect this
likely would complicated internal implementation quite a bit as now rt can't
piggyback on existing sched core cgroup infra - no task_group or
synchronization built around them - and has to build everything on its own.
It's not the end of the world but not ideal either.

> > - This has the same problem with cgroup1's rt cgroup sched support where
> >   there is no way to have a permissive default configuration, which means
> >   that users who don't really care about distributing rt shares
> >   hierarchically would get blocked from running rt processes by default,
> >   which basically forces distros to disable rt cgroup sched support. This is
> >   not new but it'd be a shame to put in all the work and the end result is
> >   that most people don't even have access to the feature.
> 
> Right, but cgroup-v2 allows enabling/disabling specific controllers for
> a (sub)-hierarchy, right? So if the controller is not enabled (by
> default), it will fall back to putting the tasks in whatever parent does
> have it on, and by default the root group would have and would accept
> tasks.
> 
> Additionally, I think we want a flag to allow non-priv tasks to use RT
> inside the controller -- after all, these tasks would be subject to
> strict bandwidth controls and cannot burn the system like unbounded/root
> FIFO tasks can.
> 
> Does that all sound workable?

Yeah, if rt becomes its own controller, I don't see any fundamental
roadblocks. It'd involve a bunch of churn which may add to maintenance
overhead but it should work. An alternative would be coming up with some way
to express the default no-enforcement state through the config knobs. I'm
sure this would be doable too and if folks can figure out a reasonable
interface, it should be able to obtain basically the same functionality with
a lot less code.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
  2026-05-11  9:29             ` Juri Lelli
@ 2026-05-11 17:52               ` Tejun Heo
  0 siblings, 0 replies; 29+ messages in thread
From: Tejun Heo @ 2026-05-11 17:52 UTC (permalink / raw)
  To: Juri Lelli
  Cc: luca abeni, Peter Zijlstra, Yuri Andriaccio, Ingo Molnar,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, linux-kernel, Yuri Andriaccio,
	hannes, mkoutny, cgroups

Hello,

On Mon, May 11, 2026 at 11:29:47AM +0200, Juri Lelli wrote:
...
> While I like the automatic approach, I also fear that it might be more
> difficult to maintain/use from a systemd admin perspective, e.g. I
> cannot make a subgroup reservation bigger because there are threads
> running in the parent group which consume all the remaining (internal)
> bandwidth. If we make it explicit it seems easier to see where bandwidth
> is allocated at all levels.
> 
> Peter? Tejun? What do we want to do with this interface?

blkcg on cgroup1 did soemthing similar for a while. It had a separate subdir
for knobs that apply to "internal threads". Effectivley, this becomes
creating a separate controller group for every cgroup as a sibling to its
children. It does work obviously but it is pretty ugly and unintuitive, both
in interface and implementation, and I'm skeptical this was actually useful
in any meaningful way. Nobody complained when we ripped it out.

If rt were to become its own cgroup controller, maybe one can just side-step
this by not supporting threaded mode at least at the beginning. If people
ask for it, hopefully we'll be able to develop better understanding of their
usecases and drive design that way. In practice, I don't think threaded mode
gets used all that much because usually only application processes
themselves know about their own threads, are not in the business of creating
their own cgroups (delegation to each application isn't common), and have
other ways of controlling their own threads. So, there's some chance that
this may not actually come up.

If rt stays as a part of cpu controller, my preference would be keeping the
config implicit for threaded mode at least at the beginning. ie. Don't get
in the way of people using threaded mode by blocking it but having some
reasonable and clear default (e.g. internal tasks have priority as suggested
or internal tasks get whatever is left over which may make more sense in the
allocation model) may be sufficient. If not, like in the other case, we can
make specific design decisions based on concrete use cases later.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
  2026-05-11  9:40         ` luca abeni
@ 2026-05-11 18:15           ` Tejun Heo
  0 siblings, 0 replies; 29+ messages in thread
From: Tejun Heo @ 2026-05-11 18:15 UTC (permalink / raw)
  To: luca abeni
  Cc: Peter Zijlstra, Yuri Andriaccio, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, linux-kernel, Yuri Andriaccio,
	hannes, mkoutny, cgroups

On Mon, May 11, 2026 at 11:40:04AM +0200, luca abeni wrote:
> We are discussing this issue with Yuri, and we have a doubt: if we
> disable the RT-CPU controller for a cgroup, would it be possible to
> enable it for its children?
> (In other words: if we want the RT-CPU controller to be enabled for
> some "leaf" cgroups, we need to enable it for their parents, right?)

Yeah, a cgroup has a controller available to it iff its parent enables that
controller, so all ancestors would have to enable it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
  2026-05-07 14:30       ` luca abeni
@ 2026-05-11 18:28         ` Tejun Heo
  2026-05-12 17:38           ` Yuri Andriaccio
  0 siblings, 1 reply; 29+ messages in thread
From: Tejun Heo @ 2026-05-11 18:28 UTC (permalink / raw)
  To: luca abeni
  Cc: Peter Zijlstra, Yuri Andriaccio, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, linux-kernel, Yuri Andriaccio,
	hannes, mkoutny, cgroups

Hello,

On Thu, May 07, 2026 at 04:30:58PM +0200, luca abeni wrote:
...
> Just to better understand: would it make sense to allow non-{FIFO,RT}
> tasks to be in non-leaf cgroups (as allowed by the threaded CPU
> controller), while enforcing that FIFO/RR tasks can only be in leaf
> cgroups? Or would this be a hack that compromises the rt-CPU controller
> usefulness?

Code-wise, sure, but I don't think an interface like that would be a good
one. From user's pov, this amounts to adding restrictions on both whether a
controller can be enabled and whether tasks can be moved into some cgroups.
UNIX error reporting being what it is, this would come down to getting
-EINVAL or -EBUSY or whatever out of those operations. I don't think it's a
good idea to add subtle failure modes to these already pretty complex (but
currently w/ clearly-defined shared rules) operations. To users, this would
look like random arbitrary failures that are nearly impossible to decode
without tracing code.

If you want to enforce no-internal-threads, separating it out to its own
controller that doesn't support threaded mode would be the right direction.

Note that the only hard requirement here is that you don't want to get in
the way for people who are NOT interested in threaded rt control. If you
block enabling CPU control for e.g. cpu.max or block thread migration into a
cgroup, you'd be in the way; however, if all you say is "I don't support
sub-allocation in threaded mode" and e.g just fail writes to the knobs in
threaded cgroups, that does not get in the way. So, it's not like you *have*
to support full threaded mode. You just need to avoid hindering non-rt
operations.

> Yes, we have a bad default here.
> Would a default like "allow running FIFO/RR tasks without runtime
> enforcement" (this is what happens to FIFO/RR tasks running in the root
> control group) be acceptable?

Yes, if you can express that in a reasonable way in the config knobs, that'd
likely be an easier way. I don't know how to transition from
allowed-by-default to explicitly-allocated in such interface tho. Making
that reasonable and smooth would be the key factor in whether such approach
can be taken.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
  2026-05-11 18:28         ` Tejun Heo
@ 2026-05-12 17:38           ` Yuri Andriaccio
  2026-05-12 18:19             ` Tejun Heo
  0 siblings, 1 reply; 29+ messages in thread
From: Yuri Andriaccio @ 2026-05-12 17:38 UTC (permalink / raw)
  To: Tejun Heo
  Cc: luca abeni, Peter Zijlstra, Yuri Andriaccio, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, linux-kernel, hannes,
	mkoutny, cgroups

Hello,

I've been thinking and experimenting with some of the ideas for the rt 
controller, and I've come up with the following interface, keeping 
everything in the standard cpu controller:

- cpu.rt.max <runtime_us> <period_us>
   Sets the bandwidth reserved to the hierarchy that has that specific 
cgroup as root, but does
   not set any deadline servers.
   The default value for this file is '0 0'.
- cpu.rt.min <runtime_us | 'root'> <period_us>
   If the runtime part is equal to 'root', the tasks are scheduled on 
the root runqueue.
   If the runtime is equal to zero, no FIFO/RR tasks can be scheduled.
   If the runtime is > zero, FIFO/RR tasks are scheduled under 
reservation/HCBS.
   This file is not available in the root cgroup, as it does not make 
use of dl-servers,
   rather only reserves the total bandwidth for the hierarchy.
   The default value for this file is 'root 0', meaning that tasks in 
this cgroups are
   by default scheduled on the root runqueue.

Of course you can imagine that all the admission tests have been updated 
accordingly, as an example a cgroups rt.max bw must be >= than the sum 
of the rt.max bws of its children + its rt.min bw. I'm also skipping 
some details which are only meaningful if we decide to adopt this solution.

What do you think of this interface?

Thanks,
Yuri

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
  2026-05-12 17:38           ` Yuri Andriaccio
@ 2026-05-12 18:19             ` Tejun Heo
  2026-05-12 18:20               ` Tejun Heo
  0 siblings, 1 reply; 29+ messages in thread
From: Tejun Heo @ 2026-05-12 18:19 UTC (permalink / raw)
  To: Yuri Andriaccio
  Cc: luca abeni, Peter Zijlstra, Yuri Andriaccio, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, linux-kernel, hannes,
	mkoutny, cgroups

Hello,

How is a delegated subtree prevented from setting cpu.rt.min = 'root' and
escaping its ancestors' cpu.rt.max budget?

Thanks.

--
tejun

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups
  2026-05-12 18:19             ` Tejun Heo
@ 2026-05-12 18:20               ` Tejun Heo
  0 siblings, 0 replies; 29+ messages in thread
From: Tejun Heo @ 2026-05-12 18:20 UTC (permalink / raw)
  To: Yuri Andriaccio
  Cc: luca abeni, Peter Zijlstra, Yuri Andriaccio, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, linux-kernel, hannes,
	mkoutny, cgroups

On Tue, May 12, 2026 at 08:19:02AM -1000, Tejun Heo wrote:
> How is a delegated subtree prevented from setting cpu.rt.min = 'root' and
> escaping its ancestors' cpu.rt.max budget?

Hmm.. I guess the same problem exists w/ separate rt controller too. If the
users on the system already started using rt, how do you enable the
controller from the top down with budgets already being used down in the
hierarchy?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2026-05-12 18:20 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20260430213835.62217-1-yurand2000@gmail.com>
     [not found] ` <20260430213835.62217-14-yurand2000@gmail.com>
2026-05-05 13:04   ` [RFC PATCH v5 13/29] sched/rt: Implement dl-server operations for rt-cgroups Peter Zijlstra
     [not found] ` <20260430213835.62217-15-yurand2000@gmail.com>
2026-05-05 13:16   ` [RFC PATCH v5 14/29] sched/rt: Update task event callbacks for HCBS scheduling Peter Zijlstra
     [not found] ` <20260430213835.62217-16-yurand2000@gmail.com>
2026-05-05 14:36   ` [RFC PATCH v5 15/29] sched/rt: Update rt-cgroup schedulability checks Peter Zijlstra
     [not found] ` <20260430213835.62217-19-yurand2000@gmail.com>
2026-05-05 14:59   ` [RFC PATCH v5 18/29] sched/core: Cgroup v2 support Peter Zijlstra
2026-05-06 19:58     ` luca abeni
2026-05-07  7:01       ` Peter Zijlstra
2026-05-07 13:30         ` luca abeni
2026-05-07 14:16           ` Peter Zijlstra
     [not found] ` <20260430213835.62217-20-yurand2000@gmail.com>
2026-05-05 15:01   ` [RFC PATCH v5 19/29] sched/rt: Remove support for cgroups-v1 Peter Zijlstra
2026-05-07 15:35     ` Juri Lelli
     [not found] ` <20260430213835.62217-21-yurand2000@gmail.com>
2026-05-05 15:15   ` [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups Peter Zijlstra
2026-05-05 19:56     ` Tejun Heo
2026-05-07 10:53       ` Peter Zijlstra
2026-05-07 15:03         ` Juri Lelli
2026-05-07 15:05           ` Peter Zijlstra
2026-05-07 16:39           ` luca abeni
2026-05-11  9:29             ` Juri Lelli
2026-05-11 17:52               ` Tejun Heo
2026-05-07 16:44         ` luca abeni
2026-05-11  9:40         ` luca abeni
2026-05-11 18:15           ` Tejun Heo
2026-05-11 17:37         ` Tejun Heo
2026-05-07 14:30       ` luca abeni
2026-05-11 18:28         ` Tejun Heo
2026-05-12 17:38           ` Yuri Andriaccio
2026-05-12 18:19             ` Tejun Heo
2026-05-12 18:20               ` Tejun Heo
     [not found] ` <20260430213835.62217-23-yurand2000@gmail.com>
2026-05-05 15:20   ` [RFC PATCH v5 22/29] sched/rt: Add rt-cgroup migration functions Peter Zijlstra
2026-05-05 15:24   ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox