From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-wr1-f52.google.com (mail-wr1-f52.google.com [209.85.221.52]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 686EB3CF032 for ; Thu, 30 Apr 2026 21:39:06 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.221.52 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777585149; cv=none; b=fOAPyW8T4R6pwR9PX7IrurEKXGzGZIRBxlLxvIQThYCYtuVqZqbGMsLgl+MHbVSlD322Dqqs6ueU7ZFqBLRC5OVlqvFrVqmLqHtKBMm/6DmMFux980rA2Y+cnLbPf/tTnJO8Wf9u7woVSroqba/C87mv6z2bQZfe4pCj93egLdk= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777585149; c=relaxed/simple; bh=O8vr8arQUQ9w1H+zuKVFXvhQgQEcu5EiLYxvCpgpJJ0=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=s9LzhjGWfeMKAJ9ZlFB0GOs0AWfP/OQSWxGw66DTIcNoGFxWJ7YrnywV6awCBBpX4GnMWuTxJx3ung1tvZOFqt6YRp0OPdNBdMemDp+5nnQfNu+ib2LbwEMo8Ge1K833BwAZRsgzIpHNA0FCLJCNxiJjCkmPoAxgESKE0VzXgkM= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=hiofhaY+; arc=none smtp.client-ip=209.85.221.52 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="hiofhaY+" Received: by mail-wr1-f52.google.com with SMTP id ffacd0b85a97d-44261378651so1630223f8f.0 for ; Thu, 30 Apr 2026 14:39:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1777585145; x=1778189945; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=rng7wRbboISczYFrMwCDoR+EdqCtU4+WfPrtNp+yNdY=; b=hiofhaY+k9EbiBmrCSKXuJQVfgHbW7OCNuet4lEU4eLjPVSSYUktqAMjlniu3rw9oZ 0n5hWejDkaMMqMB/gVBRSXBoCMX9hdJyWRGofEQhTslasepMK6OfCBQ4NCf+9IRBcA02 MNUDm9Fmndy5z+9yUWvr6bJ82tO6yKzecbyEOjcMTZlHAxHLNYQk7Dzx6gh+JdZnGhKH L1Emn2ZZWxmjtLSTp6VucbacHC/Z9Aef6mNHLR0nWfnlYrOVs+wRrHQSuQx31KmRBmHX y32U7aaIE8hTgRoV0wCGprgn9UgjnptI+Z9CWt0agnj+n8cO+ehRSC4yImFXSewEv4Vs L3Jw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1777585145; x=1778189945; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=rng7wRbboISczYFrMwCDoR+EdqCtU4+WfPrtNp+yNdY=; b=WNNh9f68WwdpqROkDPgr2g1EG2/0s8pu/iQYCes5AZ1Q36c2vVGnUm6HqTZ+Bp8nOo sHB3pE1e9WoKKgF1YdHYIoy7cNaDXPG6c7fkd3eL2ywWQpm4nUwCeEkhUg3g0vQJu4m2 2pwnEU8GSJ2s7VWHMNRWKKzWlQrj6XTGehYv4NBVsBfGcslbNjB40aQ0t36DWlTWqNHI 4VTev5KsF5WnzWwYDGuZv8J4C+eMyaIOAKoUT/pik4U9+MRb7dQIddiyf1l67EZ7Ezu4 q3RUyHuIdAvwPnbmw+xdCx1gqqHOXLLLzKUQNeQcvxQ6q27eWo5X1sBprMqeltRm0BWn dekQ== X-Gm-Message-State: AOJu0YyTzlk3pzR3jG4T/+/BqdAS+dOw3jTBnQqfmtvY9SQOw+pxhBpn rbeZ8LVT4S4M+83q3gppMXcp8L0sibDeh1CjcWXoIjmqxMoQdLFEVgAx X-Gm-Gg: AeBDieu4QrE6DvCd4n7A/WDergMjQgq8FSW2vSi7Z22Ub6xO816u3AZGXW7Tvm0zBpl Nil2J7cVijIw6Op7bghqkjiUbj2npUkZ9x1gqpEioJanFUOcUKYJefD0KdEIVTDC5d6g131ZcFP xlmkpVPchHqJaVpdYgjGsw4uIKOVIgW6Mv0ZPfHoHvdVEVq/mrygoeAKmOtpBXo0IaM+rb7aKoZ aCtiU8nhortCphqYkco5RZkLDeYkOVrXIXyPCwkDeMVeYyXEwAxbISbn/mMrdSBkdgSp6w1yG+d H4abaqZjQinzf3dxDzQ2A4Z55JXGCFuezjIp5pC939ZLu0aIm7u6NLFkasbonVW5xGOCZWISHSR iAgBLuPsCpnKBkCoDNmDtuo889nmFXnxJj4oTdJrIwAyPTV5t5FTRJK74MO1QHiiT3R8NYgrHFH XSgDhhc9fOY+shrQQqelBTEh+jGO/FGqa6K9DFE5Mg X-Received: by 2002:a05:6000:2d0f:b0:43d:7a5e:8162 with SMTP id ffacd0b85a97d-4494fe77f8fmr5052848f8f.15.1777585144664; Thu, 30 Apr 2026 14:39:04 -0700 (PDT) Received: from yuri-framework13 ([78.211.51.156]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-44a9879ef89sm418510f8f.30.2026.04.30.14.39.03 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 30 Apr 2026 14:39:04 -0700 (PDT) From: Yuri Andriaccio To: Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider Cc: linux-kernel@vger.kernel.org, Luca Abeni , Yuri Andriaccio Subject: [RFC PATCH v5 15/29] sched/rt: Update rt-cgroup schedulability checks Date: Thu, 30 Apr 2026 23:38:19 +0200 Message-ID: <20260430213835.62217-16-yurand2000@gmail.com> X-Mailer: git-send-email 2.53.0 In-Reply-To: <20260430213835.62217-1-yurand2000@gmail.com> References: <20260430213835.62217-1-yurand2000@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit From: luca abeni Update sched_group_rt_runtime/period and sched_group_set_rt_runtime/period to use the newly defined data structures and perform necessary checks to update both the runtime and period of a given group. The 'set' functions call tg_set_rt_bandwidth() which is also updated: - Use the newly added HCBS dl_bandwidth structure instead of rt_bandwidth. - Update __rt_schedulable() to check for numerical issues: - Reuse __checkparam_dl. - Add allow_zero_runtime param to __checkparam_dl as cgroups may zero their runtime, while it is not allowed for DEADLINE tasks to do so. - Use RCU lock guard instead of rcu_read_lock/unlock. - Update tg_rt_schedulable(), used when walking the cgroup tree to check if all invariants are met: - Update most of the instructions to obtain data from the newly added data structures (dl_bandwidth). - If the task group is the root group, run a total bandwidth check with the newly added dl_check_tg() function. - After all checks are successful, if the changed group is not the root cgroup, update the assigned runtime and period to all the local deadline servers. - Additionally use mutex guards instead of manually locking/unlocking. Add dl_check_tg(), which performs an admission control test similar to __dl_overflow, but this time we are updating the cgroup's total bandwidth rather than scheduling a new DEADLINE task or updating a non-cgroup deadline server. Add rcu_sched lock guard for rcu_read_lock/unlock_sched. Finally, prevent creation of a cgroup hierarchy with depth greater than two, as this will be addressed in a future patch. A depth two hierarchy is sufficient for now for testing the patchset. Co-developed-by: Alessio Balsini Signed-off-by: Alessio Balsini Co-developed-by: Andrea Parri Signed-off-by: Andrea Parri Co-developed-by: Yuri Andriaccio Signed-off-by: Yuri Andriaccio Signed-off-by: luca abeni --- include/linux/rcupdate.h | 1 + kernel/sched/core.c | 6 ++++ kernel/sched/deadline.c | 43 +++++++++++++++++----- kernel/sched/rt.c | 77 +++++++++++++++++++--------------------- kernel/sched/sched.h | 3 +- kernel/sched/syscalls.c | 2 +- 6 files changed, 82 insertions(+), 50 deletions(-) diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index 04f3f86a4145..032cfa763047 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -1191,6 +1191,7 @@ extern int rcu_expedited; extern int rcu_normal; DEFINE_LOCK_GUARD_0(rcu, rcu_read_lock(), rcu_read_unlock()) +DEFINE_LOCK_GUARD_0(rcu_sched, rcu_read_lock_sched(), rcu_read_unlock_sched()) DECLARE_LOCK_GUARD_0_ATTRS(rcu, __acquires_shared(RCU), __releases_shared(RCU)) #endif /* __LINUX_RCUPDATE_H */ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 98a53b60e21f..0c7032d254ba 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -9205,6 +9205,12 @@ cpu_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) return &root_task_group.css; } + /* Do not allow cpu_cgroup hierachies with depth greater than 2. */ +#ifdef CONFIG_RT_GROUP_SCHED + if (parent != &root_task_group) + return ERR_PTR(-EINVAL); +#endif + tg = sched_create_group(parent); if (IS_ERR(tg)) return ERR_PTR(-ENOMEM); diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index c82810732106..74bff7fb7b92 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -343,7 +343,39 @@ void cancel_inactive_timer(struct sched_dl_entity *dl_se) cancel_dl_timer(dl_se, &dl_se->inactive_timer); } +/* + * Used for dl_bw check and update, used under sched_rt_handler()::mutex and + * sched_domains_mutex. + */ +u64 dl_cookie; + #ifdef CONFIG_RT_GROUP_SCHED +int dl_check_tg(unsigned long total) +{ + int which_cpu; + int cap; + struct dl_bw *dl_b; + u64 gen = ++dl_cookie; + + for_each_possible_cpu(which_cpu) { + guard(rcu_sched)(); + + if (!dl_bw_visited(which_cpu, gen)) { + cap = dl_bw_capacity(which_cpu); + dl_b = dl_bw_of(which_cpu); + + guard(raw_spinlock_irqsave)(&dl_b->lock); + + if (dl_b->bw != -1 && + cap_scale(dl_b->bw, cap) < dl_b->total_bw + cap_scale(total, cap)) + return 0; + } + + } + + return 1; +} + void dl_init_tg(struct sched_dl_entity *dl_se, u64 rt_runtime, u64 rt_period) { struct rq *rq = container_of_const(dl_se->dl_rq, struct rq, dl); @@ -3469,12 +3501,6 @@ DEFINE_SCHED_CLASS(dl) = { #endif }; -/* - * Used for dl_bw check and update, used under sched_rt_handler()::mutex and - * sched_domains_mutex. - */ -u64 dl_cookie; - int sched_dl_global_validate(void) { u64 runtime = global_rt_runtime(); @@ -3670,7 +3696,7 @@ void __getparam_dl(struct task_struct *p, struct sched_attr *attr) * below 2^63 ns (we have to check both sched_deadline and * sched_period, as the latter can be zero). */ -bool __checkparam_dl(const struct sched_attr *attr) +bool __checkparam_dl(const struct sched_attr *attr, bool allow_zero_runtime) { u64 period, max, min; @@ -3686,7 +3712,8 @@ bool __checkparam_dl(const struct sched_attr *attr) * Since we truncate DL_SCALE bits, make sure we're at least * that big. */ - if (attr->sched_runtime < (1ULL << DL_SCALE)) + if ((!allow_zero_runtime || attr->sched_runtime != 0) && + attr->sched_runtime < (1ULL << DL_SCALE)) return false; /* diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 67fbf4bbe461..c994447f5b1c 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -2035,11 +2035,6 @@ DEFINE_SCHED_CLASS(rt) = { }; #ifdef CONFIG_RT_GROUP_SCHED -/* - * Ensure that the real time constraints are schedulable. - */ -static DEFINE_MUTEX(rt_constraints_mutex); - static inline int tg_has_rt_tasks(struct task_group *tg) { struct task_struct *task; @@ -2073,8 +2068,8 @@ static int tg_rt_schedulable(struct task_group *tg, void *data) unsigned long total, sum = 0; u64 period, runtime; - period = ktime_to_ns(tg->rt_bandwidth.rt_period); - runtime = tg->rt_bandwidth.rt_runtime; + period = tg->dl_bandwidth.dl_period; + runtime = tg->dl_bandwidth.dl_runtime; if (tg == d->tg) { period = d->rt_period; @@ -2090,8 +2085,7 @@ static int tg_rt_schedulable(struct task_group *tg, void *data) /* * Ensure we don't starve existing RT tasks if runtime turns zero. */ - if (rt_bandwidth_enabled() && !runtime && - tg->rt_bandwidth.rt_runtime && tg_has_rt_tasks(tg)) + if (dl_bandwidth_enabled() && !runtime && tg_has_rt_tasks(tg)) return -EBUSY; if (WARN_ON(!rt_group_sched_enabled() && tg != &root_task_group)) @@ -2105,12 +2099,17 @@ static int tg_rt_schedulable(struct task_group *tg, void *data) if (total > to_ratio(global_rt_period(), global_rt_runtime())) return -EINVAL; + if (tg == &root_task_group) { + if (!dl_check_tg(total)) + return -EBUSY; + } + /* * The sum of our children's runtime should not exceed our own. */ list_for_each_entry_rcu(child, &tg->children, siblings) { - period = ktime_to_ns(child->rt_bandwidth.rt_period); - runtime = child->rt_bandwidth.rt_runtime; + period = child->dl_bandwidth.dl_period; + runtime = child->dl_bandwidth.dl_runtime; if (child == d->tg) { period = d->rt_period; @@ -2128,24 +2127,30 @@ static int tg_rt_schedulable(struct task_group *tg, void *data) static int __rt_schedulable(struct task_group *tg, u64 period, u64 runtime) { - int ret; - struct rt_schedulable_data data = { .tg = tg, .rt_period = period, .rt_runtime = runtime, }; - rcu_read_lock(); - ret = walk_tg_tree(tg_rt_schedulable, tg_nop, &data); - rcu_read_unlock(); + struct sched_attr attr = { + .sched_flags = 0, + .sched_runtime = runtime, + .sched_deadline = period, + .sched_period = period, + }; - return ret; + if (!__checkparam_dl(&attr, true)) + return -EINVAL; + + guard(rcu)(); + return walk_tg_tree(tg_rt_schedulable, tg_nop, &data); } static int tg_set_rt_bandwidth(struct task_group *tg, u64 rt_period, u64 rt_runtime) { + static DEFINE_MUTEX(rt_constraints_mutex); int i, err = 0; /* @@ -2155,44 +2160,36 @@ static int tg_set_rt_bandwidth(struct task_group *tg, if (tg == &root_task_group && rt_runtime == 0) return -EINVAL; - /* No period doesn't make any sense. */ - if (rt_period == 0) - return -EINVAL; - /* * Bound quota to defend quota against overflow during bandwidth shift. */ if (rt_runtime != RUNTIME_INF && rt_runtime > max_rt_runtime) return -EINVAL; - mutex_lock(&rt_constraints_mutex); + guard(mutex)(&rt_constraints_mutex); err = __rt_schedulable(tg, rt_period, rt_runtime); if (err) - goto unlock; + return err; - raw_spin_lock_irq(&tg->rt_bandwidth.rt_runtime_lock); - tg->rt_bandwidth.rt_period = ns_to_ktime(rt_period); - tg->rt_bandwidth.rt_runtime = rt_runtime; + guard(raw_spinlock_irq)(&tg->dl_bandwidth.dl_runtime_lock); + tg->dl_bandwidth.dl_period = rt_period; + tg->dl_bandwidth.dl_runtime = rt_runtime; - for_each_possible_cpu(i) { - struct rt_rq *rt_rq = tg->rt_rq[i]; + if (tg == &root_task_group) + return 0; - raw_spin_lock(&rt_rq->rt_runtime_lock); - rt_rq->rt_runtime = rt_runtime; - raw_spin_unlock(&rt_rq->rt_runtime_lock); + for_each_possible_cpu(i) { + dl_init_tg(tg->dl_se[i], rt_runtime, rt_period); } - raw_spin_unlock_irq(&tg->rt_bandwidth.rt_runtime_lock); -unlock: - mutex_unlock(&rt_constraints_mutex); - return err; + return 0; } int sched_group_set_rt_runtime(struct task_group *tg, long rt_runtime_us) { u64 rt_runtime, rt_period; - rt_period = ktime_to_ns(tg->rt_bandwidth.rt_period); + rt_period = tg->dl_bandwidth.dl_period; rt_runtime = (u64)rt_runtime_us * NSEC_PER_USEC; if (rt_runtime_us < 0) rt_runtime = RUNTIME_INF; @@ -2206,10 +2203,10 @@ long sched_group_rt_runtime(struct task_group *tg) { u64 rt_runtime_us; - if (tg->rt_bandwidth.rt_runtime == RUNTIME_INF) + if (tg->dl_bandwidth.dl_runtime == RUNTIME_INF) return -1; - rt_runtime_us = tg->rt_bandwidth.rt_runtime; + rt_runtime_us = tg->dl_bandwidth.dl_runtime; do_div(rt_runtime_us, NSEC_PER_USEC); return rt_runtime_us; } @@ -2222,7 +2219,7 @@ int sched_group_set_rt_period(struct task_group *tg, u64 rt_period_us) return -EINVAL; rt_period = rt_period_us * NSEC_PER_USEC; - rt_runtime = tg->rt_bandwidth.rt_runtime; + rt_runtime = tg->dl_bandwidth.dl_runtime; return tg_set_rt_bandwidth(tg, rt_period, rt_runtime); } @@ -2231,7 +2228,7 @@ long sched_group_rt_period(struct task_group *tg) { u64 rt_period_us; - rt_period_us = ktime_to_ns(tg->rt_bandwidth.rt_period); + rt_period_us = tg->dl_bandwidth.dl_period; do_div(rt_period_us, NSEC_PER_USEC); return rt_period_us; } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index fceb02a04858..78f080275bf0 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -364,7 +364,7 @@ extern void sched_dl_do_global(void); extern int sched_dl_overflow(struct task_struct *p, int policy, const struct sched_attr *attr); extern void __setparam_dl(struct task_struct *p, const struct sched_attr *attr); extern void __getparam_dl(struct task_struct *p, struct sched_attr *attr); -extern bool __checkparam_dl(const struct sched_attr *attr); +extern bool __checkparam_dl(const struct sched_attr *attr, bool allow_zero_runtime); extern bool dl_param_changed(struct task_struct *p, const struct sched_attr *attr); extern int dl_cpuset_cpumask_can_shrink(const struct cpumask *cur, const struct cpumask *trial); extern int dl_bw_deactivate(int cpu); @@ -423,6 +423,7 @@ extern void dl_server_init(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq, struct rq *served_rq, dl_server_pick_f pick_task); extern void sched_init_dl_servers(void); +extern int dl_check_tg(unsigned long total); extern void dl_init_tg(struct sched_dl_entity *dl_se, u64 rt_runtime, u64 rt_period); extern void fair_server_init(struct rq *rq); diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c index 15653840c812..d30aee2e90c4 100644 --- a/kernel/sched/syscalls.c +++ b/kernel/sched/syscalls.c @@ -528,7 +528,7 @@ int __sched_setscheduler(struct task_struct *p, */ if (attr->sched_priority > MAX_RT_PRIO-1) return -EINVAL; - if ((dl_policy(policy) && !__checkparam_dl(attr)) || + if ((dl_policy(policy) && !__checkparam_dl(attr, false)) || (rt_policy(policy) != (attr->sched_priority != 0))) return -EINVAL; -- 2.53.0