From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-wr1-f50.google.com (mail-wr1-f50.google.com [209.85.221.50]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3DE3E3803F9 for ; Mon, 8 Jun 2026 12:16:06 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.221.50 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780920970; cv=none; b=fhBhrsPu2SN2chuQeuuaMiVV3bX2avabEa4/RHpaY+DWqyF/20SJwp7dmTX/fdTS6RzGkaNct2ZPhq6qbw8c8+EgXeZbaCL7k1GyfCTeU0jNwhBIuyCRwXjosQ1fMRVvqvfkYv/JmeYo340JJnj4mxCb6FzfXZ62eLgzkHG6+qk= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780920970; c=relaxed/simple; bh=x+kQMSJL0HaKNgdoHL5nXTs5650NkXHU/v/ZcKQFyWw=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=mJ55558CfW4cvQf+EWu0hwkLB3U04kZtMzvFJ+tZgu2EBAIITWpkh3db6OXsRs1BR4JxPhwhZa9Gl/H31jPZrWyqftHTHd2P9JtbF15N+bM7jCN2VV2lrhAZSZYPaDcWFDL0hyFehl46zIeshBWLKug20XRRpDH+5iv196a6EY8= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=LeNc0ZWT; arc=none smtp.client-ip=209.85.221.50 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="LeNc0ZWT" Received: by mail-wr1-f50.google.com with SMTP id ffacd0b85a97d-45ef29c5561so2178892f8f.0 for ; Mon, 08 Jun 2026 05:16:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1780920964; x=1781525764; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=Zu6ynVRY/qI7FjTBOrlhoMJpQptnT/GTnRCT/yRZp9g=; b=LeNc0ZWTCoCJUdXWFq3yaaluHbrRsri5KVfWHV0W2jtXZsjQrgx4fDaHJhYRs9nVuL xPi10nyU4VkrDu8ISKGHIPrS40kg3g5k9Z98MBPpt4ZYwuPinXVqcDYhSqgI/HaL+pdQ Y//R+w0XvpcIF+ShdOXVBCergyhVA0txhlieJIbhU7sy1haaULzKIZMLAYQg5zE0FRGA udJR1zImr+1tHuRugY+OTKTH/cTArS4fRcmt2RVrxGmqys3Q5j6rTN3SgHU41U9hzIfT 7B6aQ9tLL8QTF5ULPq9kSo5yCngySg/BW4jPm68eDyynVlIw4oHnnqEupGBxG/UgPwo/ CNog== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1780920964; x=1781525764; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=Zu6ynVRY/qI7FjTBOrlhoMJpQptnT/GTnRCT/yRZp9g=; b=SRwAn5Taar+Z0g2nYi3n+4RbFj5capKZBQ93a/YGLQ54DXHPLDX1sMjgQS7b/+6iRM 8rlQfckdhFK1gkfBfePCUMBFUfyeEHbaEMWLJ0GsV8A6SRyqx2ibQXm7X1YX8yQWHSfT dKkPabfW6t29FquhPUEeYexcj6XWDZ5YFBIh0fEawKxzmZ9ZwhJ7ZUmYoQ4EYRB2YITh Ud59jAAj+rbfwoDtob1IZWfwvTG7S3iUrXea2GN/rkkkEwi/jrW8QPncgOmBpUR34xoo eLVXIP1u8F+ZwIdMtbPtneo0bbt+q8YGS9S+aH2xGP0nsquizsEFhs7TnA+jjsRCoVIE 5zeg== X-Gm-Message-State: AOJu0YyrLPGegxWkhbQNNXnaIxkhimdwlUBorrtqwmPpA66QJgC66s4g Pcn7ILQipI/aLqk4LTAOZLGstFb0CAkWXUYKjLFp2RoBbxVx9dJArtoR X-Gm-Gg: Acq92OEwx3fnOyDEpMQyqQ0SX611sQZ8jggByxx+iN62Vp6Ij1pRO7v2lsdiPuRVGx+ l1N+mdNGZ7m2JX/0Xq8IPi0NXJz0hi2R/4zm/8EiOTslLuOlF/l1chyI2wIjc2+k/9cmQPb5LrI PEQydAIZ1MSdAafJHNJUj8KhQ66bEbpTYSYVmSgd44EX5Obci2T6g6DAB/quSB0BmGOgSvM9Qjt DhLk86FZpUep3aIDWPdBBNPFk3RElJTOsBizMMuXz2QCXwPQCv4K8s/wuwfh7rzPARx2pBGOJA0 YyD7v6eaVgkn0HKM5J+DE4/BgxMxBNj290p0rp5jArGADQLodF3Fb6KJTh+u1S/2f0CCxEfHAjd m2yfULblaUu3qezs1F1/APSeyjRP3zCxNSozuhQZBRMx3CpedgnsdawgzzFKtkx0byZ9sdnaXl8 ubzoZ26Jc6edkGmv6FxNQNzQQieGZyQpg= X-Received: by 2002:adf:d02f:0:b0:460:f36:79b0 with SMTP id ffacd0b85a97d-460304fda0amr17777567f8f.19.1780920964257; Mon, 08 Jun 2026 05:16:04 -0700 (PDT) Received: from victus-lab ([193.205.81.5]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-4601f2ec711sm50644906f8f.12.2026.06.08.05.16.03 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 08 Jun 2026 05:16:03 -0700 (PDT) From: Yuri Andriaccio To: Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Tejun Heo , Johannes Weiner , =?UTF-8?q?Michal=20Koutn=C3=BD?= Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Luca Abeni , Yuri Andriaccio Subject: [RFC PATCH v6 17/25] sched/rt: Update rt-cgroup schedulability checks Date: Mon, 8 Jun 2026 14:15:36 +0200 Message-ID: <20260608121546.69910-18-yurand2000@gmail.com> X-Mailer: git-send-email 2.54.0 In-Reply-To: <20260608121546.69910-1-yurand2000@gmail.com> References: <20260608121546.69910-1-yurand2000@gmail.com> Precedence: bulk X-Mailing-List: cgroups@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit From: luca abeni Introduce cgroup-v2 control files: - cpu.rt.max: Get/set the bandwidth of the given cgroup, or inherith from parent. - cpu.rt.internal: Get the actual remaning bandwidth for the group, removing the bw of the group's children. Introduce a number of functions to update the cgroup settings across the whole hierarchy: - tg_subtree_has_rt_tasks() Checks if the active context rooted at tg is running rt workload. Child groups which do not share the same active context are ignored. - tg_compute_children_bw() Computes the total bandwidth of the active context rooted at tg minux the root of the context itself. - tg_rt_schedulable() Runs admission tests for the current cgroup tree and the given bandwidth update. - tg_update_active_context() Updates the active context of a given subtree with a new one. - tg_rt_bandwidth() / tg_rt_internal_bandwidth() Read the max (internal) bandwidth set to the cgroup. - tg_set_rt_bandwidth() Set the bandwidth of the group. Update sched_rt_can_attach to run only tasks in the root cgroup or HCBS cgroups which have non-zero runtime. Update and reuse __checkparam_dl to check for numerical issues regarding the dl_server's parameters. Add from_ratio function to convert from period and bw to runtime, inverse of the to_ratio function. Add dl_check_tg(), which performs an admission control test similar to __dl_overflow, but this time we are updating the cgroup's total bandwidth rather than scheduling a new DEADLINE task or updating a non-cgroup deadline server. Add rcu_sched lock guard for rcu_read_{lock/unlock}_sched. Add sched_domains lock guard for sched_domains_mutex_{lock/unlock}. Add lock/unlock methods for sched_rt_handler_mutex and its lock guard. Add asserts for held sched_domains_mutex and sched_rt_handler_mutex. Co-developed-by: Alessio Balsini Signed-off-by: Alessio Balsini Co-developed-by: Andrea Parri Signed-off-by: Andrea Parri Co-developed-by: luca abeni Signed-off-by: luca abeni Signed-off-by: Yuri Andriaccio --- include/linux/rcupdate.h | 1 + include/linux/sched.h | 2 + kernel/sched/core.c | 55 ++++++ kernel/sched/deadline.c | 60 ++++-- kernel/sched/rt.c | 393 +++++++++++++++++++++++++++++++-------- kernel/sched/sched.h | 18 +- kernel/sched/syscalls.c | 2 +- 7 files changed, 445 insertions(+), 86 deletions(-) diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index bfa765132de8..70432ca3dbb9 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -1179,6 +1179,7 @@ extern int rcu_expedited; extern int rcu_normal; DEFINE_LOCK_GUARD_0(rcu, rcu_read_lock(), rcu_read_unlock()) +DEFINE_LOCK_GUARD_0(rcu_sched, rcu_read_lock_sched(), rcu_read_unlock_sched()) DECLARE_LOCK_GUARD_0_ATTRS(rcu, __acquires_shared(RCU), __releases_shared(RCU)) #endif /* __LINUX_RCUPDATE_H */ diff --git a/include/linux/sched.h b/include/linux/sched.h index b20451fcda55..0021069581c2 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2522,4 +2522,6 @@ extern void migrate_enable(void); DEFINE_LOCK_GUARD_0(migrate, migrate_disable(), migrate_enable()) +DEFINE_LOCK_GUARD_0(sched_domains, sched_domains_mutex_lock(), sched_domains_mutex_unlock()) + #endif diff --git a/kernel/sched/core.c b/kernel/sched/core.c index a8a81c69b3d3..1ad1efe1dca7 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4815,6 +4815,14 @@ u64 to_ratio(u64 period, u64 runtime) return div64_u64(runtime << BW_SHIFT, period); } +u64 from_ratio(u64 period, u64 bw) +{ + if (bw == BW_UNIT) + return RUNTIME_INF; + + return (bw * period) >> BW_SHIFT; +} + /* * wake_up_new_task - wake up a newly created task for the first time. * @@ -10415,6 +10423,41 @@ static ssize_t cpu_max_write(struct kernfs_open_file *of, } #endif /* CONFIG_CFS_BANDWIDTH */ +#ifdef CONFIG_RT_GROUP_SCHED +static int cpu_rt_max_show(struct seq_file *sf, void *v) +{ + struct task_group *tg = css_tg(seq_css(sf)); + long period_us, runtime_us; + + tg_rt_bandwidth(tg, &period_us, &runtime_us); + cpu_period_quota_print(sf, period_us, runtime_us); + return 0; +} + +static int cpu_rt_internal_show(struct seq_file *sf, void *v) +{ + struct task_group *tg = css_tg(seq_css(sf)); + long period_us, runtime_us; + + tg_rt_internal_bandwidth(tg, &period_us, &runtime_us); + cpu_period_quota_print(sf, period_us, runtime_us); + return 0; +} + +static ssize_t cpu_rt_max_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct task_group *tg = css_tg(of_css(of)); + u64 period_us, runtime_us; + int ret; + + ret = cpu_period_quota_parse(buf, &period_us, &runtime_us); + if (!ret) + ret = tg_set_rt_bandwidth(tg, period_us, runtime_us); + return ret ?: nbytes; +} +#endif /* CONFIG_RT_GROUP_SCHED */ + static struct cftype cpu_files[] = { #ifdef CONFIG_GROUP_SCHED_WEIGHT { @@ -10450,6 +10493,18 @@ static struct cftype cpu_files[] = { .write_u64 = cpu_burst_write_u64, }, #endif /* CONFIG_CFS_BANDWIDTH */ +#ifdef CONFIG_RT_GROUP_SCHED + { + .name = "rt.max", + .seq_show = cpu_rt_max_show, + .write = cpu_rt_max_write, + }, + { + .name = "rt.internal", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = cpu_rt_internal_show, + }, +#endif /* CONFIG_RT_GROUP_SCHED */ #ifdef CONFIG_UCLAMP_TASK_GROUP { .name = "uclamp.min", diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index a63253ec6441..b7102f643171 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -346,10 +346,45 @@ void cancel_inactive_timer(struct sched_dl_entity *dl_se) cancel_dl_timer(dl_se, &dl_se->inactive_timer); } +/* + * Used for dl_bw check and update, used under sched_rt_handler()::mutex and + * sched_domains_mutex. + */ +u64 dl_cookie; + #ifdef CONFIG_RT_GROUP_SCHED +int dl_check_tg(unsigned long total) +{ + int which_cpu; + int cap; + struct dl_bw *dl_b; + u64 gen = ++dl_cookie; + + lockdep_assert_held(&sched_domains_mutex); + lockdep_assert_held(&sched_rt_handler_mutex); + + for_each_possible_cpu(which_cpu) { + guard(rcu_sched)(); + + if (!dl_bw_visited(which_cpu, gen)) { + cap = dl_bw_capacity(which_cpu); + dl_b = dl_bw_of(which_cpu); + + guard(raw_spinlock_irqsave)(&dl_b->lock); + + if (dl_b->bw != -1 && + cap_scale(dl_b->bw, cap) < dl_b->total_bw + cap_scale(total, cap)) + return 0; + } + + } + + return 1; +} + void dl_init_tg(struct sched_dl_entity *dl_se, u64 rt_runtime, u64 rt_period) { - struct rq *rq = container_of_const(dl_se->dl_rq, struct rq, dl); + struct rq *rq = rq_of_dl_se(dl_se); int is_active; u64 new_bw; @@ -3497,12 +3532,6 @@ DEFINE_SCHED_CLASS(dl) = { #endif }; -/* - * Used for dl_bw check and update, used under sched_rt_handler()::mutex and - * sched_domains_mutex. - */ -u64 dl_cookie; - int sched_dl_global_validate(void) { u64 runtime = global_rt_runtime(); @@ -3514,6 +3543,9 @@ int sched_dl_global_validate(void) int cpu, cap, cpus, ret = 0; unsigned long flags; + lockdep_assert_held(&sched_domains_mutex); + lockdep_assert_held(&sched_rt_handler_mutex); + /* * Here we want to check the bandwidth not being set to some * value smaller than the currently allocated bandwidth in @@ -3566,6 +3598,9 @@ void sched_dl_do_global(void) int cpu; unsigned long flags; + lockdep_assert_held(&sched_domains_mutex); + lockdep_assert_held(&sched_rt_handler_mutex); + if (global_rt_runtime() != RUNTIME_INF) new_bw = to_ratio(global_rt_period(), global_rt_runtime()); @@ -3711,7 +3746,7 @@ void __getparam_dl(struct task_struct *p, struct sched_attr *attr, unsigned int * below 2^63 ns (we have to check both sched_deadline and * sched_period, as the latter can be zero). */ -bool __checkparam_dl(const struct sched_attr *attr) +bool __checkparam_dl(const struct sched_attr *attr, bool allow_zero_runtime) { u64 period, max, min; @@ -3720,14 +3755,16 @@ bool __checkparam_dl(const struct sched_attr *attr) return true; /* deadline != 0 */ - if (attr->sched_deadline == 0) + if ((!allow_zero_runtime || attr->sched_runtime != 0) && + attr->sched_deadline == 0) return false; /* * Since we truncate DL_SCALE bits, make sure we're at least * that big. */ - if (attr->sched_runtime < (1ULL << DL_SCALE)) + if ((!allow_zero_runtime || attr->sched_runtime != 0) && + attr->sched_runtime < (1ULL << DL_SCALE)) return false; /* @@ -3750,7 +3787,8 @@ bool __checkparam_dl(const struct sched_attr *attr) max = (u64)READ_ONCE(sysctl_sched_dl_period_max) * NSEC_PER_USEC; min = (u64)READ_ONCE(sysctl_sched_dl_period_min) * NSEC_PER_USEC; - if (period < min || period > max) + if ((!allow_zero_runtime || period != 0) && + (period < min || period > max)) return false; return true; diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 4f1e7af2e88d..a32b1f68e645 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -1,4 +1,3 @@ -#pragma GCC diagnostic ignored "-Wunused-function" // SPDX-License-Identifier: GPL-2.0 /* * Real-Time Scheduling Class (mapped to the SCHED_FIFO and SCHED_RR @@ -2111,9 +2110,6 @@ DEFINE_SCHED_CLASS(rt) = { }; #ifdef CONFIG_RT_GROUP_SCHED -/* - * Ensure that the real time constraints are schedulable. - */ static inline int tg_has_rt_tasks(struct task_group *tg) { struct task_struct *task; @@ -2134,38 +2130,114 @@ static inline int tg_has_rt_tasks(struct task_group *tg) return ret; } -struct rt_schedulable_data { +static int __tg_subtree_has_rt_tasks(struct task_group *tg, void *data) { + struct task_group *ctx = data; + + if (dl_bandwidth_read(tg)->active_context == ctx && tg_has_rt_tasks(tg)) + return 1; + else + return 0; +} + +static int tg_subtree_has_rt_tasks(struct task_group *tg) { + lockdep_assert(rcu_read_lock_held()); + return walk_tg_tree_from(tg, __tg_subtree_has_rt_tasks, tg_nop, + dl_bandwidth_read(tg)->active_context); +} + +struct tg_update_data { struct task_group *tg; u64 rt_period; u64 rt_runtime; }; -static int tg_rt_schedulable(struct task_group *tg, void *data) +struct tg_compute_children_bw_data { + struct tg_update_data update; + struct task_group *active_context; + u64 bw_sum; +}; + +static int __tg_compute_children_bw(struct task_group *tg, void *data) { + struct tg_compute_children_bw_data *d = data; + const struct dl_bandwidth *dl_b = dl_bandwidth_read(tg); + u64 period, runtime; + + /* Skip the current task group from the sum. */ + if (tg == d->active_context) + return 0; + + period = dl_b->dl_period; + runtime = dl_b->dl_runtime; + if (tg == d->update.tg) { + period = d->update.rt_period; + runtime = d->update.rt_runtime; + } + + if (runtime == RUNTIME_INF || + dl_bandwidth_read(tg->parent)->active_context != d->active_context) + return 0; + + d->bw_sum += to_ratio(period, runtime); + return 0; +} + +static unsigned long tg_compute_children_bw(struct task_group *tg, + struct tg_update_data *data) +{ + struct tg_compute_children_bw_data sum_data = { + .active_context = tg, + .bw_sum = 0, + .update = (struct tg_update_data) { + .tg = data->tg, + .rt_period = data->rt_period, + .rt_runtime = data->rt_runtime, + } + }; + + lockdep_assert(rcu_read_lock_held()); + walk_tg_tree_from(tg, __tg_compute_children_bw, tg_nop, &sum_data); + return sum_data.bw_sum; +} + +struct rt_schedulable_data { + struct tg_update_data update; + u64 rt_runtime_remainder; +}; + +static int __tg_rt_schedulable(struct task_group *tg, void *data) { struct rt_schedulable_data *d = data; - struct task_group *child; + const struct dl_bandwidth *dl_b; u64 total, sum = 0; u64 period, runtime; - period = ktime_to_ns(tg->rt_bandwidth.rt_period); - runtime = tg->rt_bandwidth.rt_runtime; + dl_b = dl_bandwidth_read(tg); + period = dl_b->dl_period; + runtime = dl_b->dl_runtime; - if (tg == d->tg) { - period = d->rt_period; - runtime = d->rt_runtime; + if (tg == d->update.tg) { + period = d->update.rt_period; + runtime = d->update.rt_runtime; } + /* + * "max" groups are always schedulable, as they defer their access + * control to their first non-max parent. + */ + if (runtime == RUNTIME_INF) + return 0; + /* * Cannot have more runtime than the period. */ - if (runtime > period && runtime != RUNTIME_INF) + if (runtime > period) return -EINVAL; /* * Ensure we don't starve existing RT tasks if runtime turns zero. */ - if (rt_bandwidth_enabled() && !runtime && - tg->rt_bandwidth.rt_runtime && tg_has_rt_tasks(tg)) + if (dl_bandwidth_enabled() && !runtime && tg != &root_task_group && + tg_subtree_has_rt_tasks(tg)) return -EBUSY; total = to_ratio(period, runtime); @@ -2176,58 +2248,146 @@ static int tg_rt_schedulable(struct task_group *tg, void *data) if (total > to_ratio(global_rt_period(), global_rt_runtime())) return -EINVAL; + if (tg == &root_task_group) { + if (!dl_check_tg(total)) + return -EBUSY; + } + /* - * The sum of our children's runtime should not exceed our own. + * The sum of our children's runtime, plus our own bw, should not + * exceed our own max. */ - list_for_each_entry_rcu(child, &tg->children, siblings) { - period = ktime_to_ns(child->rt_bandwidth.rt_period); - runtime = child->rt_bandwidth.rt_runtime; + sum = tg_compute_children_bw(tg, &d->update); + if (sum > total) + return -EINVAL; - if (child == d->tg) { - period = d->rt_period; - runtime = d->rt_runtime; - } + /* + * Compute remaining runtime + */ + if (tg == d->update.tg) + d->rt_runtime_remainder = from_ratio(period, total - sum); + + return 0; +} - sum += to_ratio(period, runtime); +static int tg_rt_schedulable(struct tg_update_data *data, u64 *remainder_runtime) +{ + int err; + struct rt_schedulable_data d = { + .update = (struct tg_update_data) { + .tg = data->tg, + .rt_period = data->rt_period, + .rt_runtime = data->rt_runtime, + }, + .rt_runtime_remainder = 0, + }; + + /* + * Walk the cgroup tree and check schedulability constraints. + */ + lockdep_assert(rcu_read_lock_held()); + err = walk_tg_tree(__tg_rt_schedulable, tg_nop, &d); + if (err) + return err; + + *remainder_runtime = d.rt_runtime_remainder; + return 0; +} + +struct tg_update_active_context_data { + struct task_group *new_active_context; + struct task_group *old_active_context; +}; + +static int __tg_update_active_context(struct task_group *tg, void *data) { + struct tg_update_active_context_data *d = data; + + if (dl_bandwidth_read(tg)->active_context == d->old_active_context) { + guard(raw_spinlock_irq)(dl_bw_lock_of_tg(tg)); + dl_bandwidth_write(tg)->active_context = d->new_active_context; } - if (sum > total) - return -EINVAL; + return 0; +} + +static void tg_update_active_context(struct task_group *tg, + struct task_group *old_context, + struct task_group *new_context) +{ + struct tg_update_active_context_data data = { + .new_active_context = new_context, + .old_active_context = old_context, + }; + lockdep_assert(rcu_read_lock_held()); + walk_tg_tree_from(tg, __tg_update_active_context, tg_nop, &data); +} + +int tg_rt_bandwidth(struct task_group *tg, + long *rt_period_us, long *rt_runtime_us) +{ + const struct dl_bandwidth *dl_b; + + guard(raw_spinlock_irq)(dl_bw_lock_of_tg(tg)); + dl_b = dl_bandwidth_read(tg); + + *rt_runtime_us = -1; + if (dl_b->dl_runtime != RUNTIME_INF) { + *rt_runtime_us = dl_b->dl_runtime; + do_div(*rt_runtime_us, NSEC_PER_USEC); + } + + *rt_period_us = dl_b->dl_period; + do_div(*rt_period_us, NSEC_PER_USEC); return 0; } -static int __rt_schedulable(struct task_group *tg, u64 period, u64 runtime) +int tg_rt_internal_bandwidth(struct task_group *tg, + long *rt_period_us, long *rt_runtime_us) { - int ret; + const struct dl_bandwidth *dl_b; - struct rt_schedulable_data data = { - .tg = tg, - .rt_period = period, - .rt_runtime = runtime, - }; + guard(raw_spinlock_irq)(dl_bw_lock_of_tg(tg)); + dl_b = dl_bandwidth_read(tg); - rcu_read_lock(); - ret = walk_tg_tree(tg_rt_schedulable, tg_nop, &data); - rcu_read_unlock(); + *rt_runtime_us = dl_b->dl_internal_runtime; + do_div(*rt_runtime_us, NSEC_PER_USEC); - return ret; + *rt_period_us = dl_b->dl_period; + do_div(*rt_period_us, NSEC_PER_USEC); + + return 0; } -static int tg_set_rt_bandwidth(struct task_group *tg, - u64 rt_period, u64 rt_runtime) +int tg_set_rt_bandwidth(struct task_group *tg, + u64 rt_period_us, u64 rt_runtime_us) { - int i, err = 0; + struct tg_update_data update; + struct task_group *parent_ctx; + struct dl_bandwidth *dl_b; + u64 rt_period, rt_runtime, old_rt_runtime; + u64 rt_actual_runtime = 0; + u64 bw, children_bw; + struct sched_attr attr; + int err, i; - /* - * Disallowing the root group RT runtime is BAD, it would disallow the - * kernel creating (and or operating) RT threads. - */ - if (tg == &root_task_group && rt_runtime == 0) + if (rt_runtime_us == RUNTIME_INF) + rt_runtime = RUNTIME_INF; + else if ((u64)rt_runtime_us > U64_MAX / NSEC_PER_USEC) return -EINVAL; + else + rt_runtime = (u64)rt_runtime_us * NSEC_PER_USEC; - /* No period doesn't make any sense. */ - if (rt_period == 0) + if ((u64)rt_period_us > U64_MAX / NSEC_PER_USEC) + return -EINVAL; + else + rt_period = (u64)rt_period_us * NSEC_PER_USEC; + + /* + * The root_task_group bandwidth settings are only used to reserve bw + * for HCBS cgroups; runtime == "max" has no meaning there. + */ + if (rt_runtime == RUNTIME_INF && tg == &root_task_group) return -EINVAL; /* @@ -2236,34 +2396,119 @@ static int tg_set_rt_bandwidth(struct task_group *tg, if (rt_runtime != RUNTIME_INF && rt_runtime > max_rt_runtime) return -EINVAL; - mutex_lock(&rt_constraints_mutex); - err = __rt_schedulable(tg, rt_period, rt_runtime); + /* + * Check if the runtime and period min and max values are admissible. + */ + attr = (struct sched_attr){ + .sched_flags = 0, + .sched_runtime = rt_runtime, + .sched_deadline = rt_period, + .sched_period = rt_period, + }; + + if (rt_runtime != RUNTIME_INF && !__checkparam_dl(&attr, true)) + return -EINVAL; + + update = (struct tg_update_data) { + .tg = tg, + .rt_period = rt_period, + .rt_runtime = rt_runtime, + }; + + guard(mutex)(&rt_constraints_mutex); + old_rt_runtime = dl_bandwidth_read(tg)->dl_runtime; + + /* + * Disallow changing from/to "max" and a HCBS reservation if the group + * and all of its "max" children have active tasks. + */ + guard(sched_rt_handler)(); + guard(sched_domains)(); + guard(rcu)(); + if (((rt_runtime == RUNTIME_INF && old_rt_runtime != RUNTIME_INF) || + (rt_runtime != RUNTIME_INF && old_rt_runtime == RUNTIME_INF)) && + tg_subtree_has_rt_tasks(tg)) + return -EINVAL; + + err = tg_rt_schedulable(&update, &rt_actual_runtime); if (err) - goto unlock; + return err; - raw_spin_lock_irq(&tg->rt_bandwidth.rt_runtime_lock); - tg->rt_bandwidth.rt_period = ns_to_ktime(rt_period); - tg->rt_bandwidth.rt_runtime = rt_runtime; + scoped_guard(raw_spinlock_irq, dl_bw_lock_of_tg(tg)) { + dl_b = dl_bandwidth_write(tg); + dl_b->dl_period = rt_period; + dl_b->dl_runtime = rt_runtime; + dl_b->dl_internal_runtime = rt_actual_runtime; + } + + if (tg == &root_task_group) + return 0; + parent_ctx = dl_bandwidth_read(tg->parent)->active_context; + + /* + * If changing from/to "max" and a HCBS reservation, must update the + * active_context of self and all of its subtree. + */ + if ((rt_runtime == RUNTIME_INF && old_rt_runtime != RUNTIME_INF) || + (rt_runtime != RUNTIME_INF && old_rt_runtime == RUNTIME_INF)) + { + if (rt_runtime == RUNTIME_INF) + tg_update_active_context(tg, dl_b->active_context, parent_ctx); + else + tg_update_active_context(tg, dl_b->active_context, tg); + + } + + WARN_ON(rt_runtime == RUNTIME_INF && rt_actual_runtime != 0); for_each_possible_cpu(i) { - struct rt_rq *rt_rq = tg->rt_rq[i]; + dl_init_tg(tg->dl_se[i], rt_actual_runtime, rt_period); + } + + /* + * Update the dl_servers of the parent's active context + */ + if (parent_ctx == &root_task_group) + return 0; + + scoped_guard(raw_spinlock_irq, dl_bw_lock_of_tg(parent_ctx)) { + dl_b = dl_bandwidth_write(parent_ctx); - raw_spin_lock(&rt_rq->rt_runtime_lock); - rt_rq->rt_runtime = rt_runtime; - raw_spin_unlock(&rt_rq->rt_runtime_lock); + bw = to_ratio(dl_b->dl_period, dl_b->dl_runtime); + children_bw = tg_compute_children_bw(parent_ctx, &update); + + rt_period = dl_b->dl_period; + rt_actual_runtime = from_ratio(rt_period, bw - children_bw); + dl_b->dl_internal_runtime = rt_actual_runtime; } - raw_spin_unlock_irq(&tg->rt_bandwidth.rt_runtime_lock); -unlock: - mutex_unlock(&rt_constraints_mutex); - return err; + for_each_possible_cpu(i) { + dl_init_tg(parent_ctx->dl_se[i], rt_actual_runtime, rt_period); + } + + return 0; } int sched_rt_can_attach(struct task_group *tg) { + struct task_group *ctx; + + /* If rt group sched is disabled, tasks are always run in the root rq */ + if (!rt_group_sched_enabled()) + return 1; + + /* Can always run on the root task group */ + scoped_guard(raw_spinlock_irqsave, dl_bw_lock_of_tg(tg)) { + ctx = dl_bandwidth_read(tg)->active_context; + if (ctx == &root_task_group) + return 1; + } + /* Don't accept real-time tasks when there is no way for them to run */ - if (rt_group_sched_enabled() && tg->dl_bandwidth.dl_runtime == 0) - return 0; + scoped_guard(raw_spinlock_irqsave, dl_bw_lock_of_tg(ctx)) { + if (dl_bandwidth_read(ctx)->dl_runtime == 0) + return 0; + } return 1; } @@ -2279,24 +2524,26 @@ static int sched_rt_global_validate(void) NSEC_PER_USEC > max_rt_runtime))) return -EINVAL; -#ifdef CONFIG_RT_GROUP_SCHED - if (!rt_group_sched_enabled()) - return 0; - - scoped_guard(mutex, &rt_constraints_mutex) - return __rt_schedulable(NULL, 0, 0); -#endif return 0; } +DEFINE_MUTEX(sched_rt_handler_mutex); + +void sched_rt_handler_mutex_lock() { + mutex_lock(&sched_rt_handler_mutex); +} + +void sched_rt_handler_mutex_unlock() { + mutex_unlock(&sched_rt_handler_mutex); +} + static int sched_rt_handler(const struct ctl_table *table, int write, void *buffer, size_t *lenp, loff_t *ppos) { int old_period, old_runtime; - static DEFINE_MUTEX(mutex); int ret; - mutex_lock(&mutex); + sched_rt_handler_mutex_lock(); sched_domains_mutex_lock(); old_period = sysctl_sched_rt_period; old_runtime = sysctl_sched_rt_runtime; @@ -2320,7 +2567,7 @@ static int sched_rt_handler(const struct ctl_table *table, int write, void *buff sysctl_sched_rt_runtime = old_runtime; } sched_domains_mutex_unlock(); - mutex_unlock(&mutex); + sched_rt_handler_mutex_unlock(); /* * After changing maximum available bandwidth for DEADLINE, we need to diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index efe52e162ba5..394f40dc26db 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -366,7 +366,7 @@ extern void sched_dl_do_global(void); extern int sched_dl_overflow(struct task_struct *p, int policy, const struct sched_attr *attr); extern void __setparam_dl(struct task_struct *p, const struct sched_attr *attr); extern void __getparam_dl(struct task_struct *p, struct sched_attr *attr, unsigned int flags); -extern bool __checkparam_dl(const struct sched_attr *attr); +extern bool __checkparam_dl(const struct sched_attr *attr, bool allow_zero_runtime); extern bool dl_param_changed(struct task_struct *p, const struct sched_attr *attr); extern int dl_cpuset_cpumask_can_shrink(const struct cpumask *cur, const struct cpumask *trial); extern int dl_bw_deactivate(int cpu); @@ -425,6 +425,7 @@ extern void dl_server_init(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq, struct rq *served_rq, dl_server_pick_f pick_task); extern void sched_init_dl_servers(void); +extern int dl_check_tg(unsigned long total); extern void dl_init_tg(struct sched_dl_entity *dl_se, u64 rt_runtime, u64 rt_period); extern void fair_server_init(struct rq *rq); @@ -607,6 +608,12 @@ extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b); extern void unthrottle_cfs_rq(struct cfs_rq *cfs_rq); extern bool cfs_task_bw_constrained(struct task_struct *p); +extern int tg_rt_bandwidth(struct task_group *tg, + long *rt_period_us, long *rt_runtime_us); +extern int tg_rt_internal_bandwidth(struct task_group *tg, + long *rt_period_us, long *rt_runtime_us); +extern int tg_set_rt_bandwidth(struct task_group *tg, + u64 rt_period_us, u64 rt_runtime_us); extern int sched_rt_can_attach(struct task_group *tg); extern struct task_group *sched_create_group(struct task_group *parent); @@ -2045,6 +2052,14 @@ DEFINE_LOCK_GUARD_1(raw_spin_rq_lock_irq, struct rq, raw_spin_rq_lock_irq(_T->lock), raw_spin_rq_unlock_irq(_T->lock)) +extern struct mutex sched_rt_handler_mutex; +extern void sched_rt_handler_mutex_lock(void); +extern void sched_rt_handler_mutex_unlock(void); + +DEFINE_LOCK_GUARD_0(sched_rt_handler, + sched_rt_handler_mutex_lock(), + sched_rt_handler_mutex_unlock()) + #ifdef CONFIG_NUMA enum numa_topology_type { @@ -2938,6 +2953,7 @@ extern void init_cfs_throttle_work(struct task_struct *p); #define MAX_BW ((1ULL << MAX_BW_BITS) - 1) extern u64 to_ratio(u64 period, u64 runtime); +extern u64 from_ratio(u64 period, u64 bw); extern void init_entity_runnable_average(struct sched_entity *se); extern void post_init_entity_util_avg(struct task_struct *p); diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c index 773f744c0460..e5b8d2f42ea8 100644 --- a/kernel/sched/syscalls.c +++ b/kernel/sched/syscalls.c @@ -528,7 +528,7 @@ int __sched_setscheduler(struct task_struct *p, */ if (attr->sched_priority > MAX_RT_PRIO-1) return -EINVAL; - if ((dl_policy(policy) && !__checkparam_dl(attr)) || + if ((dl_policy(policy) && !__checkparam_dl(attr, false)) || (rt_policy(policy) != (attr->sched_priority != 0))) return -EINVAL; -- 2.54.0