From: Ridong Chen <ridong.chen@linux.dev>
To: "Waiman Long" <longman@redhat.com>, "Tejun Heo" <tj@kernel.org>,
"Johannes Weiner" <hannes@cmpxchg.org>,
"Michal Koutný" <mkoutny@suse.com>,
"Shuah Khan" <shuah@kernel.org>,
"Juri Lelli" <juri.lelli@redhat.com>
Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-kselftest@vger.kernel.org,
Aaron Tomlin <atomlin@atomlin.com>,
Guopeng Zhang <guopeng.zhang@linux.dev>
Subject: Re: [PATCH-next v9 10/11] cgroup/cpuset: Support multiple destination cpusets for cpuset_*attach()
Date: Wed, 1 Jul 2026 10:51:01 +0800 [thread overview]
Message-ID: <e4608f5d-d569-4796-b0f6-55956a6b341c@linux.dev> (raw)
In-Reply-To: <20260630033344.352702-11-longman@redhat.com>
On 6/30/2026 11:33 AM, Waiman Long wrote:
> The only case where the cgroup_taskset structure requires task migration
> to multiple cpusets is when enabling a cpuset controller in cgroup v2
> where the newly created child cpusets inherits the same effective CPUs
> and memory nodes from the parent. In that case, task migration can happen
> directly with no update to tasks' CPU and memory nodes assignment and no
> further work needed from the cpuset side except updating nr_deadline_tasks
> when DL tasks are involved and setting old_mems_allowed in the child
> cpusets.
>
> Do that by tracking all the destination cpusets with a new dst_cs_head
> singly linked list. The reset_migrate_dl_data() function is integrated
> into clear_attach_data() so that it can be used for both source and
> destination cpusets.
>
> It is assumed that a given cpuset cannot be both a source and a
> destination cpuset. If such condition happens or when there are multiple
> destination cpusets with CPU or memory nodes changes, the current code
> will not handle it correctly. So it will print a warning and fail the
> attach operation in these unexpected cases as we will have to enhance the
> code to support this if such use cases are valid and not coding errors.
>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
> kernel/cgroup/cpuset-internal.h | 1 +
> kernel/cgroup/cpuset.c | 115 ++++++++++++++++++++------------
> 2 files changed, 72 insertions(+), 44 deletions(-)
>
> diff --git a/kernel/cgroup/cpuset-internal.h b/kernel/cgroup/cpuset-internal.h
> index e7d010661fd3..d1161b0a3d85 100644
> --- a/kernel/cgroup/cpuset-internal.h
> +++ b/kernel/cgroup/cpuset-internal.h
> @@ -149,6 +149,7 @@ struct cpuset {
> * For linking impacted cpusets during an attach operation.
> */
> struct llist_node attach_node;
> + bool attach_source;
>
> /* partition root state */
> int partition_root_state;
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index b201f4ba18b6..1591d6dca66a 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -366,10 +366,12 @@ static struct {
> bool cpus_updated;
> bool mems_updated;
> bool task_work_queued;
> + bool many_dest_cs; /* Have many destination cpusets */
> struct cpuset *old_cs; /* Source cpuset */
> nodemask_t nodemask_to;
> } attach_ctx;
> static LLIST_HEAD(src_cs_head);
> +static LLIST_HEAD(dst_cs_head);
>
This looks a lot like the 'struct list_head mg_src_preload_node' and
'struct list_head mg_dst_preload_node' in struct css_set. Is there a
better way to reuse those instead of adding a separate tracking list here?
TJ, Michal, do you have any opinions on this?
> /*
> * Wait if task attach is in progress until it is done and then acquire
> @@ -3044,8 +3046,23 @@ static int cpuset_can_attach_check(struct cpuset *cs, struct cpuset *oldcs,
> if (!oldcs)
> return 0;
>
> - if (!llist_on_list(&oldcs->attach_node))
> + /*
> + * The same cpuset cannot be both a source and a destination.
> + * The current code does not support that, print a warning and
> + * fail the attach if so.
> + */
> + if (WARN_ON_ONCE((!oldcs->attach_source &&
> + llist_on_list(&oldcs->attach_node)) ||
> + cs->attach_source))
> + return -EINVAL;
> +
> + if (!llist_on_list(&oldcs->attach_node)) {
> llist_add(&oldcs->attach_node, &src_cs_head);
> + oldcs->attach_source = true;
> + }
> +
> + if (!llist_on_list(&cs->attach_node))
> + llist_add(&cs->attach_node, &dst_cs_head);
>
> cpus_updated = !cpumask_equal(cs->effective_cpus, oldcs->effective_cpus);
> mems_updated = !nodes_equal(cs->effective_mems, oldcs->effective_mems);
> @@ -3075,35 +3092,31 @@ static int cpuset_can_attach_check(struct cpuset *cs, struct cpuset *oldcs,
> return 0;
> }
>
> -static int cpuset_reserve_dl_bw(struct cpuset *cs)
> +static int cpuset_reserve_dl_bw(void)
> {
> + struct cpuset *cs;
> int cpu, ret;
>
> - if (!cs->sum_migrate_dl_bw)
> - return 0;
> + llist_for_each_entry(cs, dst_cs_head.first, attach_node) {
> + if (!cs->sum_migrate_dl_bw)
> + continue;
>
> - cpu = cpumask_any_and(cpu_active_mask, cs->effective_cpus);
> - if (unlikely(cpu >= nr_cpu_ids))
> - return -EINVAL;
> + cpu = cpumask_any_and(cpu_active_mask, cs->effective_cpus);
> + if (unlikely(cpu >= nr_cpu_ids))
> + return -EINVAL;
>
> - ret = dl_bw_alloc(cpu, cs->sum_migrate_dl_bw);
> - if (ret)
> - return ret;
> + ret = dl_bw_alloc(cpu, cs->sum_migrate_dl_bw);
> + if (ret)
> + return ret;
>
> - cs->dl_bw_cpu = cpu;
> + cs->dl_bw_cpu = cpu;
> + }
> return 0;
> }
>
> -static void reset_migrate_dl_data(struct cpuset *cs)
> -{
> - cs->nr_migrate_dl_tasks = 0;
> - cs->sum_migrate_dl_bw = 0;
> - cs->dl_bw_cpu = -1;
> -}
> -
> /*
> * Clear and optionally apply (@cancel is false) the attach related data in the
> - * source cpusets.
> + * source or destination cpuset.
> */
> static void clear_attach_data(struct llist_head *head, bool cancel)
> {
> @@ -3115,8 +3128,13 @@ static void clear_attach_data(struct llist_head *head, bool cancel)
> if (cs->nr_migrate_dl_tasks) {
> if (!cancel)
> atomic_add(cs->nr_migrate_dl_tasks, &cs->nr_deadline_tasks);
> + else if (cs->dl_bw_cpu >= 0) /* && cacnel */
> + dl_bw_free(cs->dl_bw_cpu, cs->sum_migrate_dl_bw);
> cs->nr_migrate_dl_tasks = 0;
> + cs->sum_migrate_dl_bw = 0;
> + cs->dl_bw_cpu = -1;
> }
> + cs->attach_source = false;
> }
> }
>
> @@ -3137,6 +3155,7 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
> mutex_lock(&cpuset_mutex);
> attach_ctx.cpus_updated = false;
> attach_ctx.mems_updated = false;
> + attach_ctx.many_dest_cs = false;
>
> /* Check to see if task is allowed in the cpuset */
> ret = cpuset_can_attach_check(cs, oldcs, &setsched_check);
> @@ -3161,9 +3180,13 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
> * selected as attach_ctx.old_cs.
> */
> cgroup_taskset_for_each(task, css, tset) {
> + struct cpuset *new_cs = css_cs(css);
> struct cpuset *new_oldcs = task_cs(task);
>
> - if (new_oldcs != oldcs) {
> + if ((new_oldcs != oldcs) || (new_cs != cs)) {
> + if (new_cs != cs)
> + attach_ctx.many_dest_cs = true;
> + cs = new_cs;
> oldcs = new_oldcs;
> ret = cpuset_can_attach_check(cs, oldcs, &setsched_check);
> if (ret)
> @@ -3197,12 +3220,28 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
> }
> }
>
> - ret = cpuset_reserve_dl_bw(cs);
> + /*
> + * The only case where there are multiple destination cpusets for
> + * task migration is when enabling a v2 cpuset controllers where
> + * tasks will be migrated to multiple child cpusets from a parent
> + * cpuset with the same effective CPUs and memory nodes. IOW,
> + * both attach_cpus_updated and attach_mems_updated should be false.
> + * If not, it is a condition that the current code cannot handled.
> + * Print a warning and abort the attach operation as further code
> + * change will be needed.
> + */
> + if (WARN_ON_ONCE(attach_ctx.many_dest_cs && (!cpuset_v2() ||
> + attach_ctx.cpus_updated || attach_ctx.mems_updated))) {
> + ret = -EINVAL;
> + goto out_unlock;
> + }
> +
> + ret = cpuset_reserve_dl_bw();
>
> out_unlock:
> if (ret) {
> - reset_migrate_dl_data(cs);
> clear_attach_data(&src_cs_head, true);
> + clear_attach_data(&dst_cs_head, true);
> } else {
> attach_ctx.in_progress++;
> }
> @@ -3213,22 +3252,10 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
>
> static void cpuset_cancel_attach(struct cgroup_taskset *tset)
> {
> - struct cgroup_subsys_state *css;
> - struct cpuset *cs;
> -
> - cgroup_taskset_first(tset, &css);
> - cs = css_cs(css);
> -
> mutex_lock(&cpuset_mutex);
> dec_attach_in_progress_locked();
> clear_attach_data(&src_cs_head, true);
> -
> - if (cs->dl_bw_cpu >= 0)
> - dl_bw_free(cs->dl_bw_cpu, cs->sum_migrate_dl_bw);
> -
> - if (cs->nr_migrate_dl_tasks)
> - reset_migrate_dl_data(cs);
> -
> + clear_attach_data(&dst_cs_head, true);
> mutex_unlock(&cpuset_mutex);
> }
>
> @@ -3312,25 +3339,25 @@ static void cpuset_attach(struct cgroup_taskset *tset)
> * In the default hierarchy, enabling cpuset in the child cgroups
> * will trigger a cpuset_attach() call with no change in effective cpus
> * and mems. In that case, we can optimize out by skipping the task
> - * iteration and update.
> + * iteration and updatebut the destination cpuset list is iterated to
> + * set old_mems_allowed.
> */
> - if (cpuset_v2() && !attach_ctx.cpus_updated && !attach_ctx.mems_updated)
> + if (cpuset_v2() && !attach_ctx.cpus_updated && !attach_ctx.mems_updated) {
> + llist_for_each_entry(cs, dst_cs_head.first, attach_node)
> + cs->old_mems_allowed = attach_ctx.nodemask_to;
> goto out;
> + }
>
> + /* Task iteration shouldn't happen with attach_ctx.many_dest_cs set */
> cgroup_taskset_for_each(task, css, tset)
> cpuset_attach_task(cs, task);
>
> -out:
> if (attach_ctx.task_work_queued)
> schedule_flush_migrate_mm();
> cs->old_mems_allowed = attach_ctx.nodemask_to;
> -
> - if (cs->nr_migrate_dl_tasks) {
> - atomic_add(cs->nr_migrate_dl_tasks, &cs->nr_deadline_tasks);
> - reset_migrate_dl_data(cs);
> - }
> -
> +out:
> clear_attach_data(&src_cs_head, false);
> + clear_attach_data(&dst_cs_head, false);
> dec_attach_in_progress_locked();
>
> mutex_unlock(&cpuset_mutex);
--
Best regards
Ridong
next prev parent reply other threads:[~2026-07-01 2:51 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-30 3:33 [PATCH-next v9 00/11] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach() Waiman Long
2026-06-30 3:33 ` [PATCH-next v9 01/11] cgroup/cpuset: Make nr_deadline_tasks an atomic_t Waiman Long
2026-06-30 14:01 ` Juri Lelli
2026-06-30 17:56 ` Waiman Long
2026-07-01 9:00 ` Juri Lelli
2026-07-01 1:19 ` Ridong Chen
2026-06-30 3:33 ` [PATCH-next v9 02/11] cgroup/cpuset: Fix node inconsistencies between cpuset_update_tasks_nodemask() and cpuset_attach() Waiman Long
2026-06-30 3:33 ` [PATCH-next v9 03/11] cgroup/cpuset: Prevent race between task attach and cpuset state change Waiman Long
2026-07-01 1:41 ` Ridong Chen
2026-07-01 20:19 ` Waiman Long
2026-06-30 3:33 ` [PATCH-next v9 04/11] cgroup/cpuset: Put all task attach related variables into attach_ctx Waiman Long
2026-06-30 3:33 ` [PATCH-next v9 05/11] cgroup/cpuset: Add a cpuset_reserve_dl_bw() helper Waiman Long
2026-06-30 3:33 ` [PATCH-next v9 06/11] cgroup/cpuset: Expand the scope of cpuset_can_attach_check() Waiman Long
2026-06-30 3:33 ` [PATCH-next v9 07/11] cgroup/cpuset: Make attach_ctx.old_cs track task group leader Waiman Long
2026-06-30 3:33 ` [PATCH-next v9 08/11] cgroup/cpuset: Move mpol_rebind_mm/cpuset_migrate_mm() calls inside cpuset_attach_task() Waiman Long
2026-07-01 2:14 ` Ridong Chen
2026-07-01 20:30 ` Waiman Long
2026-06-30 3:33 ` [PATCH-next v9 09/11] cgroup/cpuset: Support multiple source cpusets for cpuset_*attach() Waiman Long
2026-07-01 2:35 ` Ridong Chen
2026-07-01 20:44 ` Waiman Long
2026-06-30 3:33 ` [PATCH-next v9 10/11] cgroup/cpuset: Support multiple destination " Waiman Long
2026-07-01 2:51 ` Ridong Chen [this message]
2026-07-01 21:16 ` Waiman Long
2026-06-30 3:33 ` [PATCH-next v9 11/11] selftests/cgroup: Add test for cpuset affinity on controller disable Waiman Long
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=e4608f5d-d569-4796-b0f6-55956a6b341c@linux.dev \
--to=ridong.chen@linux.dev \
--cc=atomlin@atomlin.com \
--cc=cgroups@vger.kernel.org \
--cc=guopeng.zhang@linux.dev \
--cc=hannes@cmpxchg.org \
--cc=juri.lelli@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=longman@redhat.com \
--cc=mkoutny@suse.com \
--cc=shuah@kernel.org \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox