Linux Kernel Selftest development
 help / color / mirror / Atom feed
From: Ridong Chen <ridong.chen@linux.dev>
To: "Waiman Long" <longman@redhat.com>, "Tejun Heo" <tj@kernel.org>,
	"Johannes Weiner" <hannes@cmpxchg.org>,
	"Michal Koutný" <mkoutny@suse.com>,
	"Shuah Khan" <shuah@kernel.org>,
	"Juri Lelli" <juri.lelli@redhat.com>
Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest@vger.kernel.org,
	Aaron Tomlin <atomlin@atomlin.com>,
	Guopeng Zhang <guopeng.zhang@linux.dev>
Subject: Re: [PATCH-next v9 10/11] cgroup/cpuset: Support multiple destination cpusets for cpuset_*attach()
Date: Wed, 1 Jul 2026 10:51:01 +0800	[thread overview]
Message-ID: <e4608f5d-d569-4796-b0f6-55956a6b341c@linux.dev> (raw)
In-Reply-To: <20260630033344.352702-11-longman@redhat.com>



On 6/30/2026 11:33 AM, Waiman Long wrote:
> The only case where the cgroup_taskset structure requires task migration
> to multiple cpusets is when enabling a cpuset controller in cgroup v2
> where the newly created child cpusets inherits the same effective CPUs
> and memory nodes from the parent. In that case, task migration can happen
> directly with no update to tasks' CPU and memory nodes assignment and no
> further work needed from the cpuset side except updating nr_deadline_tasks
> when DL tasks are involved and setting old_mems_allowed in the child
> cpusets.
> 
> Do that by tracking all the destination cpusets with a new dst_cs_head
> singly linked list. The reset_migrate_dl_data() function is integrated
> into clear_attach_data() so that it can be used for both source and
> destination cpusets.
> 
> It is assumed that a given cpuset cannot be both a source and a
> destination cpuset. If such condition happens or when there are multiple
> destination cpusets with CPU or memory nodes changes, the current code
> will not handle it correctly. So it will print a warning and fail the
> attach operation in these unexpected cases as we will have to enhance the
> code to support this if such use cases are valid and not coding errors.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>   kernel/cgroup/cpuset-internal.h |   1 +
>   kernel/cgroup/cpuset.c          | 115 ++++++++++++++++++++------------
>   2 files changed, 72 insertions(+), 44 deletions(-)
> 
> diff --git a/kernel/cgroup/cpuset-internal.h b/kernel/cgroup/cpuset-internal.h
> index e7d010661fd3..d1161b0a3d85 100644
> --- a/kernel/cgroup/cpuset-internal.h
> +++ b/kernel/cgroup/cpuset-internal.h
> @@ -149,6 +149,7 @@ struct cpuset {
>   	 * For linking impacted cpusets during an attach operation.
>   	 */
>   	struct llist_node attach_node;
> +	bool attach_source;
>   
>   	/* partition root state */
>   	int partition_root_state;
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index b201f4ba18b6..1591d6dca66a 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -366,10 +366,12 @@ static struct {
>   	bool cpus_updated;
>   	bool mems_updated;
>   	bool task_work_queued;
> +	bool many_dest_cs;	/* Have many destination cpusets */
>   	struct cpuset *old_cs;	/* Source cpuset */
>   	nodemask_t nodemask_to;
>   } attach_ctx;
>   static LLIST_HEAD(src_cs_head);
> +static LLIST_HEAD(dst_cs_head);
>   

This looks a lot like the 'struct list_head mg_src_preload_node' and
'struct list_head mg_dst_preload_node' in struct css_set. Is there a
better way to reuse those instead of adding a separate tracking list here?

TJ, Michal, do you have any opinions on this?

>   /*
>    * Wait if task attach is in progress until it is done and then acquire
> @@ -3044,8 +3046,23 @@ static int cpuset_can_attach_check(struct cpuset *cs, struct cpuset *oldcs,
>   	if (!oldcs)
>   		return 0;
>   
> -	if (!llist_on_list(&oldcs->attach_node))
> +	/*
> +	 * The same cpuset cannot be both a source and a destination.
> +	 * The current code does not support that, print a warning and
> +	 * fail the attach if so.
> +	 */
> +	if (WARN_ON_ONCE((!oldcs->attach_source &&
> +			  llist_on_list(&oldcs->attach_node)) ||
> +			  cs->attach_source))
> +		return -EINVAL;
> +
> +	if (!llist_on_list(&oldcs->attach_node)) {
>   		llist_add(&oldcs->attach_node, &src_cs_head);
> +		oldcs->attach_source = true;
> +	}
> +
> +	if (!llist_on_list(&cs->attach_node))
> +		llist_add(&cs->attach_node, &dst_cs_head);
>   
>   	cpus_updated = !cpumask_equal(cs->effective_cpus, oldcs->effective_cpus);
>   	mems_updated = !nodes_equal(cs->effective_mems, oldcs->effective_mems);
> @@ -3075,35 +3092,31 @@ static int cpuset_can_attach_check(struct cpuset *cs, struct cpuset *oldcs,
>   	return 0;
>   }
>   
> -static int cpuset_reserve_dl_bw(struct cpuset *cs)
> +static int cpuset_reserve_dl_bw(void)
>   {
> +	struct cpuset *cs;
>   	int cpu, ret;
>   
> -	if (!cs->sum_migrate_dl_bw)
> -		return 0;
> +	llist_for_each_entry(cs, dst_cs_head.first, attach_node) {
> +		if (!cs->sum_migrate_dl_bw)
> +			continue;
>   
> -	cpu = cpumask_any_and(cpu_active_mask, cs->effective_cpus);
> -	if (unlikely(cpu >= nr_cpu_ids))
> -		return -EINVAL;
> +		cpu = cpumask_any_and(cpu_active_mask, cs->effective_cpus);
> +		if (unlikely(cpu >= nr_cpu_ids))
> +			return -EINVAL;
>   
> -	ret = dl_bw_alloc(cpu, cs->sum_migrate_dl_bw);
> -	if (ret)
> -		return ret;
> +		ret = dl_bw_alloc(cpu, cs->sum_migrate_dl_bw);
> +		if (ret)
> +			return ret;
>   
> -	cs->dl_bw_cpu = cpu;
> +		cs->dl_bw_cpu = cpu;
> +	}
>   	return 0;
>   }
>   
> -static void reset_migrate_dl_data(struct cpuset *cs)
> -{
> -	cs->nr_migrate_dl_tasks = 0;
> -	cs->sum_migrate_dl_bw = 0;
> -	cs->dl_bw_cpu = -1;
> -}
> -
>   /*
>    * Clear and optionally apply (@cancel is false) the attach related data in the
> - * source cpusets.
> + * source or destination cpuset.
>    */
>   static void clear_attach_data(struct llist_head *head, bool cancel)
>   {
> @@ -3115,8 +3128,13 @@ static void clear_attach_data(struct llist_head *head, bool cancel)
>   		if (cs->nr_migrate_dl_tasks) {
>   			if (!cancel)
>   				atomic_add(cs->nr_migrate_dl_tasks, &cs->nr_deadline_tasks);
> +			else if (cs->dl_bw_cpu >= 0) /* && cacnel */
> +				dl_bw_free(cs->dl_bw_cpu, cs->sum_migrate_dl_bw);
>   			cs->nr_migrate_dl_tasks = 0;
> +			cs->sum_migrate_dl_bw = 0;
> +			cs->dl_bw_cpu = -1;
>   		}
> +		cs->attach_source = false;
>   	}
>   }
>   
> @@ -3137,6 +3155,7 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
>   	mutex_lock(&cpuset_mutex);
>   	attach_ctx.cpus_updated = false;
>   	attach_ctx.mems_updated = false;
> +	attach_ctx.many_dest_cs = false;
>   
>   	/* Check to see if task is allowed in the cpuset */
>   	ret = cpuset_can_attach_check(cs, oldcs, &setsched_check);
> @@ -3161,9 +3180,13 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
>   	 * selected as attach_ctx.old_cs.
>   	 */
>   	cgroup_taskset_for_each(task, css, tset) {
> +		struct cpuset *new_cs = css_cs(css);
>   		struct cpuset *new_oldcs = task_cs(task);
>   
> -		if (new_oldcs != oldcs) {
> +		if ((new_oldcs != oldcs) || (new_cs != cs)) {
> +			if (new_cs != cs)
> +				attach_ctx.many_dest_cs = true;
> +			cs = new_cs;
>   			oldcs = new_oldcs;
>   			ret = cpuset_can_attach_check(cs, oldcs, &setsched_check);
>   			if (ret)
> @@ -3197,12 +3220,28 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
>   		}
>   	}
>   
> -	ret = cpuset_reserve_dl_bw(cs);
> +	/*
> +	 * The only case where there are multiple destination cpusets for
> +	 * task migration is when enabling a v2 cpuset controllers where
> +	 * tasks will be migrated to multiple child cpusets from a parent
> +	 * cpuset with the same effective CPUs and memory nodes. IOW,
> +	 * both attach_cpus_updated and attach_mems_updated should be false.
> +	 * If not, it is a condition that the current code cannot handled.
> +	 * Print a warning and abort the attach operation as further code
> +	 * change will be needed.
> +	 */
> +	if (WARN_ON_ONCE(attach_ctx.many_dest_cs && (!cpuset_v2() ||
> +			 attach_ctx.cpus_updated || attach_ctx.mems_updated))) {
> +		ret = -EINVAL;
> +		goto out_unlock;
> +	}
> +
> +	ret = cpuset_reserve_dl_bw();
>   
>   out_unlock:
>   	if (ret) {
> -		reset_migrate_dl_data(cs);
>   		clear_attach_data(&src_cs_head, true);
> +		clear_attach_data(&dst_cs_head, true);
>   	} else {
>   		attach_ctx.in_progress++;
>   	}
> @@ -3213,22 +3252,10 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
>   
>   static void cpuset_cancel_attach(struct cgroup_taskset *tset)
>   {
> -	struct cgroup_subsys_state *css;
> -	struct cpuset *cs;
> -
> -	cgroup_taskset_first(tset, &css);
> -	cs = css_cs(css);
> -
>   	mutex_lock(&cpuset_mutex);
>   	dec_attach_in_progress_locked();
>   	clear_attach_data(&src_cs_head, true);
> -
> -	if (cs->dl_bw_cpu >= 0)
> -		dl_bw_free(cs->dl_bw_cpu, cs->sum_migrate_dl_bw);
> -
> -	if (cs->nr_migrate_dl_tasks)
> -		reset_migrate_dl_data(cs);
> -
> +	clear_attach_data(&dst_cs_head, true);
>   	mutex_unlock(&cpuset_mutex);
>   }
>   
> @@ -3312,25 +3339,25 @@ static void cpuset_attach(struct cgroup_taskset *tset)
>   	 * In the default hierarchy, enabling cpuset in the child cgroups
>   	 * will trigger a cpuset_attach() call with no change in effective cpus
>   	 * and mems. In that case, we can optimize out by skipping the task
> -	 * iteration and update.
> +	 * iteration and updatebut the destination cpuset list is iterated to
> +	 * set old_mems_allowed.
>   	 */
> -	if (cpuset_v2() && !attach_ctx.cpus_updated && !attach_ctx.mems_updated)
> +	if (cpuset_v2() && !attach_ctx.cpus_updated && !attach_ctx.mems_updated) {
> +		llist_for_each_entry(cs, dst_cs_head.first, attach_node)
> +			cs->old_mems_allowed = attach_ctx.nodemask_to;
>   		goto out;
> +	}
>   
> +	/* Task iteration shouldn't happen with attach_ctx.many_dest_cs set */
>   	cgroup_taskset_for_each(task, css, tset)
>   		cpuset_attach_task(cs, task);
>   
> -out:
>   	if (attach_ctx.task_work_queued)
>   		schedule_flush_migrate_mm();
>   	cs->old_mems_allowed = attach_ctx.nodemask_to;
> -
> -	if (cs->nr_migrate_dl_tasks) {
> -		atomic_add(cs->nr_migrate_dl_tasks, &cs->nr_deadline_tasks);
> -		reset_migrate_dl_data(cs);
> -	}
> -
> +out:
>   	clear_attach_data(&src_cs_head, false);
> +	clear_attach_data(&dst_cs_head, false);
>   	dec_attach_in_progress_locked();
>   
>   	mutex_unlock(&cpuset_mutex);

-- 
Best regards
Ridong


  reply	other threads:[~2026-07-01  2:51 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-30  3:33 [PATCH-next v9 00/11] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach() Waiman Long
2026-06-30  3:33 ` [PATCH-next v9 01/11] cgroup/cpuset: Make nr_deadline_tasks an atomic_t Waiman Long
2026-06-30 14:01   ` Juri Lelli
2026-06-30 17:56     ` Waiman Long
2026-07-01  9:00       ` Juri Lelli
2026-07-01  1:19   ` Ridong Chen
2026-06-30  3:33 ` [PATCH-next v9 02/11] cgroup/cpuset: Fix node inconsistencies between cpuset_update_tasks_nodemask() and cpuset_attach() Waiman Long
2026-06-30  3:33 ` [PATCH-next v9 03/11] cgroup/cpuset: Prevent race between task attach and cpuset state change Waiman Long
2026-07-01  1:41   ` Ridong Chen
2026-07-01 20:19     ` Waiman Long
2026-06-30  3:33 ` [PATCH-next v9 04/11] cgroup/cpuset: Put all task attach related variables into attach_ctx Waiman Long
2026-06-30  3:33 ` [PATCH-next v9 05/11] cgroup/cpuset: Add a cpuset_reserve_dl_bw() helper Waiman Long
2026-06-30  3:33 ` [PATCH-next v9 06/11] cgroup/cpuset: Expand the scope of cpuset_can_attach_check() Waiman Long
2026-06-30  3:33 ` [PATCH-next v9 07/11] cgroup/cpuset: Make attach_ctx.old_cs track task group leader Waiman Long
2026-06-30  3:33 ` [PATCH-next v9 08/11] cgroup/cpuset: Move mpol_rebind_mm/cpuset_migrate_mm() calls inside cpuset_attach_task() Waiman Long
2026-07-01  2:14   ` Ridong Chen
2026-07-01 20:30     ` Waiman Long
2026-06-30  3:33 ` [PATCH-next v9 09/11] cgroup/cpuset: Support multiple source cpusets for cpuset_*attach() Waiman Long
2026-07-01  2:35   ` Ridong Chen
2026-07-01 20:44     ` Waiman Long
2026-06-30  3:33 ` [PATCH-next v9 10/11] cgroup/cpuset: Support multiple destination " Waiman Long
2026-07-01  2:51   ` Ridong Chen [this message]
2026-07-01 21:16     ` Waiman Long
2026-06-30  3:33 ` [PATCH-next v9 11/11] selftests/cgroup: Add test for cpuset affinity on controller disable Waiman Long

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e4608f5d-d569-4796-b0f6-55956a6b341c@linux.dev \
    --to=ridong.chen@linux.dev \
    --cc=atomlin@atomlin.com \
    --cc=cgroups@vger.kernel.org \
    --cc=guopeng.zhang@linux.dev \
    --cc=hannes@cmpxchg.org \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=longman@redhat.com \
    --cc=mkoutny@suse.com \
    --cc=shuah@kernel.org \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox