All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ridong Chen <ridong.chen@linux.dev>
To: "Waiman Long" <longman@redhat.com>, "Tejun Heo" <tj@kernel.org>,
	"Johannes Weiner" <hannes@cmpxchg.org>,
	"Michal Koutný" <mkoutny@suse.com>,
	"Shuah Khan" <shuah@kernel.org>,
	"Juri Lelli" <juri.lelli@redhat.com>
Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest@vger.kernel.org,
	Aaron Tomlin <atomlin@atomlin.com>,
	Guopeng Zhang <guopeng.zhang@linux.dev>
Subject: Re: [PATCH-next v9 10/11] cgroup/cpuset: Support multiple destination cpusets for cpuset_*attach()
Date: Wed, 1 Jul 2026 10:51:01 +0800	[thread overview]
Message-ID: <e4608f5d-d569-4796-b0f6-55956a6b341c@linux.dev> (raw)
In-Reply-To: <20260630033344.352702-11-longman@redhat.com>



On 6/30/2026 11:33 AM, Waiman Long wrote:
> The only case where the cgroup_taskset structure requires task migration
> to multiple cpusets is when enabling a cpuset controller in cgroup v2
> where the newly created child cpusets inherits the same effective CPUs
> and memory nodes from the parent. In that case, task migration can happen
> directly with no update to tasks' CPU and memory nodes assignment and no
> further work needed from the cpuset side except updating nr_deadline_tasks
> when DL tasks are involved and setting old_mems_allowed in the child
> cpusets.
> 
> Do that by tracking all the destination cpusets with a new dst_cs_head
> singly linked list. The reset_migrate_dl_data() function is integrated
> into clear_attach_data() so that it can be used for both source and
> destination cpusets.
> 
> It is assumed that a given cpuset cannot be both a source and a
> destination cpuset. If such condition happens or when there are multiple
> destination cpusets with CPU or memory nodes changes, the current code
> will not handle it correctly. So it will print a warning and fail the
> attach operation in these unexpected cases as we will have to enhance the
> code to support this if such use cases are valid and not coding errors.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>   kernel/cgroup/cpuset-internal.h |   1 +
>   kernel/cgroup/cpuset.c          | 115 ++++++++++++++++++++------------
>   2 files changed, 72 insertions(+), 44 deletions(-)
> 
> diff --git a/kernel/cgroup/cpuset-internal.h b/kernel/cgroup/cpuset-internal.h
> index e7d010661fd3..d1161b0a3d85 100644
> --- a/kernel/cgroup/cpuset-internal.h
> +++ b/kernel/cgroup/cpuset-internal.h
> @@ -149,6 +149,7 @@ struct cpuset {
>   	 * For linking impacted cpusets during an attach operation.
>   	 */
>   	struct llist_node attach_node;
> +	bool attach_source;
>   
>   	/* partition root state */
>   	int partition_root_state;
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index b201f4ba18b6..1591d6dca66a 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -366,10 +366,12 @@ static struct {
>   	bool cpus_updated;
>   	bool mems_updated;
>   	bool task_work_queued;
> +	bool many_dest_cs;	/* Have many destination cpusets */
>   	struct cpuset *old_cs;	/* Source cpuset */
>   	nodemask_t nodemask_to;
>   } attach_ctx;
>   static LLIST_HEAD(src_cs_head);
> +static LLIST_HEAD(dst_cs_head);
>   

This looks a lot like the 'struct list_head mg_src_preload_node' and
'struct list_head mg_dst_preload_node' in struct css_set. Is there a
better way to reuse those instead of adding a separate tracking list here?

TJ, Michal, do you have any opinions on this?

>   /*
>    * Wait if task attach is in progress until it is done and then acquire
> @@ -3044,8 +3046,23 @@ static int cpuset_can_attach_check(struct cpuset *cs, struct cpuset *oldcs,
>   	if (!oldcs)
>   		return 0;
>   
> -	if (!llist_on_list(&oldcs->attach_node))
> +	/*
> +	 * The same cpuset cannot be both a source and a destination.
> +	 * The current code does not support that, print a warning and
> +	 * fail the attach if so.
> +	 */
> +	if (WARN_ON_ONCE((!oldcs->attach_source &&
> +			  llist_on_list(&oldcs->attach_node)) ||
> +			  cs->attach_source))
> +		return -EINVAL;
> +
> +	if (!llist_on_list(&oldcs->attach_node)) {
>   		llist_add(&oldcs->attach_node, &src_cs_head);
> +		oldcs->attach_source = true;
> +	}
> +
> +	if (!llist_on_list(&cs->attach_node))
> +		llist_add(&cs->attach_node, &dst_cs_head);
>   
>   	cpus_updated = !cpumask_equal(cs->effective_cpus, oldcs->effective_cpus);
>   	mems_updated = !nodes_equal(cs->effective_mems, oldcs->effective_mems);
> @@ -3075,35 +3092,31 @@ static int cpuset_can_attach_check(struct cpuset *cs, struct cpuset *oldcs,
>   	return 0;
>   }
>   
> -static int cpuset_reserve_dl_bw(struct cpuset *cs)
> +static int cpuset_reserve_dl_bw(void)
>   {
> +	struct cpuset *cs;
>   	int cpu, ret;
>   
> -	if (!cs->sum_migrate_dl_bw)
> -		return 0;
> +	llist_for_each_entry(cs, dst_cs_head.first, attach_node) {
> +		if (!cs->sum_migrate_dl_bw)
> +			continue;
>   
> -	cpu = cpumask_any_and(cpu_active_mask, cs->effective_cpus);
> -	if (unlikely(cpu >= nr_cpu_ids))
> -		return -EINVAL;
> +		cpu = cpumask_any_and(cpu_active_mask, cs->effective_cpus);
> +		if (unlikely(cpu >= nr_cpu_ids))
> +			return -EINVAL;
>   
> -	ret = dl_bw_alloc(cpu, cs->sum_migrate_dl_bw);
> -	if (ret)
> -		return ret;
> +		ret = dl_bw_alloc(cpu, cs->sum_migrate_dl_bw);
> +		if (ret)
> +			return ret;
>   
> -	cs->dl_bw_cpu = cpu;
> +		cs->dl_bw_cpu = cpu;
> +	}
>   	return 0;
>   }
>   
> -static void reset_migrate_dl_data(struct cpuset *cs)
> -{
> -	cs->nr_migrate_dl_tasks = 0;
> -	cs->sum_migrate_dl_bw = 0;
> -	cs->dl_bw_cpu = -1;
> -}
> -
>   /*
>    * Clear and optionally apply (@cancel is false) the attach related data in the
> - * source cpusets.
> + * source or destination cpuset.
>    */
>   static void clear_attach_data(struct llist_head *head, bool cancel)
>   {
> @@ -3115,8 +3128,13 @@ static void clear_attach_data(struct llist_head *head, bool cancel)
>   		if (cs->nr_migrate_dl_tasks) {
>   			if (!cancel)
>   				atomic_add(cs->nr_migrate_dl_tasks, &cs->nr_deadline_tasks);
> +			else if (cs->dl_bw_cpu >= 0) /* && cacnel */
> +				dl_bw_free(cs->dl_bw_cpu, cs->sum_migrate_dl_bw);
>   			cs->nr_migrate_dl_tasks = 0;
> +			cs->sum_migrate_dl_bw = 0;
> +			cs->dl_bw_cpu = -1;
>   		}
> +		cs->attach_source = false;
>   	}
>   }
>   
> @@ -3137,6 +3155,7 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
>   	mutex_lock(&cpuset_mutex);
>   	attach_ctx.cpus_updated = false;
>   	attach_ctx.mems_updated = false;
> +	attach_ctx.many_dest_cs = false;
>   
>   	/* Check to see if task is allowed in the cpuset */
>   	ret = cpuset_can_attach_check(cs, oldcs, &setsched_check);
> @@ -3161,9 +3180,13 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
>   	 * selected as attach_ctx.old_cs.
>   	 */
>   	cgroup_taskset_for_each(task, css, tset) {
> +		struct cpuset *new_cs = css_cs(css);
>   		struct cpuset *new_oldcs = task_cs(task);
>   
> -		if (new_oldcs != oldcs) {
> +		if ((new_oldcs != oldcs) || (new_cs != cs)) {
> +			if (new_cs != cs)
> +				attach_ctx.many_dest_cs = true;
> +			cs = new_cs;
>   			oldcs = new_oldcs;
>   			ret = cpuset_can_attach_check(cs, oldcs, &setsched_check);
>   			if (ret)
> @@ -3197,12 +3220,28 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
>   		}
>   	}
>   
> -	ret = cpuset_reserve_dl_bw(cs);
> +	/*
> +	 * The only case where there are multiple destination cpusets for
> +	 * task migration is when enabling a v2 cpuset controllers where
> +	 * tasks will be migrated to multiple child cpusets from a parent
> +	 * cpuset with the same effective CPUs and memory nodes. IOW,
> +	 * both attach_cpus_updated and attach_mems_updated should be false.
> +	 * If not, it is a condition that the current code cannot handled.
> +	 * Print a warning and abort the attach operation as further code
> +	 * change will be needed.
> +	 */
> +	if (WARN_ON_ONCE(attach_ctx.many_dest_cs && (!cpuset_v2() ||
> +			 attach_ctx.cpus_updated || attach_ctx.mems_updated))) {
> +		ret = -EINVAL;
> +		goto out_unlock;
> +	}
> +
> +	ret = cpuset_reserve_dl_bw();
>   
>   out_unlock:
>   	if (ret) {
> -		reset_migrate_dl_data(cs);
>   		clear_attach_data(&src_cs_head, true);
> +		clear_attach_data(&dst_cs_head, true);
>   	} else {
>   		attach_ctx.in_progress++;
>   	}
> @@ -3213,22 +3252,10 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
>   
>   static void cpuset_cancel_attach(struct cgroup_taskset *tset)
>   {
> -	struct cgroup_subsys_state *css;
> -	struct cpuset *cs;
> -
> -	cgroup_taskset_first(tset, &css);
> -	cs = css_cs(css);
> -
>   	mutex_lock(&cpuset_mutex);
>   	dec_attach_in_progress_locked();
>   	clear_attach_data(&src_cs_head, true);
> -
> -	if (cs->dl_bw_cpu >= 0)
> -		dl_bw_free(cs->dl_bw_cpu, cs->sum_migrate_dl_bw);
> -
> -	if (cs->nr_migrate_dl_tasks)
> -		reset_migrate_dl_data(cs);
> -
> +	clear_attach_data(&dst_cs_head, true);
>   	mutex_unlock(&cpuset_mutex);
>   }
>   
> @@ -3312,25 +3339,25 @@ static void cpuset_attach(struct cgroup_taskset *tset)
>   	 * In the default hierarchy, enabling cpuset in the child cgroups
>   	 * will trigger a cpuset_attach() call with no change in effective cpus
>   	 * and mems. In that case, we can optimize out by skipping the task
> -	 * iteration and update.
> +	 * iteration and updatebut the destination cpuset list is iterated to
> +	 * set old_mems_allowed.
>   	 */
> -	if (cpuset_v2() && !attach_ctx.cpus_updated && !attach_ctx.mems_updated)
> +	if (cpuset_v2() && !attach_ctx.cpus_updated && !attach_ctx.mems_updated) {
> +		llist_for_each_entry(cs, dst_cs_head.first, attach_node)
> +			cs->old_mems_allowed = attach_ctx.nodemask_to;
>   		goto out;
> +	}
>   
> +	/* Task iteration shouldn't happen with attach_ctx.many_dest_cs set */
>   	cgroup_taskset_for_each(task, css, tset)
>   		cpuset_attach_task(cs, task);
>   
> -out:
>   	if (attach_ctx.task_work_queued)
>   		schedule_flush_migrate_mm();
>   	cs->old_mems_allowed = attach_ctx.nodemask_to;
> -
> -	if (cs->nr_migrate_dl_tasks) {
> -		atomic_add(cs->nr_migrate_dl_tasks, &cs->nr_deadline_tasks);
> -		reset_migrate_dl_data(cs);
> -	}
> -
> +out:
>   	clear_attach_data(&src_cs_head, false);
> +	clear_attach_data(&dst_cs_head, false);
>   	dec_attach_in_progress_locked();
>   
>   	mutex_unlock(&cpuset_mutex);

-- 
Best regards
Ridong


  reply	other threads:[~2026-07-01  2:51 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-30  3:33 [PATCH-next v9 00/11] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach() Waiman Long
2026-06-30  3:33 ` [PATCH-next v9 01/11] cgroup/cpuset: Make nr_deadline_tasks an atomic_t Waiman Long
2026-06-30 14:01   ` Juri Lelli
2026-06-30 17:56     ` Waiman Long
2026-07-01  9:00       ` Juri Lelli
2026-07-01  1:19   ` Ridong Chen
2026-06-30  3:33 ` [PATCH-next v9 02/11] cgroup/cpuset: Fix node inconsistencies between cpuset_update_tasks_nodemask() and cpuset_attach() Waiman Long
2026-06-30  3:33 ` [PATCH-next v9 03/11] cgroup/cpuset: Prevent race between task attach and cpuset state change Waiman Long
2026-07-01  1:41   ` Ridong Chen
2026-07-01 20:19     ` Waiman Long
2026-06-30  3:33 ` [PATCH-next v9 04/11] cgroup/cpuset: Put all task attach related variables into attach_ctx Waiman Long
2026-06-30  3:33 ` [PATCH-next v9 05/11] cgroup/cpuset: Add a cpuset_reserve_dl_bw() helper Waiman Long
2026-06-30  3:33 ` [PATCH-next v9 06/11] cgroup/cpuset: Expand the scope of cpuset_can_attach_check() Waiman Long
2026-06-30  3:33 ` [PATCH-next v9 07/11] cgroup/cpuset: Make attach_ctx.old_cs track task group leader Waiman Long
2026-06-30  3:33 ` [PATCH-next v9 08/11] cgroup/cpuset: Move mpol_rebind_mm/cpuset_migrate_mm() calls inside cpuset_attach_task() Waiman Long
2026-07-01  2:14   ` Ridong Chen
2026-07-01 20:30     ` Waiman Long
2026-06-30  3:33 ` [PATCH-next v9 09/11] cgroup/cpuset: Support multiple source cpusets for cpuset_*attach() Waiman Long
2026-07-01  2:35   ` Ridong Chen
2026-07-01 20:44     ` Waiman Long
2026-06-30  3:33 ` [PATCH-next v9 10/11] cgroup/cpuset: Support multiple destination " Waiman Long
2026-07-01  2:51   ` Ridong Chen [this message]
2026-07-01 21:16     ` Waiman Long
2026-06-30  3:33 ` [PATCH-next v9 11/11] selftests/cgroup: Add test for cpuset affinity on controller disable Waiman Long

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e4608f5d-d569-4796-b0f6-55956a6b341c@linux.dev \
    --to=ridong.chen@linux.dev \
    --cc=atomlin@atomlin.com \
    --cc=cgroups@vger.kernel.org \
    --cc=guopeng.zhang@linux.dev \
    --cc=hannes@cmpxchg.org \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=longman@redhat.com \
    --cc=mkoutny@suse.com \
    --cc=shuah@kernel.org \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.