All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ridong Chen <ridong.chen@linux.dev>
To: "Waiman Long" <longman@redhat.com>, "Tejun Heo" <tj@kernel.org>,
	"Johannes Weiner" <hannes@cmpxchg.org>,
	"Michal Koutný" <mkoutny@suse.com>,
	"Li Zefan" <lizefan@huawei.com>,
	"Farhad Alemi" <farhad.alemi@berkeley.edu>,
	"Andrew Morton" <akpm@linux-foundation.org>
Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org,
	Aaron Tomlin <atomlin@atomlin.com>,
	Guopeng Zhang <guopeng.zhang@linux.dev>,
	Gregory Price <gourry@gourry.net>,
	David Hildenbrand <david@kernel.org>
Subject: Re: [PATCH v7 7/9] cgroup/cpuset: Move mpol_rebind_mm/cpuset_migrate_mm() calls inside cpuset_attach_task()
Date: Mon, 22 Jun 2026 10:48:26 +0800	[thread overview]
Message-ID: <e8fb50d3-0831-4caf-b4e4-2af94ac86263@linux.dev> (raw)
In-Reply-To: <20260621032816.1806773-8-longman@redhat.com>



On 6/21/2026 11:28 AM, Waiman Long wrote:
> The cpuset_attach_task() was introduced in commit 42a11bf5c543
> ("cgroup/cpuset: Make cpuset_fork() handle CLONE_INTO_CGROUP properly")
> to enable the CLONE_INTO_CGROUP flag of clone(2) to behave more like
> moving a task from one cpuset into another one. That commits didn't
> move the mpol_rebind_mm() and cpuset_migrate_mm() calls for group leader
> into cpuset_attach_task().
> 
> When the CLONE_INTO_CGROUP flag is used without CLONE_THREAD, the new
> task is its own group leader. So it is still not equivalent to moving
> task between cpusets in this case. Make CLONE_INTO_CGROUP behaves
> more close to cpuset_attach() by moving the mpol_rebind_mm() and
> cpuset_migrate_mm() calls inside cpuset_attach_task(). As a result,
> the following static variables will have to be updated in cpuset_fork().
>   - cpuset_attach_old_cs
>   - attach_cpus_updated
>   - attach_mems_updated
>   - queue_task_work
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>   kernel/cgroup/cpuset.c | 105 ++++++++++++++++++++++++-----------------
>   1 file changed, 62 insertions(+), 43 deletions(-)
> 
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 0375dae26d0b..511afb077e2d 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -2981,8 +2981,13 @@ static int update_prstate(struct cpuset *cs, int new_prs)
>   /*
>    * cpuset_can_attach() and cpuset_attach() specific internal data
>    * Protected by cpuset_mutex
> + *
> + * The attach_cpus_updated/attach_mems_updated flags are set in either
> + * cpuset_attach() or cpuset_fork() and used in cpuset_attach_task().
>    */
>   static struct cpuset *cpuset_attach_old_cs;
> +static bool attach_cpus_updated;
> +static bool attach_mems_updated;
>   
>   /*
>    * Check to see if a cpuset can accept a new task
> @@ -3157,9 +3162,12 @@ static void cpuset_cancel_attach(struct cgroup_taskset *tset)
>    */
>   static cpumask_var_t cpus_attach;
>   static nodemask_t cpuset_attach_nodemask_to;
> +static bool queue_task_work;
>   

There are more and more of these standalone state variables now, and 
it's getting harder to maintain. Could we group them into a struct and 
manage them together rather than keep adding globals?

Just like:

```
struct cpuset_attach_ctx {
	struct cpuset     *old_cs;
	struct llist_head src_cs, dst_cs;
	bool              cpus_updated, mems_updated, queue_work;
	nodemask_t        nodemask_to;
};
```

>   static void cpuset_attach_task(struct cpuset *cs, struct task_struct *task)
>   {
> +	struct mm_struct *mm;
> +
>   	lockdep_assert_cpuset_lock_held();
>   
>   	if (cs != &top_cpuset)
> @@ -3173,28 +3181,60 @@ static void cpuset_attach_task(struct cpuset *cs, struct task_struct *task)
>   	 */
>   	WARN_ON_ONCE(set_cpus_allowed_ptr(task, cpus_attach));
>   
> +	if (cpuset_v2() && !attach_mems_updated)
> +		return;
> +
>   	cpuset_change_task_nodemask(task, &cpuset_attach_nodemask_to);
>   	cpuset1_update_task_spread_flags(cs, task);
> +
> +	if ((task != task->group_leader) ||
> +	    (!is_memory_migrate(cs) && !attach_mems_updated))
> +		return;
> +
> +	/*
> +	 * Change mm for threadgroup leader. This is expensive and may
> +	 * sleep and should be moved outside migration path proper.
> +	 */
> +	mm = get_task_mm(task);
> +	if (mm) {
> +		struct cpuset *oldcs = cpuset_attach_old_cs;
> +
> +		mpol_rebind_mm(mm, &cs->effective_mems);
> +
> +		/*
> +		 * old_mems_allowed is the same with mems_allowed
> +		 * here, except if this task is being moved
> +		 * automatically due to hotplug.  In that case
> +		 * @mems_allowed has been updated and is empty, so
> +		 * @old_mems_allowed is the right nodesets that we
> +		 * migrate mm from.
> +		 */
> +		if (is_memory_migrate(cs)) {
> +			cpuset_migrate_mm(mm, &oldcs->old_mems_allowed,
> +					  &cpuset_attach_nodemask_to);
> +			queue_task_work = true;
> +		} else {
> +			mmput(mm);
> +		}
> +	}
>   }
>   
>   static void cpuset_attach(struct cgroup_taskset *tset)
>   {
>   	struct task_struct *task;
> -	struct task_struct *leader;
>   	struct cgroup_subsys_state *css;
>   	struct cpuset *cs;
>   	struct cpuset *oldcs = cpuset_attach_old_cs;
> -	bool cpus_updated, mems_updated;
> -	bool queue_task_work = false;
>   
>   	cgroup_taskset_first(tset, &css);
>   	cs = css_cs(css);
>   
>   	lockdep_assert_cpus_held();	/* see cgroup_attach_lock() */
>   	mutex_lock(&cpuset_mutex);
> -	cpus_updated = !cpumask_equal(cs->effective_cpus,
> -				      oldcs->effective_cpus);
> -	mems_updated = !nodes_equal(cs->effective_mems, oldcs->effective_mems);
> +	queue_task_work = false;
> +
> +	attach_cpus_updated = !cpumask_equal(cs->effective_cpus, oldcs->effective_cpus);
> +	attach_mems_updated = !nodes_equal(cs->effective_mems, oldcs->effective_mems);
>   	guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
>   
>   	/*
> @@ -3203,44 +3243,12 @@ static void cpuset_attach(struct cgroup_taskset *tset)
>   	 * and mems. In that case, we can optimize out by skipping the task
>   	 * iteration and update.
>   	 */
> -	if (cpuset_v2() && !cpus_updated && !mems_updated)
> +	if (cpuset_v2() && !attach_cpus_updated && !attach_mems_updated)
>   		goto out;
>   
>   	cgroup_taskset_for_each(task, css, tset)
>   		cpuset_attach_task(cs, task);
>   
> -	/*
> -	 * Change mm for all threadgroup leaders. This is expensive and may
> -	 * sleep and should be moved outside migration path proper. Skip it
> -	 * if there is no change in effective_mems and CS_MEMORY_MIGRATE is
> -	 * not set.
> -	 */
> -	if (!is_memory_migrate(cs) && !mems_updated)
> -		goto out;
> -
> -	cgroup_taskset_for_each_leader(leader, css, tset) {
> -		struct mm_struct *mm = get_task_mm(leader);
> -
> -		if (mm) {
> -			mpol_rebind_mm(mm, &cs->effective_mems);
> -
> -			/*
> -			 * old_mems_allowed is the same with mems_allowed
> -			 * here, except if this task is being moved
> -			 * automatically due to hotplug.  In that case
> -			 * @mems_allowed has been updated and is empty, so
> -			 * @old_mems_allowed is the right nodesets that we
> -			 * migrate mm from.
> -			 */
> -			if (is_memory_migrate(cs)) {
> -				cpuset_migrate_mm(mm, &oldcs->old_mems_allowed,
> -						  &cpuset_attach_nodemask_to);
> -				queue_task_work = true;
> -			} else
> -				mmput(mm);
> -		}
> -	}
> -
>   out:
>   	if (queue_task_work)
>   		schedule_flush_migrate_mm();
> @@ -3689,15 +3697,14 @@ static void cpuset_cancel_fork(struct task_struct *task, struct css_set *cset)
>    */
>   static void cpuset_fork(struct task_struct *task)
>   {
> -	struct cpuset *cs;
> -	bool same_cs;
> +	struct cpuset *cs, *oldcs;
>   
>   	rcu_read_lock();
>   	cs = task_cs(task);
> -	same_cs = (cs == task_cs(current));
> +	oldcs = task_cs(current);
>   	rcu_read_unlock();
>   
> -	if (same_cs) {
> +	if (cs == oldcs) {
>   		if (cs == &top_cpuset)
>   			return;
>   
> @@ -3709,7 +3716,19 @@ static void cpuset_fork(struct task_struct *task)
>   	/* CLONE_INTO_CGROUP */
>   	mutex_lock(&cpuset_mutex);
>   	guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
> +	cs->old_mems_allowed = cpuset_attach_nodemask_to;
> +
> +	/*
> +	 * Assume CPUs and memory nodes are updated
> +	 * A CLONE_INTO_CGROUP operation should have taken the cgroup mutex
> +	 * and so there shouldn't be a competing cpuset_attach() operation.
> +	 */
> +	attach_cpus_updated = attach_mems_updated = true;
> +	queue_task_work = false;
> +	cpuset_attach_old_cs = oldcs;
>   	cpuset_attach_task(cs, task);
> +	if (queue_task_work)
> +		schedule_flush_migrate_mm();
>   
>   	dec_attach_in_progress_locked(cs);
>   	mutex_unlock(&cpuset_mutex);

-- 
Best regards
Ridong


  reply	other threads:[~2026-06-22  2:48 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-21  3:28 [PATCH v7 0/9] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach() Waiman Long
2026-06-21  3:28 ` [PATCH v7 1/9] cgroup/cpuset: rebind mm mempolicy to effective_mems, not mems_allowed Waiman Long
2026-06-22  1:42   ` Ridong Chen
2026-06-21  3:28 ` [PATCH v7 2/9] cgroup/cpuset: Fix node inconsistencies between cpuset_update_tasks_nodemask() and cpuset_attach() Waiman Long
2026-06-21  3:28 ` [PATCH v7 3/9] cgroup/cpuset: Prevent race between task attach and cpuset state change Waiman Long
2026-06-22  2:21   ` Ridong Chen
2026-06-21  3:28 ` [PATCH v7 4/9] cgroup/cpuset: Add a cpuset_reserve_dl_bw() helper Waiman Long
2026-06-21  3:28 ` [PATCH v7 5/9] cgroup/cpuset: Expand the scope of cpuset_can_attach_check() Waiman Long
2026-06-21  3:28 ` [PATCH v7 6/9] cgroup/cpuset: Make cpuset_attach_old_cs track task group leaders Waiman Long
2026-06-21  3:28 ` [PATCH v7 7/9] cgroup/cpuset: Move mpol_rebind_mm/cpuset_migrate_mm() calls inside cpuset_attach_task() Waiman Long
2026-06-22  2:48   ` Ridong Chen [this message]
2026-06-21  3:28 ` [PATCH v7 8/9] cgroup/cpuset: Support multiple source cpusets for cpuset_*attach() Waiman Long
2026-06-21  3:28 ` [PATCH v7 9/9] cgroup/cpuset: Support multiple destination " Waiman Long

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e8fb50d3-0831-4caf-b4e4-2af94ac86263@linux.dev \
    --to=ridong.chen@linux.dev \
    --cc=akpm@linux-foundation.org \
    --cc=atomlin@atomlin.com \
    --cc=cgroups@vger.kernel.org \
    --cc=david@kernel.org \
    --cc=farhad.alemi@berkeley.edu \
    --cc=gourry@gourry.net \
    --cc=guopeng.zhang@linux.dev \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lizefan@huawei.com \
    --cc=longman@redhat.com \
    --cc=mkoutny@suse.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.