Linux Kernel Selftest development
 help / color / mirror / Atom feed
From: Ridong Chen <ridong.chen@linux.dev>
To: "Waiman Long" <longman@redhat.com>, "Tejun Heo" <tj@kernel.org>,
	"Johannes Weiner" <hannes@cmpxchg.org>,
	"Michal Koutný" <mkoutny@suse.com>,
	"Shuah Khan" <shuah@kernel.org>,
	"Juri Lelli" <juri.lelli@redhat.com>
Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest@vger.kernel.org,
	Aaron Tomlin <atomlin@atomlin.com>,
	Guopeng Zhang <guopeng.zhang@linux.dev>
Subject: Re: [PATCH-next v9 03/11] cgroup/cpuset: Prevent race between task attach and cpuset state change
Date: Wed, 1 Jul 2026 09:41:22 +0800	[thread overview]
Message-ID: <e28a38fa-5998-4e73-886c-2cbbd26ae98f@linux.dev> (raw)
In-Reply-To: <20260630033344.352702-4-longman@redhat.com>



On 6/30/2026 11:33 AM, Waiman Long wrote:
> Commit e44193d39e8d ("cpuset: let hotplug propagation work wait for
> task attaching") was introduced to let hotplug operation to wait
> until the completion of task attach operation. However, it is still
> possible that the states of the source or destination cpuset can
> be changed between the cpuset_can_attach() call and the subsequent
> cpuset_attach()/cpuset_cacnel_attach() call.
> 
> As a result, data gathered during cpuset_can_attach() cannot be reliably
> used in the subsequent cpuset_attach()/cpuset_cacnel_attach()
> call at all. Make the task attach operation more robust
> and allow the sharing of data between cpuset_can_attach() and
> cpuset_attach()/cpuset_cacnel_attach() by making cpuset_write_resmask()
> and cpuset_partition_write() wait for the completion of task attach
> as well.
> 

Nit.

s/cpuset_cacnel_attach/cpuset_cancel_attach/

> Ideally, an ongoing task attach operation should block any cpuset write
> operation that can change its internal state until the operation is
> completed. However, the attach_in_progress flag is currently per cpuset
> and only the destination cpuset will have this flag set. The flag is not
> set in the source cpuset where the tasks will be moved from. Even if we
> extend the scope to include the source cpuset, it will not block cpuset
> operation that changes the state of one of its ancestor cpuset which may
> indirectly impact the state of the source or destination cpuset. It may
> be too costly to set the flag for the whole subtree, it is far easier
> to just make the flag global and block all the cpuset write operation
> whenever a task attach operation is in progress.
> 
> Make that change by creating a new cpuset attach context (attach_ctx)
> structure to hold the global in_progress flag and use it for blocking
> cpuset write operation if a cpuset attach operation is in progress. Also
> add a new wait_attach_done_lock() helper to do the waiting for an
> ongoing attach operation and acquire the cpuset_mutex.
> 
> The comments about validate_change() are no longer valid as it won't
> be called at all if an attach operation is in progress. So the comments
> can be removed.
> 
> The per-cpuset attach_in_progress flag is also currently used in
> partition_is_populated() and cpuset_is_populated() to determine if
> an empty cpuset will have incoming task. This check will no longer be
> needed as this function will not be called when there is a task attach
> in progress. So the flag check is now removed.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>   kernel/cgroup/cpuset-internal.h | 11 +---
>   kernel/cgroup/cpuset.c          | 90 +++++++++++++++++++--------------
>   2 files changed, 53 insertions(+), 48 deletions(-)
> 
> diff --git a/kernel/cgroup/cpuset-internal.h b/kernel/cgroup/cpuset-internal.h
> index 140700e5e236..df662c7fd1a4 100644
> --- a/kernel/cgroup/cpuset-internal.h
> +++ b/kernel/cgroup/cpuset-internal.h
> @@ -145,12 +145,6 @@ struct cpuset {
>   	 */
>   	nodemask_t old_mems_allowed;
>   
> -	/*
> -	 * Tasks are being attached to this cpuset.  Used to prevent
> -	 * zeroing cpus/mems_allowed between ->can_attach() and ->attach().
> -	 */
> -	int attach_in_progress;
> -
>   	/* partition root state */
>   	int partition_root_state;
>   
> @@ -269,10 +263,7 @@ static inline int nr_cpusets(void)
>   static inline bool cpuset_is_populated(struct cpuset *cs)
>   {
>   	lockdep_assert_cpuset_lock_held();
> -
> -	/* Cpusets in the process of attaching should be considered as populated */
> -	return cgroup_is_populated(cs->css.cgroup) ||
> -		cs->attach_in_progress;
> +	return cgroup_is_populated(cs->css.cgroup);
>   }
>   
>   /**
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 431bf210aa52..1a78d0590737 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -356,6 +356,33 @@ static struct workqueue_struct *cpuset_migrate_mm_wq;
>   
>   static DECLARE_WAIT_QUEUE_HEAD(cpuset_attach_wq);
>   
> +/*
> + * Cpuset task attach context
> + * Protected by cpuset_mutex
> + */
> +static struct {
> +	int in_progress;
> +} attach_ctx;
> +
> +/*
> + * Wait if task attach is in progress until it is done and then acquire
> + * cpuset_mutex before returning.
> + */
> +static void wait_attach_done_lock(void)
> +	    __acquires(&cpuset_mutex)
> +{
> +	for (;;) {
> +		mutex_lock(&cpuset_mutex);
> +		if (!attach_ctx.in_progress)
> +			return;
> +
> +		mutex_unlock(&cpuset_mutex);
> +
> +		/* Wait until attach operation is done to prevent racing */
> +		wait_event(cpuset_attach_wq, attach_ctx.in_progress == 0);
> +	}
> +}
> +
>   static inline void check_insane_mems_config(nodemask_t *nodes)
>   {
>   	if (!cpusets_insane_config() &&
> @@ -368,22 +395,22 @@ static inline void check_insane_mems_config(nodemask_t *nodes)
>   }
>   
>   /*
> - * decrease cs->attach_in_progress.
> - * wake_up cpuset_attach_wq if cs->attach_in_progress==0.
> + * decrease attach_ctx.in_progress.
> + * wake_up cpuset_attach_wq if attach_ctx.in_progress==0.
>    */
> -static inline void dec_attach_in_progress_locked(struct cpuset *cs)
> +static inline void dec_attach_in_progress_locked(void)
>   {
>   	lockdep_assert_cpuset_lock_held();
>   
> -	cs->attach_in_progress--;
> -	if (!cs->attach_in_progress)
> +	attach_ctx.in_progress--;
> +	if (!attach_ctx.in_progress)
>   		wake_up(&cpuset_attach_wq);
>   }
>   
> -static inline void dec_attach_in_progress(struct cpuset *cs)
> +static inline void dec_attach_in_progress(void)
>   {
>   	mutex_lock(&cpuset_mutex);
> -	dec_attach_in_progress_locked(cs);
> +	dec_attach_in_progress_locked();
>   	mutex_unlock(&cpuset_mutex);
>   }
>   
> @@ -432,8 +459,7 @@ static inline bool partition_is_populated(struct cpuset *cs,
>   	 * nr_populated_domain_children may include populated
>   	 * csets from descendants that are partitions.
>   	 */
> -	if (cgroup_has_tasks(cs->css.cgroup) ||
> -	    cs->attach_in_progress)
> +	if (cgroup_has_tasks(cs->css.cgroup))
>   		return true;
>   
>   	rcu_read_lock();
> @@ -3091,11 +3117,7 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
>   	cs->dl_bw_cpu = cpu;
>   
>   out_success:
> -	/*
> -	 * Mark attach is in progress.  This makes validate_change() fail
> -	 * changes which zero cpus/mems_allowed.
> -	 */
> -	cs->attach_in_progress++;
> +	attach_ctx.in_progress++;
>   
>   out_unlock:
>   	if (ret)
> @@ -3113,7 +3135,7 @@ static void cpuset_cancel_attach(struct cgroup_taskset *tset)
>   	cs = css_cs(css);
>   
>   	mutex_lock(&cpuset_mutex);
> -	dec_attach_in_progress_locked(cs);
> +	dec_attach_in_progress_locked();
>   
>   	if (cs->dl_bw_cpu >= 0)
>   		dl_bw_free(cs->dl_bw_cpu, cs->sum_migrate_dl_bw);
> @@ -3226,7 +3248,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
>   		reset_migrate_dl_data(cs);
>   	}
>   
> -	dec_attach_in_progress_locked(cs);
> +	dec_attach_in_progress_locked();
>   
>   	mutex_unlock(&cpuset_mutex);
>   }
> @@ -3246,7 +3268,12 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
>   		return -EACCES;
>   
>   	buf = strstrip(buf);
> -	cpuset_full_lock();
> +
> +	/* cpuset_mutex acquired in wait_attach_done_lock() */
> +	mutex_lock(&cpuset_top_mutex);
> +	cpus_read_lock();
> +	wait_attach_done_lock();
> +

Would it be cleaner to just pass this into cpuset_full_lock() as a flag?

void cpuset_full_lock(bool wait_attach)
{
       mutex_lock(&cpuset_top_mutex);
       cpus_read_lock();
       if (wait_attach)
               wait_attach_done_lock();
       else
               mutex_lock(&cpuset_mutex);
}
Then the two write paths become a single cpuset_full_lock(true). The
downside is the other 6 callers would need cpuset_full_lock(false). Not 
sure it's worth it — what do you think?

>   	if (!is_cpuset_online(cs))
>   		goto out_unlock;
>   
> @@ -3377,7 +3404,10 @@ static ssize_t cpuset_partition_write(struct kernfs_open_file *of, char *buf,
>   	else
>   		return -EINVAL;
>   
> -	cpuset_full_lock();
> +	mutex_lock(&cpuset_top_mutex);
> +	cpus_read_lock();
> +	wait_attach_done_lock();
> +
>   	if (is_cpuset_online(cs))
>   		retval = update_prstate(cs, val);
>   	cpuset_update_sd_hk_unlock();
> @@ -3616,11 +3646,7 @@ static int cpuset_can_fork(struct task_struct *task, struct css_set *cset)
>   	if (ret)
>   		goto out_unlock;
>   
> -	/*
> -	 * Mark attach is in progress.  This makes validate_change() fail
> -	 * changes which zero cpus/mems_allowed.
> -	 */
> -	cs->attach_in_progress++;
> +	attach_ctx.in_progress++;
>   out_unlock:
>   	mutex_unlock(&cpuset_mutex);
>   	return ret;
> @@ -3638,7 +3664,7 @@ static void cpuset_cancel_fork(struct task_struct *task, struct css_set *cset)
>   	if (same_cs)
>   		return;
>   
> -	dec_attach_in_progress(cs);
> +	dec_attach_in_progress();
>   }
>   
>   /*
> @@ -3670,7 +3696,7 @@ static void cpuset_fork(struct task_struct *task)
>   	guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
>   	cpuset_attach_task(cs, task);
>   
> -	dec_attach_in_progress_locked(cs);
> +	dec_attach_in_progress_locked();
>   	mutex_unlock(&cpuset_mutex);
>   }
>   
> @@ -3774,20 +3800,8 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
>   	bool remote;
>   	int partcmd = -1;
>   	struct cpuset *parent;
> -retry:
> -	wait_event(cpuset_attach_wq, cs->attach_in_progress == 0);
> -
> -	mutex_lock(&cpuset_mutex);
> -
> -	/*
> -	 * We have raced with task attaching. We wait until attaching
> -	 * is finished, so we won't attach a task to an empty cpuset.
> -	 */
> -	if (cs->attach_in_progress) {
> -		mutex_unlock(&cpuset_mutex);
> -		goto retry;
> -	}
>   
> +	wait_attach_done_lock();
>   	parent = parent_cs(cs);
>   	compute_effective_cpumask(&new_cpus, cs, parent);
>   	compute_effective_nodemask(&new_mems, cs, parent);

Overall this looks good to me.

Reviewed-by: Ridong Chen <ridong.chen@linux.dev>

-- 
Best regards
Ridong


  reply	other threads:[~2026-07-01  1:41 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-30  3:33 [PATCH-next v9 00/11] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach() Waiman Long
2026-06-30  3:33 ` [PATCH-next v9 01/11] cgroup/cpuset: Make nr_deadline_tasks an atomic_t Waiman Long
2026-06-30 14:01   ` Juri Lelli
2026-06-30 17:56     ` Waiman Long
2026-07-01  9:00       ` Juri Lelli
2026-07-01  1:19   ` Ridong Chen
2026-06-30  3:33 ` [PATCH-next v9 02/11] cgroup/cpuset: Fix node inconsistencies between cpuset_update_tasks_nodemask() and cpuset_attach() Waiman Long
2026-06-30  3:33 ` [PATCH-next v9 03/11] cgroup/cpuset: Prevent race between task attach and cpuset state change Waiman Long
2026-07-01  1:41   ` Ridong Chen [this message]
2026-07-01 20:19     ` Waiman Long
2026-06-30  3:33 ` [PATCH-next v9 04/11] cgroup/cpuset: Put all task attach related variables into attach_ctx Waiman Long
2026-06-30  3:33 ` [PATCH-next v9 05/11] cgroup/cpuset: Add a cpuset_reserve_dl_bw() helper Waiman Long
2026-06-30  3:33 ` [PATCH-next v9 06/11] cgroup/cpuset: Expand the scope of cpuset_can_attach_check() Waiman Long
2026-06-30  3:33 ` [PATCH-next v9 07/11] cgroup/cpuset: Make attach_ctx.old_cs track task group leader Waiman Long
2026-06-30  3:33 ` [PATCH-next v9 08/11] cgroup/cpuset: Move mpol_rebind_mm/cpuset_migrate_mm() calls inside cpuset_attach_task() Waiman Long
2026-07-01  2:14   ` Ridong Chen
2026-07-01 20:30     ` Waiman Long
2026-06-30  3:33 ` [PATCH-next v9 09/11] cgroup/cpuset: Support multiple source cpusets for cpuset_*attach() Waiman Long
2026-07-01  2:35   ` Ridong Chen
2026-07-01 20:44     ` Waiman Long
2026-06-30  3:33 ` [PATCH-next v9 10/11] cgroup/cpuset: Support multiple destination " Waiman Long
2026-07-01  2:51   ` Ridong Chen
2026-07-01 21:16     ` Waiman Long
2026-06-30  3:33 ` [PATCH-next v9 11/11] selftests/cgroup: Add test for cpuset affinity on controller disable Waiman Long

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e28a38fa-5998-4e73-886c-2cbbd26ae98f@linux.dev \
    --to=ridong.chen@linux.dev \
    --cc=atomlin@atomlin.com \
    --cc=cgroups@vger.kernel.org \
    --cc=guopeng.zhang@linux.dev \
    --cc=hannes@cmpxchg.org \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=longman@redhat.com \
    --cc=mkoutny@suse.com \
    --cc=shuah@kernel.org \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox