All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ridong Chen <ridong.chen@linux.dev>
To: "Waiman Long" <longman@redhat.com>, "Tejun Heo" <tj@kernel.org>,
	"Johannes Weiner" <hannes@cmpxchg.org>,
	"Michal Koutný" <mkoutny@suse.com>,
	"Shuah Khan" <shuah@kernel.org>,
	"Juri Lelli" <juri.lelli@redhat.com>
Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest@vger.kernel.org,
	Aaron Tomlin <atomlin@atomlin.com>,
	Guopeng Zhang <guopeng.zhang@linux.dev>
Subject: Re: [PATCH-next v9 03/11] cgroup/cpuset: Prevent race between task attach and cpuset state change
Date: Wed, 1 Jul 2026 09:41:22 +0800	[thread overview]
Message-ID: <e28a38fa-5998-4e73-886c-2cbbd26ae98f@linux.dev> (raw)
In-Reply-To: <20260630033344.352702-4-longman@redhat.com>



On 6/30/2026 11:33 AM, Waiman Long wrote:
> Commit e44193d39e8d ("cpuset: let hotplug propagation work wait for
> task attaching") was introduced to let hotplug operation to wait
> until the completion of task attach operation. However, it is still
> possible that the states of the source or destination cpuset can
> be changed between the cpuset_can_attach() call and the subsequent
> cpuset_attach()/cpuset_cacnel_attach() call.
> 
> As a result, data gathered during cpuset_can_attach() cannot be reliably
> used in the subsequent cpuset_attach()/cpuset_cacnel_attach()
> call at all. Make the task attach operation more robust
> and allow the sharing of data between cpuset_can_attach() and
> cpuset_attach()/cpuset_cacnel_attach() by making cpuset_write_resmask()
> and cpuset_partition_write() wait for the completion of task attach
> as well.
> 

Nit.

s/cpuset_cacnel_attach/cpuset_cancel_attach/

> Ideally, an ongoing task attach operation should block any cpuset write
> operation that can change its internal state until the operation is
> completed. However, the attach_in_progress flag is currently per cpuset
> and only the destination cpuset will have this flag set. The flag is not
> set in the source cpuset where the tasks will be moved from. Even if we
> extend the scope to include the source cpuset, it will not block cpuset
> operation that changes the state of one of its ancestor cpuset which may
> indirectly impact the state of the source or destination cpuset. It may
> be too costly to set the flag for the whole subtree, it is far easier
> to just make the flag global and block all the cpuset write operation
> whenever a task attach operation is in progress.
> 
> Make that change by creating a new cpuset attach context (attach_ctx)
> structure to hold the global in_progress flag and use it for blocking
> cpuset write operation if a cpuset attach operation is in progress. Also
> add a new wait_attach_done_lock() helper to do the waiting for an
> ongoing attach operation and acquire the cpuset_mutex.
> 
> The comments about validate_change() are no longer valid as it won't
> be called at all if an attach operation is in progress. So the comments
> can be removed.
> 
> The per-cpuset attach_in_progress flag is also currently used in
> partition_is_populated() and cpuset_is_populated() to determine if
> an empty cpuset will have incoming task. This check will no longer be
> needed as this function will not be called when there is a task attach
> in progress. So the flag check is now removed.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>   kernel/cgroup/cpuset-internal.h | 11 +---
>   kernel/cgroup/cpuset.c          | 90 +++++++++++++++++++--------------
>   2 files changed, 53 insertions(+), 48 deletions(-)
> 
> diff --git a/kernel/cgroup/cpuset-internal.h b/kernel/cgroup/cpuset-internal.h
> index 140700e5e236..df662c7fd1a4 100644
> --- a/kernel/cgroup/cpuset-internal.h
> +++ b/kernel/cgroup/cpuset-internal.h
> @@ -145,12 +145,6 @@ struct cpuset {
>   	 */
>   	nodemask_t old_mems_allowed;
>   
> -	/*
> -	 * Tasks are being attached to this cpuset.  Used to prevent
> -	 * zeroing cpus/mems_allowed between ->can_attach() and ->attach().
> -	 */
> -	int attach_in_progress;
> -
>   	/* partition root state */
>   	int partition_root_state;
>   
> @@ -269,10 +263,7 @@ static inline int nr_cpusets(void)
>   static inline bool cpuset_is_populated(struct cpuset *cs)
>   {
>   	lockdep_assert_cpuset_lock_held();
> -
> -	/* Cpusets in the process of attaching should be considered as populated */
> -	return cgroup_is_populated(cs->css.cgroup) ||
> -		cs->attach_in_progress;
> +	return cgroup_is_populated(cs->css.cgroup);
>   }
>   
>   /**
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 431bf210aa52..1a78d0590737 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -356,6 +356,33 @@ static struct workqueue_struct *cpuset_migrate_mm_wq;
>   
>   static DECLARE_WAIT_QUEUE_HEAD(cpuset_attach_wq);
>   
> +/*
> + * Cpuset task attach context
> + * Protected by cpuset_mutex
> + */
> +static struct {
> +	int in_progress;
> +} attach_ctx;
> +
> +/*
> + * Wait if task attach is in progress until it is done and then acquire
> + * cpuset_mutex before returning.
> + */
> +static void wait_attach_done_lock(void)
> +	    __acquires(&cpuset_mutex)
> +{
> +	for (;;) {
> +		mutex_lock(&cpuset_mutex);
> +		if (!attach_ctx.in_progress)
> +			return;
> +
> +		mutex_unlock(&cpuset_mutex);
> +
> +		/* Wait until attach operation is done to prevent racing */
> +		wait_event(cpuset_attach_wq, attach_ctx.in_progress == 0);
> +	}
> +}
> +
>   static inline void check_insane_mems_config(nodemask_t *nodes)
>   {
>   	if (!cpusets_insane_config() &&
> @@ -368,22 +395,22 @@ static inline void check_insane_mems_config(nodemask_t *nodes)
>   }
>   
>   /*
> - * decrease cs->attach_in_progress.
> - * wake_up cpuset_attach_wq if cs->attach_in_progress==0.
> + * decrease attach_ctx.in_progress.
> + * wake_up cpuset_attach_wq if attach_ctx.in_progress==0.
>    */
> -static inline void dec_attach_in_progress_locked(struct cpuset *cs)
> +static inline void dec_attach_in_progress_locked(void)
>   {
>   	lockdep_assert_cpuset_lock_held();
>   
> -	cs->attach_in_progress--;
> -	if (!cs->attach_in_progress)
> +	attach_ctx.in_progress--;
> +	if (!attach_ctx.in_progress)
>   		wake_up(&cpuset_attach_wq);
>   }
>   
> -static inline void dec_attach_in_progress(struct cpuset *cs)
> +static inline void dec_attach_in_progress(void)
>   {
>   	mutex_lock(&cpuset_mutex);
> -	dec_attach_in_progress_locked(cs);
> +	dec_attach_in_progress_locked();
>   	mutex_unlock(&cpuset_mutex);
>   }
>   
> @@ -432,8 +459,7 @@ static inline bool partition_is_populated(struct cpuset *cs,
>   	 * nr_populated_domain_children may include populated
>   	 * csets from descendants that are partitions.
>   	 */
> -	if (cgroup_has_tasks(cs->css.cgroup) ||
> -	    cs->attach_in_progress)
> +	if (cgroup_has_tasks(cs->css.cgroup))
>   		return true;
>   
>   	rcu_read_lock();
> @@ -3091,11 +3117,7 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
>   	cs->dl_bw_cpu = cpu;
>   
>   out_success:
> -	/*
> -	 * Mark attach is in progress.  This makes validate_change() fail
> -	 * changes which zero cpus/mems_allowed.
> -	 */
> -	cs->attach_in_progress++;
> +	attach_ctx.in_progress++;
>   
>   out_unlock:
>   	if (ret)
> @@ -3113,7 +3135,7 @@ static void cpuset_cancel_attach(struct cgroup_taskset *tset)
>   	cs = css_cs(css);
>   
>   	mutex_lock(&cpuset_mutex);
> -	dec_attach_in_progress_locked(cs);
> +	dec_attach_in_progress_locked();
>   
>   	if (cs->dl_bw_cpu >= 0)
>   		dl_bw_free(cs->dl_bw_cpu, cs->sum_migrate_dl_bw);
> @@ -3226,7 +3248,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
>   		reset_migrate_dl_data(cs);
>   	}
>   
> -	dec_attach_in_progress_locked(cs);
> +	dec_attach_in_progress_locked();
>   
>   	mutex_unlock(&cpuset_mutex);
>   }
> @@ -3246,7 +3268,12 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
>   		return -EACCES;
>   
>   	buf = strstrip(buf);
> -	cpuset_full_lock();
> +
> +	/* cpuset_mutex acquired in wait_attach_done_lock() */
> +	mutex_lock(&cpuset_top_mutex);
> +	cpus_read_lock();
> +	wait_attach_done_lock();
> +

Would it be cleaner to just pass this into cpuset_full_lock() as a flag?

void cpuset_full_lock(bool wait_attach)
{
       mutex_lock(&cpuset_top_mutex);
       cpus_read_lock();
       if (wait_attach)
               wait_attach_done_lock();
       else
               mutex_lock(&cpuset_mutex);
}
Then the two write paths become a single cpuset_full_lock(true). The
downside is the other 6 callers would need cpuset_full_lock(false). Not 
sure it's worth it — what do you think?

>   	if (!is_cpuset_online(cs))
>   		goto out_unlock;
>   
> @@ -3377,7 +3404,10 @@ static ssize_t cpuset_partition_write(struct kernfs_open_file *of, char *buf,
>   	else
>   		return -EINVAL;
>   
> -	cpuset_full_lock();
> +	mutex_lock(&cpuset_top_mutex);
> +	cpus_read_lock();
> +	wait_attach_done_lock();
> +
>   	if (is_cpuset_online(cs))
>   		retval = update_prstate(cs, val);
>   	cpuset_update_sd_hk_unlock();
> @@ -3616,11 +3646,7 @@ static int cpuset_can_fork(struct task_struct *task, struct css_set *cset)
>   	if (ret)
>   		goto out_unlock;
>   
> -	/*
> -	 * Mark attach is in progress.  This makes validate_change() fail
> -	 * changes which zero cpus/mems_allowed.
> -	 */
> -	cs->attach_in_progress++;
> +	attach_ctx.in_progress++;
>   out_unlock:
>   	mutex_unlock(&cpuset_mutex);
>   	return ret;
> @@ -3638,7 +3664,7 @@ static void cpuset_cancel_fork(struct task_struct *task, struct css_set *cset)
>   	if (same_cs)
>   		return;
>   
> -	dec_attach_in_progress(cs);
> +	dec_attach_in_progress();
>   }
>   
>   /*
> @@ -3670,7 +3696,7 @@ static void cpuset_fork(struct task_struct *task)
>   	guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
>   	cpuset_attach_task(cs, task);
>   
> -	dec_attach_in_progress_locked(cs);
> +	dec_attach_in_progress_locked();
>   	mutex_unlock(&cpuset_mutex);
>   }
>   
> @@ -3774,20 +3800,8 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
>   	bool remote;
>   	int partcmd = -1;
>   	struct cpuset *parent;
> -retry:
> -	wait_event(cpuset_attach_wq, cs->attach_in_progress == 0);
> -
> -	mutex_lock(&cpuset_mutex);
> -
> -	/*
> -	 * We have raced with task attaching. We wait until attaching
> -	 * is finished, so we won't attach a task to an empty cpuset.
> -	 */
> -	if (cs->attach_in_progress) {
> -		mutex_unlock(&cpuset_mutex);
> -		goto retry;
> -	}
>   
> +	wait_attach_done_lock();
>   	parent = parent_cs(cs);
>   	compute_effective_cpumask(&new_cpus, cs, parent);
>   	compute_effective_nodemask(&new_mems, cs, parent);

Overall this looks good to me.

Reviewed-by: Ridong Chen <ridong.chen@linux.dev>

-- 
Best regards
Ridong


  reply	other threads:[~2026-07-01  1:41 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-30  3:33 [PATCH-next v9 00/11] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach() Waiman Long
2026-06-30  3:33 ` [PATCH-next v9 01/11] cgroup/cpuset: Make nr_deadline_tasks an atomic_t Waiman Long
2026-06-30 14:01   ` Juri Lelli
2026-06-30 17:56     ` Waiman Long
2026-07-01  9:00       ` Juri Lelli
2026-07-01  1:19   ` Ridong Chen
2026-06-30  3:33 ` [PATCH-next v9 02/11] cgroup/cpuset: Fix node inconsistencies between cpuset_update_tasks_nodemask() and cpuset_attach() Waiman Long
2026-06-30  3:33 ` [PATCH-next v9 03/11] cgroup/cpuset: Prevent race between task attach and cpuset state change Waiman Long
2026-07-01  1:41   ` Ridong Chen [this message]
2026-07-01 20:19     ` Waiman Long
2026-06-30  3:33 ` [PATCH-next v9 04/11] cgroup/cpuset: Put all task attach related variables into attach_ctx Waiman Long
2026-06-30  3:33 ` [PATCH-next v9 05/11] cgroup/cpuset: Add a cpuset_reserve_dl_bw() helper Waiman Long
2026-06-30  3:33 ` [PATCH-next v9 06/11] cgroup/cpuset: Expand the scope of cpuset_can_attach_check() Waiman Long
2026-06-30  3:33 ` [PATCH-next v9 07/11] cgroup/cpuset: Make attach_ctx.old_cs track task group leader Waiman Long
2026-06-30  3:33 ` [PATCH-next v9 08/11] cgroup/cpuset: Move mpol_rebind_mm/cpuset_migrate_mm() calls inside cpuset_attach_task() Waiman Long
2026-07-01  2:14   ` Ridong Chen
2026-07-01 20:30     ` Waiman Long
2026-06-30  3:33 ` [PATCH-next v9 09/11] cgroup/cpuset: Support multiple source cpusets for cpuset_*attach() Waiman Long
2026-07-01  2:35   ` Ridong Chen
2026-07-01 20:44     ` Waiman Long
2026-06-30  3:33 ` [PATCH-next v9 10/11] cgroup/cpuset: Support multiple destination " Waiman Long
2026-07-01  2:51   ` Ridong Chen
2026-07-01 21:16     ` Waiman Long
2026-06-30  3:33 ` [PATCH-next v9 11/11] selftests/cgroup: Add test for cpuset affinity on controller disable Waiman Long

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e28a38fa-5998-4e73-886c-2cbbd26ae98f@linux.dev \
    --to=ridong.chen@linux.dev \
    --cc=atomlin@atomlin.com \
    --cc=cgroups@vger.kernel.org \
    --cc=guopeng.zhang@linux.dev \
    --cc=hannes@cmpxchg.org \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=longman@redhat.com \
    --cc=mkoutny@suse.com \
    --cc=shuah@kernel.org \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.