From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-174.mta1.migadu.com (out-174.mta1.migadu.com [95.215.58.174]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9FF0F3C061C for ; Mon, 29 Jun 2026 07:15:41 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.174 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1782717343; cv=none; b=Ni6Hc8lT91fQ3GetQU7mvHKREpzO/XpiiF3xQaUbk66zBy78bVukLKGI1llIVBmhOqsWa9mU8cwpv/IdP4UHrS1k6wohOz48eO9YpISEP9SHGE5Izcw12AkCFkKS+w1Jtr/zN29RAOISZVkDhce3ELlO2gQOwbm4FO78S8L0/0o= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1782717343; c=relaxed/simple; bh=MYaqEaysKP46BpS2btGZP46vesfc6sQogd1OFwtbopM=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=iTOKkD64jvu7LC183LD5yIg93bjH1gcfN/IDc1r1ygDB6OFYRb9hWSFE37+5D8e2S9uOg4eSzYpaGPpGs3xKBFkcz3sQ7iW802i4XxmjLbR2Zg5e3MHbWHvgVYmFdj0Okk9uYdhPrjhnkR9xsfbUx3pFsGZGKtnxJTOiCOmL66g= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=aByoCzyh; arc=none smtp.client-ip=95.215.58.174 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="aByoCzyh" Message-ID: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1782717329; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=P/Spr3uHJDXm8nUfAmFnApQAMVX8PvaQ+FFwn3/Z1Sc=; b=aByoCzyhYyplBlrBYJzeK1dw/7QbQI+nIbiAYJQwXfj7C3snAHQ0PdmkM4WllqSJV4mi0K Xp4N3IMs64BUQdLgiULOgSe/3ky1FjgSDSjod0IvQDjoYA/Mh4vWcY4vMUpM91vjCfdOZt TvZ3d5o9LrSd1T3I04EoLzIRNHiUza0= Date: Mon, 29 Jun 2026 15:14:51 +0800 Precedence: bulk X-Mailing-List: linux-kselftest@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Subject: Re: [PATCH v8 03/11] cgroup/cpuset: Prevent race between task attach and cpuset state change To: Waiman Long , Tejun Heo , Johannes Weiner , =?UTF-8?Q?Michal_Koutn=C3=BD?= , Farhad Alemi , Andrew Morton , Shuah Khan Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, Aaron Tomlin , Guopeng Zhang , Gregory Price , David Hildenbrand References: <20260626181923.133658-1-longman@redhat.com> <20260626181923.133658-4-longman@redhat.com> X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Ridong Chen In-Reply-To: <20260626181923.133658-4-longman@redhat.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Migadu-Flow: FLOW_OUT On 6/27/2026 2:19 AM, Waiman Long wrote: > Commit e44193d39e8d ("cpuset: let hotplug propagation work wait for > task attaching") was introduced to let hotplug operation to wait > until the completion of task attach operation. However, it is still > possible that the states of the source or destination cpuset can > be changed between the cpuset_can_attach() call and the subsequent > cpuset_attach()/cpuset_cacnel_attach() call. > > As a result, data gathered during cpuset_can_attach() cannot be reliably > used in the subsequent cpuset_attach()/cpuset_cacnel_attach() > call at all. Make the task attach operation more robust > and allow the sharing of data between cpuset_can_attach() and > cpuset_attach()/cpuset_cacnel_attach() by making cpuset_write_resmask() > and cpuset_partition_write() wait for the completion of task attach > as well. > > Ideally, an ongoing task attach operation should block any cpuset write > operation that can change its internal state until the operation is > completed. However, the attach_in_progress flag is currently per cpuset > and only the destination cpuset will have this flag set. The flag is not > set in the source cpuset where the tasks will be moved from. Even if we > extend the scope to include the source cpuset, it will not block cpuset > operation that changes the state of one of its ancestor cpuset which may > indirectly impact the state of the source or destination cpuset. It may > be too costly to set the flag for the whole subtree, it is far easier > to just make the flag global and block all the cpuset write operation > whenever a task attach operation is in progress. Make that change by > creating a new cpuset attach context (attach_ctx) structure to hold the > global in_progress flag and use it for blocking cpuset write operation > if a cpuset attach operation is in progress. > > The comments about validate_change() are no longer valid as it won't > be called at all if an attach operation is in progress. So the comments > can be removed. > > The per-cpuset attach_in_progress flag is also currently used in > partition_is_populated() and cpuset_is_populated() to determine if > an empty cpuset will have incoming task. This check will no longer be > needed as this function will not be called when there is a task attach > in progress. So the flag check is now removed. > > Signed-off-by: Waiman Long > --- > kernel/cgroup/cpuset-internal.h | 11 +----- > kernel/cgroup/cpuset.c | 68 +++++++++++++++++++++------------ > 2 files changed, 44 insertions(+), 35 deletions(-) > > diff --git a/kernel/cgroup/cpuset-internal.h b/kernel/cgroup/cpuset-internal.h > index f7aaf01f7cd5..817b86ba7019 100644 > --- a/kernel/cgroup/cpuset-internal.h > +++ b/kernel/cgroup/cpuset-internal.h > @@ -145,12 +145,6 @@ struct cpuset { > */ > nodemask_t old_mems_allowed; > > - /* > - * Tasks are being attached to this cpuset. Used to prevent > - * zeroing cpus/mems_allowed between ->can_attach() and ->attach(). > - */ > - int attach_in_progress; > - > /* partition root state */ > int partition_root_state; > > @@ -269,10 +263,7 @@ static inline int nr_cpusets(void) > static inline bool cpuset_is_populated(struct cpuset *cs) > { > lockdep_assert_cpuset_lock_held(); > - > - /* Cpusets in the process of attaching should be considered as populated */ > - return cgroup_is_populated(cs->css.cgroup) || > - cs->attach_in_progress; > + return cgroup_is_populated(cs->css.cgroup); > } > > /** > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c > index d108c2083e86..dec9785d0271 100644 > --- a/kernel/cgroup/cpuset.c > +++ b/kernel/cgroup/cpuset.c > @@ -356,6 +356,14 @@ static struct workqueue_struct *cpuset_migrate_mm_wq; > > static DECLARE_WAIT_QUEUE_HEAD(cpuset_attach_wq); > > +/* > + * Cpuset task attach context > + * Protected by cpuset_mutex > + */ > +static struct { > + int in_progress; > +} attach_ctx; > + > static inline void check_insane_mems_config(nodemask_t *nodes) > { > if (!cpusets_insane_config() && > @@ -368,22 +376,22 @@ static inline void check_insane_mems_config(nodemask_t *nodes) > } > > /* > - * decrease cs->attach_in_progress. > - * wake_up cpuset_attach_wq if cs->attach_in_progress==0. > + * decrease attach_ctx.in_progress. > + * wake_up cpuset_attach_wq if attach_ctx.in_progress==0. > */ > -static inline void dec_attach_in_progress_locked(struct cpuset *cs) > +static inline void dec_attach_in_progress_locked(void) > { > lockdep_assert_cpuset_lock_held(); > > - cs->attach_in_progress--; > - if (!cs->attach_in_progress) > + attach_ctx.in_progress--; > + if (!attach_ctx.in_progress) > wake_up(&cpuset_attach_wq); > } > > -static inline void dec_attach_in_progress(struct cpuset *cs) > +static inline void dec_attach_in_progress(void) > { > mutex_lock(&cpuset_mutex); > - dec_attach_in_progress_locked(cs); > + dec_attach_in_progress_locked(); > mutex_unlock(&cpuset_mutex); > } > > @@ -432,8 +440,7 @@ static inline bool partition_is_populated(struct cpuset *cs, > * nr_populated_domain_children may include populated > * csets from descendants that are partitions. > */ > - if (cgroup_has_tasks(cs->css.cgroup) || > - cs->attach_in_progress) > + if (cgroup_has_tasks(cs->css.cgroup)) > return true; > > rcu_read_lock(); > @@ -3091,11 +3098,7 @@ static int cpuset_can_attach(struct cgroup_taskset *tset) > cs->dl_bw_cpu = cpu; > > out_success: > - /* > - * Mark attach is in progress. This makes validate_change() fail > - * changes which zero cpus/mems_allowed. > - */ > - cs->attach_in_progress++; > + attach_ctx.in_progress++; > > out_unlock: > if (ret) > @@ -3113,7 +3116,7 @@ static void cpuset_cancel_attach(struct cgroup_taskset *tset) > cs = css_cs(css); > > mutex_lock(&cpuset_mutex); > - dec_attach_in_progress_locked(cs); > + dec_attach_in_progress_locked(); > > if (cs->dl_bw_cpu >= 0) > dl_bw_free(cs->dl_bw_cpu, cs->sum_migrate_dl_bw); > @@ -3226,7 +3229,7 @@ static void cpuset_attach(struct cgroup_taskset *tset) > reset_migrate_dl_data(cs); > } > > - dec_attach_in_progress_locked(cs); > + dec_attach_in_progress_locked(); > > mutex_unlock(&cpuset_mutex); > } > @@ -3246,10 +3249,19 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of, > return -EACCES; > > buf = strstrip(buf); > +retry: > + wait_event(cpuset_attach_wq, attach_ctx.in_progress == 0); > + > cpuset_full_lock(); > if (!is_cpuset_online(cs)) > goto out_unlock; > > + /* Don't race with task attach */ > + if (attach_ctx.in_progress) { > + cpuset_full_unlock(); > + goto retry; > + } > + > trialcs = dup_or_alloc_cpuset(cs); > if (!trialcs) { > retval = -ENOMEM; > @@ -3377,7 +3389,17 @@ static ssize_t cpuset_partition_write(struct kernfs_open_file *of, char *buf, > else > return -EINVAL; > > +retry: > + wait_event(cpuset_attach_wq, attach_ctx.in_progress == 0); > + > cpuset_full_lock(); > + > + /* Don't race with task attach */ > + if (attach_ctx.in_progress) { > + cpuset_full_unlock(); > + goto retry; > + } > + Would it make sense to add a helper like wait_attach_done_locked()? > if (is_cpuset_online(cs)) > retval = update_prstate(cs, val); > cpuset_update_sd_hk_unlock(); > @@ -3616,11 +3638,7 @@ static int cpuset_can_fork(struct task_struct *task, struct css_set *cset) > if (ret) > goto out_unlock; > > - /* > - * Mark attach is in progress. This makes validate_change() fail > - * changes which zero cpus/mems_allowed. > - */ > - cs->attach_in_progress++; > + attach_ctx.in_progress++; > out_unlock: > mutex_unlock(&cpuset_mutex); > return ret; > @@ -3638,7 +3656,7 @@ static void cpuset_cancel_fork(struct task_struct *task, struct css_set *cset) > if (same_cs) > return; > > - dec_attach_in_progress(cs); > + dec_attach_in_progress(); > } > > /* > @@ -3670,7 +3688,7 @@ static void cpuset_fork(struct task_struct *task) > guarantee_online_mems(cs, &cpuset_attach_nodemask_to); > cpuset_attach_task(cs, task); > > - dec_attach_in_progress_locked(cs); > + dec_attach_in_progress_locked(); > mutex_unlock(&cpuset_mutex); > } > > @@ -3775,7 +3793,7 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp) > int partcmd = -1; > struct cpuset *parent; > retry: > - wait_event(cpuset_attach_wq, cs->attach_in_progress == 0); > + wait_event(cpuset_attach_wq, attach_ctx.in_progress == 0); > > mutex_lock(&cpuset_mutex); > > @@ -3783,7 +3801,7 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp) > * We have raced with task attaching. We wait until attaching > * is finished, so we won't attach a task to an empty cpuset. > */ > - if (cs->attach_in_progress) { > + if (attach_ctx.in_progress) { > mutex_unlock(&cpuset_mutex); > goto retry; > } -- Best regards Ridong