Linux cgroups development
 help / color / mirror / Atom feed
From: Waiman Long <longman@redhat.com>
To: "Ridong Chen" <ridong.chen@linux.dev>,
	"Tejun Heo" <tj@kernel.org>,
	"Johannes Weiner" <hannes@cmpxchg.org>,
	"Michal Koutný" <mkoutny@suse.com>,
	"Farhad Alemi" <farhad.alemi@berkeley.edu>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"Shuah Khan" <shuah@kernel.org>
Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest@vger.kernel.org,
	Aaron Tomlin <atomlin@atomlin.com>,
	Guopeng Zhang <guopeng.zhang@linux.dev>,
	Gregory Price <gourry@gourry.net>,
	David Hildenbrand <david@kernel.org>
Subject: Re: [PATCH v8 03/11] cgroup/cpuset: Prevent race between task attach and cpuset state change
Date: Mon, 29 Jun 2026 17:53:33 -0400	[thread overview]
Message-ID: <6b9c7f81-b77a-4ab6-9e35-ece3bf4ad475@redhat.com> (raw)
In-Reply-To: <e856149c-e4cf-430f-80e0-a6f402faec99@linux.dev>

On 6/29/26 3:14 AM, Ridong Chen wrote:
>
>
> On 6/27/2026 2:19 AM, Waiman Long wrote:
>> Commit e44193d39e8d ("cpuset: let hotplug propagation work wait for
>> task attaching") was introduced to let hotplug operation to wait
>> until the completion of task attach operation. However, it is still
>> possible that the states of the source or destination cpuset can
>> be changed between the cpuset_can_attach() call and the subsequent
>> cpuset_attach()/cpuset_cacnel_attach() call.
>>
>> As a result, data gathered during cpuset_can_attach() cannot be reliably
>> used in the subsequent cpuset_attach()/cpuset_cacnel_attach()
>> call at all. Make the task attach operation more robust
>> and allow the sharing of data between cpuset_can_attach() and
>> cpuset_attach()/cpuset_cacnel_attach() by making cpuset_write_resmask()
>> and cpuset_partition_write() wait for the completion of task attach
>> as well.
>>
>> Ideally, an ongoing task attach operation should block any cpuset write
>> operation that can change its internal state until the operation is
>> completed. However, the attach_in_progress flag is currently per cpuset
>> and only the destination cpuset will have this flag set. The flag is not
>> set in the source cpuset where the tasks will be moved from. Even if we
>> extend the scope to include the source cpuset, it will not block cpuset
>> operation that changes the state of one of its ancestor cpuset which may
>> indirectly impact the state of the source or destination cpuset. It may
>> be too costly to set the flag for the whole subtree, it is far easier
>> to just make the flag global and block all the cpuset write operation
>> whenever a task attach operation is in progress. Make that change by
>> creating a new cpuset attach context (attach_ctx) structure to hold the
>> global in_progress flag and use it for blocking cpuset write operation
>> if a cpuset attach operation is in progress.
>>
>> The comments about validate_change() are no longer valid as it won't
>> be called at all if an attach operation is in progress. So the comments
>> can be removed.
>>
>> The per-cpuset attach_in_progress flag is also currently used in
>> partition_is_populated() and cpuset_is_populated() to determine if
>> an empty cpuset will have incoming task. This check will no longer be
>> needed as this function will not be called when there is a task attach
>> in progress. So the flag check is now removed.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>   kernel/cgroup/cpuset-internal.h | 11 +-----
>>   kernel/cgroup/cpuset.c          | 68 +++++++++++++++++++++------------
>>   2 files changed, 44 insertions(+), 35 deletions(-)
>>
>> diff --git a/kernel/cgroup/cpuset-internal.h 
>> b/kernel/cgroup/cpuset-internal.h
>> index f7aaf01f7cd5..817b86ba7019 100644
>> --- a/kernel/cgroup/cpuset-internal.h
>> +++ b/kernel/cgroup/cpuset-internal.h
>> @@ -145,12 +145,6 @@ struct cpuset {
>>        */
>>       nodemask_t old_mems_allowed;
>>   -    /*
>> -     * Tasks are being attached to this cpuset.  Used to prevent
>> -     * zeroing cpus/mems_allowed between ->can_attach() and ->attach().
>> -     */
>> -    int attach_in_progress;
>> -
>>       /* partition root state */
>>       int partition_root_state;
>>   @@ -269,10 +263,7 @@ static inline int nr_cpusets(void)
>>   static inline bool cpuset_is_populated(struct cpuset *cs)
>>   {
>>       lockdep_assert_cpuset_lock_held();
>> -
>> -    /* Cpusets in the process of attaching should be considered as 
>> populated */
>> -    return cgroup_is_populated(cs->css.cgroup) ||
>> -        cs->attach_in_progress;
>> +    return cgroup_is_populated(cs->css.cgroup);
>>   }
>>     /**
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index d108c2083e86..dec9785d0271 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -356,6 +356,14 @@ static struct workqueue_struct 
>> *cpuset_migrate_mm_wq;
>>     static DECLARE_WAIT_QUEUE_HEAD(cpuset_attach_wq);
>>   +/*
>> + * Cpuset task attach context
>> + * Protected by cpuset_mutex
>> + */
>> +static struct {
>> +    int in_progress;
>> +} attach_ctx;
>> +
>>   static inline void check_insane_mems_config(nodemask_t *nodes)
>>   {
>>       if (!cpusets_insane_config() &&
>> @@ -368,22 +376,22 @@ static inline void 
>> check_insane_mems_config(nodemask_t *nodes)
>>   }
>>     /*
>> - * decrease cs->attach_in_progress.
>> - * wake_up cpuset_attach_wq if cs->attach_in_progress==0.
>> + * decrease attach_ctx.in_progress.
>> + * wake_up cpuset_attach_wq if attach_ctx.in_progress==0.
>>    */
>> -static inline void dec_attach_in_progress_locked(struct cpuset *cs)
>> +static inline void dec_attach_in_progress_locked(void)
>>   {
>>       lockdep_assert_cpuset_lock_held();
>>   -    cs->attach_in_progress--;
>> -    if (!cs->attach_in_progress)
>> +    attach_ctx.in_progress--;
>> +    if (!attach_ctx.in_progress)
>>           wake_up(&cpuset_attach_wq);
>>   }
>>   -static inline void dec_attach_in_progress(struct cpuset *cs)
>> +static inline void dec_attach_in_progress(void)
>>   {
>>       mutex_lock(&cpuset_mutex);
>> -    dec_attach_in_progress_locked(cs);
>> +    dec_attach_in_progress_locked();
>>       mutex_unlock(&cpuset_mutex);
>>   }
>>   @@ -432,8 +440,7 @@ static inline bool 
>> partition_is_populated(struct cpuset *cs,
>>        * nr_populated_domain_children may include populated
>>        * csets from descendants that are partitions.
>>        */
>> -    if (cgroup_has_tasks(cs->css.cgroup) ||
>> -        cs->attach_in_progress)
>> +    if (cgroup_has_tasks(cs->css.cgroup))
>>           return true;
>>         rcu_read_lock();
>> @@ -3091,11 +3098,7 @@ static int cpuset_can_attach(struct 
>> cgroup_taskset *tset)
>>       cs->dl_bw_cpu = cpu;
>>     out_success:
>> -    /*
>> -     * Mark attach is in progress.  This makes validate_change() fail
>> -     * changes which zero cpus/mems_allowed.
>> -     */
>> -    cs->attach_in_progress++;
>> +    attach_ctx.in_progress++;
>>     out_unlock:
>>       if (ret)
>> @@ -3113,7 +3116,7 @@ static void cpuset_cancel_attach(struct 
>> cgroup_taskset *tset)
>>       cs = css_cs(css);
>>         mutex_lock(&cpuset_mutex);
>> -    dec_attach_in_progress_locked(cs);
>> +    dec_attach_in_progress_locked();
>>         if (cs->dl_bw_cpu >= 0)
>>           dl_bw_free(cs->dl_bw_cpu, cs->sum_migrate_dl_bw);
>> @@ -3226,7 +3229,7 @@ static void cpuset_attach(struct cgroup_taskset 
>> *tset)
>>           reset_migrate_dl_data(cs);
>>       }
>>   -    dec_attach_in_progress_locked(cs);
>> +    dec_attach_in_progress_locked();
>>         mutex_unlock(&cpuset_mutex);
>>   }
>> @@ -3246,10 +3249,19 @@ ssize_t cpuset_write_resmask(struct 
>> kernfs_open_file *of,
>>           return -EACCES;
>>         buf = strstrip(buf);
>> +retry:
>> +    wait_event(cpuset_attach_wq, attach_ctx.in_progress == 0);
>> +
>>       cpuset_full_lock();
>>       if (!is_cpuset_online(cs))
>>           goto out_unlock;
>>   +    /* Don't race with task attach */
>> +    if (attach_ctx.in_progress) {
>> +        cpuset_full_unlock();
>> +        goto retry;
>> +    }
>> +
>>       trialcs = dup_or_alloc_cpuset(cs);
>>       if (!trialcs) {
>>           retval = -ENOMEM;
>> @@ -3377,7 +3389,17 @@ static ssize_t cpuset_partition_write(struct 
>> kernfs_open_file *of, char *buf,
>>       else
>>           return -EINVAL;
>>   +retry:
>> +    wait_event(cpuset_attach_wq, attach_ctx.in_progress == 0);
>> +
>>       cpuset_full_lock();
>> +
>> +    /* Don't race with task attach */
>> +    if (attach_ctx.in_progress) {
>> +        cpuset_full_unlock();
>> +        goto retry;
>> +    }
>> +
>
> Would it make sense to add a helper like wait_attach_done_locked()?

I guess we can add a helper to do that.

Thanks for the suggestions.

Cheers,
Longman



  reply	other threads:[~2026-06-29 21:53 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-26 18:19 [PATCH v8 00/11] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach() Waiman Long
2026-06-26 18:19 ` [PATCH v8 01/11] cgroup/cpuset: rebind mm mempolicy to effective_mems, not mems_allowed Waiman Long
2026-06-26 18:19 ` [PATCH v8 02/11] cgroup/cpuset: Fix node inconsistencies between cpuset_update_tasks_nodemask() and cpuset_attach() Waiman Long
2026-06-26 18:19 ` [PATCH v8 03/11] cgroup/cpuset: Prevent race between task attach and cpuset state change Waiman Long
2026-06-29  7:14   ` Ridong Chen
2026-06-29 21:53     ` Waiman Long [this message]
2026-06-26 18:19 ` [PATCH v8 04/11] cgroup/cpuset: Put all task attach related variables into attach_ctx Waiman Long
2026-06-29  7:16   ` Ridong Chen
2026-06-26 18:19 ` [PATCH v8 05/11] cgroup/cpuset: Add a cpuset_reserve_dl_bw() helper Waiman Long
2026-06-26 18:19 ` [PATCH v8 06/11] cgroup/cpuset: Expand the scope of cpuset_can_attach_check() Waiman Long
2026-06-26 18:19 ` [PATCH v8 07/11] cgroup/cpuset: Make attach_ctx.old_cs track task group leader Waiman Long
2026-06-26 18:19 ` [PATCH v8 08/11] cgroup/cpuset: Move mpol_rebind_mm/cpuset_migrate_mm() calls inside cpuset_attach_task() Waiman Long
2026-06-26 18:19 ` [PATCH v8 09/11] cgroup/cpuset: Support multiple source cpusets for cpuset_*attach() Waiman Long
2026-06-26 18:19 ` [PATCH v8 10/11] cgroup/cpuset: Support multiple destination " Waiman Long
2026-06-26 18:19 ` [PATCH v8 11/11] selftests/cgroup: Add test for cpuset affinity on controller disable Waiman Long

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6b9c7f81-b77a-4ab6-9e35-ece3bf4ad475@redhat.com \
    --to=longman@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=atomlin@atomlin.com \
    --cc=cgroups@vger.kernel.org \
    --cc=david@kernel.org \
    --cc=farhad.alemi@berkeley.edu \
    --cc=gourry@gourry.net \
    --cc=guopeng.zhang@linux.dev \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=mkoutny@suse.com \
    --cc=ridong.chen@linux.dev \
    --cc=shuah@kernel.org \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox