* [PATCH-next v9 01/11] cgroup/cpuset: Make nr_deadline_tasks an atomic_t
2026-06-30 3:33 [PATCH-next v9 00/11] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach() Waiman Long
@ 2026-06-30 3:33 ` Waiman Long
2026-06-30 14:01 ` Juri Lelli
2026-07-01 1:19 ` Ridong Chen
2026-06-30 3:33 ` [PATCH-next v9 02/11] cgroup/cpuset: Fix node inconsistencies between cpuset_update_tasks_nodemask() and cpuset_attach() Waiman Long
` (9 subsequent siblings)
10 siblings, 2 replies; 24+ messages in thread
From: Waiman Long @ 2026-06-30 3:33 UTC (permalink / raw)
To: Ridong Chen, Tejun Heo, Johannes Weiner, Michal Koutný,
Shuah Khan, Juri Lelli
Cc: cgroups, linux-kernel, linux-kselftest, Aaron Tomlin,
Guopeng Zhang, Waiman Long
The nr_deadline_tasks variable in the cpuset structure was introduced by
commit 6c24849f5515 ("sched/cpuset: Keep track of SCHED_DEADLINE task
in cpusets"). It is reported by sashiko [1] that nr_deadline_tasks
can currently be modified by inc_dl_tasks_cs() under rq->lock and
by cpuset_attach() under cpuset_mutex. So if both updates happen
simultaneously, the nr_deadline_tasks variable can be corrupted leading
to incorrect operations down the road.
Fix that by changing its type to atomic_t so that nr_deadline_tasks are
always atomically updated.
[1] https://sashiko.dev/#/patchset/20260626181923.133658-1-longman%40redhat.comk
Fixes: 6c24849f5515 ("sched/cpuset: Keep track of SCHED_DEADLINE task in cpusets")
Signed-off-by: Waiman Long <longman@redhat.com>
---
kernel/cgroup/cpuset-internal.h | 2 +-
kernel/cgroup/cpuset.c | 10 +++++-----
2 files changed, 6 insertions(+), 6 deletions(-)
diff --git a/kernel/cgroup/cpuset-internal.h b/kernel/cgroup/cpuset-internal.h
index f7aaf01f7cd5..140700e5e236 100644
--- a/kernel/cgroup/cpuset-internal.h
+++ b/kernel/cgroup/cpuset-internal.h
@@ -165,7 +165,7 @@ struct cpuset {
* number of SCHED_DEADLINE tasks attached to this cpuset, so that we
* know when to rebuild associated root domain bandwidth information.
*/
- int nr_deadline_tasks;
+ atomic_t nr_deadline_tasks;
int nr_migrate_dl_tasks;
/* DL bandwidth that needs destination reservation for this attach. */
u64 sum_migrate_dl_bw;
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 49d8564d1a48..c22e55d798cf 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -222,14 +222,14 @@ void inc_dl_tasks_cs(struct task_struct *p)
{
struct cpuset *cs = task_cs(p);
- cs->nr_deadline_tasks++;
+ atomic_inc(&cs->nr_deadline_tasks);
}
void dec_dl_tasks_cs(struct task_struct *p)
{
struct cpuset *cs = task_cs(p);
- cs->nr_deadline_tasks--;
+ atomic_dec(&cs->nr_deadline_tasks);
}
static inline bool is_partition_valid(const struct cpuset *cs)
@@ -918,7 +918,7 @@ static void dl_update_tasks_root_domain(struct cpuset *cs)
struct css_task_iter it;
struct task_struct *task;
- if (cs->nr_deadline_tasks == 0)
+ if (atomic_read(&cs->nr_deadline_tasks) == 0)
return;
css_task_iter_start(&cs->css, 0, &it);
@@ -3215,8 +3215,8 @@ static void cpuset_attach(struct cgroup_taskset *tset)
cs->old_mems_allowed = cpuset_attach_nodemask_to;
if (cs->nr_migrate_dl_tasks) {
- cs->nr_deadline_tasks += cs->nr_migrate_dl_tasks;
- oldcs->nr_deadline_tasks -= cs->nr_migrate_dl_tasks;
+ atomic_add(cs->nr_migrate_dl_tasks, &cs->nr_deadline_tasks);
+ atomic_sub(cs->nr_migrate_dl_tasks, &oldcs->nr_deadline_tasks);
reset_migrate_dl_data(cs);
}
--
2.54.0
^ permalink raw reply related [flat|nested] 24+ messages in thread* Re: [PATCH-next v9 01/11] cgroup/cpuset: Make nr_deadline_tasks an atomic_t
2026-06-30 3:33 ` [PATCH-next v9 01/11] cgroup/cpuset: Make nr_deadline_tasks an atomic_t Waiman Long
@ 2026-06-30 14:01 ` Juri Lelli
2026-06-30 17:56 ` Waiman Long
2026-07-01 1:19 ` Ridong Chen
1 sibling, 1 reply; 24+ messages in thread
From: Juri Lelli @ 2026-06-30 14:01 UTC (permalink / raw)
To: Waiman Long
Cc: Ridong Chen, Tejun Heo, Johannes Weiner, Michal Koutný,
Shuah Khan, cgroups, linux-kernel, linux-kselftest, Aaron Tomlin,
Guopeng Zhang
Hi Waiman,
On 29/06/26 23:33, Waiman Long wrote:
> The nr_deadline_tasks variable in the cpuset structure was introduced by
> commit 6c24849f5515 ("sched/cpuset: Keep track of SCHED_DEADLINE task
> in cpusets"). It is reported by sashiko [1] that nr_deadline_tasks
> can currently be modified by inc_dl_tasks_cs() under rq->lock and
> by cpuset_attach() under cpuset_mutex. So if both updates happen
> simultaneously, the nr_deadline_tasks variable can be corrupted leading
> to incorrect operations down the road.
>
> Fix that by changing its type to atomic_t so that nr_deadline_tasks are
> always atomically updated.
>
> [1] https://sashiko.dev/#/patchset/20260626181923.133658-1-longman%40redhat.comk
>
> Fixes: 6c24849f5515 ("sched/cpuset: Keep track of SCHED_DEADLINE task in cpusets")
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
Looks like Sashiko is yet not completely happy with this:
https://sashiko.dev/#/patchset/20260630033344.352702-1-longman%40redhat.com
I actually wondered the same and couldn't convince myself we don't
actually have that problem with the window between sched_setscheduler()
and cpuset_attach(). If issue is confirmed, not sure if wait_attach_
done_lock() could help here as well? It's kind of a big lock for the
scheduler, but maybe only affecting DEADLINE tasks and if migrations
are ongoing.
Thanks,
Juri
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: [PATCH-next v9 01/11] cgroup/cpuset: Make nr_deadline_tasks an atomic_t
2026-06-30 14:01 ` Juri Lelli
@ 2026-06-30 17:56 ` Waiman Long
2026-07-01 9:00 ` Juri Lelli
0 siblings, 1 reply; 24+ messages in thread
From: Waiman Long @ 2026-06-30 17:56 UTC (permalink / raw)
To: Juri Lelli
Cc: Ridong Chen, Tejun Heo, Johannes Weiner, Michal Koutný,
Shuah Khan, cgroups, linux-kernel, linux-kselftest, Aaron Tomlin,
Guopeng Zhang
On 6/30/26 10:01 AM, Juri Lelli wrote:
> Hi Waiman,
>
> On 29/06/26 23:33, Waiman Long wrote:
>> The nr_deadline_tasks variable in the cpuset structure was introduced by
>> commit 6c24849f5515 ("sched/cpuset: Keep track of SCHED_DEADLINE task
>> in cpusets"). It is reported by sashiko [1] that nr_deadline_tasks
>> can currently be modified by inc_dl_tasks_cs() under rq->lock and
>> by cpuset_attach() under cpuset_mutex. So if both updates happen
>> simultaneously, the nr_deadline_tasks variable can be corrupted leading
>> to incorrect operations down the road.
>>
>> Fix that by changing its type to atomic_t so that nr_deadline_tasks are
>> always atomically updated.
>>
>> [1] https://sashiko.dev/#/patchset/20260626181923.133658-1-longman%40redhat.comk
>>
>> Fixes: 6c24849f5515 ("sched/cpuset: Keep track of SCHED_DEADLINE task in cpusets")
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
> Looks like Sashiko is yet not completely happy with this:
>
> https://sashiko.dev/#/patchset/20260630033344.352702-1-longman%40redhat.com
>
> I actually wondered the same and couldn't convince myself we don't
> actually have that problem with the window between sched_setscheduler()
> and cpuset_attach(). If issue is confirmed, not sure if wait_attach_
> done_lock() could help here as well? It's kind of a big lock for the
> scheduler, but maybe only affecting DEADLINE tasks and if migrations
> are ongoing.
Yes, I am aware of that. This patch can only partially close the race
window. It doesn't completely eliminate it.
My current thought is for inc_dl_tasks_cs() to check if the in_progress
flag is set. If so, it sets another flag for cpuset_attach() to double
check the DL data for consistency. It will be a rather complicated
solution in order to eliminate the race window. So I am postponing it to
a later time when I have more free time to think about it.
Cheers,
Longman
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: [PATCH-next v9 01/11] cgroup/cpuset: Make nr_deadline_tasks an atomic_t
2026-06-30 17:56 ` Waiman Long
@ 2026-07-01 9:00 ` Juri Lelli
0 siblings, 0 replies; 24+ messages in thread
From: Juri Lelli @ 2026-07-01 9:00 UTC (permalink / raw)
To: Waiman Long
Cc: Ridong Chen, Tejun Heo, Johannes Weiner, Michal Koutný,
Shuah Khan, cgroups, linux-kernel, linux-kselftest, Aaron Tomlin,
Guopeng Zhang
On 30/06/26 13:56, Waiman Long wrote:
>
> On 6/30/26 10:01 AM, Juri Lelli wrote:
> > Hi Waiman,
> >
> > On 29/06/26 23:33, Waiman Long wrote:
> > > The nr_deadline_tasks variable in the cpuset structure was introduced by
> > > commit 6c24849f5515 ("sched/cpuset: Keep track of SCHED_DEADLINE task
> > > in cpusets"). It is reported by sashiko [1] that nr_deadline_tasks
> > > can currently be modified by inc_dl_tasks_cs() under rq->lock and
> > > by cpuset_attach() under cpuset_mutex. So if both updates happen
> > > simultaneously, the nr_deadline_tasks variable can be corrupted leading
> > > to incorrect operations down the road.
> > >
> > > Fix that by changing its type to atomic_t so that nr_deadline_tasks are
> > > always atomically updated.
> > >
> > > [1] https://sashiko.dev/#/patchset/20260626181923.133658-1-longman%40redhat.comk
> > >
> > > Fixes: 6c24849f5515 ("sched/cpuset: Keep track of SCHED_DEADLINE task in cpusets")
> > > Signed-off-by: Waiman Long <longman@redhat.com>
> > > ---
> > Looks like Sashiko is yet not completely happy with this:
> >
> > https://sashiko.dev/#/patchset/20260630033344.352702-1-longman%40redhat.com
> >
> > I actually wondered the same and couldn't convince myself we don't
> > actually have that problem with the window between sched_setscheduler()
> > and cpuset_attach(). If issue is confirmed, not sure if wait_attach_
> > done_lock() could help here as well? It's kind of a big lock for the
> > scheduler, but maybe only affecting DEADLINE tasks and if migrations
> > are ongoing.
>
> Yes, I am aware of that. This patch can only partially close the race
> window. It doesn't completely eliminate it.
>
> My current thought is for inc_dl_tasks_cs() to check if the in_progress flag
> is set. If so, it sets another flag for cpuset_attach() to double check the
> DL data for consistency. It will be a rather complicated solution in order
> to eliminate the race window. So I am postponing it to a later time when I
> have more free time to think about it.
Ah yes of course, makes sense.
Thanks!
Juri
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH-next v9 01/11] cgroup/cpuset: Make nr_deadline_tasks an atomic_t
2026-06-30 3:33 ` [PATCH-next v9 01/11] cgroup/cpuset: Make nr_deadline_tasks an atomic_t Waiman Long
2026-06-30 14:01 ` Juri Lelli
@ 2026-07-01 1:19 ` Ridong Chen
1 sibling, 0 replies; 24+ messages in thread
From: Ridong Chen @ 2026-07-01 1:19 UTC (permalink / raw)
To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
Shuah Khan, Juri Lelli
Cc: cgroups, linux-kernel, linux-kselftest, Aaron Tomlin,
Guopeng Zhang
On 6/30/2026 11:33 AM, Waiman Long wrote:
> The nr_deadline_tasks variable in the cpuset structure was introduced by
> commit 6c24849f5515 ("sched/cpuset: Keep track of SCHED_DEADLINE task
> in cpusets"). It is reported by sashiko [1] that nr_deadline_tasks
> can currently be modified by inc_dl_tasks_cs() under rq->lock and
> by cpuset_attach() under cpuset_mutex. So if both updates happen
> simultaneously, the nr_deadline_tasks variable can be corrupted leading
> to incorrect operations down the road.
>
> Fix that by changing its type to atomic_t so that nr_deadline_tasks are
> always atomically updated.
>
> [1] https://sashiko.dev/#/patchset/20260626181923.133658-1-longman%40redhat.comk
>
Nit.
The link you provided has an extra 'k' at the end. Please remove it.
https://sashiko.dev/#/patchset/20260626181923.133658-1-longman%40redhat.com
> Fixes: 6c24849f5515 ("sched/cpuset: Keep track of SCHED_DEADLINE task in cpusets")
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
> kernel/cgroup/cpuset-internal.h | 2 +-
> kernel/cgroup/cpuset.c | 10 +++++-----
> 2 files changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/cgroup/cpuset-internal.h b/kernel/cgroup/cpuset-internal.h
> index f7aaf01f7cd5..140700e5e236 100644
> --- a/kernel/cgroup/cpuset-internal.h
> +++ b/kernel/cgroup/cpuset-internal.h
> @@ -165,7 +165,7 @@ struct cpuset {
> * number of SCHED_DEADLINE tasks attached to this cpuset, so that we
> * know when to rebuild associated root domain bandwidth information.
> */
> - int nr_deadline_tasks;
> + atomic_t nr_deadline_tasks;
> int nr_migrate_dl_tasks;
> /* DL bandwidth that needs destination reservation for this attach. */
> u64 sum_migrate_dl_bw;
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 49d8564d1a48..c22e55d798cf 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -222,14 +222,14 @@ void inc_dl_tasks_cs(struct task_struct *p)
> {
> struct cpuset *cs = task_cs(p);
>
> - cs->nr_deadline_tasks++;
> + atomic_inc(&cs->nr_deadline_tasks);
> }
>
> void dec_dl_tasks_cs(struct task_struct *p)
> {
> struct cpuset *cs = task_cs(p);
>
> - cs->nr_deadline_tasks--;
> + atomic_dec(&cs->nr_deadline_tasks);
> }
>
> static inline bool is_partition_valid(const struct cpuset *cs)
> @@ -918,7 +918,7 @@ static void dl_update_tasks_root_domain(struct cpuset *cs)
> struct css_task_iter it;
> struct task_struct *task;
>
> - if (cs->nr_deadline_tasks == 0)
> + if (atomic_read(&cs->nr_deadline_tasks) == 0)
> return;
>
> css_task_iter_start(&cs->css, 0, &it);
> @@ -3215,8 +3215,8 @@ static void cpuset_attach(struct cgroup_taskset *tset)
> cs->old_mems_allowed = cpuset_attach_nodemask_to;
>
> if (cs->nr_migrate_dl_tasks) {
> - cs->nr_deadline_tasks += cs->nr_migrate_dl_tasks;
> - oldcs->nr_deadline_tasks -= cs->nr_migrate_dl_tasks;
> + atomic_add(cs->nr_migrate_dl_tasks, &cs->nr_deadline_tasks);
> + atomic_sub(cs->nr_migrate_dl_tasks, &oldcs->nr_deadline_tasks);
> reset_migrate_dl_data(cs);
> }
>
Reviewed-by: Ridong Chen <ridong.chen@linux.dev>
--
Best regards
Ridong
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH-next v9 02/11] cgroup/cpuset: Fix node inconsistencies between cpuset_update_tasks_nodemask() and cpuset_attach()
2026-06-30 3:33 [PATCH-next v9 00/11] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach() Waiman Long
2026-06-30 3:33 ` [PATCH-next v9 01/11] cgroup/cpuset: Make nr_deadline_tasks an atomic_t Waiman Long
@ 2026-06-30 3:33 ` Waiman Long
2026-06-30 3:33 ` [PATCH-next v9 03/11] cgroup/cpuset: Prevent race between task attach and cpuset state change Waiman Long
` (8 subsequent siblings)
10 siblings, 0 replies; 24+ messages in thread
From: Waiman Long @ 2026-06-30 3:33 UTC (permalink / raw)
To: Ridong Chen, Tejun Heo, Johannes Weiner, Michal Koutný,
Shuah Khan, Juri Lelli
Cc: cgroups, linux-kernel, linux-kselftest, Aaron Tomlin,
Guopeng Zhang, Waiman Long, Gregory Price
Whenever memory node mask is changed, there are 4 places where the node
mask has to be updated or used.
1) task's node mask via cpuset_change_task_nodemask()
2) memory policy binding via mpol_rebind_mm()
3) if memory migration is enabled, migrate from old_mems_allowed to
the new node mask via cpuset_migrate_mm().
4) setting old_mems_allowed
These memory actions are done in cpuset_update_tasks_nodemask() and
cpuset_attach(). However there are inconsistencies in what node masks
are being used in these 2 functions.
In cpuset_update_tasks_nodemask(),
- cpuset_change_task_nodemask(): guarantee_online_mems()
- mpol_rebind_mm(): mems_allowed
- cpuset_migrate_mm(): guarantee_online_mems()
- old_mems_allowed: guarantee_online_mems()
In cpuset_attach(),
- cpuset_change_task_nodemask(): guarantee_online_mems()
- mpol_rebind_mm(): effective_mems
- cpuset_migrate_mm(): effective_mems
- old_mems_allowed: effective_mems
These inconsistencies dates back to quite a long time ago and it is
hard to say what should be the correct values.
The guarantee_online_mems() function returns a node mask from current or
an ancestor cpuset that is a subset of node_states[N_MEMORY]. Nodes in
node_states[N_MEMORY] are all online, i.e. in node_states[N_ONLINE].
However, node in node_states[N_ONLINE] may not have memory. So
node_states[N_MEMORY] should be a subset of node_states[N_ONLINE].
The guarantee_online_mems() function should mostly be useful for v1
where mems_allowed is the same as effective_mems. With v2, the memory
nodes in effective_mems should be a subset of node_states[N_MEMORY]
except when a memory hot-unplug operation is in progress and a memory
node is removed from node_states[N_MEMORY] but not yet reflected in
the effective_mems's as cpuset_handle_hotplug() has not been called
from cpuset_track_online_nodes().
Let use the following setup for both of them and make them consistent.
- cpuset_change_task_nodemask(): guarantee_online_mems()
- mpol_rebind_mm(): effective_mems
- cpuset_migrate_mm(): guarantee_online_mems()
- old_mems_allowed: guarantee_online_mems()
So for v2, it is effectively all effective_mems most of the time. For
v1, mpol_rebind_mm() uses mems_allowed which may differ from what
guarantee_online_mems() returns, but it conforms to what the cpuset v1
documentation says with respect to setting memory policy.
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Ridong Chen <ridong.chen@linux.dev>
Reviewed-by: Gregory Price <gourry@gourry.net>
---
kernel/cgroup/cpuset.c | 30 ++++++++++++++++++------------
1 file changed, 18 insertions(+), 12 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index c22e55d798cf..431bf210aa52 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -489,7 +489,10 @@ static void guarantee_active_cpus(struct task_struct *tsk,
* Return in *pmask the portion of a cpusets's mems_allowed that
* are online, with memory. If none are online with memory, walk
* up the cpuset hierarchy until we find one that does have some
- * online mems. The top cpuset always has some mems online.
+ * online mems. The top cpuset always has some mems online. With v2,
+ * effective_mems should always contain online memory nodes except
+ * during the transition period where a memory node hotunplug operation
+ * is in progress.
*
* One way or another, we guarantee to return some non-empty subset
* of node_states[N_MEMORY].
@@ -2633,6 +2636,14 @@ static void *cpuset_being_rebound;
* Iterate through each task of @cs updating its mems_allowed to the
* effective cpuset's. As this function is called with cpuset_mutex held,
* cpuset membership stays stable.
+ *
+ * - cpuset_change_task_nodemask(): guarantee_online_mems()
+ * - mpol_rebind_mm(): effective_mems
+ * - cpuset_migrate_mm(): guarantee_online_mems()
+ * - old_mems_allowed: guarantee_online_mems()
+ *
+ * For v2, guarantee_online_mems() should return a node mask that is the same
+ * as the effective_mems of current cpuset.
*/
void cpuset_update_tasks_nodemask(struct cpuset *cs)
{
@@ -2641,7 +2652,6 @@ void cpuset_update_tasks_nodemask(struct cpuset *cs)
struct task_struct *task;
cpuset_being_rebound = cs; /* causes mpol_dup() rebind */
-
guarantee_online_mems(cs, &newmems);
/*
@@ -3159,19 +3169,16 @@ static void cpuset_attach(struct cgroup_taskset *tset)
cpus_updated = !cpumask_equal(cs->effective_cpus,
oldcs->effective_cpus);
mems_updated = !nodes_equal(cs->effective_mems, oldcs->effective_mems);
+ guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
/*
* In the default hierarchy, enabling cpuset in the child cgroups
- * will trigger a number of cpuset_attach() calls with no change
- * in effective cpus and mems. In that case, we can optimize out
- * by skipping the task iteration and update.
+ * will trigger a cpuset_attach() call with no change in effective cpus
+ * and mems. In that case, we can optimize out by skipping the task
+ * iteration and update.
*/
- if (cpuset_v2() && !cpus_updated && !mems_updated) {
- cpuset_attach_nodemask_to = cs->effective_mems;
+ if (cpuset_v2() && !cpus_updated && !mems_updated)
goto out;
- }
-
- guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
cgroup_taskset_for_each(task, css, tset)
cpuset_attach_task(cs, task);
@@ -3182,7 +3189,6 @@ static void cpuset_attach(struct cgroup_taskset *tset)
* if there is no change in effective_mems and CS_MEMORY_MIGRATE is
* not set.
*/
- cpuset_attach_nodemask_to = cs->effective_mems;
if (!is_memory_migrate(cs) && !mems_updated)
goto out;
@@ -3190,7 +3196,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
struct mm_struct *mm = get_task_mm(leader);
if (mm) {
- mpol_rebind_mm(mm, &cpuset_attach_nodemask_to);
+ mpol_rebind_mm(mm, &cs->effective_mems);
/*
* old_mems_allowed is the same with mems_allowed
--
2.54.0
^ permalink raw reply related [flat|nested] 24+ messages in thread* [PATCH-next v9 03/11] cgroup/cpuset: Prevent race between task attach and cpuset state change
2026-06-30 3:33 [PATCH-next v9 00/11] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach() Waiman Long
2026-06-30 3:33 ` [PATCH-next v9 01/11] cgroup/cpuset: Make nr_deadline_tasks an atomic_t Waiman Long
2026-06-30 3:33 ` [PATCH-next v9 02/11] cgroup/cpuset: Fix node inconsistencies between cpuset_update_tasks_nodemask() and cpuset_attach() Waiman Long
@ 2026-06-30 3:33 ` Waiman Long
2026-07-01 1:41 ` Ridong Chen
2026-06-30 3:33 ` [PATCH-next v9 04/11] cgroup/cpuset: Put all task attach related variables into attach_ctx Waiman Long
` (7 subsequent siblings)
10 siblings, 1 reply; 24+ messages in thread
From: Waiman Long @ 2026-06-30 3:33 UTC (permalink / raw)
To: Ridong Chen, Tejun Heo, Johannes Weiner, Michal Koutný,
Shuah Khan, Juri Lelli
Cc: cgroups, linux-kernel, linux-kselftest, Aaron Tomlin,
Guopeng Zhang, Waiman Long
Commit e44193d39e8d ("cpuset: let hotplug propagation work wait for
task attaching") was introduced to let hotplug operation to wait
until the completion of task attach operation. However, it is still
possible that the states of the source or destination cpuset can
be changed between the cpuset_can_attach() call and the subsequent
cpuset_attach()/cpuset_cacnel_attach() call.
As a result, data gathered during cpuset_can_attach() cannot be reliably
used in the subsequent cpuset_attach()/cpuset_cacnel_attach()
call at all. Make the task attach operation more robust
and allow the sharing of data between cpuset_can_attach() and
cpuset_attach()/cpuset_cacnel_attach() by making cpuset_write_resmask()
and cpuset_partition_write() wait for the completion of task attach
as well.
Ideally, an ongoing task attach operation should block any cpuset write
operation that can change its internal state until the operation is
completed. However, the attach_in_progress flag is currently per cpuset
and only the destination cpuset will have this flag set. The flag is not
set in the source cpuset where the tasks will be moved from. Even if we
extend the scope to include the source cpuset, it will not block cpuset
operation that changes the state of one of its ancestor cpuset which may
indirectly impact the state of the source or destination cpuset. It may
be too costly to set the flag for the whole subtree, it is far easier
to just make the flag global and block all the cpuset write operation
whenever a task attach operation is in progress.
Make that change by creating a new cpuset attach context (attach_ctx)
structure to hold the global in_progress flag and use it for blocking
cpuset write operation if a cpuset attach operation is in progress. Also
add a new wait_attach_done_lock() helper to do the waiting for an
ongoing attach operation and acquire the cpuset_mutex.
The comments about validate_change() are no longer valid as it won't
be called at all if an attach operation is in progress. So the comments
can be removed.
The per-cpuset attach_in_progress flag is also currently used in
partition_is_populated() and cpuset_is_populated() to determine if
an empty cpuset will have incoming task. This check will no longer be
needed as this function will not be called when there is a task attach
in progress. So the flag check is now removed.
Signed-off-by: Waiman Long <longman@redhat.com>
---
kernel/cgroup/cpuset-internal.h | 11 +---
kernel/cgroup/cpuset.c | 90 +++++++++++++++++++--------------
2 files changed, 53 insertions(+), 48 deletions(-)
diff --git a/kernel/cgroup/cpuset-internal.h b/kernel/cgroup/cpuset-internal.h
index 140700e5e236..df662c7fd1a4 100644
--- a/kernel/cgroup/cpuset-internal.h
+++ b/kernel/cgroup/cpuset-internal.h
@@ -145,12 +145,6 @@ struct cpuset {
*/
nodemask_t old_mems_allowed;
- /*
- * Tasks are being attached to this cpuset. Used to prevent
- * zeroing cpus/mems_allowed between ->can_attach() and ->attach().
- */
- int attach_in_progress;
-
/* partition root state */
int partition_root_state;
@@ -269,10 +263,7 @@ static inline int nr_cpusets(void)
static inline bool cpuset_is_populated(struct cpuset *cs)
{
lockdep_assert_cpuset_lock_held();
-
- /* Cpusets in the process of attaching should be considered as populated */
- return cgroup_is_populated(cs->css.cgroup) ||
- cs->attach_in_progress;
+ return cgroup_is_populated(cs->css.cgroup);
}
/**
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 431bf210aa52..1a78d0590737 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -356,6 +356,33 @@ static struct workqueue_struct *cpuset_migrate_mm_wq;
static DECLARE_WAIT_QUEUE_HEAD(cpuset_attach_wq);
+/*
+ * Cpuset task attach context
+ * Protected by cpuset_mutex
+ */
+static struct {
+ int in_progress;
+} attach_ctx;
+
+/*
+ * Wait if task attach is in progress until it is done and then acquire
+ * cpuset_mutex before returning.
+ */
+static void wait_attach_done_lock(void)
+ __acquires(&cpuset_mutex)
+{
+ for (;;) {
+ mutex_lock(&cpuset_mutex);
+ if (!attach_ctx.in_progress)
+ return;
+
+ mutex_unlock(&cpuset_mutex);
+
+ /* Wait until attach operation is done to prevent racing */
+ wait_event(cpuset_attach_wq, attach_ctx.in_progress == 0);
+ }
+}
+
static inline void check_insane_mems_config(nodemask_t *nodes)
{
if (!cpusets_insane_config() &&
@@ -368,22 +395,22 @@ static inline void check_insane_mems_config(nodemask_t *nodes)
}
/*
- * decrease cs->attach_in_progress.
- * wake_up cpuset_attach_wq if cs->attach_in_progress==0.
+ * decrease attach_ctx.in_progress.
+ * wake_up cpuset_attach_wq if attach_ctx.in_progress==0.
*/
-static inline void dec_attach_in_progress_locked(struct cpuset *cs)
+static inline void dec_attach_in_progress_locked(void)
{
lockdep_assert_cpuset_lock_held();
- cs->attach_in_progress--;
- if (!cs->attach_in_progress)
+ attach_ctx.in_progress--;
+ if (!attach_ctx.in_progress)
wake_up(&cpuset_attach_wq);
}
-static inline void dec_attach_in_progress(struct cpuset *cs)
+static inline void dec_attach_in_progress(void)
{
mutex_lock(&cpuset_mutex);
- dec_attach_in_progress_locked(cs);
+ dec_attach_in_progress_locked();
mutex_unlock(&cpuset_mutex);
}
@@ -432,8 +459,7 @@ static inline bool partition_is_populated(struct cpuset *cs,
* nr_populated_domain_children may include populated
* csets from descendants that are partitions.
*/
- if (cgroup_has_tasks(cs->css.cgroup) ||
- cs->attach_in_progress)
+ if (cgroup_has_tasks(cs->css.cgroup))
return true;
rcu_read_lock();
@@ -3091,11 +3117,7 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
cs->dl_bw_cpu = cpu;
out_success:
- /*
- * Mark attach is in progress. This makes validate_change() fail
- * changes which zero cpus/mems_allowed.
- */
- cs->attach_in_progress++;
+ attach_ctx.in_progress++;
out_unlock:
if (ret)
@@ -3113,7 +3135,7 @@ static void cpuset_cancel_attach(struct cgroup_taskset *tset)
cs = css_cs(css);
mutex_lock(&cpuset_mutex);
- dec_attach_in_progress_locked(cs);
+ dec_attach_in_progress_locked();
if (cs->dl_bw_cpu >= 0)
dl_bw_free(cs->dl_bw_cpu, cs->sum_migrate_dl_bw);
@@ -3226,7 +3248,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
reset_migrate_dl_data(cs);
}
- dec_attach_in_progress_locked(cs);
+ dec_attach_in_progress_locked();
mutex_unlock(&cpuset_mutex);
}
@@ -3246,7 +3268,12 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
return -EACCES;
buf = strstrip(buf);
- cpuset_full_lock();
+
+ /* cpuset_mutex acquired in wait_attach_done_lock() */
+ mutex_lock(&cpuset_top_mutex);
+ cpus_read_lock();
+ wait_attach_done_lock();
+
if (!is_cpuset_online(cs))
goto out_unlock;
@@ -3377,7 +3404,10 @@ static ssize_t cpuset_partition_write(struct kernfs_open_file *of, char *buf,
else
return -EINVAL;
- cpuset_full_lock();
+ mutex_lock(&cpuset_top_mutex);
+ cpus_read_lock();
+ wait_attach_done_lock();
+
if (is_cpuset_online(cs))
retval = update_prstate(cs, val);
cpuset_update_sd_hk_unlock();
@@ -3616,11 +3646,7 @@ static int cpuset_can_fork(struct task_struct *task, struct css_set *cset)
if (ret)
goto out_unlock;
- /*
- * Mark attach is in progress. This makes validate_change() fail
- * changes which zero cpus/mems_allowed.
- */
- cs->attach_in_progress++;
+ attach_ctx.in_progress++;
out_unlock:
mutex_unlock(&cpuset_mutex);
return ret;
@@ -3638,7 +3664,7 @@ static void cpuset_cancel_fork(struct task_struct *task, struct css_set *cset)
if (same_cs)
return;
- dec_attach_in_progress(cs);
+ dec_attach_in_progress();
}
/*
@@ -3670,7 +3696,7 @@ static void cpuset_fork(struct task_struct *task)
guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
cpuset_attach_task(cs, task);
- dec_attach_in_progress_locked(cs);
+ dec_attach_in_progress_locked();
mutex_unlock(&cpuset_mutex);
}
@@ -3774,20 +3800,8 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
bool remote;
int partcmd = -1;
struct cpuset *parent;
-retry:
- wait_event(cpuset_attach_wq, cs->attach_in_progress == 0);
-
- mutex_lock(&cpuset_mutex);
-
- /*
- * We have raced with task attaching. We wait until attaching
- * is finished, so we won't attach a task to an empty cpuset.
- */
- if (cs->attach_in_progress) {
- mutex_unlock(&cpuset_mutex);
- goto retry;
- }
+ wait_attach_done_lock();
parent = parent_cs(cs);
compute_effective_cpumask(&new_cpus, cs, parent);
compute_effective_nodemask(&new_mems, cs, parent);
--
2.54.0
^ permalink raw reply related [flat|nested] 24+ messages in thread* Re: [PATCH-next v9 03/11] cgroup/cpuset: Prevent race between task attach and cpuset state change
2026-06-30 3:33 ` [PATCH-next v9 03/11] cgroup/cpuset: Prevent race between task attach and cpuset state change Waiman Long
@ 2026-07-01 1:41 ` Ridong Chen
2026-07-01 20:19 ` Waiman Long
0 siblings, 1 reply; 24+ messages in thread
From: Ridong Chen @ 2026-07-01 1:41 UTC (permalink / raw)
To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
Shuah Khan, Juri Lelli
Cc: cgroups, linux-kernel, linux-kselftest, Aaron Tomlin,
Guopeng Zhang
On 6/30/2026 11:33 AM, Waiman Long wrote:
> Commit e44193d39e8d ("cpuset: let hotplug propagation work wait for
> task attaching") was introduced to let hotplug operation to wait
> until the completion of task attach operation. However, it is still
> possible that the states of the source or destination cpuset can
> be changed between the cpuset_can_attach() call and the subsequent
> cpuset_attach()/cpuset_cacnel_attach() call.
>
> As a result, data gathered during cpuset_can_attach() cannot be reliably
> used in the subsequent cpuset_attach()/cpuset_cacnel_attach()
> call at all. Make the task attach operation more robust
> and allow the sharing of data between cpuset_can_attach() and
> cpuset_attach()/cpuset_cacnel_attach() by making cpuset_write_resmask()
> and cpuset_partition_write() wait for the completion of task attach
> as well.
>
Nit.
s/cpuset_cacnel_attach/cpuset_cancel_attach/
> Ideally, an ongoing task attach operation should block any cpuset write
> operation that can change its internal state until the operation is
> completed. However, the attach_in_progress flag is currently per cpuset
> and only the destination cpuset will have this flag set. The flag is not
> set in the source cpuset where the tasks will be moved from. Even if we
> extend the scope to include the source cpuset, it will not block cpuset
> operation that changes the state of one of its ancestor cpuset which may
> indirectly impact the state of the source or destination cpuset. It may
> be too costly to set the flag for the whole subtree, it is far easier
> to just make the flag global and block all the cpuset write operation
> whenever a task attach operation is in progress.
>
> Make that change by creating a new cpuset attach context (attach_ctx)
> structure to hold the global in_progress flag and use it for blocking
> cpuset write operation if a cpuset attach operation is in progress. Also
> add a new wait_attach_done_lock() helper to do the waiting for an
> ongoing attach operation and acquire the cpuset_mutex.
>
> The comments about validate_change() are no longer valid as it won't
> be called at all if an attach operation is in progress. So the comments
> can be removed.
>
> The per-cpuset attach_in_progress flag is also currently used in
> partition_is_populated() and cpuset_is_populated() to determine if
> an empty cpuset will have incoming task. This check will no longer be
> needed as this function will not be called when there is a task attach
> in progress. So the flag check is now removed.
>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
> kernel/cgroup/cpuset-internal.h | 11 +---
> kernel/cgroup/cpuset.c | 90 +++++++++++++++++++--------------
> 2 files changed, 53 insertions(+), 48 deletions(-)
>
> diff --git a/kernel/cgroup/cpuset-internal.h b/kernel/cgroup/cpuset-internal.h
> index 140700e5e236..df662c7fd1a4 100644
> --- a/kernel/cgroup/cpuset-internal.h
> +++ b/kernel/cgroup/cpuset-internal.h
> @@ -145,12 +145,6 @@ struct cpuset {
> */
> nodemask_t old_mems_allowed;
>
> - /*
> - * Tasks are being attached to this cpuset. Used to prevent
> - * zeroing cpus/mems_allowed between ->can_attach() and ->attach().
> - */
> - int attach_in_progress;
> -
> /* partition root state */
> int partition_root_state;
>
> @@ -269,10 +263,7 @@ static inline int nr_cpusets(void)
> static inline bool cpuset_is_populated(struct cpuset *cs)
> {
> lockdep_assert_cpuset_lock_held();
> -
> - /* Cpusets in the process of attaching should be considered as populated */
> - return cgroup_is_populated(cs->css.cgroup) ||
> - cs->attach_in_progress;
> + return cgroup_is_populated(cs->css.cgroup);
> }
>
> /**
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 431bf210aa52..1a78d0590737 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -356,6 +356,33 @@ static struct workqueue_struct *cpuset_migrate_mm_wq;
>
> static DECLARE_WAIT_QUEUE_HEAD(cpuset_attach_wq);
>
> +/*
> + * Cpuset task attach context
> + * Protected by cpuset_mutex
> + */
> +static struct {
> + int in_progress;
> +} attach_ctx;
> +
> +/*
> + * Wait if task attach is in progress until it is done and then acquire
> + * cpuset_mutex before returning.
> + */
> +static void wait_attach_done_lock(void)
> + __acquires(&cpuset_mutex)
> +{
> + for (;;) {
> + mutex_lock(&cpuset_mutex);
> + if (!attach_ctx.in_progress)
> + return;
> +
> + mutex_unlock(&cpuset_mutex);
> +
> + /* Wait until attach operation is done to prevent racing */
> + wait_event(cpuset_attach_wq, attach_ctx.in_progress == 0);
> + }
> +}
> +
> static inline void check_insane_mems_config(nodemask_t *nodes)
> {
> if (!cpusets_insane_config() &&
> @@ -368,22 +395,22 @@ static inline void check_insane_mems_config(nodemask_t *nodes)
> }
>
> /*
> - * decrease cs->attach_in_progress.
> - * wake_up cpuset_attach_wq if cs->attach_in_progress==0.
> + * decrease attach_ctx.in_progress.
> + * wake_up cpuset_attach_wq if attach_ctx.in_progress==0.
> */
> -static inline void dec_attach_in_progress_locked(struct cpuset *cs)
> +static inline void dec_attach_in_progress_locked(void)
> {
> lockdep_assert_cpuset_lock_held();
>
> - cs->attach_in_progress--;
> - if (!cs->attach_in_progress)
> + attach_ctx.in_progress--;
> + if (!attach_ctx.in_progress)
> wake_up(&cpuset_attach_wq);
> }
>
> -static inline void dec_attach_in_progress(struct cpuset *cs)
> +static inline void dec_attach_in_progress(void)
> {
> mutex_lock(&cpuset_mutex);
> - dec_attach_in_progress_locked(cs);
> + dec_attach_in_progress_locked();
> mutex_unlock(&cpuset_mutex);
> }
>
> @@ -432,8 +459,7 @@ static inline bool partition_is_populated(struct cpuset *cs,
> * nr_populated_domain_children may include populated
> * csets from descendants that are partitions.
> */
> - if (cgroup_has_tasks(cs->css.cgroup) ||
> - cs->attach_in_progress)
> + if (cgroup_has_tasks(cs->css.cgroup))
> return true;
>
> rcu_read_lock();
> @@ -3091,11 +3117,7 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
> cs->dl_bw_cpu = cpu;
>
> out_success:
> - /*
> - * Mark attach is in progress. This makes validate_change() fail
> - * changes which zero cpus/mems_allowed.
> - */
> - cs->attach_in_progress++;
> + attach_ctx.in_progress++;
>
> out_unlock:
> if (ret)
> @@ -3113,7 +3135,7 @@ static void cpuset_cancel_attach(struct cgroup_taskset *tset)
> cs = css_cs(css);
>
> mutex_lock(&cpuset_mutex);
> - dec_attach_in_progress_locked(cs);
> + dec_attach_in_progress_locked();
>
> if (cs->dl_bw_cpu >= 0)
> dl_bw_free(cs->dl_bw_cpu, cs->sum_migrate_dl_bw);
> @@ -3226,7 +3248,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
> reset_migrate_dl_data(cs);
> }
>
> - dec_attach_in_progress_locked(cs);
> + dec_attach_in_progress_locked();
>
> mutex_unlock(&cpuset_mutex);
> }
> @@ -3246,7 +3268,12 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
> return -EACCES;
>
> buf = strstrip(buf);
> - cpuset_full_lock();
> +
> + /* cpuset_mutex acquired in wait_attach_done_lock() */
> + mutex_lock(&cpuset_top_mutex);
> + cpus_read_lock();
> + wait_attach_done_lock();
> +
Would it be cleaner to just pass this into cpuset_full_lock() as a flag?
void cpuset_full_lock(bool wait_attach)
{
mutex_lock(&cpuset_top_mutex);
cpus_read_lock();
if (wait_attach)
wait_attach_done_lock();
else
mutex_lock(&cpuset_mutex);
}
Then the two write paths become a single cpuset_full_lock(true). The
downside is the other 6 callers would need cpuset_full_lock(false). Not
sure it's worth it â what do you think?
> if (!is_cpuset_online(cs))
> goto out_unlock;
>
> @@ -3377,7 +3404,10 @@ static ssize_t cpuset_partition_write(struct kernfs_open_file *of, char *buf,
> else
> return -EINVAL;
>
> - cpuset_full_lock();
> + mutex_lock(&cpuset_top_mutex);
> + cpus_read_lock();
> + wait_attach_done_lock();
> +
> if (is_cpuset_online(cs))
> retval = update_prstate(cs, val);
> cpuset_update_sd_hk_unlock();
> @@ -3616,11 +3646,7 @@ static int cpuset_can_fork(struct task_struct *task, struct css_set *cset)
> if (ret)
> goto out_unlock;
>
> - /*
> - * Mark attach is in progress. This makes validate_change() fail
> - * changes which zero cpus/mems_allowed.
> - */
> - cs->attach_in_progress++;
> + attach_ctx.in_progress++;
> out_unlock:
> mutex_unlock(&cpuset_mutex);
> return ret;
> @@ -3638,7 +3664,7 @@ static void cpuset_cancel_fork(struct task_struct *task, struct css_set *cset)
> if (same_cs)
> return;
>
> - dec_attach_in_progress(cs);
> + dec_attach_in_progress();
> }
>
> /*
> @@ -3670,7 +3696,7 @@ static void cpuset_fork(struct task_struct *task)
> guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
> cpuset_attach_task(cs, task);
>
> - dec_attach_in_progress_locked(cs);
> + dec_attach_in_progress_locked();
> mutex_unlock(&cpuset_mutex);
> }
>
> @@ -3774,20 +3800,8 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
> bool remote;
> int partcmd = -1;
> struct cpuset *parent;
> -retry:
> - wait_event(cpuset_attach_wq, cs->attach_in_progress == 0);
> -
> - mutex_lock(&cpuset_mutex);
> -
> - /*
> - * We have raced with task attaching. We wait until attaching
> - * is finished, so we won't attach a task to an empty cpuset.
> - */
> - if (cs->attach_in_progress) {
> - mutex_unlock(&cpuset_mutex);
> - goto retry;
> - }
>
> + wait_attach_done_lock();
> parent = parent_cs(cs);
> compute_effective_cpumask(&new_cpus, cs, parent);
> compute_effective_nodemask(&new_mems, cs, parent);
Overall this looks good to me.
Reviewed-by: Ridong Chen <ridong.chen@linux.dev>
--
Best regards
Ridong
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: [PATCH-next v9 03/11] cgroup/cpuset: Prevent race between task attach and cpuset state change
2026-07-01 1:41 ` Ridong Chen
@ 2026-07-01 20:19 ` Waiman Long
0 siblings, 0 replies; 24+ messages in thread
From: Waiman Long @ 2026-07-01 20:19 UTC (permalink / raw)
To: Ridong Chen, Tejun Heo, Johannes Weiner, Michal Koutný,
Shuah Khan, Juri Lelli
Cc: cgroups, linux-kernel, linux-kselftest, Aaron Tomlin,
Guopeng Zhang
On 6/30/26 9:41 PM, Ridong Chen wrote:
>> @@ -3246,7 +3268,12 @@ ssize_t cpuset_write_resmask(struct
>> kernfs_open_file *of,
>> return -EACCES;
>> buf = strstrip(buf);
>> - cpuset_full_lock();
>> +
>> + /* cpuset_mutex acquired in wait_attach_done_lock() */
>> + mutex_lock(&cpuset_top_mutex);
>> + cpus_read_lock();
>> + wait_attach_done_lock();
>> +
>
> Would it be cleaner to just pass this into cpuset_full_lock() as a flag?
>
> void cpuset_full_lock(bool wait_attach)
> {
> mutex_lock(&cpuset_top_mutex);
> cpus_read_lock();
> if (wait_attach)
> wait_attach_done_lock();
> else
> mutex_lock(&cpuset_mutex);
> }
> Then the two write paths become a single cpuset_full_lock(true). The
> downside is the other 6 callers would need cpuset_full_lock(false).
> Not sure it's worth it â what do you think?
I would like to put wait_attach_done_lock() at places where it will wait
for the completion of the attach operation. Putting it inside
cpuset_full_lock() will make it harder to find out which operations will
need to wait for task attach. So I will keep the code as is.
Cheers,
Longman
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH-next v9 04/11] cgroup/cpuset: Put all task attach related variables into attach_ctx
2026-06-30 3:33 [PATCH-next v9 00/11] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach() Waiman Long
` (2 preceding siblings ...)
2026-06-30 3:33 ` [PATCH-next v9 03/11] cgroup/cpuset: Prevent race between task attach and cpuset state change Waiman Long
@ 2026-06-30 3:33 ` Waiman Long
2026-06-30 3:33 ` [PATCH-next v9 05/11] cgroup/cpuset: Add a cpuset_reserve_dl_bw() helper Waiman Long
` (6 subsequent siblings)
10 siblings, 0 replies; 24+ messages in thread
From: Waiman Long @ 2026-06-30 3:33 UTC (permalink / raw)
To: Ridong Chen, Tejun Heo, Johannes Weiner, Michal Koutný,
Shuah Khan, Juri Lelli
Cc: cgroups, linux-kernel, linux-kselftest, Aaron Tomlin,
Guopeng Zhang, Waiman Long
Put the task attach related cpuset_attach_old_cs and
cpuset_attach_nodemask_to static variables into the new attach_ctx
structure to improve readability and ease maintanence.
No functional change is expected.
Suggested-by: Ridong Chen <ridong.chen@linux.dev>
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Ridong Chen <ridong.chen@linux.dev>
---
kernel/cgroup/cpuset.c | 21 ++++++++++-----------
1 file changed, 10 insertions(+), 11 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 1a78d0590737..47aa8f2fdfdc 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -362,6 +362,8 @@ static DECLARE_WAIT_QUEUE_HEAD(cpuset_attach_wq);
*/
static struct {
int in_progress;
+ struct cpuset *old_cs; /* Source cpuset */
+ nodemask_t nodemask_to;
} attach_ctx;
/*
@@ -3015,8 +3017,6 @@ static int update_prstate(struct cpuset *cs, int new_prs)
return 0;
}
-static struct cpuset *cpuset_attach_old_cs;
-
/*
* Check to see if a cpuset can accept a new task
* For v1, cpus_allowed and mems_allowed can't be empty.
@@ -3048,8 +3048,8 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
int cpu, ret;
/* used later by cpuset_attach() */
- cpuset_attach_old_cs = task_cs(cgroup_taskset_first(tset, &css));
- oldcs = cpuset_attach_old_cs;
+ attach_ctx.old_cs = task_cs(cgroup_taskset_first(tset, &css));
+ oldcs = attach_ctx.old_cs;
cs = css_cs(css);
mutex_lock(&cpuset_mutex);
@@ -3152,7 +3152,6 @@ static void cpuset_cancel_attach(struct cgroup_taskset *tset)
* allocate from cpuset_init().
*/
static cpumask_var_t cpus_attach;
-static nodemask_t cpuset_attach_nodemask_to;
static void cpuset_attach_task(struct cpuset *cs, struct task_struct *task)
{
@@ -3169,7 +3168,7 @@ static void cpuset_attach_task(struct cpuset *cs, struct task_struct *task)
*/
WARN_ON_ONCE(set_cpus_allowed_ptr(task, cpus_attach));
- cpuset_change_task_nodemask(task, &cpuset_attach_nodemask_to);
+ cpuset_change_task_nodemask(task, &attach_ctx.nodemask_to);
cpuset1_update_task_spread_flags(cs, task);
}
@@ -3179,7 +3178,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
struct task_struct *leader;
struct cgroup_subsys_state *css;
struct cpuset *cs;
- struct cpuset *oldcs = cpuset_attach_old_cs;
+ struct cpuset *oldcs = attach_ctx.old_cs;
bool cpus_updated, mems_updated;
bool queue_task_work = false;
@@ -3191,7 +3190,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
cpus_updated = !cpumask_equal(cs->effective_cpus,
oldcs->effective_cpus);
mems_updated = !nodes_equal(cs->effective_mems, oldcs->effective_mems);
- guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
+ guarantee_online_mems(cs, &attach_ctx.nodemask_to);
/*
* In the default hierarchy, enabling cpuset in the child cgroups
@@ -3230,7 +3229,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
*/
if (is_memory_migrate(cs)) {
cpuset_migrate_mm(mm, &oldcs->old_mems_allowed,
- &cpuset_attach_nodemask_to);
+ &attach_ctx.nodemask_to);
queue_task_work = true;
} else
mmput(mm);
@@ -3240,7 +3239,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
out:
if (queue_task_work)
schedule_flush_migrate_mm();
- cs->old_mems_allowed = cpuset_attach_nodemask_to;
+ cs->old_mems_allowed = attach_ctx.nodemask_to;
if (cs->nr_migrate_dl_tasks) {
atomic_add(cs->nr_migrate_dl_tasks, &cs->nr_deadline_tasks);
@@ -3693,7 +3692,7 @@ static void cpuset_fork(struct task_struct *task)
/* CLONE_INTO_CGROUP */
mutex_lock(&cpuset_mutex);
- guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
+ guarantee_online_mems(cs, &attach_ctx.nodemask_to);
cpuset_attach_task(cs, task);
dec_attach_in_progress_locked();
--
2.54.0
^ permalink raw reply related [flat|nested] 24+ messages in thread* [PATCH-next v9 05/11] cgroup/cpuset: Add a cpuset_reserve_dl_bw() helper
2026-06-30 3:33 [PATCH-next v9 00/11] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach() Waiman Long
` (3 preceding siblings ...)
2026-06-30 3:33 ` [PATCH-next v9 04/11] cgroup/cpuset: Put all task attach related variables into attach_ctx Waiman Long
@ 2026-06-30 3:33 ` Waiman Long
2026-06-30 3:33 ` [PATCH-next v9 06/11] cgroup/cpuset: Expand the scope of cpuset_can_attach_check() Waiman Long
` (5 subsequent siblings)
10 siblings, 0 replies; 24+ messages in thread
From: Waiman Long @ 2026-06-30 3:33 UTC (permalink / raw)
To: Ridong Chen, Tejun Heo, Johannes Weiner, Michal Koutný,
Shuah Khan, Juri Lelli
Cc: cgroups, linux-kernel, linux-kselftest, Aaron Tomlin,
Guopeng Zhang, Waiman Long, Gregory Price
Extract the DL bandwidth allocation code in cpuset_attach() to a new
cpuset_reserve_dl_bw() helper to simplify code.
No functional change is expected.
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Ridong Chen <ridong.chen@linux.dev>
Reviewed-by: Gregory Price <gourry@gourry.net>
---
kernel/cgroup/cpuset.c | 42 ++++++++++++++++++++++++------------------
1 file changed, 24 insertions(+), 18 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 47aa8f2fdfdc..c3c354ab61db 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -3031,6 +3031,25 @@ static int cpuset_can_attach_check(struct cpuset *cs)
return 0;
}
+static int cpuset_reserve_dl_bw(struct cpuset *cs)
+{
+ int cpu, ret;
+
+ if (!cs->sum_migrate_dl_bw)
+ return 0;
+
+ cpu = cpumask_any_and(cpu_active_mask, cs->effective_cpus);
+ if (unlikely(cpu >= nr_cpu_ids))
+ return -EINVAL;
+
+ ret = dl_bw_alloc(cpu, cs->sum_migrate_dl_bw);
+ if (ret)
+ return ret;
+
+ cs->dl_bw_cpu = cpu;
+ return 0;
+}
+
static void reset_migrate_dl_data(struct cpuset *cs)
{
cs->nr_migrate_dl_tasks = 0;
@@ -3045,7 +3064,7 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
struct cpuset *cs, *oldcs;
struct task_struct *task;
bool setsched_check;
- int cpu, ret;
+ int ret;
/* used later by cpuset_attach() */
attach_ctx.old_cs = task_cs(cgroup_taskset_first(tset, &css));
@@ -3101,27 +3120,14 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
}
}
- if (!cs->sum_migrate_dl_bw)
- goto out_success;
-
- cpu = cpumask_any_and(cpu_active_mask, cs->effective_cpus);
- if (unlikely(cpu >= nr_cpu_ids)) {
- ret = -EINVAL;
- goto out_unlock;
- }
-
- ret = dl_bw_alloc(cpu, cs->sum_migrate_dl_bw);
- if (ret)
- goto out_unlock;
-
- cs->dl_bw_cpu = cpu;
-
-out_success:
- attach_ctx.in_progress++;
+ ret = cpuset_reserve_dl_bw(cs);
out_unlock:
if (ret)
reset_migrate_dl_data(cs);
+ else
+ attach_ctx.in_progress++;
+
mutex_unlock(&cpuset_mutex);
return ret;
}
--
2.54.0
^ permalink raw reply related [flat|nested] 24+ messages in thread* [PATCH-next v9 06/11] cgroup/cpuset: Expand the scope of cpuset_can_attach_check()
2026-06-30 3:33 [PATCH-next v9 00/11] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach() Waiman Long
` (4 preceding siblings ...)
2026-06-30 3:33 ` [PATCH-next v9 05/11] cgroup/cpuset: Add a cpuset_reserve_dl_bw() helper Waiman Long
@ 2026-06-30 3:33 ` Waiman Long
2026-06-30 3:33 ` [PATCH-next v9 07/11] cgroup/cpuset: Make attach_ctx.old_cs track task group leader Waiman Long
` (4 subsequent siblings)
10 siblings, 0 replies; 24+ messages in thread
From: Waiman Long @ 2026-06-30 3:33 UTC (permalink / raw)
To: Ridong Chen, Tejun Heo, Johannes Weiner, Michal Koutný,
Shuah Khan, Juri Lelli
Cc: cgroups, linux-kernel, linux-kselftest, Aaron Tomlin,
Guopeng Zhang, Waiman Long
Expand the scope of cpuset_can_attach_check() by including the setting
of setsched flag inside cpuset_can_attach_check() with the new @oldcs
and @psetsched argument. As cpuset_can_attach_check() is also called
from cpuset_can_fork(), set the new arguments to NULL from that caller.
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Ridong Chen <ridong.chen@linux.dev>
---
kernel/cgroup/cpuset.c | 52 ++++++++++++++++++++++++------------------
1 file changed, 30 insertions(+), 22 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index c3c354ab61db..4a3e2972884c 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -3022,12 +3022,39 @@ static int update_prstate(struct cpuset *cs, int new_prs)
* For v1, cpus_allowed and mems_allowed can't be empty.
* For v2, effective_cpus can't be empty.
* Note that in v1, effective_cpus = cpus_allowed.
+ *
+ * Also set the boolean flag passed in by @psetsched depending on if
+ * security_task_setscheduler() call is needed and @oldcs is not NULL.
*/
-static int cpuset_can_attach_check(struct cpuset *cs)
+static int cpuset_can_attach_check(struct cpuset *cs, struct cpuset *oldcs,
+ bool *psetsched)
{
if (cpumask_empty(cs->effective_cpus) ||
(!is_in_v2_mode() && nodes_empty(cs->mems_allowed)))
return -ENOSPC;
+
+ if (!oldcs)
+ return 0;
+
+ /*
+ * Skip rights over task setsched check in v2 when nothing changes,
+ * migration permission derives from hierarchy ownership in
+ * cgroup_procs_write_permission()).
+ */
+ *psetsched = !cpuset_v2() ||
+ !cpumask_equal(cs->effective_cpus, oldcs->effective_cpus) ||
+ !nodes_equal(cs->effective_mems, oldcs->effective_mems);
+
+ /*
+ * A v1 cpuset with tasks will have no CPU left only when CPU hotplug
+ * brings the last online CPU offline as users are not allowed to empty
+ * cpuset.cpus when there are active tasks inside. When that happens,
+ * we should allow tasks to migrate out without security check to make
+ * sure they will be able to run after migration.
+ */
+ if (!is_in_v2_mode() && cpumask_empty(oldcs->effective_cpus))
+ *psetsched = false;
+
return 0;
}
@@ -3074,29 +3101,10 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
mutex_lock(&cpuset_mutex);
/* Check to see if task is allowed in the cpuset */
- ret = cpuset_can_attach_check(cs);
+ ret = cpuset_can_attach_check(cs, oldcs, &setsched_check);
if (ret)
goto out_unlock;
- /*
- * Skip rights over task setsched check in v2 when nothing changes,
- * migration permission derives from hierarchy ownership in
- * cgroup_procs_write_permission()).
- */
- setsched_check = !cpuset_v2() ||
- !cpumask_equal(cs->effective_cpus, oldcs->effective_cpus) ||
- !nodes_equal(cs->effective_mems, oldcs->effective_mems);
-
- /*
- * A v1 cpuset with tasks will have no CPU left only when CPU hotplug
- * brings the last online CPU offline as users are not allowed to empty
- * cpuset.cpus when there are active tasks inside. When that happens,
- * we should allow tasks to migrate out without security check to make
- * sure they will be able to run after migration.
- */
- if (!is_in_v2_mode() && cpumask_empty(oldcs->effective_cpus))
- setsched_check = false;
-
cgroup_taskset_for_each(task, css, tset) {
ret = task_can_attach(task);
if (ret)
@@ -3639,7 +3647,7 @@ static int cpuset_can_fork(struct task_struct *task, struct css_set *cset)
mutex_lock(&cpuset_mutex);
/* Check to see if task is allowed in the cpuset */
- ret = cpuset_can_attach_check(cs);
+ ret = cpuset_can_attach_check(cs, NULL, NULL);
if (ret)
goto out_unlock;
--
2.54.0
^ permalink raw reply related [flat|nested] 24+ messages in thread* [PATCH-next v9 07/11] cgroup/cpuset: Make attach_ctx.old_cs track task group leader
2026-06-30 3:33 [PATCH-next v9 00/11] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach() Waiman Long
` (5 preceding siblings ...)
2026-06-30 3:33 ` [PATCH-next v9 06/11] cgroup/cpuset: Expand the scope of cpuset_can_attach_check() Waiman Long
@ 2026-06-30 3:33 ` Waiman Long
2026-06-30 3:33 ` [PATCH-next v9 08/11] cgroup/cpuset: Move mpol_rebind_mm/cpuset_migrate_mm() calls inside cpuset_attach_task() Waiman Long
` (3 subsequent siblings)
10 siblings, 0 replies; 24+ messages in thread
From: Waiman Long @ 2026-06-30 3:33 UTC (permalink / raw)
To: Ridong Chen, Tejun Heo, Johannes Weiner, Michal Koutný,
Shuah Khan, Juri Lelli
Cc: cgroups, linux-kernel, linux-kselftest, Aaron Tomlin,
Guopeng Zhang, Waiman Long
There are two possible ways that migration of tasks from multiple source
cpusets to a target cpuset can happen. Either a multithread application
with threads in different cpusets is wholely migrated to a new cpuset
or disabling of v2 cpuset controller will move all the tasks in child
cpusets to the parent cpuset.
In the former case, it is the mm setting of the group leader that
really matters. So attach_ctx.old_cs should track the oldcs of the
thread leader. In the latter case, effective_mems of child cpusets
must always be a subset of the parent. So no real page migration
will not be necessary no matter which child cpuset is selected as
attach_ctx.old_cs.
IOW, attach_ctx.old_cs should be updated to match the latest task
group leader in cpuset_can_attach(), but fall back to that of the first
task if there is no group leader in the taskset.
Suggested-by: Ridong Chen <ridong.chen@linux.dev>
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Ridong Chen <ridong.chen@linux.dev>
---
kernel/cgroup/cpuset.c | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 4a3e2972884c..55cd580373b7 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -3105,11 +3105,32 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
if (ret)
goto out_unlock;
+ /*
+ * The attach_ctx.old_cs is used mainly by cpuset_migrate_mm() to get
+ * the old_mems_allowed value. There are two ways that many-to-one
+ * cpuset migration can happen:
+ * 1) A multithread application with threads in different cpusets is
+ * wholely migrated to a new cpuset.
+ * 2) Disabling v2 cpuset controller will move all the tasks in child
+ * cpusets to the parent cpuset.
+ *
+ * In the former case, it is the mm setting of the group leader that
+ * really matters. So attach_ctx.old_cs should track the oldcs of the
+ * group leader. It falls back to the oldcs of the first task if there
+ * is no group leader in the taskset. In the latter case, effective_mems
+ * of child cpusets must always be a subset of the parent. So no real
+ * page migration will be necessary no matter which child cpuset is
+ * selected as attach_ctx.old_cs.
+ */
cgroup_taskset_for_each(task, css, tset) {
ret = task_can_attach(task);
if (ret)
goto out_unlock;
+ /* Update attach_ctx.old_cs to the latest group leader */
+ if (task == task->group_leader)
+ attach_ctx.old_cs = task_cs(task);
+
if (setsched_check) {
ret = security_task_setscheduler(task);
if (ret)
--
2.54.0
^ permalink raw reply related [flat|nested] 24+ messages in thread* [PATCH-next v9 08/11] cgroup/cpuset: Move mpol_rebind_mm/cpuset_migrate_mm() calls inside cpuset_attach_task()
2026-06-30 3:33 [PATCH-next v9 00/11] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach() Waiman Long
` (6 preceding siblings ...)
2026-06-30 3:33 ` [PATCH-next v9 07/11] cgroup/cpuset: Make attach_ctx.old_cs track task group leader Waiman Long
@ 2026-06-30 3:33 ` Waiman Long
2026-07-01 2:14 ` Ridong Chen
2026-06-30 3:33 ` [PATCH-next v9 09/11] cgroup/cpuset: Support multiple source cpusets for cpuset_*attach() Waiman Long
` (2 subsequent siblings)
10 siblings, 1 reply; 24+ messages in thread
From: Waiman Long @ 2026-06-30 3:33 UTC (permalink / raw)
To: Ridong Chen, Tejun Heo, Johannes Weiner, Michal Koutný,
Shuah Khan, Juri Lelli
Cc: cgroups, linux-kernel, linux-kselftest, Aaron Tomlin,
Guopeng Zhang, Waiman Long
The cpuset_attach_task() was introduced in commit 42a11bf5c543
("cgroup/cpuset: Make cpuset_fork() handle CLONE_INTO_CGROUP properly")
to enable the CLONE_INTO_CGROUP flag of clone(2) to behave more like
moving a task from one cpuset into another one. That commits didn't
move the mpol_rebind_mm() and cpuset_migrate_mm() calls for group leader
into cpuset_attach_task().
When the CLONE_INTO_CGROUP flag is used without CLONE_THREAD, the new
task is its own group leader. So it is still not equivalent to moving
task between cpusets in this case. Make CLONE_INTO_CGROUP behaves
more close to cpuset_attach() by moving the mpol_rebind_mm() and
cpuset_migrate_mm() calls inside cpuset_attach_task().
Also move the stack local cpus_updated, mems_updated and queue_task_work
flags into attach_ctx so that these flags can be accessed inside and
outside of cpuset_attach_task(). The cpuset_fork() function is updated
to set up these flags and do memory migration if necessary.
Signed-off-by: Waiman Long <longman@redhat.com>
---
kernel/cgroup/cpuset.c | 104 ++++++++++++++++++++++++-----------------
1 file changed, 60 insertions(+), 44 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 55cd580373b7..0b9df38e9a63 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -362,6 +362,9 @@ static DECLARE_WAIT_QUEUE_HEAD(cpuset_attach_wq);
*/
static struct {
int in_progress;
+ bool cpus_updated;
+ bool mems_updated;
+ bool task_work_queued;
struct cpuset *old_cs; /* Source cpuset */
nodemask_t nodemask_to;
} attach_ctx;
@@ -3190,6 +3193,8 @@ static cpumask_var_t cpus_attach;
static void cpuset_attach_task(struct cpuset *cs, struct task_struct *task)
{
+ struct mm_struct *mm;
+
lockdep_assert_cpuset_lock_held();
if (cs != &top_cpuset)
@@ -3203,28 +3208,60 @@ static void cpuset_attach_task(struct cpuset *cs, struct task_struct *task)
*/
WARN_ON_ONCE(set_cpus_allowed_ptr(task, cpus_attach));
+ if (cpuset_v2() && !attach_ctx.mems_updated)
+ return;
+
cpuset_change_task_nodemask(task, &attach_ctx.nodemask_to);
cpuset1_update_task_spread_flags(cs, task);
+
+ if ((task != task->group_leader) ||
+ (!is_memory_migrate(cs) && !attach_ctx.mems_updated))
+ return;
+
+ /*
+ * Change mm for threadgroup leader. This is expensive and may
+ * sleep and should be moved outside migration path proper.
+ */
+ mm = get_task_mm(task);
+ if (mm) {
+ struct cpuset *oldcs = attach_ctx.old_cs;
+
+ mpol_rebind_mm(mm, &cs->effective_mems);
+
+ /*
+ * old_mems_allowed is the same with mems_allowed
+ * here, except if this task is being moved
+ * automatically due to hotplug. In that case
+ * @mems_allowed has been updated and is empty, so
+ * @old_mems_allowed is the right nodesets that we
+ * migrate mm from.
+ */
+ if (is_memory_migrate(cs)) {
+ cpuset_migrate_mm(mm, &oldcs->old_mems_allowed,
+ &attach_ctx.nodemask_to);
+ attach_ctx.task_work_queued = true;
+ } else {
+ mmput(mm);
+ }
+ }
}
static void cpuset_attach(struct cgroup_taskset *tset)
{
struct task_struct *task;
- struct task_struct *leader;
struct cgroup_subsys_state *css;
struct cpuset *cs;
struct cpuset *oldcs = attach_ctx.old_cs;
- bool cpus_updated, mems_updated;
- bool queue_task_work = false;
cgroup_taskset_first(tset, &css);
cs = css_cs(css);
lockdep_assert_cpus_held(); /* see cgroup_attach_lock() */
mutex_lock(&cpuset_mutex);
- cpus_updated = !cpumask_equal(cs->effective_cpus,
- oldcs->effective_cpus);
- mems_updated = !nodes_equal(cs->effective_mems, oldcs->effective_mems);
+ attach_ctx.task_work_queued = false;
+
+ attach_ctx.cpus_updated = !cpumask_equal(cs->effective_cpus, oldcs->effective_cpus);
+ attach_ctx.mems_updated = !nodes_equal(cs->effective_mems, oldcs->effective_mems);
guarantee_online_mems(cs, &attach_ctx.nodemask_to);
/*
@@ -3233,46 +3270,14 @@ static void cpuset_attach(struct cgroup_taskset *tset)
* and mems. In that case, we can optimize out by skipping the task
* iteration and update.
*/
- if (cpuset_v2() && !cpus_updated && !mems_updated)
+ if (cpuset_v2() && !attach_ctx.cpus_updated && !attach_ctx.mems_updated)
goto out;
cgroup_taskset_for_each(task, css, tset)
cpuset_attach_task(cs, task);
- /*
- * Change mm for all threadgroup leaders. This is expensive and may
- * sleep and should be moved outside migration path proper. Skip it
- * if there is no change in effective_mems and CS_MEMORY_MIGRATE is
- * not set.
- */
- if (!is_memory_migrate(cs) && !mems_updated)
- goto out;
-
- cgroup_taskset_for_each_leader(leader, css, tset) {
- struct mm_struct *mm = get_task_mm(leader);
-
- if (mm) {
- mpol_rebind_mm(mm, &cs->effective_mems);
-
- /*
- * old_mems_allowed is the same with mems_allowed
- * here, except if this task is being moved
- * automatically due to hotplug. In that case
- * @mems_allowed has been updated and is empty, so
- * @old_mems_allowed is the right nodesets that we
- * migrate mm from.
- */
- if (is_memory_migrate(cs)) {
- cpuset_migrate_mm(mm, &oldcs->old_mems_allowed,
- &attach_ctx.nodemask_to);
- queue_task_work = true;
- } else
- mmput(mm);
- }
- }
-
out:
- if (queue_task_work)
+ if (attach_ctx.task_work_queued)
schedule_flush_migrate_mm();
cs->old_mems_allowed = attach_ctx.nodemask_to;
@@ -3708,15 +3713,14 @@ static void cpuset_cancel_fork(struct task_struct *task, struct css_set *cset)
*/
static void cpuset_fork(struct task_struct *task)
{
- struct cpuset *cs;
- bool same_cs;
+ struct cpuset *cs, *oldcs;
rcu_read_lock();
cs = task_cs(task);
- same_cs = (cs == task_cs(current));
+ oldcs = task_cs(current);
rcu_read_unlock();
- if (same_cs) {
+ if (cs == oldcs) {
if (cs == &top_cpuset)
return;
@@ -3728,7 +3732,19 @@ static void cpuset_fork(struct task_struct *task)
/* CLONE_INTO_CGROUP */
mutex_lock(&cpuset_mutex);
guarantee_online_mems(cs, &attach_ctx.nodemask_to);
+ cs->old_mems_allowed = attach_ctx.nodemask_to;
+
+ /*
+ * Assume CPUs and memory nodes are updated
+ * A CLONE_INTO_CGROUP operation should have taken the cgroup mutex
+ * and so there shouldn't be a competing cpuset_attach() operation.
+ */
+ attach_ctx.cpus_updated = attach_ctx.mems_updated = true;
+ attach_ctx.task_work_queued = false;
+ attach_ctx.old_cs = oldcs;
cpuset_attach_task(cs, task);
+ if (attach_ctx.task_work_queued)
+ schedule_flush_migrate_mm();
dec_attach_in_progress_locked();
mutex_unlock(&cpuset_mutex);
--
2.54.0
^ permalink raw reply related [flat|nested] 24+ messages in thread* Re: [PATCH-next v9 08/11] cgroup/cpuset: Move mpol_rebind_mm/cpuset_migrate_mm() calls inside cpuset_attach_task()
2026-06-30 3:33 ` [PATCH-next v9 08/11] cgroup/cpuset: Move mpol_rebind_mm/cpuset_migrate_mm() calls inside cpuset_attach_task() Waiman Long
@ 2026-07-01 2:14 ` Ridong Chen
2026-07-01 20:30 ` Waiman Long
0 siblings, 1 reply; 24+ messages in thread
From: Ridong Chen @ 2026-07-01 2:14 UTC (permalink / raw)
To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
Shuah Khan, Juri Lelli
Cc: cgroups, linux-kernel, linux-kselftest, Aaron Tomlin,
Guopeng Zhang
On 6/30/2026 11:33 AM, Waiman Long wrote:
> The cpuset_attach_task() was introduced in commit 42a11bf5c543
> ("cgroup/cpuset: Make cpuset_fork() handle CLONE_INTO_CGROUP properly")
> to enable the CLONE_INTO_CGROUP flag of clone(2) to behave more like
> moving a task from one cpuset into another one. That commits didn't
> move the mpol_rebind_mm() and cpuset_migrate_mm() calls for group leader
> into cpuset_attach_task().
>
> When the CLONE_INTO_CGROUP flag is used without CLONE_THREAD, the new
> task is its own group leader. So it is still not equivalent to moving
> task between cpusets in this case. Make CLONE_INTO_CGROUP behaves
> more close to cpuset_attach() by moving the mpol_rebind_mm() and
> cpuset_migrate_mm() calls inside cpuset_attach_task().
>
> Also move the stack local cpus_updated, mems_updated and queue_task_work
> flags into attach_ctx so that these flags can be accessed inside and
> outside of cpuset_attach_task(). The cpuset_fork() function is updated
> to set up these flags and do memory migration if necessary.
>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
> kernel/cgroup/cpuset.c | 104 ++++++++++++++++++++++++-----------------
> 1 file changed, 60 insertions(+), 44 deletions(-)
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 55cd580373b7..0b9df38e9a63 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -362,6 +362,9 @@ static DECLARE_WAIT_QUEUE_HEAD(cpuset_attach_wq);
> */
> static struct {
> int in_progress;
> + bool cpus_updated;
> + bool mems_updated;
> + bool task_work_queued;
> struct cpuset *old_cs; /* Source cpuset */
> nodemask_t nodemask_to;
> } attach_ctx;
> @@ -3190,6 +3193,8 @@ static cpumask_var_t cpus_attach;
>
> static void cpuset_attach_task(struct cpuset *cs, struct task_struct *task)
> {
> + struct mm_struct *mm;
> +
> lockdep_assert_cpuset_lock_held();
>
> if (cs != &top_cpuset)
> @@ -3203,28 +3208,60 @@ static void cpuset_attach_task(struct cpuset *cs, struct task_struct *task)
> */
> WARN_ON_ONCE(set_cpus_allowed_ptr(task, cpus_attach));
>
> + if (cpuset_v2() && !attach_ctx.mems_updated)
> + return;
> +
> cpuset_change_task_nodemask(task, &attach_ctx.nodemask_to);
> cpuset1_update_task_spread_flags(cs, task);
> +
> + if ((task != task->group_leader) ||
> + (!is_memory_migrate(cs) && !attach_ctx.mems_updated))
> + return;
> +
Nit.
IIUC, the !is_memory_migrate(cs) check may be unnecessary. Previously,
placing this condition outside could prevent an unnecessary loop, but in
its current position, it appears redundant.
if (task != task->group_leader ||
!attach_ctx.mems_updated)
> + /*
> + * Change mm for threadgroup leader. This is expensive and may
> + * sleep and should be moved outside migration path proper.
> + */
> + mm = get_task_mm(task);
> + if (mm) {
> + struct cpuset *oldcs = attach_ctx.old_cs;
> +
> + mpol_rebind_mm(mm, &cs->effective_mems);
> +
> + /*
> + * old_mems_allowed is the same with mems_allowed
> + * here, except if this task is being moved
> + * automatically due to hotplug. In that case
> + * @mems_allowed has been updated and is empty, so
> + * @old_mems_allowed is the right nodesets that we
> + * migrate mm from.
> + */
> + if (is_memory_migrate(cs)) {
> + cpuset_migrate_mm(mm, &oldcs->old_mems_allowed,
> + &attach_ctx.nodemask_to);
> + attach_ctx.task_work_queued = true;
> + } else {
> + mmput(mm);
> + }
> + }
> }
>
> static void cpuset_attach(struct cgroup_taskset *tset)
> {
> struct task_struct *task;
> - struct task_struct *leader;
> struct cgroup_subsys_state *css;
> struct cpuset *cs;
> struct cpuset *oldcs = attach_ctx.old_cs;
> - bool cpus_updated, mems_updated;
> - bool queue_task_work = false;
>
> cgroup_taskset_first(tset, &css);
> cs = css_cs(css);
>
> lockdep_assert_cpus_held(); /* see cgroup_attach_lock() */
> mutex_lock(&cpuset_mutex);
> - cpus_updated = !cpumask_equal(cs->effective_cpus,
> - oldcs->effective_cpus);
> - mems_updated = !nodes_equal(cs->effective_mems, oldcs->effective_mems);
> + attach_ctx.task_work_queued = false;
> +
> + attach_ctx.cpus_updated = !cpumask_equal(cs->effective_cpus, oldcs->effective_cpus);
> + attach_ctx.mems_updated = !nodes_equal(cs->effective_mems, oldcs->effective_mems);
> guarantee_online_mems(cs, &attach_ctx.nodemask_to);
>
> /*
> @@ -3233,46 +3270,14 @@ static void cpuset_attach(struct cgroup_taskset *tset)
> * and mems. In that case, we can optimize out by skipping the task
> * iteration and update.
> */
> - if (cpuset_v2() && !cpus_updated && !mems_updated)
> + if (cpuset_v2() && !attach_ctx.cpus_updated && !attach_ctx.mems_updated)
> goto out;
>
> cgroup_taskset_for_each(task, css, tset)
> cpuset_attach_task(cs, task);
>
> - /*
> - * Change mm for all threadgroup leaders. This is expensive and may
> - * sleep and should be moved outside migration path proper. Skip it
> - * if there is no change in effective_mems and CS_MEMORY_MIGRATE is
> - * not set.
> - */
> - if (!is_memory_migrate(cs) && !mems_updated)
> - goto out;
> -
> - cgroup_taskset_for_each_leader(leader, css, tset) {
> - struct mm_struct *mm = get_task_mm(leader);
> -
> - if (mm) {
> - mpol_rebind_mm(mm, &cs->effective_mems);
> -
> - /*
> - * old_mems_allowed is the same with mems_allowed
> - * here, except if this task is being moved
> - * automatically due to hotplug. In that case
> - * @mems_allowed has been updated and is empty, so
> - * @old_mems_allowed is the right nodesets that we
> - * migrate mm from.
> - */
> - if (is_memory_migrate(cs)) {
> - cpuset_migrate_mm(mm, &oldcs->old_mems_allowed,
> - &attach_ctx.nodemask_to);
> - queue_task_work = true;
> - } else
> - mmput(mm);
> - }
> - }
> -
> out:
> - if (queue_task_work)
> + if (attach_ctx.task_work_queued)
> schedule_flush_migrate_mm();
> cs->old_mems_allowed = attach_ctx.nodemask_to;
>
> @@ -3708,15 +3713,14 @@ static void cpuset_cancel_fork(struct task_struct *task, struct css_set *cset)
> */
> static void cpuset_fork(struct task_struct *task)
> {
> - struct cpuset *cs;
> - bool same_cs;
> + struct cpuset *cs, *oldcs;
>
> rcu_read_lock();
> cs = task_cs(task);
> - same_cs = (cs == task_cs(current));
> + oldcs = task_cs(current);
> rcu_read_unlock();
>
> - if (same_cs) {
> + if (cs == oldcs) {
> if (cs == &top_cpuset)
> return;
>
> @@ -3728,7 +3732,19 @@ static void cpuset_fork(struct task_struct *task)
> /* CLONE_INTO_CGROUP */
> mutex_lock(&cpuset_mutex);
> guarantee_online_mems(cs, &attach_ctx.nodemask_to);
> + cs->old_mems_allowed = attach_ctx.nodemask_to;
> +
> + /*
> + * Assume CPUs and memory nodes are updated
> + * A CLONE_INTO_CGROUP operation should have taken the cgroup mutex
> + * and so there shouldn't be a competing cpuset_attach() operation.
> + */
> + attach_ctx.cpus_updated = attach_ctx.mems_updated = true;
> + attach_ctx.task_work_queued = false;
> + attach_ctx.old_cs = oldcs;
> cpuset_attach_task(cs, task);
> + if (attach_ctx.task_work_queued)
> + schedule_flush_migrate_mm();
>
> dec_attach_in_progress_locked();
> mutex_unlock(&cpuset_mutex);
Reviewed-by: Ridong Chen <ridong.chen@linux.dev>
--
Best regards
Ridong
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: [PATCH-next v9 08/11] cgroup/cpuset: Move mpol_rebind_mm/cpuset_migrate_mm() calls inside cpuset_attach_task()
2026-07-01 2:14 ` Ridong Chen
@ 2026-07-01 20:30 ` Waiman Long
0 siblings, 0 replies; 24+ messages in thread
From: Waiman Long @ 2026-07-01 20:30 UTC (permalink / raw)
To: Ridong Chen, Tejun Heo, Johannes Weiner, Michal Koutný,
Shuah Khan, Juri Lelli
Cc: cgroups, linux-kernel, linux-kselftest, Aaron Tomlin,
Guopeng Zhang
On 6/30/26 10:14 PM, Ridong Chen wrote:
>> @@ -3203,28 +3208,60 @@ static void cpuset_attach_task(struct cpuset
>> *cs, struct task_struct *task)
>> */
>> WARN_ON_ONCE(set_cpus_allowed_ptr(task, cpus_attach));
>> + if (cpuset_v2() && !attach_ctx.mems_updated)
>> + return;
>> +
>> cpuset_change_task_nodemask(task, &attach_ctx.nodemask_to);
>> cpuset1_update_task_spread_flags(cs, task);
>> +
>> + if ((task != task->group_leader) ||
>> + (!is_memory_migrate(cs) && !attach_ctx.mems_updated))
>> + return;
>> +
>
> Nit.
>
> IIUC, the !is_memory_migrate(cs) check may be unnecessary. Previously,
> placing this condition outside could prevent an unnecessary loop, but
> in its current position, it appears redundant.
>
> if (task != task->group_leader ||
> !attach_ctx.mems_updated)
You are right. The is_memory_migrate(cs) check is unnecessary. Will
remove it in the next version.
Cheers,
Longman
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH-next v9 09/11] cgroup/cpuset: Support multiple source cpusets for cpuset_*attach()
2026-06-30 3:33 [PATCH-next v9 00/11] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach() Waiman Long
` (7 preceding siblings ...)
2026-06-30 3:33 ` [PATCH-next v9 08/11] cgroup/cpuset: Move mpol_rebind_mm/cpuset_migrate_mm() calls inside cpuset_attach_task() Waiman Long
@ 2026-06-30 3:33 ` Waiman Long
2026-07-01 2:35 ` Ridong Chen
2026-06-30 3:33 ` [PATCH-next v9 10/11] cgroup/cpuset: Support multiple destination " Waiman Long
2026-06-30 3:33 ` [PATCH-next v9 11/11] selftests/cgroup: Add test for cpuset affinity on controller disable Waiman Long
10 siblings, 1 reply; 24+ messages in thread
From: Waiman Long @ 2026-06-30 3:33 UTC (permalink / raw)
To: Ridong Chen, Tejun Heo, Johannes Weiner, Michal Koutný,
Shuah Khan, Juri Lelli
Cc: cgroups, linux-kernel, linux-kselftest, Aaron Tomlin,
Guopeng Zhang, Waiman Long
There are 2 possible scenarios where the cgroup_taskset structure
passed into the cgroup can_attach() and attach() methods can contain
task migration data with multiple source cpusets.
- A multithread application with threads in different cpusets is
fully migrated into a new cpuset.
- Disabling v2 cpuset controller will move all the tasks in child
cpusets to the parent cpuset.
The current cpuset_can_attach() and cpuset_attach() functions still
expect task migration is from one source cpuset to one destination
cpuset.
Fix that by tracking the set of source (old) cpusets in singly linked
lists. The list will be iterated when necessary to properly update
internal data.
To ensure proper DL tasks accounting, the nr_migrate_dl_tasks in both
the source and destination cpusets are decremented/incremented with
their values added to nr_deadline_tasks when the migration is successful.
The setting of the global attach_ctx.cpus_updated and
attach_ctx.mems_updated flags are also moved from cpuset_attach()
to cpuset_can_attach() as the correct source cpuset can no longer be
determined in cpuset_attach() and cpuset states will not be changed
between cpuset_attach() and cpuset_can_attach() with an earlier patch.
Signed-off-by: Waiman Long <longman@redhat.com>
---
kernel/cgroup/cpuset-internal.h | 5 +++
kernel/cgroup/cpuset.c | 65 ++++++++++++++++++++++++++++-----
2 files changed, 60 insertions(+), 10 deletions(-)
diff --git a/kernel/cgroup/cpuset-internal.h b/kernel/cgroup/cpuset-internal.h
index df662c7fd1a4..e7d010661fd3 100644
--- a/kernel/cgroup/cpuset-internal.h
+++ b/kernel/cgroup/cpuset-internal.h
@@ -145,6 +145,11 @@ struct cpuset {
*/
nodemask_t old_mems_allowed;
+ /*
+ * For linking impacted cpusets during an attach operation.
+ */
+ struct llist_node attach_node;
+
/* partition root state */
int partition_root_state;
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 0b9df38e9a63..b201f4ba18b6 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -37,6 +37,7 @@
#include <linux/wait.h>
#include <linux/workqueue.h>
#include <linux/task_work.h>
+#include <linux/llist.h>
DEFINE_STATIC_KEY_FALSE(cpusets_pre_enable_key);
DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key);
@@ -368,6 +369,7 @@ static struct {
struct cpuset *old_cs; /* Source cpuset */
nodemask_t nodemask_to;
} attach_ctx;
+static LLIST_HEAD(src_cs_head);
/*
* Wait if task attach is in progress until it is done and then acquire
@@ -615,6 +617,7 @@ static struct cpuset *dup_or_alloc_cpuset(struct cpuset *cs)
return NULL;
trial->dl_bw_cpu = -1;
+ init_llist_node(&trial->attach_node);
/* Setup cpumask pointer array */
cpumask_var_t *pmask[4] = {
@@ -3032,6 +3035,8 @@ static int update_prstate(struct cpuset *cs, int new_prs)
static int cpuset_can_attach_check(struct cpuset *cs, struct cpuset *oldcs,
bool *psetsched)
{
+ bool cpus_updated, mems_updated;
+
if (cpumask_empty(cs->effective_cpus) ||
(!is_in_v2_mode() && nodes_empty(cs->mems_allowed)))
return -ENOSPC;
@@ -3039,14 +3044,23 @@ static int cpuset_can_attach_check(struct cpuset *cs, struct cpuset *oldcs,
if (!oldcs)
return 0;
+ if (!llist_on_list(&oldcs->attach_node))
+ llist_add(&oldcs->attach_node, &src_cs_head);
+
+ cpus_updated = !cpumask_equal(cs->effective_cpus, oldcs->effective_cpus);
+ mems_updated = !nodes_equal(cs->effective_mems, oldcs->effective_mems);
+
+ if (cpus_updated)
+ attach_ctx.cpus_updated = true;
+ if (mems_updated)
+ attach_ctx.mems_updated = true;
+
/*
* Skip rights over task setsched check in v2 when nothing changes,
* migration permission derives from hierarchy ownership in
* cgroup_procs_write_permission()).
*/
- *psetsched = !cpuset_v2() ||
- !cpumask_equal(cs->effective_cpus, oldcs->effective_cpus) ||
- !nodes_equal(cs->effective_mems, oldcs->effective_mems);
+ *psetsched = !cpuset_v2() || cpus_updated || mems_updated;
/*
* A v1 cpuset with tasks will have no CPU left only when CPU hotplug
@@ -3087,6 +3101,25 @@ static void reset_migrate_dl_data(struct cpuset *cs)
cs->dl_bw_cpu = -1;
}
+/*
+ * Clear and optionally apply (@cancel is false) the attach related data in the
+ * source cpusets.
+ */
+static void clear_attach_data(struct llist_head *head, bool cancel)
+{
+ struct cpuset *cs, *next;
+ struct llist_node *lnode = __llist_del_all(head);
+
+ llist_for_each_entry_safe(cs, next, lnode, attach_node) {
+ init_llist_node(&cs->attach_node);
+ if (cs->nr_migrate_dl_tasks) {
+ if (!cancel)
+ atomic_add(cs->nr_migrate_dl_tasks, &cs->nr_deadline_tasks);
+ cs->nr_migrate_dl_tasks = 0;
+ }
+ }
+}
+
/* Called by cgroups to determine if a cpuset is usable; cpuset_mutex held */
static int cpuset_can_attach(struct cgroup_taskset *tset)
{
@@ -3102,6 +3135,8 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
cs = css_cs(css);
mutex_lock(&cpuset_mutex);
+ attach_ctx.cpus_updated = false;
+ attach_ctx.mems_updated = false;
/* Check to see if task is allowed in the cpuset */
ret = cpuset_can_attach_check(cs, oldcs, &setsched_check);
@@ -3126,6 +3161,15 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
* selected as attach_ctx.old_cs.
*/
cgroup_taskset_for_each(task, css, tset) {
+ struct cpuset *new_oldcs = task_cs(task);
+
+ if (new_oldcs != oldcs) {
+ oldcs = new_oldcs;
+ ret = cpuset_can_attach_check(cs, oldcs, &setsched_check);
+ if (ret)
+ goto out_unlock;
+ }
+
ret = task_can_attach(task);
if (ret)
goto out_unlock;
@@ -3147,6 +3191,7 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
* contribute to sum_migrate_dl_bw.
*/
cs->nr_migrate_dl_tasks++;
+ oldcs->nr_migrate_dl_tasks--;
if (dl_task_needs_bw_move(task, cs->effective_cpus))
cs->sum_migrate_dl_bw += task->dl.dl_bw;
}
@@ -3155,10 +3200,12 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
ret = cpuset_reserve_dl_bw(cs);
out_unlock:
- if (ret)
+ if (ret) {
reset_migrate_dl_data(cs);
- else
+ clear_attach_data(&src_cs_head, true);
+ } else {
attach_ctx.in_progress++;
+ }
mutex_unlock(&cpuset_mutex);
return ret;
@@ -3174,6 +3221,7 @@ static void cpuset_cancel_attach(struct cgroup_taskset *tset)
mutex_lock(&cpuset_mutex);
dec_attach_in_progress_locked();
+ clear_attach_data(&src_cs_head, true);
if (cs->dl_bw_cpu >= 0)
dl_bw_free(cs->dl_bw_cpu, cs->sum_migrate_dl_bw);
@@ -3251,7 +3299,6 @@ static void cpuset_attach(struct cgroup_taskset *tset)
struct task_struct *task;
struct cgroup_subsys_state *css;
struct cpuset *cs;
- struct cpuset *oldcs = attach_ctx.old_cs;
cgroup_taskset_first(tset, &css);
cs = css_cs(css);
@@ -3259,9 +3306,6 @@ static void cpuset_attach(struct cgroup_taskset *tset)
lockdep_assert_cpus_held(); /* see cgroup_attach_lock() */
mutex_lock(&cpuset_mutex);
attach_ctx.task_work_queued = false;
-
- attach_ctx.cpus_updated = !cpumask_equal(cs->effective_cpus, oldcs->effective_cpus);
- attach_ctx.mems_updated = !nodes_equal(cs->effective_mems, oldcs->effective_mems);
guarantee_online_mems(cs, &attach_ctx.nodemask_to);
/*
@@ -3283,10 +3327,10 @@ static void cpuset_attach(struct cgroup_taskset *tset)
if (cs->nr_migrate_dl_tasks) {
atomic_add(cs->nr_migrate_dl_tasks, &cs->nr_deadline_tasks);
- atomic_sub(cs->nr_migrate_dl_tasks, &oldcs->nr_deadline_tasks);
reset_migrate_dl_data(cs);
}
+ clear_attach_data(&src_cs_head, false);
dec_attach_in_progress_locked();
mutex_unlock(&cpuset_mutex);
@@ -3793,6 +3837,7 @@ int __init cpuset_init(void)
cpumask_setall(top_cpuset.effective_xcpus);
cpumask_setall(top_cpuset.exclusive_cpus);
nodes_setall(top_cpuset.effective_mems);
+ init_llist_node(&top_cpuset.attach_node);
cpuset1_init(&top_cpuset);
--
2.54.0
^ permalink raw reply related [flat|nested] 24+ messages in thread* Re: [PATCH-next v9 09/11] cgroup/cpuset: Support multiple source cpusets for cpuset_*attach()
2026-06-30 3:33 ` [PATCH-next v9 09/11] cgroup/cpuset: Support multiple source cpusets for cpuset_*attach() Waiman Long
@ 2026-07-01 2:35 ` Ridong Chen
2026-07-01 20:44 ` Waiman Long
0 siblings, 1 reply; 24+ messages in thread
From: Ridong Chen @ 2026-07-01 2:35 UTC (permalink / raw)
To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
Shuah Khan, Juri Lelli
Cc: cgroups, linux-kernel, linux-kselftest, Aaron Tomlin,
Guopeng Zhang
On 6/30/2026 11:33 AM, Waiman Long wrote:
> There are 2 possible scenarios where the cgroup_taskset structure
> passed into the cgroup can_attach() and attach() methods can contain
> task migration data with multiple source cpusets.
>
> - A multithread application with threads in different cpusets is
> fully migrated into a new cpuset.
> - Disabling v2 cpuset controller will move all the tasks in child
> cpusets to the parent cpuset.
>
> The current cpuset_can_attach() and cpuset_attach() functions still
> expect task migration is from one source cpuset to one destination
> cpuset.
>
> Fix that by tracking the set of source (old) cpusets in singly linked
> lists. The list will be iterated when necessary to properly update
> internal data.
>
> To ensure proper DL tasks accounting, the nr_migrate_dl_tasks in both
> the source and destination cpusets are decremented/incremented with
> their values added to nr_deadline_tasks when the migration is successful.
>
> The setting of the global attach_ctx.cpus_updated and
> attach_ctx.mems_updated flags are also moved from cpuset_attach()
> to cpuset_can_attach() as the correct source cpuset can no longer be
> determined in cpuset_attach() and cpuset states will not be changed
> between cpuset_attach() and cpuset_can_attach() with an earlier patch.
>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
> kernel/cgroup/cpuset-internal.h | 5 +++
> kernel/cgroup/cpuset.c | 65 ++++++++++++++++++++++++++++-----
> 2 files changed, 60 insertions(+), 10 deletions(-)
>
> diff --git a/kernel/cgroup/cpuset-internal.h b/kernel/cgroup/cpuset-internal.h
> index df662c7fd1a4..e7d010661fd3 100644
> --- a/kernel/cgroup/cpuset-internal.h
> +++ b/kernel/cgroup/cpuset-internal.h
> @@ -145,6 +145,11 @@ struct cpuset {
> */
> nodemask_t old_mems_allowed;
>
> + /*
> + * For linking impacted cpusets during an attach operation.
> + */
> + struct llist_node attach_node;
> +
> /* partition root state */
> int partition_root_state;
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 0b9df38e9a63..b201f4ba18b6 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -37,6 +37,7 @@
> #include <linux/wait.h>
> #include <linux/workqueue.h>
> #include <linux/task_work.h>
> +#include <linux/llist.h>
>
> DEFINE_STATIC_KEY_FALSE(cpusets_pre_enable_key);
> DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key);
> @@ -368,6 +369,7 @@ static struct {
> struct cpuset *old_cs; /* Source cpuset */
> nodemask_t nodemask_to;
> } attach_ctx;
> +static LLIST_HEAD(src_cs_head);
>
> /*
> * Wait if task attach is in progress until it is done and then acquire
> @@ -615,6 +617,7 @@ static struct cpuset *dup_or_alloc_cpuset(struct cpuset *cs)
> return NULL;
>
> trial->dl_bw_cpu = -1;
> + init_llist_node(&trial->attach_node);
>
> /* Setup cpumask pointer array */
> cpumask_var_t *pmask[4] = {
> @@ -3032,6 +3035,8 @@ static int update_prstate(struct cpuset *cs, int new_prs)
> static int cpuset_can_attach_check(struct cpuset *cs, struct cpuset *oldcs,
> bool *psetsched)
> {
> + bool cpus_updated, mems_updated;
> +
These local vaviables are unnecessary, we can just use
attach_ctx.cpus_updated adn attach_ctx.mems_updated
> if (cpumask_empty(cs->effective_cpus) ||
> (!is_in_v2_mode() && nodes_empty(cs->mems_allowed)))
> return -ENOSPC;
> @@ -3039,14 +3044,23 @@ static int cpuset_can_attach_check(struct cpuset *cs, struct cpuset *oldcs,
> if (!oldcs)
> return 0;
>
> + if (!llist_on_list(&oldcs->attach_node))
> + llist_add(&oldcs->attach_node, &src_cs_head);
> +
> + cpus_updated = !cpumask_equal(cs->effective_cpus, oldcs->effective_cpus);
> + mems_updated = !nodes_equal(cs->effective_mems, oldcs->effective_mems);
> +
> + if (cpus_updated)
> + attach_ctx.cpus_updated = true;
> + if (mems_updated)
> + attach_ctx.mems_updated = true;
> +
attach_ctx.cpus_updated = !cpumask_equal(cs->effective_cpus,
oldcs->effective_cpus);
attach_ctx.mems_updated = !nodes_equal(cs->effective_mems,
oldcs->effective_mems);
> /*
> * Skip rights over task setsched check in v2 when nothing changes,
> * migration permission derives from hierarchy ownership in
> * cgroup_procs_write_permission()).
> */
> - *psetsched = !cpuset_v2() ||
> - !cpumask_equal(cs->effective_cpus, oldcs->effective_cpus) ||
> - !nodes_equal(cs->effective_mems, oldcs->effective_mems);
> + *psetsched = !cpuset_v2() || cpus_updated || mems_updated;
>
> /*
> * A v1 cpuset with tasks will have no CPU left only when CPU hotplug
> @@ -3087,6 +3101,25 @@ static void reset_migrate_dl_data(struct cpuset *cs)
> cs->dl_bw_cpu = -1;
> }
>
> +/*
> + * Clear and optionally apply (@cancel is false) the attach related data in the
> + * source cpusets.
> + */
> +static void clear_attach_data(struct llist_head *head, bool cancel)
> +{
> + struct cpuset *cs, *next;
> + struct llist_node *lnode = __llist_del_all(head);
> +
> + llist_for_each_entry_safe(cs, next, lnode, attach_node) {
> + init_llist_node(&cs->attach_node);
> + if (cs->nr_migrate_dl_tasks) {
> + if (!cancel)
> + atomic_add(cs->nr_migrate_dl_tasks, &cs->nr_deadline_tasks);
> + cs->nr_migrate_dl_tasks = 0;
> + }
> + }
> +}
> +
> /* Called by cgroups to determine if a cpuset is usable; cpuset_mutex held */
> static int cpuset_can_attach(struct cgroup_taskset *tset)
> {
> @@ -3102,6 +3135,8 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
> cs = css_cs(css);
>
> mutex_lock(&cpuset_mutex);
> + attach_ctx.cpus_updated = false;
> + attach_ctx.mems_updated = false;
>
> /* Check to see if task is allowed in the cpuset */
> ret = cpuset_can_attach_check(cs, oldcs, &setsched_check);
> @@ -3126,6 +3161,15 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
> * selected as attach_ctx.old_cs.
> */
> cgroup_taskset_for_each(task, css, tset) {
> + struct cpuset *new_oldcs = task_cs(task);
> +
> + if (new_oldcs != oldcs) {
> + oldcs = new_oldcs;
> + ret = cpuset_can_attach_check(cs, oldcs, &setsched_check);
> + if (ret)
> + goto out_unlock;
> + }
> +
> ret = task_can_attach(task);
> if (ret)
> goto out_unlock;
> @@ -3147,6 +3191,7 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
> * contribute to sum_migrate_dl_bw.
> */
> cs->nr_migrate_dl_tasks++;
> + oldcs->nr_migrate_dl_tasks--;
> if (dl_task_needs_bw_move(task, cs->effective_cpus))
> cs->sum_migrate_dl_bw += task->dl.dl_bw;
> }
> @@ -3155,10 +3200,12 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
> ret = cpuset_reserve_dl_bw(cs);
>
> out_unlock:
> - if (ret)
> + if (ret) {
> reset_migrate_dl_data(cs);
> - else
> + clear_attach_data(&src_cs_head, true);
This doesn't seem right, because clear_attach_data uses
cs->nr_migrate_dl_tasks, which has already been cleared by
reset_migrate_dl_data earlier, hasn't it?
> + } else {
> attach_ctx.in_progress++;
> + }
>
> mutex_unlock(&cpuset_mutex);
> return ret;
> @@ -3174,6 +3221,7 @@ static void cpuset_cancel_attach(struct cgroup_taskset *tset)
>
> mutex_lock(&cpuset_mutex);
> dec_attach_in_progress_locked();
> + clear_attach_data(&src_cs_head, true);
>
> if (cs->dl_bw_cpu >= 0)
> dl_bw_free(cs->dl_bw_cpu, cs->sum_migrate_dl_bw);
> @@ -3251,7 +3299,6 @@ static void cpuset_attach(struct cgroup_taskset *tset)
> struct task_struct *task;
> struct cgroup_subsys_state *css;
> struct cpuset *cs;
> - struct cpuset *oldcs = attach_ctx.old_cs;
>
> cgroup_taskset_first(tset, &css);
> cs = css_cs(css);
> @@ -3259,9 +3306,6 @@ static void cpuset_attach(struct cgroup_taskset *tset)
> lockdep_assert_cpus_held(); /* see cgroup_attach_lock() */
> mutex_lock(&cpuset_mutex);
> attach_ctx.task_work_queued = false;
> -
> - attach_ctx.cpus_updated = !cpumask_equal(cs->effective_cpus, oldcs->effective_cpus);
> - attach_ctx.mems_updated = !nodes_equal(cs->effective_mems, oldcs->effective_mems);
> guarantee_online_mems(cs, &attach_ctx.nodemask_to);
>
> /*
> @@ -3283,10 +3327,10 @@ static void cpuset_attach(struct cgroup_taskset *tset)
>
> if (cs->nr_migrate_dl_tasks) {
> atomic_add(cs->nr_migrate_dl_tasks, &cs->nr_deadline_tasks);
> - atomic_sub(cs->nr_migrate_dl_tasks, &oldcs->nr_deadline_tasks);
> reset_migrate_dl_data(cs);
> }
>
> + clear_attach_data(&src_cs_head, false);
> dec_attach_in_progress_locked();
>
> mutex_unlock(&cpuset_mutex);
> @@ -3793,6 +3837,7 @@ int __init cpuset_init(void)
> cpumask_setall(top_cpuset.effective_xcpus);
> cpumask_setall(top_cpuset.exclusive_cpus);
> nodes_setall(top_cpuset.effective_mems);
> + init_llist_node(&top_cpuset.attach_node);
>
> cpuset1_init(&top_cpuset);
>
--
Best regards
Ridong
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: [PATCH-next v9 09/11] cgroup/cpuset: Support multiple source cpusets for cpuset_*attach()
2026-07-01 2:35 ` Ridong Chen
@ 2026-07-01 20:44 ` Waiman Long
0 siblings, 0 replies; 24+ messages in thread
From: Waiman Long @ 2026-07-01 20:44 UTC (permalink / raw)
To: Ridong Chen, Tejun Heo, Johannes Weiner, Michal Koutný,
Shuah Khan, Juri Lelli
Cc: cgroups, linux-kernel, linux-kselftest, Aaron Tomlin,
Guopeng Zhang
On 6/30/26 10:35 PM, Ridong Chen wrote:
>
>
> On 6/30/2026 11:33 AM, Waiman Long wrote:
>> There are 2 possible scenarios where the cgroup_taskset structure
>> passed into the cgroup can_attach() and attach() methods can contain
>> task migration data with multiple source cpusets.
>>
>> - A multithread application with threads in different cpusets is
>> fully migrated into a new cpuset.
>> - Disabling v2 cpuset controller will move all the tasks in child
>> cpusets to the parent cpuset.
>>
>> The current cpuset_can_attach() and cpuset_attach() functions still
>> expect task migration is from one source cpuset to one destination
>> cpuset.
>>
>> Fix that by tracking the set of source (old) cpusets in singly linked
>> lists. The list will be iterated when necessary to properly update
>> internal data.
>>
>> To ensure proper DL tasks accounting, the nr_migrate_dl_tasks in both
>> the source and destination cpusets are decremented/incremented with
>> their values added to nr_deadline_tasks when the migration is
>> successful.
>>
>> The setting of the global attach_ctx.cpus_updated and
>> attach_ctx.mems_updated flags are also moved from cpuset_attach()
>> to cpuset_can_attach() as the correct source cpuset can no longer be
>> determined in cpuset_attach() and cpuset states will not be changed
>> between cpuset_attach() and cpuset_can_attach() with an earlier patch.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>> kernel/cgroup/cpuset-internal.h | 5 +++
>> kernel/cgroup/cpuset.c | 65 ++++++++++++++++++++++++++++-----
>> 2 files changed, 60 insertions(+), 10 deletions(-)
>>
>> diff --git a/kernel/cgroup/cpuset-internal.h
>> b/kernel/cgroup/cpuset-internal.h
>> index df662c7fd1a4..e7d010661fd3 100644
>> --- a/kernel/cgroup/cpuset-internal.h
>> +++ b/kernel/cgroup/cpuset-internal.h
>> @@ -145,6 +145,11 @@ struct cpuset {
>> */
>> nodemask_t old_mems_allowed;
>> + /*
>> + * For linking impacted cpusets during an attach operation.
>> + */
>> + struct llist_node attach_node;
>> +
>> /* partition root state */
>> int partition_root_state;
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index 0b9df38e9a63..b201f4ba18b6 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -37,6 +37,7 @@
>> #include <linux/wait.h>
>> #include <linux/workqueue.h>
>> #include <linux/task_work.h>
>> +#include <linux/llist.h>
>> DEFINE_STATIC_KEY_FALSE(cpusets_pre_enable_key);
>> DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key);
>> @@ -368,6 +369,7 @@ static struct {
>> struct cpuset *old_cs; /* Source cpuset */
>> nodemask_t nodemask_to;
>> } attach_ctx;
>> +static LLIST_HEAD(src_cs_head);
>> /*
>> * Wait if task attach is in progress until it is done and then
>> acquire
>> @@ -615,6 +617,7 @@ static struct cpuset *dup_or_alloc_cpuset(struct
>> cpuset *cs)
>> return NULL;
>> trial->dl_bw_cpu = -1;
>> + init_llist_node(&trial->attach_node);
>> /* Setup cpumask pointer array */
>> cpumask_var_t *pmask[4] = {
>> @@ -3032,6 +3035,8 @@ static int update_prstate(struct cpuset *cs,
>> int new_prs)
>> static int cpuset_can_attach_check(struct cpuset *cs, struct cpuset
>> *oldcs,
>> bool *psetsched)
>> {
>> + bool cpus_updated, mems_updated;
>> +
>
> These local vaviables are unnecessary, we can just use
> attach_ctx.cpus_updated adn attach_ctx.mems_updated
These local variables are needed because I want the setsched flag to be
depend on the current oldcs/cs pair only instead of on the global
setting. So cpus_updated and mems_updated can be false even if the ones
in attach_ctx can be true.
>
>> if (cpumask_empty(cs->effective_cpus) ||
>> (!is_in_v2_mode() && nodes_empty(cs->mems_allowed)))
>> return -ENOSPC;
>> @@ -3039,14 +3044,23 @@ static int cpuset_can_attach_check(struct
>> cpuset *cs, struct cpuset *oldcs,
>> if (!oldcs)
>> return 0;
>> + if (!llist_on_list(&oldcs->attach_node))
>> + llist_add(&oldcs->attach_node, &src_cs_head);
>> +
>> + cpus_updated = !cpumask_equal(cs->effective_cpus,
>> oldcs->effective_cpus);
>> + mems_updated = !nodes_equal(cs->effective_mems,
>> oldcs->effective_mems);
>> +
>> + if (cpus_updated)
>> + attach_ctx.cpus_updated = true;
>> + if (mems_updated)
>> + attach_ctx.mems_updated = true;
>> +
>
> attach_ctx.cpus_updated = !cpumask_equal(cs->effective_cpus,
> oldcs->effective_cpus);
> attach_ctx.mems_updated = !nodes_equal(cs->effective_mems,
> oldcs->effective_mems);
>
>> /*
>> * Skip rights over task setsched check in v2 when nothing
>> changes,
>> * migration permission derives from hierarchy ownership in
>> * cgroup_procs_write_permission()).
>> */
>> - *psetsched = !cpuset_v2() ||
>> - !cpumask_equal(cs->effective_cpus, oldcs->effective_cpus) ||
>> - !nodes_equal(cs->effective_mems, oldcs->effective_mems);
>> + *psetsched = !cpuset_v2() || cpus_updated || mems_updated;
>> /*
>> * A v1 cpuset with tasks will have no CPU left only when CPU
>> hotplug
>> @@ -3087,6 +3101,25 @@ static void reset_migrate_dl_data(struct
>> cpuset *cs)
>> cs->dl_bw_cpu = -1;
>> }
>> +/*
>> + * Clear and optionally apply (@cancel is false) the attach related
>> data in the
>> + * source cpusets.
>> + */
>> +static void clear_attach_data(struct llist_head *head, bool cancel)
>> +{
>> + struct cpuset *cs, *next;
>> + struct llist_node *lnode = __llist_del_all(head);
>> +
>> + llist_for_each_entry_safe(cs, next, lnode, attach_node) {
>> + init_llist_node(&cs->attach_node);
>> + if (cs->nr_migrate_dl_tasks) {
>> + if (!cancel)
>> + atomic_add(cs->nr_migrate_dl_tasks,
>> &cs->nr_deadline_tasks);
>> + cs->nr_migrate_dl_tasks = 0;
>> + }
>> + }
>> +}
>> +
>> /* Called by cgroups to determine if a cpuset is usable;
>> cpuset_mutex held */
>> static int cpuset_can_attach(struct cgroup_taskset *tset)
>> {
>> @@ -3102,6 +3135,8 @@ static int cpuset_can_attach(struct
>> cgroup_taskset *tset)
>> cs = css_cs(css);
>> mutex_lock(&cpuset_mutex);
>> + attach_ctx.cpus_updated = false;
>> + attach_ctx.mems_updated = false;
>> /* Check to see if task is allowed in the cpuset */
>> ret = cpuset_can_attach_check(cs, oldcs, &setsched_check);
>> @@ -3126,6 +3161,15 @@ static int cpuset_can_attach(struct
>> cgroup_taskset *tset)
>> * selected as attach_ctx.old_cs.
>> */
>> cgroup_taskset_for_each(task, css, tset) {
>> + struct cpuset *new_oldcs = task_cs(task);
>> +
>> + if (new_oldcs != oldcs) {
>> + oldcs = new_oldcs;
>> + ret = cpuset_can_attach_check(cs, oldcs, &setsched_check);
>> + if (ret)
>> + goto out_unlock;
>> + }
>> +
>> ret = task_can_attach(task);
>> if (ret)
>> goto out_unlock;
>> @@ -3147,6 +3191,7 @@ static int cpuset_can_attach(struct
>> cgroup_taskset *tset)
>> * contribute to sum_migrate_dl_bw.
>> */
>> cs->nr_migrate_dl_tasks++;
>> + oldcs->nr_migrate_dl_tasks--;
>> if (dl_task_needs_bw_move(task, cs->effective_cpus))
>> cs->sum_migrate_dl_bw += task->dl.dl_bw;
>> }
>> @@ -3155,10 +3200,12 @@ static int cpuset_can_attach(struct
>> cgroup_taskset *tset)
>> ret = cpuset_reserve_dl_bw(cs);
>> out_unlock:
>> - if (ret)
>> + if (ret) {
>> reset_migrate_dl_data(cs);
>> - else
>> + clear_attach_data(&src_cs_head, true);
>
> This doesn't seem right, because clear_attach_data uses
> cs->nr_migrate_dl_tasks, which has already been cleared by
> reset_migrate_dl_data earlier, hasn't it?
At this point, reset_migration_dl_data() is for the destination cpuset
while clear_attach_data() is for the source cpusets only. In the next
patch, reset_migration_dl_data() will be replaced by another call to
clear_attach_data().
Cheers,
Longman
>
>
>> + } else {
>> attach_ctx.in_progress++;
>> + }
>> mutex_unlock(&cpuset_mutex);
>> return ret;
>> @@ -3174,6 +3221,7 @@ static void cpuset_cancel_attach(struct
>> cgroup_taskset *tset)
>> mutex_lock(&cpuset_mutex);
>> dec_attach_in_progress_locked();
>> + clear_attach_data(&src_cs_head, true);
>> if (cs->dl_bw_cpu >= 0)
>> dl_bw_free(cs->dl_bw_cpu, cs->sum_migrate_dl_bw);
>> @@ -3251,7 +3299,6 @@ static void cpuset_attach(struct cgroup_taskset
>> *tset)
>> struct task_struct *task;
>> struct cgroup_subsys_state *css;
>> struct cpuset *cs;
>> - struct cpuset *oldcs = attach_ctx.old_cs;
>> cgroup_taskset_first(tset, &css);
>> cs = css_cs(css);
>> @@ -3259,9 +3306,6 @@ static void cpuset_attach(struct cgroup_taskset
>> *tset)
>> lockdep_assert_cpus_held(); /* see cgroup_attach_lock() */
>> mutex_lock(&cpuset_mutex);
>> attach_ctx.task_work_queued = false;
>> -
>> - attach_ctx.cpus_updated = !cpumask_equal(cs->effective_cpus,
>> oldcs->effective_cpus);
>> - attach_ctx.mems_updated = !nodes_equal(cs->effective_mems,
>> oldcs->effective_mems);
>> guarantee_online_mems(cs, &attach_ctx.nodemask_to);
>> /*
>> @@ -3283,10 +3327,10 @@ static void cpuset_attach(struct
>> cgroup_taskset *tset)
>> if (cs->nr_migrate_dl_tasks) {
>> atomic_add(cs->nr_migrate_dl_tasks, &cs->nr_deadline_tasks);
>> - atomic_sub(cs->nr_migrate_dl_tasks, &oldcs->nr_deadline_tasks);
>> reset_migrate_dl_data(cs);
>> }
>> + clear_attach_data(&src_cs_head, false);
>> dec_attach_in_progress_locked();
>> mutex_unlock(&cpuset_mutex);
>> @@ -3793,6 +3837,7 @@ int __init cpuset_init(void)
>> cpumask_setall(top_cpuset.effective_xcpus);
>> cpumask_setall(top_cpuset.exclusive_cpus);
>> nodes_setall(top_cpuset.effective_mems);
>> + init_llist_node(&top_cpuset.attach_node);
>> cpuset1_init(&top_cpuset);
>
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH-next v9 10/11] cgroup/cpuset: Support multiple destination cpusets for cpuset_*attach()
2026-06-30 3:33 [PATCH-next v9 00/11] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach() Waiman Long
` (8 preceding siblings ...)
2026-06-30 3:33 ` [PATCH-next v9 09/11] cgroup/cpuset: Support multiple source cpusets for cpuset_*attach() Waiman Long
@ 2026-06-30 3:33 ` Waiman Long
2026-07-01 2:51 ` Ridong Chen
2026-06-30 3:33 ` [PATCH-next v9 11/11] selftests/cgroup: Add test for cpuset affinity on controller disable Waiman Long
10 siblings, 1 reply; 24+ messages in thread
From: Waiman Long @ 2026-06-30 3:33 UTC (permalink / raw)
To: Ridong Chen, Tejun Heo, Johannes Weiner, Michal Koutný,
Shuah Khan, Juri Lelli
Cc: cgroups, linux-kernel, linux-kselftest, Aaron Tomlin,
Guopeng Zhang, Waiman Long
The only case where the cgroup_taskset structure requires task migration
to multiple cpusets is when enabling a cpuset controller in cgroup v2
where the newly created child cpusets inherits the same effective CPUs
and memory nodes from the parent. In that case, task migration can happen
directly with no update to tasks' CPU and memory nodes assignment and no
further work needed from the cpuset side except updating nr_deadline_tasks
when DL tasks are involved and setting old_mems_allowed in the child
cpusets.
Do that by tracking all the destination cpusets with a new dst_cs_head
singly linked list. The reset_migrate_dl_data() function is integrated
into clear_attach_data() so that it can be used for both source and
destination cpusets.
It is assumed that a given cpuset cannot be both a source and a
destination cpuset. If such condition happens or when there are multiple
destination cpusets with CPU or memory nodes changes, the current code
will not handle it correctly. So it will print a warning and fail the
attach operation in these unexpected cases as we will have to enhance the
code to support this if such use cases are valid and not coding errors.
Signed-off-by: Waiman Long <longman@redhat.com>
---
kernel/cgroup/cpuset-internal.h | 1 +
kernel/cgroup/cpuset.c | 115 ++++++++++++++++++++------------
2 files changed, 72 insertions(+), 44 deletions(-)
diff --git a/kernel/cgroup/cpuset-internal.h b/kernel/cgroup/cpuset-internal.h
index e7d010661fd3..d1161b0a3d85 100644
--- a/kernel/cgroup/cpuset-internal.h
+++ b/kernel/cgroup/cpuset-internal.h
@@ -149,6 +149,7 @@ struct cpuset {
* For linking impacted cpusets during an attach operation.
*/
struct llist_node attach_node;
+ bool attach_source;
/* partition root state */
int partition_root_state;
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index b201f4ba18b6..1591d6dca66a 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -366,10 +366,12 @@ static struct {
bool cpus_updated;
bool mems_updated;
bool task_work_queued;
+ bool many_dest_cs; /* Have many destination cpusets */
struct cpuset *old_cs; /* Source cpuset */
nodemask_t nodemask_to;
} attach_ctx;
static LLIST_HEAD(src_cs_head);
+static LLIST_HEAD(dst_cs_head);
/*
* Wait if task attach is in progress until it is done and then acquire
@@ -3044,8 +3046,23 @@ static int cpuset_can_attach_check(struct cpuset *cs, struct cpuset *oldcs,
if (!oldcs)
return 0;
- if (!llist_on_list(&oldcs->attach_node))
+ /*
+ * The same cpuset cannot be both a source and a destination.
+ * The current code does not support that, print a warning and
+ * fail the attach if so.
+ */
+ if (WARN_ON_ONCE((!oldcs->attach_source &&
+ llist_on_list(&oldcs->attach_node)) ||
+ cs->attach_source))
+ return -EINVAL;
+
+ if (!llist_on_list(&oldcs->attach_node)) {
llist_add(&oldcs->attach_node, &src_cs_head);
+ oldcs->attach_source = true;
+ }
+
+ if (!llist_on_list(&cs->attach_node))
+ llist_add(&cs->attach_node, &dst_cs_head);
cpus_updated = !cpumask_equal(cs->effective_cpus, oldcs->effective_cpus);
mems_updated = !nodes_equal(cs->effective_mems, oldcs->effective_mems);
@@ -3075,35 +3092,31 @@ static int cpuset_can_attach_check(struct cpuset *cs, struct cpuset *oldcs,
return 0;
}
-static int cpuset_reserve_dl_bw(struct cpuset *cs)
+static int cpuset_reserve_dl_bw(void)
{
+ struct cpuset *cs;
int cpu, ret;
- if (!cs->sum_migrate_dl_bw)
- return 0;
+ llist_for_each_entry(cs, dst_cs_head.first, attach_node) {
+ if (!cs->sum_migrate_dl_bw)
+ continue;
- cpu = cpumask_any_and(cpu_active_mask, cs->effective_cpus);
- if (unlikely(cpu >= nr_cpu_ids))
- return -EINVAL;
+ cpu = cpumask_any_and(cpu_active_mask, cs->effective_cpus);
+ if (unlikely(cpu >= nr_cpu_ids))
+ return -EINVAL;
- ret = dl_bw_alloc(cpu, cs->sum_migrate_dl_bw);
- if (ret)
- return ret;
+ ret = dl_bw_alloc(cpu, cs->sum_migrate_dl_bw);
+ if (ret)
+ return ret;
- cs->dl_bw_cpu = cpu;
+ cs->dl_bw_cpu = cpu;
+ }
return 0;
}
-static void reset_migrate_dl_data(struct cpuset *cs)
-{
- cs->nr_migrate_dl_tasks = 0;
- cs->sum_migrate_dl_bw = 0;
- cs->dl_bw_cpu = -1;
-}
-
/*
* Clear and optionally apply (@cancel is false) the attach related data in the
- * source cpusets.
+ * source or destination cpuset.
*/
static void clear_attach_data(struct llist_head *head, bool cancel)
{
@@ -3115,8 +3128,13 @@ static void clear_attach_data(struct llist_head *head, bool cancel)
if (cs->nr_migrate_dl_tasks) {
if (!cancel)
atomic_add(cs->nr_migrate_dl_tasks, &cs->nr_deadline_tasks);
+ else if (cs->dl_bw_cpu >= 0) /* && cacnel */
+ dl_bw_free(cs->dl_bw_cpu, cs->sum_migrate_dl_bw);
cs->nr_migrate_dl_tasks = 0;
+ cs->sum_migrate_dl_bw = 0;
+ cs->dl_bw_cpu = -1;
}
+ cs->attach_source = false;
}
}
@@ -3137,6 +3155,7 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
mutex_lock(&cpuset_mutex);
attach_ctx.cpus_updated = false;
attach_ctx.mems_updated = false;
+ attach_ctx.many_dest_cs = false;
/* Check to see if task is allowed in the cpuset */
ret = cpuset_can_attach_check(cs, oldcs, &setsched_check);
@@ -3161,9 +3180,13 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
* selected as attach_ctx.old_cs.
*/
cgroup_taskset_for_each(task, css, tset) {
+ struct cpuset *new_cs = css_cs(css);
struct cpuset *new_oldcs = task_cs(task);
- if (new_oldcs != oldcs) {
+ if ((new_oldcs != oldcs) || (new_cs != cs)) {
+ if (new_cs != cs)
+ attach_ctx.many_dest_cs = true;
+ cs = new_cs;
oldcs = new_oldcs;
ret = cpuset_can_attach_check(cs, oldcs, &setsched_check);
if (ret)
@@ -3197,12 +3220,28 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
}
}
- ret = cpuset_reserve_dl_bw(cs);
+ /*
+ * The only case where there are multiple destination cpusets for
+ * task migration is when enabling a v2 cpuset controllers where
+ * tasks will be migrated to multiple child cpusets from a parent
+ * cpuset with the same effective CPUs and memory nodes. IOW,
+ * both attach_cpus_updated and attach_mems_updated should be false.
+ * If not, it is a condition that the current code cannot handled.
+ * Print a warning and abort the attach operation as further code
+ * change will be needed.
+ */
+ if (WARN_ON_ONCE(attach_ctx.many_dest_cs && (!cpuset_v2() ||
+ attach_ctx.cpus_updated || attach_ctx.mems_updated))) {
+ ret = -EINVAL;
+ goto out_unlock;
+ }
+
+ ret = cpuset_reserve_dl_bw();
out_unlock:
if (ret) {
- reset_migrate_dl_data(cs);
clear_attach_data(&src_cs_head, true);
+ clear_attach_data(&dst_cs_head, true);
} else {
attach_ctx.in_progress++;
}
@@ -3213,22 +3252,10 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
static void cpuset_cancel_attach(struct cgroup_taskset *tset)
{
- struct cgroup_subsys_state *css;
- struct cpuset *cs;
-
- cgroup_taskset_first(tset, &css);
- cs = css_cs(css);
-
mutex_lock(&cpuset_mutex);
dec_attach_in_progress_locked();
clear_attach_data(&src_cs_head, true);
-
- if (cs->dl_bw_cpu >= 0)
- dl_bw_free(cs->dl_bw_cpu, cs->sum_migrate_dl_bw);
-
- if (cs->nr_migrate_dl_tasks)
- reset_migrate_dl_data(cs);
-
+ clear_attach_data(&dst_cs_head, true);
mutex_unlock(&cpuset_mutex);
}
@@ -3312,25 +3339,25 @@ static void cpuset_attach(struct cgroup_taskset *tset)
* In the default hierarchy, enabling cpuset in the child cgroups
* will trigger a cpuset_attach() call with no change in effective cpus
* and mems. In that case, we can optimize out by skipping the task
- * iteration and update.
+ * iteration and updatebut the destination cpuset list is iterated to
+ * set old_mems_allowed.
*/
- if (cpuset_v2() && !attach_ctx.cpus_updated && !attach_ctx.mems_updated)
+ if (cpuset_v2() && !attach_ctx.cpus_updated && !attach_ctx.mems_updated) {
+ llist_for_each_entry(cs, dst_cs_head.first, attach_node)
+ cs->old_mems_allowed = attach_ctx.nodemask_to;
goto out;
+ }
+ /* Task iteration shouldn't happen with attach_ctx.many_dest_cs set */
cgroup_taskset_for_each(task, css, tset)
cpuset_attach_task(cs, task);
-out:
if (attach_ctx.task_work_queued)
schedule_flush_migrate_mm();
cs->old_mems_allowed = attach_ctx.nodemask_to;
-
- if (cs->nr_migrate_dl_tasks) {
- atomic_add(cs->nr_migrate_dl_tasks, &cs->nr_deadline_tasks);
- reset_migrate_dl_data(cs);
- }
-
+out:
clear_attach_data(&src_cs_head, false);
+ clear_attach_data(&dst_cs_head, false);
dec_attach_in_progress_locked();
mutex_unlock(&cpuset_mutex);
--
2.54.0
^ permalink raw reply related [flat|nested] 24+ messages in thread* Re: [PATCH-next v9 10/11] cgroup/cpuset: Support multiple destination cpusets for cpuset_*attach()
2026-06-30 3:33 ` [PATCH-next v9 10/11] cgroup/cpuset: Support multiple destination " Waiman Long
@ 2026-07-01 2:51 ` Ridong Chen
2026-07-01 21:16 ` Waiman Long
0 siblings, 1 reply; 24+ messages in thread
From: Ridong Chen @ 2026-07-01 2:51 UTC (permalink / raw)
To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
Shuah Khan, Juri Lelli
Cc: cgroups, linux-kernel, linux-kselftest, Aaron Tomlin,
Guopeng Zhang
On 6/30/2026 11:33 AM, Waiman Long wrote:
> The only case where the cgroup_taskset structure requires task migration
> to multiple cpusets is when enabling a cpuset controller in cgroup v2
> where the newly created child cpusets inherits the same effective CPUs
> and memory nodes from the parent. In that case, task migration can happen
> directly with no update to tasks' CPU and memory nodes assignment and no
> further work needed from the cpuset side except updating nr_deadline_tasks
> when DL tasks are involved and setting old_mems_allowed in the child
> cpusets.
>
> Do that by tracking all the destination cpusets with a new dst_cs_head
> singly linked list. The reset_migrate_dl_data() function is integrated
> into clear_attach_data() so that it can be used for both source and
> destination cpusets.
>
> It is assumed that a given cpuset cannot be both a source and a
> destination cpuset. If such condition happens or when there are multiple
> destination cpusets with CPU or memory nodes changes, the current code
> will not handle it correctly. So it will print a warning and fail the
> attach operation in these unexpected cases as we will have to enhance the
> code to support this if such use cases are valid and not coding errors.
>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
> kernel/cgroup/cpuset-internal.h | 1 +
> kernel/cgroup/cpuset.c | 115 ++++++++++++++++++++------------
> 2 files changed, 72 insertions(+), 44 deletions(-)
>
> diff --git a/kernel/cgroup/cpuset-internal.h b/kernel/cgroup/cpuset-internal.h
> index e7d010661fd3..d1161b0a3d85 100644
> --- a/kernel/cgroup/cpuset-internal.h
> +++ b/kernel/cgroup/cpuset-internal.h
> @@ -149,6 +149,7 @@ struct cpuset {
> * For linking impacted cpusets during an attach operation.
> */
> struct llist_node attach_node;
> + bool attach_source;
>
> /* partition root state */
> int partition_root_state;
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index b201f4ba18b6..1591d6dca66a 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -366,10 +366,12 @@ static struct {
> bool cpus_updated;
> bool mems_updated;
> bool task_work_queued;
> + bool many_dest_cs; /* Have many destination cpusets */
> struct cpuset *old_cs; /* Source cpuset */
> nodemask_t nodemask_to;
> } attach_ctx;
> static LLIST_HEAD(src_cs_head);
> +static LLIST_HEAD(dst_cs_head);
>
This looks a lot like the 'struct list_head mg_src_preload_node' and
'struct list_head mg_dst_preload_node' in struct css_set. Is there a
better way to reuse those instead of adding a separate tracking list here?
TJ, Michal, do you have any opinions on this?
> /*
> * Wait if task attach is in progress until it is done and then acquire
> @@ -3044,8 +3046,23 @@ static int cpuset_can_attach_check(struct cpuset *cs, struct cpuset *oldcs,
> if (!oldcs)
> return 0;
>
> - if (!llist_on_list(&oldcs->attach_node))
> + /*
> + * The same cpuset cannot be both a source and a destination.
> + * The current code does not support that, print a warning and
> + * fail the attach if so.
> + */
> + if (WARN_ON_ONCE((!oldcs->attach_source &&
> + llist_on_list(&oldcs->attach_node)) ||
> + cs->attach_source))
> + return -EINVAL;
> +
> + if (!llist_on_list(&oldcs->attach_node)) {
> llist_add(&oldcs->attach_node, &src_cs_head);
> + oldcs->attach_source = true;
> + }
> +
> + if (!llist_on_list(&cs->attach_node))
> + llist_add(&cs->attach_node, &dst_cs_head);
>
> cpus_updated = !cpumask_equal(cs->effective_cpus, oldcs->effective_cpus);
> mems_updated = !nodes_equal(cs->effective_mems, oldcs->effective_mems);
> @@ -3075,35 +3092,31 @@ static int cpuset_can_attach_check(struct cpuset *cs, struct cpuset *oldcs,
> return 0;
> }
>
> -static int cpuset_reserve_dl_bw(struct cpuset *cs)
> +static int cpuset_reserve_dl_bw(void)
> {
> + struct cpuset *cs;
> int cpu, ret;
>
> - if (!cs->sum_migrate_dl_bw)
> - return 0;
> + llist_for_each_entry(cs, dst_cs_head.first, attach_node) {
> + if (!cs->sum_migrate_dl_bw)
> + continue;
>
> - cpu = cpumask_any_and(cpu_active_mask, cs->effective_cpus);
> - if (unlikely(cpu >= nr_cpu_ids))
> - return -EINVAL;
> + cpu = cpumask_any_and(cpu_active_mask, cs->effective_cpus);
> + if (unlikely(cpu >= nr_cpu_ids))
> + return -EINVAL;
>
> - ret = dl_bw_alloc(cpu, cs->sum_migrate_dl_bw);
> - if (ret)
> - return ret;
> + ret = dl_bw_alloc(cpu, cs->sum_migrate_dl_bw);
> + if (ret)
> + return ret;
>
> - cs->dl_bw_cpu = cpu;
> + cs->dl_bw_cpu = cpu;
> + }
> return 0;
> }
>
> -static void reset_migrate_dl_data(struct cpuset *cs)
> -{
> - cs->nr_migrate_dl_tasks = 0;
> - cs->sum_migrate_dl_bw = 0;
> - cs->dl_bw_cpu = -1;
> -}
> -
> /*
> * Clear and optionally apply (@cancel is false) the attach related data in the
> - * source cpusets.
> + * source or destination cpuset.
> */
> static void clear_attach_data(struct llist_head *head, bool cancel)
> {
> @@ -3115,8 +3128,13 @@ static void clear_attach_data(struct llist_head *head, bool cancel)
> if (cs->nr_migrate_dl_tasks) {
> if (!cancel)
> atomic_add(cs->nr_migrate_dl_tasks, &cs->nr_deadline_tasks);
> + else if (cs->dl_bw_cpu >= 0) /* && cacnel */
> + dl_bw_free(cs->dl_bw_cpu, cs->sum_migrate_dl_bw);
> cs->nr_migrate_dl_tasks = 0;
> + cs->sum_migrate_dl_bw = 0;
> + cs->dl_bw_cpu = -1;
> }
> + cs->attach_source = false;
> }
> }
>
> @@ -3137,6 +3155,7 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
> mutex_lock(&cpuset_mutex);
> attach_ctx.cpus_updated = false;
> attach_ctx.mems_updated = false;
> + attach_ctx.many_dest_cs = false;
>
> /* Check to see if task is allowed in the cpuset */
> ret = cpuset_can_attach_check(cs, oldcs, &setsched_check);
> @@ -3161,9 +3180,13 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
> * selected as attach_ctx.old_cs.
> */
> cgroup_taskset_for_each(task, css, tset) {
> + struct cpuset *new_cs = css_cs(css);
> struct cpuset *new_oldcs = task_cs(task);
>
> - if (new_oldcs != oldcs) {
> + if ((new_oldcs != oldcs) || (new_cs != cs)) {
> + if (new_cs != cs)
> + attach_ctx.many_dest_cs = true;
> + cs = new_cs;
> oldcs = new_oldcs;
> ret = cpuset_can_attach_check(cs, oldcs, &setsched_check);
> if (ret)
> @@ -3197,12 +3220,28 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
> }
> }
>
> - ret = cpuset_reserve_dl_bw(cs);
> + /*
> + * The only case where there are multiple destination cpusets for
> + * task migration is when enabling a v2 cpuset controllers where
> + * tasks will be migrated to multiple child cpusets from a parent
> + * cpuset with the same effective CPUs and memory nodes. IOW,
> + * both attach_cpus_updated and attach_mems_updated should be false.
> + * If not, it is a condition that the current code cannot handled.
> + * Print a warning and abort the attach operation as further code
> + * change will be needed.
> + */
> + if (WARN_ON_ONCE(attach_ctx.many_dest_cs && (!cpuset_v2() ||
> + attach_ctx.cpus_updated || attach_ctx.mems_updated))) {
> + ret = -EINVAL;
> + goto out_unlock;
> + }
> +
> + ret = cpuset_reserve_dl_bw();
>
> out_unlock:
> if (ret) {
> - reset_migrate_dl_data(cs);
> clear_attach_data(&src_cs_head, true);
> + clear_attach_data(&dst_cs_head, true);
> } else {
> attach_ctx.in_progress++;
> }
> @@ -3213,22 +3252,10 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
>
> static void cpuset_cancel_attach(struct cgroup_taskset *tset)
> {
> - struct cgroup_subsys_state *css;
> - struct cpuset *cs;
> -
> - cgroup_taskset_first(tset, &css);
> - cs = css_cs(css);
> -
> mutex_lock(&cpuset_mutex);
> dec_attach_in_progress_locked();
> clear_attach_data(&src_cs_head, true);
> -
> - if (cs->dl_bw_cpu >= 0)
> - dl_bw_free(cs->dl_bw_cpu, cs->sum_migrate_dl_bw);
> -
> - if (cs->nr_migrate_dl_tasks)
> - reset_migrate_dl_data(cs);
> -
> + clear_attach_data(&dst_cs_head, true);
> mutex_unlock(&cpuset_mutex);
> }
>
> @@ -3312,25 +3339,25 @@ static void cpuset_attach(struct cgroup_taskset *tset)
> * In the default hierarchy, enabling cpuset in the child cgroups
> * will trigger a cpuset_attach() call with no change in effective cpus
> * and mems. In that case, we can optimize out by skipping the task
> - * iteration and update.
> + * iteration and updatebut the destination cpuset list is iterated to
> + * set old_mems_allowed.
> */
> - if (cpuset_v2() && !attach_ctx.cpus_updated && !attach_ctx.mems_updated)
> + if (cpuset_v2() && !attach_ctx.cpus_updated && !attach_ctx.mems_updated) {
> + llist_for_each_entry(cs, dst_cs_head.first, attach_node)
> + cs->old_mems_allowed = attach_ctx.nodemask_to;
> goto out;
> + }
>
> + /* Task iteration shouldn't happen with attach_ctx.many_dest_cs set */
> cgroup_taskset_for_each(task, css, tset)
> cpuset_attach_task(cs, task);
>
> -out:
> if (attach_ctx.task_work_queued)
> schedule_flush_migrate_mm();
> cs->old_mems_allowed = attach_ctx.nodemask_to;
> -
> - if (cs->nr_migrate_dl_tasks) {
> - atomic_add(cs->nr_migrate_dl_tasks, &cs->nr_deadline_tasks);
> - reset_migrate_dl_data(cs);
> - }
> -
> +out:
> clear_attach_data(&src_cs_head, false);
> + clear_attach_data(&dst_cs_head, false);
> dec_attach_in_progress_locked();
>
> mutex_unlock(&cpuset_mutex);
--
Best regards
Ridong
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: [PATCH-next v9 10/11] cgroup/cpuset: Support multiple destination cpusets for cpuset_*attach()
2026-07-01 2:51 ` Ridong Chen
@ 2026-07-01 21:16 ` Waiman Long
0 siblings, 0 replies; 24+ messages in thread
From: Waiman Long @ 2026-07-01 21:16 UTC (permalink / raw)
To: Ridong Chen, Tejun Heo, Johannes Weiner, Michal Koutný,
Shuah Khan, Juri Lelli
Cc: cgroups, linux-kernel, linux-kselftest, Aaron Tomlin,
Guopeng Zhang
On 6/30/26 10:51 PM, Ridong Chen wrote:
>
>
> On 6/30/2026 11:33 AM, Waiman Long wrote:
>> The only case where the cgroup_taskset structure requires task migration
>> to multiple cpusets is when enabling a cpuset controller in cgroup v2
>> where the newly created child cpusets inherits the same effective CPUs
>> and memory nodes from the parent. In that case, task migration can
>> happen
>> directly with no update to tasks' CPU and memory nodes assignment and no
>> further work needed from the cpuset side except updating
>> nr_deadline_tasks
>> when DL tasks are involved and setting old_mems_allowed in the child
>> cpusets.
>>
>> Do that by tracking all the destination cpusets with a new dst_cs_head
>> singly linked list. The reset_migrate_dl_data() function is integrated
>> into clear_attach_data() so that it can be used for both source and
>> destination cpusets.
>>
>> It is assumed that a given cpuset cannot be both a source and a
>> destination cpuset. If such condition happens or when there are multiple
>> destination cpusets with CPU or memory nodes changes, the current code
>> will not handle it correctly. So it will print a warning and fail the
>> attach operation in these unexpected cases as we will have to enhance
>> the
>> code to support this if such use cases are valid and not coding errors.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>> kernel/cgroup/cpuset-internal.h | 1 +
>> kernel/cgroup/cpuset.c | 115 ++++++++++++++++++++------------
>> 2 files changed, 72 insertions(+), 44 deletions(-)
>>
>> diff --git a/kernel/cgroup/cpuset-internal.h
>> b/kernel/cgroup/cpuset-internal.h
>> index e7d010661fd3..d1161b0a3d85 100644
>> --- a/kernel/cgroup/cpuset-internal.h
>> +++ b/kernel/cgroup/cpuset-internal.h
>> @@ -149,6 +149,7 @@ struct cpuset {
>> * For linking impacted cpusets during an attach operation.
>> */
>> struct llist_node attach_node;
>> + bool attach_source;
>> /* partition root state */
>> int partition_root_state;
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index b201f4ba18b6..1591d6dca66a 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -366,10 +366,12 @@ static struct {
>> bool cpus_updated;
>> bool mems_updated;
>> bool task_work_queued;
>> + bool many_dest_cs; /* Have many destination cpusets */
>> struct cpuset *old_cs; /* Source cpuset */
>> nodemask_t nodemask_to;
>> } attach_ctx;
>> static LLIST_HEAD(src_cs_head);
>> +static LLIST_HEAD(dst_cs_head);
>
> This looks a lot like the 'struct list_head mg_src_preload_node' and
> 'struct list_head mg_dst_preload_node' in struct css_set. Is there a
> better way to reuse those instead of adding a separate tracking list
> here?
The cgroup_mgctx is a cgroup internal data structure which is not
exposed to individual controllers. Sharing it will have some risks if it
is accidentally modified.
Conversion of css_set iteration to cpuset iteration is a bit more
complicated as 2 or more css_sets may point to the same cpuset. So we
still have to track if a cpuset has been visited before.
It is doable, but I doubt it is worth the effort.
Cheers,
Longman
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH-next v9 11/11] selftests/cgroup: Add test for cpuset affinity on controller disable
2026-06-30 3:33 [PATCH-next v9 00/11] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach() Waiman Long
` (9 preceding siblings ...)
2026-06-30 3:33 ` [PATCH-next v9 10/11] cgroup/cpuset: Support multiple destination " Waiman Long
@ 2026-06-30 3:33 ` Waiman Long
10 siblings, 0 replies; 24+ messages in thread
From: Waiman Long @ 2026-06-30 3:33 UTC (permalink / raw)
To: Ridong Chen, Tejun Heo, Johannes Weiner, Michal Koutný,
Shuah Khan, Juri Lelli
Cc: cgroups, linux-kernel, linux-kselftest, Aaron Tomlin,
Guopeng Zhang, Waiman Long
From: Michal Koutný <mkoutny@suse.com>
Add a new selftest that exposes a bug in cpuset_attach() where thread
CPU affinity is not properly updated when the cpuset controller is
disabled in a threaded cgroup hierarchy.
The test creates a threaded cgroup hierarchy with two child cgroups
(A and B) having different cpuset.cpus constraints:
- Parent: cpuset.cpus=0-1
- Child A: cpuset.cpus=0-1
- Child B: cpuset.cpus=1 (restricted to CPU 1 only)
A multithreaded process is created with threads placed in different
cgroups. When the cpuset controller is disabled on the parent, thread
affinities should be updated to match the parent's cpuset.
Expected behavior:
- thread_a affinity: {0-1} before and after (unchanged)
- thread_b affinity: {1} before, {0-1} after (expanded)
Current buggy behavior:
- thread_b affinity remains {1} after controller disable
Assisted-by: Claude:claude-sonnet-4-5
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Waiman Long <longman@redhat.com>
---
tools/testing/selftests/cgroup/test_cpuset.c | 243 +++++++++++++++++++
1 file changed, 243 insertions(+)
diff --git a/tools/testing/selftests/cgroup/test_cpuset.c b/tools/testing/selftests/cgroup/test_cpuset.c
index c5cf8b56ceb8..8b4c4a9dd78b 100644
--- a/tools/testing/selftests/cgroup/test_cpuset.c
+++ b/tools/testing/selftests/cgroup/test_cpuset.c
@@ -1,7 +1,13 @@
// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include <assert.h>
#include <linux/limits.h>
+#include <pthread.h>
+#include <sched.h>
#include <signal.h>
+#include <sys/syscall.h>
+#include <unistd.h>
#include "kselftest.h"
#include "cgroup_util.h"
@@ -232,6 +238,242 @@ static int test_cpuset_perms_subtree(const char *root)
return ret;
}
+static int get_cpu_affinity(cpu_set_t *mask)
+{
+ CPU_ZERO(mask);
+ return sched_getaffinity(0, sizeof(*mask), mask);
+}
+
+static int cpu_set_equal(cpu_set_t *dst, unsigned long mask)
+{
+ cpu_set_t expected;
+
+ CPU_ZERO(&expected);
+ assert(sizeof(mask) < CPU_SETSIZE);
+
+ for (int cpu = 0; cpu < sizeof(mask); ++cpu)
+ if ((1UL << cpu) & mask)
+ CPU_SET(cpu, &expected);
+
+ return CPU_EQUAL(&expected, dst);
+}
+
+enum test_phase {
+ AFFINITY_SETUP,
+ AFFINITY_THREAD_A_READY,
+ AFFINITY_THREADS_READY,
+ AFFINITY_CONTROLLER_DISABLED,
+ AFFINITY_COMPLETE,
+ AFFINITY_ERROR
+};
+
+struct thread_args {
+ const char *cgroup;
+ cpu_set_t *affinity_before;
+ cpu_set_t *affinity_after;
+ enum test_phase ready_phase;
+};
+
+static pthread_mutex_t test_mutex = PTHREAD_MUTEX_INITIALIZER;
+static pthread_cond_t test_cond = PTHREAD_COND_INITIALIZER;
+static enum test_phase test_phase;
+
+static void *affinity_thread_fn(void *arg)
+{
+ struct thread_args *args = (struct thread_args *)arg;
+
+ if (cg_enter_current_thread(args->cgroup))
+ goto fail;
+
+ if (get_cpu_affinity(args->affinity_before) != 0)
+ goto fail;
+
+ pthread_mutex_lock(&test_mutex);
+ if (test_phase < args->ready_phase)
+ test_phase = args->ready_phase;
+ pthread_cond_broadcast(&test_cond);
+
+ while (test_phase < AFFINITY_CONTROLLER_DISABLED)
+ pthread_cond_wait(&test_cond, &test_mutex);
+ pthread_mutex_unlock(&test_mutex);
+
+ if (get_cpu_affinity(args->affinity_after) != 0)
+ goto fail;
+
+
+ return NULL;
+
+fail:
+ pthread_mutex_lock(&test_mutex);
+ test_phase = AFFINITY_ERROR;
+ pthread_cond_broadcast(&test_cond);
+ pthread_mutex_unlock(&test_mutex);
+ return NULL;
+}
+
+/*
+ * Test that disabling cpuset controller properly updates thread affinity.
+ *
+ * This test exposes a bug in cpuset_attach() where threads in child cgroups
+ * don't get their affinity updated when the cpuset controller is disabled.
+ *
+ * Setup:
+ * - Create parent cgroup with cpuset.cpus=0-1
+ * - Create child A with cpuset.cpus=0-1
+ * - Create child B with cpuset.cpus=1
+ * - Place multithreaded process: group leader + thread_a in A, thread_b in B
+ * - Disable cpuset controller on parent
+ *
+ * Expected: thread_b's affinity should expand from {1} to {0-1}
+ * Buggy: thread_b's affinity remains {1}
+ */
+static int test_cpuset_affinity_on_controller_disable(const char *root)
+{
+ char *parent = NULL, *child_a = NULL, *child_b = NULL;
+ pthread_t thread_a, thread_b;
+ int thread_a_created = 0, thread_b_created = 0;
+ cpu_set_t affinity_a_before, affinity_a_after;
+ cpu_set_t affinity_b_before, affinity_b_after;
+ int ret = KSFT_FAIL;
+
+ parent = cg_name(root, "cpuset_affinity_test");
+ if (!parent)
+ goto cleanup;
+ if (cg_create(parent))
+ goto cleanup;
+ if (cg_write(parent, "cgroup.type", "threaded"))
+ goto cleanup;
+
+ child_a = cg_name(parent, "A");
+ if (!child_a)
+ goto cleanup;
+ if (cg_create(child_a))
+ goto cleanup;
+ if (cg_write(child_a, "cgroup.type", "threaded"))
+ goto cleanup;
+
+ child_b = cg_name(parent, "B");
+ if (!child_b)
+ goto cleanup;
+ if (cg_create(child_b))
+ goto cleanup;
+ if (cg_write(child_b, "cgroup.type", "threaded"))
+ goto cleanup;
+
+ /* Now enable cpuset controller in parent */
+ if (cg_write(parent, "cgroup.subtree_control", "+cpuset")) {
+ ret = KSFT_SKIP;
+ goto cleanup;
+ }
+
+ /* Set CPU affinity constraints */
+ if (cg_write(parent, "cpuset.cpus", "0-1"))
+ goto cleanup;
+ if (cg_write(child_a, "cpuset.cpus", "0-1"))
+ goto cleanup;
+ if (cg_write(child_b, "cpuset.cpus", "1"))
+ goto cleanup;
+
+ /* Move group leader (main thread) to child A */
+ if (cg_enter_current(child_a))
+ goto cleanup;
+
+ /* Create threads - they will move themselves to their respective cgroups */
+ test_phase = AFFINITY_SETUP;
+
+ struct thread_args args_a = {
+ .cgroup = child_a,
+ .affinity_before = &affinity_a_before,
+ .affinity_after = &affinity_a_after,
+ .ready_phase = AFFINITY_THREAD_A_READY,
+ };
+ if (pthread_create(&thread_a, NULL, affinity_thread_fn, &args_a))
+ goto cleanup;
+ thread_a_created = 1;
+
+ struct thread_args args_b = {
+ .cgroup = child_b,
+ .affinity_before = &affinity_b_before,
+ .affinity_after = &affinity_b_after,
+ .ready_phase = AFFINITY_THREADS_READY,
+ };
+ if (pthread_create(&thread_b, NULL, affinity_thread_fn, &args_b))
+ goto cleanup_threads;
+ thread_b_created = 1;
+
+ pthread_mutex_lock(&test_mutex);
+ while (test_phase < AFFINITY_THREADS_READY)
+ pthread_cond_wait(&test_cond, &test_mutex);
+
+ /* If a thread failed during setup, bail out */
+ if (test_phase == AFFINITY_ERROR) {
+ pthread_mutex_unlock(&test_mutex);
+ goto cleanup_threads;
+ }
+ pthread_mutex_unlock(&test_mutex);
+
+ if (!cpu_set_equal(&affinity_a_before, 0x3)) {
+ ksft_print_msg("FAIL: thread_a initial affinity incorrect\n");
+ goto cleanup_threads;
+ }
+
+ if (!cpu_set_equal(&affinity_b_before, 0x2)) {
+ ksft_print_msg("FAIL: thread_b initial affinity incorrect\n");
+ goto cleanup_threads;
+ }
+
+ /* Disable cpuset controller - this should trigger affinity update */
+ if (cg_write(parent, "cgroup.subtree_control", "-cpuset"))
+ goto cleanup_threads;
+
+ /* Signal threads to save their final affinity and exit */
+ pthread_mutex_lock(&test_mutex);
+ test_phase = AFFINITY_CONTROLLER_DISABLED;
+ pthread_cond_broadcast(&test_cond);
+ pthread_mutex_unlock(&test_mutex);
+
+ pthread_join(thread_a, NULL);
+ pthread_join(thread_b, NULL);
+
+ /* Verify thread affinities AFTER disabling controller */
+ if (!cpu_set_equal(&affinity_a_after, 0x3)) {
+ ksft_print_msg("FAIL: thread_a final affinity incorrect\n");
+ goto cleanup;
+ }
+
+ if (!cpu_set_equal(&affinity_b_after, 0x3)) {
+ ksft_print_msg("FAIL: thread_b affinity did not expand to {0-1}\n");
+ goto cleanup;
+ }
+
+ ret = KSFT_PASS;
+ goto cleanup;
+
+cleanup_threads:
+ pthread_mutex_lock(&test_mutex);
+ test_phase = AFFINITY_COMPLETE;
+ pthread_cond_broadcast(&test_cond);
+ pthread_mutex_unlock(&test_mutex);
+
+ if (thread_a_created)
+ pthread_join(thread_a, NULL);
+ if (thread_b_created)
+ pthread_join(thread_b, NULL);
+
+cleanup:
+ /* Move back to root before cleanup */
+ cg_enter_current(root);
+
+ cg_destroy(child_b);
+ free(child_b);
+ cg_destroy(child_a);
+ free(child_a);
+ cg_destroy(parent);
+ free(parent);
+
+ return ret;
+}
+
#define T(x) { x, #x }
struct cpuset_test {
@@ -241,6 +483,7 @@ struct cpuset_test {
T(test_cpuset_perms_object_allow),
T(test_cpuset_perms_object_deny),
T(test_cpuset_perms_subtree),
+ T(test_cpuset_affinity_on_controller_disable),
};
#undef T
--
2.54.0
^ permalink raw reply related [flat|nested] 24+ messages in thread