* [PATCH-next v5 0/6] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach()
@ 2026-06-02 2:31 Waiman Long
2026-06-02 2:31 ` [PATCH-next v5 1/6] cgroup/cpuset: Fix node inconsistencies between cpuset_update_tasks_nodemask() and cpuset_attach() Waiman Long
` (5 more replies)
0 siblings, 6 replies; 15+ messages in thread
From: Waiman Long @ 2026-06-02 2:31 UTC (permalink / raw)
To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
Peter Zijlstra
Cc: cgroups, linux-kernel, Aaron Tomlin, Guopeng Zhang, Waiman Long
v5:
- Remove the WARN_ON() call as it can be triggered in a corner case.
- Instead of passing an attach_cpus_updated and attach_mems_updated
flags from cpuset_can_attach() to cpuset_attach(), re-evaluate the
flags at the beginning of cpuset_attach() based on data in the source &
destination cpusets in the singly linked lists to eliminate the
Time-of-Check to Time-of-Use (TOCTOU) race condition & simplify the
code changes.
- Add back the dropped optimization in patch 5.
v4:
- Add a new patch 1 to fix inconsistency in node mask usage in
cpuset_update_tasks_nodemask() and cpuset_attach() and adjust
the subsequent patches accordingly.
- Update patch 3 to set the update flags whenever the CPU or node
mask is updated to address issue reported by Sashiko.
- Update patch 5 to remove unneeded setting of old_mems_allowed as
well as calling schedule_flush_migrate_mm() if queue_task_work is
set.
v3:
- Rebased to the lastest linux-next tree.
- Keep cpuset_attach_old_cs as suggested by Chen Ridong and replace
patch 3 by a new one to make it track task group leader.
Sashiko AI review of another cpuset patch had found that cpuset_attach()
and cpuset_can_attach() can be passed a cgroup_taskset with tasks
migrating from one source cpuset to multiple destination cpusets and
vice versa. Further testing of the cpuset code indicates that this is
indeed the case when the v2 cpuset controller is enabled or disabled.
Unfortunately, cpuset_attach() and cpuset_can_attach() still assume that
there will be one source and one destinaton cpuset which may result in
inocrrect behavior.
This patch series is created to fix this issue.
Patch 1 is to fix an inconsistency in the way node mask update is being
handled in cpuset_update_tasks_nodemask() and cpuset_attach() so that
they match each other.
Patches 2 and 3 are just preparatory patches to make the remaining
patches easier to review.
Patch 4 makes cpuset_attach_old_cs to track group leader for use by
cpuset_migrate_mm().
Patch 5 moves mpol_rebind_mm() and cpuset_migrate_mm() inside
cpuset_attach_task() to make CLONE_INTO_CGROUP flag of clone(2) works
more like moving task from one cpuset to another one, while also make
supporting multiple source and destination cpusets easier.
Patch 6 makes the necessary changes to enable the support of multiple
source and destination cpusets by keeping all the source and destination
cpusets found during task iterations in two singly linked lists for
source and destination cpusets respectively.
Waiman Long (6):
cgroup/cpuset: Fix node inconsistencies between
cpuset_update_tasks_nodemask() and cpuset_attach()
cgroup/cpuset: Add a cpuset_reserve_dl_bw() helper
cgroup/cpuset: Expand the scope of cpuset_can_attach_check()
cgroup/cpuset: Make cpuset_attach_old_cs track task group leaders
cgroup/cpuset: Move mpol_rebind_mm/cpuset_migrate_mm() calls inside
cpuset_attach_task()
cgroup/cpuset: Support multiple source/destination cpusets for
cpuset_*attach()
kernel/cgroup/cpuset-internal.h | 6 +
kernel/cgroup/cpuset.c | 411 +++++++++++++++++++++++---------
2 files changed, 299 insertions(+), 118 deletions(-)
--
2.54.0
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH-next v5 1/6] cgroup/cpuset: Fix node inconsistencies between cpuset_update_tasks_nodemask() and cpuset_attach()
2026-06-02 2:31 [PATCH-next v5 0/6] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach() Waiman Long
@ 2026-06-02 2:31 ` Waiman Long
2026-06-02 13:37 ` Ridong Chen
2026-06-02 2:31 ` [PATCH-next v5 2/6] cgroup/cpuset: Add a cpuset_reserve_dl_bw() helper Waiman Long
` (4 subsequent siblings)
5 siblings, 1 reply; 15+ messages in thread
From: Waiman Long @ 2026-06-02 2:31 UTC (permalink / raw)
To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
Peter Zijlstra
Cc: cgroups, linux-kernel, Aaron Tomlin, Guopeng Zhang, Waiman Long
Whenever memory node mask is changed, there are 4 places where the node
mask has to be updated or used.
1) task's node mask via cpuset_change_task_nodemask()
2) memory policy binding via mpol_rebind_mm()
3) if memory migration is enabled, migrate from old_mems_allowed to
the new node mask via cpuset_migrate_mm().
4) setting old_mems_allowed
These memory actions are done in cpuset_update_tasks_nodemask() and
cpuset_attach(). However there are inconsistencies in what node masks
are being used in these 2 functions.
In cpuset_update_tasks_nodemask(),
- cpuset_change_task_nodemask(): guarantee_online_mems()
- mpol_rebind_mm(): mems_allowed
- cpuset_migrate_mm(): guarantee_online_mems()
- old_mems_allowed: guarantee_online_mems()
In cpuset_attach(),
- cpuset_change_task_nodemask(): guarantee_online_mems()
- mpol_rebind_mm(): effective_mems
- cpuset_migrate_mm(): effective_mems
- old_mems_allowed: effective_mems
These inconsistencies dates back to quite a long time ago and it is
hard to say what should be the correct values.
The guarantee_online_mems() function returns a node mask from current or
an ancestor cpuset that is a subset of node_states[N_MEMORY]. Nodes in
node_states[N_MEMORY] are all online, i.e. in node_states[N_ONLINE].
However, node in node_states[N_ONLINE] may not have memory. So
node_states[N_MEMORY] should be a subset of node_states[N_ONLINE].
The guarantee_online_mems() function should only be useful for v1 where
mems_allowed is the same as effective_mems. With v2, the memory nodes
in effective_mems should always be a subset of node_states[N_MEMORY].
The only time that may not be true is when a memory hot-unplug operation
is in progress and a memory node is removed from node_states[N_MEMORY]
but not yet reflected in effective_mems as cpuset_handle_hotplug()
has not yet been called from cpuset_track_online_nodes(). When
cpuset_handle_hotplug() is called later, the memory node setting
of the relevant cpusets and tasks will be updated. So replacing the
guarantee_online_mems() call by just using cs->effective_mems should
be fine.
Let use the following setup for both of them and make them consistent.
- cpuset_change_task_nodemask(): guarantee_online_mems()
- mpol_rebind_mm(): effective_mems
- cpuset_migrate_mm(): guarantee_online_mems()
- old_mems_allowed: guarantee_online_mems()
So for v2, it is effectively all effective_mems. For v1, mpol_rebind_mm()
uses cpus_allowed which may differ from what guarantee_online_mems()
returns.
Signed-off-by: Waiman Long <longman@redhat.com>
---
kernel/cgroup/cpuset.c | 32 +++++++++++++++++++++-----------
1 file changed, 21 insertions(+), 11 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 6bdb68689c24..987456b6d879 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2616,6 +2616,13 @@ static void *cpuset_being_rebound;
* Iterate through each task of @cs updating its mems_allowed to the
* effective cpuset's. As this function is called with cpuset_mutex held,
* cpuset membership stays stable.
+ *
+ * - cpuset_change_task_nodemask(): guarantee_online_mems()
+ * - mpol_rebind_mm(): effective_mems
+ * - cpuset_migrate_mm(): guarantee_online_mems()
+ * - old_mems_allowed: guarantee_online_mems()
+ *
+ * For v2, guarantee_online_mems() should just return effective_mems.
*/
void cpuset_update_tasks_nodemask(struct cpuset *cs)
{
@@ -2625,7 +2632,10 @@ void cpuset_update_tasks_nodemask(struct cpuset *cs)
cpuset_being_rebound = cs; /* causes mpol_dup() rebind */
- guarantee_online_mems(cs, &newmems);
+ if (cpuset_v2())
+ newmems = cs->effective_mems;
+ else
+ guarantee_online_mems(cs, &newmems);
/*
* The mpol_rebind_mm() call takes mmap_lock, which we couldn't
@@ -2650,7 +2660,7 @@ void cpuset_update_tasks_nodemask(struct cpuset *cs)
migrate = is_memory_migrate(cs);
- mpol_rebind_mm(mm, &cs->mems_allowed);
+ mpol_rebind_mm(mm, &cs->effective_mems);
if (migrate)
cpuset_migrate_mm(mm, &cs->old_mems_allowed, &newmems);
else
@@ -3148,17 +3158,18 @@ static void cpuset_attach(struct cgroup_taskset *tset)
/*
* In the default hierarchy, enabling cpuset in the child cgroups
- * will trigger a number of cpuset_attach() calls with no change
- * in effective cpus and mems. In that case, we can optimize out
- * by skipping the task iteration and update.
+ * will trigger a cpuset_attach() call with no change in effective cpus
+ * and mems. In that case, we can optimize out by skipping the task
+ * iteration and update.
*/
- if (cpuset_v2() && !cpus_updated && !mems_updated) {
+ if (cpuset_v2()) {
cpuset_attach_nodemask_to = cs->effective_mems;
- goto out;
+ if (!cpus_updated && !mems_updated)
+ goto out;
+ } else {
+ guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
}
- guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
-
cgroup_taskset_for_each(task, css, tset)
cpuset_attach_task(cs, task);
@@ -3168,7 +3179,6 @@ static void cpuset_attach(struct cgroup_taskset *tset)
* if there is no change in effective_mems and CS_MEMORY_MIGRATE is
* not set.
*/
- cpuset_attach_nodemask_to = cs->effective_mems;
if (!is_memory_migrate(cs) && !mems_updated)
goto out;
@@ -3176,7 +3186,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
struct mm_struct *mm = get_task_mm(leader);
if (mm) {
- mpol_rebind_mm(mm, &cpuset_attach_nodemask_to);
+ mpol_rebind_mm(mm, &cs->effective_mems);
/*
* old_mems_allowed is the same with mems_allowed
--
2.54.0
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH-next v5 2/6] cgroup/cpuset: Add a cpuset_reserve_dl_bw() helper
2026-06-02 2:31 [PATCH-next v5 0/6] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach() Waiman Long
2026-06-02 2:31 ` [PATCH-next v5 1/6] cgroup/cpuset: Fix node inconsistencies between cpuset_update_tasks_nodemask() and cpuset_attach() Waiman Long
@ 2026-06-02 2:31 ` Waiman Long
2026-06-02 13:40 ` Ridong Chen
2026-06-02 2:32 ` [PATCH-next v5 3/6] cgroup/cpuset: Expand the scope of cpuset_can_attach_check() Waiman Long
` (3 subsequent siblings)
5 siblings, 1 reply; 15+ messages in thread
From: Waiman Long @ 2026-06-02 2:31 UTC (permalink / raw)
To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
Peter Zijlstra
Cc: cgroups, linux-kernel, Aaron Tomlin, Guopeng Zhang, Waiman Long
Extract the DL bandwidth allocation code in cpuset_attach() to a new
cpuset_reserve_dl_bw() helper to simplify code.
No functional change is expected.
Signed-off-by: Waiman Long <longman@redhat.com>
---
kernel/cgroup/cpuset.c | 53 ++++++++++++++++++++++++------------------
1 file changed, 30 insertions(+), 23 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 987456b6d879..5c1f3ee48d5d 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2991,6 +2991,25 @@ static int cpuset_can_attach_check(struct cpuset *cs)
return 0;
}
+static int cpuset_reserve_dl_bw(struct cpuset *cs)
+{
+ int cpu, ret;
+
+ if (!cs->sum_migrate_dl_bw)
+ return 0;
+
+ cpu = cpumask_any_and(cpu_active_mask, cs->effective_cpus);
+ if (unlikely(cpu >= nr_cpu_ids))
+ return -EINVAL;
+
+ ret = dl_bw_alloc(cpu, cs->sum_migrate_dl_bw);
+ if (ret)
+ return ret;
+
+ cs->dl_bw_cpu = cpu;
+ return 0;
+}
+
static void reset_migrate_dl_data(struct cpuset *cs)
{
cs->nr_migrate_dl_tasks = 0;
@@ -3005,7 +3024,7 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
struct cpuset *cs, *oldcs;
struct task_struct *task;
bool setsched_check;
- int cpu, ret;
+ int ret;
/* used later by cpuset_attach() */
cpuset_attach_old_cs = task_cs(cgroup_taskset_first(tset, &css));
@@ -3061,31 +3080,19 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
}
}
- if (!cs->sum_migrate_dl_bw)
- goto out_success;
-
- cpu = cpumask_any_and(cpu_active_mask, cs->effective_cpus);
- if (unlikely(cpu >= nr_cpu_ids)) {
- ret = -EINVAL;
- goto out_unlock;
- }
-
- ret = dl_bw_alloc(cpu, cs->sum_migrate_dl_bw);
- if (ret)
- goto out_unlock;
-
- cs->dl_bw_cpu = cpu;
-
-out_success:
- /*
- * Mark attach is in progress. This makes validate_change() fail
- * changes which zero cpus/mems_allowed.
- */
- cs->attach_in_progress++;
+ ret = cpuset_reserve_dl_bw(cs);
out_unlock:
- if (ret)
+ if (ret) {
reset_migrate_dl_data(cs);
+ } else {
+ /*
+ * Mark attach is in progress. This makes validate_change() fail
+ * changes which zero cpus/mems_allowed.
+ */
+ cs->attach_in_progress++;
+ }
+
mutex_unlock(&cpuset_mutex);
return ret;
}
--
2.54.0
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH-next v5 3/6] cgroup/cpuset: Expand the scope of cpuset_can_attach_check()
2026-06-02 2:31 [PATCH-next v5 0/6] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach() Waiman Long
2026-06-02 2:31 ` [PATCH-next v5 1/6] cgroup/cpuset: Fix node inconsistencies between cpuset_update_tasks_nodemask() and cpuset_attach() Waiman Long
2026-06-02 2:31 ` [PATCH-next v5 2/6] cgroup/cpuset: Add a cpuset_reserve_dl_bw() helper Waiman Long
@ 2026-06-02 2:32 ` Waiman Long
2026-06-02 13:51 ` Ridong Chen
2026-06-02 2:32 ` [PATCH-next v5 4/6] cgroup/cpuset: Make cpuset_attach_old_cs track task group leaders Waiman Long
` (2 subsequent siblings)
5 siblings, 1 reply; 15+ messages in thread
From: Waiman Long @ 2026-06-02 2:32 UTC (permalink / raw)
To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
Peter Zijlstra
Cc: cgroups, linux-kernel, Aaron Tomlin, Guopeng Zhang, Waiman Long
Expand the scope of cpuset_can_attach_check() by including the setting
of setsched flag inside cpuset_can_attach_check() with the new @oldcs
and @psetsched argument. As cpuset_can_attach_check() is also called
from cpuset_can_fork(), set the new arguments to NULL from that caller.
While at it, expose the source and destination cpuset cpu/memory check
results in the new attach_cpus_updated and attach_mems_updated static
flags so that these flags can be used directly from cpuset_attach()
without the need to do the same computations again.
Two new global attach related flags are added (attach_cpus_updated &
attach_mems_updated) which are set to indicate that CPUs or memory nodes
are updated. These 2 flags are set in cpuset_can_attach() and are used
in cpuset_attach() for optimization. Since cpuset_mutex will be released
between the 2 calls, it is possible that an intervening cpuset action
may change the CPU or node mask of the relevant cpusets, so check is
added to set these flags if the effective_cpus or effective_mems of
those cpusets is changed.
Signed-off-by: Waiman Long <longman@redhat.com>
---
kernel/cgroup/cpuset.c | 52 ++++++++++++++++++++++++------------------
1 file changed, 30 insertions(+), 22 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 5c1f3ee48d5d..5c777b1237a8 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2982,12 +2982,39 @@ static struct cpuset *cpuset_attach_old_cs;
* For v1, cpus_allowed and mems_allowed can't be empty.
* For v2, effective_cpus can't be empty.
* Note that in v1, effective_cpus = cpus_allowed.
+ *
+ * Also set the boolean flag passed in by @psetsched depending on if
+ * security_task_setscheduler() call is needed and @oldcs is not NULL.
*/
-static int cpuset_can_attach_check(struct cpuset *cs)
+static int cpuset_can_attach_check(struct cpuset *cs, struct cpuset *oldcs,
+ bool *psetsched)
{
if (cpumask_empty(cs->effective_cpus) ||
(!is_in_v2_mode() && nodes_empty(cs->mems_allowed)))
return -ENOSPC;
+
+ if (!oldcs)
+ return 0;
+
+ /*
+ * Skip rights over task setsched check in v2 when nothing changes,
+ * migration permission derives from hierarchy ownership in
+ * cgroup_procs_write_permission()).
+ */
+ *psetsched = !cpuset_v2() ||
+ !cpumask_equal(cs->effective_cpus, oldcs->effective_cpus) ||
+ !nodes_equal(cs->effective_mems, oldcs->effective_mems);
+
+ /*
+ * A v1 cpuset with tasks will have no CPU left only when CPU hotplug
+ * brings the last online CPU offline as users are not allowed to empty
+ * cpuset.cpus when there are active tasks inside. When that happens,
+ * we should allow tasks to migrate out without security check to make
+ * sure they will be able to run after migration.
+ */
+ if (!is_in_v2_mode() && cpumask_empty(oldcs->effective_cpus))
+ *psetsched = false;
+
return 0;
}
@@ -3034,29 +3061,10 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
mutex_lock(&cpuset_mutex);
/* Check to see if task is allowed in the cpuset */
- ret = cpuset_can_attach_check(cs);
+ ret = cpuset_can_attach_check(cs, oldcs, &setsched_check);
if (ret)
goto out_unlock;
- /*
- * Skip rights over task setsched check in v2 when nothing changes,
- * migration permission derives from hierarchy ownership in
- * cgroup_procs_write_permission()).
- */
- setsched_check = !cpuset_v2() ||
- !cpumask_equal(cs->effective_cpus, oldcs->effective_cpus) ||
- !nodes_equal(cs->effective_mems, oldcs->effective_mems);
-
- /*
- * A v1 cpuset with tasks will have no CPU left only when CPU hotplug
- * brings the last online CPU offline as users are not allowed to empty
- * cpuset.cpus when there are active tasks inside. When that happens,
- * we should allow tasks to migrate out without security check to make
- * sure they will be able to run after migration.
- */
- if (!is_in_v2_mode() && cpumask_empty(oldcs->effective_cpus))
- setsched_check = false;
-
cgroup_taskset_for_each(task, css, tset) {
ret = task_can_attach(task);
if (ret)
@@ -3601,7 +3609,7 @@ static int cpuset_can_fork(struct task_struct *task, struct css_set *cset)
mutex_lock(&cpuset_mutex);
/* Check to see if task is allowed in the cpuset */
- ret = cpuset_can_attach_check(cs);
+ ret = cpuset_can_attach_check(cs, NULL, NULL);
if (ret)
goto out_unlock;
--
2.54.0
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH-next v5 4/6] cgroup/cpuset: Make cpuset_attach_old_cs track task group leaders
2026-06-02 2:31 [PATCH-next v5 0/6] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach() Waiman Long
` (2 preceding siblings ...)
2026-06-02 2:32 ` [PATCH-next v5 3/6] cgroup/cpuset: Expand the scope of cpuset_can_attach_check() Waiman Long
@ 2026-06-02 2:32 ` Waiman Long
2026-06-02 13:58 ` Ridong Chen
2026-06-02 2:32 ` [PATCH-next v5 5/6] cgroup/cpuset: Move mpol_rebind_mm/cpuset_migrate_mm() calls inside cpuset_attach_task() Waiman Long
2026-06-02 2:32 ` [PATCH-next v5 6/6] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach() Waiman Long
5 siblings, 1 reply; 15+ messages in thread
From: Waiman Long @ 2026-06-02 2:32 UTC (permalink / raw)
To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
Peter Zijlstra
Cc: cgroups, linux-kernel, Aaron Tomlin, Guopeng Zhang, Waiman Long,
Ridong Chen
There are two possible ways that migration of tasks from multiple source
cpusets to a target cpuset can happen. Either a multithread application
with threads in different cpusets is wholely moved to a new cpuset
or disabling of v2 cpuset controller will move all the tasks in child
cpusets to the parent cpuset.
In the former case, it is the mm setting of the group leader that really
matters. So cpuset_attach_old_cs should track the oldcs of the thread
leader. In the latter case, effective_mems of child cpusets must always
be a subset of the parent. So no real page migration will be necessary
no matter which child cpuset is selected as cpuset_attach_old_cs.
IOW, cpuset_attach_old_cs should be updated to match the latest task
group leader in cpuset_can_attach(), but fall back to that of the first
task if there is no group leader in the taskset.
Suggested-by: Ridong Chen <ridong.chen@linux.dev>
Signed-off-by: Waiman Long <longman@redhat.com>
---
kernel/cgroup/cpuset.c | 25 +++++++++++++++++++++++++
1 file changed, 25 insertions(+)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 5c777b1237a8..60e8149cc907 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2975,6 +2975,10 @@ static int update_prstate(struct cpuset *cs, int new_prs)
return 0;
}
+/*
+ * cpuset_can_attach() and cpuset_attach() specific internal data
+ * Protected by cpuset_mutex
+ */
static struct cpuset *cpuset_attach_old_cs;
/*
@@ -3065,11 +3069,32 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
if (ret)
goto out_unlock;
+ /*
+ * The cpuset_attach_old_cs is used mainly by cpuset_migrate_mm() to get
+ * the old_mems_allowed value. There are two ways that many-to-one
+ * cpuset migration can happen:
+ * 1) A multithread application with threads in different cpusets is
+ * wholely migrated to a new cpuset.
+ * 2) Disabling v2 cpuset controller will move all the tasks in child
+ * cpusets to the parent cpuset.
+ *
+ * In the former case, it is the mm setting of the group leader that
+ * really matters. So cpuset_attach_old_cs should track the oldcs of the
+ * group leader. It falls back to the oldcs of the first task if there
+ * is no group leader in the taskset. In the latter case, effective_mems
+ * of child cpusets must always be a subset of the parent. So no real
+ * page migration will be necessary no matter which child cpuset is
+ * selected as cpuset_attach_old_cs.
+ */
cgroup_taskset_for_each(task, css, tset) {
ret = task_can_attach(task);
if (ret)
goto out_unlock;
+ /* Update cpuset_attach_old_cs to the latest group leader */
+ if (task == task->group_leader)
+ cpuset_attach_old_cs = task_cs(task);
+
if (setsched_check) {
ret = security_task_setscheduler(task);
if (ret)
--
2.54.0
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH-next v5 5/6] cgroup/cpuset: Move mpol_rebind_mm/cpuset_migrate_mm() calls inside cpuset_attach_task()
2026-06-02 2:31 [PATCH-next v5 0/6] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach() Waiman Long
` (3 preceding siblings ...)
2026-06-02 2:32 ` [PATCH-next v5 4/6] cgroup/cpuset: Make cpuset_attach_old_cs track task group leaders Waiman Long
@ 2026-06-02 2:32 ` Waiman Long
2026-06-02 2:32 ` [PATCH-next v5 6/6] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach() Waiman Long
5 siblings, 0 replies; 15+ messages in thread
From: Waiman Long @ 2026-06-02 2:32 UTC (permalink / raw)
To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
Peter Zijlstra
Cc: cgroups, linux-kernel, Aaron Tomlin, Guopeng Zhang, Waiman Long
The cpuset_attach_task() was introduced in commit 42a11bf5c543
("cgroup/cpuset: Make cpuset_fork() handle CLONE_INTO_CGROUP properly")
to enable the CLONE_INTO_CGROUP flag of clone(2) to behave more like
moving a task from one cpuset into another one. That commits didn't
move the mpol_rebind_mm() and cpuset_migrate_mm() calls for group leader
into cpuset_attach_task().
When the CLONE_INTO_CGROUP flag is used without CLONE_THREAD, the new
task is its own group leader. So it is still not equivalent to moving
task between cpusets in this case. Make CLONE_INTO_CGROUP behaves
more close to cpuset_attach() by moving the mpol_rebind_mm() and
cpuset_migrate_mm() calls inside cpuset_attach_task(). As a result,
the following static variables will have to be updated in cpuset_fork().
- cpuset_attach_old_cs
- attach_cpus_updated
- attach_mems_updated
- queue_task_work
Signed-off-by: Waiman Long <longman@redhat.com>
---
kernel/cgroup/cpuset.c | 103 ++++++++++++++++++++++++-----------------
1 file changed, 60 insertions(+), 43 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 60e8149cc907..5b5352ec0e69 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2978,8 +2978,13 @@ static int update_prstate(struct cpuset *cs, int new_prs)
/*
* cpuset_can_attach() and cpuset_attach() specific internal data
* Protected by cpuset_mutex
+ *
+ * The attach_cpus_updated/attach_mems_updated flags are set in either
+ * cpuset_attach() or cpuset_fork() and used in cpuset_attach_task().
*/
static struct cpuset *cpuset_attach_old_cs;
+static bool attach_cpus_updated;
+static bool attach_mems_updated;
/*
* Check to see if a cpuset can accept a new task
@@ -3157,9 +3162,12 @@ static void cpuset_cancel_attach(struct cgroup_taskset *tset)
*/
static cpumask_var_t cpus_attach;
static nodemask_t cpuset_attach_nodemask_to;
+static bool queue_task_work;
static void cpuset_attach_task(struct cpuset *cs, struct task_struct *task)
{
+ struct mm_struct *mm;
+
lockdep_assert_cpuset_lock_held();
if (cs != &top_cpuset)
@@ -3173,28 +3181,60 @@ static void cpuset_attach_task(struct cpuset *cs, struct task_struct *task)
*/
WARN_ON_ONCE(set_cpus_allowed_ptr(task, cpus_attach));
+ if (cpuset_v2() && !attach_mems_updated)
+ return;
+
cpuset_change_task_nodemask(task, &cpuset_attach_nodemask_to);
cpuset1_update_task_spread_flags(cs, task);
+
+ if ((task != task->group_leader) ||
+ (!is_memory_migrate(cs) && !attach_mems_updated))
+ return;
+
+ /*
+ * Change mm for threadgroup leader. This is expensive and may
+ * sleep and should be moved outside migration path proper.
+ */
+ mm = get_task_mm(task);
+ if (mm) {
+ struct cpuset *oldcs = cpuset_attach_old_cs;
+
+ mpol_rebind_mm(mm, &cs->effective_mems);
+
+ /*
+ * old_mems_allowed is the same with mems_allowed
+ * here, except if this task is being moved
+ * automatically due to hotplug. In that case
+ * @mems_allowed has been updated and is empty, so
+ * @old_mems_allowed is the right nodesets that we
+ * migrate mm from.
+ */
+ if (is_memory_migrate(cs)) {
+ cpuset_migrate_mm(mm, &oldcs->old_mems_allowed,
+ &cpuset_attach_nodemask_to);
+ queue_task_work = true;
+ } else {
+ mmput(mm);
+ }
+ }
}
static void cpuset_attach(struct cgroup_taskset *tset)
{
struct task_struct *task;
- struct task_struct *leader;
struct cgroup_subsys_state *css;
struct cpuset *cs;
struct cpuset *oldcs = cpuset_attach_old_cs;
- bool cpus_updated, mems_updated;
- bool queue_task_work = false;
cgroup_taskset_first(tset, &css);
cs = css_cs(css);
lockdep_assert_cpus_held(); /* see cgroup_attach_lock() */
mutex_lock(&cpuset_mutex);
- cpus_updated = !cpumask_equal(cs->effective_cpus,
- oldcs->effective_cpus);
- mems_updated = !nodes_equal(cs->effective_mems, oldcs->effective_mems);
+ queue_task_work = false;
+
+ attach_cpus_updated = !cpumask_equal(cs->effective_cpus, oldcs->effective_cpus);
+ attach_mems_updated = !nodes_equal(cs->effective_mems, oldcs->effective_mems);
/*
* In the default hierarchy, enabling cpuset in the child cgroups
@@ -3204,7 +3244,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
*/
if (cpuset_v2()) {
cpuset_attach_nodemask_to = cs->effective_mems;
- if (!cpus_updated && !mems_updated)
+ if (!attach_cpus_updated && !attach_mems_updated)
goto out;
} else {
guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
@@ -3213,38 +3253,6 @@ static void cpuset_attach(struct cgroup_taskset *tset)
cgroup_taskset_for_each(task, css, tset)
cpuset_attach_task(cs, task);
- /*
- * Change mm for all threadgroup leaders. This is expensive and may
- * sleep and should be moved outside migration path proper. Skip it
- * if there is no change in effective_mems and CS_MEMORY_MIGRATE is
- * not set.
- */
- if (!is_memory_migrate(cs) && !mems_updated)
- goto out;
-
- cgroup_taskset_for_each_leader(leader, css, tset) {
- struct mm_struct *mm = get_task_mm(leader);
-
- if (mm) {
- mpol_rebind_mm(mm, &cs->effective_mems);
-
- /*
- * old_mems_allowed is the same with mems_allowed
- * here, except if this task is being moved
- * automatically due to hotplug. In that case
- * @mems_allowed has been updated and is empty, so
- * @old_mems_allowed is the right nodesets that we
- * migrate mm from.
- */
- if (is_memory_migrate(cs)) {
- cpuset_migrate_mm(mm, &oldcs->old_mems_allowed,
- &cpuset_attach_nodemask_to);
- queue_task_work = true;
- } else
- mmput(mm);
- }
- }
-
out:
if (queue_task_work)
schedule_flush_migrate_mm();
@@ -3678,15 +3686,14 @@ static void cpuset_cancel_fork(struct task_struct *task, struct css_set *cset)
*/
static void cpuset_fork(struct task_struct *task)
{
- struct cpuset *cs;
- bool same_cs;
+ struct cpuset *cs, *oldcs;
rcu_read_lock();
cs = task_cs(task);
- same_cs = (cs == task_cs(current));
+ oldcs = task_cs(current);
rcu_read_unlock();
- if (same_cs) {
+ if (cs == oldcs) {
if (cs == &top_cpuset)
return;
@@ -3698,7 +3705,17 @@ static void cpuset_fork(struct task_struct *task)
/* CLONE_INTO_CGROUP */
mutex_lock(&cpuset_mutex);
guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
+ /*
+ * Assume CPUs and memory nodes are updated
+ * A CLONE_INTO_CGROUP operation should have taken the cgroup mutex
+ * and so there shouldn't be a competing cpuset_attach() operation.
+ */
+ attach_cpus_updated = attach_mems_updated = true;
+ queue_task_work = false;
+ cpuset_attach_old_cs = oldcs;
cpuset_attach_task(cs, task);
+ if (queue_task_work)
+ schedule_flush_migrate_mm();
dec_attach_in_progress_locked(cs);
mutex_unlock(&cpuset_mutex);
--
2.54.0
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH-next v5 6/6] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach()
2026-06-02 2:31 [PATCH-next v5 0/6] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach() Waiman Long
` (4 preceding siblings ...)
2026-06-02 2:32 ` [PATCH-next v5 5/6] cgroup/cpuset: Move mpol_rebind_mm/cpuset_migrate_mm() calls inside cpuset_attach_task() Waiman Long
@ 2026-06-02 2:32 ` Waiman Long
2026-06-03 10:26 ` [PATCH] cgroup/cpuset: Support multiple source/destination cpusets using pids pattern Ridong Chen
5 siblings, 1 reply; 15+ messages in thread
From: Waiman Long @ 2026-06-02 2:32 UTC (permalink / raw)
To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
Peter Zijlstra
Cc: cgroups, linux-kernel, Aaron Tomlin, Guopeng Zhang, Waiman Long
With cgroup v2, the cgroup_taskset structure passed into the cgroup
can_attach() and attach() methods can contain task migration data with
multiple destination or source cpusets when the cpuset controller is
enabled or disabled respectively.
Since cpuset is threaded in both v1 and v2, another possible way to
cause many-to-one migration is to move the whole process with multiple
threads in different cpuset enabled threaded cgroups into another cpuset
enabled cgroup.
The current cpuset_can_attach() and cpuset_attach() functions still
expect task migration is from one source cpuset to one destination
cpuset. This has been the case since cpuset was enabled for cgroup v2
in commit 4ec22e9c5a90 ("cpuset: Enable cpuset controller in default
hierarchy").
This problem is less an issue when enabling the cpuset controller as all
the newly created child cpusets will have exactly the same set of CPUs
and memory nodes except when deadline tasks are involved in migration
as the deadline task accounting data can be off.
It can be more problematic when the cpuset controller is disabled as
their set of CPUs and memory nodes may differ from their parent or with
the moving of multi-threaded process from different threaded cgroups.
Fix that by tracking the set of source (old) and destination cpusets
in singly linked lists and iterating them all to properly update the
internal data. Also keep the current cs and oldcs variables up-to-date
with the css and task iterators.
To ensure proper DL tasks accounting, the nr_migrate_dl_tasks in both
the source and destination cpusets are decremented/incremented with
their values added to nr_deadline_tasks when the migration is successful.
Fixes: 4ec22e9c5a90 ("cpuset: Enable cpuset controller in default hierarchy")
Signed-off-by: Waiman Long <longman@redhat.com>
---
kernel/cgroup/cpuset-internal.h | 6 +
kernel/cgroup/cpuset.c | 208 ++++++++++++++++++++++++--------
2 files changed, 164 insertions(+), 50 deletions(-)
diff --git a/kernel/cgroup/cpuset-internal.h b/kernel/cgroup/cpuset-internal.h
index f7aaf01f7cd5..4c2772a7fd5e 100644
--- a/kernel/cgroup/cpuset-internal.h
+++ b/kernel/cgroup/cpuset-internal.h
@@ -161,6 +161,12 @@ struct cpuset {
*/
bool remote_partition;
+ /*
+ * cpuset_can_attach() and cpuset_attach() specific data
+ */
+ bool attach_node_in_llist;
+ struct llist_node attach_node;
+
/*
* number of SCHED_DEADLINE tasks attached to this cpuset, so that we
* know when to rebuild associated root domain bandwidth information.
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 5b5352ec0e69..53a9d3cc8407 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -37,6 +37,7 @@
#include <linux/wait.h>
#include <linux/workqueue.h>
#include <linux/task_work.h>
+#include <linux/llist.h>
DEFINE_STATIC_KEY_FALSE(cpusets_pre_enable_key);
DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key);
@@ -2983,6 +2984,8 @@ static int update_prstate(struct cpuset *cs, int new_prs)
* cpuset_attach() or cpuset_fork() and used in cpuset_attach_task().
*/
static struct cpuset *cpuset_attach_old_cs;
+static LLIST_HEAD(src_cs_head);
+static LLIST_HEAD(dst_cs_head);
static bool attach_cpus_updated;
static bool attach_mems_updated;
@@ -3005,6 +3008,15 @@ static int cpuset_can_attach_check(struct cpuset *cs, struct cpuset *oldcs,
if (!oldcs)
return 0;
+ if (!cs->attach_node_in_llist) {
+ llist_add(&cs->attach_node, &dst_cs_head);
+ cs->attach_node_in_llist = true;
+ }
+ if (!oldcs->attach_node_in_llist) {
+ llist_add(&oldcs->attach_node, &src_cs_head);
+ oldcs->attach_node_in_llist = true;
+ }
+
/*
* Skip rights over task setsched check in v2 when nothing changes,
* migration permission derives from hierarchy ownership in
@@ -3027,33 +3039,101 @@ static int cpuset_can_attach_check(struct cpuset *cs, struct cpuset *oldcs,
return 0;
}
-static int cpuset_reserve_dl_bw(struct cpuset *cs)
+/*
+ * If reset_dl_bw is set, reset the previous dl_bw_alloc() call. Otherwise,
+ * update nr_deadline_tasks according to nr_migrate_dl_tasks in both source
+ * and destination cpusets.
+ */
+static void clear_attach_data(bool reset_dl_bw)
{
+ struct cpuset *cs, *next;
+
+ llist_for_each_entry_safe(cs, next, src_cs_head.first, attach_node) {
+ cs->attach_node.next = NULL;
+ cs->attach_node_in_llist = false;
+ if (cs->nr_migrate_dl_tasks && !reset_dl_bw)
+ cs->nr_deadline_tasks += cs->nr_migrate_dl_tasks;
+ cs->nr_migrate_dl_tasks = 0;
+ }
+
+ llist_for_each_entry_safe(cs, next, dst_cs_head.first, attach_node) {
+ cs->attach_node.next = NULL;
+ cs->attach_node_in_llist = false;
+ if (reset_dl_bw && cs->dl_bw_cpu >= 0)
+ dl_bw_free(cs->dl_bw_cpu, cs->sum_migrate_dl_bw);
+ if (cs->nr_migrate_dl_tasks && !reset_dl_bw)
+ cs->nr_deadline_tasks += cs->nr_migrate_dl_tasks;
+ cs->nr_migrate_dl_tasks = 0;
+ cs->sum_migrate_dl_bw = 0;
+ cs->dl_bw_cpu = -1;
+ }
+
+ src_cs_head.first = NULL;
+ dst_cs_head.first = NULL;
+}
+
+static int cpuset_reserve_dl_bw(void)
+{
+ struct cpuset *cs;
int cpu, ret;
- if (!cs->sum_migrate_dl_bw)
- return 0;
+ llist_for_each_entry(cs, dst_cs_head.first, attach_node) {
+ if (!cs->sum_migrate_dl_bw)
+ continue;
- cpu = cpumask_any_and(cpu_active_mask, cs->effective_cpus);
- if (unlikely(cpu >= nr_cpu_ids))
- return -EINVAL;
+ cpu = cpumask_any_and(cpu_active_mask, cs->effective_cpus);
+ if (unlikely(cpu >= nr_cpu_ids))
+ return -EINVAL;
- ret = dl_bw_alloc(cpu, cs->sum_migrate_dl_bw);
- if (ret)
- return ret;
+ ret = dl_bw_alloc(cpu, cs->sum_migrate_dl_bw);
+ if (ret)
+ return ret;
- cs->dl_bw_cpu = cpu;
+ cs->dl_bw_cpu = cpu;
+ }
return 0;
}
-static void reset_migrate_dl_data(struct cpuset *cs)
+static void set_attach_in_progress(void)
{
- cs->nr_migrate_dl_tasks = 0;
- cs->sum_migrate_dl_bw = 0;
- cs->dl_bw_cpu = -1;
+ struct cpuset *cs;
+
+ /*
+ * Mark attach is in progress. This makes validate_change() fail
+ * changes which zero cpus/mems_allowed.
+ */
+ llist_for_each_entry(cs, dst_cs_head.first, attach_node)
+ cs->attach_in_progress++;
+}
+
+static void reset_attach_in_progress(void)
+{
+ struct cpuset *cs;
+
+ llist_for_each_entry(cs, dst_cs_head.first, attach_node)
+ dec_attach_in_progress_locked(cs);
}
-/* Called by cgroups to determine if a cpuset is usable; cpuset_mutex held */
+/*
+ * Called by cgroups to determine if a cpuset is usable; cpuset_mutex held.
+ *
+ * With cgroup v2, enabling of cpuset controller in a cgroup subtree can
+ * cause @tset to contain task migration data from one parent cpuset to multiple
+ * child cpusets. Not much is needed to be done here other than tracking the
+ * number of DL tasks in each cpuset as the CPUs and memory nodes of the child
+ * cpusets are exactly the same as the parent.
+ *
+ * Conversely, disabling of cpuset controller can cause @tset to contain task
+ * migration data from multiple child cpusets to one parent cpuset. Here, the
+ * CPUs and memory nodes of the child cpusets may be different from the parent,
+ * but must be a subset of its parent.
+ *
+ * Another possible many-to-one migration is the moving of the whole
+ * multithreaded process with threads in different cpusets to another cpuset.
+ *
+ * For all other use cases, @tset task migration data should be from one source
+ * cpuset to one destination cpuset.
+ */
static int cpuset_can_attach(struct cgroup_taskset *tset)
{
struct cgroup_subsys_state *css;
@@ -3092,6 +3172,16 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
* selected as cpuset_attach_old_cs.
*/
cgroup_taskset_for_each(task, css, tset) {
+ struct cpuset *newcs = css_cs(css);
+ struct cpuset *new_oldcs = task_cs(task);
+
+ if ((newcs != cs) || (new_oldcs != oldcs)) {
+ cs = newcs;
+ oldcs = new_oldcs;
+ ret = cpuset_can_attach_check(cs, oldcs, &setsched_check);
+ if (ret)
+ goto out_unlock;
+ }
ret = task_can_attach(task);
if (ret)
goto out_unlock;
@@ -3113,23 +3203,19 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
* contribute to sum_migrate_dl_bw.
*/
cs->nr_migrate_dl_tasks++;
+ oldcs->nr_migrate_dl_tasks--;
if (dl_task_needs_bw_move(task, cs->effective_cpus))
cs->sum_migrate_dl_bw += task->dl.dl_bw;
}
}
- ret = cpuset_reserve_dl_bw(cs);
+ ret = cpuset_reserve_dl_bw();
out_unlock:
- if (ret) {
- reset_migrate_dl_data(cs);
- } else {
- /*
- * Mark attach is in progress. This makes validate_change() fail
- * changes which zero cpus/mems_allowed.
- */
- cs->attach_in_progress++;
- }
+ if (ret)
+ clear_attach_data(true);
+ else
+ set_attach_in_progress();
mutex_unlock(&cpuset_mutex);
return ret;
@@ -3144,14 +3230,8 @@ static void cpuset_cancel_attach(struct cgroup_taskset *tset)
cs = css_cs(css);
mutex_lock(&cpuset_mutex);
- dec_attach_in_progress_locked(cs);
-
- if (cs->dl_bw_cpu >= 0)
- dl_bw_free(cs->dl_bw_cpu, cs->sum_migrate_dl_bw);
-
- if (cs->nr_migrate_dl_tasks)
- reset_migrate_dl_data(cs);
-
+ reset_attach_in_progress();
+ clear_attach_data(true);
mutex_unlock(&cpuset_mutex);
}
@@ -3224,48 +3304,76 @@ static void cpuset_attach(struct cgroup_taskset *tset)
struct task_struct *task;
struct cgroup_subsys_state *css;
struct cpuset *cs;
- struct cpuset *oldcs = cpuset_attach_old_cs;
+ bool many_cs_to_one = !!src_cs_head.first->next;
cgroup_taskset_first(tset, &css);
- cs = css_cs(css);
lockdep_assert_cpus_held(); /* see cgroup_attach_lock() */
mutex_lock(&cpuset_mutex);
queue_task_work = false;
- attach_cpus_updated = !cpumask_equal(cs->effective_cpus, oldcs->effective_cpus);
- attach_mems_updated = !nodes_equal(cs->effective_mems, oldcs->effective_mems);
+ /*
+ * attach_cpus_updated/attach_mems_updated can be set to false if
+ * source and destination masks are the same and there is only one
+ * source cpuset. IOW, a many-cs-to-one migration is always treated as
+ * updated as the tasks to old cpuset mapping is lost.
+ */
+ if (many_cs_to_one) {
+ attach_cpus_updated = true;
+ attach_mems_updated = true;
+ } else {
+ /* one_cs_to_one or one_cs_to_many */
+ struct cpuset *oldcs = cpuset_attach_old_cs;
+ attach_cpus_updated = false;
+ attach_mems_updated = false;
+ llist_for_each_entry(cs, dst_cs_head.first, attach_node) {
+ attach_cpus_updated |= !cpumask_equal(cs->effective_cpus,
+ oldcs->effective_cpus);
+ attach_mems_updated |= !nodes_equal(cs->effective_mems,
+ oldcs->effective_mems);
+ }
+ }
+
+ cs = css_cs(css);
/*
* In the default hierarchy, enabling cpuset in the child cgroups
* will trigger a cpuset_attach() call with no change in effective cpus
* and mems. In that case, we can optimize out by skipping the task
- * iteration and update.
+ * iteration and update, but the destination cpuset list is iterated to
+ * set old_mems_sllowed.
*/
if (cpuset_v2()) {
cpuset_attach_nodemask_to = cs->effective_mems;
- if (!attach_cpus_updated && !attach_mems_updated)
+ if (!attach_cpus_updated && !attach_mems_updated) {
+ llist_for_each_entry(cs, dst_cs_head.first, attach_node)
+ cs->old_mems_allowed = cs->effective_mems;
goto out;
+ }
} else {
guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
}
- cgroup_taskset_for_each(task, css, tset)
+ cgroup_taskset_for_each(task, css, tset) {
+ struct cpuset *newcs = css_cs(css);
+
+ if (newcs != cs) {
+ cs->old_mems_allowed = cpuset_attach_nodemask_to;
+ cs = newcs;
+ if (cpuset_v2())
+ cpuset_attach_nodemask_to = cs->effective_mems;
+ else
+ guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
+ }
cpuset_attach_task(cs, task);
+ }
-out:
if (queue_task_work)
schedule_flush_migrate_mm();
cs->old_mems_allowed = cpuset_attach_nodemask_to;
-
- if (cs->nr_migrate_dl_tasks) {
- cs->nr_deadline_tasks += cs->nr_migrate_dl_tasks;
- oldcs->nr_deadline_tasks -= cs->nr_migrate_dl_tasks;
- reset_migrate_dl_data(cs);
- }
-
- dec_attach_in_progress_locked(cs);
-
+out:
+ reset_attach_in_progress();
+ clear_attach_data(false);
mutex_unlock(&cpuset_mutex);
}
--
2.54.0
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH-next v5 1/6] cgroup/cpuset: Fix node inconsistencies between cpuset_update_tasks_nodemask() and cpuset_attach()
2026-06-02 2:31 ` [PATCH-next v5 1/6] cgroup/cpuset: Fix node inconsistencies between cpuset_update_tasks_nodemask() and cpuset_attach() Waiman Long
@ 2026-06-02 13:37 ` Ridong Chen
2026-06-02 18:43 ` Waiman Long
0 siblings, 1 reply; 15+ messages in thread
From: Ridong Chen @ 2026-06-02 13:37 UTC (permalink / raw)
To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
Peter Zijlstra, ridong.chen
Cc: cgroups, linux-kernel, Aaron Tomlin, Guopeng Zhang
On 2026/6/2 10:31, Waiman Long wrote:
> Whenever memory node mask is changed, there are 4 places where the node
> mask has to be updated or used.
> 1) task's node mask via cpuset_change_task_nodemask()
> 2) memory policy binding via mpol_rebind_mm()
> 3) if memory migration is enabled, migrate from old_mems_allowed to
> the new node mask via cpuset_migrate_mm().
> 4) setting old_mems_allowed
>
> These memory actions are done in cpuset_update_tasks_nodemask() and
> cpuset_attach(). However there are inconsistencies in what node masks
> are being used in these 2 functions.
>
> In cpuset_update_tasks_nodemask(),
> - cpuset_change_task_nodemask(): guarantee_online_mems()
> - mpol_rebind_mm(): mems_allowed
> - cpuset_migrate_mm(): guarantee_online_mems()
> - old_mems_allowed: guarantee_online_mems()
>
> In cpuset_attach(),
> - cpuset_change_task_nodemask(): guarantee_online_mems()
> - mpol_rebind_mm(): effective_mems
> - cpuset_migrate_mm(): effective_mems
> - old_mems_allowed: effective_mems
>
> These inconsistencies dates back to quite a long time ago and it is
> hard to say what should be the correct values.
>
> The guarantee_online_mems() function returns a node mask from current or
> an ancestor cpuset that is a subset of node_states[N_MEMORY]. Nodes in
> node_states[N_MEMORY] are all online, i.e. in node_states[N_ONLINE].
> However, node in node_states[N_ONLINE] may not have memory. So
> node_states[N_MEMORY] should be a subset of node_states[N_ONLINE].
>
> The guarantee_online_mems() function should only be useful for v1 where
> mems_allowed is the same as effective_mems. With v2, the memory nodes
> in effective_mems should always be a subset of node_states[N_MEMORY].
> The only time that may not be true is when a memory hot-unplug operation
> is in progress and a memory node is removed from node_states[N_MEMORY]
> but not yet reflected in effective_mems as cpuset_handle_hotplug()
> has not yet been called from cpuset_track_online_nodes(). When
> cpuset_handle_hotplug() is called later, the memory node setting
> of the relevant cpusets and tasks will be updated. So replacing the
> guarantee_online_mems() call by just using cs->effective_mems should
> be fine.
>
I noticed this pattern in several places:
```
if (cpuset_v2())
newmems = cs->effective_mems;
else
guarantee_online_mems(cs, &newmems);
```
Would it be simpler to move the v2 logic into guarantee_online_mems?
```
static void guarantee_online_mems(struct cpuset *cs, nodemask_t *pmask)
{
if (cpuset_v2()) {
*pmask = cs->effective_mems;
return;
}
while (!nodes_and(*pmask, cs->effective_mems, node_states[N_MEMORY]))
cs = parent_cs(cs);
}
```
> Let use the following setup for both of them and make them consistent.
> - cpuset_change_task_nodemask(): guarantee_online_mems()
> - mpol_rebind_mm(): effective_mems
> - cpuset_migrate_mm(): guarantee_online_mems()
> - old_mems_allowed: guarantee_online_mems()
>
> So for v2, it is effectively all effective_mems. For v1, mpol_rebind_mm()
> uses cpus_allowed which may differ from what guarantee_online_mems()
^
mems_allowed?
> returns.
>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
> kernel/cgroup/cpuset.c | 32 +++++++++++++++++++++-----------
> 1 file changed, 21 insertions(+), 11 deletions(-)
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 6bdb68689c24..987456b6d879 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -2616,6 +2616,13 @@ static void *cpuset_being_rebound;
> * Iterate through each task of @cs updating its mems_allowed to the
> * effective cpuset's. As this function is called with cpuset_mutex held,
> * cpuset membership stays stable.
> + *
> + * - cpuset_change_task_nodemask(): guarantee_online_mems()
> + * - mpol_rebind_mm(): effective_mems
> + * - cpuset_migrate_mm(): guarantee_online_mems()
> + * - old_mems_allowed: guarantee_online_mems()
> + *
> + * For v2, guarantee_online_mems() should just return effective_mems.
I agree, but the implementation is not as simple as what I mentioned above.
> */
> void cpuset_update_tasks_nodemask(struct cpuset *cs)
> {
> @@ -2625,7 +2632,10 @@ void cpuset_update_tasks_nodemask(struct cpuset *cs)
>
> cpuset_being_rebound = cs; /* causes mpol_dup() rebind */
>
> - guarantee_online_mems(cs, &newmems);
> + if (cpuset_v2())
> + newmems = cs->effective_mems;
> + else
> + guarantee_online_mems(cs, &newmems);
>
> /*
> * The mpol_rebind_mm() call takes mmap_lock, which we couldn't
> @@ -2650,7 +2660,7 @@ void cpuset_update_tasks_nodemask(struct cpuset *cs)
>
> migrate = is_memory_migrate(cs);
>
> - mpol_rebind_mm(mm, &cs->mems_allowed);
> + mpol_rebind_mm(mm, &cs->effective_mems);
> if (migrate)
> cpuset_migrate_mm(mm, &cs->old_mems_allowed, &newmems);
> else
> @@ -3148,17 +3158,18 @@ static void cpuset_attach(struct cgroup_taskset *tset)
>
> /*
> * In the default hierarchy, enabling cpuset in the child cgroups
> - * will trigger a number of cpuset_attach() calls with no change
> - * in effective cpus and mems. In that case, we can optimize out
> - * by skipping the task iteration and update.
> + * will trigger a cpuset_attach() call with no change in effective cpus
> + * and mems. In that case, we can optimize out by skipping the task
> + * iteration and update.
> */
> - if (cpuset_v2() && !cpus_updated && !mems_updated) {
> + if (cpuset_v2()) {
> cpuset_attach_nodemask_to = cs->effective_mems;
> - goto out;
> + if (!cpus_updated && !mems_updated)
> + goto out;
> + } else {
> + guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
> }
>
> - guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
> -
> cgroup_taskset_for_each(task, css, tset)
> cpuset_attach_task(cs, task);
>
> @@ -3168,7 +3179,6 @@ static void cpuset_attach(struct cgroup_taskset *tset)
> * if there is no change in effective_mems and CS_MEMORY_MIGRATE is
> * not set.
> */
> - cpuset_attach_nodemask_to = cs->effective_mems;
> if (!is_memory_migrate(cs) && !mems_updated)
> goto out;
>
> @@ -3176,7 +3186,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
> struct mm_struct *mm = get_task_mm(leader);
>
> if (mm) {
> - mpol_rebind_mm(mm, &cpuset_attach_nodemask_to);
> + mpol_rebind_mm(mm, &cs->effective_mems);
>
> /*
> * old_mems_allowed is the same with mems_allowed
--
Best regards,
Ridong
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH-next v5 2/6] cgroup/cpuset: Add a cpuset_reserve_dl_bw() helper
2026-06-02 2:31 ` [PATCH-next v5 2/6] cgroup/cpuset: Add a cpuset_reserve_dl_bw() helper Waiman Long
@ 2026-06-02 13:40 ` Ridong Chen
0 siblings, 0 replies; 15+ messages in thread
From: Ridong Chen @ 2026-06-02 13:40 UTC (permalink / raw)
To: Waiman Long, Chen Ridong, Tejun Heo, Johannes Weiner,
Michal Koutný, Peter Zijlstra
Cc: cgroups, linux-kernel, Aaron Tomlin, Guopeng Zhang
On 2026/6/2 10:31, Waiman Long wrote:
> Extract the DL bandwidth allocation code in cpuset_attach() to a new
> cpuset_reserve_dl_bw() helper to simplify code.
>
> No functional change is expected.
>
> Signed-off-by: Waiman Long <longman@redhat.com>
LGTM.
Reviewed-by: Ridong Chen <ridong.chen@linux.dev>
> ---
> kernel/cgroup/cpuset.c | 53 ++++++++++++++++++++++++------------------
> 1 file changed, 30 insertions(+), 23 deletions(-)
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 987456b6d879..5c1f3ee48d5d 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -2991,6 +2991,25 @@ static int cpuset_can_attach_check(struct cpuset *cs)
> return 0;
> }
>
> +static int cpuset_reserve_dl_bw(struct cpuset *cs)
> +{
> + int cpu, ret;
> +
> + if (!cs->sum_migrate_dl_bw)
> + return 0;
> +
> + cpu = cpumask_any_and(cpu_active_mask, cs->effective_cpus);
> + if (unlikely(cpu >= nr_cpu_ids))
> + return -EINVAL;
> +
> + ret = dl_bw_alloc(cpu, cs->sum_migrate_dl_bw);
> + if (ret)
> + return ret;
> +
> + cs->dl_bw_cpu = cpu;
> + return 0;
> +}
> +
> static void reset_migrate_dl_data(struct cpuset *cs)
> {
> cs->nr_migrate_dl_tasks = 0;
> @@ -3005,7 +3024,7 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
> struct cpuset *cs, *oldcs;
> struct task_struct *task;
> bool setsched_check;
> - int cpu, ret;
> + int ret;
>
> /* used later by cpuset_attach() */
> cpuset_attach_old_cs = task_cs(cgroup_taskset_first(tset, &css));
> @@ -3061,31 +3080,19 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
> }
> }
>
> - if (!cs->sum_migrate_dl_bw)
> - goto out_success;
> -
> - cpu = cpumask_any_and(cpu_active_mask, cs->effective_cpus);
> - if (unlikely(cpu >= nr_cpu_ids)) {
> - ret = -EINVAL;
> - goto out_unlock;
> - }
> -
> - ret = dl_bw_alloc(cpu, cs->sum_migrate_dl_bw);
> - if (ret)
> - goto out_unlock;
> -
> - cs->dl_bw_cpu = cpu;
> -
> -out_success:
> - /*
> - * Mark attach is in progress. This makes validate_change() fail
> - * changes which zero cpus/mems_allowed.
> - */
> - cs->attach_in_progress++;
> + ret = cpuset_reserve_dl_bw(cs);
>
> out_unlock:
> - if (ret)
> + if (ret) {
> reset_migrate_dl_data(cs);
> + } else {
> + /*
> + * Mark attach is in progress. This makes validate_change() fail
> + * changes which zero cpus/mems_allowed.
> + */
> + cs->attach_in_progress++;
> + }
> +
> mutex_unlock(&cpuset_mutex);
> return ret;
> }
--
Best regards,
Ridong
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH-next v5 3/6] cgroup/cpuset: Expand the scope of cpuset_can_attach_check()
2026-06-02 2:32 ` [PATCH-next v5 3/6] cgroup/cpuset: Expand the scope of cpuset_can_attach_check() Waiman Long
@ 2026-06-02 13:51 ` Ridong Chen
0 siblings, 0 replies; 15+ messages in thread
From: Ridong Chen @ 2026-06-02 13:51 UTC (permalink / raw)
To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
Peter Zijlstra
Cc: cgroups, linux-kernel, Aaron Tomlin, Guopeng Zhang
On 2026/6/2 10:32, Waiman Long wrote:
> Expand the scope of cpuset_can_attach_check() by including the setting
> of setsched flag inside cpuset_can_attach_check() with the new @oldcs
> and @psetsched argument. As cpuset_can_attach_check() is also called
> from cpuset_can_fork(), set the new arguments to NULL from that caller.
>
Hi Waiman,
The code change itself looks good to me. However, the commit message
has two paragraphs that don't match this patch:
> While at it, expose the source and destination cpuset cpu/memory check
> results in the new attach_cpus_updated and attach_mems_updated static
> flags so that these flags can be used directly from cpuset_attach()
> without the need to do the same computations again.
>
> Two new global attach related flags are added (attach_cpus_updated &
> attach_mems_updated) which are set to indicate that CPUs or memory nodes
> are updated. These 2 flags are set in cpuset_can_attach() and are used
> in cpuset_attach() for optimization. Since cpuset_mutex will be released
> between the 2 calls, it is possible that an intervening cpuset action
> may change the CPU or node mask of the relevant cpusets, so check is
> added to set these flags if the effective_cpus or effective_mems of
> those cpusets is changed.
>
> Signed-off-by: Waiman Long <longman@redhat.com>
Other than that:
Reviewed-by: Ridong Chen <ridong.chen@linux.dev>
> ---
> kernel/cgroup/cpuset.c | 52 ++++++++++++++++++++++++------------------
> 1 file changed, 30 insertions(+), 22 deletions(-)
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 5c1f3ee48d5d..5c777b1237a8 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -2982,12 +2982,39 @@ static struct cpuset *cpuset_attach_old_cs;
> * For v1, cpus_allowed and mems_allowed can't be empty.
> * For v2, effective_cpus can't be empty.
> * Note that in v1, effective_cpus = cpus_allowed.
> + *
> + * Also set the boolean flag passed in by @psetsched depending on if
> + * security_task_setscheduler() call is needed and @oldcs is not NULL.
> */
> -static int cpuset_can_attach_check(struct cpuset *cs)
> +static int cpuset_can_attach_check(struct cpuset *cs, struct cpuset *oldcs,
> + bool *psetsched)
> {
> if (cpumask_empty(cs->effective_cpus) ||
> (!is_in_v2_mode() && nodes_empty(cs->mems_allowed)))
> return -ENOSPC;
> +
> + if (!oldcs)
> + return 0;
> +
> + /*
> + * Skip rights over task setsched check in v2 when nothing changes,
> + * migration permission derives from hierarchy ownership in
> + * cgroup_procs_write_permission()).
> + */
> + *psetsched = !cpuset_v2() ||
> + !cpumask_equal(cs->effective_cpus, oldcs->effective_cpus) ||
> + !nodes_equal(cs->effective_mems, oldcs->effective_mems);
> +
> + /*
> + * A v1 cpuset with tasks will have no CPU left only when CPU hotplug
> + * brings the last online CPU offline as users are not allowed to empty
> + * cpuset.cpus when there are active tasks inside. When that happens,
> + * we should allow tasks to migrate out without security check to make
> + * sure they will be able to run after migration.
> + */
> + if (!is_in_v2_mode() && cpumask_empty(oldcs->effective_cpus))
> + *psetsched = false;
> +
> return 0;
> }
>
> @@ -3034,29 +3061,10 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
> mutex_lock(&cpuset_mutex);
>
> /* Check to see if task is allowed in the cpuset */
> - ret = cpuset_can_attach_check(cs);
> + ret = cpuset_can_attach_check(cs, oldcs, &setsched_check);
> if (ret)
> goto out_unlock;
>
> - /*
> - * Skip rights over task setsched check in v2 when nothing changes,
> - * migration permission derives from hierarchy ownership in
> - * cgroup_procs_write_permission()).
> - */
> - setsched_check = !cpuset_v2() ||
> - !cpumask_equal(cs->effective_cpus, oldcs->effective_cpus) ||
> - !nodes_equal(cs->effective_mems, oldcs->effective_mems);
> -
> - /*
> - * A v1 cpuset with tasks will have no CPU left only when CPU hotplug
> - * brings the last online CPU offline as users are not allowed to empty
> - * cpuset.cpus when there are active tasks inside. When that happens,
> - * we should allow tasks to migrate out without security check to make
> - * sure they will be able to run after migration.
> - */
> - if (!is_in_v2_mode() && cpumask_empty(oldcs->effective_cpus))
> - setsched_check = false;
> -
> cgroup_taskset_for_each(task, css, tset) {
> ret = task_can_attach(task);
> if (ret)
> @@ -3601,7 +3609,7 @@ static int cpuset_can_fork(struct task_struct *task, struct css_set *cset)
> mutex_lock(&cpuset_mutex);
>
> /* Check to see if task is allowed in the cpuset */
> - ret = cpuset_can_attach_check(cs);
> + ret = cpuset_can_attach_check(cs, NULL, NULL);
> if (ret)
> goto out_unlock;
>
--
Best regards,
Ridong
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH-next v5 4/6] cgroup/cpuset: Make cpuset_attach_old_cs track task group leaders
2026-06-02 2:32 ` [PATCH-next v5 4/6] cgroup/cpuset: Make cpuset_attach_old_cs track task group leaders Waiman Long
@ 2026-06-02 13:58 ` Ridong Chen
0 siblings, 0 replies; 15+ messages in thread
From: Ridong Chen @ 2026-06-02 13:58 UTC (permalink / raw)
To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
Peter Zijlstra, ridong.chen
Cc: cgroups, linux-kernel, Aaron Tomlin, Guopeng Zhang
On 2026/6/2 10:32, Waiman Long wrote:
> There are two possible ways that migration of tasks from multiple source
> cpusets to a target cpuset can happen. Either a multithread application
> with threads in different cpusets is wholely moved to a new cpuset
^
wholly
> or disabling of v2 cpuset controller will move all the tasks in child
> cpusets to the parent cpuset.
>
> In the former case, it is the mm setting of the group leader that really
> matters. So cpuset_attach_old_cs should track the oldcs of the thread
> leader. In the latter case, effective_mems of child cpusets must always
> be a subset of the parent. So no real page migration will be necessary
> no matter which child cpuset is selected as cpuset_attach_old_cs.
>
> IOW, cpuset_attach_old_cs should be updated to match the latest task
> group leader in cpuset_can_attach(), but fall back to that of the first
> task if there is no group leader in the taskset.
>
> Suggested-by: Ridong Chen <ridong.chen@linux.dev>
> Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Ridong Chen <ridong.chen@linux.dev>
> ---
> kernel/cgroup/cpuset.c | 25 +++++++++++++++++++++++++
> 1 file changed, 25 insertions(+)
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 5c777b1237a8..60e8149cc907 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -2975,6 +2975,10 @@ static int update_prstate(struct cpuset *cs, int new_prs)
> return 0;
> }
>
> +/*
> + * cpuset_can_attach() and cpuset_attach() specific internal data
> + * Protected by cpuset_mutex
> + */
> static struct cpuset *cpuset_attach_old_cs;
>
> /*
> @@ -3065,11 +3069,32 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
> if (ret)
> goto out_unlock;
>
> + /*
> + * The cpuset_attach_old_cs is used mainly by cpuset_migrate_mm() to get
> + * the old_mems_allowed value. There are two ways that many-to-one
> + * cpuset migration can happen:
> + * 1) A multithread application with threads in different cpusets is
> + * wholely migrated to a new cpuset.
> + * 2) Disabling v2 cpuset controller will move all the tasks in child
> + * cpusets to the parent cpuset.
> + *
> + * In the former case, it is the mm setting of the group leader that
> + * really matters. So cpuset_attach_old_cs should track the oldcs of the
> + * group leader. It falls back to the oldcs of the first task if there
> + * is no group leader in the taskset. In the latter case, effective_mems
> + * of child cpusets must always be a subset of the parent. So no real
> + * page migration will be necessary no matter which child cpuset is
> + * selected as cpuset_attach_old_cs.
> + */
> cgroup_taskset_for_each(task, css, tset) {
> ret = task_can_attach(task);
> if (ret)
> goto out_unlock;
>
> + /* Update cpuset_attach_old_cs to the latest group leader */
> + if (task == task->group_leader)
> + cpuset_attach_old_cs = task_cs(task);
> +
> if (setsched_check) {
> ret = security_task_setscheduler(task);
> if (ret)
--
Best regards,
Ridong
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH-next v5 1/6] cgroup/cpuset: Fix node inconsistencies between cpuset_update_tasks_nodemask() and cpuset_attach()
2026-06-02 13:37 ` Ridong Chen
@ 2026-06-02 18:43 ` Waiman Long
0 siblings, 0 replies; 15+ messages in thread
From: Waiman Long @ 2026-06-02 18:43 UTC (permalink / raw)
To: Ridong Chen, Tejun Heo, Johannes Weiner, Michal Koutný,
Peter Zijlstra
Cc: cgroups, linux-kernel, Aaron Tomlin, Guopeng Zhang
On 6/2/26 9:37 AM, Ridong Chen wrote:
>
> On 2026/6/2 10:31, Waiman Long wrote:
>> Whenever memory node mask is changed, there are 4 places where the node
>> mask has to be updated or used.
>> 1) task's node mask via cpuset_change_task_nodemask()
>> 2) memory policy binding via mpol_rebind_mm()
>> 3) if memory migration is enabled, migrate from old_mems_allowed to
>> the new node mask via cpuset_migrate_mm().
>> 4) setting old_mems_allowed
>>
>> These memory actions are done in cpuset_update_tasks_nodemask() and
>> cpuset_attach(). However there are inconsistencies in what node masks
>> are being used in these 2 functions.
>>
>> In cpuset_update_tasks_nodemask(),
>> - cpuset_change_task_nodemask(): guarantee_online_mems()
>> - mpol_rebind_mm(): mems_allowed
>> - cpuset_migrate_mm(): guarantee_online_mems()
>> - old_mems_allowed: guarantee_online_mems()
>>
>> In cpuset_attach(),
>> - cpuset_change_task_nodemask(): guarantee_online_mems()
>> - mpol_rebind_mm(): effective_mems
>> - cpuset_migrate_mm(): effective_mems
>> - old_mems_allowed: effective_mems
>>
>> These inconsistencies dates back to quite a long time ago and it is
>> hard to say what should be the correct values.
>>
>> The guarantee_online_mems() function returns a node mask from current or
>> an ancestor cpuset that is a subset of node_states[N_MEMORY]. Nodes in
>> node_states[N_MEMORY] are all online, i.e. in node_states[N_ONLINE].
>> However, node in node_states[N_ONLINE] may not have memory. So
>> node_states[N_MEMORY] should be a subset of node_states[N_ONLINE].
>>
>> The guarantee_online_mems() function should only be useful for v1 where
>> mems_allowed is the same as effective_mems. With v2, the memory nodes
>> in effective_mems should always be a subset of node_states[N_MEMORY].
>> The only time that may not be true is when a memory hot-unplug operation
>> is in progress and a memory node is removed from node_states[N_MEMORY]
>> but not yet reflected in effective_mems as cpuset_handle_hotplug()
>> has not yet been called from cpuset_track_online_nodes(). When
>> cpuset_handle_hotplug() is called later, the memory node setting
>> of the relevant cpusets and tasks will be updated. So replacing the
>> guarantee_online_mems() call by just using cs->effective_mems should
>> be fine.
>>
> I noticed this pattern in several places:
>
> ```
> if (cpuset_v2())
> newmems = cs->effective_mems;
> else
> guarantee_online_mems(cs, &newmems);
> ```
>
> Would it be simpler to move the v2 logic into guarantee_online_mems?
>
> ```
> static void guarantee_online_mems(struct cpuset *cs, nodemask_t *pmask)
> {
> if (cpuset_v2()) {
> *pmask = cs->effective_mems;
> return;
> }
> while (!nodes_and(*pmask, cs->effective_mems, node_states[N_MEMORY]))
> cs = parent_cs(cs);
> }
> ```
Yes, it makes sense to put it directly into guarantee_online_mems().
>> Let use the following setup for both of them and make them consistent.
>> - cpuset_change_task_nodemask(): guarantee_online_mems()
>> - mpol_rebind_mm(): effective_mems
>> - cpuset_migrate_mm(): guarantee_online_mems()
>> - old_mems_allowed: guarantee_online_mems()
>>
>> So for v2, it is effectively all effective_mems. For v1, mpol_rebind_mm()
>> uses cpus_allowed which may differ from what guarantee_online_mems()
> ^
> mems_allowed?
Thanks for catching it. Will fix that and update your email address in
the next version.
Cheers,
Longman
>> returns.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>> kernel/cgroup/cpuset.c | 32 +++++++++++++++++++++-----------
>> 1 file changed, 21 insertions(+), 11 deletions(-)
>>
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index 6bdb68689c24..987456b6d879 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -2616,6 +2616,13 @@ static void *cpuset_being_rebound;
>> * Iterate through each task of @cs updating its mems_allowed to the
>> * effective cpuset's. As this function is called with cpuset_mutex held,
>> * cpuset membership stays stable.
>> + *
>> + * - cpuset_change_task_nodemask(): guarantee_online_mems()
>> + * - mpol_rebind_mm(): effective_mems
>> + * - cpuset_migrate_mm(): guarantee_online_mems()
>> + * - old_mems_allowed: guarantee_online_mems()
>> + *
>> + * For v2, guarantee_online_mems() should just return effective_mems.
> I agree, but the implementation is not as simple as what I mentioned above.
>
>> */
>> void cpuset_update_tasks_nodemask(struct cpuset *cs)
>> {
>> @@ -2625,7 +2632,10 @@ void cpuset_update_tasks_nodemask(struct cpuset *cs)
>>
>> cpuset_being_rebound = cs; /* causes mpol_dup() rebind */
>>
>> - guarantee_online_mems(cs, &newmems);
>> + if (cpuset_v2())
>> + newmems = cs->effective_mems;
>> + else
>> + guarantee_online_mems(cs, &newmems);
>>
>> /*
>> * The mpol_rebind_mm() call takes mmap_lock, which we couldn't
>> @@ -2650,7 +2660,7 @@ void cpuset_update_tasks_nodemask(struct cpuset *cs)
>>
>> migrate = is_memory_migrate(cs);
>>
>> - mpol_rebind_mm(mm, &cs->mems_allowed);
>> + mpol_rebind_mm(mm, &cs->effective_mems);
>> if (migrate)
>> cpuset_migrate_mm(mm, &cs->old_mems_allowed, &newmems);
>> else
>> @@ -3148,17 +3158,18 @@ static void cpuset_attach(struct cgroup_taskset *tset)
>>
>> /*
>> * In the default hierarchy, enabling cpuset in the child cgroups
>> - * will trigger a number of cpuset_attach() calls with no change
>> - * in effective cpus and mems. In that case, we can optimize out
>> - * by skipping the task iteration and update.
>> + * will trigger a cpuset_attach() call with no change in effective cpus
>> + * and mems. In that case, we can optimize out by skipping the task
>> + * iteration and update.
>> */
>> - if (cpuset_v2() && !cpus_updated && !mems_updated) {
>> + if (cpuset_v2()) {
>> cpuset_attach_nodemask_to = cs->effective_mems;
>> - goto out;
>> + if (!cpus_updated && !mems_updated)
>> + goto out;
>> + } else {
>> + guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
>> }
>>
>> - guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
>> -
>> cgroup_taskset_for_each(task, css, tset)
>> cpuset_attach_task(cs, task);
>>
>> @@ -3168,7 +3179,6 @@ static void cpuset_attach(struct cgroup_taskset *tset)
>> * if there is no change in effective_mems and CS_MEMORY_MIGRATE is
>> * not set.
>> */
>> - cpuset_attach_nodemask_to = cs->effective_mems;
>> if (!is_memory_migrate(cs) && !mems_updated)
>> goto out;
>>
>> @@ -3176,7 +3186,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
>> struct mm_struct *mm = get_task_mm(leader);
>>
>> if (mm) {
>> - mpol_rebind_mm(mm, &cpuset_attach_nodemask_to);
>> + mpol_rebind_mm(mm, &cs->effective_mems);
>>
>> /*
>> * old_mems_allowed is the same with mems_allowed
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH] cgroup/cpuset: Support multiple source/destination cpusets using pids pattern
2026-06-02 2:32 ` [PATCH-next v5 6/6] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach() Waiman Long
@ 2026-06-03 10:26 ` Ridong Chen
2026-06-03 10:32 ` Ridong Chen
2026-06-03 18:47 ` Waiman Long
0 siblings, 2 replies; 15+ messages in thread
From: Ridong Chen @ 2026-06-03 10:26 UTC (permalink / raw)
To: Waiman Long
Cc: cgroups, Tejun Heo, Johannes Weiner, ridong.chen, linux-kernel
The current cpuset_can_attach() and cpuset_attach() functions assume task
migration is from one source cpuset to one destination cpuset. This can be
wrong in several scenarios:
- Moving a multi-threaded process with threads in different cpusets
- Disabling the cpuset controller (many children to one parent)
- Enabling the cpuset controller (one parent to many children)
Fix this by adopting the pids subsystem's per-task accounting pattern.
In cpuset_can_attach(), use task_cs(task) to get the correct source cpuset
for each task (like pids_can_attach uses task_css), adjust nr_deadline_tasks
and reserve DL bandwidth per-task, and increment attach_in_progress per-task
on the destination cpuset. In cpuset_attach(), handle destination cpuset
changes within the task iteration loop.
A shared helper cpuset_undo_attach() reverses the per-task operations for
both partial rollback in cpuset_can_attach() and full reversal in
cpuset_cancel_attach().
When multiple source cpusets are detected in can_attach(), set
attach_many_sources so that cpuset_attach() forces cpus_updated and
mems_updated to true, ensuring all tasks get properly updated regardless
of which source cpuset cpuset_attach_old_cs points to.
This eliminates the need for nr_migrate_dl_tasks, sum_migrate_dl_bw, and
dl_bw_cpu fields in struct cpuset.
Fixes: 4ec22e9c5a90 ("cpuset: Enable cpuset controller in default hierarchy")
Signed-off-by: Ridong Chen <ridong.chen@linux.dev>
---
kernel/cgroup/cpuset-internal.h | 8 --
kernel/cgroup/cpuset.c | 177 ++++++++++++++++----------------
2 files changed, 89 insertions(+), 96 deletions(-)
diff --git a/kernel/cgroup/cpuset-internal.h b/kernel/cgroup/cpuset-internal.h
index f7aaf01f7cd5..601e38b3c75b 100644
--- a/kernel/cgroup/cpuset-internal.h
+++ b/kernel/cgroup/cpuset-internal.h
@@ -166,14 +166,6 @@ struct cpuset {
* know when to rebuild associated root domain bandwidth information.
*/
int nr_deadline_tasks;
- int nr_migrate_dl_tasks;
- /* DL bandwidth that needs destination reservation for this attach. */
- u64 sum_migrate_dl_bw;
- /*
- * CPU used for temporary DL bandwidth allocation during attach;
- * -1 if no DL bandwidth was allocated in the current attach.
- */
- int dl_bw_cpu;
/* Invalid partition error code, not lock protected */
enum prs_errcode prs_err;
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index e52a5a40d607..be222eb6078c 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -288,7 +288,6 @@ struct cpuset top_cpuset = {
.flags = BIT(CS_CPU_EXCLUSIVE) |
BIT(CS_MEM_EXCLUSIVE) | BIT(CS_SCHED_LOAD_BALANCE),
.partition_root_state = PRS_ROOT,
- .dl_bw_cpu = -1,
};
/**
@@ -580,8 +579,6 @@ static struct cpuset *dup_or_alloc_cpuset(struct cpuset *cs)
if (!trial)
return NULL;
- trial->dl_bw_cpu = -1;
-
/* Setup cpumask pointer array */
cpumask_var_t *pmask[4] = {
&trial->cpus_allowed,
@@ -2984,6 +2981,7 @@ static int update_prstate(struct cpuset *cs, int new_prs)
static struct cpuset *cpuset_attach_old_cs;
static bool attach_cpus_updated;
static bool attach_mems_updated;
+static bool attach_many_sources;
/*
* Check to see if a cpuset can accept a new task
@@ -3026,30 +3024,36 @@ static int cpuset_can_attach_check(struct cpuset *cs, struct cpuset *oldcs,
return 0;
}
-static int cpuset_reserve_dl_bw(struct cpuset *cs)
+/*
+ * Reverse per-task operations done in cpuset_can_attach().
+ * If @stop_at is non-NULL, only undo tasks before it (partial rollback).
+ * If @stop_at is NULL, undo all tasks (full reversal for cancel_attach).
+ * Must be called with cpuset_mutex held.
+ */
+static void cpuset_undo_attach(struct cgroup_taskset *tset,
+ struct task_struct *stop_at)
{
- int cpu, ret;
-
- if (!cs->sum_migrate_dl_bw)
- return 0;
-
- cpu = cpumask_any_and(cpu_active_mask, cs->effective_cpus);
- if (unlikely(cpu >= nr_cpu_ids))
- return -EINVAL;
+ struct cgroup_subsys_state *css;
+ struct task_struct *task;
- ret = dl_bw_alloc(cpu, cs->sum_migrate_dl_bw);
- if (ret)
- return ret;
+ cgroup_taskset_for_each(task, css, tset) {
+ struct cpuset *cs = css_cs(css);
+ struct cpuset *oldcs = task_cs(task);
- cs->dl_bw_cpu = cpu;
- return 0;
-}
+ if (task == stop_at)
+ break;
-static void reset_migrate_dl_data(struct cpuset *cs)
-{
- cs->nr_migrate_dl_tasks = 0;
- cs->sum_migrate_dl_bw = 0;
- cs->dl_bw_cpu = -1;
+ if (dl_task(task)) {
+ cs->nr_deadline_tasks--;
+ oldcs->nr_deadline_tasks++;
+ if (dl_task_needs_bw_move(task, cs->effective_cpus)) {
+ int cpu = cpumask_any_and(cpu_active_mask,
+ cs->effective_cpus);
+ dl_bw_free(cpu, task->dl.dl_bw);
+ }
+ }
+ dec_attach_in_progress_locked(cs);
+ }
}
/* Called by cgroups to determine if a cpuset is usable; cpuset_mutex held */
@@ -3061,96 +3065,79 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
bool setsched_check;
int ret;
- /* used later by cpuset_attach() */
cpuset_attach_old_cs = task_cs(cgroup_taskset_first(tset, &css));
oldcs = cpuset_attach_old_cs;
cs = css_cs(css);
mutex_lock(&cpuset_mutex);
+ attach_many_sources = false;
- /* Check to see if task is allowed in the cpuset */
ret = cpuset_can_attach_check(cs, oldcs, &setsched_check);
if (ret)
goto out_unlock;
- /*
- * The cpuset_attach_old_cs is used mainly by cpuset_migrate_mm() to get
- * the old_mems_allowed value. There are two ways that many-to-one
- * cpuset migration can happen:
- * 1) A multithread application with threads in different cpusets is
- * wholely migrated to a new cpuset.
- * 2) Disabling v2 cpuset controller will move all the tasks in child
- * cpusets to the parent cpuset.
- *
- * In the former case, it is the mm setting of the group leader that
- * really matters. So cpuset_attach_old_cs should track the oldcs of the
- * group leader. It falls back to the oldcs of the first task if there
- * is no group leader in the taskset. In the latter case, effective_mems
- * of child cpusets must always be a subset of the parent. So no real
- * page migration will be necessary no matter which child cpuset is
- * selected as cpuset_attach_old_cs.
- */
cgroup_taskset_for_each(task, css, tset) {
+ struct cpuset *newcs = css_cs(css);
+ struct cpuset *new_oldcs = task_cs(task);
+
+ if (newcs != cs || new_oldcs != oldcs) {
+ if (new_oldcs != oldcs)
+ attach_many_sources = true;
+ cs = newcs;
+ oldcs = new_oldcs;
+ ret = cpuset_can_attach_check(cs, oldcs,
+ &setsched_check);
+ if (ret)
+ goto out_rollback;
+ }
+
ret = task_can_attach(task);
if (ret)
- goto out_unlock;
+ goto out_rollback;
- /* Update cpuset_attach_old_cs to the latest group leader */
if (task == task->group_leader)
cpuset_attach_old_cs = task_cs(task);
if (setsched_check) {
ret = security_task_setscheduler(task);
if (ret)
- goto out_unlock;
+ goto out_rollback;
}
if (dl_task(task)) {
- /*
- * Count all migrating DL tasks for cpuset task accounting.
- * Only tasks that need a root-domain bandwidth move
- * contribute to sum_migrate_dl_bw.
- */
- cs->nr_migrate_dl_tasks++;
- if (dl_task_needs_bw_move(task, cs->effective_cpus))
- cs->sum_migrate_dl_bw += task->dl.dl_bw;
+ cs->nr_deadline_tasks++;
+ oldcs->nr_deadline_tasks--;
+
+ if (dl_task_needs_bw_move(task, cs->effective_cpus)) {
+ int cpu = cpumask_any_and(cpu_active_mask,
+ cs->effective_cpus);
+ if (unlikely(cpu >= nr_cpu_ids)) {
+ ret = -EINVAL;
+ goto out_rollback;
+ }
+ ret = dl_bw_alloc(cpu, task->dl.dl_bw);
+ if (ret)
+ goto out_rollback;
+ }
}
- }
-
- ret = cpuset_reserve_dl_bw(cs);
-out_unlock:
- if (ret) {
- reset_migrate_dl_data(cs);
- } else {
- /*
- * Mark attach is in progress. This makes validate_change() fail
- * changes which zero cpus/mems_allowed.
- */
cs->attach_in_progress++;
}
+ goto out_unlock;
+
+out_rollback:
+ cpuset_undo_attach(tset, task);
+
+out_unlock:
mutex_unlock(&cpuset_mutex);
return ret;
}
static void cpuset_cancel_attach(struct cgroup_taskset *tset)
{
- struct cgroup_subsys_state *css;
- struct cpuset *cs;
-
- cgroup_taskset_first(tset, &css);
- cs = css_cs(css);
-
mutex_lock(&cpuset_mutex);
- dec_attach_in_progress_locked(cs);
-
- if (cs->dl_bw_cpu >= 0)
- dl_bw_free(cs->dl_bw_cpu, cs->sum_migrate_dl_bw);
-
- if (cs->nr_migrate_dl_tasks)
- reset_migrate_dl_data(cs);
-
+ cpuset_undo_attach(tset, NULL);
mutex_unlock(&cpuset_mutex);
}
@@ -3232,8 +3219,15 @@ static void cpuset_attach(struct cgroup_taskset *tset)
mutex_lock(&cpuset_mutex);
queue_task_work = false;
- attach_cpus_updated = !cpumask_equal(cs->effective_cpus, oldcs->effective_cpus);
- attach_mems_updated = !nodes_equal(cs->effective_mems, oldcs->effective_mems);
+ if (attach_many_sources) {
+ attach_cpus_updated = true;
+ attach_mems_updated = true;
+ } else {
+ attach_cpus_updated = !cpumask_equal(cs->effective_cpus,
+ oldcs->effective_cpus);
+ attach_mems_updated = !nodes_equal(cs->effective_mems,
+ oldcs->effective_mems);
+ }
/*
* In the default hierarchy, enabling cpuset in the child cgroups
@@ -3249,21 +3243,28 @@ static void cpuset_attach(struct cgroup_taskset *tset)
guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
}
- cgroup_taskset_for_each(task, css, tset)
+ cgroup_taskset_for_each(task, css, tset) {
+ struct cpuset *newcs = css_cs(css);
+
+ if (newcs != cs) {
+ cs->old_mems_allowed = cpuset_attach_nodemask_to;
+ cs = newcs;
+ if (cpuset_v2())
+ cpuset_attach_nodemask_to = cs->effective_mems;
+ else
+ guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
+ }
cpuset_attach_task(cs, task);
+ }
out:
if (queue_task_work)
schedule_flush_migrate_mm();
cs->old_mems_allowed = cpuset_attach_nodemask_to;
- if (cs->nr_migrate_dl_tasks) {
- cs->nr_deadline_tasks += cs->nr_migrate_dl_tasks;
- oldcs->nr_deadline_tasks -= cs->nr_migrate_dl_tasks;
- reset_migrate_dl_data(cs);
- }
-
- dec_attach_in_progress_locked(cs);
+ /* Decrement per-task attach_in_progress */
+ cgroup_taskset_for_each(task, css, tset)
+ dec_attach_in_progress_locked(css_cs(css));
mutex_unlock(&cpuset_mutex);
}
--
2.43.0
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH] cgroup/cpuset: Support multiple source/destination cpusets using pids pattern
2026-06-03 10:26 ` [PATCH] cgroup/cpuset: Support multiple source/destination cpusets using pids pattern Ridong Chen
@ 2026-06-03 10:32 ` Ridong Chen
2026-06-03 18:47 ` Waiman Long
1 sibling, 0 replies; 15+ messages in thread
From: Ridong Chen @ 2026-06-03 10:32 UTC (permalink / raw)
To: Waiman Long; +Cc: cgroups, Tejun Heo, Johannes Weiner, linux-kernel
Hi Longman,
I used AI to generate a patch that fixes this issue, following the same
approach as the pids subsystem. I think this may be much simpler. Just a
heads-up — this patch is only for discussion and hasn't been tested.
On 2026/6/3 18:26, Ridong Chen wrote:
> The current cpuset_can_attach() and cpuset_attach() functions assume task
> migration is from one source cpuset to one destination cpuset. This can be
> wrong in several scenarios:
> - Moving a multi-threaded process with threads in different cpusets
> - Disabling the cpuset controller (many children to one parent)
> - Enabling the cpuset controller (one parent to many children)
>
> Fix this by adopting the pids subsystem's per-task accounting pattern.
> In cpuset_can_attach(), use task_cs(task) to get the correct source cpuset
> for each task (like pids_can_attach uses task_css), adjust nr_deadline_tasks
> and reserve DL bandwidth per-task, and increment attach_in_progress per-task
> on the destination cpuset. In cpuset_attach(), handle destination cpuset
> changes within the task iteration loop.
>
> A shared helper cpuset_undo_attach() reverses the per-task operations for
> both partial rollback in cpuset_can_attach() and full reversal in
> cpuset_cancel_attach().
>
> When multiple source cpusets are detected in can_attach(), set
> attach_many_sources so that cpuset_attach() forces cpus_updated and
> mems_updated to true, ensuring all tasks get properly updated regardless
> of which source cpuset cpuset_attach_old_cs points to.
>
> This eliminates the need for nr_migrate_dl_tasks, sum_migrate_dl_bw, and
> dl_bw_cpu fields in struct cpuset.
>
> Fixes: 4ec22e9c5a90 ("cpuset: Enable cpuset controller in default hierarchy")
> Signed-off-by: Ridong Chen <ridong.chen@linux.dev>
> ---
> kernel/cgroup/cpuset-internal.h | 8 --
> kernel/cgroup/cpuset.c | 177 ++++++++++++++++----------------
> 2 files changed, 89 insertions(+), 96 deletions(-)
>
> diff --git a/kernel/cgroup/cpuset-internal.h b/kernel/cgroup/cpuset-internal.h
> index f7aaf01f7cd5..601e38b3c75b 100644
> --- a/kernel/cgroup/cpuset-internal.h
> +++ b/kernel/cgroup/cpuset-internal.h
> @@ -166,14 +166,6 @@ struct cpuset {
> * know when to rebuild associated root domain bandwidth information.
> */
> int nr_deadline_tasks;
> - int nr_migrate_dl_tasks;
> - /* DL bandwidth that needs destination reservation for this attach. */
> - u64 sum_migrate_dl_bw;
> - /*
> - * CPU used for temporary DL bandwidth allocation during attach;
> - * -1 if no DL bandwidth was allocated in the current attach.
> - */
> - int dl_bw_cpu;
>
> /* Invalid partition error code, not lock protected */
> enum prs_errcode prs_err;
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index e52a5a40d607..be222eb6078c 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -288,7 +288,6 @@ struct cpuset top_cpuset = {
> .flags = BIT(CS_CPU_EXCLUSIVE) |
> BIT(CS_MEM_EXCLUSIVE) | BIT(CS_SCHED_LOAD_BALANCE),
> .partition_root_state = PRS_ROOT,
> - .dl_bw_cpu = -1,
> };
>
> /**
> @@ -580,8 +579,6 @@ static struct cpuset *dup_or_alloc_cpuset(struct cpuset *cs)
> if (!trial)
> return NULL;
>
> - trial->dl_bw_cpu = -1;
> -
> /* Setup cpumask pointer array */
> cpumask_var_t *pmask[4] = {
> &trial->cpus_allowed,
> @@ -2984,6 +2981,7 @@ static int update_prstate(struct cpuset *cs, int new_prs)
> static struct cpuset *cpuset_attach_old_cs;
> static bool attach_cpus_updated;
> static bool attach_mems_updated;
> +static bool attach_many_sources;
>
> /*
> * Check to see if a cpuset can accept a new task
> @@ -3026,30 +3024,36 @@ static int cpuset_can_attach_check(struct cpuset *cs, struct cpuset *oldcs,
> return 0;
> }
>
> -static int cpuset_reserve_dl_bw(struct cpuset *cs)
> +/*
> + * Reverse per-task operations done in cpuset_can_attach().
> + * If @stop_at is non-NULL, only undo tasks before it (partial rollback).
> + * If @stop_at is NULL, undo all tasks (full reversal for cancel_attach).
> + * Must be called with cpuset_mutex held.
> + */
> +static void cpuset_undo_attach(struct cgroup_taskset *tset,
> + struct task_struct *stop_at)
> {
> - int cpu, ret;
> -
> - if (!cs->sum_migrate_dl_bw)
> - return 0;
> -
> - cpu = cpumask_any_and(cpu_active_mask, cs->effective_cpus);
> - if (unlikely(cpu >= nr_cpu_ids))
> - return -EINVAL;
> + struct cgroup_subsys_state *css;
> + struct task_struct *task;
>
> - ret = dl_bw_alloc(cpu, cs->sum_migrate_dl_bw);
> - if (ret)
> - return ret;
> + cgroup_taskset_for_each(task, css, tset) {
> + struct cpuset *cs = css_cs(css);
> + struct cpuset *oldcs = task_cs(task);
>
> - cs->dl_bw_cpu = cpu;
> - return 0;
> -}
> + if (task == stop_at)
> + break;
>
> -static void reset_migrate_dl_data(struct cpuset *cs)
> -{
> - cs->nr_migrate_dl_tasks = 0;
> - cs->sum_migrate_dl_bw = 0;
> - cs->dl_bw_cpu = -1;
> + if (dl_task(task)) {
> + cs->nr_deadline_tasks--;
> + oldcs->nr_deadline_tasks++;
> + if (dl_task_needs_bw_move(task, cs->effective_cpus)) {
> + int cpu = cpumask_any_and(cpu_active_mask,
> + cs->effective_cpus);
> + dl_bw_free(cpu, task->dl.dl_bw);
> + }
> + }
> + dec_attach_in_progress_locked(cs);
> + }
> }
>
> /* Called by cgroups to determine if a cpuset is usable; cpuset_mutex held */
> @@ -3061,96 +3065,79 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
> bool setsched_check;
> int ret;
>
> - /* used later by cpuset_attach() */
> cpuset_attach_old_cs = task_cs(cgroup_taskset_first(tset, &css));
> oldcs = cpuset_attach_old_cs;
> cs = css_cs(css);
>
> mutex_lock(&cpuset_mutex);
> + attach_many_sources = false;
>
> - /* Check to see if task is allowed in the cpuset */
> ret = cpuset_can_attach_check(cs, oldcs, &setsched_check);
> if (ret)
> goto out_unlock;
>
> - /*
> - * The cpuset_attach_old_cs is used mainly by cpuset_migrate_mm() to get
> - * the old_mems_allowed value. There are two ways that many-to-one
> - * cpuset migration can happen:
> - * 1) A multithread application with threads in different cpusets is
> - * wholely migrated to a new cpuset.
> - * 2) Disabling v2 cpuset controller will move all the tasks in child
> - * cpusets to the parent cpuset.
> - *
> - * In the former case, it is the mm setting of the group leader that
> - * really matters. So cpuset_attach_old_cs should track the oldcs of the
> - * group leader. It falls back to the oldcs of the first task if there
> - * is no group leader in the taskset. In the latter case, effective_mems
> - * of child cpusets must always be a subset of the parent. So no real
> - * page migration will be necessary no matter which child cpuset is
> - * selected as cpuset_attach_old_cs.
> - */
> cgroup_taskset_for_each(task, css, tset) {
> + struct cpuset *newcs = css_cs(css);
> + struct cpuset *new_oldcs = task_cs(task);
> +
> + if (newcs != cs || new_oldcs != oldcs) {
> + if (new_oldcs != oldcs)
> + attach_many_sources = true;
> + cs = newcs;
> + oldcs = new_oldcs;
> + ret = cpuset_can_attach_check(cs, oldcs,
> + &setsched_check);
> + if (ret)
> + goto out_rollback;
> + }
> +
> ret = task_can_attach(task);
> if (ret)
> - goto out_unlock;
> + goto out_rollback;
>
> - /* Update cpuset_attach_old_cs to the latest group leader */
> if (task == task->group_leader)
> cpuset_attach_old_cs = task_cs(task);
>
> if (setsched_check) {
> ret = security_task_setscheduler(task);
> if (ret)
> - goto out_unlock;
> + goto out_rollback;
> }
>
> if (dl_task(task)) {
> - /*
> - * Count all migrating DL tasks for cpuset task accounting.
> - * Only tasks that need a root-domain bandwidth move
> - * contribute to sum_migrate_dl_bw.
> - */
> - cs->nr_migrate_dl_tasks++;
> - if (dl_task_needs_bw_move(task, cs->effective_cpus))
> - cs->sum_migrate_dl_bw += task->dl.dl_bw;
> + cs->nr_deadline_tasks++;
> + oldcs->nr_deadline_tasks--;
> +
> + if (dl_task_needs_bw_move(task, cs->effective_cpus)) {
> + int cpu = cpumask_any_and(cpu_active_mask,
> + cs->effective_cpus);
> + if (unlikely(cpu >= nr_cpu_ids)) {
> + ret = -EINVAL;
> + goto out_rollback;
> + }
> + ret = dl_bw_alloc(cpu, task->dl.dl_bw);
> + if (ret)
> + goto out_rollback;
> + }
> }
> - }
> -
> - ret = cpuset_reserve_dl_bw(cs);
>
> -out_unlock:
> - if (ret) {
> - reset_migrate_dl_data(cs);
> - } else {
> - /*
> - * Mark attach is in progress. This makes validate_change() fail
> - * changes which zero cpus/mems_allowed.
> - */
> cs->attach_in_progress++;
> }
>
> + goto out_unlock;
> +
> +out_rollback:
> + cpuset_undo_attach(tset, task);
> +
> +out_unlock:
> mutex_unlock(&cpuset_mutex);
> return ret;
> }
>
> static void cpuset_cancel_attach(struct cgroup_taskset *tset)
> {
> - struct cgroup_subsys_state *css;
> - struct cpuset *cs;
> -
> - cgroup_taskset_first(tset, &css);
> - cs = css_cs(css);
> -
> mutex_lock(&cpuset_mutex);
> - dec_attach_in_progress_locked(cs);
> -
> - if (cs->dl_bw_cpu >= 0)
> - dl_bw_free(cs->dl_bw_cpu, cs->sum_migrate_dl_bw);
> -
> - if (cs->nr_migrate_dl_tasks)
> - reset_migrate_dl_data(cs);
> -
> + cpuset_undo_attach(tset, NULL);
> mutex_unlock(&cpuset_mutex);
> }
>
> @@ -3232,8 +3219,15 @@ static void cpuset_attach(struct cgroup_taskset *tset)
> mutex_lock(&cpuset_mutex);
> queue_task_work = false;
>
> - attach_cpus_updated = !cpumask_equal(cs->effective_cpus, oldcs->effective_cpus);
> - attach_mems_updated = !nodes_equal(cs->effective_mems, oldcs->effective_mems);
> + if (attach_many_sources) {
> + attach_cpus_updated = true;
> + attach_mems_updated = true;
> + } else {
> + attach_cpus_updated = !cpumask_equal(cs->effective_cpus,
> + oldcs->effective_cpus);
> + attach_mems_updated = !nodes_equal(cs->effective_mems,
> + oldcs->effective_mems);
> + }
>
> /*
> * In the default hierarchy, enabling cpuset in the child cgroups
> @@ -3249,21 +3243,28 @@ static void cpuset_attach(struct cgroup_taskset *tset)
> guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
> }
>
> - cgroup_taskset_for_each(task, css, tset)
> + cgroup_taskset_for_each(task, css, tset) {
> + struct cpuset *newcs = css_cs(css);
> +
> + if (newcs != cs) {
> + cs->old_mems_allowed = cpuset_attach_nodemask_to;
> + cs = newcs;
> + if (cpuset_v2())
> + cpuset_attach_nodemask_to = cs->effective_mems;
> + else
> + guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
> + }
> cpuset_attach_task(cs, task);
> + }
>
> out:
> if (queue_task_work)
> schedule_flush_migrate_mm();
> cs->old_mems_allowed = cpuset_attach_nodemask_to;
>
> - if (cs->nr_migrate_dl_tasks) {
> - cs->nr_deadline_tasks += cs->nr_migrate_dl_tasks;
> - oldcs->nr_deadline_tasks -= cs->nr_migrate_dl_tasks;
> - reset_migrate_dl_data(cs);
> - }
> -
> - dec_attach_in_progress_locked(cs);
> + /* Decrement per-task attach_in_progress */
> + cgroup_taskset_for_each(task, css, tset)
> + dec_attach_in_progress_locked(css_cs(css));
>
> mutex_unlock(&cpuset_mutex);
> }
--
Best regards,
Ridong
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] cgroup/cpuset: Support multiple source/destination cpusets using pids pattern
2026-06-03 10:26 ` [PATCH] cgroup/cpuset: Support multiple source/destination cpusets using pids pattern Ridong Chen
2026-06-03 10:32 ` Ridong Chen
@ 2026-06-03 18:47 ` Waiman Long
1 sibling, 0 replies; 15+ messages in thread
From: Waiman Long @ 2026-06-03 18:47 UTC (permalink / raw)
To: Ridong Chen; +Cc: cgroups, Tejun Heo, Johannes Weiner, linux-kernel
On 6/3/26 6:26 AM, Ridong Chen wrote:
> The current cpuset_can_attach() and cpuset_attach() functions assume task
> migration is from one source cpuset to one destination cpuset. This can be
> wrong in several scenarios:
> - Moving a multi-threaded process with threads in different cpusets
> - Disabling the cpuset controller (many children to one parent)
> - Enabling the cpuset controller (one parent to many children)
>
> Fix this by adopting the pids subsystem's per-task accounting pattern.
> In cpuset_can_attach(), use task_cs(task) to get the correct source cpuset
> for each task (like pids_can_attach uses task_css), adjust nr_deadline_tasks
> and reserve DL bandwidth per-task, and increment attach_in_progress per-task
> on the destination cpuset. In cpuset_attach(), handle destination cpuset
> changes within the task iteration loop.
>
> A shared helper cpuset_undo_attach() reverses the per-task operations for
> both partial rollback in cpuset_can_attach() and full reversal in
> cpuset_cancel_attach().
>
> When multiple source cpusets are detected in can_attach(), set
> attach_many_sources so that cpuset_attach() forces cpus_updated and
> mems_updated to true, ensuring all tasks get properly updated regardless
> of which source cpuset cpuset_attach_old_cs points to.
>
> This eliminates the need for nr_migrate_dl_tasks, sum_migrate_dl_bw, and
> dl_bw_cpu fields in struct cpuset.
>
> Fixes: 4ec22e9c5a90 ("cpuset: Enable cpuset controller in default hierarchy")
> Signed-off-by: Ridong Chen <ridong.chen@linux.dev>
It is not a problem doing per-task DL BW allocation and eliminating the
*dl_bw* fields. However, updating nr_deadline_tasks before it is
committed can be problematic.
nr_deadline_tasks is used in dl_rebuild_rd_accounting() which is called
by partition_sched_domains_locked(). After the release of cpuset_mutex
at the end of cpuset_can_attach() and before cpuset_attach() or
cpuset_cancel_attach() is called, it is possible
that partition_sched_domains_locked() can be called
and dl_rebuild_rd_accounting() is not getting the right DL BW accounting
information. So unless there is a way to confirm that this situation
cannot happen, we can't change nr_deadline_tasks before the attach is
commited.
Cheers,
Longman
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2026-06-03 18:47 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-02 2:31 [PATCH-next v5 0/6] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach() Waiman Long
2026-06-02 2:31 ` [PATCH-next v5 1/6] cgroup/cpuset: Fix node inconsistencies between cpuset_update_tasks_nodemask() and cpuset_attach() Waiman Long
2026-06-02 13:37 ` Ridong Chen
2026-06-02 18:43 ` Waiman Long
2026-06-02 2:31 ` [PATCH-next v5 2/6] cgroup/cpuset: Add a cpuset_reserve_dl_bw() helper Waiman Long
2026-06-02 13:40 ` Ridong Chen
2026-06-02 2:32 ` [PATCH-next v5 3/6] cgroup/cpuset: Expand the scope of cpuset_can_attach_check() Waiman Long
2026-06-02 13:51 ` Ridong Chen
2026-06-02 2:32 ` [PATCH-next v5 4/6] cgroup/cpuset: Make cpuset_attach_old_cs track task group leaders Waiman Long
2026-06-02 13:58 ` Ridong Chen
2026-06-02 2:32 ` [PATCH-next v5 5/6] cgroup/cpuset: Move mpol_rebind_mm/cpuset_migrate_mm() calls inside cpuset_attach_task() Waiman Long
2026-06-02 2:32 ` [PATCH-next v5 6/6] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach() Waiman Long
2026-06-03 10:26 ` [PATCH] cgroup/cpuset: Support multiple source/destination cpusets using pids pattern Ridong Chen
2026-06-03 10:32 ` Ridong Chen
2026-06-03 18:47 ` Waiman Long
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox