* [PATCH v7 0/9] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach()
@ 2026-06-21 3:28 Waiman Long
2026-06-21 3:28 ` [PATCH v7 1/9] cgroup/cpuset: rebind mm mempolicy to effective_mems, not mems_allowed Waiman Long
` (8 more replies)
0 siblings, 9 replies; 10+ messages in thread
From: Waiman Long @ 2026-06-21 3:28 UTC (permalink / raw)
To: Ridong Chen, Tejun Heo, Johannes Weiner, Michal Koutný,
Li Zefan, Farhad Alemi, Andrew Morton
Cc: cgroups, linux-kernel, Aaron Tomlin, Guopeng Zhang, Gregory Price,
David Hildenbrand, Waiman Long
v7:
- Include the fix patch from Farhad Alemi to fix a div/0 crash that
was part of the old patch 1.
- Integrated v6 patch 7 into earlier patches.
- Add a new "cgroup/cpuset: Prevent race between task attach and
cpuset state change" patch to ensure that there will be no cpuset
state change between cpuset_can_attach() and cpuset_attach() or
cpuset_cancel_attach().
- Break v6 patch 6 into 2 separate patches for supporting multiple
source cpusets and multiple destination cpusets respectively and
further simplify and streamline the code.
v6:
- Make guarantee_online_mems() to only return cs->effective_mems with v2
in patch 1.
- Remove obsolete commit description text from patch 3.
- Add Reviewed-by tags.
- In patch 6, add WARN_ON_ONCE() test in cpuset_can_attach() to
confirm that cs != oldcs.
v5:
- Remove the WARN_ON() call as it can be triggered in a corner case.
- Instead of passing an attach_cpus_updated and attach_mems_updated
flags from cpuset_can_attach() to cpuset_attach(), re-evaluate the
flags at the beginning of cpuset_attach() based on data in the source &
destination cpusets in the singly linked lists to eliminate the
Time-of-Check to Time-of-Use (TOCTOU) race condition & simplify the
code changes.
- Add back the dropped optimization in patch 5.
Sashiko AI review of another cpuset patch had found that cpuset_attach()
and cpuset_can_attach() can be passed a cgroup_taskset with tasks
migrating from one source cpuset to multiple destination cpusets and
vice versa. Further testing of the cpuset code indicates that this is
indeed the case when the v2 cpuset controller is enabled or disabled.
Unfortunately, cpuset_attach() and cpuset_can_attach() still assume that
there will be one source and one destinaton cpuset which may result in
inocrrect behavior.
This patch series is created to fix this issue.
Patch 1 is a fix that fix a cgroup v2 div/0 crash due to bug in
cpuset_update_tasks_nodemask().
Patch 2 is to fix an inconsistency in the way node mask update is being
handled in cpuset_update_tasks_nodemask() and cpuset_attach() so that
they match each other.
Patch 3 makes any cpuset state change to wait for the completion of the
pending cpuset attach operation, if any.
Patches 4 and 5 are just preparatory patches to make the remaining
patches easier to review.
Patch 6 makes cpuset_attach_old_cs to track group leader for use by
cpuset_migrate_mm().
Patch 7 moves mpol_rebind_mm() and cpuset_migrate_mm() inside
cpuset_attach_task() to make CLONE_INTO_CGROUP flag of clone(2) works
more like moving task from one cpuset to another one, while also make
supporting multiple source and destination cpusets easier.
Patch 8 makes the necessary changes to enable the support of multiple
source cpusets by keeping all the source cpusets found during task
iterations in a singly linked lists.
Patch 9 enables the support of multiple destination cpusets during the
the enabling of cpuset v2 controller.
Farhad Alemi (1):
cgroup/cpuset: rebind mm mempolicy to effective_mems, not mems_allowed
Waiman Long (8):
cgroup/cpuset: Fix node inconsistencies between
cpuset_update_tasks_nodemask() and cpuset_attach()
cgroup/cpuset: Prevent race between task attach and cpuset state
change
cgroup/cpuset: Add a cpuset_reserve_dl_bw() helper
cgroup/cpuset: Expand the scope of cpuset_can_attach_check()
cgroup/cpuset: Make cpuset_attach_old_cs track task group leaders
cgroup/cpuset: Move mpol_rebind_mm/cpuset_migrate_mm() calls inside
cpuset_attach_task()
cgroup/cpuset: Support multiple source cpusets for cpuset_*attach()
cgroup/cpuset: Support multiple destination cpusets for
cpuset_*attach()
kernel/cgroup/cpuset-internal.h | 2 +
kernel/cgroup/cpuset.c | 400 ++++++++++++++++++++++----------
2 files changed, 277 insertions(+), 125 deletions(-)
--
2.54.0
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH v7 1/9] cgroup/cpuset: rebind mm mempolicy to effective_mems, not mems_allowed
2026-06-21 3:28 [PATCH v7 0/9] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach() Waiman Long
@ 2026-06-21 3:28 ` Waiman Long
2026-06-21 3:28 ` [PATCH v7 2/9] cgroup/cpuset: Fix node inconsistencies between cpuset_update_tasks_nodemask() and cpuset_attach() Waiman Long
` (7 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: Waiman Long @ 2026-06-21 3:28 UTC (permalink / raw)
To: Ridong Chen, Tejun Heo, Johannes Weiner, Michal Koutný,
Li Zefan, Farhad Alemi, Andrew Morton
Cc: cgroups, linux-kernel, Aaron Tomlin, Guopeng Zhang, Gregory Price,
David Hildenbrand, Waiman Long, stable
From: Farhad Alemi <farhad.alemi@berkeley.edu>
Creating a child cpuset where cpuset.mems is never set leads to a div/0
when a VMA mempolicy with MPOL_F_RELATIVE_NODES rebinds in response to a
CPU hotplug event.
Reproduction steps:
1) Create a cgroup w/ cpuset controls (do not set cpuset.mems)
2) Move the task into the child cpuset
3) Create a VMA mempolicy for that task with MPOL_F_RELATIVE_NODES
4) unplug and hotplug a cpu
echo 0 > /sys/devices/system/cpu/cpu1/online
echo 1 > /sys/devices/system/cpu/cpu1/online
5) mempolicy rebind does a div/0 in mpol_relative_nodemask on the
call to __nodes_fold()
The cpuset code passes (cs->mems_allowed) which is not guaranteed to have
nodes to the rebind routine. Use cs->effective_mems instead, which is
guaranteed to have a non-empty nodemask.
Closes: https://lore.kernel.org/linux-mm/CA+0ovCgxbZkXa+OU8w3s84R3KNPNxxRfmsNR-udh+afQBbGNmw@mail.gmail.com/
Link: https://lore.kernel.org/all/CA+0ovCiEz6SP_sn3kN4Tb+_oC=eHMXy_Ffj=usV3wREdQrUtww@mail.gmail.com/
Fixes: ae1c802382f7 ("cpuset: apply cs->effective_{cpus,mems}")
Suggested-by: Gregory Price <gourry@gourry.net>
Suggested-by: Waiman Long <longman@redhat.com>
Signed-off-by: Farhad Alemi <farhad.alemi@berkeley.edu>
Acked-by: Waiman Long <longman@redhat.com>
Cc: stable@vger.kernel.org
---
kernel/cgroup/cpuset.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 591e3aa487fc..b21c31650583 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2653,7 +2653,7 @@ void cpuset_update_tasks_nodemask(struct cpuset *cs)
migrate = is_memory_migrate(cs);
- mpol_rebind_mm(mm, &cs->mems_allowed);
+ mpol_rebind_mm(mm, &cs->effective_mems);
if (migrate)
cpuset_migrate_mm(mm, &cs->old_mems_allowed, &newmems);
else
--
2.54.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v7 2/9] cgroup/cpuset: Fix node inconsistencies between cpuset_update_tasks_nodemask() and cpuset_attach()
2026-06-21 3:28 [PATCH v7 0/9] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach() Waiman Long
2026-06-21 3:28 ` [PATCH v7 1/9] cgroup/cpuset: rebind mm mempolicy to effective_mems, not mems_allowed Waiman Long
@ 2026-06-21 3:28 ` Waiman Long
2026-06-21 3:28 ` [PATCH v7 3/9] cgroup/cpuset: Prevent race between task attach and cpuset state change Waiman Long
` (6 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: Waiman Long @ 2026-06-21 3:28 UTC (permalink / raw)
To: Ridong Chen, Tejun Heo, Johannes Weiner, Michal Koutný,
Li Zefan, Farhad Alemi, Andrew Morton
Cc: cgroups, linux-kernel, Aaron Tomlin, Guopeng Zhang, Gregory Price,
David Hildenbrand, Waiman Long
Whenever memory node mask is changed, there are 4 places where the node
mask has to be updated or used.
1) task's node mask via cpuset_change_task_nodemask()
2) memory policy binding via mpol_rebind_mm()
3) if memory migration is enabled, migrate from old_mems_allowed to
the new node mask via cpuset_migrate_mm().
4) setting old_mems_allowed
These memory actions are done in cpuset_update_tasks_nodemask() and
cpuset_attach(). However there are inconsistencies in what node masks
are being used in these 2 functions.
In cpuset_update_tasks_nodemask(),
- cpuset_change_task_nodemask(): guarantee_online_mems()
- mpol_rebind_mm(): mems_allowed
- cpuset_migrate_mm(): guarantee_online_mems()
- old_mems_allowed: guarantee_online_mems()
In cpuset_attach(),
- cpuset_change_task_nodemask(): guarantee_online_mems()
- mpol_rebind_mm(): effective_mems
- cpuset_migrate_mm(): effective_mems
- old_mems_allowed: effective_mems
These inconsistencies dates back to quite a long time ago and it is
hard to say what should be the correct values.
The guarantee_online_mems() function returns a node mask from current or
an ancestor cpuset that is a subset of node_states[N_MEMORY]. Nodes in
node_states[N_MEMORY] are all online, i.e. in node_states[N_ONLINE].
However, node in node_states[N_ONLINE] may not have memory. So
node_states[N_MEMORY] should be a subset of node_states[N_ONLINE].
The guarantee_online_mems() function should mostly be useful for v1
where mems_allowed is the same as effective_mems. With v2, the memory
nodes in effective_mems should be a subset of node_states[N_MEMORY]
except when a memory hot-unplug operation is in progress and a memory
node is removed from node_states[N_MEMORY] but not yet reflected in
the effective_mems's as cpuset_handle_hotplug() has not been called
from cpuset_track_online_nodes().
Let use the following setup for both of them and make them consistent.
- cpuset_change_task_nodemask(): guarantee_online_mems()
- mpol_rebind_mm(): effective_mems
- cpuset_migrate_mm(): guarantee_online_mems()
- old_mems_allowed: guarantee_online_mems()
So for v2, it is effectively all effective_mems most of the time. For
v1, mpol_rebind_mm() uses mems_allowed which may differ from what
guarantee_online_mems() returns, but it conforms to what the cpuset v1
documentation says with respect to setting memory policy.
Reviewed-by: Ridong Chen <ridong.chen@linux.dev>
Signed-off-by: Waiman Long <longman@redhat.com>
---
kernel/cgroup/cpuset.c | 30 ++++++++++++++++++------------
1 file changed, 18 insertions(+), 12 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index b21c31650583..a1c8890d3519 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -489,7 +489,10 @@ static void guarantee_active_cpus(struct task_struct *tsk,
* Return in *pmask the portion of a cpusets's mems_allowed that
* are online, with memory. If none are online with memory, walk
* up the cpuset hierarchy until we find one that does have some
- * online mems. The top cpuset always has some mems online.
+ * online mems. The top cpuset always has some mems online. With v2,
+ * effective_mems should always contain online memory nodes except
+ * during the transition period where a memory node hotunplug operation
+ * is in progress.
*
* One way or another, we guarantee to return some non-empty subset
* of node_states[N_MEMORY].
@@ -2619,6 +2622,14 @@ static void *cpuset_being_rebound;
* Iterate through each task of @cs updating its mems_allowed to the
* effective cpuset's. As this function is called with cpuset_mutex held,
* cpuset membership stays stable.
+ *
+ * - cpuset_change_task_nodemask(): guarantee_online_mems()
+ * - mpol_rebind_mm(): effective_mems
+ * - cpuset_migrate_mm(): guarantee_online_mems()
+ * - old_mems_allowed: guarantee_online_mems()
+ *
+ * For v2, guarantee_online_mems() should return a node mask that is the same
+ * as the effective_mems of current cpuset.
*/
void cpuset_update_tasks_nodemask(struct cpuset *cs)
{
@@ -2627,7 +2638,6 @@ void cpuset_update_tasks_nodemask(struct cpuset *cs)
struct task_struct *task;
cpuset_being_rebound = cs; /* causes mpol_dup() rebind */
-
guarantee_online_mems(cs, &newmems);
/*
@@ -3148,19 +3158,16 @@ static void cpuset_attach(struct cgroup_taskset *tset)
cpus_updated = !cpumask_equal(cs->effective_cpus,
oldcs->effective_cpus);
mems_updated = !nodes_equal(cs->effective_mems, oldcs->effective_mems);
+ guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
/*
* In the default hierarchy, enabling cpuset in the child cgroups
- * will trigger a number of cpuset_attach() calls with no change
- * in effective cpus and mems. In that case, we can optimize out
- * by skipping the task iteration and update.
+ * will trigger a cpuset_attach() call with no change in effective cpus
+ * and mems. In that case, we can optimize out by skipping the task
+ * iteration and update.
*/
- if (cpuset_v2() && !cpus_updated && !mems_updated) {
- cpuset_attach_nodemask_to = cs->effective_mems;
+ if (cpuset_v2() && !cpus_updated && !mems_updated)
goto out;
- }
-
- guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
cgroup_taskset_for_each(task, css, tset)
cpuset_attach_task(cs, task);
@@ -3171,7 +3178,6 @@ static void cpuset_attach(struct cgroup_taskset *tset)
* if there is no change in effective_mems and CS_MEMORY_MIGRATE is
* not set.
*/
- cpuset_attach_nodemask_to = cs->effective_mems;
if (!is_memory_migrate(cs) && !mems_updated)
goto out;
@@ -3179,7 +3185,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
struct mm_struct *mm = get_task_mm(leader);
if (mm) {
- mpol_rebind_mm(mm, &cpuset_attach_nodemask_to);
+ mpol_rebind_mm(mm, &cs->effective_mems);
/*
* old_mems_allowed is the same with mems_allowed
--
2.54.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v7 3/9] cgroup/cpuset: Prevent race between task attach and cpuset state change
2026-06-21 3:28 [PATCH v7 0/9] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach() Waiman Long
2026-06-21 3:28 ` [PATCH v7 1/9] cgroup/cpuset: rebind mm mempolicy to effective_mems, not mems_allowed Waiman Long
2026-06-21 3:28 ` [PATCH v7 2/9] cgroup/cpuset: Fix node inconsistencies between cpuset_update_tasks_nodemask() and cpuset_attach() Waiman Long
@ 2026-06-21 3:28 ` Waiman Long
2026-06-21 3:28 ` [PATCH v7 4/9] cgroup/cpuset: Add a cpuset_reserve_dl_bw() helper Waiman Long
` (5 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: Waiman Long @ 2026-06-21 3:28 UTC (permalink / raw)
To: Ridong Chen, Tejun Heo, Johannes Weiner, Michal Koutný,
Li Zefan, Farhad Alemi, Andrew Morton
Cc: cgroups, linux-kernel, Aaron Tomlin, Guopeng Zhang, Gregory Price,
David Hildenbrand, Waiman Long
Commit e44193d39e8d ("cpuset: let hotplug propagation work wait for
task attaching") was introduced to let hotplug operation to wait
until the completion of task attaching operation. However, it is
still possible that the states of the source or destination cpuset
can be changed between the cpuset_can_attach() call and the subsequent
cpuset_attach()/cpuset_cacnel_attach() call.
As a result, data gathered during cpuset_can_attach() cannot be reliably
used in the subsequent cpuset_attach()/cpuset_cacnel_attach()
call at all. Make the task attach operation more robust
and allow the sharing of data between cpuset_can_attach() and
cpuset_attach()/cpuset_cacnel_attach() by making cpuset_write_resmask()
and cpuset_partition_write() wait for the completion of task attach
and set the attach_in_progress flag in the source cpuset as well.
The comments about validate_change() are no longer valid as it won't
be called at all if an attach operation is in progress. So the comments
can be removed.
Signed-off-by: Waiman Long <longman@redhat.com>
---
kernel/cgroup/cpuset.c | 28 ++++++++++++++++++++--------
1 file changed, 20 insertions(+), 8 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index a1c8890d3519..65d095dcada1 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -3080,11 +3080,8 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
cs->dl_bw_cpu = cpu;
out_success:
- /*
- * Mark attach is in progress. This makes validate_change() fail
- * changes which zero cpus/mems_allowed.
- */
cs->attach_in_progress++;
+ oldcs->attach_in_progress++;
out_unlock:
if (ret)
@@ -3235,10 +3232,19 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
return -EACCES;
buf = strstrip(buf);
+retry:
+ wait_event(cpuset_attach_wq, cs->attach_in_progress == 0);
+
cpuset_full_lock();
if (!is_cpuset_online(cs))
goto out_unlock;
+ /* Don't race with task attach */
+ if (cs->attach_in_progress) {
+ cpuset_full_unlock();
+ goto retry;
+ }
+
trialcs = dup_or_alloc_cpuset(cs);
if (!trialcs) {
retval = -ENOMEM;
@@ -3366,7 +3372,17 @@ static ssize_t cpuset_partition_write(struct kernfs_open_file *of, char *buf,
else
return -EINVAL;
+retry:
+ wait_event(cpuset_attach_wq, cs->attach_in_progress == 0);
+
cpuset_full_lock();
+
+ /* Don't race with task attach */
+ if (cs->attach_in_progress) {
+ cpuset_full_unlock();
+ goto retry;
+ }
+
if (is_cpuset_online(cs))
retval = update_prstate(cs, val);
cpuset_update_sd_hk_unlock();
@@ -3605,10 +3621,6 @@ static int cpuset_can_fork(struct task_struct *task, struct css_set *cset)
if (ret)
goto out_unlock;
- /*
- * Mark attach is in progress. This makes validate_change() fail
- * changes which zero cpus/mems_allowed.
- */
cs->attach_in_progress++;
out_unlock:
mutex_unlock(&cpuset_mutex);
--
2.54.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v7 4/9] cgroup/cpuset: Add a cpuset_reserve_dl_bw() helper
2026-06-21 3:28 [PATCH v7 0/9] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach() Waiman Long
` (2 preceding siblings ...)
2026-06-21 3:28 ` [PATCH v7 3/9] cgroup/cpuset: Prevent race between task attach and cpuset state change Waiman Long
@ 2026-06-21 3:28 ` Waiman Long
2026-06-21 3:28 ` [PATCH v7 5/9] cgroup/cpuset: Expand the scope of cpuset_can_attach_check() Waiman Long
` (4 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: Waiman Long @ 2026-06-21 3:28 UTC (permalink / raw)
To: Ridong Chen, Tejun Heo, Johannes Weiner, Michal Koutný,
Li Zefan, Farhad Alemi, Andrew Morton
Cc: cgroups, linux-kernel, Aaron Tomlin, Guopeng Zhang, Gregory Price,
David Hildenbrand, Waiman Long
Extract the DL bandwidth allocation code in cpuset_attach() to a new
cpuset_reserve_dl_bw() helper to simplify code.
No functional change is expected.
Reviewed-by: Ridong Chen <ridong.chen@linux.dev>
Signed-off-by: Waiman Long <longman@redhat.com>
---
kernel/cgroup/cpuset.c | 47 ++++++++++++++++++++++++------------------
1 file changed, 27 insertions(+), 20 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 65d095dcada1..2ffc66baedf3 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2994,6 +2994,25 @@ static int cpuset_can_attach_check(struct cpuset *cs)
return 0;
}
+static int cpuset_reserve_dl_bw(struct cpuset *cs)
+{
+ int cpu, ret;
+
+ if (!cs->sum_migrate_dl_bw)
+ return 0;
+
+ cpu = cpumask_any_and(cpu_active_mask, cs->effective_cpus);
+ if (unlikely(cpu >= nr_cpu_ids))
+ return -EINVAL;
+
+ ret = dl_bw_alloc(cpu, cs->sum_migrate_dl_bw);
+ if (ret)
+ return ret;
+
+ cs->dl_bw_cpu = cpu;
+ return 0;
+}
+
static void reset_migrate_dl_data(struct cpuset *cs)
{
cs->nr_migrate_dl_tasks = 0;
@@ -3008,7 +3027,7 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
struct cpuset *cs, *oldcs;
struct task_struct *task;
bool setsched_check;
- int cpu, ret;
+ int ret;
/* used later by cpuset_attach() */
cpuset_attach_old_cs = task_cs(cgroup_taskset_first(tset, &css));
@@ -3064,28 +3083,16 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
}
}
- if (!cs->sum_migrate_dl_bw)
- goto out_success;
-
- cpu = cpumask_any_and(cpu_active_mask, cs->effective_cpus);
- if (unlikely(cpu >= nr_cpu_ids)) {
- ret = -EINVAL;
- goto out_unlock;
- }
-
- ret = dl_bw_alloc(cpu, cs->sum_migrate_dl_bw);
- if (ret)
- goto out_unlock;
-
- cs->dl_bw_cpu = cpu;
-
-out_success:
- cs->attach_in_progress++;
- oldcs->attach_in_progress++;
+ ret = cpuset_reserve_dl_bw(cs);
out_unlock:
- if (ret)
+ if (ret) {
reset_migrate_dl_data(cs);
+ } else {
+ cs->attach_in_progress++;
+ oldcs->attach_in_progress++;
+ }
+
mutex_unlock(&cpuset_mutex);
return ret;
}
--
2.54.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v7 5/9] cgroup/cpuset: Expand the scope of cpuset_can_attach_check()
2026-06-21 3:28 [PATCH v7 0/9] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach() Waiman Long
` (3 preceding siblings ...)
2026-06-21 3:28 ` [PATCH v7 4/9] cgroup/cpuset: Add a cpuset_reserve_dl_bw() helper Waiman Long
@ 2026-06-21 3:28 ` Waiman Long
2026-06-21 3:28 ` [PATCH v7 6/9] cgroup/cpuset: Make cpuset_attach_old_cs track task group leaders Waiman Long
` (3 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: Waiman Long @ 2026-06-21 3:28 UTC (permalink / raw)
To: Ridong Chen, Tejun Heo, Johannes Weiner, Michal Koutný,
Li Zefan, Farhad Alemi, Andrew Morton
Cc: cgroups, linux-kernel, Aaron Tomlin, Guopeng Zhang, Gregory Price,
David Hildenbrand, Waiman Long
Expand the scope of cpuset_can_attach_check() by including the setting
of setsched flag inside cpuset_can_attach_check() with the new @oldcs
and @psetsched argument. As cpuset_can_attach_check() is also called
from cpuset_can_fork(), set the new arguments to NULL from that caller.
Reviewed-by: Ridong Chen <ridong.chen@linux.dev>
Signed-off-by: Waiman Long <longman@redhat.com>
---
kernel/cgroup/cpuset.c | 52 ++++++++++++++++++++++++------------------
1 file changed, 30 insertions(+), 22 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 2ffc66baedf3..b7b5072f2fdd 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2985,12 +2985,39 @@ static struct cpuset *cpuset_attach_old_cs;
* For v1, cpus_allowed and mems_allowed can't be empty.
* For v2, effective_cpus can't be empty.
* Note that in v1, effective_cpus = cpus_allowed.
+ *
+ * Also set the boolean flag passed in by @psetsched depending on if
+ * security_task_setscheduler() call is needed and @oldcs is not NULL.
*/
-static int cpuset_can_attach_check(struct cpuset *cs)
+static int cpuset_can_attach_check(struct cpuset *cs, struct cpuset *oldcs,
+ bool *psetsched)
{
if (cpumask_empty(cs->effective_cpus) ||
(!is_in_v2_mode() && nodes_empty(cs->mems_allowed)))
return -ENOSPC;
+
+ if (!oldcs)
+ return 0;
+
+ /*
+ * Skip rights over task setsched check in v2 when nothing changes,
+ * migration permission derives from hierarchy ownership in
+ * cgroup_procs_write_permission()).
+ */
+ *psetsched = !cpuset_v2() ||
+ !cpumask_equal(cs->effective_cpus, oldcs->effective_cpus) ||
+ !nodes_equal(cs->effective_mems, oldcs->effective_mems);
+
+ /*
+ * A v1 cpuset with tasks will have no CPU left only when CPU hotplug
+ * brings the last online CPU offline as users are not allowed to empty
+ * cpuset.cpus when there are active tasks inside. When that happens,
+ * we should allow tasks to migrate out without security check to make
+ * sure they will be able to run after migration.
+ */
+ if (!is_in_v2_mode() && cpumask_empty(oldcs->effective_cpus))
+ *psetsched = false;
+
return 0;
}
@@ -3037,29 +3064,10 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
mutex_lock(&cpuset_mutex);
/* Check to see if task is allowed in the cpuset */
- ret = cpuset_can_attach_check(cs);
+ ret = cpuset_can_attach_check(cs, oldcs, &setsched_check);
if (ret)
goto out_unlock;
- /*
- * Skip rights over task setsched check in v2 when nothing changes,
- * migration permission derives from hierarchy ownership in
- * cgroup_procs_write_permission()).
- */
- setsched_check = !cpuset_v2() ||
- !cpumask_equal(cs->effective_cpus, oldcs->effective_cpus) ||
- !nodes_equal(cs->effective_mems, oldcs->effective_mems);
-
- /*
- * A v1 cpuset with tasks will have no CPU left only when CPU hotplug
- * brings the last online CPU offline as users are not allowed to empty
- * cpuset.cpus when there are active tasks inside. When that happens,
- * we should allow tasks to migrate out without security check to make
- * sure they will be able to run after migration.
- */
- if (!is_in_v2_mode() && cpumask_empty(oldcs->effective_cpus))
- setsched_check = false;
-
cgroup_taskset_for_each(task, css, tset) {
ret = task_can_attach(task);
if (ret)
@@ -3616,7 +3624,7 @@ static int cpuset_can_fork(struct task_struct *task, struct css_set *cset)
mutex_lock(&cpuset_mutex);
/* Check to see if task is allowed in the cpuset */
- ret = cpuset_can_attach_check(cs);
+ ret = cpuset_can_attach_check(cs, NULL, NULL);
if (ret)
goto out_unlock;
--
2.54.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v7 6/9] cgroup/cpuset: Make cpuset_attach_old_cs track task group leaders
2026-06-21 3:28 [PATCH v7 0/9] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach() Waiman Long
` (4 preceding siblings ...)
2026-06-21 3:28 ` [PATCH v7 5/9] cgroup/cpuset: Expand the scope of cpuset_can_attach_check() Waiman Long
@ 2026-06-21 3:28 ` Waiman Long
2026-06-21 3:28 ` [PATCH v7 7/9] cgroup/cpuset: Move mpol_rebind_mm/cpuset_migrate_mm() calls inside cpuset_attach_task() Waiman Long
` (2 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: Waiman Long @ 2026-06-21 3:28 UTC (permalink / raw)
To: Ridong Chen, Tejun Heo, Johannes Weiner, Michal Koutný,
Li Zefan, Farhad Alemi, Andrew Morton
Cc: cgroups, linux-kernel, Aaron Tomlin, Guopeng Zhang, Gregory Price,
David Hildenbrand, Waiman Long
There are two possible ways that migration of tasks from multiple source
cpusets to a target cpuset can happen. Either a multithread application
with threads in different cpusets is wholely moved to a new cpuset
or disabling of v2 cpuset controller will move all the tasks in child
cpusets to the parent cpuset.
In the former case, it is the mm setting of the group leader that really
matters. So cpuset_attach_old_cs should track the oldcs of the thread
leader. In the latter case, effective_mems of child cpusets must always
be a subset of the parent. So no real page migration will be necessary
no matter which child cpuset is selected as cpuset_attach_old_cs.
IOW, cpuset_attach_old_cs should be updated to match the latest task
group leader in cpuset_can_attach(), but fall back to that of the first
task if there is no group leader in the taskset.
Suggested-by: Ridong Chen <ridong.chen@linux.dev>
Reviewed-by: Ridong Chen <ridong.chen@linux.dev>
Signed-off-by: Waiman Long <longman@redhat.com>
---
kernel/cgroup/cpuset.c | 25 +++++++++++++++++++++++++
1 file changed, 25 insertions(+)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index b7b5072f2fdd..0375dae26d0b 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2978,6 +2978,10 @@ static int update_prstate(struct cpuset *cs, int new_prs)
return 0;
}
+/*
+ * cpuset_can_attach() and cpuset_attach() specific internal data
+ * Protected by cpuset_mutex
+ */
static struct cpuset *cpuset_attach_old_cs;
/*
@@ -3068,11 +3072,32 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
if (ret)
goto out_unlock;
+ /*
+ * The cpuset_attach_old_cs is used mainly by cpuset_migrate_mm() to get
+ * the old_mems_allowed value. There are two ways that many-to-one
+ * cpuset migration can happen:
+ * 1) A multithread application with threads in different cpusets is
+ * wholely migrated to a new cpuset.
+ * 2) Disabling v2 cpuset controller will move all the tasks in child
+ * cpusets to the parent cpuset.
+ *
+ * In the former case, it is the mm setting of the group leader that
+ * really matters. So cpuset_attach_old_cs should track the oldcs of the
+ * group leader. It falls back to the oldcs of the first task if there
+ * is no group leader in the taskset. In the latter case, effective_mems
+ * of child cpusets must always be a subset of the parent. So no real
+ * page migration will be necessary no matter which child cpuset is
+ * selected as cpuset_attach_old_cs.
+ */
cgroup_taskset_for_each(task, css, tset) {
ret = task_can_attach(task);
if (ret)
goto out_unlock;
+ /* Update cpuset_attach_old_cs to the latest group leader */
+ if (task == task->group_leader)
+ cpuset_attach_old_cs = task_cs(task);
+
if (setsched_check) {
ret = security_task_setscheduler(task);
if (ret)
--
2.54.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v7 7/9] cgroup/cpuset: Move mpol_rebind_mm/cpuset_migrate_mm() calls inside cpuset_attach_task()
2026-06-21 3:28 [PATCH v7 0/9] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach() Waiman Long
` (5 preceding siblings ...)
2026-06-21 3:28 ` [PATCH v7 6/9] cgroup/cpuset: Make cpuset_attach_old_cs track task group leaders Waiman Long
@ 2026-06-21 3:28 ` Waiman Long
2026-06-21 3:28 ` [PATCH v7 8/9] cgroup/cpuset: Support multiple source cpusets for cpuset_*attach() Waiman Long
2026-06-21 3:28 ` [PATCH v7 9/9] cgroup/cpuset: Support multiple destination " Waiman Long
8 siblings, 0 replies; 10+ messages in thread
From: Waiman Long @ 2026-06-21 3:28 UTC (permalink / raw)
To: Ridong Chen, Tejun Heo, Johannes Weiner, Michal Koutný,
Li Zefan, Farhad Alemi, Andrew Morton
Cc: cgroups, linux-kernel, Aaron Tomlin, Guopeng Zhang, Gregory Price,
David Hildenbrand, Waiman Long
The cpuset_attach_task() was introduced in commit 42a11bf5c543
("cgroup/cpuset: Make cpuset_fork() handle CLONE_INTO_CGROUP properly")
to enable the CLONE_INTO_CGROUP flag of clone(2) to behave more like
moving a task from one cpuset into another one. That commits didn't
move the mpol_rebind_mm() and cpuset_migrate_mm() calls for group leader
into cpuset_attach_task().
When the CLONE_INTO_CGROUP flag is used without CLONE_THREAD, the new
task is its own group leader. So it is still not equivalent to moving
task between cpusets in this case. Make CLONE_INTO_CGROUP behaves
more close to cpuset_attach() by moving the mpol_rebind_mm() and
cpuset_migrate_mm() calls inside cpuset_attach_task(). As a result,
the following static variables will have to be updated in cpuset_fork().
- cpuset_attach_old_cs
- attach_cpus_updated
- attach_mems_updated
- queue_task_work
Signed-off-by: Waiman Long <longman@redhat.com>
---
kernel/cgroup/cpuset.c | 105 ++++++++++++++++++++++++-----------------
1 file changed, 62 insertions(+), 43 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 0375dae26d0b..511afb077e2d 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2981,8 +2981,13 @@ static int update_prstate(struct cpuset *cs, int new_prs)
/*
* cpuset_can_attach() and cpuset_attach() specific internal data
* Protected by cpuset_mutex
+ *
+ * The attach_cpus_updated/attach_mems_updated flags are set in either
+ * cpuset_attach() or cpuset_fork() and used in cpuset_attach_task().
*/
static struct cpuset *cpuset_attach_old_cs;
+static bool attach_cpus_updated;
+static bool attach_mems_updated;
/*
* Check to see if a cpuset can accept a new task
@@ -3157,9 +3162,12 @@ static void cpuset_cancel_attach(struct cgroup_taskset *tset)
*/
static cpumask_var_t cpus_attach;
static nodemask_t cpuset_attach_nodemask_to;
+static bool queue_task_work;
static void cpuset_attach_task(struct cpuset *cs, struct task_struct *task)
{
+ struct mm_struct *mm;
+
lockdep_assert_cpuset_lock_held();
if (cs != &top_cpuset)
@@ -3173,28 +3181,60 @@ static void cpuset_attach_task(struct cpuset *cs, struct task_struct *task)
*/
WARN_ON_ONCE(set_cpus_allowed_ptr(task, cpus_attach));
+ if (cpuset_v2() && !attach_mems_updated)
+ return;
+
cpuset_change_task_nodemask(task, &cpuset_attach_nodemask_to);
cpuset1_update_task_spread_flags(cs, task);
+
+ if ((task != task->group_leader) ||
+ (!is_memory_migrate(cs) && !attach_mems_updated))
+ return;
+
+ /*
+ * Change mm for threadgroup leader. This is expensive and may
+ * sleep and should be moved outside migration path proper.
+ */
+ mm = get_task_mm(task);
+ if (mm) {
+ struct cpuset *oldcs = cpuset_attach_old_cs;
+
+ mpol_rebind_mm(mm, &cs->effective_mems);
+
+ /*
+ * old_mems_allowed is the same with mems_allowed
+ * here, except if this task is being moved
+ * automatically due to hotplug. In that case
+ * @mems_allowed has been updated and is empty, so
+ * @old_mems_allowed is the right nodesets that we
+ * migrate mm from.
+ */
+ if (is_memory_migrate(cs)) {
+ cpuset_migrate_mm(mm, &oldcs->old_mems_allowed,
+ &cpuset_attach_nodemask_to);
+ queue_task_work = true;
+ } else {
+ mmput(mm);
+ }
+ }
}
static void cpuset_attach(struct cgroup_taskset *tset)
{
struct task_struct *task;
- struct task_struct *leader;
struct cgroup_subsys_state *css;
struct cpuset *cs;
struct cpuset *oldcs = cpuset_attach_old_cs;
- bool cpus_updated, mems_updated;
- bool queue_task_work = false;
cgroup_taskset_first(tset, &css);
cs = css_cs(css);
lockdep_assert_cpus_held(); /* see cgroup_attach_lock() */
mutex_lock(&cpuset_mutex);
- cpus_updated = !cpumask_equal(cs->effective_cpus,
- oldcs->effective_cpus);
- mems_updated = !nodes_equal(cs->effective_mems, oldcs->effective_mems);
+ queue_task_work = false;
+
+ attach_cpus_updated = !cpumask_equal(cs->effective_cpus, oldcs->effective_cpus);
+ attach_mems_updated = !nodes_equal(cs->effective_mems, oldcs->effective_mems);
guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
/*
@@ -3203,44 +3243,12 @@ static void cpuset_attach(struct cgroup_taskset *tset)
* and mems. In that case, we can optimize out by skipping the task
* iteration and update.
*/
- if (cpuset_v2() && !cpus_updated && !mems_updated)
+ if (cpuset_v2() && !attach_cpus_updated && !attach_mems_updated)
goto out;
cgroup_taskset_for_each(task, css, tset)
cpuset_attach_task(cs, task);
- /*
- * Change mm for all threadgroup leaders. This is expensive and may
- * sleep and should be moved outside migration path proper. Skip it
- * if there is no change in effective_mems and CS_MEMORY_MIGRATE is
- * not set.
- */
- if (!is_memory_migrate(cs) && !mems_updated)
- goto out;
-
- cgroup_taskset_for_each_leader(leader, css, tset) {
- struct mm_struct *mm = get_task_mm(leader);
-
- if (mm) {
- mpol_rebind_mm(mm, &cs->effective_mems);
-
- /*
- * old_mems_allowed is the same with mems_allowed
- * here, except if this task is being moved
- * automatically due to hotplug. In that case
- * @mems_allowed has been updated and is empty, so
- * @old_mems_allowed is the right nodesets that we
- * migrate mm from.
- */
- if (is_memory_migrate(cs)) {
- cpuset_migrate_mm(mm, &oldcs->old_mems_allowed,
- &cpuset_attach_nodemask_to);
- queue_task_work = true;
- } else
- mmput(mm);
- }
- }
-
out:
if (queue_task_work)
schedule_flush_migrate_mm();
@@ -3689,15 +3697,14 @@ static void cpuset_cancel_fork(struct task_struct *task, struct css_set *cset)
*/
static void cpuset_fork(struct task_struct *task)
{
- struct cpuset *cs;
- bool same_cs;
+ struct cpuset *cs, *oldcs;
rcu_read_lock();
cs = task_cs(task);
- same_cs = (cs == task_cs(current));
+ oldcs = task_cs(current);
rcu_read_unlock();
- if (same_cs) {
+ if (cs == oldcs) {
if (cs == &top_cpuset)
return;
@@ -3709,7 +3716,19 @@ static void cpuset_fork(struct task_struct *task)
/* CLONE_INTO_CGROUP */
mutex_lock(&cpuset_mutex);
guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
+ cs->old_mems_allowed = cpuset_attach_nodemask_to;
+
+ /*
+ * Assume CPUs and memory nodes are updated
+ * A CLONE_INTO_CGROUP operation should have taken the cgroup mutex
+ * and so there shouldn't be a competing cpuset_attach() operation.
+ */
+ attach_cpus_updated = attach_mems_updated = true;
+ queue_task_work = false;
+ cpuset_attach_old_cs = oldcs;
cpuset_attach_task(cs, task);
+ if (queue_task_work)
+ schedule_flush_migrate_mm();
dec_attach_in_progress_locked(cs);
mutex_unlock(&cpuset_mutex);
--
2.54.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v7 8/9] cgroup/cpuset: Support multiple source cpusets for cpuset_*attach()
2026-06-21 3:28 [PATCH v7 0/9] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach() Waiman Long
` (6 preceding siblings ...)
2026-06-21 3:28 ` [PATCH v7 7/9] cgroup/cpuset: Move mpol_rebind_mm/cpuset_migrate_mm() calls inside cpuset_attach_task() Waiman Long
@ 2026-06-21 3:28 ` Waiman Long
2026-06-21 3:28 ` [PATCH v7 9/9] cgroup/cpuset: Support multiple destination " Waiman Long
8 siblings, 0 replies; 10+ messages in thread
From: Waiman Long @ 2026-06-21 3:28 UTC (permalink / raw)
To: Ridong Chen, Tejun Heo, Johannes Weiner, Michal Koutný,
Li Zefan, Farhad Alemi, Andrew Morton
Cc: cgroups, linux-kernel, Aaron Tomlin, Guopeng Zhang, Gregory Price,
David Hildenbrand, Waiman Long
There are 2 possible scenarios where the cgroup_taskset structure
passed into the cgroup can_attach() and attach() methods can contain
task migration data with multiple source cpusets.
- A multithread application with threads in different cpusets is
fully migrated into a new cpuset.
- Disabling v2 cpuset controller will move all the tasks in child
cpusets to the parent cpuset.
The current cpuset_can_attach() and cpuset_attach() functions still
expect task migration is from one source cpuset to one destination
cpuset.
Fix that by tracking the set of source (old) cpusets in singly linked
lists with the setting of attach_in_progress flag associated with the
insertion into the list. The list will be iterated when necessary to
properly update the internal data.
To ensure proper DL tasks accounting, the nr_migrate_dl_tasks in both
the source and destination cpusets are decremented/incremented with
their values added to nr_deadline_tasks when the migration is successful.
The setting of the global attach_cpus_updated and attach_mems_updated
flags are also moved from cpuset_attach() to cpuset_can_attach() as the
correct source cpuset can no longer be determined in cpuset_attach()
and cpuset states will not be changed between cpuset_attach() and
cpuset_can_attach() with an earlier patch.
Signed-off-by: Waiman Long <longman@redhat.com>
---
kernel/cgroup/cpuset-internal.h | 1 +
kernel/cgroup/cpuset.c | 66 ++++++++++++++++++++++++++++-----
2 files changed, 57 insertions(+), 10 deletions(-)
diff --git a/kernel/cgroup/cpuset-internal.h b/kernel/cgroup/cpuset-internal.h
index f7aaf01f7cd5..011993b1f756 100644
--- a/kernel/cgroup/cpuset-internal.h
+++ b/kernel/cgroup/cpuset-internal.h
@@ -149,6 +149,7 @@ struct cpuset {
* Tasks are being attached to this cpuset. Used to prevent
* zeroing cpus/mems_allowed between ->can_attach() and ->attach().
*/
+ struct llist_node attach_node;
int attach_in_progress;
/* partition root state */
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 511afb077e2d..c2d172873166 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -37,6 +37,7 @@
#include <linux/wait.h>
#include <linux/workqueue.h>
#include <linux/task_work.h>
+#include <linux/llist.h>
DEFINE_STATIC_KEY_FALSE(cpusets_pre_enable_key);
DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key);
@@ -584,6 +585,7 @@ static struct cpuset *dup_or_alloc_cpuset(struct cpuset *cs)
return NULL;
trial->dl_bw_cpu = -1;
+ init_llist_node(&trial->attach_node);
/* Setup cpumask pointer array */
cpumask_var_t *pmask[4] = {
@@ -2983,9 +2985,10 @@ static int update_prstate(struct cpuset *cs, int new_prs)
* Protected by cpuset_mutex
*
* The attach_cpus_updated/attach_mems_updated flags are set in either
- * cpuset_attach() or cpuset_fork() and used in cpuset_attach_task().
+ * cpuset_can_attach() or cpuset_fork() and used in cpuset_attach_task().
*/
static struct cpuset *cpuset_attach_old_cs;
+static LLIST_HEAD(src_cs_head);
static bool attach_cpus_updated;
static bool attach_mems_updated;
@@ -3001,6 +3004,8 @@ static bool attach_mems_updated;
static int cpuset_can_attach_check(struct cpuset *cs, struct cpuset *oldcs,
bool *psetsched)
{
+ bool cpus_updated, mems_updated;
+
if (cpumask_empty(cs->effective_cpus) ||
(!is_in_v2_mode() && nodes_empty(cs->mems_allowed)))
return -ENOSPC;
@@ -3008,14 +3013,25 @@ static int cpuset_can_attach_check(struct cpuset *cs, struct cpuset *oldcs,
if (!oldcs)
return 0;
+ if (!llist_on_list(&oldcs->attach_node)) {
+ llist_add(&oldcs->attach_node, &src_cs_head);
+ oldcs->attach_in_progress++;
+ }
+
+ cpus_updated = !cpumask_equal(cs->effective_cpus, oldcs->effective_cpus);
+ mems_updated = !nodes_equal(cs->effective_mems, oldcs->effective_mems);
+
+ if (cpus_updated)
+ attach_cpus_updated = true;
+ if (mems_updated)
+ attach_mems_updated = true;
+
/*
* Skip rights over task setsched check in v2 when nothing changes,
* migration permission derives from hierarchy ownership in
* cgroup_procs_write_permission()).
*/
- *psetsched = !cpuset_v2() ||
- !cpumask_equal(cs->effective_cpus, oldcs->effective_cpus) ||
- !nodes_equal(cs->effective_mems, oldcs->effective_mems);
+ *psetsched = !cpuset_v2() || cpus_updated || mems_updated;
/*
* A v1 cpuset with tasks will have no CPU left only when CPU hotplug
@@ -3056,6 +3072,26 @@ static void reset_migrate_dl_data(struct cpuset *cs)
cs->dl_bw_cpu = -1;
}
+/*
+ * Clear and optionally apply (@cancel is false) the attach related data in the
+ * source cpusets.
+ */
+static void clear_attach_data(struct llist_head *head, bool cancel)
+{
+ struct cpuset *cs, *next;
+ struct llist_node *lnode = __llist_del_all(head);
+
+ llist_for_each_entry_safe(cs, next, lnode, attach_node) {
+ init_llist_node(&cs->attach_node);
+ if (cs->nr_migrate_dl_tasks) {
+ if (!cancel)
+ cs->nr_deadline_tasks += cs->nr_migrate_dl_tasks;
+ cs->nr_migrate_dl_tasks = 0;
+ }
+ dec_attach_in_progress_locked(cs);
+ }
+}
+
/* Called by cgroups to determine if a cpuset is usable; cpuset_mutex held */
static int cpuset_can_attach(struct cgroup_taskset *tset)
{
@@ -3071,6 +3107,8 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
cs = css_cs(css);
mutex_lock(&cpuset_mutex);
+ attach_cpus_updated = false;
+ attach_mems_updated = false;
/* Check to see if task is allowed in the cpuset */
ret = cpuset_can_attach_check(cs, oldcs, &setsched_check);
@@ -3095,6 +3133,15 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
* selected as cpuset_attach_old_cs.
*/
cgroup_taskset_for_each(task, css, tset) {
+ struct cpuset *new_oldcs = task_cs(task);
+
+ if (new_oldcs != oldcs) {
+ oldcs = new_oldcs;
+ ret = cpuset_can_attach_check(cs, oldcs, &setsched_check);
+ if (ret)
+ goto out_unlock;
+ }
+
ret = task_can_attach(task);
if (ret)
goto out_unlock;
@@ -3116,6 +3163,7 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
* contribute to sum_migrate_dl_bw.
*/
cs->nr_migrate_dl_tasks++;
+ oldcs->nr_migrate_dl_tasks--;
if (dl_task_needs_bw_move(task, cs->effective_cpus))
cs->sum_migrate_dl_bw += task->dl.dl_bw;
}
@@ -3126,9 +3174,9 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
out_unlock:
if (ret) {
reset_migrate_dl_data(cs);
+ clear_attach_data(&src_cs_head, true);
} else {
cs->attach_in_progress++;
- oldcs->attach_in_progress++;
}
mutex_unlock(&cpuset_mutex);
@@ -3145,6 +3193,7 @@ static void cpuset_cancel_attach(struct cgroup_taskset *tset)
mutex_lock(&cpuset_mutex);
dec_attach_in_progress_locked(cs);
+ clear_attach_data(&src_cs_head, true);
if (cs->dl_bw_cpu >= 0)
dl_bw_free(cs->dl_bw_cpu, cs->sum_migrate_dl_bw);
@@ -3224,7 +3273,6 @@ static void cpuset_attach(struct cgroup_taskset *tset)
struct task_struct *task;
struct cgroup_subsys_state *css;
struct cpuset *cs;
- struct cpuset *oldcs = cpuset_attach_old_cs;
cgroup_taskset_first(tset, &css);
cs = css_cs(css);
@@ -3232,9 +3280,6 @@ static void cpuset_attach(struct cgroup_taskset *tset)
lockdep_assert_cpus_held(); /* see cgroup_attach_lock() */
mutex_lock(&cpuset_mutex);
queue_task_work = false;
-
- attach_cpus_updated = !cpumask_equal(cs->effective_cpus, oldcs->effective_cpus);
- attach_mems_updated = !nodes_equal(cs->effective_mems, oldcs->effective_mems);
guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
/*
@@ -3256,10 +3301,10 @@ static void cpuset_attach(struct cgroup_taskset *tset)
if (cs->nr_migrate_dl_tasks) {
cs->nr_deadline_tasks += cs->nr_migrate_dl_tasks;
- oldcs->nr_deadline_tasks -= cs->nr_migrate_dl_tasks;
reset_migrate_dl_data(cs);
}
+ clear_attach_data(&src_cs_head, false);
dec_attach_in_progress_locked(cs);
mutex_unlock(&cpuset_mutex);
@@ -3777,6 +3822,7 @@ int __init cpuset_init(void)
cpumask_setall(top_cpuset.effective_xcpus);
cpumask_setall(top_cpuset.exclusive_cpus);
nodes_setall(top_cpuset.effective_mems);
+ init_llist_node(&top_cpuset.attach_node);
cpuset1_init(&top_cpuset);
--
2.54.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v7 9/9] cgroup/cpuset: Support multiple destination cpusets for cpuset_*attach()
2026-06-21 3:28 [PATCH v7 0/9] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach() Waiman Long
` (7 preceding siblings ...)
2026-06-21 3:28 ` [PATCH v7 8/9] cgroup/cpuset: Support multiple source cpusets for cpuset_*attach() Waiman Long
@ 2026-06-21 3:28 ` Waiman Long
8 siblings, 0 replies; 10+ messages in thread
From: Waiman Long @ 2026-06-21 3:28 UTC (permalink / raw)
To: Ridong Chen, Tejun Heo, Johannes Weiner, Michal Koutný,
Li Zefan, Farhad Alemi, Andrew Morton
Cc: cgroups, linux-kernel, Aaron Tomlin, Guopeng Zhang, Gregory Price,
David Hildenbrand, Waiman Long
The only case where the cgroup_taskset structure requires task migration
to multiple cpusets is when enabling a cpuset controller in cgroup v2
where the newly created child cpusets inherits the same effective CPUs
and memory nodes from the parent. In that case, task migration can happen
directly with no update to tasks' CPU and memory nodes assignment and no
further work needed from the cpuset side exact updating nr_deadline_tasks
when DL tasks are involved and setting old_mems_allowed in the child
cpusets.
Do that by tracking all the destination cpusets with a new dst_cs_head
singly linked list again with the setting of attach_in_progress
associated with the insertion into the list.
It is assumed that a given cpuset cannot be both a source and a
destination cpuset. If such condition happens or when there are multiple
destination cpusets with CPU or memory nodes changes, the current code
will not handle it correctly. So it will print a warning and fail the
attach operation in these unexpected cases as we will have to enhance
the code to support this if such use cases are valid and not coding bugs.
Signed-off-by: Waiman Long <longman@redhat.com>
---
kernel/cgroup/cpuset-internal.h | 1 +
kernel/cgroup/cpuset.c | 121 +++++++++++++++++++-------------
2 files changed, 75 insertions(+), 47 deletions(-)
diff --git a/kernel/cgroup/cpuset-internal.h b/kernel/cgroup/cpuset-internal.h
index 011993b1f756..900e74ac3538 100644
--- a/kernel/cgroup/cpuset-internal.h
+++ b/kernel/cgroup/cpuset-internal.h
@@ -151,6 +151,7 @@ struct cpuset {
*/
struct llist_node attach_node;
int attach_in_progress;
+ bool attach_source;
/* partition root state */
int partition_root_state;
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index c2d172873166..aff86acea701 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2986,11 +2986,16 @@ static int update_prstate(struct cpuset *cs, int new_prs)
*
* The attach_cpus_updated/attach_mems_updated flags are set in either
* cpuset_can_attach() or cpuset_fork() and used in cpuset_attach_task().
+ *
+ * The attach_many_dest_cs is set when there are multiple destination cpusets
+ * for task migration.
*/
static struct cpuset *cpuset_attach_old_cs;
static LLIST_HEAD(src_cs_head);
+static LLIST_HEAD(dst_cs_head);
static bool attach_cpus_updated;
static bool attach_mems_updated;
+static bool attach_many_dest_cs;
/*
* Check to see if a cpuset can accept a new task
@@ -3013,9 +3018,25 @@ static int cpuset_can_attach_check(struct cpuset *cs, struct cpuset *oldcs,
if (!oldcs)
return 0;
+ /*
+ * The same cpuset cannot be both a source and a destination.
+ * The current code does not support that, print a warning and
+ * fail the attach if so.
+ */
+ if (WARN_ON_ONCE((!oldcs->attach_source &&
+ llist_on_list(&oldcs->attach_node)) ||
+ cs->attach_source))
+ return -EINVAL;
+
if (!llist_on_list(&oldcs->attach_node)) {
llist_add(&oldcs->attach_node, &src_cs_head);
oldcs->attach_in_progress++;
+ oldcs->attach_source = true;
+ }
+
+ if (!llist_on_list(&cs->attach_node)) {
+ llist_add(&cs->attach_node, &dst_cs_head);
+ cs->attach_in_progress++;
}
cpus_updated = !cpumask_equal(cs->effective_cpus, oldcs->effective_cpus);
@@ -3046,35 +3067,31 @@ static int cpuset_can_attach_check(struct cpuset *cs, struct cpuset *oldcs,
return 0;
}
-static int cpuset_reserve_dl_bw(struct cpuset *cs)
+static int cpuset_reserve_dl_bw(void)
{
+ struct cpuset *cs;
int cpu, ret;
- if (!cs->sum_migrate_dl_bw)
- return 0;
+ llist_for_each_entry(cs, dst_cs_head.first, attach_node) {
+ if (!cs->sum_migrate_dl_bw)
+ continue;
- cpu = cpumask_any_and(cpu_active_mask, cs->effective_cpus);
- if (unlikely(cpu >= nr_cpu_ids))
- return -EINVAL;
+ cpu = cpumask_any_and(cpu_active_mask, cs->effective_cpus);
+ if (unlikely(cpu >= nr_cpu_ids))
+ return -EINVAL;
- ret = dl_bw_alloc(cpu, cs->sum_migrate_dl_bw);
- if (ret)
- return ret;
+ ret = dl_bw_alloc(cpu, cs->sum_migrate_dl_bw);
+ if (ret)
+ return ret;
- cs->dl_bw_cpu = cpu;
+ cs->dl_bw_cpu = cpu;
+ }
return 0;
}
-static void reset_migrate_dl_data(struct cpuset *cs)
-{
- cs->nr_migrate_dl_tasks = 0;
- cs->sum_migrate_dl_bw = 0;
- cs->dl_bw_cpu = -1;
-}
-
/*
* Clear and optionally apply (@cancel is false) the attach related data in the
- * source cpusets.
+ * source or destination cpuset.
*/
static void clear_attach_data(struct llist_head *head, bool cancel)
{
@@ -3086,9 +3103,14 @@ static void clear_attach_data(struct llist_head *head, bool cancel)
if (cs->nr_migrate_dl_tasks) {
if (!cancel)
cs->nr_deadline_tasks += cs->nr_migrate_dl_tasks;
+ else if (cs->dl_bw_cpu >= 0) /* && cacnel */
+ dl_bw_free(cs->dl_bw_cpu, cs->sum_migrate_dl_bw);
cs->nr_migrate_dl_tasks = 0;
+ cs->sum_migrate_dl_bw = 0;
+ cs->dl_bw_cpu = -1;
}
dec_attach_in_progress_locked(cs);
+ cs->attach_source = false;
}
}
@@ -3109,6 +3131,7 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
mutex_lock(&cpuset_mutex);
attach_cpus_updated = false;
attach_mems_updated = false;
+ attach_many_dest_cs = false;
/* Check to see if task is allowed in the cpuset */
ret = cpuset_can_attach_check(cs, oldcs, &setsched_check);
@@ -3133,9 +3156,13 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
* selected as cpuset_attach_old_cs.
*/
cgroup_taskset_for_each(task, css, tset) {
+ struct cpuset *new_cs = css_cs(css);
struct cpuset *new_oldcs = task_cs(task);
- if (new_oldcs != oldcs) {
+ if ((new_oldcs != oldcs) || (new_cs != cs)) {
+ if (new_cs != cs)
+ attach_many_dest_cs = true;
+ cs = new_cs;
oldcs = new_oldcs;
ret = cpuset_can_attach_check(cs, oldcs, &setsched_check);
if (ret)
@@ -3169,14 +3196,28 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
}
}
- ret = cpuset_reserve_dl_bw(cs);
+ /*
+ * The only case where there are multiple destination cpusets for
+ * task migration is when enabling a v2 cpuset controllers where
+ * tasks will be migrated to multiple child cpusets from a parent
+ * cpuset with the same effective CPUs and memory nodes. IOW,
+ * both attach_cpus_updated and attach_mems_updated should be false.
+ * If not, it is a condition that the current code cannot handled.
+ * Print a warning and abort the attach operation as further code
+ * change will be needed.
+ */
+ if (WARN_ON_ONCE(attach_many_dest_cs && (!cpuset_v2() ||
+ attach_cpus_updated || attach_mems_updated))) {
+ ret = -EINVAL;
+ goto out_unlock;
+ }
+
+ ret = cpuset_reserve_dl_bw();
out_unlock:
if (ret) {
- reset_migrate_dl_data(cs);
clear_attach_data(&src_cs_head, true);
- } else {
- cs->attach_in_progress++;
+ clear_attach_data(&dst_cs_head, true);
}
mutex_unlock(&cpuset_mutex);
@@ -3185,22 +3226,9 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
static void cpuset_cancel_attach(struct cgroup_taskset *tset)
{
- struct cgroup_subsys_state *css;
- struct cpuset *cs;
-
- cgroup_taskset_first(tset, &css);
- cs = css_cs(css);
-
mutex_lock(&cpuset_mutex);
- dec_attach_in_progress_locked(cs);
clear_attach_data(&src_cs_head, true);
-
- if (cs->dl_bw_cpu >= 0)
- dl_bw_free(cs->dl_bw_cpu, cs->sum_migrate_dl_bw);
-
- if (cs->nr_migrate_dl_tasks)
- reset_migrate_dl_data(cs);
-
+ clear_attach_data(&dst_cs_head, true);
mutex_unlock(&cpuset_mutex);
}
@@ -3286,26 +3314,25 @@ static void cpuset_attach(struct cgroup_taskset *tset)
* In the default hierarchy, enabling cpuset in the child cgroups
* will trigger a cpuset_attach() call with no change in effective cpus
* and mems. In that case, we can optimize out by skipping the task
- * iteration and update.
+ * iteration and update, but the destination cpuset list is iterated to
+ * set old_mems_allowed.
*/
- if (cpuset_v2() && !attach_cpus_updated && !attach_mems_updated)
+ if (cpuset_v2() && !attach_cpus_updated && !attach_mems_updated) {
+ llist_for_each_entry(cs, dst_cs_head.first, attach_node)
+ cs->old_mems_allowed = cpuset_attach_nodemask_to;
goto out;
+ }
+ /* Task iteration shouldn't happen with attach_many_dest_cs set */
cgroup_taskset_for_each(task, css, tset)
cpuset_attach_task(cs, task);
-out:
if (queue_task_work)
schedule_flush_migrate_mm();
cs->old_mems_allowed = cpuset_attach_nodemask_to;
-
- if (cs->nr_migrate_dl_tasks) {
- cs->nr_deadline_tasks += cs->nr_migrate_dl_tasks;
- reset_migrate_dl_data(cs);
- }
-
+out:
clear_attach_data(&src_cs_head, false);
- dec_attach_in_progress_locked(cs);
+ clear_attach_data(&dst_cs_head, false);
mutex_unlock(&cpuset_mutex);
}
--
2.54.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
end of thread, other threads:[~2026-06-21 3:29 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-21 3:28 [PATCH v7 0/9] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach() Waiman Long
2026-06-21 3:28 ` [PATCH v7 1/9] cgroup/cpuset: rebind mm mempolicy to effective_mems, not mems_allowed Waiman Long
2026-06-21 3:28 ` [PATCH v7 2/9] cgroup/cpuset: Fix node inconsistencies between cpuset_update_tasks_nodemask() and cpuset_attach() Waiman Long
2026-06-21 3:28 ` [PATCH v7 3/9] cgroup/cpuset: Prevent race between task attach and cpuset state change Waiman Long
2026-06-21 3:28 ` [PATCH v7 4/9] cgroup/cpuset: Add a cpuset_reserve_dl_bw() helper Waiman Long
2026-06-21 3:28 ` [PATCH v7 5/9] cgroup/cpuset: Expand the scope of cpuset_can_attach_check() Waiman Long
2026-06-21 3:28 ` [PATCH v7 6/9] cgroup/cpuset: Make cpuset_attach_old_cs track task group leaders Waiman Long
2026-06-21 3:28 ` [PATCH v7 7/9] cgroup/cpuset: Move mpol_rebind_mm/cpuset_migrate_mm() calls inside cpuset_attach_task() Waiman Long
2026-06-21 3:28 ` [PATCH v7 8/9] cgroup/cpuset: Support multiple source cpusets for cpuset_*attach() Waiman Long
2026-06-21 3:28 ` [PATCH v7 9/9] cgroup/cpuset: Support multiple destination " Waiman Long
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox