Linux cgroups development
 help / color / mirror / Atom feed
* [PATCH cgroup/for-next 0/4] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach()
@ 2026-05-14 17:02 Waiman Long
  2026-05-14 17:02 ` [PATCH cgroup/for-next 1/4] cgroup/cpuset: Add an alloc_dl_bw() helper Waiman Long
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: Waiman Long @ 2026-05-14 17:02 UTC (permalink / raw)
  To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný
  Cc: cgroups, linux-kernel, Dietmar Eggemann, Aaron Tomlin, Juri Lelli,
	Waiman Long

Sashiko AI review of another cpuset patch had found that cpuset_attach()
and cpuset_can_attach() can be passed a cgroup_taskset with tasks
migrating from one source cpuset to multiple destination cpusets and
vice versa.  Further testing of the cpuset code indicates that this is
indeed the case when the v2 cpuset controller is enabled or disabled.

Unfortunately, cpuset_attach() and cpuset_can_attach() still assume that
there will be one source and one destinaton cpuset which may result in
inocrrect behavior. This patch series is created to fix this issue. The
first 3 patches are just preparatory patches to make it easier to review
the last patch which fixes this problem.

Waiman Long (4):
  cgroup/cpuset: Add an alloc_dl_bw() helper
  cgroup/cpuset: Expand the scope of cpuset_can_attach_check()
  cgroup/cpuset: Optimize cpuset_attach_task()
  cgroup/cpuset: Support multiple source/destination cpusets for
    cpuset_*attach()

 kernel/cgroup/cpuset-internal.h |   6 +
 kernel/cgroup/cpuset.c          | 315 +++++++++++++++++++++++---------
 2 files changed, 230 insertions(+), 91 deletions(-)

-- 
2.54.0


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH cgroup/for-next 1/4] cgroup/cpuset: Add an alloc_dl_bw() helper
  2026-05-14 17:02 [PATCH cgroup/for-next 0/4] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach() Waiman Long
@ 2026-05-14 17:02 ` Waiman Long
  2026-05-14 17:02 ` [PATCH cgroup/for-next 2/4] cgroup/cpuset: Expand the scope of cpuset_can_attach_check() Waiman Long
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Waiman Long @ 2026-05-14 17:02 UTC (permalink / raw)
  To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný
  Cc: cgroups, linux-kernel, Dietmar Eggemann, Aaron Tomlin, Juri Lelli,
	Waiman Long

Extract the DL bandwidth allocation code in cpuset_attach() to a new
alloc_dl_bw() helper to simplify code.

No functional change is expected.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 50 +++++++++++++++++++++++-------------------
 1 file changed, 28 insertions(+), 22 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index bcefc9f50ac5..9de3c907436f 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2980,6 +2980,25 @@ static int cpuset_can_attach_check(struct cpuset *cs)
 	return 0;
 }
 
+static int alloc_dl_bw(struct cpuset *cs)
+{
+	int cpu, ret;
+
+	if (!cs->sum_migrate_dl_bw)
+		return 0;
+
+	cpu = cpumask_any_and(cpu_active_mask, cs->effective_cpus);
+	if (unlikely(cpu >= nr_cpu_ids))
+		return -EINVAL;
+
+	ret = dl_bw_alloc(cpu, cs->sum_migrate_dl_bw);
+	if (ret)
+		return ret;
+
+	cs->dl_bw_cpu = cpu;
+	return 0;
+}
+
 static void reset_migrate_dl_data(struct cpuset *cs)
 {
 	cs->nr_migrate_dl_tasks = 0;
@@ -2994,7 +3013,7 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
 	struct cpuset *cs, *oldcs;
 	struct task_struct *task;
 	bool setsched_check;
-	int cpu, ret;
+	int ret;
 
 	/* used later by cpuset_attach() */
 	cpuset_attach_old_cs = task_cs(cgroup_taskset_first(tset, &css));
@@ -3050,31 +3069,18 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
 		}
 	}
 
-	if (!cs->sum_migrate_dl_bw)
-		goto out_success;
-
-	cpu = cpumask_any_and(cpu_active_mask, cs->effective_cpus);
-	if (unlikely(cpu >= nr_cpu_ids)) {
-		ret = -EINVAL;
-		goto out_unlock;
-	}
-
-	ret = dl_bw_alloc(cpu, cs->sum_migrate_dl_bw);
-	if (ret)
-		goto out_unlock;
-
-	cs->dl_bw_cpu = cpu;
-
-out_success:
-	/*
-	 * Mark attach is in progress.  This makes validate_change() fail
-	 * changes which zero cpus/mems_allowed.
-	 */
-	cs->attach_in_progress++;
+	ret = alloc_dl_bw(cs);
 
 out_unlock:
 	if (ret)
 		reset_migrate_dl_data(cs);
+	else
+		/*
+		 * Mark attach is in progress.  This makes validate_change() fail
+		 * changes which zero cpus/mems_allowed.
+		 */
+		cs->attach_in_progress++;
+
 	mutex_unlock(&cpuset_mutex);
 	return ret;
 }
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH cgroup/for-next 2/4] cgroup/cpuset: Expand the scope of cpuset_can_attach_check()
  2026-05-14 17:02 [PATCH cgroup/for-next 0/4] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach() Waiman Long
  2026-05-14 17:02 ` [PATCH cgroup/for-next 1/4] cgroup/cpuset: Add an alloc_dl_bw() helper Waiman Long
@ 2026-05-14 17:02 ` Waiman Long
  2026-05-14 17:02 ` [PATCH cgroup/for-next 3/4] cgroup/cpuset: Optimize cpuset_attach_task() Waiman Long
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Waiman Long @ 2026-05-14 17:02 UTC (permalink / raw)
  To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný
  Cc: cgroups, linux-kernel, Dietmar Eggemann, Aaron Tomlin, Juri Lelli,
	Waiman Long

Expand the scope of cpuset_can_attach_check() by including the setting
of setsched flag inside cpuset_can_attach_check() with the new @oldcs
and @psetsched argument. As cpuset_can_attach_check() is also called
from cpuset_can_fork(), set the new arguments to NULL from that caller.

While at it, expose the source and destination cpuset cpu/memory check
results in the new attach_cpus_updated and attach_mems_updated static
flags so that these flags can be used directly from cpuset_attach()
without the need to do the same computations again.

No functional change is expected.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 69 +++++++++++++++++++++++++-----------------
 1 file changed, 41 insertions(+), 28 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 9de3c907436f..68392cf6429b 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2964,19 +2964,55 @@ static int update_prstate(struct cpuset *cs, int new_prs)
 	return 0;
 }
 
+/*
+ * cpuset_can_attach() and cpuset_attach() specific internal data
+ */
 static struct cpuset *cpuset_attach_old_cs;
+static bool attach_cpus_updated;
+static bool attach_mems_updated;
 
 /*
  * Check to see if a cpuset can accept a new task
  * For v1, cpus_allowed and mems_allowed can't be empty.
  * For v2, effective_cpus can't be empty.
  * Note that in v1, effective_cpus = cpus_allowed.
+ *
+ * Also set the boolean flag passed in by @psetsched depending on if
+ * security_task_setscheduler() call is needed and @oldcs is not NULL.
  */
-static int cpuset_can_attach_check(struct cpuset *cs)
+static int cpuset_can_attach_check(struct cpuset *cs, struct cpuset *oldcs,
+				   bool *psetsched)
 {
 	if (cpumask_empty(cs->effective_cpus) ||
 	   (!is_in_v2_mode() && nodes_empty(cs->mems_allowed)))
 		return -ENOSPC;
+
+	if (!oldcs)
+		return 0;
+
+	/*
+	 * Update attach specific data
+	 */
+	attach_cpus_updated = !cpumask_equal(cs->effective_cpus, oldcs->effective_cpus);
+	attach_mems_updated = !nodes_equal(cs->effective_mems, oldcs->effective_mems);
+
+	/*
+	 * Skip rights over task setsched check in v2 when nothing changes,
+	 * migration permission derives from hierarchy ownership in
+	 * cgroup_procs_write_permission()).
+	 */
+	*psetsched = !cpuset_v2() || attach_cpus_updated || attach_mems_updated;
+
+	/*
+	 * A v1 cpuset with tasks will have no CPU left only when CPU hotplug
+	 * brings the last online CPU offline as users are not allowed to empty
+	 * cpuset.cpus when there are active tasks inside. When that happens,
+	 * we should allow tasks to migrate out without security check to make
+	 * sure they will be able to run after migration.
+	 */
+	if (!is_in_v2_mode() && cpumask_empty(oldcs->effective_cpus))
+		*psetsched = false;
+
 	return 0;
 }
 
@@ -3023,29 +3059,10 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
 	mutex_lock(&cpuset_mutex);
 
 	/* Check to see if task is allowed in the cpuset */
-	ret = cpuset_can_attach_check(cs);
+	ret = cpuset_can_attach_check(cs, oldcs, &setsched_check);
 	if (ret)
 		goto out_unlock;
 
-	/*
-	 * Skip rights over task setsched check in v2 when nothing changes,
-	 * migration permission derives from hierarchy ownership in
-	 * cgroup_procs_write_permission()).
-	 */
-	setsched_check = !cpuset_v2() ||
-		!cpumask_equal(cs->effective_cpus, oldcs->effective_cpus) ||
-		!nodes_equal(cs->effective_mems, oldcs->effective_mems);
-
-	/*
-	 * A v1 cpuset with tasks will have no CPU left only when CPU hotplug
-	 * brings the last online CPU offline as users are not allowed to empty
-	 * cpuset.cpus when there are active tasks inside. When that happens,
-	 * we should allow tasks to migrate out without security check to make
-	 * sure they will be able to run after migration.
-	 */
-	if (!is_in_v2_mode() && cpumask_empty(oldcs->effective_cpus))
-		setsched_check = false;
-
 	cgroup_taskset_for_each(task, css, tset) {
 		ret = task_can_attach(task);
 		if (ret)
@@ -3139,7 +3156,6 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 	struct cgroup_subsys_state *css;
 	struct cpuset *cs;
 	struct cpuset *oldcs = cpuset_attach_old_cs;
-	bool cpus_updated, mems_updated;
 	bool queue_task_work = false;
 
 	cgroup_taskset_first(tset, &css);
@@ -3147,9 +3163,6 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 
 	lockdep_assert_cpus_held();	/* see cgroup_attach_lock() */
 	mutex_lock(&cpuset_mutex);
-	cpus_updated = !cpumask_equal(cs->effective_cpus,
-				      oldcs->effective_cpus);
-	mems_updated = !nodes_equal(cs->effective_mems, oldcs->effective_mems);
 
 	/*
 	 * In the default hierarchy, enabling cpuset in the child cgroups
@@ -3157,7 +3170,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 	 * in effective cpus and mems. In that case, we can optimize out
 	 * by skipping the task iteration and update.
 	 */
-	if (cpuset_v2() && !cpus_updated && !mems_updated) {
+	if (cpuset_v2() && !attach_cpus_updated && !attach_mems_updated) {
 		cpuset_attach_nodemask_to = cs->effective_mems;
 		goto out;
 	}
@@ -3174,7 +3187,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 	 * not set.
 	 */
 	cpuset_attach_nodemask_to = cs->effective_mems;
-	if (!is_memory_migrate(cs) && !mems_updated)
+	if (!is_memory_migrate(cs) && !attach_mems_updated)
 		goto out;
 
 	cgroup_taskset_for_each_leader(leader, css, tset) {
@@ -3589,7 +3602,7 @@ static int cpuset_can_fork(struct task_struct *task, struct css_set *cset)
 	mutex_lock(&cpuset_mutex);
 
 	/* Check to see if task is allowed in the cpuset */
-	ret = cpuset_can_attach_check(cs);
+	ret = cpuset_can_attach_check(cs, NULL, NULL);
 	if (ret)
 		goto out_unlock;
 
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH cgroup/for-next 3/4] cgroup/cpuset: Optimize cpuset_attach_task()
  2026-05-14 17:02 [PATCH cgroup/for-next 0/4] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach() Waiman Long
  2026-05-14 17:02 ` [PATCH cgroup/for-next 1/4] cgroup/cpuset: Add an alloc_dl_bw() helper Waiman Long
  2026-05-14 17:02 ` [PATCH cgroup/for-next 2/4] cgroup/cpuset: Expand the scope of cpuset_can_attach_check() Waiman Long
@ 2026-05-14 17:02 ` Waiman Long
  2026-05-14 17:02 ` [PATCH cgroup/for-next 4/4] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach() Waiman Long
  2026-05-14 21:46 ` [PATCH cgroup/for-next 0/4] " Tejun Heo
  4 siblings, 0 replies; 6+ messages in thread
From: Waiman Long @ 2026-05-14 17:02 UTC (permalink / raw)
  To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný
  Cc: cgroups, linux-kernel, Dietmar Eggemann, Aaron Tomlin, Juri Lelli,
	Waiman Long

Within cpuset_attach(), cpuset_attach_task() is called only if either the
CPU and/or the memory setting are updated. If only one of the settings
is updated, cpuset_attach_task() still updates both CPU and memory node
setting of each task. Further optimize it by checking attach_cpus_updated
and attach_mems_updated for v2 to skip the unnecessary update.

While at it, also move the mpol_rebind_mm() call for mm group leader
to cpuset_attach_task(). This change shouldn't affect the cpuset_fork()
caller as the newly cloned task isn't the group leader. For that caller,
it is assumed that both CPU and memory nodes are updated to keep the
existing behavior.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 36 ++++++++++++++++++++++++++----------
 1 file changed, 26 insertions(+), 10 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 68392cf6429b..8ced1fa0900f 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -3132,8 +3132,13 @@ static nodemask_t cpuset_attach_nodemask_to;
 
 static void cpuset_attach_task(struct cpuset *cs, struct task_struct *task)
 {
+	struct mm_struct *mm;
+
 	lockdep_assert_cpuset_lock_held();
 
+	if (cpuset_v2() && !attach_cpus_updated)
+		goto update_mem;
+
 	if (cs != &top_cpuset)
 		guarantee_active_cpus(task, cpus_attach);
 	else
@@ -3145,8 +3150,21 @@ static void cpuset_attach_task(struct cpuset *cs, struct task_struct *task)
 	 */
 	WARN_ON_ONCE(set_cpus_allowed_ptr(task, cpus_attach));
 
+update_mem:
+	if (cpuset_v2() && !attach_mems_updated)
+		return;
+
 	cpuset_change_task_nodemask(task, &cpuset_attach_nodemask_to);
 	cpuset1_update_task_spread_flags(cs, task);
+
+	if (task != task->group_leader)
+		return;
+
+	mm = get_task_mm(task);
+	if (mm) {
+		mpol_rebind_mm(mm, &cs->effective_mems);
+		mmput(mm);
+	}
 }
 
 static void cpuset_attach(struct cgroup_taskset *tset)
@@ -3187,15 +3205,13 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 	 * not set.
 	 */
 	cpuset_attach_nodemask_to = cs->effective_mems;
-	if (!is_memory_migrate(cs) && !attach_mems_updated)
+	if (!is_memory_migrate(cs))
 		goto out;
 
 	cgroup_taskset_for_each_leader(leader, css, tset) {
 		struct mm_struct *mm = get_task_mm(leader);
 
 		if (mm) {
-			mpol_rebind_mm(mm, &cpuset_attach_nodemask_to);
-
 			/*
 			 * old_mems_allowed is the same with mems_allowed
 			 * here, except if this task is being moved
@@ -3204,18 +3220,15 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 			 * @old_mems_allowed is the right nodesets that we
 			 * migrate mm from.
 			 */
-			if (is_memory_migrate(cs)) {
-				cpuset_migrate_mm(mm, &oldcs->old_mems_allowed,
-						  &cpuset_attach_nodemask_to);
-				queue_task_work = true;
-			} else
-				mmput(mm);
+			cpuset_migrate_mm(mm, &oldcs->old_mems_allowed,
+					  &cpuset_attach_nodemask_to);
+			queue_task_work = true;
 		}
 	}
 
-out:
 	if (queue_task_work)
 		schedule_flush_migrate_mm();
+out:
 	cs->old_mems_allowed = cpuset_attach_nodemask_to;
 
 	if (cs->nr_migrate_dl_tasks) {
@@ -3666,7 +3679,10 @@ static void cpuset_fork(struct task_struct *task)
 	/* CLONE_INTO_CGROUP */
 	mutex_lock(&cpuset_mutex);
 	guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
+	/* Assume CPUs and memory nodes are updated */
+	attach_cpus_updated = attach_mems_updated = true;
 	cpuset_attach_task(cs, task);
+	attach_cpus_updated = attach_mems_updated = false;
 
 	dec_attach_in_progress_locked(cs);
 	mutex_unlock(&cpuset_mutex);
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH cgroup/for-next 4/4] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach()
  2026-05-14 17:02 [PATCH cgroup/for-next 0/4] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach() Waiman Long
                   ` (2 preceding siblings ...)
  2026-05-14 17:02 ` [PATCH cgroup/for-next 3/4] cgroup/cpuset: Optimize cpuset_attach_task() Waiman Long
@ 2026-05-14 17:02 ` Waiman Long
  2026-05-14 21:46 ` [PATCH cgroup/for-next 0/4] " Tejun Heo
  4 siblings, 0 replies; 6+ messages in thread
From: Waiman Long @ 2026-05-14 17:02 UTC (permalink / raw)
  To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný
  Cc: cgroups, linux-kernel, Dietmar Eggemann, Aaron Tomlin, Juri Lelli,
	Waiman Long

With cgroup v2, the cgroup_taskset structure passed into the cgroup
can_attach() and attach() methods can contain task migration data with
multiple destination or source cpusets when the cpuset controller is
enabled or disabled respectively.

Since cpuset is threaded, another possible way to cause many-to-one
migration is to move the whole process with multiple threads in different
cpuset enabled threaded cgroups into another cpuset enabled cgroup.
Alternatively, multiple processs from different cpusets can be written
into cgroup.proc as a single operation.

The current cpuset_can_attach() and cpuset_attach() functions still
expect task migration is from one source cpuset to one destination
cpuset. This has been the case since cpuset was enabled for cgroup v2
in commit 4ec22e9c5a90 ("cpuset: Enable cpuset controller in default
hierarchy").

This problem is less an issue when enabling the cpuset controller as all
the newly created child cpusets will have exactly the same set of CPUs
and memory nodes except when deadline tasks are involved in migration
as the deadline task accounting data can be off.

It can be more problematic when the cpuset controller is disabled as
their set of CPUs and memory nodes may differ from their parent or with
the moving of multi-threaded process from different threaded cgroups.

Fix that by tracking the set of source (old) and destination cpusets
in singly linked lists and iterating them all to properly update the
internal data. Also keep the current cs and oldcs variables up-to-date
with the css and task iterators. cpuset_attach_old_cs is now dropped
as the old cpusets are now being tracked.

To ensure proper DL tasks accounting, the nr_migrate_dl_tasks in both
the source and destination cpusets are decremented/incremented with
their values added to nr_deadline_tasks when the migration is successful.

Fixes: 4ec22e9c5a90 ("cpuset: Enable cpuset controller in default hierarchy")
Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset-internal.h |   6 +
 kernel/cgroup/cpuset.c          | 206 +++++++++++++++++++++++---------
 2 files changed, 158 insertions(+), 54 deletions(-)

diff --git a/kernel/cgroup/cpuset-internal.h b/kernel/cgroup/cpuset-internal.h
index f7aaf01f7cd5..4c2772a7fd5e 100644
--- a/kernel/cgroup/cpuset-internal.h
+++ b/kernel/cgroup/cpuset-internal.h
@@ -161,6 +161,12 @@ struct cpuset {
 	 */
 	bool remote_partition;
 
+	/*
+	 * cpuset_can_attach() and cpuset_attach() specific data
+	 */
+	bool			attach_node_in_llist;
+	struct llist_node	attach_node;
+
 	/*
 	 * number of SCHED_DEADLINE tasks attached to this cpuset, so that we
 	 * know when to rebuild associated root domain bandwidth information.
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 8ced1fa0900f..c46454b29d74 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -37,6 +37,7 @@
 #include <linux/wait.h>
 #include <linux/workqueue.h>
 #include <linux/task_work.h>
+#include <linux/llist.h>
 
 DEFINE_STATIC_KEY_FALSE(cpusets_pre_enable_key);
 DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key);
@@ -2967,7 +2968,8 @@ static int update_prstate(struct cpuset *cs, int new_prs)
 /*
  * cpuset_can_attach() and cpuset_attach() specific internal data
  */
-static struct cpuset *cpuset_attach_old_cs;
+static LLIST_HEAD(src_cs_head);
+static LLIST_HEAD(dst_cs_head);
 static bool attach_cpus_updated;
 static bool attach_mems_updated;
 
@@ -2980,9 +2982,10 @@ static bool attach_mems_updated;
  * Also set the boolean flag passed in by @psetsched depending on if
  * security_task_setscheduler() call is needed and @oldcs is not NULL.
  */
-static int cpuset_can_attach_check(struct cpuset *cs, struct cpuset *oldcs,
-				   bool *psetsched)
+static int cpuset_can_attach_check(struct cpuset *cs, struct cpuset *oldcs, bool *psetsched)
 {
+	bool cpu_match, mem_match;
+
 	if (cpumask_empty(cs->effective_cpus) ||
 	   (!is_in_v2_mode() && nodes_empty(cs->mems_allowed)))
 		return -ENOSPC;
@@ -2993,15 +2996,34 @@ static int cpuset_can_attach_check(struct cpuset *cs, struct cpuset *oldcs,
 	/*
 	 * Update attach specific data
 	 */
-	attach_cpus_updated = !cpumask_equal(cs->effective_cpus, oldcs->effective_cpus);
-	attach_mems_updated = !nodes_equal(cs->effective_mems, oldcs->effective_mems);
+	if (!cs->attach_node_in_llist) {
+		llist_add(&cs->attach_node, &dst_cs_head);
+		cs->attach_node_in_llist = true;
+	}
+	if (!oldcs->attach_node_in_llist) {
+		llist_add(&oldcs->attach_node, &src_cs_head);
+		oldcs->attach_node_in_llist = true;
+	}
+
+	cpu_match = cpumask_equal(cs->effective_cpus, oldcs->effective_cpus);
+	mem_match = nodes_equal(cs->effective_mems, oldcs->effective_mems);
+
+	/*
+	 * Set the updated flags whenever there is a mismatch in any of the
+	 * src/dst pairs.
+	 */
+	if (!attach_cpus_updated)
+		attach_cpus_updated = !cpu_match;
+
+	if (!attach_mems_updated)
+		attach_mems_updated = !mem_match;
 
 	/*
 	 * Skip rights over task setsched check in v2 when nothing changes,
 	 * migration permission derives from hierarchy ownership in
 	 * cgroup_procs_write_permission()).
 	 */
-	*psetsched = !cpuset_v2() || attach_cpus_updated || attach_mems_updated;
+	*psetsched = !cpuset_v2() || !cpu_match || !mem_match;
 
 	/*
 	 * A v1 cpuset with tasks will have no CPU left only when CPU hotplug
@@ -3016,33 +3038,105 @@ static int cpuset_can_attach_check(struct cpuset *cs, struct cpuset *oldcs,
 	return 0;
 }
 
-static int alloc_dl_bw(struct cpuset *cs)
+/*
+ * If reset_dl_bw is set, reset the previous dl_bw_alloc() call. Otherwise,
+ * update nr_deadline_tasks according to nr_migrate_dl_tasks in both source
+ * and destination cpusets.
+ */
+static void clear_attach_data(bool reset_dl_bw)
+{
+	struct cpuset *cs, *next;
+
+	llist_for_each_entry_safe(cs, next, src_cs_head.first, attach_node) {
+		cs->attach_node.next = NULL;
+		cs->attach_node_in_llist = false;
+		if (cs->nr_migrate_dl_tasks && !reset_dl_bw)
+			cs->nr_deadline_tasks += cs->nr_migrate_dl_tasks;
+		cs->nr_migrate_dl_tasks = 0;
+	}
+
+	llist_for_each_entry_safe(cs, next, dst_cs_head.first, attach_node) {
+		cs->attach_node.next = NULL;
+		cs->attach_node_in_llist = false;
+		if (reset_dl_bw && cs->dl_bw_cpu >= 0)
+			dl_bw_free(cs->dl_bw_cpu, cs->sum_migrate_dl_bw);
+		if (cs->nr_migrate_dl_tasks && !reset_dl_bw)
+			cs->nr_deadline_tasks += cs->nr_migrate_dl_tasks;
+		cs->nr_migrate_dl_tasks = 0;
+		cs->sum_migrate_dl_bw = 0;
+		cs->dl_bw_cpu = -1;
+	}
+
+	src_cs_head.first = NULL;
+	dst_cs_head.first = NULL;
+	attach_cpus_updated = false;
+	attach_mems_updated = false;
+}
+
+static int alloc_dl_bw(void)
 {
+	struct cpuset *cs;
 	int cpu, ret;
 
-	if (!cs->sum_migrate_dl_bw)
-		return 0;
+	llist_for_each_entry(cs, dst_cs_head.first, attach_node) {
+		if (!cs->sum_migrate_dl_bw)
+			continue;
 
-	cpu = cpumask_any_and(cpu_active_mask, cs->effective_cpus);
-	if (unlikely(cpu >= nr_cpu_ids))
-		return -EINVAL;
+		cpu = cpumask_any_and(cpu_active_mask, cs->effective_cpus);
+		if (unlikely(cpu >= nr_cpu_ids))
+			return -EINVAL;
 
-	ret = dl_bw_alloc(cpu, cs->sum_migrate_dl_bw);
-	if (ret)
-		return ret;
+		ret = dl_bw_alloc(cpu, cs->sum_migrate_dl_bw);
+		if (ret)
+			return ret;
 
-	cs->dl_bw_cpu = cpu;
+		cs->dl_bw_cpu = cpu;
+	}
 	return 0;
 }
 
-static void reset_migrate_dl_data(struct cpuset *cs)
+static void set_attach_in_progress(void)
 {
-	cs->nr_migrate_dl_tasks = 0;
-	cs->sum_migrate_dl_bw = 0;
-	cs->dl_bw_cpu = -1;
+	struct cpuset *cs;
+
+	/*
+	 * Mark attach is in progress.  This makes validate_change() fail
+	 * changes which zero cpus/mems_allowed.
+	 */
+	llist_for_each_entry(cs, dst_cs_head.first, attach_node)
+		cs->attach_in_progress++;
 }
 
-/* Called by cgroups to determine if a cpuset is usable; cpuset_mutex held */
+static void reset_attach_in_progress(void)
+{
+	struct cpuset *cs;
+
+	llist_for_each_entry(cs, dst_cs_head.first, attach_node)
+		dec_attach_in_progress_locked(cs);
+}
+
+/*
+ * Called by cgroups to determine if a cpuset is usable; cpuset_mutex held.
+ *
+ * With cgroup v2, enabling of cpuset controller in a cgroup subtree can
+ * cause @tset to contain task migration data from one parent cpuset to multiple
+ * child cpusets. Not much is needed to be done here other than tracking the
+ * number of DL tasks in each cpuset as the CPUs and memory nodes of the child
+ * cpusets are exactly the same as the parent.
+ *
+ * Conversely, disabling of cpuset controller can cause @tset to contain task
+ * migration data from multiple child cpusets to one parent cpuset. Here, the
+ * CPUs and memory nodes of the child cpusets may be different from the parent,
+ * but must be a subset of its parent.
+ *
+ * Another possible many-to-one migration is the moving of the whole
+ * multithreaded process with threads in different cpusets to another cpuset.
+ * Alternatively, multiple processes from multiple cpusets can be moved to
+ * another cpuset in a single operation.
+ *
+ * For all other use cases including cgroup v1, @tset task migration data
+ * should be from one source cpuset to one destination cpuset.
+ */
 static int cpuset_can_attach(struct cgroup_taskset *tset)
 {
 	struct cgroup_subsys_state *css;
@@ -3052,8 +3146,7 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
 	int ret;
 
 	/* used later by cpuset_attach() */
-	cpuset_attach_old_cs = task_cs(cgroup_taskset_first(tset, &css));
-	oldcs = cpuset_attach_old_cs;
+	oldcs = task_cs(cgroup_taskset_first(tset, &css));
 	cs = css_cs(css);
 
 	mutex_lock(&cpuset_mutex);
@@ -3064,6 +3157,16 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
 		goto out_unlock;
 
 	cgroup_taskset_for_each(task, css, tset) {
+		struct cpuset *newcs = css_cs(css);
+		struct cpuset *new_oldcs = task_cs(task);
+
+		if ((newcs != cs) || (new_oldcs != oldcs)) {
+			cs = newcs;
+			oldcs = new_oldcs;
+			ret = cpuset_can_attach_check(cs, oldcs, &setsched_check);
+			if (ret)
+				goto out_unlock;
+		}
 		ret = task_can_attach(task);
 		if (ret)
 			goto out_unlock;
@@ -3081,23 +3184,18 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
 			 * contribute to sum_migrate_dl_bw.
 			 */
 			cs->nr_migrate_dl_tasks++;
+			oldcs->nr_migrate_dl_tasks--;
 			if (dl_task_needs_bw_move(task, cs->effective_cpus))
 				cs->sum_migrate_dl_bw += task->dl.dl_bw;
 		}
 	}
 
-	ret = alloc_dl_bw(cs);
-
+	ret = alloc_dl_bw();
 out_unlock:
 	if (ret)
-		reset_migrate_dl_data(cs);
+		clear_attach_data(true);
 	else
-		/*
-		 * Mark attach is in progress.  This makes validate_change() fail
-		 * changes which zero cpus/mems_allowed.
-		 */
-		cs->attach_in_progress++;
-
+		set_attach_in_progress();
 	mutex_unlock(&cpuset_mutex);
 	return ret;
 }
@@ -3111,14 +3209,8 @@ static void cpuset_cancel_attach(struct cgroup_taskset *tset)
 	cs = css_cs(css);
 
 	mutex_lock(&cpuset_mutex);
-	dec_attach_in_progress_locked(cs);
-
-	if (cs->dl_bw_cpu >= 0)
-		dl_bw_free(cs->dl_bw_cpu, cs->sum_migrate_dl_bw);
-
-	if (cs->nr_migrate_dl_tasks)
-		reset_migrate_dl_data(cs);
-
+	reset_attach_in_progress();
+	clear_attach_data(true);
 	mutex_unlock(&cpuset_mutex);
 }
 
@@ -3172,8 +3264,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 	struct task_struct *task;
 	struct task_struct *leader;
 	struct cgroup_subsys_state *css;
-	struct cpuset *cs;
-	struct cpuset *oldcs = cpuset_attach_old_cs;
+	struct cpuset *cs, *oldcs;
 	bool queue_task_work = false;
 
 	cgroup_taskset_first(tset, &css);
@@ -3184,9 +3275,9 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 
 	/*
 	 * In the default hierarchy, enabling cpuset in the child cgroups
-	 * will trigger a number of cpuset_attach() calls with no change
-	 * in effective cpus and mems. In that case, we can optimize out
-	 * by skipping the task iteration and update.
+	 * will trigger a cpuset_attach() call with no change in effective cpus
+	 * and mems. In that case, we can optimize out by skipping the task
+	 * iteration and update.
 	 */
 	if (cpuset_v2() && !attach_cpus_updated && !attach_mems_updated) {
 		cpuset_attach_nodemask_to = cs->effective_mems;
@@ -3195,8 +3286,16 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 
 	guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
 
-	cgroup_taskset_for_each(task, css, tset)
+	cgroup_taskset_for_each(task, css, tset) {
+		struct cpuset *newcs = css_cs(css);
+
+		if (newcs != cs) {
+			cs->old_mems_allowed = cs->effective_mems;
+			cs = newcs;
+			guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
+		}
 		cpuset_attach_task(cs, task);
+	}
 
 	/*
 	 * Change mm for all threadgroup leaders. This is expensive and may
@@ -3208,6 +3307,11 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 	if (!is_memory_migrate(cs))
 		goto out;
 
+	/*
+	 * Only v1 supports memory_migrate and there should only be one source
+	 * and one destination cpuset.
+	 */
+	oldcs = llist_entry(src_cs_head.first, struct cpuset, attach_node);
 	cgroup_taskset_for_each_leader(leader, css, tset) {
 		struct mm_struct *mm = get_task_mm(leader);
 
@@ -3231,14 +3335,8 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 out:
 	cs->old_mems_allowed = cpuset_attach_nodemask_to;
 
-	if (cs->nr_migrate_dl_tasks) {
-		cs->nr_deadline_tasks += cs->nr_migrate_dl_tasks;
-		oldcs->nr_deadline_tasks -= cs->nr_migrate_dl_tasks;
-		reset_migrate_dl_data(cs);
-	}
-
-	dec_attach_in_progress_locked(cs);
-
+	reset_attach_in_progress();
+	clear_attach_data(false);
 	mutex_unlock(&cpuset_mutex);
 }
 
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH cgroup/for-next 0/4] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach()
  2026-05-14 17:02 [PATCH cgroup/for-next 0/4] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach() Waiman Long
                   ` (3 preceding siblings ...)
  2026-05-14 17:02 ` [PATCH cgroup/for-next 4/4] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach() Waiman Long
@ 2026-05-14 21:46 ` Tejun Heo
  4 siblings, 0 replies; 6+ messages in thread
From: Tejun Heo @ 2026-05-14 21:46 UTC (permalink / raw)
  To: Waiman Long
  Cc: Chen Ridong, Johannes Weiner, Michal Koutný, cgroups,
	linux-kernel, Dietmar Eggemann, Aaron Tomlin, Juri Lelli

Hello,

Quick AI-assisted review pass; passing the points along for human eyes.

Patch 4:

- The leader loop comment says "Only v1 supports memory_migrate", but
  CS_MEMORY_MIGRATE is set unconditionally on v2 cpusets in
  cpuset_css_alloc(). With v2 controller-disable folding children with
  differing effective_mems into the parent, picking a single
  llist_entry(src_cs_head.first, ...) as oldcs passes the wrong source
  nodemask to cpuset_migrate_mm() for every leader whose actual source
  differs. Looks like the source needs to be looked up per leader.

- cs->old_mems_allowed updates are inconsistent across destinations: the
  mid-loop transition assigns cs->effective_mems (raw) while the tail
  assignment uses cpuset_attach_nodemask_to (after guarantee_online_mems).
  The v2 fast-path also updates only the first-task cs, leaving other
  destinations on dst_cs_head stale.

Patch 3:

- Changelog says "the newly cloned task isn't the group leader", but for
  CLONE_INTO_CGROUP without CLONE_THREAD the new task is its own
  group_leader, so the new mpol_rebind_mm() block in cpuset_attach_task()
  does run from cpuset_fork(). Either acknowledge as an incidental
  improvement or guard the new path.

Patch 1:

- alloc_dl_bw() reads confusingly next to the scheduler's dl_bw_alloc()
  while doing more (pick cpu, call dl_bw_alloc, record cs->dl_bw_cpu).
  Something like cpuset_reserve_dl_bw() would be clearer.

- The relocated "Mark attach is in progress" comment sits inside a
  braceless else; either move it above the if (ret) or brace both arms.

Patch 2 looked clean.

Thanks.

--
tejun

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-05-14 21:46 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-14 17:02 [PATCH cgroup/for-next 0/4] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach() Waiman Long
2026-05-14 17:02 ` [PATCH cgroup/for-next 1/4] cgroup/cpuset: Add an alloc_dl_bw() helper Waiman Long
2026-05-14 17:02 ` [PATCH cgroup/for-next 2/4] cgroup/cpuset: Expand the scope of cpuset_can_attach_check() Waiman Long
2026-05-14 17:02 ` [PATCH cgroup/for-next 3/4] cgroup/cpuset: Optimize cpuset_attach_task() Waiman Long
2026-05-14 17:02 ` [PATCH cgroup/for-next 4/4] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach() Waiman Long
2026-05-14 21:46 ` [PATCH cgroup/for-next 0/4] " Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox