From: Waiman Long <longman@redhat.com>
To: "Ridong Chen" <ridong.chen@linux.dev>,
"Tejun Heo" <tj@kernel.org>,
"Johannes Weiner" <hannes@cmpxchg.org>,
"Michal Koutný" <mkoutny@suse.com>,
"Shuah Khan" <shuah@kernel.org>,
"Juri Lelli" <juri.lelli@redhat.com>
Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-kselftest@vger.kernel.org,
Aaron Tomlin <atomlin@atomlin.com>,
Guopeng Zhang <guopeng.zhang@linux.dev>,
Waiman Long <longman@redhat.com>
Subject: [PATCH-next v9 09/11] cgroup/cpuset: Support multiple source cpusets for cpuset_*attach()
Date: Mon, 29 Jun 2026 23:33:42 -0400 [thread overview]
Message-ID: <20260630033344.352702-10-longman@redhat.com> (raw)
In-Reply-To: <20260630033344.352702-1-longman@redhat.com>
There are 2 possible scenarios where the cgroup_taskset structure
passed into the cgroup can_attach() and attach() methods can contain
task migration data with multiple source cpusets.
- A multithread application with threads in different cpusets is
fully migrated into a new cpuset.
- Disabling v2 cpuset controller will move all the tasks in child
cpusets to the parent cpuset.
The current cpuset_can_attach() and cpuset_attach() functions still
expect task migration is from one source cpuset to one destination
cpuset.
Fix that by tracking the set of source (old) cpusets in singly linked
lists. The list will be iterated when necessary to properly update
internal data.
To ensure proper DL tasks accounting, the nr_migrate_dl_tasks in both
the source and destination cpusets are decremented/incremented with
their values added to nr_deadline_tasks when the migration is successful.
The setting of the global attach_ctx.cpus_updated and
attach_ctx.mems_updated flags are also moved from cpuset_attach()
to cpuset_can_attach() as the correct source cpuset can no longer be
determined in cpuset_attach() and cpuset states will not be changed
between cpuset_attach() and cpuset_can_attach() with an earlier patch.
Signed-off-by: Waiman Long <longman@redhat.com>
---
kernel/cgroup/cpuset-internal.h | 5 +++
kernel/cgroup/cpuset.c | 65 ++++++++++++++++++++++++++++-----
2 files changed, 60 insertions(+), 10 deletions(-)
diff --git a/kernel/cgroup/cpuset-internal.h b/kernel/cgroup/cpuset-internal.h
index df662c7fd1a4..e7d010661fd3 100644
--- a/kernel/cgroup/cpuset-internal.h
+++ b/kernel/cgroup/cpuset-internal.h
@@ -145,6 +145,11 @@ struct cpuset {
*/
nodemask_t old_mems_allowed;
+ /*
+ * For linking impacted cpusets during an attach operation.
+ */
+ struct llist_node attach_node;
+
/* partition root state */
int partition_root_state;
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 0b9df38e9a63..b201f4ba18b6 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -37,6 +37,7 @@
#include <linux/wait.h>
#include <linux/workqueue.h>
#include <linux/task_work.h>
+#include <linux/llist.h>
DEFINE_STATIC_KEY_FALSE(cpusets_pre_enable_key);
DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key);
@@ -368,6 +369,7 @@ static struct {
struct cpuset *old_cs; /* Source cpuset */
nodemask_t nodemask_to;
} attach_ctx;
+static LLIST_HEAD(src_cs_head);
/*
* Wait if task attach is in progress until it is done and then acquire
@@ -615,6 +617,7 @@ static struct cpuset *dup_or_alloc_cpuset(struct cpuset *cs)
return NULL;
trial->dl_bw_cpu = -1;
+ init_llist_node(&trial->attach_node);
/* Setup cpumask pointer array */
cpumask_var_t *pmask[4] = {
@@ -3032,6 +3035,8 @@ static int update_prstate(struct cpuset *cs, int new_prs)
static int cpuset_can_attach_check(struct cpuset *cs, struct cpuset *oldcs,
bool *psetsched)
{
+ bool cpus_updated, mems_updated;
+
if (cpumask_empty(cs->effective_cpus) ||
(!is_in_v2_mode() && nodes_empty(cs->mems_allowed)))
return -ENOSPC;
@@ -3039,14 +3044,23 @@ static int cpuset_can_attach_check(struct cpuset *cs, struct cpuset *oldcs,
if (!oldcs)
return 0;
+ if (!llist_on_list(&oldcs->attach_node))
+ llist_add(&oldcs->attach_node, &src_cs_head);
+
+ cpus_updated = !cpumask_equal(cs->effective_cpus, oldcs->effective_cpus);
+ mems_updated = !nodes_equal(cs->effective_mems, oldcs->effective_mems);
+
+ if (cpus_updated)
+ attach_ctx.cpus_updated = true;
+ if (mems_updated)
+ attach_ctx.mems_updated = true;
+
/*
* Skip rights over task setsched check in v2 when nothing changes,
* migration permission derives from hierarchy ownership in
* cgroup_procs_write_permission()).
*/
- *psetsched = !cpuset_v2() ||
- !cpumask_equal(cs->effective_cpus, oldcs->effective_cpus) ||
- !nodes_equal(cs->effective_mems, oldcs->effective_mems);
+ *psetsched = !cpuset_v2() || cpus_updated || mems_updated;
/*
* A v1 cpuset with tasks will have no CPU left only when CPU hotplug
@@ -3087,6 +3101,25 @@ static void reset_migrate_dl_data(struct cpuset *cs)
cs->dl_bw_cpu = -1;
}
+/*
+ * Clear and optionally apply (@cancel is false) the attach related data in the
+ * source cpusets.
+ */
+static void clear_attach_data(struct llist_head *head, bool cancel)
+{
+ struct cpuset *cs, *next;
+ struct llist_node *lnode = __llist_del_all(head);
+
+ llist_for_each_entry_safe(cs, next, lnode, attach_node) {
+ init_llist_node(&cs->attach_node);
+ if (cs->nr_migrate_dl_tasks) {
+ if (!cancel)
+ atomic_add(cs->nr_migrate_dl_tasks, &cs->nr_deadline_tasks);
+ cs->nr_migrate_dl_tasks = 0;
+ }
+ }
+}
+
/* Called by cgroups to determine if a cpuset is usable; cpuset_mutex held */
static int cpuset_can_attach(struct cgroup_taskset *tset)
{
@@ -3102,6 +3135,8 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
cs = css_cs(css);
mutex_lock(&cpuset_mutex);
+ attach_ctx.cpus_updated = false;
+ attach_ctx.mems_updated = false;
/* Check to see if task is allowed in the cpuset */
ret = cpuset_can_attach_check(cs, oldcs, &setsched_check);
@@ -3126,6 +3161,15 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
* selected as attach_ctx.old_cs.
*/
cgroup_taskset_for_each(task, css, tset) {
+ struct cpuset *new_oldcs = task_cs(task);
+
+ if (new_oldcs != oldcs) {
+ oldcs = new_oldcs;
+ ret = cpuset_can_attach_check(cs, oldcs, &setsched_check);
+ if (ret)
+ goto out_unlock;
+ }
+
ret = task_can_attach(task);
if (ret)
goto out_unlock;
@@ -3147,6 +3191,7 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
* contribute to sum_migrate_dl_bw.
*/
cs->nr_migrate_dl_tasks++;
+ oldcs->nr_migrate_dl_tasks--;
if (dl_task_needs_bw_move(task, cs->effective_cpus))
cs->sum_migrate_dl_bw += task->dl.dl_bw;
}
@@ -3155,10 +3200,12 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
ret = cpuset_reserve_dl_bw(cs);
out_unlock:
- if (ret)
+ if (ret) {
reset_migrate_dl_data(cs);
- else
+ clear_attach_data(&src_cs_head, true);
+ } else {
attach_ctx.in_progress++;
+ }
mutex_unlock(&cpuset_mutex);
return ret;
@@ -3174,6 +3221,7 @@ static void cpuset_cancel_attach(struct cgroup_taskset *tset)
mutex_lock(&cpuset_mutex);
dec_attach_in_progress_locked();
+ clear_attach_data(&src_cs_head, true);
if (cs->dl_bw_cpu >= 0)
dl_bw_free(cs->dl_bw_cpu, cs->sum_migrate_dl_bw);
@@ -3251,7 +3299,6 @@ static void cpuset_attach(struct cgroup_taskset *tset)
struct task_struct *task;
struct cgroup_subsys_state *css;
struct cpuset *cs;
- struct cpuset *oldcs = attach_ctx.old_cs;
cgroup_taskset_first(tset, &css);
cs = css_cs(css);
@@ -3259,9 +3306,6 @@ static void cpuset_attach(struct cgroup_taskset *tset)
lockdep_assert_cpus_held(); /* see cgroup_attach_lock() */
mutex_lock(&cpuset_mutex);
attach_ctx.task_work_queued = false;
-
- attach_ctx.cpus_updated = !cpumask_equal(cs->effective_cpus, oldcs->effective_cpus);
- attach_ctx.mems_updated = !nodes_equal(cs->effective_mems, oldcs->effective_mems);
guarantee_online_mems(cs, &attach_ctx.nodemask_to);
/*
@@ -3283,10 +3327,10 @@ static void cpuset_attach(struct cgroup_taskset *tset)
if (cs->nr_migrate_dl_tasks) {
atomic_add(cs->nr_migrate_dl_tasks, &cs->nr_deadline_tasks);
- atomic_sub(cs->nr_migrate_dl_tasks, &oldcs->nr_deadline_tasks);
reset_migrate_dl_data(cs);
}
+ clear_attach_data(&src_cs_head, false);
dec_attach_in_progress_locked();
mutex_unlock(&cpuset_mutex);
@@ -3793,6 +3837,7 @@ int __init cpuset_init(void)
cpumask_setall(top_cpuset.effective_xcpus);
cpumask_setall(top_cpuset.exclusive_cpus);
nodes_setall(top_cpuset.effective_mems);
+ init_llist_node(&top_cpuset.attach_node);
cpuset1_init(&top_cpuset);
--
2.54.0
next prev parent reply other threads:[~2026-06-30 3:34 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-30 3:33 [PATCH-next v9 00/11] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach() Waiman Long
2026-06-30 3:33 ` [PATCH-next v9 01/11] cgroup/cpuset: Make nr_deadline_tasks an atomic_t Waiman Long
2026-06-30 14:01 ` Juri Lelli
2026-06-30 17:56 ` Waiman Long
2026-07-01 9:00 ` Juri Lelli
2026-07-01 1:19 ` Ridong Chen
2026-06-30 3:33 ` [PATCH-next v9 02/11] cgroup/cpuset: Fix node inconsistencies between cpuset_update_tasks_nodemask() and cpuset_attach() Waiman Long
2026-06-30 3:33 ` [PATCH-next v9 03/11] cgroup/cpuset: Prevent race between task attach and cpuset state change Waiman Long
2026-07-01 1:41 ` Ridong Chen
2026-07-01 20:19 ` Waiman Long
2026-06-30 3:33 ` [PATCH-next v9 04/11] cgroup/cpuset: Put all task attach related variables into attach_ctx Waiman Long
2026-06-30 3:33 ` [PATCH-next v9 05/11] cgroup/cpuset: Add a cpuset_reserve_dl_bw() helper Waiman Long
2026-06-30 3:33 ` [PATCH-next v9 06/11] cgroup/cpuset: Expand the scope of cpuset_can_attach_check() Waiman Long
2026-06-30 3:33 ` [PATCH-next v9 07/11] cgroup/cpuset: Make attach_ctx.old_cs track task group leader Waiman Long
2026-06-30 3:33 ` [PATCH-next v9 08/11] cgroup/cpuset: Move mpol_rebind_mm/cpuset_migrate_mm() calls inside cpuset_attach_task() Waiman Long
2026-07-01 2:14 ` Ridong Chen
2026-07-01 20:30 ` Waiman Long
2026-06-30 3:33 ` Waiman Long [this message]
2026-07-01 2:35 ` [PATCH-next v9 09/11] cgroup/cpuset: Support multiple source cpusets for cpuset_*attach() Ridong Chen
2026-07-01 20:44 ` Waiman Long
2026-06-30 3:33 ` [PATCH-next v9 10/11] cgroup/cpuset: Support multiple destination " Waiman Long
2026-07-01 2:51 ` Ridong Chen
2026-07-01 21:16 ` Waiman Long
2026-06-30 3:33 ` [PATCH-next v9 11/11] selftests/cgroup: Add test for cpuset affinity on controller disable Waiman Long
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260630033344.352702-10-longman@redhat.com \
--to=longman@redhat.com \
--cc=atomlin@atomlin.com \
--cc=cgroups@vger.kernel.org \
--cc=guopeng.zhang@linux.dev \
--cc=hannes@cmpxchg.org \
--cc=juri.lelli@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=mkoutny@suse.com \
--cc=ridong.chen@linux.dev \
--cc=shuah@kernel.org \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.