Sched_ext development
 help / color / mirror / Atom feed
* [PATCH 0/2] cgroup/cpuset: fix DL rollback accounting and add a selftest
@ 2026-04-17  3:37 Guopeng Zhang
  2026-04-17  3:37 ` [PATCH 1/2] cgroup/cpuset: record DL BW alloc CPU for attach rollback Guopeng Zhang
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Guopeng Zhang @ 2026-04-17  3:37 UTC (permalink / raw)
  To: longman, tj, hannes, mkoutny, void, arighi, changwoo, shuah,
	chenridong
  Cc: cgroups, sched-ext, linux-kselftest, linux-kernel, Guopeng Zhang

Hi,

This series fixes a cpuset DL bandwidth rollback bug and adds a selftest to cover it.

cpuset_can_attach() only reserves deadline bandwidth when migrating DL
tasks to a disjoint CPU mask, but cpuset_cancel_attach() rolls back based
only on the presence of migrating DL tasks. This makes the rollback path
asymmetric: it can call dl_bw_free() even when no dl_bw_alloc() was done.
Because the alloc and free CPU selection also differed, rollback could
return bandwidth to a different root domain than the one originally
charged.

Patch 1 fixes the rollback accounting by recording the CPU used for
dl_bw_alloc() during attach and using that exact state in
cpuset_cancel_attach(). If no allocation happened, rollback skips
dl_bw_free(). Successful attach behavior is unchanged.

Patch 2 adds a sched_ext selftest which exercises the real attach
rollback path from userspace. The test uses a sched_ext scheduler whose
cgroup_prep_move() rejects SCHED_DEADLINE tasks so that the cpu
controller fails after cpuset has already accepted the move. It then
checks that dl_bw->total_bw on the target CPU remains unchanged across
the failed move.

Local testing:

With patch 1:
ok 1 cpuset_dl_rollback

Without patch 1:
ERR: cpuset_dl_rollback.c:760
Expected total_bw for CPU0 to remain unchanged (1677696 != 2027221)
not ok 1 cpuset_dl_rollback

Guopeng Zhang (2):
  cgroup/cpuset: record DL BW alloc CPU for attach rollback
  selftests/sched_ext: add cpuset DL rollback test

 kernel/cgroup/cpuset-internal.h               |   5 +
 kernel/cgroup/cpuset.c                        |  13 +-
 tools/testing/selftests/sched_ext/Makefile    |   1 +
 .../sched_ext/cpuset_dl_rollback.bpf.c        |  28 +
 .../selftests/sched_ext/cpuset_dl_rollback.c  | 810 ++++++++++++++++++
 5 files changed, 853 insertions(+), 4 deletions(-)
 create mode 100644 tools/testing/selftests/sched_ext/cpuset_dl_rollback.bpf.c
 create mode 100644 tools/testing/selftests/sched_ext/cpuset_dl_rollback.c

-- 
2.43.0


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/2] cgroup/cpuset: record DL BW alloc CPU for attach rollback
  2026-04-17  3:37 [PATCH 0/2] cgroup/cpuset: fix DL rollback accounting and add a selftest Guopeng Zhang
@ 2026-04-17  3:37 ` Guopeng Zhang
  2026-04-17 18:51   ` Waiman Long
  2026-04-20  2:23   ` Chen Ridong
  2026-04-17  3:37 ` [PATCH 2/2] selftests/sched_ext: add cpuset DL rollback test Guopeng Zhang
  2026-04-17 19:03 ` [PATCH 0/2] cgroup/cpuset: fix DL rollback accounting and add a selftest Tejun Heo
  2 siblings, 2 replies; 13+ messages in thread
From: Guopeng Zhang @ 2026-04-17  3:37 UTC (permalink / raw)
  To: longman, tj, hannes, mkoutny, void, arighi, changwoo, shuah,
	chenridong
  Cc: cgroups, sched-ext, linux-kselftest, linux-kernel, Guopeng Zhang

cpuset_can_attach() allocates DL bandwidth only when migrating
deadline tasks to a disjoint CPU mask, but cpuset_cancel_attach()
rolls back based only on nr_migrate_dl_tasks. This makes the DL
bandwidth alloc/free paths asymmetric: rollback can call dl_bw_free()
even when no dl_bw_alloc() was done.

Rollback also needs to undo the reservation against the same CPU/root
domain that was charged. Record the CPU used by dl_bw_alloc() and use
that state in cpuset_cancel_attach(). If no allocation happened,
dl_bw_cpu stays at -1 and rollback skips dl_bw_free(). If allocation
did happen, bandwidth is returned to the same CPU/root domain.

Successful attach paths are unchanged. This only fixes failed attach
rollback accounting.

Fixes: 2ef269ef1ac0 ("cgroup/cpuset: Free DL BW in case can_attach() fails")
Signed-off-by: Guopeng Zhang <zhangguopeng@kylinos.cn>
---
 kernel/cgroup/cpuset-internal.h |  5 +++++
 kernel/cgroup/cpuset.c          | 13 +++++++++----
 2 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/kernel/cgroup/cpuset-internal.h b/kernel/cgroup/cpuset-internal.h
index fd7d19842ded..bb4e692bea30 100644
--- a/kernel/cgroup/cpuset-internal.h
+++ b/kernel/cgroup/cpuset-internal.h
@@ -168,6 +168,11 @@ struct cpuset {
 	int nr_deadline_tasks;
 	int nr_migrate_dl_tasks;
 	u64 sum_migrate_dl_bw;
+	/*
+	 * CPU used for temporary DL bandwidth allocation during attach;
+	 * -1 if no DL bandwidth was allocated in the current attach.
+	 */
+	int dl_bw_cpu;
 
 	/* Invalid partition error code, not lock protected */
 	enum prs_errcode prs_err;
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 1335e437098e..e3a081a07c6d 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -288,6 +288,7 @@ struct cpuset top_cpuset = {
 	.flags = BIT(CS_CPU_EXCLUSIVE) |
 		 BIT(CS_MEM_EXCLUSIVE) | BIT(CS_SCHED_LOAD_BALANCE),
 	.partition_root_state = PRS_ROOT,
+	.dl_bw_cpu = -1,
 };
 
 /**
@@ -579,6 +580,8 @@ static struct cpuset *dup_or_alloc_cpuset(struct cpuset *cs)
 	if (!trial)
 		return NULL;
 
+	trial->dl_bw_cpu = -1;
+
 	/* Setup cpumask pointer array */
 	cpumask_var_t *pmask[4] = {
 		&trial->cpus_allowed,
@@ -2980,6 +2983,7 @@ static void reset_migrate_dl_data(struct cpuset *cs)
 {
 	cs->nr_migrate_dl_tasks = 0;
 	cs->sum_migrate_dl_bw = 0;
+	cs->dl_bw_cpu = -1;
 }
 
 /* Called by cgroups to determine if a cpuset is usable; cpuset_mutex held */
@@ -3056,6 +3060,8 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
 			reset_migrate_dl_data(cs);
 			goto out_unlock;
 		}
+
+		cs->dl_bw_cpu = cpu;
 	}
 
 out_success:
@@ -3080,12 +3086,11 @@ static void cpuset_cancel_attach(struct cgroup_taskset *tset)
 	mutex_lock(&cpuset_mutex);
 	dec_attach_in_progress_locked(cs);
 
-	if (cs->nr_migrate_dl_tasks) {
-		int cpu = cpumask_any(cs->effective_cpus);
+	if (cs->dl_bw_cpu >= 0)
+		dl_bw_free(cs->dl_bw_cpu, cs->sum_migrate_dl_bw);
 
-		dl_bw_free(cpu, cs->sum_migrate_dl_bw);
+	if (cs->nr_migrate_dl_tasks)
 		reset_migrate_dl_data(cs);
-	}
 
 	mutex_unlock(&cpuset_mutex);
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/2] selftests/sched_ext: add cpuset DL rollback test
  2026-04-17  3:37 [PATCH 0/2] cgroup/cpuset: fix DL rollback accounting and add a selftest Guopeng Zhang
  2026-04-17  3:37 ` [PATCH 1/2] cgroup/cpuset: record DL BW alloc CPU for attach rollback Guopeng Zhang
@ 2026-04-17  3:37 ` Guopeng Zhang
  2026-04-17 16:23   ` Cheng-Yang Chou
  2026-04-17 19:03 ` [PATCH 0/2] cgroup/cpuset: fix DL rollback accounting and add a selftest Tejun Heo
  2 siblings, 1 reply; 13+ messages in thread
From: Guopeng Zhang @ 2026-04-17  3:37 UTC (permalink / raw)
  To: longman, tj, hannes, mkoutny, void, arighi, changwoo, shuah,
	chenridong
  Cc: cgroups, sched-ext, linux-kselftest, linux-kernel, Guopeng Zhang

The cpuset DL rollback bug only shows up when another controller rejects
a migration after cpuset_can_attach() has already succeeded. Use a
sched_ext scheduler whose cgroup_prep_move() rejects SCHED_DEADLINE
tasks so that the cpu controller fails after cpuset and drives the real
attach rollback path from userspace.

Create overlapping source and destination cpusets under the current
cgroup, using that cgroup's effective CPU and memory masks. Constrain
the destination cpuset to a single CPU so the rollback accounting target
is deterministic, then compare dl_bw->total_bw for that CPU before and
after the failed move. Restore the parent subtree_control state during
cleanup so the test does not leave the cgroup tree changed.

This catches the old behavior where cpuset_cancel_attach() could free DL
bandwidth even though cpuset_can_attach() never allocated it. The test
reads sched/debug because that debugfs output exposes the per-CPU
dl_bw->total_bw accounting that the rollback perturbs.

Signed-off-by: Guopeng Zhang <zhangguopeng@kylinos.cn>
---
 tools/testing/selftests/sched_ext/Makefile    |   1 +
 .../sched_ext/cpuset_dl_rollback.bpf.c        |  28 +
 .../selftests/sched_ext/cpuset_dl_rollback.c  | 810 ++++++++++++++++++
 3 files changed, 839 insertions(+)
 create mode 100644 tools/testing/selftests/sched_ext/cpuset_dl_rollback.bpf.c
 create mode 100644 tools/testing/selftests/sched_ext/cpuset_dl_rollback.c

diff --git a/tools/testing/selftests/sched_ext/Makefile b/tools/testing/selftests/sched_ext/Makefile
index 789037be44c7..2a54d15552bd 100644
--- a/tools/testing/selftests/sched_ext/Makefile
+++ b/tools/testing/selftests/sched_ext/Makefile
@@ -162,6 +162,7 @@ endef
 all_test_bpfprogs := $(foreach prog,$(wildcard *.bpf.c),$(INCLUDE_DIR)/$(patsubst %.c,%.skel.h,$(prog)))
 
 auto-test-targets :=			\
+	cpuset_dl_rollback		\
 	create_dsq			\
 	dequeue				\
 	enq_last_no_enq_fails		\
diff --git a/tools/testing/selftests/sched_ext/cpuset_dl_rollback.bpf.c b/tools/testing/selftests/sched_ext/cpuset_dl_rollback.bpf.c
new file mode 100644
index 000000000000..ca5758a7361f
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/cpuset_dl_rollback.bpf.c
@@ -0,0 +1,28 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * A sched_ext scheduler used to trigger attach rollback after cpuset has
+ * already accepted the migration.
+ *
+ * Reject moving SCHED_DEADLINE tasks between cgroups from cgroup_prep_move(),
+ * which makes the cpu controller fail after cpuset has already succeeded.
+ */
+
+#include <scx/common.bpf.h>
+
+#define SCHED_DEADLINE 6
+
+char _license[] SEC("license") = "GPL";
+
+s32 BPF_STRUCT_OPS(cpuset_dl_rollback_cgroup_prep_move, struct task_struct *p,
+		   struct cgroup *from, struct cgroup *to)
+{
+	if (p->policy == SCHED_DEADLINE)
+		return -EAGAIN;
+
+	return 0;
+}
+SEC(".struct_ops.link")
+struct sched_ext_ops cpuset_dl_rollback_ops = {
+	.cgroup_prep_move	= (void *)cpuset_dl_rollback_cgroup_prep_move,
+	.name			= "cpuset_dl_rollback",
+};
diff --git a/tools/testing/selftests/sched_ext/cpuset_dl_rollback.c b/tools/testing/selftests/sched_ext/cpuset_dl_rollback.c
new file mode 100644
index 000000000000..44b6cad77d3e
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/cpuset_dl_rollback.c
@@ -0,0 +1,810 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Verify that rollback from cpu_cgroup_can_attach() failure doesn't perturb DL
+ * bandwidth accounting when cpuset_can_attach() didn't allocate DL bandwidth in
+ * the first place.
+ *
+ * The test uses a sched_ext scheduler whose cgroup_prep_move() rejects
+ * SCHED_DEADLINE task migration. That makes the cpu controller fail after the
+ * cpuset controller has already accepted the move, which triggers the cgroup
+ * rollback path without any kernel fault injection.
+ */
+#define _GNU_SOURCE
+
+#include <bpf/bpf.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <limits.h>
+#include <linux/sched/types.h>
+#include <scx/common.h>
+#include <signal.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <sys/wait.h>
+#include <unistd.h>
+
+#include "cpuset_dl_rollback.bpf.skel.h"
+#include "scx_test.h"
+
+#ifndef SYS_sched_setattr
+#if defined(__x86_64__)
+#define SYS_sched_setattr 314
+#elif defined(__i386__)
+#define SYS_sched_setattr 351
+#elif defined(__aarch64__)
+#define SYS_sched_setattr 274
+#else
+#error "Unknown architecture: please define SYS_sched_setattr"
+#endif
+#endif
+
+#ifndef SCHED_DEADLINE
+#define SCHED_DEADLINE 6
+#endif
+
+#define CGROUP2_ROOT "/sys/fs/cgroup"
+#define SCHED_DEBUG "/sys/kernel/debug/sched/debug"
+
+struct cpuset_dl_rollback_ctx {
+	struct cpuset_dl_rollback *skel;
+	struct bpf_link *link;
+	pid_t child;
+	/* The only CPU in dst, and the rollback accounting observation point. */
+	int target_cpu;
+	bool restore_parent_subtree;
+	char parent[PATH_MAX];
+	char root[PATH_MAX];
+	char src[PATH_MAX];
+	char dst[PATH_MAX];
+	char src_rel[PATH_MAX];
+	char parent_subtree[256];
+	char cpu_list[1024];
+	char mem_list[256];
+	char dst_cpu[32];
+};
+
+static void cleanup(void *arg);
+
+static int sched_setattr(pid_t pid, const struct sched_attr *attr,
+			 unsigned int flags)
+{
+	return syscall(SYS_sched_setattr, pid, attr, flags);
+}
+
+static void trim_trailing_ws(char *buf)
+{
+	size_t len = strlen(buf);
+
+	while (len > 0) {
+		char c = buf[len - 1];
+
+		if (c != '\n' && c != ' ' && c != '\t')
+			break;
+		buf[--len] = '\0';
+	}
+}
+
+static int read_text(const char *path, char *buf, size_t size)
+{
+	ssize_t len;
+	int fd;
+
+	fd = open(path, O_RDONLY | O_CLOEXEC);
+	if (fd < 0)
+		return -errno;
+
+	len = read(fd, buf, size - 1);
+	close(fd);
+	if (len < 0)
+		return -errno;
+
+	buf[len] = '\0';
+	trim_trailing_ws(buf);
+	return 0;
+}
+
+static int write_text(const char *path, const char *buf)
+{
+	size_t len = strlen(buf);
+	ssize_t ret;
+	int fd;
+
+	fd = open(path, O_WRONLY | O_CLOEXEC);
+	if (fd < 0)
+		return -errno;
+
+	ret = write(fd, buf, len);
+	close(fd);
+	if (ret < 0)
+		return -errno;
+	if ((size_t)ret != len)
+		return -EIO;
+
+	return 0;
+}
+
+static int build_path(char *buf, size_t size, const char *dir, const char *file)
+{
+	int ret;
+
+	ret = snprintf(buf, size, "%s/%s", dir, file);
+	if (ret < 0 || (size_t)ret >= size)
+		return -ENAMETOOLONG;
+
+	return 0;
+}
+
+static int build_cgroup_dir(const char *rel, char *buf, size_t size)
+{
+	int ret;
+
+	if (!strcmp(rel, "/"))
+		ret = snprintf(buf, size, "%s", CGROUP2_ROOT);
+	else
+		ret = snprintf(buf, size, "%s%s", CGROUP2_ROOT, rel);
+
+	if (ret < 0 || (size_t)ret >= size)
+		return -ENAMETOOLONG;
+
+	return 0;
+}
+
+static int read_cgroup_relpath(const char *path, char *buf, size_t size)
+{
+	char line[PATH_MAX];
+	FILE *fp;
+	int ret;
+
+	fp = fopen(path, "r");
+	if (!fp)
+		return -errno;
+
+	while (fgets(line, sizeof(line), fp)) {
+		char *first, *second, *rel;
+
+		trim_trailing_ws(line);
+
+		first = strchr(line, ':');
+		if (!first) {
+			fclose(fp);
+			return -EINVAL;
+		}
+
+		second = strchr(first + 1, ':');
+		if (!second) {
+			fclose(fp);
+			return -EINVAL;
+		}
+
+		*first = '\0';
+		*second = '\0';
+
+		/* Match the cgroup v2 entry, which is formatted as 0::/path. */
+		if (strcmp(line, "0") || first[1] != '\0')
+			continue;
+
+		rel = second + 1;
+		if (rel[0] != '/') {
+			fclose(fp);
+			return -EINVAL;
+		}
+
+		ret = snprintf(buf, size, "%s", rel);
+		fclose(fp);
+		if (ret < 0 || (size_t)ret >= size)
+			return -ENAMETOOLONG;
+
+		return 0;
+	}
+
+	if (ferror(fp)) {
+		fclose(fp);
+		return -EIO;
+	}
+
+	fclose(fp);
+	return -EOPNOTSUPP;
+}
+
+static bool has_token(const char *list, const char *token)
+{
+	size_t len = strlen(token);
+	const char *pos = list;
+
+	while ((pos = strstr(pos, token))) {
+		bool left_ok = pos == list || pos[-1] == ' ';
+		bool right_ok = pos[len] == '\0' || pos[len] == ' ';
+
+		if (left_ok && right_ok)
+			return true;
+		pos += len;
+	}
+
+	return false;
+}
+
+static int enable_controllers(const char *dir, char *orig, size_t orig_sz,
+			      bool *changed)
+{
+	char ctrl_path[PATH_MAX];
+	char subtree_path[PATH_MAX];
+	char controllers[256];
+	char subtree[256];
+	char enable[64];
+	size_t len = 0;
+	int ret;
+
+	ret = build_path(ctrl_path, sizeof(ctrl_path), dir, "cgroup.controllers");
+	if (ret)
+		return ret;
+	ret = build_path(subtree_path, sizeof(subtree_path), dir,
+			 "cgroup.subtree_control");
+	if (ret)
+		return ret;
+
+	ret = read_text(ctrl_path, controllers, sizeof(controllers));
+	if (ret == -ENOENT)
+		return -EOPNOTSUPP;
+	if (ret)
+		return ret;
+	if (!has_token(controllers, "cpu") || !has_token(controllers, "cpuset"))
+		return -EOPNOTSUPP;
+
+	ret = read_text(subtree_path, subtree, sizeof(subtree));
+	if (ret == -ENOENT)
+		return -EOPNOTSUPP;
+	if (ret)
+		return ret;
+
+	enable[0] = '\0';
+	if (!has_token(subtree, "cpu"))
+		len += snprintf(enable + len, sizeof(enable) - len, "+cpu ");
+	if (!has_token(subtree, "cpuset"))
+		len += snprintf(enable + len, sizeof(enable) - len, "+cpuset ");
+	if (len >= sizeof(enable))
+		return -EOVERFLOW;
+
+	if (!enable[0]) {
+		if (orig && orig_sz) {
+			ret = snprintf(orig, orig_sz, "%s", subtree);
+			if (ret < 0 || (size_t)ret >= orig_sz)
+				return -ENAMETOOLONG;
+		}
+		if (changed)
+			*changed = false;
+		return 0;
+	}
+
+	if (orig && orig_sz) {
+		ret = snprintf(orig, orig_sz, "%s", subtree);
+		if (ret < 0 || (size_t)ret >= orig_sz)
+			return -ENAMETOOLONG;
+	}
+
+	trim_trailing_ws(enable);
+	ret = write_text(subtree_path, enable);
+	if (!ret && changed)
+		*changed = true;
+	return ret;
+}
+
+static int restore_controllers(const char *dir, const char *orig)
+{
+	char subtree_path[PATH_MAX];
+	char subtree[256];
+	char disable[64];
+	size_t len = 0;
+	int ret;
+
+	ret = build_path(subtree_path, sizeof(subtree_path), dir,
+			 "cgroup.subtree_control");
+	if (ret)
+		return ret;
+
+	ret = read_text(subtree_path, subtree, sizeof(subtree));
+	if (ret)
+		return ret;
+
+	/*
+	 * Only undo controllers that this test turned on. If "cpu" or "cpuset"
+	 * was already present in the original subtree_control state, leave it
+	 * alone.
+	 */
+	disable[0] = '\0';
+	if (has_token(subtree, "cpu") && !has_token(orig, "cpu"))
+		len += snprintf(disable + len, sizeof(disable) - len, "-cpu ");
+	if (has_token(subtree, "cpuset") && !has_token(orig, "cpuset"))
+		len += snprintf(disable + len, sizeof(disable) - len,
+				"-cpuset ");
+	if (len >= sizeof(disable))
+		return -EOVERFLOW;
+
+	if (!disable[0])
+		return 0;
+
+	trim_trailing_ws(disable);
+	return write_text(subtree_path, disable);
+}
+
+static int mkdir_one(const char *path)
+{
+	if (mkdir(path, 0755) && errno != EEXIST)
+		return -errno;
+	return 0;
+}
+
+static int write_pid(const char *path, pid_t pid)
+{
+	char buf[32];
+	int ret;
+
+	ret = snprintf(buf, sizeof(buf), "%d", pid);
+	if (ret < 0 || (size_t)ret >= sizeof(buf))
+		return -EOVERFLOW;
+
+	return write_text(path, buf);
+}
+
+/* Parse the first CPU from a cpulist-style string such as "0-3,8". */
+static int first_list_item(const char *list, char *buf, size_t size, int *valp)
+{
+	char *end;
+	long val;
+	int ret;
+
+	errno = 0;
+	val = strtol(list, &end, 10);
+	if (errno || end == list || val < 0)
+		return -EINVAL;
+
+	if (valp)
+		*valp = val;
+
+	ret = snprintf(buf, size, "%ld", val);
+	if (ret < 0 || (size_t)ret >= size)
+		return -EOVERFLOW;
+
+	return 0;
+}
+
+/*
+ * sched/debug reports dl_bw->total_bw inside each CPU section.
+ *
+ * This test constrains dst to a single CPU and stores that CPU number in
+ * ctx->target_cpu. cpuset_cancel_attach() rolls rollback accounting against a
+ * CPU selected from the destination effective mask, so with a single-CPU dst
+ * that exact CPU becomes the rollback site and the matching observation point.
+ *
+ * Reading only the target CPU's dl_bw->total_bw avoids assuming that every CPU
+ * in the system shares one root domain. Unlike sched_ext/total_bw.c, this test
+ * has to identify one specific CPU section, so it also relies on the current
+ * sched/debug "cpu#<n>" section header format.
+ */
+static int read_cpu_total_bw(int target_cpu, long long *bw)
+{
+	char line[256];
+	FILE *fp;
+	bool in_target = false;
+
+	fp = fopen(SCHED_DEBUG, "r");
+	if (!fp)
+		return -errno;
+
+	while (fgets(line, sizeof(line), fp)) {
+		int header_cpu;
+		char *val;
+
+		if (sscanf(line, "cpu#%d", &header_cpu) == 1) {
+			if (in_target)
+				break;
+
+			in_target = header_cpu == target_cpu;
+			continue;
+		}
+		if (!in_target)
+			continue;
+
+		val = strstr(line, "dl_bw->total_bw");
+		if (!val)
+			continue;
+
+		val = strchr(val, ':');
+		if (!val) {
+			fclose(fp);
+			return -EINVAL;
+		}
+
+		*bw = strtoll(val + 1, NULL, 10);
+		fclose(fp);
+		return 0;
+	}
+
+	fclose(fp);
+	return -ENOENT;
+}
+
+static int set_deadline_policy(void)
+{
+	struct sched_attr attr = {
+		.size = sizeof(attr),
+		.sched_policy = SCHED_DEADLINE,
+		.sched_runtime = 10 * 1000 * 1000ULL,
+		.sched_deadline = 30 * 1000 * 1000ULL,
+		.sched_period = 30 * 1000 * 1000ULL,
+	};
+
+	return sched_setattr(0, &attr, 0);
+}
+
+static int spawn_dl_child(struct cpuset_dl_rollback_ctx *ctx)
+{
+	char procs_path[PATH_MAX];
+	int pipefd[2];
+	pid_t pid;
+	int child_ret;
+	int ret;
+
+	ret = build_path(procs_path, sizeof(procs_path), ctx->src, "cgroup.procs");
+	if (ret)
+		return ret;
+
+	if (pipe(pipefd))
+		return -errno;
+
+	pid = fork();
+	if (pid < 0) {
+		ret = -errno;
+		close(pipefd[0]);
+		close(pipefd[1]);
+		return ret;
+	}
+
+	if (!pid) {
+		int err = 0;
+
+		close(pipefd[0]);
+
+		err = write_pid(procs_path, getpid());
+		if (!err && set_deadline_policy())
+			err = -errno;
+
+		if (write(pipefd[1], &err, sizeof(err)) != sizeof(err))
+			_exit(1);
+
+		if (err)
+			_exit(1);
+
+		for (;;)
+			pause();
+	}
+
+	close(pipefd[1]);
+	ret = read(pipefd[0], &child_ret, sizeof(child_ret));
+	close(pipefd[0]);
+	if (ret != sizeof(child_ret)) {
+		kill(pid, SIGKILL);
+		waitpid(pid, NULL, 0);
+		return -EIO;
+	}
+
+	if (child_ret) {
+		kill(pid, SIGKILL);
+		waitpid(pid, NULL, 0);
+		return child_ret;
+	}
+
+	ctx->child = pid;
+	return 0;
+}
+
+static int create_cgroups(struct cpuset_dl_rollback_ctx *ctx)
+{
+	char parent_rel[PATH_MAX];
+	char path[PATH_MAX];
+	char tmpl[PATH_MAX];
+	int ret;
+
+	ret = read_cgroup_relpath("/proc/self/cgroup", parent_rel,
+				  sizeof(parent_rel));
+	if (ret)
+		return ret;
+
+	ret = build_cgroup_dir(parent_rel, ctx->parent, sizeof(ctx->parent));
+	if (ret)
+		return ret;
+
+	ret = enable_controllers(ctx->parent, ctx->parent_subtree,
+				 sizeof(ctx->parent_subtree),
+				 &ctx->restore_parent_subtree);
+	if (ret)
+		return ret;
+
+	ret = build_path(path, sizeof(path), ctx->parent, "cpuset.cpus.effective");
+	if (ret)
+		return ret;
+	ret = read_text(path, ctx->cpu_list, sizeof(ctx->cpu_list));
+	if (ret == -ENOENT)
+		return -EOPNOTSUPP;
+	if (ret)
+		return ret;
+
+	ret = build_path(path, sizeof(path), ctx->parent, "cpuset.mems.effective");
+	if (ret)
+		return ret;
+	ret = read_text(path, ctx->mem_list, sizeof(ctx->mem_list));
+	if (ret == -ENOENT)
+		return -EOPNOTSUPP;
+	if (ret)
+		return ret;
+	if (!ctx->cpu_list[0] || !ctx->mem_list[0])
+		return -ENOSPC;
+
+	/*
+	 * Keep dst on a single CPU so the rollback accounting target is
+	 * deterministic. That same CPU is later sampled from sched/debug.
+	 */
+	ret = first_list_item(ctx->cpu_list, ctx->dst_cpu, sizeof(ctx->dst_cpu),
+			      &ctx->target_cpu);
+	if (ret)
+		return ret;
+
+	ret = snprintf(tmpl, sizeof(tmpl), "%s/scx-cpuset-dl-rollback-XXXXXX",
+		       ctx->parent);
+	if (ret < 0 || (size_t)ret >= sizeof(tmpl))
+		return -ENAMETOOLONG;
+
+	if (!mkdtemp(tmpl))
+		return -errno;
+
+	ret = snprintf(ctx->root, sizeof(ctx->root), "%s", tmpl);
+	if (ret < 0 || (size_t)ret >= sizeof(ctx->root))
+		return -EOVERFLOW;
+
+	ret = snprintf(ctx->src, sizeof(ctx->src), "%s/src", ctx->root);
+	if (ret < 0 || (size_t)ret >= sizeof(ctx->src))
+		return -EOVERFLOW;
+	ret = snprintf(ctx->dst, sizeof(ctx->dst), "%s/ovl", ctx->root);
+	if (ret < 0 || (size_t)ret >= sizeof(ctx->dst))
+		return -EOVERFLOW;
+	ret = snprintf(ctx->src_rel, sizeof(ctx->src_rel), "%s/src",
+		       ctx->root + strlen(CGROUP2_ROOT));
+	if (ret < 0 || (size_t)ret >= sizeof(ctx->src_rel))
+		return -EOVERFLOW;
+
+	ret = build_path(path, sizeof(path), ctx->root, "cpuset.cpus");
+	if (ret)
+		return ret;
+	ret = write_text(path, ctx->cpu_list);
+	if (ret)
+		return ret;
+
+	ret = build_path(path, sizeof(path), ctx->root, "cpuset.mems");
+	if (ret)
+		return ret;
+	ret = write_text(path, ctx->mem_list);
+	if (ret)
+		return ret;
+
+	ret = enable_controllers(ctx->root, NULL, 0, NULL);
+	if (ret)
+		return ret;
+
+	ret = mkdir_one(ctx->src);
+	if (ret)
+		return ret;
+	ret = mkdir_one(ctx->dst);
+	if (ret)
+		return ret;
+
+	ret = build_path(path, sizeof(path), ctx->src, "cpuset.cpus");
+	if (ret)
+		return ret;
+	ret = write_text(path, ctx->cpu_list);
+	if (ret)
+		return ret;
+
+	ret = build_path(path, sizeof(path), ctx->src, "cpuset.mems");
+	if (ret)
+		return ret;
+	ret = write_text(path, ctx->mem_list);
+	if (ret)
+		return ret;
+
+	ret = build_path(path, sizeof(path), ctx->dst, "cpuset.cpus");
+	if (ret)
+		return ret;
+	ret = write_text(path, ctx->dst_cpu);
+	if (ret)
+		return ret;
+
+	ret = build_path(path, sizeof(path), ctx->dst, "cpuset.mems");
+	if (ret)
+		return ret;
+	return write_text(path, ctx->mem_list);
+}
+
+static bool child_in_src(const struct cpuset_dl_rollback_ctx *ctx)
+{
+	char path[PATH_MAX];
+	char cgroup[PATH_MAX];
+	int ret;
+
+	ret = snprintf(path, sizeof(path), "/proc/%d/cgroup", ctx->child);
+	if (ret < 0 || (size_t)ret >= sizeof(path))
+		return false;
+
+	if (read_cgroup_relpath(path, cgroup, sizeof(cgroup)))
+		return false;
+
+	return strcmp(cgroup, ctx->src_rel) == 0;
+}
+
+static enum scx_test_status setup(void **out_ctx)
+{
+	struct cpuset_dl_rollback_ctx *ctx;
+	int ret;
+
+	if (geteuid()) {
+		fprintf(stderr, "Skipping test: root privileges required\n");
+		return SCX_TEST_SKIP;
+	}
+
+	if (access(SCHED_DEBUG, R_OK)) {
+		fprintf(stderr, "Skipping test: %s not accessible\n", SCHED_DEBUG);
+		return SCX_TEST_SKIP;
+	}
+
+	ctx = calloc(1, sizeof(*ctx));
+	if (!ctx)
+		return SCX_TEST_FAIL;
+
+	ret = create_cgroups(ctx);
+	switch (ret) {
+	case -EOPNOTSUPP:
+		fprintf(stderr,
+			"Skipping test: cgroup v2 cpu/cpuset controllers unavailable in current cgroup tree\n");
+		cleanup(ctx);
+		return SCX_TEST_SKIP;
+	case -EPERM:
+	case -EACCES:
+	case -EROFS:
+		fprintf(stderr,
+			"Skipping test: current cgroup tree does not allow cpu/cpuset writes\n");
+		cleanup(ctx);
+		return SCX_TEST_SKIP;
+	case -EBUSY:
+		fprintf(stderr,
+			"Skipping test: current cgroup tree does not allow enabling cpu/cpuset controllers here\n");
+		cleanup(ctx);
+		return SCX_TEST_SKIP;
+	case -ENOSPC:
+		fprintf(stderr,
+			"Skipping test: current cgroup does not expose enough effective cpuset resources\n");
+		cleanup(ctx);
+		return SCX_TEST_SKIP;
+	}
+	if (ret) {
+		SCX_ERR("Failed to create cgroups (%d)", ret);
+		cleanup(ctx);
+		return SCX_TEST_FAIL;
+	}
+
+	ctx->skel = cpuset_dl_rollback__open();
+	if (!ctx->skel) {
+		SCX_ERR("Failed to open skel");
+		cleanup(ctx);
+		return SCX_TEST_FAIL;
+	}
+	SCX_ENUM_INIT(ctx->skel);
+	if (cpuset_dl_rollback__load(ctx->skel)) {
+		SCX_ERR("Failed to load skel");
+		cleanup(ctx);
+		return SCX_TEST_FAIL;
+	}
+
+	*out_ctx = ctx;
+	return SCX_TEST_PASS;
+}
+
+static enum scx_test_status run(void *arg)
+{
+	struct cpuset_dl_rollback_ctx *ctx = arg;
+	char procs_path[PATH_MAX];
+	long long before_bw, after_bw;
+	int ret;
+
+	ret = read_cpu_total_bw(ctx->target_cpu, &before_bw);
+	SCX_FAIL_IF(ret, "Failed to read baseline total_bw (%d)", ret);
+
+	ctx->link = bpf_map__attach_struct_ops(ctx->skel->maps.cpuset_dl_rollback_ops);
+	SCX_FAIL_IF(!ctx->link, "Failed to attach scheduler");
+
+	ret = spawn_dl_child(ctx);
+	switch (ret) {
+	case -EACCES:
+	case -EPERM:
+		fprintf(stderr,
+			"Skipping test: unable to place child in the source cgroup or enable SCHED_DEADLINE due to permissions (%d)\n",
+			ret);
+		return SCX_TEST_SKIP;
+	case -EBUSY:
+		fprintf(stderr,
+			"Skipping test: SCHED_DEADLINE admission control rejected the child (%d)\n",
+			ret);
+		return SCX_TEST_SKIP;
+	case -EINVAL:
+		fprintf(stderr,
+			"Skipping test: unable to enable SCHED_DEADLINE for the child in this environment (%d)\n",
+			ret);
+		return SCX_TEST_SKIP;
+	}
+	SCX_FAIL_IF(ret, "Failed to start SCHED_DEADLINE child (%d)", ret);
+
+	ret = read_cpu_total_bw(ctx->target_cpu, &before_bw);
+	SCX_FAIL_IF(ret, "Failed to read pre-move total_bw (%d)", ret);
+
+	ret = build_path(procs_path, sizeof(procs_path), ctx->dst, "cgroup.procs");
+	SCX_FAIL_IF(ret, "Failed to build cgroup.procs path (%d)", ret);
+
+	ret = write_pid(procs_path, ctx->child);
+	SCX_FAIL_IF(ret != -EAGAIN,
+		    "Expected cgroup move failure with -EAGAIN, got %d", ret);
+	SCX_FAIL_IF(!child_in_src(ctx), "Child left source cgroup after rollback");
+
+	ret = read_cpu_total_bw(ctx->target_cpu, &after_bw);
+	SCX_FAIL_IF(ret, "Failed to read post-move total_bw (%d)", ret);
+	SCX_FAIL_IF(after_bw != before_bw,
+		    "Expected total_bw for CPU%d to remain unchanged (%lld != %lld)",
+		    ctx->target_cpu, after_bw, before_bw);
+
+	return SCX_TEST_PASS;
+}
+
+static void cleanup(void *arg)
+{
+	struct cpuset_dl_rollback_ctx *ctx = arg;
+	int ret;
+
+	if (!ctx)
+		return;
+
+	if (ctx->child > 0) {
+		kill(ctx->child, SIGKILL);
+		waitpid(ctx->child, NULL, 0);
+	}
+
+	if (ctx->link)
+		bpf_link__destroy(ctx->link);
+	if (ctx->skel)
+		cpuset_dl_rollback__destroy(ctx->skel);
+
+	if (ctx->dst[0])
+		rmdir(ctx->dst);
+	if (ctx->src[0])
+		rmdir(ctx->src);
+	if (ctx->root[0])
+		rmdir(ctx->root);
+
+	if (ctx->restore_parent_subtree) {
+		ret = restore_controllers(ctx->parent, ctx->parent_subtree);
+		if (ret)
+			fprintf(stderr,
+				"%s: failed to restore %s/cgroup.subtree_control (%d)\n",
+				__func__, ctx->parent, ret);
+	}
+
+	free(ctx);
+}
+
+struct scx_test cpuset_dl_rollback = {
+	.name = "cpuset_dl_rollback",
+	.description = "Verify attach rollback after cpuset preserves DL bandwidth accounting",
+	.setup = setup,
+	.run = run,
+	.cleanup = cleanup,
+};
+REGISTER_SCX_TEST(&cpuset_dl_rollback)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] selftests/sched_ext: add cpuset DL rollback test
  2026-04-17  3:37 ` [PATCH 2/2] selftests/sched_ext: add cpuset DL rollback test Guopeng Zhang
@ 2026-04-17 16:23   ` Cheng-Yang Chou
  2026-04-20  1:56     ` Guopeng Zhang
  0 siblings, 1 reply; 13+ messages in thread
From: Cheng-Yang Chou @ 2026-04-17 16:23 UTC (permalink / raw)
  To: Guopeng Zhang
  Cc: longman, tj, hannes, mkoutny, void, arighi, changwoo, shuah,
	chenridong, cgroups, sched-ext, linux-kselftest, linux-kernel,
	jserv, chia7712

Hi Guopeng,

On Fri, Apr 17, 2026 at 11:37:42AM +0800, Guopeng Zhang wrote:
[...]
> +
> +#ifndef SYS_sched_setattr
> +#if defined(__x86_64__)
> +#define SYS_sched_setattr 314
> +#elif defined(__i386__)
> +#define SYS_sched_setattr 351
> +#elif defined(__aarch64__)

Nit: Since RISC-V uses the same assigned syscall number as ARM64, it
would be nice to support it here as well,

  +#elif defined(__aarch64__) || defined(__riscv)

> +#define SYS_sched_setattr 274
> +#else
> +#error "Unknown architecture: please define SYS_sched_setattr"
> +#endif
> +#endif

[...]

> +static enum scx_test_status run(void *arg)
> +{
> +	struct cpuset_dl_rollback_ctx *ctx = arg;
> +	char procs_path[PATH_MAX];
> +	long long before_bw, after_bw;
> +	int ret;
> +
> +	ret = read_cpu_total_bw(ctx->target_cpu, &before_bw);
> +	SCX_FAIL_IF(ret, "Failed to read baseline total_bw (%d)", ret);

The first read_cpu_total_bw() call is redundant because before_bw is
overwritten after spawn_dl_child(). Use a separate variable for the
baseline if needed. Otherwise, the second call already handles the same
error check.

> +
> +	ctx->link = bpf_map__attach_struct_ops(ctx->skel->maps.cpuset_dl_rollback_ops);
> +	SCX_FAIL_IF(!ctx->link, "Failed to attach scheduler");
> +
> +	ret = spawn_dl_child(ctx);
> +	switch (ret) {
> +	case -EACCES:
> +	case -EPERM:
> +		fprintf(stderr,
> +			"Skipping test: unable to place child in the source cgroup or enable SCHED_DEADLINE due to permissions (%d)\n",
> +			ret);
> +		return SCX_TEST_SKIP;
> +	case -EBUSY:
> +		fprintf(stderr,
> +			"Skipping test: SCHED_DEADLINE admission control rejected the child (%d)\n",
> +			ret);
> +		return SCX_TEST_SKIP;
> +	case -EINVAL:
> +		fprintf(stderr,
> +			"Skipping test: unable to enable SCHED_DEADLINE for the child in this environment (%d)\n",
> +			ret);
> +		return SCX_TEST_SKIP;
> +	}
> +	SCX_FAIL_IF(ret, "Failed to start SCHED_DEADLINE child (%d)", ret);
> +
> +	ret = read_cpu_total_bw(ctx->target_cpu, &before_bw);

Overwritten here.

> +	SCX_FAIL_IF(ret, "Failed to read pre-move total_bw (%d)", ret);

-- 
Thanks,
Cheng-Yang

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] cgroup/cpuset: record DL BW alloc CPU for attach rollback
  2026-04-17  3:37 ` [PATCH 1/2] cgroup/cpuset: record DL BW alloc CPU for attach rollback Guopeng Zhang
@ 2026-04-17 18:51   ` Waiman Long
  2026-04-20  2:21     ` Guopeng Zhang
  2026-04-20  2:23   ` Chen Ridong
  1 sibling, 1 reply; 13+ messages in thread
From: Waiman Long @ 2026-04-17 18:51 UTC (permalink / raw)
  To: Guopeng Zhang, tj, hannes, mkoutny, void, arighi, changwoo, shuah,
	chenridong, Juri Lelli, Valentin Schneider, Dietmar Eggemann
  Cc: cgroups, sched-ext, linux-kselftest, linux-kernel

On 4/16/26 11:37 PM, Guopeng Zhang wrote:
> cpuset_can_attach() allocates DL bandwidth only when migrating
> deadline tasks to a disjoint CPU mask, but cpuset_cancel_attach()
> rolls back based only on nr_migrate_dl_tasks. This makes the DL
> bandwidth alloc/free paths asymmetric: rollback can call dl_bw_free()
> even when no dl_bw_alloc() was done.
>
> Rollback also needs to undo the reservation against the same CPU/root
> domain that was charged. Record the CPU used by dl_bw_alloc() and use
> that state in cpuset_cancel_attach(). If no allocation happened,
> dl_bw_cpu stays at -1 and rollback skips dl_bw_free(). If allocation
> did happen, bandwidth is returned to the same CPU/root domain.
>
> Successful attach paths are unchanged. This only fixes failed attach
> rollback accounting.
>
> Fixes: 2ef269ef1ac0 ("cgroup/cpuset: Free DL BW in case can_attach() fails")
> Signed-off-by: Guopeng Zhang <zhangguopeng@kylinos.cn>
> ---
>   kernel/cgroup/cpuset-internal.h |  5 +++++
>   kernel/cgroup/cpuset.c          | 13 +++++++++----
>   2 files changed, 14 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/cgroup/cpuset-internal.h b/kernel/cgroup/cpuset-internal.h
> index fd7d19842ded..bb4e692bea30 100644
> --- a/kernel/cgroup/cpuset-internal.h
> +++ b/kernel/cgroup/cpuset-internal.h
> @@ -168,6 +168,11 @@ struct cpuset {
>   	int nr_deadline_tasks;
>   	int nr_migrate_dl_tasks;
>   	u64 sum_migrate_dl_bw;
> +	/*
> +	 * CPU used for temporary DL bandwidth allocation during attach;
> +	 * -1 if no DL bandwidth was allocated in the current attach.
> +	 */
> +	int dl_bw_cpu;
>   
>   	/* Invalid partition error code, not lock protected */
>   	enum prs_errcode prs_err;
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 1335e437098e..e3a081a07c6d 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -288,6 +288,7 @@ struct cpuset top_cpuset = {
>   	.flags = BIT(CS_CPU_EXCLUSIVE) |
>   		 BIT(CS_MEM_EXCLUSIVE) | BIT(CS_SCHED_LOAD_BALANCE),
>   	.partition_root_state = PRS_ROOT,
> +	.dl_bw_cpu = -1,
>   };
>   
>   /**
> @@ -579,6 +580,8 @@ static struct cpuset *dup_or_alloc_cpuset(struct cpuset *cs)
>   	if (!trial)
>   		return NULL;
>   
> +	trial->dl_bw_cpu = -1;
> +
>   	/* Setup cpumask pointer array */
>   	cpumask_var_t *pmask[4] = {
>   		&trial->cpus_allowed,
> @@ -2980,6 +2983,7 @@ static void reset_migrate_dl_data(struct cpuset *cs)
>   {
>   	cs->nr_migrate_dl_tasks = 0;
>   	cs->sum_migrate_dl_bw = 0;
> +	cs->dl_bw_cpu = -1;
>   }
>   
>   /* Called by cgroups to determine if a cpuset is usable; cpuset_mutex held */
> @@ -3056,6 +3060,8 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
>   			reset_migrate_dl_data(cs);
>   			goto out_unlock;
>   		}
> +
> +		cs->dl_bw_cpu = cpu;
>   	}
>   
>   out_success:
> @@ -3080,12 +3086,11 @@ static void cpuset_cancel_attach(struct cgroup_taskset *tset)
>   	mutex_lock(&cpuset_mutex);
>   	dec_attach_in_progress_locked(cs);
>   
> -	if (cs->nr_migrate_dl_tasks) {
> -		int cpu = cpumask_any(cs->effective_cpus);
> +	if (cs->dl_bw_cpu >= 0)
> +		dl_bw_free(cs->dl_bw_cpu, cs->sum_migrate_dl_bw);
>   
> -		dl_bw_free(cpu, cs->sum_migrate_dl_bw);
> +	if (cs->nr_migrate_dl_tasks)
>   		reset_migrate_dl_data(cs);
> -	}
>   
>   	mutex_unlock(&cpuset_mutex);
>   }

The patch looks correct to me.

Reviewed-by: Waiman Long <longman@redhat.com>

However, I have a DL bandwidth accounting question unrelated to this 
patch that I would like the scheduler people to clarify. The allocation 
of additional DL BW is based on the condition

         if (!cpumask_intersects(oldcs->effective_cpus, 
cs->effective_cpus)) {

IOW, additional DL BW will need to be allocated when the old and new 
cpuset doesn't overlap. However, they could still be in the same root 
domain. Does that mean we will be double counting it?

Looking from the other side, the root domain may have enough DL BW for 
the task migration, but the subset of CPUs in the cpuset itself may not 
have enough total DL BW to host all the DL tasks to be migrated, is that 
a problem?

Cheers,
Longman


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/2] cgroup/cpuset: fix DL rollback accounting and add a selftest
  2026-04-17  3:37 [PATCH 0/2] cgroup/cpuset: fix DL rollback accounting and add a selftest Guopeng Zhang
  2026-04-17  3:37 ` [PATCH 1/2] cgroup/cpuset: record DL BW alloc CPU for attach rollback Guopeng Zhang
  2026-04-17  3:37 ` [PATCH 2/2] selftests/sched_ext: add cpuset DL rollback test Guopeng Zhang
@ 2026-04-17 19:03 ` Tejun Heo
  2026-04-20  2:35   ` Guopeng Zhang
  2 siblings, 1 reply; 13+ messages in thread
From: Tejun Heo @ 2026-04-17 19:03 UTC (permalink / raw)
  To: Guopeng Zhang
  Cc: longman, hannes, mkoutny, void, arighi, changwoo, shuah,
	chenridong, cgroups, sched-ext, linux-kselftest, linux-kernel

Hello,

On Fri, Apr 17, 2026 at 11:37:40AM +0800, Guopeng Zhang wrote:
> Guopeng Zhang (2):
>   cgroup/cpuset: record DL BW alloc CPU for attach rollback
>   selftests/sched_ext: add cpuset DL rollback test

Applied 1 to cgroup/for-7.1-fixes.

For 2, a cpuset test doesn't belong under selftests/sched_ext, but
selftests/cgroup doesn't have the machinery to host a sched_ext-based test
either, so there's no good home for it right now. If the rollback path can
be exercised without a BPF scheduler, please rewrite it that way and resend
targeting selftests/cgroup.

Thanks.

--
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] selftests/sched_ext: add cpuset DL rollback test
  2026-04-17 16:23   ` Cheng-Yang Chou
@ 2026-04-20  1:56     ` Guopeng Zhang
  0 siblings, 0 replies; 13+ messages in thread
From: Guopeng Zhang @ 2026-04-20  1:56 UTC (permalink / raw)
  To: Cheng-Yang Chou
  Cc: longman, tj, hannes, mkoutny, void, arighi, changwoo, shuah,
	chenridong, cgroups, sched-ext, linux-kselftest, linux-kernel,
	jserv, chia7712



在 2026/4/18 0:23, Cheng-Yang Chou 写道:
> Hi Guopeng,
> 
> On Fri, Apr 17, 2026 at 11:37:42AM +0800, Guopeng Zhang wrote:
> [...]
>> +
>> +#ifndef SYS_sched_setattr
>> +#if defined(__x86_64__)
>> +#define SYS_sched_setattr 314
>> +#elif defined(__i386__)
>> +#define SYS_sched_setattr 351
>> +#elif defined(__aarch64__)
> 
> Nit: Since RISC-V uses the same assigned syscall number as ARM64, it
> would be nice to support it here as well,
> 
...
> 
>> +static enum scx_test_status run(void *arg)
>> +{
>> +	struct cpuset_dl_rollback_ctx *ctx = arg;
>> +	char procs_path[PATH_MAX];
>> +	long long before_bw, after_bw;
>> +	int ret;
>> +
>> +	ret = read_cpu_total_bw(ctx->target_cpu, &before_bw);
>> +	SCX_FAIL_IF(ret, "Failed to read baseline total_bw (%d)", ret);
> 
> The first read_cpu_total_bw() call is redundant because before_bw is
> overwritten after spawn_dl_child(). Use a separate variable for the
> baseline if needed. Otherwise, the second call already handles the same
> error check.
> 
...
>> +	SCX_FAIL_IF(ret, "Failed to start SCHED_DEADLINE child (%d)", ret);
>> +
>> +	ret = read_cpu_total_bw(ctx->target_cpu, &before_bw);
> 
> Overwritten here.
> 
>> +	SCX_FAIL_IF(ret, "Failed to read pre-move total_bw (%d)", ret);
> 
Hi Cheng-Yang,

Thanks for the review.

Both comments make sense. The current test is implemented under
`selftests/sched_ext`. If this test case is later moved to
`selftests/cgroup`, I’ll address both points in the next revision.

Thanks,
Guopeng

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] cgroup/cpuset: record DL BW alloc CPU for attach rollback
  2026-04-17 18:51   ` Waiman Long
@ 2026-04-20  2:21     ` Guopeng Zhang
  2026-04-20  2:31       ` Waiman Long
  0 siblings, 1 reply; 13+ messages in thread
From: Guopeng Zhang @ 2026-04-20  2:21 UTC (permalink / raw)
  To: Waiman Long, tj, hannes, mkoutny, void, arighi, changwoo, shuah,
	chenridong, Juri Lelli, Valentin Schneider, Dietmar Eggemann
  Cc: cgroups, sched-ext, linux-kselftest, linux-kernel



在 2026/4/18 2:51, Waiman Long 写道:
> On 4/16/26 11:37 PM, Guopeng Zhang wrote:
>> cpuset_can_attach() allocates DL bandwidth only when migrating
>> deadline tasks to a disjoint CPU mask, but cpuset_cancel_attach()
>> rolls back based only on nr_migrate_dl_tasks. This makes the DL
>> bandwidth alloc/free paths asymmetric: rollback can call dl_bw_free()
>> even when no dl_bw_alloc() was done.
>>
>> Rollback also needs to undo the reservation against the same CPU/root
>> domain that was charged. Record the CPU used by dl_bw_alloc() and use
>> that state in cpuset_cancel_attach(). If no allocation happened,
>> dl_bw_cpu stays at -1 and rollback skips dl_bw_free(). If allocation
>> did happen, bandwidth is returned to the same CPU/root domain.
>>
>> Successful attach paths are unchanged. This only fixes failed attach
>> rollback accounting.
>>
>> Fixes: 2ef269ef1ac0 ("cgroup/cpuset: Free DL BW in case can_attach() fails")
>> Signed-off-by: Guopeng Zhang <zhangguopeng@kylinos.cn>
...
> 
> The patch looks correct to me.
> 
> Reviewed-by: Waiman Long <longman@redhat.com>
Hi Waiman,

Thank you for the review and for the Reviewed-by.
> 
> However, I have a DL bandwidth accounting question unrelated to this patch that I would like the scheduler people to clarify. The allocation of additional DL BW is based on the condition
> 
>         if (!cpumask_intersects(oldcs->effective_cpus, cs->effective_cpus)) {
> 
> IOW, additional DL BW will need to be allocated when the old and new cpuset doesn't overlap. However, they could still be in the same root domain. Does that mean we will be double counting it?
I think you are right to call this out. Looking at the
current logic, !cpumask_intersects(oldcs->effective_cpus, cs->effective_cpus)
does not obviously guarantee that the migration is crossing into a different
root domain. If the old and new cpusets are disjoint but still belong to the
same root domain, it does look possible that we reserve bandwidth on the
destination side without a corresponding subtraction from the source side.
I will try to reproduce that configuration and follow up with results.
> 
> Looking from the other side, the root domain may have enough DL BW for the task migration, but the subset of CPUs in the cpuset itself may not have enough total DL BW to host all the DL tasks to be migrated, is that a problem?
my current understanding is that the DL bandwidth
accounting is done at root-domain granularity, not at arbitrary cpuset-subset
granularity. That also seems consistent with
Documentation/scheduler/sched-deadline.rst, which says that deadline tasks
cannot have a CPU affinity mask smaller than the root domain they are created
on, and that a restricted CPU set should be achieved by creating a restricted
root domain with cpuset.

So if a cpuset is only a subset inside a larger root domain, it does not seem
to get an independent DL bandwidth limit of its own. If that understanding is
correct, then the smaller cpuset not having enough bandwidth by itself would
be a limitation of that model rather than something this code checks
separately. I'd appreciate confirmation from the scheduler folks on that
point.

Thanks,
Guopeng
> 
> Cheers,
> Longman


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] cgroup/cpuset: record DL BW alloc CPU for attach rollback
  2026-04-17  3:37 ` [PATCH 1/2] cgroup/cpuset: record DL BW alloc CPU for attach rollback Guopeng Zhang
  2026-04-17 18:51   ` Waiman Long
@ 2026-04-20  2:23   ` Chen Ridong
  1 sibling, 0 replies; 13+ messages in thread
From: Chen Ridong @ 2026-04-20  2:23 UTC (permalink / raw)
  To: Guopeng Zhang, longman, tj, hannes, mkoutny, void, arighi,
	changwoo, shuah
  Cc: cgroups, sched-ext, linux-kselftest, linux-kernel



On 2026/4/17 11:37, Guopeng Zhang wrote:
> cpuset_can_attach() allocates DL bandwidth only when migrating
> deadline tasks to a disjoint CPU mask, but cpuset_cancel_attach()
> rolls back based only on nr_migrate_dl_tasks. This makes the DL
> bandwidth alloc/free paths asymmetric: rollback can call dl_bw_free()
> even when no dl_bw_alloc() was done.
> 
> Rollback also needs to undo the reservation against the same CPU/root
> domain that was charged. Record the CPU used by dl_bw_alloc() and use
> that state in cpuset_cancel_attach(). If no allocation happened,
> dl_bw_cpu stays at -1 and rollback skips dl_bw_free(). If allocation
> did happen, bandwidth is returned to the same CPU/root domain.
> 
> Successful attach paths are unchanged. This only fixes failed attach
> rollback accounting.
> 
> Fixes: 2ef269ef1ac0 ("cgroup/cpuset: Free DL BW in case can_attach() fails")
> Signed-off-by: Guopeng Zhang <zhangguopeng@kylinos.cn>
> ---
>  kernel/cgroup/cpuset-internal.h |  5 +++++
>  kernel/cgroup/cpuset.c          | 13 +++++++++----
>  2 files changed, 14 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/cgroup/cpuset-internal.h b/kernel/cgroup/cpuset-internal.h
> index fd7d19842ded..bb4e692bea30 100644
> --- a/kernel/cgroup/cpuset-internal.h
> +++ b/kernel/cgroup/cpuset-internal.h
> @@ -168,6 +168,11 @@ struct cpuset {
>  	int nr_deadline_tasks;
>  	int nr_migrate_dl_tasks;
>  	u64 sum_migrate_dl_bw;
> +	/*
> +	 * CPU used for temporary DL bandwidth allocation during attach;
> +	 * -1 if no DL bandwidth was allocated in the current attach.
> +	 */
> +	int dl_bw_cpu;
>  
>  	/* Invalid partition error code, not lock protected */
>  	enum prs_errcode prs_err;
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 1335e437098e..e3a081a07c6d 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -288,6 +288,7 @@ struct cpuset top_cpuset = {
>  	.flags = BIT(CS_CPU_EXCLUSIVE) |
>  		 BIT(CS_MEM_EXCLUSIVE) | BIT(CS_SCHED_LOAD_BALANCE),
>  	.partition_root_state = PRS_ROOT,
> +	.dl_bw_cpu = -1,
>  };
>  
>  /**
> @@ -579,6 +580,8 @@ static struct cpuset *dup_or_alloc_cpuset(struct cpuset *cs)
>  	if (!trial)
>  		return NULL;
>  
> +	trial->dl_bw_cpu = -1;
> +
>  	/* Setup cpumask pointer array */
>  	cpumask_var_t *pmask[4] = {
>  		&trial->cpus_allowed,
> @@ -2980,6 +2983,7 @@ static void reset_migrate_dl_data(struct cpuset *cs)
>  {
>  	cs->nr_migrate_dl_tasks = 0;
>  	cs->sum_migrate_dl_bw = 0;
> +	cs->dl_bw_cpu = -1;
>  }
>  
>  /* Called by cgroups to determine if a cpuset is usable; cpuset_mutex held */
> @@ -3056,6 +3060,8 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
>  			reset_migrate_dl_data(cs);
>  			goto out_unlock;
>  		}
> +
> +		cs->dl_bw_cpu = cpu;
>  	}
>  
>  out_success:
> @@ -3080,12 +3086,11 @@ static void cpuset_cancel_attach(struct cgroup_taskset *tset)
>  	mutex_lock(&cpuset_mutex);
>  	dec_attach_in_progress_locked(cs);
>  
> -	if (cs->nr_migrate_dl_tasks) {
> -		int cpu = cpumask_any(cs->effective_cpus);
> +	if (cs->dl_bw_cpu >= 0)
> +		dl_bw_free(cs->dl_bw_cpu, cs->sum_migrate_dl_bw);
>  
> -		dl_bw_free(cpu, cs->sum_migrate_dl_bw);
> +	if (cs->nr_migrate_dl_tasks)
>  		reset_migrate_dl_data(cs);
> -	}
>  
>  	mutex_unlock(&cpuset_mutex);
>  }

Good catch.

Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>

-- 
Best regards,
Ridong


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] cgroup/cpuset: record DL BW alloc CPU for attach rollback
  2026-04-20  2:21     ` Guopeng Zhang
@ 2026-04-20  2:31       ` Waiman Long
  2026-04-20  7:58         ` Juri Lelli
  2026-04-21  8:53         ` Guopeng Zhang
  0 siblings, 2 replies; 13+ messages in thread
From: Waiman Long @ 2026-04-20  2:31 UTC (permalink / raw)
  To: Guopeng Zhang, tj, hannes, mkoutny, void, arighi, changwoo, shuah,
	chenridong, Juri Lelli, Valentin Schneider, Dietmar Eggemann
  Cc: cgroups, sched-ext, linux-kselftest, linux-kernel

On 4/19/26 10:21 PM, Guopeng Zhang wrote:
>
> 在 2026/4/18 2:51, Waiman Long 写道:
>> On 4/16/26 11:37 PM, Guopeng Zhang wrote:
>>> cpuset_can_attach() allocates DL bandwidth only when migrating
>>> deadline tasks to a disjoint CPU mask, but cpuset_cancel_attach()
>>> rolls back based only on nr_migrate_dl_tasks. This makes the DL
>>> bandwidth alloc/free paths asymmetric: rollback can call dl_bw_free()
>>> even when no dl_bw_alloc() was done.
>>>
>>> Rollback also needs to undo the reservation against the same CPU/root
>>> domain that was charged. Record the CPU used by dl_bw_alloc() and use
>>> that state in cpuset_cancel_attach(). If no allocation happened,
>>> dl_bw_cpu stays at -1 and rollback skips dl_bw_free(). If allocation
>>> did happen, bandwidth is returned to the same CPU/root domain.
>>>
>>> Successful attach paths are unchanged. This only fixes failed attach
>>> rollback accounting.
>>>
>>> Fixes: 2ef269ef1ac0 ("cgroup/cpuset: Free DL BW in case can_attach() fails")
>>> Signed-off-by: Guopeng Zhang <zhangguopeng@kylinos.cn>
> ...
>> The patch looks correct to me.
>>
>> Reviewed-by: Waiman Long <longman@redhat.com>
> Hi Waiman,
>
> Thank you for the review and for the Reviewed-by.
>> However, I have a DL bandwidth accounting question unrelated to this patch that I would like the scheduler people to clarify. The allocation of additional DL BW is based on the condition
>>
>>          if (!cpumask_intersects(oldcs->effective_cpus, cs->effective_cpus)) {
>>
>> IOW, additional DL BW will need to be allocated when the old and new cpuset doesn't overlap. However, they could still be in the same root domain. Does that mean we will be double counting it?
> I think you are right to call this out. Looking at the
> current logic, !cpumask_intersects(oldcs->effective_cpus, cs->effective_cpus)
> does not obviously guarantee that the migration is crossing into a different
> root domain. If the old and new cpusets are disjoint but still belong to the
> same root domain, it does look possible that we reserve bandwidth on the
> destination side without a corresponding subtraction from the source side.
> I will try to reproduce that configuration and follow up with results.
>> Looking from the other side, the root domain may have enough DL BW for the task migration, but the subset of CPUs in the cpuset itself may not have enough total DL BW to host all the DL tasks to be migrated, is that a problem?
> my current understanding is that the DL bandwidth
> accounting is done at root-domain granularity, not at arbitrary cpuset-subset
> granularity.
That is my understanding too.
> That also seems consistent with
> Documentation/scheduler/sched-deadline.rst, which says that deadline tasks
> cannot have a CPU affinity mask smaller than the root domain they are created
> on, and that a restricted CPU set should be achieved by creating a restricted
> root domain with cpuset.

A root domain should be created by creating cpuset root partition for v2 
or using the cpuset.cpu_exclusive flag in v1.

What is listed in the documentation is the ideal case, but users may not 
strictly follow the rule.

Cheers,
Longman

>
> So if a cpuset is only a subset inside a larger root domain, it does not seem
> to get an independent DL bandwidth limit of its own. If that understanding is
> correct, then the smaller cpuset not having enough bandwidth by itself would
> be a limitation of that model rather than something this code checks
> separately. I'd appreciate confirmation from the scheduler folks on that
> point.
>
> Thanks,
> Guopeng
>> Cheers,
>> Longman


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/2] cgroup/cpuset: fix DL rollback accounting and add a selftest
  2026-04-17 19:03 ` [PATCH 0/2] cgroup/cpuset: fix DL rollback accounting and add a selftest Tejun Heo
@ 2026-04-20  2:35   ` Guopeng Zhang
  0 siblings, 0 replies; 13+ messages in thread
From: Guopeng Zhang @ 2026-04-20  2:35 UTC (permalink / raw)
  To: Tejun Heo
  Cc: longman, hannes, mkoutny, void, arighi, changwoo, shuah,
	chenridong, cgroups, sched-ext, linux-kselftest, linux-kernel



在 2026/4/18 3:03, Tejun Heo 写道:
> Hello,
> 
> On Fri, Apr 17, 2026 at 11:37:40AM +0800, Guopeng Zhang wrote:
>> Guopeng Zhang (2):
>>   cgroup/cpuset: record DL BW alloc CPU for attach rollback
>>   selftests/sched_ext: add cpuset DL rollback test
> 
> Applied 1 to cgroup/for-7.1-fixes.
Hello Tejun,

Thanks for applying patch 1.
> 
> For 2, a cpuset test doesn't belong under selftests/sched_ext, but
> selftests/cgroup doesn't have the machinery to host a sched_ext-based test
> either, so there's no good home for it right now. If the rollback path can
> be exercised without a BPF scheduler, please rewrite it that way and resend
> targeting selftests/cgroup.
I will look into whether this can be rewritten without a BPF scheduler
and resent under selftests/cgroup.

Thanks,
Guopeng
> 
> Thanks.
> 
> --
> tejun


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] cgroup/cpuset: record DL BW alloc CPU for attach rollback
  2026-04-20  2:31       ` Waiman Long
@ 2026-04-20  7:58         ` Juri Lelli
  2026-04-21  8:53         ` Guopeng Zhang
  1 sibling, 0 replies; 13+ messages in thread
From: Juri Lelli @ 2026-04-20  7:58 UTC (permalink / raw)
  To: Waiman Long
  Cc: Guopeng Zhang, tj, hannes, mkoutny, void, arighi, changwoo, shuah,
	chenridong, Juri Lelli, Valentin Schneider, Dietmar Eggemann,
	cgroups, sched-ext, linux-kselftest, linux-kernel

Hi!

On 2026-04-19 22:31:29-04:00, Waiman Long wrote:
> On 4/19/26 10:21 PM, Guopeng Zhang wrote:
> 
> > 在 2026/4/18 2:51, Waiman Long 写道:
> > ...
> > Hi Waiman,
> >
> > Thank you for the review and for the Reviewed-by.
> > I think you are right to call this out. Looking at the
> > current logic, !cpumask_intersects(oldcs->effective_cpus, cs->effective_cpus)
> > does not obviously guarantee that the migration is crossing into a different
> > root domain. If the old and new cpusets are disjoint but still belong to the
> > same root domain, it does look possible that we reserve bandwidth on the
> > destination side without a corresponding subtraction from the source side.
> > I will try to reproduce that configuration and follow up with results.
> > my current understanding is that the DL bandwidth
> > accounting is done at root-domain granularity, not at arbitrary cpuset-subset
> > granularity.
> 
> That is my understanding too.
> 
> > That also seems consistent with
> > Documentation/scheduler/sched-deadline.rst, which says that deadline tasks
> > cannot have a CPU affinity mask smaller than the root domain they are created
> > on, and that a restricted CPU set should be achieved by creating a restricted
> > root domain with cpuset.
> 
> A root domain should be created by creating cpuset root partition for v2 
> or using the cpuset.cpu_exclusive flag in v1.
> 
> What is listed in the documentation is the ideal case, but users may not 
> strictly follow the rule.

But, if they don't and try to create DEADLINE task on cpusets that are
subsets of a root-domain, that should fail, as the affinity mask won't
be covering the entire root domain. So no BW allocated (and no
additional data structures) for subsets either.

Thanks,
Juri


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] cgroup/cpuset: record DL BW alloc CPU for attach rollback
  2026-04-20  2:31       ` Waiman Long
  2026-04-20  7:58         ` Juri Lelli
@ 2026-04-21  8:53         ` Guopeng Zhang
  1 sibling, 0 replies; 13+ messages in thread
From: Guopeng Zhang @ 2026-04-21  8:53 UTC (permalink / raw)
  To: Waiman Long, tj, hannes, mkoutny, void, arighi, changwoo, shuah,
	chenridong, Juri Lelli, Valentin Schneider, Dietmar Eggemann
  Cc: cgroups, sched-ext, linux-kselftest, linux-kernel



在 2026/4/20 10:31, Waiman Long 写道:
> On 4/19/26 10:21 PM, Guopeng Zhang wrote:
>>
>> 在 2026/4/18 2:51, Waiman Long 写道:
>>> On 4/16/26 11:37 PM, Guopeng Zhang wrote:
>>>> cpuset_can_attach() allocates DL bandwidth only when migrating
>>>> deadline tasks to a disjoint CPU mask, but cpuset_cancel_attach()
>>>> rolls back based only on nr_migrate_dl_tasks. This makes the DL
>>>> bandwidth alloc/free paths asymmetric: rollback can call dl_bw_free()
>>>> even when no dl_bw_alloc() was done.
>>>>
>>>> Rollback also needs to undo the reservation against the same CPU/root
>>>> domain that was charged. Record the CPU used by dl_bw_alloc() and use
>>>> that state in cpuset_cancel_attach(). If no allocation happened,
>>>> dl_bw_cpu stays at -1 and rollback skips dl_bw_free(). If allocation
>>>> did happen, bandwidth is returned to the same CPU/root domain.
>>>>
>>>> Successful attach paths are unchanged. This only fixes failed attach
>>>> rollback accounting.
>>>>
>>>> Fixes: 2ef269ef1ac0 ("cgroup/cpuset: Free DL BW in case can_attach() fails")
>>>> Signed-off-by: Guopeng Zhang <zhangguopeng@kylinos.cn>
>> ...
>>> The patch looks correct to me.
>>>
>>> Reviewed-by: Waiman Long <longman@redhat.com>
>> Hi Waiman,
>>
>> Thank you for the review and for the Reviewed-by.
>>> However, I have a DL bandwidth accounting question unrelated to this patch that I would like the scheduler people to clarify. The allocation of additional DL BW is based on the condition
>>>
>>>          if (!cpumask_intersects(oldcs->effective_cpus, cs->effective_cpus)) {
>>>
>>> IOW, additional DL BW will need to be allocated when the old and new cpuset doesn't overlap. However, they could still be in the same root domain. Does that mean we will be double counting it?
>> I think you are right to call this out. Looking at the
>> current logic, !cpumask_intersects(oldcs->effective_cpus, cs->effective_cpus)
>> does not obviously guarantee that the migration is crossing into a different
>> root domain. If the old and new cpusets are disjoint but still belong to the
>> same root domain, it does look possible that we reserve bandwidth on the
>> destination side without a corresponding subtraction from the source side.
>> I will try to reproduce that configuration and follow up with results.
Hi Waiman,

I reproduced the issue you pointed out, and the result does support
your concern.

I also tested the follow-up fix here:
https://lore.kernel.org/all/20260421083449.95750-1-zhangguopeng@kylinos.cn/

I tested two cases:
1. disjoint member cpusets that still belong to the same root-domain
   setup
2. disjoint partition-root cpusets that do cross root domains

The results look consistent with the bug and with the fix.
Case 1: disjoint member cpusets
Setup:
src: cpuset.cpus = 1-15
dst: cpuset.cpus = 0
both remained "member"

Without the fix, successful back-and-forth migration of the same
SCHED_DEADLINE task caused dl_bw->total_bw on CPU0 to increase
monotonically:
BW0 = 2027221
BW1 = 2376746
BW2 = 2726271

So:
BW1 - BW0 = 349525
BW2 - BW0 = 699050

That is, after src -> dst, dl_bw->total_bw increased, and after
dst -> src it increased again by about the same amount instead of
returning to the original value.

With the fix applied, the same reproducer no longer shows any net
increase, while the attach path still succeeds:

BW0 = 2027221
BW1 = 2027221
BW2 = 2027221

So:
BW1 - BW0 = 0
BW2 - BW0 = 0

Case 2: disjoint partition-root cpusets (true cross-root-domain move)
I also tested a configuration where src and dst are separate partition
roots:
src: cpuset.cpus = 0-6, cpuset.cpus.partition = root
dst: cpuset.cpus = 8-14, cpuset.cpus.partition = root

Then I started the DL task in src and migrated it to dst.

The accounting moved as expected:
Before src -> dst:
SRC0 = 1083517
DST0 =  733992

After src -> dst:
SRC1 =  733992
DST1 = 1083517

Deltas:
SRC delta after src -> dst = -349525
DST delta after src -> dst = +349525

After moving the same task back to src:
SRC2 = 1083517
DST2 =  733992

So both sides returned to baseline:
SRC2 - SRC0 = 0
DST2 - DST0 = 0

So with the fix applied, the same-root-domain case no longer leaves
persistent extra DL bandwidth accounting, while the true cross-root-domain
case still moves the bandwidth accounting as expected.

Shortened reproducers and observed values are below.
Case 1: disjoint member cpusets

echo "+cpu +cpuset" > /sys/fs/cgroup/cgroup.subtree_control
mkdir -p /sys/fs/cgroup/dl-rd-test
echo 0-15 > /sys/fs/cgroup/dl-rd-test/cpuset.cpus
echo 0 > /sys/fs/cgroup/dl-rd-test/cpuset.mems
echo "+cpu +cpuset" > /sys/fs/cgroup/dl-rd-test/cgroup.subtree_control

mkdir -p /sys/fs/cgroup/dl-rd-test/src
mkdir -p /sys/fs/cgroup/dl-rd-test/dst
echo 1-15 > /sys/fs/cgroup/dl-rd-test/src/cpuset.cpus
echo 0 > /sys/fs/cgroup/dl-rd-test/src/cpuset.mems
echo 0 > /sys/fs/cgroup/dl-rd-test/dst/cpuset.cpus
echo 0 > /sys/fs/cgroup/dl-rd-test/dst/cpuset.mems

/tmp/dl_test &
PID=$!
echo $PID > /sys/fs/cgroup/dl-rd-test/src/cgroup.procs

# read BW0 from cpu0 dl_bw->total_bw
# move src -> dst
# read BW1
# move dst -> src
# read BW2

Observed without fix:
BW0=2027221
BW1=2376746
BW2=2726271

Observed with fix:
BW0=2027221
BW1=2027221
BW2=2027221

Case 2: disjoint partition-root cpusets
echo "+cpu +cpuset" > /sys/fs/cgroup/cgroup.subtree_control
mkdir -p /sys/fs/cgroup/dl-rd-part-test
echo 0-15 > /sys/fs/cgroup/dl-rd-part-test/cpuset.cpus
echo 0 > /sys/fs/cgroup/dl-rd-part-test/cpuset.mems
echo 0-15 > /sys/fs/cgroup/dl-rd-part-test/cpuset.cpus.exclusive
echo "+cpu +cpuset" > /sys/fs/cgroup/dl-rd-part-test/cgroup.subtree_control

mkdir -p /sys/fs/cgroup/dl-rd-part-test/src
echo 0-6 > /sys/fs/cgroup/dl-rd-part-test/src/cpuset.cpus
echo 0 > /sys/fs/cgroup/dl-rd-part-test/src/cpuset.mems
echo 0-6 > /sys/fs/cgroup/dl-rd-part-test/src/cpuset.cpus.exclusive
echo root > /sys/fs/cgroup/dl-rd-part-test/src/cpuset.cpus.partition

mkdir -p /sys/fs/cgroup/dl-rd-part-test/dst
echo 8-14 > /sys/fs/cgroup/dl-rd-part-test/dst/cpuset.cpus
echo 0 > /sys/fs/cgroup/dl-rd-part-test/dst/cpuset.mems
echo 8-14 > /sys/fs/cgroup/dl-rd-part-test/dst/cpuset.cpus.exclusive
echo root > /sys/fs/cgroup/dl-rd-part-test/dst/cpuset.cpus.partition

sh -c 'echo $$ > /sys/fs/cgroup/dl-rd-part-test/src/cgroup.procs; exec /tmp/dl_test' &
PID=$!

# read source-side and destination-side dl_bw->total_bw
# move src -> dst
# read both again
# move dst -> src
# read both again

Observed with fix:
SRC0=1083517
DST0=733992
SRC1=733992
DST1=1083517
SRC2=1083517
DST2=733992

This matches the intended behavior: no persistent increase within one
root domain, and correct bandwidth transfer across root domains.

Cheers,
Guopeng

>>> Looking from the other side, the root domain may have enough DL BW for the task migration, but the subset of CPUs in the cpuset itself may not have enough total DL BW to host all the DL tasks to be migrated, is that a problem?
>> my current understanding is that the DL bandwidth
>> accounting is done at root-domain granularity, not at arbitrary cpuset-subset
>> granularity.
> That is my understanding too.
>> That also seems consistent with
>> Documentation/scheduler/sched-deadline.rst, which says that deadline tasks
>> cannot have a CPU affinity mask smaller than the root domain they are created
>> on, and that a restricted CPU set should be achieved by creating a restricted
>> root domain with cpuset.
> 
> A root domain should be created by creating cpuset root partition for v2 or using the cpuset.cpu_exclusive flag in v1.
> 
> What is listed in the documentation is the ideal case, but users may not strictly follow the rule.
> 
> Cheers,
> Longman
> 
>>
>> So if a cpuset is only a subset inside a larger root domain, it does not seem
>> to get an independent DL bandwidth limit of its own. If that understanding is
>> correct, then the smaller cpuset not having enough bandwidth by itself would
>> be a limitation of that model rather than something this code checks
>> separately. I'd appreciate confirmation from the scheduler folks on that
>> point.
>>
>> Thanks,
>> Guopeng
>>> Cheers,
>>> Longman


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2026-04-21  8:53 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-17  3:37 [PATCH 0/2] cgroup/cpuset: fix DL rollback accounting and add a selftest Guopeng Zhang
2026-04-17  3:37 ` [PATCH 1/2] cgroup/cpuset: record DL BW alloc CPU for attach rollback Guopeng Zhang
2026-04-17 18:51   ` Waiman Long
2026-04-20  2:21     ` Guopeng Zhang
2026-04-20  2:31       ` Waiman Long
2026-04-20  7:58         ` Juri Lelli
2026-04-21  8:53         ` Guopeng Zhang
2026-04-20  2:23   ` Chen Ridong
2026-04-17  3:37 ` [PATCH 2/2] selftests/sched_ext: add cpuset DL rollback test Guopeng Zhang
2026-04-17 16:23   ` Cheng-Yang Chou
2026-04-20  1:56     ` Guopeng Zhang
2026-04-17 19:03 ` [PATCH 0/2] cgroup/cpuset: fix DL rollback accounting and add a selftest Tejun Heo
2026-04-20  2:35   ` Guopeng Zhang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox