* [PATCH] cpuset: Fix multi-source deadline task accounting and bandwidth bypass
@ 2026-05-12 1:03 Aaron Tomlin
2026-05-13 16:22 ` Dietmar Eggemann
0 siblings, 1 reply; 6+ messages in thread
From: Aaron Tomlin @ 2026-05-12 1:03 UTC (permalink / raw)
To: longman, tj, hannes, mkoutny; +Cc: chenridong, neelx, cgroups, linux-kernel
During a batch migration where threads in a taskset originate from
multiple source cpusets (e.g., via cgroup.procs), cpuset_can_attach()
and cpuset_attach() currently evaluate the source cpuset exactly once
by caching the first task's oldcs.
This creates two distinct critical flaws for SCHED_DEADLINE tasks:
1. oldcs->nr_deadline_tasks is decremented solely on the first
source cpuset. If tasks originated from other cpusets, their
counts are permanently leaked, and the first cpuset permanently
underflows.
2. cpumask_intersects() is evaluated strictly against the first
task's source cpuset. This allows tasks originating from
entirely isolated root domains to silently bypass the
dl_bw_alloc() admission control.
This patch refactors the deadline accounting to evaluate task_cs(task)
on a per-task basis during the cgroup_taskset_for_each() loops. To
achieve accurate accounting before the core cgroup migration actually
executes, the permanent nr_deadline_tasks increments/decrements are
shifted into cpuset_can_attach(). If the migration aborts, the counts
are gracefully reverted via an internal rollback loop or the
cpuset_cancel_attach() callback.
Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
---
kernel/cgroup/cpuset.c | 53 +++++++++++++++++++++++++++++++-----------
1 file changed, 39 insertions(+), 14 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index e3a081a07c6d..36f1d28f8ade 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -3034,32 +3034,36 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
if (setsched_check) {
ret = security_task_setscheduler(task);
if (ret)
- goto out_unlock;
+ goto out_unlock_reset;
}
if (dl_task(task)) {
+ struct cpuset *old_cs = task_cs(task);
+
cs->nr_migrate_dl_tasks++;
- cs->sum_migrate_dl_bw += task->dl.dl_bw;
+ old_cs->nr_deadline_tasks--;
+ cs->nr_deadline_tasks++;
+
+ if (!cpumask_intersects(old_cs->effective_cpus,
+ cs->effective_cpus))
+ cs->sum_migrate_dl_bw += task->dl.dl_bw;
}
}
if (!cs->nr_migrate_dl_tasks)
goto out_success;
- if (!cpumask_intersects(oldcs->effective_cpus, cs->effective_cpus)) {
+ if (cs->sum_migrate_dl_bw) {
int cpu = cpumask_any_and(cpu_active_mask, cs->effective_cpus);
if (unlikely(cpu >= nr_cpu_ids)) {
- reset_migrate_dl_data(cs);
ret = -EINVAL;
- goto out_unlock;
+ goto out_unlock_reset;
}
ret = dl_bw_alloc(cpu, cs->sum_migrate_dl_bw);
- if (ret) {
- reset_migrate_dl_data(cs);
- goto out_unlock;
- }
+ if (ret)
+ goto out_unlock_reset;
cs->dl_bw_cpu = cpu;
}
@@ -3070,6 +3074,22 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
* changes which zero cpus/mems_allowed.
*/
cs->attach_in_progress++;
+ goto out_unlock;
+
+out_unlock_reset:
+ if (cs->nr_migrate_dl_tasks) {
+ struct task_struct *t;
+
+ cgroup_taskset_for_each(t, css, tset) {
+ if (t == task)
+ break;
+ if (dl_task(t)) {
+ task_cs(t)->nr_deadline_tasks++;
+ cs->nr_deadline_tasks--;
+ }
+ }
+ reset_migrate_dl_data(cs);
+ }
out_unlock:
mutex_unlock(&cpuset_mutex);
return ret;
@@ -3079,6 +3099,7 @@ static void cpuset_cancel_attach(struct cgroup_taskset *tset)
{
struct cgroup_subsys_state *css;
struct cpuset *cs;
+ struct task_struct *task;
cgroup_taskset_first(tset, &css);
cs = css_cs(css);
@@ -3089,8 +3110,15 @@ static void cpuset_cancel_attach(struct cgroup_taskset *tset)
if (cs->dl_bw_cpu >= 0)
dl_bw_free(cs->dl_bw_cpu, cs->sum_migrate_dl_bw);
- if (cs->nr_migrate_dl_tasks)
+ if (cs->nr_migrate_dl_tasks) {
+ cgroup_taskset_for_each(task, css, tset) {
+ if (dl_task(task)) {
+ task_cs(task)->nr_deadline_tasks++;
+ cs->nr_deadline_tasks--;
+ }
+ }
reset_migrate_dl_data(cs);
+ }
mutex_unlock(&cpuset_mutex);
}
@@ -3195,11 +3223,8 @@ static void cpuset_attach(struct cgroup_taskset *tset)
schedule_flush_migrate_mm();
cs->old_mems_allowed = cpuset_attach_nodemask_to;
- if (cs->nr_migrate_dl_tasks) {
- cs->nr_deadline_tasks += cs->nr_migrate_dl_tasks;
- oldcs->nr_deadline_tasks -= cs->nr_migrate_dl_tasks;
+ if (cs->nr_migrate_dl_tasks)
reset_migrate_dl_data(cs);
- }
dec_attach_in_progress_locked(cs);
--
2.51.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH] cpuset: Fix multi-source deadline task accounting and bandwidth bypass
2026-05-12 1:03 [PATCH] cpuset: Fix multi-source deadline task accounting and bandwidth bypass Aaron Tomlin
@ 2026-05-13 16:22 ` Dietmar Eggemann
2026-05-13 23:09 ` Aaron Tomlin
2026-05-13 23:19 ` Waiman Long
0 siblings, 2 replies; 6+ messages in thread
From: Dietmar Eggemann @ 2026-05-13 16:22 UTC (permalink / raw)
To: Aaron Tomlin, longman, tj, hannes, mkoutny
Cc: chenridong, neelx, cgroups, linux-kernel
On 12.05.26 03:03, Aaron Tomlin wrote:
> During a batch migration where threads in a taskset originate from
> multiple source cpusets (e.g., via cgroup.procs), cpuset_can_attach()
> and cpuset_attach() currently evaluate the source cpuset exactly once
> by caching the first task's oldcs.
>
> This creates two distinct critical flaws for SCHED_DEADLINE tasks:
>
> 1. oldcs->nr_deadline_tasks is decremented solely on the first
> source cpuset. If tasks originated from other cpusets, their
> counts are permanently leaked, and the first cpuset permanently
> underflows.
>
> 2. cpumask_intersects() is evaluated strictly against the first
> task's source cpuset. This allows tasks originating from
> entirely isolated root domains to silently bypass the
> dl_bw_alloc() admission control.
>
> This patch refactors the deadline accounting to evaluate task_cs(task)
> on a per-task basis during the cgroup_taskset_for_each() loops. To
> achieve accurate accounting before the core cgroup migration actually
> executes, the permanent nr_deadline_tasks increments/decrements are
> shifted into cpuset_can_attach(). If the migration aborts, the counts
> are gracefully reverted via an internal rollback loop or the
> cpuset_cancel_attach() callback.
Is there a testcase to provoke this issue in the current code?
I tried to move a process with 6 DL tasks from one cpuset to another by:
echo $PID > /sys/fs/cgroup/B/cgroup.procs
but in this case old_cs is the same for all these tasks.
[ 1991.852034] cgroup_migrate() (7) leader=[dl_batch_cgroup 823] threadgroup=1
[ 1991.852068] cgroup_migrate_execute() tset->nr_tasks=7
[ 1991.852238] cpuset_can_attach() (4) [dl_batch_cgroup 832] nr_migrate_dl_tasks=1 sum_migrate_dl_bw=104857 old_cs=ffff0000c4955200
[ 1991.852246] cpuset_can_attach() (4) [dl_batch_cgroup 833] nr_migrate_dl_tasks=2 sum_migrate_dl_bw=209714 old_cs=ffff0000c4955200
[ 1991.852248] cpuset_can_attach() (4) [dl_batch_cgroup 834] nr_migrate_dl_tasks=3 sum_migrate_dl_bw=314571 old_cs=ffff0000c4955200
[ 1991.852249] cpuset_can_attach() (4) [dl_batch_cgroup 835] nr_migrate_dl_tasks=4 sum_migrate_dl_bw=419428 old_cs=ffff0000c4955200
[ 1991.852249] cpuset_can_attach() (4) [dl_batch_cgroup 836] nr_migrate_dl_tasks=5 sum_migrate_dl_bw=524285 old_cs=ffff0000c4955200
[ 1991.852250] cpuset_can_attach() (4) [dl_batch_cgroup 837] nr_migrate_dl_tasks=6 sum_migrate_dl_bw=629142 old_cs=ffff0000c4955200
[ 1991.852328] cpuset_attach() (5) cs=ffff0000c1e9fc00 oldcs=ffff0000c4955200 cs->nr_deadline_tasks=6 oldcs->nr_deadline_tasks=6 cs->nr_migrate_dl_tasks=6
dl_batch_cgroup 823 823 19 - 0 TS
dl_batch_cgroup 823 832 140 0 - DLN
dl_batch_cgroup 823 833 140 0 - DLN
dl_batch_cgroup 823 834 140 0 - DLN
dl_batch_cgroup 823 835 140 0 - DLN
dl_batch_cgroup 823 836 140 0 - DLN
dl_batch_cgroup 823 837 140 0 - DLN
[...]
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] cpuset: Fix multi-source deadline task accounting and bandwidth bypass
2026-05-13 16:22 ` Dietmar Eggemann
@ 2026-05-13 23:09 ` Aaron Tomlin
2026-05-13 23:19 ` Waiman Long
1 sibling, 0 replies; 6+ messages in thread
From: Aaron Tomlin @ 2026-05-13 23:09 UTC (permalink / raw)
To: Dietmar Eggemann
Cc: longman, tj, hannes, mkoutny, chenridong, neelx, cgroups,
linux-kernel
[-- Attachment #1: Type: text/plain, Size: 4187 bytes --]
On Wed, May 13, 2026 at 06:22:25PM +0200, Dietmar Eggemann wrote:
> Is there a testcase to provoke this issue in the current code?
>
> I tried to move a process with 6 DL tasks from one cpuset to another by:
>
> echo $PID > /sys/fs/cgroup/B/cgroup.procs
>
> but in this case old_cs is the same for all these tasks.
>
> [ 1991.852034] cgroup_migrate() (7) leader=[dl_batch_cgroup 823] threadgroup=1
> [ 1991.852068] cgroup_migrate_execute() tset->nr_tasks=7
> [ 1991.852238] cpuset_can_attach() (4) [dl_batch_cgroup 832] nr_migrate_dl_tasks=1 sum_migrate_dl_bw=104857 old_cs=ffff0000c4955200
> [ 1991.852246] cpuset_can_attach() (4) [dl_batch_cgroup 833] nr_migrate_dl_tasks=2 sum_migrate_dl_bw=209714 old_cs=ffff0000c4955200
> [ 1991.852248] cpuset_can_attach() (4) [dl_batch_cgroup 834] nr_migrate_dl_tasks=3 sum_migrate_dl_bw=314571 old_cs=ffff0000c4955200
> [ 1991.852249] cpuset_can_attach() (4) [dl_batch_cgroup 835] nr_migrate_dl_tasks=4 sum_migrate_dl_bw=419428 old_cs=ffff0000c4955200
> [ 1991.852249] cpuset_can_attach() (4) [dl_batch_cgroup 836] nr_migrate_dl_tasks=5 sum_migrate_dl_bw=524285 old_cs=ffff0000c4955200
> [ 1991.852250] cpuset_can_attach() (4) [dl_batch_cgroup 837] nr_migrate_dl_tasks=6 sum_migrate_dl_bw=629142 old_cs=ffff0000c4955200
> [ 1991.852328] cpuset_attach() (5) cs=ffff0000c1e9fc00 oldcs=ffff0000c4955200 cs->nr_deadline_tasks=6 oldcs->nr_deadline_tasks=6 cs->nr_migrate_dl_tasks=6
>
> dl_batch_cgroup 823 823 19 - 0 TS
> dl_batch_cgroup 823 832 140 0 - DLN
> dl_batch_cgroup 823 833 140 0 - DLN
> dl_batch_cgroup 823 834 140 0 - DLN
> dl_batch_cgroup 823 835 140 0 - DLN
> dl_batch_cgroup 823 836 140 0 - DLN
> dl_batch_cgroup 823 837 140 0 - DLN
>
> [...]
Hi Dietmar,
Thank you for your feedback.
When you write a PID to cgroup.procs, the cgroup core gathers all threads
in that threadgroup into a single cgroup_taskset. If those threads were
spawned normally and never individually moved, they will all share the
exact same old_cs, which is why your test yielded identical source cpusets.
To provoke this specific BUG, you have to split the threads across
different cgroups before you trigger the batch migration that pulls them
all back together.
Here is the test case to reproduce the multi-source edge case:
1. Create two source cpusets and one target cpuset
mkdir /sys/fs/cgroup/SRC_A
mkdir /sys/fs/cgroup/SRC_B
mkdir /sys/fs/cgroup/TARGET
2. Start your Multithreaded DL Application
Run your dl_batch_cgroup app. Let's assume it has PID 1000 and
spawns two SCHED_DEADLINE threads: TID 1001 and TID 1002.
3. Split the Threads
Instead of moving the whole process, move the individual threads
into different source cpusets using the thread-level interface
echo 1001 > /sys/fs/cgroup/SRC_A/cgroup.threads
echo 1002 > /sys/fs/cgroup/SRC_B/cgroup.threads
At this point, SRC_A has nr_deadline_tasks = 1 and SRC_B has
nr_deadline_tasks = 1.
4. Trigger the Batch Migration
Now, trigger a process-level migration by writing the main
threadgroup ID to the target cpuset's cgroup.procs file.
echo 1000 > /sys/fs/cgroup/TARGET/cgroup.procs
Now, when you execute Step 4, the cgroup core gathers TID 1001 and 1002 into a
single cgroup_taskset. Because they originated from different cgroups, they
have different old_cs pointers.
However, the unpatched cpuset_can_attach() loops through the taskset, finds
the oldcs of the first task (e.g., SRC_A), and caches it. It then counts
that there are 2 migrating DL tasks (i.e., cs->nr_migrate_dl_tasks = 2).
In cpuset_attach(), it blindly subtracts the total migrating DL count from
the cached oldcs:
oldcs->nr_deadline_tasks -= cs->nr_migrate_dl_tasks;
/*
* SRC_A count becomes 1 - 2 = -1 (Underflow)
* SRC_B count remains 1 (Permanent leak)
*/
The patch resolves this by evaluating task_cs(task) individually for every
single task as the loop iterates through the cgroup_taskset.
Kind regards,
--
Aaron Tomlin
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] cpuset: Fix multi-source deadline task accounting and bandwidth bypass
2026-05-13 16:22 ` Dietmar Eggemann
2026-05-13 23:09 ` Aaron Tomlin
@ 2026-05-13 23:19 ` Waiman Long
2026-05-13 23:39 ` Aaron Tomlin
1 sibling, 1 reply; 6+ messages in thread
From: Waiman Long @ 2026-05-13 23:19 UTC (permalink / raw)
To: Dietmar Eggemann, Aaron Tomlin, tj, hannes, mkoutny
Cc: chenridong, neelx, cgroups, linux-kernel
On 5/13/26 12:22 PM, Dietmar Eggemann wrote:
> On 12.05.26 03:03, Aaron Tomlin wrote:
>> During a batch migration where threads in a taskset originate from
>> multiple source cpusets (e.g., via cgroup.procs), cpuset_can_attach()
>> and cpuset_attach() currently evaluate the source cpuset exactly once
>> by caching the first task's oldcs.
>>
>> This creates two distinct critical flaws for SCHED_DEADLINE tasks:
>>
>> 1. oldcs->nr_deadline_tasks is decremented solely on the first
>> source cpuset. If tasks originated from other cpusets, their
>> counts are permanently leaked, and the first cpuset permanently
>> underflows.
>>
>> 2. cpumask_intersects() is evaluated strictly against the first
>> task's source cpuset. This allows tasks originating from
>> entirely isolated root domains to silently bypass the
>> dl_bw_alloc() admission control.
>>
>> This patch refactors the deadline accounting to evaluate task_cs(task)
>> on a per-task basis during the cgroup_taskset_for_each() loops. To
>> achieve accurate accounting before the core cgroup migration actually
>> executes, the permanent nr_deadline_tasks increments/decrements are
>> shifted into cpuset_can_attach(). If the migration aborts, the counts
>> are gracefully reverted via an internal rollback loop or the
>> cpuset_cancel_attach() callback.
> Is there a testcase to provoke this issue in the current code?
>
> I tried to move a process with 6 DL tasks from one cpuset to another by:
>
> echo $PID > /sys/fs/cgroup/B/cgroup.procs
>
> but in this case old_cs is the same for all these tasks.
>
> [ 1991.852034] cgroup_migrate() (7) leader=[dl_batch_cgroup 823] threadgroup=1
> [ 1991.852068] cgroup_migrate_execute() tset->nr_tasks=7
> [ 1991.852238] cpuset_can_attach() (4) [dl_batch_cgroup 832] nr_migrate_dl_tasks=1 sum_migrate_dl_bw=104857 old_cs=ffff0000c4955200
> [ 1991.852246] cpuset_can_attach() (4) [dl_batch_cgroup 833] nr_migrate_dl_tasks=2 sum_migrate_dl_bw=209714 old_cs=ffff0000c4955200
> [ 1991.852248] cpuset_can_attach() (4) [dl_batch_cgroup 834] nr_migrate_dl_tasks=3 sum_migrate_dl_bw=314571 old_cs=ffff0000c4955200
> [ 1991.852249] cpuset_can_attach() (4) [dl_batch_cgroup 835] nr_migrate_dl_tasks=4 sum_migrate_dl_bw=419428 old_cs=ffff0000c4955200
> [ 1991.852249] cpuset_can_attach() (4) [dl_batch_cgroup 836] nr_migrate_dl_tasks=5 sum_migrate_dl_bw=524285 old_cs=ffff0000c4955200
> [ 1991.852250] cpuset_can_attach() (4) [dl_batch_cgroup 837] nr_migrate_dl_tasks=6 sum_migrate_dl_bw=629142 old_cs=ffff0000c4955200
> [ 1991.852328] cpuset_attach() (5) cs=ffff0000c1e9fc00 oldcs=ffff0000c4955200 cs->nr_deadline_tasks=6 oldcs->nr_deadline_tasks=6 cs->nr_migrate_dl_tasks=6
>
> dl_batch_cgroup 823 823 19 - 0 TS
> dl_batch_cgroup 823 832 140 0 - DLN
> dl_batch_cgroup 823 833 140 0 - DLN
> dl_batch_cgroup 823 834 140 0 - DLN
> dl_batch_cgroup 823 835 140 0 - DLN
> dl_batch_cgroup 823 836 140 0 - DLN
> dl_batch_cgroup 823 837 140 0 - DLN
>
> [...]
Multiple source or destination cpusets in task migration can only
happens when the cpuset controller is enabled or disabled in a cgroup
subtree. If there are DL tasks in 2 or more child cgroups, enabling or
disabling of the cpuset controller for those child cgroups may lead to
incorrect DL task accounting. This patch will probably fix the DL
accounting aspect. However, there are also other issues unrelated to DL
tasks that need to be addressed as well. So this patch is incomplete in
this regard. I am working on a patch series to address these issues.
Hopefully I can send it out in a day or 2.
Cheers,
Longman
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] cpuset: Fix multi-source deadline task accounting and bandwidth bypass
2026-05-13 23:19 ` Waiman Long
@ 2026-05-13 23:39 ` Aaron Tomlin
2026-05-14 4:26 ` Waiman Long
0 siblings, 1 reply; 6+ messages in thread
From: Aaron Tomlin @ 2026-05-13 23:39 UTC (permalink / raw)
To: Waiman Long
Cc: Dietmar Eggemann, tj, hannes, mkoutny, chenridong, neelx, cgroups,
linux-kernel
[-- Attachment #1: Type: text/plain, Size: 1150 bytes --]
On Wed, May 13, 2026 at 07:19:18PM -0400, Waiman Long wrote:
> Multiple source or destination cpusets in task migration can only happens
> when the cpuset controller is enabled or disabled in a cgroup subtree. If
> there are DL tasks in 2 or more child cgroups, enabling or disabling of the
> cpuset controller for those child cgroups may lead to incorrect DL task
> accounting. This patch will probably fix the DL accounting aspect. However,
> there are also other issues unrelated to DL tasks that need to be addressed
> as well. So this patch is incomplete in this regard. I am working on a patch
> series to address these issues. Hopefully I can send it out in a day or 2.
>
Hi Longman,
Acknowledged.
Also, the Sashiko AI review reported: "TOCTOU race on dl_task(task) during
rollback causes state corruption."
A concurrent sched_setscheduler() could alter the scheduling class of a
task between the initial pass and a rollback. This assertion seems valid to
me. Currently, neither cgroup_mutex or cpuset_mutex prevents scheduling
class changes.
Should I let you handle this too?
Kind regards,
--
Aaron Tomlin
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] cpuset: Fix multi-source deadline task accounting and bandwidth bypass
2026-05-13 23:39 ` Aaron Tomlin
@ 2026-05-14 4:26 ` Waiman Long
0 siblings, 0 replies; 6+ messages in thread
From: Waiman Long @ 2026-05-14 4:26 UTC (permalink / raw)
To: Aaron Tomlin
Cc: Dietmar Eggemann, tj, hannes, mkoutny, chenridong, neelx, cgroups,
linux-kernel
On 5/13/26 7:39 PM, Aaron Tomlin wrote:
> On Wed, May 13, 2026 at 07:19:18PM -0400, Waiman Long wrote:
>> Multiple source or destination cpusets in task migration can only happens
>> when the cpuset controller is enabled or disabled in a cgroup subtree. If
>> there are DL tasks in 2 or more child cgroups, enabling or disabling of the
>> cpuset controller for those child cgroups may lead to incorrect DL task
>> accounting. This patch will probably fix the DL accounting aspect. However,
>> there are also other issues unrelated to DL tasks that need to be addressed
>> as well. So this patch is incomplete in this regard. I am working on a patch
>> series to address these issues. Hopefully I can send it out in a day or 2.
>>
> Hi Longman,
>
> Acknowledged.
>
> Also, the Sashiko AI review reported: "TOCTOU race on dl_task(task) during
> rollback causes state corruption."
>
> A concurrent sched_setscheduler() could alter the scheduling class of a
> task between the initial pass and a rollback. This assertion seems valid to
> me. Currently, neither cgroup_mutex or cpuset_mutex prevents scheduling
> class changes.
>
> Should I let you handle this too?
No, you can handle it if you want. I am more familiar with the cpuset
code, but scheduler is much more complex. I don't think I have enough
understanding of the code to handle it correctly.
Cheers,
Longman
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-05-14 4:26 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-12 1:03 [PATCH] cpuset: Fix multi-source deadline task accounting and bandwidth bypass Aaron Tomlin
2026-05-13 16:22 ` Dietmar Eggemann
2026-05-13 23:09 ` Aaron Tomlin
2026-05-13 23:19 ` Waiman Long
2026-05-13 23:39 ` Aaron Tomlin
2026-05-14 4:26 ` Waiman Long
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox