* [PATCH v2] sched_ext/scx_flatcg: Fix cvtime_delta race and add hweight scaling to bypass charging
@ 2026-06-08 7:24 Wanwu Li
2026-06-08 7:43 ` sashiko-bot
2026-06-24 21:28 ` Tejun Heo
0 siblings, 2 replies; 3+ messages in thread
From: Wanwu Li @ 2026-06-08 7:24 UTC (permalink / raw)
To: sched-ext; +Cc: tj, void, arighi, changwoo, linux-kernel, Wanwu Li
From: Wanwu Li <liwanwu@kylinos.cn>
1. cgrp_cap_budget() used __sync_fetch_and_sub(&cgc->cvtime_delta,
cgc->cvtime_delta) to atomically read and clear cvtime_delta. However,
this is not a true atomic read-clear operation: the second argument
(cgc->cvtime_delta) is evaluated as a normal read before the atomic
fetch_and_sub executes. If a concurrent __sync_fetch_and_add() happens
between the read and the sub, the added value gets included in the
returned delta AND remains in cvtime_delta, causing double charging.
Example:
CPU 0 runs cgrp_cap_budget(), CPU 1 runs fcg_stopping().
Assume cvtime_delta = 100 initially.
T1 CPU 0: sub_val = cvtime_delta = 100 cvtime_delta = 100
T2 CPU 1: __sync_fetch_and_add(&cvtime_delta, 10) cvtime_delta = 110
T3 CPU 0: __sync_fetch_and_sub(&cvtime_delta, sub_val) cvtime_delta = 10
returns old=110
delta = 110 (includes the 10 from CPU 1), but cvtime_delta = 10
(the 10 also remains). The 10 is charged twice: once in delta
(applied to cgv_node->cvtime) and once in the residual cvtime_delta
(fetched again next time).
Fix by using __sync_fetch_and_and(&cgc->cvtime_delta, 0).
Disassembly comparison:
(1) delta = __sync_fetch_and_sub(&cgc->cvtime_delta, cgc->cvtime_delta);
228: (79) r7 = *(u64 *)(r9 +40)
229: (87) r7 = -r7
230: (db) r7 = atomic64_fetch_add((u64 *)(r9 +40), r7) //r9 may be changed
(2) delta = __sync_fetch_and_and(&cgc->cvtime_delta, 0);
228: (b7) r8 = 0
229: (db) r8 = atomic64_xchg((u64 *)(r9 +40), r8)
2. The bypass charging path in fcg_stopping() charges raw execution time
to cvtime_delta without scaling by the inverse of the cgroup hweight.
Since cvtime_delta is eventually applied to cgv_node->cvtime which is
in vtime space (weight-scaled), the bypass path should also scale by
FCG_HWEIGHT_ONE / hweight to match the units used by the dispatch path.
Fixes: a4103eacc2ab ("sched_ext: Add a cgroup scheduler which uses flattened hierarchy")
Signed-off-by: Wanwu Li <liwanwu@kylinos.cn>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
---
tools/sched_ext/scx_flatcg.bpf.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/tools/sched_ext/scx_flatcg.bpf.c b/tools/sched_ext/scx_flatcg.bpf.c
index fec359581826..ffabf0d9f32e 100644
--- a/tools/sched_ext/scx_flatcg.bpf.c
+++ b/tools/sched_ext/scx_flatcg.bpf.c
@@ -256,7 +256,7 @@ static void cgrp_cap_budget(struct cgv_node *cgv_node, struct fcg_cgrp_ctx *cgc)
* and thus can't be updated and repositioned. Instead, we collect the
* vtime deltas separately and apply it asynchronously here.
*/
- delta = __sync_fetch_and_sub(&cgc->cvtime_delta, cgc->cvtime_delta);
+ delta = __sync_fetch_and_and(&cgc->cvtime_delta, 0);
cvtime = cgv_node->cvtime + delta;
/*
@@ -570,7 +570,8 @@ void BPF_STRUCT_OPS(fcg_stopping, struct task_struct *p, bool runnable)
cgc = find_cgrp_ctx(cgrp);
if (cgc) {
__sync_fetch_and_add(&cgc->cvtime_delta,
- p->se.sum_exec_runtime - taskc->bypassed_at);
+ (p->se.sum_exec_runtime - taskc->bypassed_at) *
+ FCG_HWEIGHT_ONE / (cgc->hweight ?: 1));
taskc->bypassed_at = 0;
}
bpf_cgroup_release(cgrp);
--
2.25.1
^ permalink raw reply related [flat|nested] 3+ messages in thread* Re: [PATCH v2] sched_ext/scx_flatcg: Fix cvtime_delta race and add hweight scaling to bypass charging
2026-06-08 7:24 [PATCH v2] sched_ext/scx_flatcg: Fix cvtime_delta race and add hweight scaling to bypass charging Wanwu Li
@ 2026-06-08 7:43 ` sashiko-bot
2026-06-24 21:28 ` Tejun Heo
1 sibling, 0 replies; 3+ messages in thread
From: sashiko-bot @ 2026-06-08 7:43 UTC (permalink / raw)
To: Wanwu Li; +Cc: sched-ext
Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
Pre-existing issues:
- [Critical] In `fcg_dispatch()`, unused slice time is calculated with inverted logic and unsigned division on a wrapped-around integer, severely corrupting cgroup virtual time.
--
commit 037c76e7851e9209a73d28f57c667ccb6303434e
Author: Wanwu Li <liwanwu@kylinos.cn>
sched_ext/scx_flatcg: Fix cvtime_delta race and add hweight scaling to bypass charging
Fixes a cvtime_delta atomic read-clear race in cgrp_cap_budget() by replacing
__sync_fetch_and_sub() with __sync_fetch_and_and(). It also scales execution
time by cgroup hweight in the fcg_stopping() bypass charging path.
> diff --git a/tools/sched_ext/scx_flatcg.bpf.c b/tools/sched_ext/scx_flatcg.bpf.c
> index fec3595818269..0fd214cc61dae 100644
> --- a/tools/sched_ext/scx_flatcg.bpf.c
> +++ b/tools/sched_ext/scx_flatcg.bpf.c
[ ... ]
> @@ -570,7 +570,8 @@ void BPF_STRUCT_OPS(fcg_stopping, struct task_struct *p, bool runnable)
> cgc = find_cgrp_ctx(cgrp);
> if (cgc) {
> __sync_fetch_and_add(&cgc->cvtime_delta,
> - p->se.sum_exec_runtime - taskc->bypassed_at);
> + (p->se.sum_exec_runtime - taskc->bypassed_at) *
> + FCG_HWEIGHT_ONE / (cgc->hweight ?: 1));
> taskc->bypassed_at = 0;
> }
> bpf_cgroup_release(cgrp);
While reviewing this fix, I noticed a similar pre-existing issue in fcg_dispatch():
tools/sched_ext/scx_flatcg.bpf.c:fcg_dispatch() {
...
__sync_fetch_and_add(&cgc->cvtime_delta,
(cpuc->cur_at + cgrp_slice_ns - now) *
FCG_HWEIGHT_ONE / (cgc->hweight ?: 1));
...
}
[Severity: Critical]
This is a pre-existing issue, but does this calculate unused slice time with
inverted logic and an unsigned division on a wrapped-around integer?
In try_pick_next_cgroup(), a cgroup is charged its full slice upfront.
Here in fcg_dispatch(), unused time is seemingly refunded by adding
(expected_end - now) * ... to cvtime_delta.
However, if yielding early (expected_end > now), this adds a positive value to
cvtime_delta, which ultimately increases cvtime, heavily penalizing the cgroup
instead of refunding it.
Conversely, if a cgroup overuses its slice (now > expected_end), the unsigned
subtraction underflows to a massive u64.
When multiplied by FCG_HWEIGHT_ONE and divided by hweight via unsigned division,
it yields an enormous positive number. Adding this huge value to cvtime_delta
pushes the cgroup's virtual time roughly 3.25 days into the future, permanently
starving it.
Should the time delta be properly ordered (e.g., now - expected_end) and both
operands to the multiplication and division be cast to signed 64-bit integers
(s64) to preserve negative values through the division?
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260608072457.3-1-liwanwu9113@163.com?part=1
^ permalink raw reply [flat|nested] 3+ messages in thread* Re: [PATCH v2] sched_ext/scx_flatcg: Fix cvtime_delta race and add hweight scaling to bypass charging
2026-06-08 7:24 [PATCH v2] sched_ext/scx_flatcg: Fix cvtime_delta race and add hweight scaling to bypass charging Wanwu Li
2026-06-08 7:43 ` sashiko-bot
@ 2026-06-24 21:28 ` Tejun Heo
1 sibling, 0 replies; 3+ messages in thread
From: Tejun Heo @ 2026-06-24 21:28 UTC (permalink / raw)
To: Wanwu Li
Cc: sched-ext, void, arighi, changwoo, linux-kernel, Wanwu Li,
Emil Tsalapatis
Applied to sched_ext/for-7.3.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2026-06-24 21:28 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-08 7:24 [PATCH v2] sched_ext/scx_flatcg: Fix cvtime_delta race and add hweight scaling to bypass charging Wanwu Li
2026-06-08 7:43 ` sashiko-bot
2026-06-24 21:28 ` Tejun Heo
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.