[PATCH] sched_ext/scx_flatcg: Fix cvtime_delta race and add hweight scaling to bypass charging

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] sched_ext/scx_flatcg: Fix cvtime_delta race and add hweight scaling to bypass charging
@ 2026-06-08  4:22 Wanwu Li
  2026-06-08  4:35 ` sashiko-bot
  2026-06-08  5:58 ` Andrea Righi
  0 siblings, 2 replies; 3+ messages in thread
From: Wanwu Li @ 2026-06-08  4:22 UTC (permalink / raw)
  To: sched-ext; +Cc: tj, void, arighi, changwoo, linux-kernel, Wanwu Li

From: Wanwu Li <liwanwu@kylinos.cn>

1. cgrp_cap_budget() used __sync_fetch_and_sub(&cgc->cvtime_delta,
   cgc->cvtime_delta) to atomically read and clear cvtime_delta. However,
   this is not a true atomic read-clear operation: the second argument
   (cgc->cvtime_delta) is evaluated as a normal read before the atomic
   fetch_and_sub executes. If a concurrent __sync_fetch_and_add() happens
   between the read and the sub, the added value gets included in the
   returned delta AND remains in cvtime_delta, causing double charging.

   Example: 
   CPU 0 runs cgrp_cap_budget(), CPU 1 runs fcg_stopping().
   Assume cvtime_delta = 100 initially.

     T1  CPU 0: sub_val = cvtime_delta = 100                 cvtime_delta = 100
     T2  CPU 1: __sync_fetch_and_add(&cvtime_delta, 10)      cvtime_delta = 110
     T3  CPU 0: __sync_fetch_and_sub(&cvtime_delta, sub_val) cvtime_delta = 10
                                                                returns old=110

   delta = 110 (includes the 10 from CPU 1), but cvtime_delta = 10
   (the 10 also remains). The 10 is charged twice: once in delta
   (applied to cgv_node->cvtime) and once in the residual cvtime_delta
   (fetched again next time).

   Fix by using __sync_fetch_and_and(&cgc->cvtime_delta, 0).
   Disassembly comparison:
   (1) delta = __sync_fetch_and_sub(&cgc->cvtime_delta, cgc->cvtime_delta);
   228: (79) r7 = *(u64 *)(r9 +40)
   229: (87) r7 = -r7
   230: (db) r7 = atomic64_fetch_add((u64 *)(r9 +40), r7) //r9 may be changed

   (2) delta = __sync_fetch_and_and(&cgc->cvtime_delta, 0);
   228: (b7) r8 = 0
   229: (db) r8 = atomic64_xchg((u64 *)(r9 +40), r8)

2. The bypass charging path in fcg_stopping() charges raw execution time
   to cvtime_delta without scaling by the inverse of the cgroup hweight.
   Since cvtime_delta is eventually applied to cgv_node->cvtime which is
   in vtime space (weight-scaled), the bypass path should also scale by
   FCG_HWEIGHT_ONE / hweight to match the units used by the dispatch path.

Signed-off-by: Wanwu Li <liwanwu@kylinos.cn>
---
 tools/sched_ext/scx_flatcg.bpf.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/tools/sched_ext/scx_flatcg.bpf.c b/tools/sched_ext/scx_flatcg.bpf.c
index fec359581826..ffabf0d9f32e 100644
--- a/tools/sched_ext/scx_flatcg.bpf.c
+++ b/tools/sched_ext/scx_flatcg.bpf.c
@@ -256,7 +256,7 @@ static void cgrp_cap_budget(struct cgv_node *cgv_node, struct fcg_cgrp_ctx *cgc)
 	 * and thus can't be updated and repositioned. Instead, we collect the
 	 * vtime deltas separately and apply it asynchronously here.
 	 */
-	delta = __sync_fetch_and_sub(&cgc->cvtime_delta, cgc->cvtime_delta);
+	delta = __sync_fetch_and_and(&cgc->cvtime_delta, 0);
 	cvtime = cgv_node->cvtime + delta;

 	/*
@@ -570,7 +570,8 @@ void BPF_STRUCT_OPS(fcg_stopping, struct task_struct *p, bool runnable)
 	cgc = find_cgrp_ctx(cgrp);
 	if (cgc) {
 		__sync_fetch_and_add(&cgc->cvtime_delta,
-				     p->se.sum_exec_runtime - taskc->bypassed_at);
+				     (p->se.sum_exec_runtime - taskc->bypassed_at) *
+				     FCG_HWEIGHT_ONE / (cgc->hweight ?: 1));
 		taskc->bypassed_at = 0;
 	}
 	bpf_cgroup_release(cgrp);
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] sched_ext/scx_flatcg: Fix cvtime_delta race and add hweight scaling to bypass charging
  2026-06-08  4:22 [PATCH] sched_ext/scx_flatcg: Fix cvtime_delta race and add hweight scaling to bypass charging Wanwu Li
@ 2026-06-08  4:35 ` sashiko-bot
  2026-06-08  5:58 ` Andrea Righi
  1 sibling, 0 replies; 3+ messages in thread
From: sashiko-bot @ 2026-06-08  4:35 UTC (permalink / raw)
  To: Wanwu Li; +Cc: sched-ext

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:

Pre-existing issues:
- [High] Incorrect tracking of cgroup virtual time delta in `fcg_dispatch` leads to double charging on early yield and permanent starvation on slice overrun due to sign inversion and unsigned division of negative values.
--

commit 26a5b29cfee7b890fe54a82847b9652c4700912d
Author: Wanwu Li <liwanwu@kylinos.cn>

sched_ext/scx_flatcg: Fix cvtime_delta race and add hweight scaling to bypass charging

This commit fixes a double-charging race condition in cgrp_cap_budget() by
using __sync_fetch_and_and() to read and clear cvtime_delta atomically. It
also updates the bypass charging path in fcg_stopping() to scale the raw
execution time by the inverse of the cgroup hweight before adding it to
cvtime_delta.

> diff --git a/tools/sched_ext/scx_flatcg.bpf.c b/tools/sched_ext/scx_flatcg.bpf.c
> index fec3595818269..0fd214cc61dae 100644
> --- a/tools/sched_ext/scx_flatcg.bpf.c
> +++ b/tools/sched_ext/scx_flatcg.bpf.c

[ ... ]

> @@ -570,7 +570,8 @@ void BPF_STRUCT_OPS(fcg_stopping, struct task_struct *p, bool runnable)
>  	cgc = find_cgrp_ctx(cgrp);
>  	if (cgc) {
>  		__sync_fetch_and_add(&cgc->cvtime_delta,
> -				     p->se.sum_exec_runtime - taskc->bypassed_at);
> +				     (p->se.sum_exec_runtime - taskc->bypassed_at) *
> +				     FCG_HWEIGHT_ONE / (cgc->hweight ?: 1));

[Severity: High]
This is a pre-existing issue, but seeing the hweight scaling logic added
here brings to mind a potential bug with the same pattern in fcg_dispatch().
Does the tracking of the cgroup virtual time delta in fcg_dispatch() suffer
from double-charging and underflow issues?

In fcg_dispatch(), the unused slice time is calculated as:

(cpuc->cur_at + cgrp_slice_ns - now)

Because the cgroup is charged a full slice upfront in try_pick_next_cgroup(),
should this unused time be refunded by subtracting it from cvtime_delta? The
current code uses __sync_fetch_and_add() which adds the positive unused time
and penalizes the cgroup a second time when it yields early.

Also, if the cgroup overruns its slice (now > cur_at + slice), the
calculation underflows. Because FCG_HWEIGHT_ONE is an unsigned value
(1LLU << 16), does the subsequent multiplication and division by cgc->hweight
destroy the two's complement representation of the negative value?

This unsigned division would shift zeros into the most significant bits,
resulting in a huge positive number being added to cvtime_delta. Could this
push the cgroup's virtual time days into the future and cause permanent
starvation?

>  		taskc->bypassed_at = 0;
>  	}
>  	bpf_cgroup_release(cgrp);

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260608042230.3-1-liwanwu9113@163.com?part=1

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] sched_ext/scx_flatcg: Fix cvtime_delta race and add hweight scaling to bypass charging
  2026-06-08  4:22 [PATCH] sched_ext/scx_flatcg: Fix cvtime_delta race and add hweight scaling to bypass charging Wanwu Li
  2026-06-08  4:35 ` sashiko-bot
@ 2026-06-08  5:58 ` Andrea Righi
  1 sibling, 0 replies; 3+ messages in thread
From: Andrea Righi @ 2026-06-08  5:58 UTC (permalink / raw)
  To: Wanwu Li; +Cc: sched-ext, tj, void, changwoo, linux-kernel, Wanwu Li

Hi Wanwu,

On Mon, Jun 08, 2026 at 12:22:30PM +0800, Wanwu Li wrote:
> From: Wanwu Li <liwanwu@kylinos.cn>
> 
> 1. cgrp_cap_budget() used __sync_fetch_and_sub(&cgc->cvtime_delta,
>    cgc->cvtime_delta) to atomically read and clear cvtime_delta. However,
>    this is not a true atomic read-clear operation: the second argument
>    (cgc->cvtime_delta) is evaluated as a normal read before the atomic
>    fetch_and_sub executes. If a concurrent __sync_fetch_and_add() happens
>    between the read and the sub, the added value gets included in the
>    returned delta AND remains in cvtime_delta, causing double charging.
> 
>    Example: 
>    CPU 0 runs cgrp_cap_budget(), CPU 1 runs fcg_stopping().
>    Assume cvtime_delta = 100 initially.
> 
>      T1  CPU 0: sub_val = cvtime_delta = 100                 cvtime_delta = 100
>      T2  CPU 1: __sync_fetch_and_add(&cvtime_delta, 10)      cvtime_delta = 110
>      T3  CPU 0: __sync_fetch_and_sub(&cvtime_delta, sub_val) cvtime_delta = 10
>                                                                 returns old=110
> 
>    delta = 110 (includes the 10 from CPU 1), but cvtime_delta = 10
>    (the 10 also remains). The 10 is charged twice: once in delta
>    (applied to cgv_node->cvtime) and once in the residual cvtime_delta
>    (fetched again next time).
> 
>    Fix by using __sync_fetch_and_and(&cgc->cvtime_delta, 0).
>    Disassembly comparison:
>    (1) delta = __sync_fetch_and_sub(&cgc->cvtime_delta, cgc->cvtime_delta);
>    228: (79) r7 = *(u64 *)(r9 +40)
>    229: (87) r7 = -r7
>    230: (db) r7 = atomic64_fetch_add((u64 *)(r9 +40), r7) //r9 may be changed
> 
>    (2) delta = __sync_fetch_and_and(&cgc->cvtime_delta, 0);
>    228: (b7) r8 = 0
>    229: (db) r8 = atomic64_xchg((u64 *)(r9 +40), r8)
> 
> 2. The bypass charging path in fcg_stopping() charges raw execution time
>    to cvtime_delta without scaling by the inverse of the cgroup hweight.
>    Since cvtime_delta is eventually applied to cgv_node->cvtime which is
>    in vtime space (weight-scaled), the bypass path should also scale by
>    FCG_HWEIGHT_ONE / hweight to match the units used by the dispatch path.
> 
> Signed-off-by: Wanwu Li <liwanwu@kylinos.cn>

We should probably add:

Fixes: a4103eacc2ab ("sched_ext: Add a cgroup scheduler which uses flattened hierarchy")

Other than that, looks good to me:

Reviewed-by: Andrea Righi <arighi@nvidia.com>

Thanks,
-Andrea

> ---
>  tools/sched_ext/scx_flatcg.bpf.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/sched_ext/scx_flatcg.bpf.c b/tools/sched_ext/scx_flatcg.bpf.c
> index fec359581826..ffabf0d9f32e 100644
> --- a/tools/sched_ext/scx_flatcg.bpf.c
> +++ b/tools/sched_ext/scx_flatcg.bpf.c
> @@ -256,7 +256,7 @@ static void cgrp_cap_budget(struct cgv_node *cgv_node, struct fcg_cgrp_ctx *cgc)
>  	 * and thus can't be updated and repositioned. Instead, we collect the
>  	 * vtime deltas separately and apply it asynchronously here.
>  	 */
> -	delta = __sync_fetch_and_sub(&cgc->cvtime_delta, cgc->cvtime_delta);
> +	delta = __sync_fetch_and_and(&cgc->cvtime_delta, 0);
>  	cvtime = cgv_node->cvtime + delta;
>  
>  	/*
> @@ -570,7 +570,8 @@ void BPF_STRUCT_OPS(fcg_stopping, struct task_struct *p, bool runnable)
>  	cgc = find_cgrp_ctx(cgrp);
>  	if (cgc) {
>  		__sync_fetch_and_add(&cgc->cvtime_delta,
> -				     p->se.sum_exec_runtime - taskc->bypassed_at);
> +				     (p->se.sum_exec_runtime - taskc->bypassed_at) *
> +				     FCG_HWEIGHT_ONE / (cgc->hweight ?: 1));
>  		taskc->bypassed_at = 0;
>  	}
>  	bpf_cgroup_release(cgrp);
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-06-08  5:58 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-08  4:22 [PATCH] sched_ext/scx_flatcg: Fix cvtime_delta race and add hweight scaling to bypass charging Wanwu Li
2026-06-08  4:35 ` sashiko-bot
2026-06-08  5:58 ` Andrea Righi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.