[PATCH] sched/fair: Fix overflow in vruntime

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] sched/fair: Fix overflow in vruntime_eligible()
@ 2026-04-15 14:57 Zhan Xusheng
  2026-04-28 14:49 ` [PATCH RESEND] " Zhan Xusheng
  0 siblings, 1 reply; 17+ messages in thread
From: Zhan Xusheng @ 2026-04-15 14:57 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Zhan Xusheng

After commit 556146ce5e94 ("sched/fair: Avoid overflow in
enqueue_entity()"), place_entity() can shift cfs_rq->zero_vruntime
towards a newly enqueued heavy entity. This can make (vruntime -
zero_vruntime) very large for other entities and cause key * load in
vruntime_eligible() to overflow s64, flipping the eligibility result.

Use check_mul_overflow() for the multiplication and fall back to a
sign-based result on overflow.

Fixes: 556146ce5e94 ("sched/fair: Avoid overflow in enqueue_entity()")
Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com>
---
 kernel/sched/fair.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 69361c63353a..9c186c34b2a8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -891,6 +891,7 @@ static int vruntime_eligible(struct cfs_rq *cfs_rq, u64 vruntime)
 	struct sched_entity *curr = cfs_rq->curr;
 	s64 avg = cfs_rq->sum_w_vruntime;
 	long load = cfs_rq->sum_weight;
+	s64 key, rhs;
 
 	if (curr && curr->on_rq) {
 		unsigned long weight = avg_vruntime_weight(cfs_rq, curr->load.weight);
@@ -899,7 +900,21 @@ static int vruntime_eligible(struct cfs_rq *cfs_rq, u64 vruntime)
 		load += weight;
 	}
 
-	return avg >= vruntime_op(vruntime, "-", cfs_rq->zero_vruntime) * load;
+	key = vruntime_op(vruntime, "-", cfs_rq->zero_vruntime);
+
+	/*
+	 * The multiplication key * load can overflow s64 when a heavy entity
+	 * enqueue shifts zero_vruntime far from lighter entities (see the
+	 * weight > load condition in place_entity()).
+	 *
+	 * On overflow, the sign of key tells us the correct answer: a large
+	 * positive key means vruntime >> V, so not eligible; a large negative
+	 * key means vruntime << V, so eligible.
+	 */
+	if (check_mul_overflow(key, (s64)load, &rhs))
+		return key <= 0;
+
+	return avg >= rhs;
 }
 
 int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH RESEND] sched/fair: Fix overflow in vruntime_eligible()
  2026-04-15 14:57 [PATCH] sched/fair: Fix overflow in vruntime_eligible() Zhan Xusheng
@ 2026-04-28 14:49 ` Zhan Xusheng
  2026-04-28 16:17   ` K Prateek Nayak
  0 siblings, 1 reply; 17+ messages in thread
From: Zhan Xusheng @ 2026-04-28 14:49 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar; +Cc: K Prateek Nayak, linux-kernel, Zhan Xusheng

After commit 556146ce5e94 ("sched/fair: Avoid overflow in
enqueue_entity()"), place_entity() can shift cfs_rq->zero_vruntime
towards a newly enqueued heavy entity. This can make (vruntime -
zero_vruntime) very large for other entities and cause key * load in
vruntime_eligible() to overflow s64, flipping the eligibility result.

Use check_mul_overflow() for the multiplication and fall back to a
sign-based result on overflow.

Fixes: 556146ce5e94 ("sched/fair: Avoid overflow in enqueue_entity()")
Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com>
---
 kernel/sched/fair.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 69361c63353a..9c186c34b2a8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -891,6 +891,7 @@ static int vruntime_eligible(struct cfs_rq *cfs_rq, u64 vruntime)
 	struct sched_entity *curr = cfs_rq->curr;
 	s64 avg = cfs_rq->sum_w_vruntime;
 	long load = cfs_rq->sum_weight;
+	s64 key, rhs;
 
 	if (curr && curr->on_rq) {
 		unsigned long weight = avg_vruntime_weight(cfs_rq, curr->load.weight);
@@ -899,7 +900,21 @@ static int vruntime_eligible(struct cfs_rq *cfs_rq, u64 vruntime)
 		load += weight;
 	}
 
-	return avg >= vruntime_op(vruntime, "-", cfs_rq->zero_vruntime) * load;
+	key = vruntime_op(vruntime, "-", cfs_rq->zero_vruntime);
+
+	/*
+	 * The multiplication key * load can overflow s64 when a heavy entity
+	 * enqueue shifts zero_vruntime far from lighter entities (see the
+	 * weight > load condition in place_entity()).
+	 *
+	 * On overflow, the sign of key tells us the correct answer: a large
+	 * positive key means vruntime >> V, so not eligible; a large negative
+	 * key means vruntime << V, so eligible.
+	 */
+	if (check_mul_overflow(key, (s64)load, &rhs))
+		return key <= 0;
+
+	return avg >= rhs;
 }
 
 int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH RESEND] sched/fair: Fix overflow in vruntime_eligible()
  2026-04-28 14:49 ` [PATCH RESEND] " Zhan Xusheng
@ 2026-04-28 16:17   ` K Prateek Nayak
  2026-04-28 17:35     ` Peter Zijlstra
  0 siblings, 1 reply; 17+ messages in thread
From: K Prateek Nayak @ 2026-04-28 16:17 UTC (permalink / raw)
  To: Zhan Xusheng, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Juri Lelli
  Cc: linux-kernel, Zhan Xusheng, Dietmar Eggemann, Valentin Schneider,
	Ben Segall, Steven Rostedt, Mel Gorman

(+ scheduler folks)

Hello Zhan,

On 4/28/2026 8:19 PM, Zhan Xusheng wrote:
> After commit 556146ce5e94 ("sched/fair: Avoid overflow in
> enqueue_entity()"), place_entity() can shift cfs_rq->zero_vruntime
> towards a newly enqueued heavy entity. This can make (vruntime -
> zero_vruntime) very large for other entities and cause key * load in
> vruntime_eligible() to overflow s64, flipping the eligibility result.

So the commit in question moves the zero_vruntime only when the
load > sum_weight.

You seem to have found a case where the entity_key() is already large
enough that moving the zero_vruntime farther will make the eligibility
check overflow which we were hoping will not be the case.

Do you have a reproducer that fails pick_eevdf() after introduction of
commit 556146ce5e94? Also, do you see any splats in the dmesg since we
have a defensive WARN_ON() to catch an overflow.

> 
> Use check_mul_overflow() for the multiplication and fall back to a
> sign-based result on overflow.

Don't we have PARANOID_AVG for that? Can you do:

    echo PARANOID_AVG > /sys/kernel/debug/sched/features
    # run your workload
    grep "sum_shift" /sys/kernel/debug/sched/debug

and check if the "sum_shift" turns non-zero.

> 
> Fixes: 556146ce5e94 ("sched/fair: Avoid overflow in enqueue_entity()")

So we avoid a warning with that optimization but didn't see a crash
anywhere without it during my testing.

If you have a workload that crashes from those changes, perhaps we can
see if there is a cheap enough way to move the zero_vruntime closer to
the true average.

-- 
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RESEND] sched/fair: Fix overflow in vruntime_eligible()
  2026-04-28 16:17   ` K Prateek Nayak
@ 2026-04-28 17:35     ` Peter Zijlstra
  2026-04-29  5:03       ` K Prateek Nayak
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2026-04-28 17:35 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Zhan Xusheng, Ingo Molnar, Vincent Guittot, Juri Lelli,
	linux-kernel, Zhan Xusheng, Dietmar Eggemann, Valentin Schneider,
	Ben Segall, Steven Rostedt, Mel Gorman

On Tue, Apr 28, 2026 at 09:47:11PM +0530, K Prateek Nayak wrote:
> (+ scheduler folks)
> 
> Hello Zhan,
> 
> On 4/28/2026 8:19 PM, Zhan Xusheng wrote:
> > After commit 556146ce5e94 ("sched/fair: Avoid overflow in
> > enqueue_entity()"), place_entity() can shift cfs_rq->zero_vruntime
> > towards a newly enqueued heavy entity. This can make (vruntime -
> > zero_vruntime) very large for other entities and cause key * load in
> > vruntime_eligible() to overflow s64, flipping the eligibility result.
> 
> So the commit in question moves the zero_vruntime only when the
> load > sum_weight.
> 
> You seem to have found a case where the entity_key() is already large
> enough that moving the zero_vruntime farther will make the eligibility
> check overflow which we were hoping will not be the case.
> 
> Do you have a reproducer that fails pick_eevdf() after introduction of
> commit 556146ce5e94? Also, do you see any splats in the dmesg since we
> have a defensive WARN_ON() to catch an overflow.

Right, either a reproducer or a trace showing the values leading up to
this are the absolute minimum you need to provide. That is, you need a
definite description of how you got there, otherwise you cannot judge
the solution is sound.

Anyway... let me construct a worst case.

So if you have this cgroup crap on, then you can have an entity of
weight 2, and vlag should then be bounded by: (slice+TICK_NSEC) *
NICE_0_LOAD, which is around 44 bits as per the comment on entity_key().

The other end is 100*NICE_0_LOAD, so lets wake that, then you get:

{key, weight}[] := {
  puny: { (slice + TICK_NSEC) * NICE_0_LOAD, 2               },
  max:  { 0,                                 100*NICE_0_LOAD },
}

The avg_vruntime() would end up being very close to 0 (which is
zero_vruntime), so no real help making that more accurate.

vruntime_eligible(puny) ends up with:

 avg    = 2 * puny.key (+ 0)
 weight = 2 + 100 * NICE_0_LOAD

 avg >= puny.key * weight

And that is: (slice + TICK_NSEC) * NICE_0_LOAD * NICE_0_LOAD * 100

and yes, I suppose that can exceed 64bit :-(

Can someone double check that? I have a silly head-ache and could easily
have made mistake. Not sure what the best solution is either. I think we
can show avg cannot overflow, and if this multiplication overflows, then
it must be larger, so perhaps the proposed patch is the best option.

Dunno, need clear head :/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RESEND] sched/fair: Fix overflow in vruntime_eligible()
  2026-04-28 17:35     ` Peter Zijlstra
@ 2026-04-29  5:03       ` K Prateek Nayak
  2026-04-29  7:31         ` Zhan Xusheng
  2026-05-01 10:40         ` Peter Zijlstra
  0 siblings, 2 replies; 17+ messages in thread
From: K Prateek Nayak @ 2026-04-29  5:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Zhan Xusheng, Ingo Molnar, Vincent Guittot, Juri Lelli,
	linux-kernel, Zhan Xusheng, Dietmar Eggemann, Valentin Schneider,
	Ben Segall, Steven Rostedt, Mel Gorman

Hello Peter,

On 4/28/2026 11:05 PM, Peter Zijlstra wrote:
> On Tue, Apr 28, 2026 at 09:47:11PM +0530, K Prateek Nayak wrote:
>> (+ scheduler folks)
>>
>> Hello Zhan,
>>
>> On 4/28/2026 8:19 PM, Zhan Xusheng wrote:
>>> After commit 556146ce5e94 ("sched/fair: Avoid overflow in
>>> enqueue_entity()"), place_entity() can shift cfs_rq->zero_vruntime
>>> towards a newly enqueued heavy entity. This can make (vruntime -
>>> zero_vruntime) very large for other entities and cause key * load in
>>> vruntime_eligible() to overflow s64, flipping the eligibility result.
>>
>> So the commit in question moves the zero_vruntime only when the
>> load > sum_weight.
>>
>> You seem to have found a case where the entity_key() is already large
>> enough that moving the zero_vruntime farther will make the eligibility
>> check overflow which we were hoping will not be the case.
>>
>> Do you have a reproducer that fails pick_eevdf() after introduction of
>> commit 556146ce5e94? Also, do you see any splats in the dmesg since we
>> have a defensive WARN_ON() to catch an overflow.
> 
> Right, either a reproducer or a trace showing the values leading up to
> this are the absolute minimum you need to provide. That is, you need a
> definite description of how you got there, otherwise you cannot judge
> the solution is sound.
> 
> Anyway... let me construct a worst case.
> 
> So if you have this cgroup crap on, then you can have an entity of
> weight 2, and vlag should then be bounded by: (slice+TICK_NSEC) *
> NICE_0_LOAD, which is around 44 bits as per the comment on entity_key().
> 
> The other end is 100*NICE_0_LOAD, so lets wake that, then you get:
> 
> {key, weight}[] := {
>   puny: { (slice + TICK_NSEC) * NICE_0_LOAD, 2               },
>   max:  { 0,                                 100*NICE_0_LOAD },
> }

So MAX_SHARES is (1UL << 18) and we then scale it up so the weight can
can go up to (1 << 28) I suppose in UP setting.

So let us assume there is a single task with the largest custom slice on
each cgroup making the min_slice equal to 100ms. Things start off in
such a way that puny having a vruntime of say -1 is picked over max which
had a vruntime of 0 for the sake of simplicity.

Deadline then becomes:

    puny->deadline = -1 + __calc_delta(100ms, NICE_0_LOAD, 2)

which is 52428800000000 - 1.

At 10HZ tick + RUN_TO_PARITY, we get 10ms error so that puts the
entity_key() at 57671680000000 (46 bits)

> 
> The avg_vruntime() would end up being very close to 0 (which is
> zero_vruntime), so no real help making that more accurate.
> 
> vruntime_eligible(puny) ends up with:
> 
>  avg    = 2 * puny.key (+ 0)
>  weight = 2 + 100 * NICE_0_LOAD
> 
>  avg >= puny.key * weight
> 
> And that is: (slice + TICK_NSEC) * NICE_0_LOAD * NICE_0_LOAD * 100
> > and yes, I suppose that can exceed 64bit :-(

Then the worst case right side would be:

    57671680000000 * load /* load = ((1 << 28) + 2) */

and yes that goes beyond 64-bits at 15481123719086080000000.

I'm running with this configurations:

    echo 2 > /sys/fs/cgroup/cg0/cpu.weight
    echo 10000 > /sys/fs/cgroup/cg1/cpu.weight

Weight of 2 actually gets scaled up to 20480 for UP scenario so I go try
to make it even smaller with SMP and wait until it gets picked again.

So it goes from ...

cfs_rq[0]:/cg0
  .se->vruntime                  : 11602839.678164
  .se->deadline                  : 19092472.390575
  .se->sum_exec_runtime          : 4243.207300
  .se->slice                     : 100.000000
  .se->custom_slice              : 1
  .se->load.weight               : 14
  .se->avg.load_avg              : 2
  .se->avg.util_avg              : 0
  .se->avg.runnable_avg          : 1024

cfs_rq[0]:/cg1
  .se->vruntime                  : 11602835.026387
  .se->deadline                  : 11602835.706227
  .se->sum_exec_runtime          : 2646719.706164
  .se->slice                     : 100.000000
  .se->custom_slice              : 1
  .se->load.weight               : 104857600
  .se->avg.load_avg              : 102389
  .se->avg.util_avg              : 1024
  .se->avg.runnable_avg          : 1024

cfs_rq[0]:/
  .left_deadline                 : 19092472.390575
  .left_vruntime                 : 11602839.678164
  .zero_vruntime                 : 11602835.026387
  .sum_w_vruntime                : 65124878 (25 bits)
  .sum_weight                    : 14
  .sum_shift                     : 0
  .avg_vruntime                  : 11602835.026387
  .right_vruntime                : 11602839.678164
  .spread                        : 0.000000


... to ...


cfs_rq[0]:/cg0
  .se->vruntime                  : 19392969.031442
  .se->deadline                  : 26881950.428361
  .se->sum_exec_runtime          : 4347.216748
  .se->slice                     : 100.000000
  .se->custom_slice              : 1
  .se->load.weight               : 14
  .se->avg.load_avg              : 1
  .se->avg.util_avg              : 904
  .se->avg.runnable_avg          : 1024

cfs_rq[0]:/cg1
  .se->vruntime                  : 11602851.024250
  .se->deadline                  : 11602851.144214
  .se->sum_exec_runtime          : 2648319.534241
  .se->slice                     : 100.000000
  .se->custom_slice              : 1
  .se->load.weight               : 104857600
  .se->avg.load_avg              : 102386
  .se->avg.util_avg              : 1024
  .se->avg.runnable_avg          : 1024

cfs_rq[0]:/
  .left_deadline                 : 26881950.428361
  .left_vruntime                 : 19392969.031442
  .zero_vruntime                 : 11602852.064342
  .sum_w_vruntime                : 109061637539400 (46 bits)
  .sum_weight                    : 14
  .sum_shift                     : 0
  .avg_vruntime                  : 11602852.064342
  .right_vruntime                : 19392969.031442
  .spread                        : 0.000000


without any splat so I'm not sure if there is something that prevents
a possible crash since a weight of 104857600 should have definitely
made that entity_key() overflow.

> 
> Can someone double check that? I have a silly head-ache and could easily
> have made mistake. Not sure what the best solution is either. I think we
> can show avg cannot overflow, and if this multiplication overflows, then
> it must be larger, so perhaps the proposed patch is the best option.
> 
> Dunno, need clear head :/

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RESEND] sched/fair: Fix overflow in vruntime_eligible()
  2026-04-29  5:03       ` K Prateek Nayak
@ 2026-04-29  7:31         ` Zhan Xusheng
  2026-05-01 10:40         ` Peter Zijlstra
  1 sibling, 0 replies; 17+ messages in thread
From: Zhan Xusheng @ 2026-04-29  7:31 UTC (permalink / raw)
  To: K Prateek Nayak, Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Valentin Schneider, Ben Segall, Steven Rostedt, Mel Gorman,
	linux-kernel, Zhan Xusheng

Hi Prateek, Hi Peter,

On Wed, 29 Apr 2026, K Prateek Nayak wrote:
> ... the worst case right side would be:
>     57671680000000 * load /* load = ((1 << 28) + 2) */
> and yes that goes beyond 64-bits at 15481123719086080000000.
> ...
>   cfs_rq[0]:/
>     .sum_w_vruntime : 109061637539400 (46 bits)
>     .sum_weight     : 14
>     .sum_shift      : 0
> ...
> without any splat so I'm not sure if there is something that prevents
> a possible crash since a weight of 104857600 should have definitely
> made that entity_key() overflow.

Thanks for running this.

On root's sum_weight being 14 despite cg1 having se->load.weight =
104857600: sum_weight is incremented by __enqueue_entity() only for
entities currently on_rq on this cfs_rq.  In your snapshot only cg0
(weight 14) is on_rq on root; cg1's group_se is not, so it does not
contribute to root's sum_weight -- even though its se->load.weight
is 104857600.

If a workload kept both group_ses on_rq on root simultaneously,
root's sum_weight would be 14 + 104857600.  I have not constructed
a workload that reliably holds both on_rq long enough to observe a
sum_shift bump; that is outside what I can test in my current
environment.

On the upper-bound analysis, I re-checked both of your derivations:

  - Peter's (slice + TICK_NSEC) * NICE_0_LOAD * NICE_0_LOAD * 100
    evaluates to ~1.87e20 with HZ=1000 and the 64-bit NICE_0_LOAD
    shift of 20, using 100 * NICE_0_LOAD as the heavy-side sum_weight.

  - Your refinement uses load = (1 << 28) + 2, giving ~1.55e22.
    (1 << 28) corresponds to scale_load(MAX_SHARES) and is reachable
    via sched_setattr / non-cgroup paths, but is not reachable
    through cpu.weight alone, which caps load.weight at ~1.05e8 via
    sched_weight_from_cgroup(CGROUP_WEIGHT_MAX).

Both bounds cross S64_MAX.  Working backwards from the smaller
bound: with Peter's puny.key ~1.78e12, the multiplication crosses
S64_MAX once load exceeds ~5.2e6, which a single cpu.weight=10000
cgroup clears on its own (load.weight ~1.05e8).  So in cgroup-only
setups the overflow threshold on load is low; the reason your
snapshot stayed well under it appears to be the on_rq pattern, not
the weight cap.

To be upfront about the patch's basis:

  - I do not have a reproducer.
  - I have not observed the WARN_ON() in __enqueue_entity().
  - The patch came from static inspection of vruntime_eligible()
    after 556146ce5e94.  The observation is that key * load
    overflows specifically when *both* key is large (the entity
    being checked is far from v0) and load is large (an unrelated
    heavy entity is on_rq).  The v0-move change keeps individual
    (v_i - v0) * w_i terms in sum_w_vruntime bounded, but key and
    load in vruntime_eligible() refer to different entities, so v0
    placement cannot simultaneously control both factors.

Given the bounds above, I think the defensive check is justified
as hardening against a provable overflow, with the acknowledgement
that I have not triggered it in practice.  I'm happy to respin with
a commit message built on these analyses if that would help.

Thanks,
Zhan Xusheng

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RESEND] sched/fair: Fix overflow in vruntime_eligible()
  2026-04-29  5:03       ` K Prateek Nayak
  2026-04-29  7:31         ` Zhan Xusheng
@ 2026-05-01 10:40         ` Peter Zijlstra
  2026-05-01 10:48           ` Peter Zijlstra
                             ` (2 more replies)
  1 sibling, 3 replies; 17+ messages in thread
From: Peter Zijlstra @ 2026-05-01 10:40 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Zhan Xusheng, Ingo Molnar, Vincent Guittot, Juri Lelli,
	linux-kernel, Zhan Xusheng, Dietmar Eggemann, Valentin Schneider,
	Ben Segall, Steven Rostedt, Mel Gorman

On Wed, Apr 29, 2026 at 10:33:48AM +0530, K Prateek Nayak wrote:

> > Anyway... let me construct a worst case.
> > 
> > So if you have this cgroup crap on, then you can have an entity of
> > weight 2, and vlag should then be bounded by: (slice+TICK_NSEC) *
> > NICE_0_LOAD, which is around 44 bits as per the comment on entity_key().
> > 
> > The other end is 100*NICE_0_LOAD, so lets wake that, then you get:
> > 
> > {key, weight}[] := {
> >   puny: { (slice + TICK_NSEC) * NICE_0_LOAD, 2               },
> >   max:  { 0,                                 100*NICE_0_LOAD },
> > }
> 
> So MAX_SHARES is (1UL << 18) and we then scale it up so the weight can
> can go up to (1 << 28) I suppose in UP setting.
> 
> So let us assume there is a single task with the largest custom slice on
> each cgroup making the min_slice equal to 100ms. Things start off in
> such a way that puny having a vruntime of say -1 is picked over max which
> had a vruntime of 0 for the sake of simplicity.
> 
> Deadline then becomes:
> 
>     puny->deadline = -1 + __calc_delta(100ms, NICE_0_LOAD, 2)
> 
> which is 52428800000000 - 1.
> 
> At 10HZ tick + RUN_TO_PARITY, we get 10ms error so that puts the
> entity_key() at 57671680000000 (46 bits)
> 
> > 
> > The avg_vruntime() would end up being very close to 0 (which is
> > zero_vruntime), so no real help making that more accurate.
> > 
> > vruntime_eligible(puny) ends up with:
> > 
> >  avg    = 2 * puny.key (+ 0)
> >  weight = 2 + 100 * NICE_0_LOAD
> > 
> >  avg >= puny.key * weight
> > 
> > And that is: (slice + TICK_NSEC) * NICE_0_LOAD * NICE_0_LOAD * 100
> > > and yes, I suppose that can exceed 64bit :-(
> 
> Then the worst case right side would be:
> 
>     57671680000000 * load /* load = ((1 << 28) + 2) */
> 
> and yes that goes beyond 64-bits at 15481123719086080000000.
> 
> I'm running with this configurations:
> 
>     echo 2 > /sys/fs/cgroup/cg0/cpu.weight
>     echo 10000 > /sys/fs/cgroup/cg1/cpu.weight
> 
> Weight of 2 actually gets scaled up to 20480 for UP scenario so I go try
> to make it even smaller with SMP and wait until it gets picked again.

Make sure all of the actual activity of that cgroup happens 'elsewhere'
and you can get it down to an actual 2 per calc_group_shares().

> without any splat so I'm not sure if there is something that prevents
> a possible crash since a weight of 104857600 should have definitely
> made that entity_key() overflow.

Right.

Anyway, I had a poke around with godbolt, and the below seems to
generate the best code for things like x86_64 and arm64.

Specifically, the __builtin_mul_overflow() already has to compute the
128 bit product anyway for most architectures, so using that directly
then leads to saner asm and easier to understand code.

AFAICT HPPA64 is the only 64bit architecture that doesn't implement
__int128 and will thus be demoted to doing what we do on 32bit.

Now all I need to do is write a coherent Changelog to go with it ...
*sigh*

---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 728965851842..214fe9d99834 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -904,7 +904,7 @@ static int vruntime_eligible(struct cfs_rq *cfs_rq, u64 vruntime)
 		load += weight;
 	}
 
-	return avg >= vruntime_op(vruntime, "-", cfs_rq->zero_vruntime) * load;
+	return avg >= (sched_double_long_t)vruntime_op(vruntime, "-", cfs_rq->zero_vruntime) * load;
 }
 
 int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9f63b15d309d..f15f4468efe5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -146,7 +146,7 @@ extern struct list_head asym_cap_list;
  * Really only required when CONFIG_FAIR_GROUP_SCHED=y is also set, but to
  * increase coverage and consistency always enable it on 64-bit platforms.
  */
-#ifdef CONFIG_64BIT
+#if defined(CONFIG_64BIT) && defined(__SIZEOF_INT128__)
 # define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
 # define scale_load(w)		((w) << SCHED_FIXEDPOINT_SHIFT)
 # define scale_load_down(w)					\
@@ -157,10 +157,12 @@ extern struct list_head asym_cap_list;
 		__w = max(2UL, __w >> SCHED_FIXEDPOINT_SHIFT);	\
 	__w;							\
 })
+typedef __int128 sched_double_long_t;
 #else
 # define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT)
 # define scale_load(w)		(w)
 # define scale_load_down(w)	(w)
+typedef s64 sched_double_long_t;
 #endif
 
 /*

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH RESEND] sched/fair: Fix overflow in vruntime_eligible()
  2026-05-01 10:40         ` Peter Zijlstra
@ 2026-05-01 10:48           ` Peter Zijlstra
  2026-05-01 13:05           ` David Laight
  2026-05-04 11:22           ` Peter Zijlstra
  2 siblings, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2026-05-01 10:48 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Zhan Xusheng, Ingo Molnar, Vincent Guittot, Juri Lelli,
	linux-kernel, Zhan Xusheng, Dietmar Eggemann, Valentin Schneider,
	Ben Segall, Steven Rostedt, Mel Gorman

On Fri, May 01, 2026 at 12:40:06PM +0200, Peter Zijlstra wrote:

> Anyway, I had a poke around with godbolt, and the below seems to
> generate the best code for things like x86_64 and arm64.

https://godbolt.org/z/T389jd7bn

> Specifically, the __builtin_mul_overflow() already has to compute the
> 128 bit product anyway for most architectures, so using that directly
> then leads to saner asm and easier to understand code.
> 
> AFAICT HPPA64 is the only 64bit architecture that doesn't implement
> __int128 and will thus be demoted to doing what we do on 32bit.
> 
> Now all I need to do is write a coherent Changelog to go with it ...
> *sigh*
> 
> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 728965851842..214fe9d99834 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -904,7 +904,7 @@ static int vruntime_eligible(struct cfs_rq *cfs_rq, u64 vruntime)
>  		load += weight;
>  	}
>  
> -	return avg >= vruntime_op(vruntime, "-", cfs_rq->zero_vruntime) * load;
> +	return avg >= (sched_double_long_t)vruntime_op(vruntime, "-", cfs_rq->zero_vruntime) * load;
>  }
>  
>  int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 9f63b15d309d..f15f4468efe5 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -146,7 +146,7 @@ extern struct list_head asym_cap_list;
>   * Really only required when CONFIG_FAIR_GROUP_SCHED=y is also set, but to
>   * increase coverage and consistency always enable it on 64-bit platforms.
>   */
> -#ifdef CONFIG_64BIT
> +#if defined(CONFIG_64BIT) && defined(__SIZEOF_INT128__)
>  # define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
>  # define scale_load(w)		((w) << SCHED_FIXEDPOINT_SHIFT)
>  # define scale_load_down(w)					\
> @@ -157,10 +157,12 @@ extern struct list_head asym_cap_list;
>  		__w = max(2UL, __w >> SCHED_FIXEDPOINT_SHIFT);	\
>  	__w;							\
>  })
> +typedef __int128 sched_double_long_t;
>  #else
>  # define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT)
>  # define scale_load(w)		(w)
>  # define scale_load_down(w)	(w)
> +typedef s64 sched_double_long_t;
>  #endif
>  
>  /*

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RESEND] sched/fair: Fix overflow in vruntime_eligible()
  2026-05-01 10:40         ` Peter Zijlstra
  2026-05-01 10:48           ` Peter Zijlstra
@ 2026-05-01 13:05           ` David Laight
  2026-05-01 13:43             ` Peter Zijlstra
  2026-05-04 11:22           ` Peter Zijlstra
  2 siblings, 1 reply; 17+ messages in thread
From: David Laight @ 2026-05-01 13:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: K Prateek Nayak, Zhan Xusheng, Ingo Molnar, Vincent Guittot,
	Juri Lelli, linux-kernel, Zhan Xusheng, Dietmar Eggemann,
	Valentin Schneider, Ben Segall, Steven Rostedt, Mel Gorman

On Fri, 1 May 2026 12:40:06 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Apr 29, 2026 at 10:33:48AM +0530, K Prateek Nayak wrote:
> 
> > > Anyway... let me construct a worst case.
> > > 
> > > So if you have this cgroup crap on, then you can have an entity of
> > > weight 2, and vlag should then be bounded by: (slice+TICK_NSEC) *
> > > NICE_0_LOAD, which is around 44 bits as per the comment on entity_key().
> > > 
> > > The other end is 100*NICE_0_LOAD, so lets wake that, then you get:
> > > 
> > > {key, weight}[] := {
> > >   puny: { (slice + TICK_NSEC) * NICE_0_LOAD, 2               },
> > >   max:  { 0,                                 100*NICE_0_LOAD },
> > > }  
> > 
> > So MAX_SHARES is (1UL << 18) and we then scale it up so the weight can
> > can go up to (1 << 28) I suppose in UP setting.
> > 
> > So let us assume there is a single task with the largest custom slice on
> > each cgroup making the min_slice equal to 100ms. Things start off in
> > such a way that puny having a vruntime of say -1 is picked over max which
> > had a vruntime of 0 for the sake of simplicity.
> > 
> > Deadline then becomes:
> > 
> >     puny->deadline = -1 + __calc_delta(100ms, NICE_0_LOAD, 2)
> > 
> > which is 52428800000000 - 1.
> > 
> > At 10HZ tick + RUN_TO_PARITY, we get 10ms error so that puts the
> > entity_key() at 57671680000000 (46 bits)
> >   
> > > 
> > > The avg_vruntime() would end up being very close to 0 (which is
> > > zero_vruntime), so no real help making that more accurate.
> > > 
> > > vruntime_eligible(puny) ends up with:
> > > 
> > >  avg    = 2 * puny.key (+ 0)
> > >  weight = 2 + 100 * NICE_0_LOAD
> > > 
> > >  avg >= puny.key * weight
> > > 
> > > And that is: (slice + TICK_NSEC) * NICE_0_LOAD * NICE_0_LOAD * 100  
> > > > and yes, I suppose that can exceed 64bit :-(  
> > 
> > Then the worst case right side would be:
> > 
> >     57671680000000 * load /* load = ((1 << 28) + 2) */
> > 
> > and yes that goes beyond 64-bits at 15481123719086080000000.
> > 
> > I'm running with this configurations:
> > 
> >     echo 2 > /sys/fs/cgroup/cg0/cpu.weight
> >     echo 10000 > /sys/fs/cgroup/cg1/cpu.weight
> > 
> > Weight of 2 actually gets scaled up to 20480 for UP scenario so I go try
> > to make it even smaller with SMP and wait until it gets picked again.  
> 
> Make sure all of the actual activity of that cgroup happens 'elsewhere'
> and you can get it down to an actual 2 per calc_group_shares().
> 
> > without any splat so I'm not sure if there is something that prevents
> > a possible crash since a weight of 104857600 should have definitely
> > made that entity_key() overflow.  
> 
> Right.
> 
> Anyway, I had a poke around with godbolt, and the below seems to
> generate the best code for things like x86_64 and arm64.
> 
> Specifically, the __builtin_mul_overflow() already has to compute the
> 128 bit product anyway for most architectures, so using that directly
> then leads to saner asm and easier to understand code.

Older versions of gcc (maybe before 10?) generate a call to the full
128 bit multiply (_muldid3 ?) on mips64 - rather than just using the instruction
that generates the high part of the product.
The library function is now included but is slightly more complex because
it does the high*low multiples as well.

The same happens on sparc64 and the asm multiply code looks strange.

> 
> AFAICT HPPA64 is the only 64bit architecture that doesn't implement
> __int128 and will thus be demoted to doing what we do on 32bit.
> 
> Now all I need to do is write a coherent Changelog to go with it ...
> *sigh*
> 
> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 728965851842..214fe9d99834 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -904,7 +904,7 @@ static int vruntime_eligible(struct cfs_rq *cfs_rq, u64 vruntime)
>  		load += weight;
>  	}
>  
> -	return avg >= vruntime_op(vruntime, "-", cfs_rq->zero_vruntime) * load;
> +	return avg >= (sched_double_long_t)vruntime_op(vruntime, "-", cfs_rq->zero_vruntime) * load;
>  }
>  
>  int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 9f63b15d309d..f15f4468efe5 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -146,7 +146,7 @@ extern struct list_head asym_cap_list;
>   * Really only required when CONFIG_FAIR_GROUP_SCHED=y is also set, but to
>   * increase coverage and consistency always enable it on 64-bit platforms.
>   */
> -#ifdef CONFIG_64BIT
> +#if defined(CONFIG_64BIT) && defined(__SIZEOF_INT128__)
>  # define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
>  # define scale_load(w)		((w) << SCHED_FIXEDPOINT_SHIFT)
>  # define scale_load_down(w)					\
> @@ -157,10 +157,12 @@ extern struct list_head asym_cap_list;
>  		__w = max(2UL, __w >> SCHED_FIXEDPOINT_SHIFT);	\
>  	__w;							\
>  })
> +typedef __int128 sched_double_long_t;
>  #else
>  # define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT)
>  # define scale_load(w)		(w)
>  # define scale_load_down(w)	(w)
> +typedef s64 sched_double_long_t;

Seems strange that this is signed but the 64bit one is unsigned.

-- David

>  #endif
>  
>  /*
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RESEND] sched/fair: Fix overflow in vruntime_eligible()
  2026-05-01 13:05           ` David Laight
@ 2026-05-01 13:43             ` Peter Zijlstra
  0 siblings, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2026-05-01 13:43 UTC (permalink / raw)
  To: David Laight
  Cc: K Prateek Nayak, Zhan Xusheng, Ingo Molnar, Vincent Guittot,
	Juri Lelli, linux-kernel, Zhan Xusheng, Dietmar Eggemann,
	Valentin Schneider, Ben Segall, Steven Rostedt, Mel Gorman

On Fri, May 01, 2026 at 02:05:21PM +0100, David Laight wrote:

> > Anyway, I had a poke around with godbolt, and the below seems to
> > generate the best code for things like x86_64 and arm64.
> > 
> > Specifically, the __builtin_mul_overflow() already has to compute the
> > 128 bit product anyway for most architectures, so using that directly
> > then leads to saner asm and easier to understand code.
> 
> Older versions of gcc (maybe before 10?) generate a call to the full
> 128 bit multiply (_muldid3 ?) on mips64 - rather than just using the instruction
> that generates the high part of the product.
> The library function is now included but is slightly more complex because
> it does the high*low multiples as well.
> 
> The same happens on sparc64 and the asm multiply code looks strange.

godbolt only has mips64 gcc-9.5 (no 8.1 which is the very oldest gcc we
support) and that doesn't seem to generate calls to library functions.

I don't particularly care about the quality of old compiler output, nor
do I really care for the performance aspect on these fringe
architectures. As long as it builds and isn't obviously broken, things
are good.

Using the gcc-8.5 toolchains from kernel.org (the oldest still supported
set) I see no library calls for MIPS64.

I do see them for Sparc64, but that actually has that library call
implemented.

Anyway, I'll throw it at the robots, see what if anything pops out.

> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 9f63b15d309d..f15f4468efe5 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -146,7 +146,7 @@ extern struct list_head asym_cap_list;
> >   * Really only required when CONFIG_FAIR_GROUP_SCHED=y is also set, but to
> >   * increase coverage and consistency always enable it on 64-bit platforms.
> >   */
> > -#ifdef CONFIG_64BIT
> > +#if defined(CONFIG_64BIT) && defined(__SIZEOF_INT128__)
> >  # define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
> >  # define scale_load(w)		((w) << SCHED_FIXEDPOINT_SHIFT)
> >  # define scale_load_down(w)					\
> > @@ -157,10 +157,12 @@ extern struct list_head asym_cap_list;
> >  		__w = max(2UL, __w >> SCHED_FIXEDPOINT_SHIFT);	\
> >  	__w;							\
> >  })
> > +typedef __int128 sched_double_long_t;
> >  #else
> >  # define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT)
> >  # define scale_load(w)		(w)
> >  # define scale_load_down(w)	(w)
> > +typedef s64 sched_double_long_t;
> 
> Seems strange that this is signed but the 64bit one is unsigned.

They both signed. __int128 is a signed type.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RESEND] sched/fair: Fix overflow in vruntime_eligible()
  2026-05-01 10:40         ` Peter Zijlstra
  2026-05-01 10:48           ` Peter Zijlstra
  2026-05-01 13:05           ` David Laight
@ 2026-05-04 11:22           ` Peter Zijlstra
  2026-05-04 13:16             ` Heiko Carstens
  2026-05-05 10:31             ` Peter Zijlstra
  2 siblings, 2 replies; 17+ messages in thread
From: Peter Zijlstra @ 2026-05-04 11:22 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Zhan Xusheng, Ingo Molnar, Vincent Guittot, Juri Lelli,
	linux-kernel, Zhan Xusheng, Dietmar Eggemann, Valentin Schneider,
	Ben Segall, Steven Rostedt, Mel Gorman, hca, gor, agordeev,
	borntraeger, svens, maddy, mpe, chleroy

On Fri, May 01, 2026 at 12:40:06PM +0200, Peter Zijlstra wrote:

> Anyway, I had a poke around with godbolt, and the below seems to
> generate the best code for things like x86_64 and arm64.
> 
> Specifically, the __builtin_mul_overflow() already has to compute the
> 128 bit product anyway for most architectures, so using that directly
> then leads to saner asm and easier to understand code.
> 
> AFAICT HPPA64 is the only 64bit architecture that doesn't implement
> __int128 and will thus be demoted to doing what we do on 32bit.

I forgot we had ARCH_SUPPORTS_INT128, and I suppose this had better
check that. Now, s390 is a bit weird and excludes GCC even though that
definitely supports __int128. Supposedly there was a issue, but perhaps
modern GCC has this fixed?

Also, PowerPC does not seem to set this at all.

Anyway, I've then ended up with this, it keeps all 64bit architectures
on the extra precision, but incurs an extra division for all that do not
have __int128 selected.

---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 728965851842..01405f6011b0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -882,11 +882,11 @@ bool update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
  *
  * lag_i >= 0 -> V >= v_i
  *
- *     \Sum (v_i - v)*w_i
- * V = ------------------ + v
+ *     \Sum (v_i - v0)*w_i
+ * V = ------------------- + v0
  *          \Sum w_i
  *
- * lag_i >= 0 -> \Sum (v_i - v)*w_i >= (v_i - v)*(\Sum w_i)
+ * lag_i >= 0 -> \Sum (v_i - v0)*w_i >= (v_i - v0)*(\Sum w_i)
  *
  * Note: using 'avg_vruntime() > se->vruntime' is inaccurate due
  *       to the loss in precision caused by the division.
@@ -894,7 +894,7 @@ bool update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
 static int vruntime_eligible(struct cfs_rq *cfs_rq, u64 vruntime)
 {
 	struct sched_entity *curr = cfs_rq->curr;
-	s64 avg = cfs_rq->sum_w_vruntime;
+	s64 key, avg = cfs_rq->sum_w_vruntime;
 	long load = cfs_rq->sum_weight;
 
 	if (curr && curr->on_rq) {
@@ -904,7 +904,35 @@ static int vruntime_eligible(struct cfs_rq *cfs_rq, u64 vruntime)
 		load += weight;
 	}
 
-	return avg >= vruntime_op(vruntime, "-", cfs_rq->zero_vruntime) * load;
+	key = vruntime_op(vruntime, "-", cfs_rq->zero_vruntime);
+
+	/*
+	 * The worst case term for @key includes 'NSEC_TICK * NICE_0_LOAD'
+	 * and @load obviously includes NICE_0_LOAD. NSEC_TICK is around 24
+	 * bits, while NICE_0_LOAD is 20 on 64bit and 10 otherwise.
+	 *
+	 * This gives that on 64bit the product will be at least 64bit which
+	 * overflows s64, while on 32bit it will only be 44bits and should fit
+	 * comfortably.
+	 */
+#ifdef CONFIG_64BIT
+#ifdef CONFIG_ARCH_SUPPORTS_INT128
+	/* An __int128 mult should be cheaper than a division. */
+	return avg >= (__int128)key * load;
+#else
+	/*
+	 * Since the divisor is @load, which is guaranteed positive, the
+	 * inequality: avg >= key * load, can be rewritten into a division
+	 * like: avg/load > key || (avg/load == key && avg%load >= 0).
+	 */
+	s64 div = avg / load;
+	if (div > key)
+		return true;
+	return div == key && avg % load >= 0;
+#endif
+#else /* 32bit */
+	return avg >= key * load;
+#endif
 }
 
 int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH RESEND] sched/fair: Fix overflow in vruntime_eligible()
  2026-05-04 11:22           ` Peter Zijlstra
@ 2026-05-04 13:16             ` Heiko Carstens
  2026-05-04 14:57               ` Peter Zijlstra
  2026-05-04 17:40               ` Stefan Schulze Frielinghaus
  2026-05-05 10:31             ` Peter Zijlstra
  1 sibling, 2 replies; 17+ messages in thread
From: Heiko Carstens @ 2026-05-04 13:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: K Prateek Nayak, Zhan Xusheng, Ingo Molnar, Vincent Guittot,
	Juri Lelli, linux-kernel, Zhan Xusheng, Dietmar Eggemann,
	Valentin Schneider, Ben Segall, Steven Rostedt, Mel Gorman, gor,
	agordeev, borntraeger, svens, maddy, mpe, chleroy,
	Stefan Schulze Frielinghaus, Juergen Christ

On Mon, May 04, 2026 at 01:22:39PM +0200, Peter Zijlstra wrote:
> On Fri, May 01, 2026 at 12:40:06PM +0200, Peter Zijlstra wrote:
> 
> > Anyway, I had a poke around with godbolt, and the below seems to
> > generate the best code for things like x86_64 and arm64.
> > 
> > Specifically, the __builtin_mul_overflow() already has to compute the
> > 128 bit product anyway for most architectures, so using that directly
> > then leads to saner asm and easier to understand code.
> > 
> > AFAICT HPPA64 is the only 64bit architecture that doesn't implement
> > __int128 and will thus be demoted to doing what we do on 32bit.
> 
> I forgot we had ARCH_SUPPORTS_INT128, and I suppose this had better
> check that. Now, s390 is a bit weird and excludes GCC even though that
> definitely supports __int128. Supposedly there was a issue, but perhaps
> modern GCC has this fixed?

The reason was not a bug (in terms of incorrect code), but gcc generated a
larger than 6kb stack frame for one of the crypto functions - see commit
fbac266f095d ("s390: select ARCH_SUPPORTS_INT128"). That's just too large to
be acceptable.

If I remember correctly gcc generated code which did not reuse known to be
unused stack slots, but created for every variable a new stack slot, for
whatever reason. Which then resulted in such a huge stack frame. With clang
the stack frame size was only 1,5kb.

I just checked: with gcc 15.2.0 we are down to 4.5kb. Still too large :)

Adding s390 compiler folks; but I seem to remember I discussed that back then
with them.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RESEND] sched/fair: Fix overflow in vruntime_eligible()
  2026-05-04 13:16             ` Heiko Carstens
@ 2026-05-04 14:57               ` Peter Zijlstra
  2026-05-04 17:40               ` Stefan Schulze Frielinghaus
  1 sibling, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2026-05-04 14:57 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: K Prateek Nayak, Zhan Xusheng, Ingo Molnar, Vincent Guittot,
	Juri Lelli, linux-kernel, Zhan Xusheng, Dietmar Eggemann,
	Valentin Schneider, Ben Segall, Steven Rostedt, Mel Gorman, gor,
	agordeev, borntraeger, svens, maddy, mpe, chleroy,
	Stefan Schulze Frielinghaus, Juergen Christ

On Mon, May 04, 2026 at 03:16:09PM +0200, Heiko Carstens wrote:
> On Mon, May 04, 2026 at 01:22:39PM +0200, Peter Zijlstra wrote:
> > On Fri, May 01, 2026 at 12:40:06PM +0200, Peter Zijlstra wrote:
> > 
> > > Anyway, I had a poke around with godbolt, and the below seems to
> > > generate the best code for things like x86_64 and arm64.
> > > 
> > > Specifically, the __builtin_mul_overflow() already has to compute the
> > > 128 bit product anyway for most architectures, so using that directly
> > > then leads to saner asm and easier to understand code.
> > > 
> > > AFAICT HPPA64 is the only 64bit architecture that doesn't implement
> > > __int128 and will thus be demoted to doing what we do on 32bit.
> > 
> > I forgot we had ARCH_SUPPORTS_INT128, and I suppose this had better
> > check that. Now, s390 is a bit weird and excludes GCC even though that
> > definitely supports __int128. Supposedly there was a issue, but perhaps
> > modern GCC has this fixed?
> 
> The reason was not a bug (in terms of incorrect code), but gcc generated a
> larger than 6kb stack frame for one of the crypto functions - see commit
> fbac266f095d ("s390: select ARCH_SUPPORTS_INT128"). That's just too large to
> be acceptable.
> 
> If I remember correctly gcc generated code which did not reuse known to be
> unused stack slots, but created for every variable a new stack slot, for
> whatever reason. Which then resulted in such a huge stack frame. With clang
> the stack frame size was only 1,5kb.
> 
> I just checked: with gcc 15.2.0 we are down to 4.5kb. Still too large :)
> 
> Adding s390 compiler folks; but I seem to remember I discussed that back then
> with them.

Right, I had indeed found that commit, but since it was from 2023, I had
some hope that gcc-1[56] might have it resolved already. Oh well.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RESEND] sched/fair: Fix overflow in vruntime_eligible()
  2026-05-04 13:16             ` Heiko Carstens
  2026-05-04 14:57               ` Peter Zijlstra
@ 2026-05-04 17:40               ` Stefan Schulze Frielinghaus
  1 sibling, 0 replies; 17+ messages in thread
From: Stefan Schulze Frielinghaus @ 2026-05-04 17:40 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Peter Zijlstra, K Prateek Nayak, Zhan Xusheng, Ingo Molnar,
	Vincent Guittot, Juri Lelli, linux-kernel, Zhan Xusheng,
	Dietmar Eggemann, Valentin Schneider, Ben Segall, Steven Rostedt,
	Mel Gorman, gor, agordeev, borntraeger, svens, maddy, mpe,
	chleroy, Juergen Christ

On Mon, May 04, 2026 at 03:16:09PM +0200, Heiko Carstens wrote:
> On Mon, May 04, 2026 at 01:22:39PM +0200, Peter Zijlstra wrote:
> > On Fri, May 01, 2026 at 12:40:06PM +0200, Peter Zijlstra wrote:
> > 
> > > Anyway, I had a poke around with godbolt, and the below seems to
> > > generate the best code for things like x86_64 and arm64.
> > > 
> > > Specifically, the __builtin_mul_overflow() already has to compute the
> > > 128 bit product anyway for most architectures, so using that directly
> > > then leads to saner asm and easier to understand code.
> > > 
> > > AFAICT HPPA64 is the only 64bit architecture that doesn't implement
> > > __int128 and will thus be demoted to doing what we do on 32bit.
> > 
> > I forgot we had ARCH_SUPPORTS_INT128, and I suppose this had better
> > check that. Now, s390 is a bit weird and excludes GCC even though that
> > definitely supports __int128. Supposedly there was a issue, but perhaps
> > modern GCC has this fixed?
> 
> The reason was not a bug (in terms of incorrect code), but gcc generated a
> larger than 6kb stack frame for one of the crypto functions - see commit
> fbac266f095d ("s390: select ARCH_SUPPORTS_INT128"). That's just too large to
> be acceptable.
> 
> If I remember correctly gcc generated code which did not reuse known to be
> unused stack slots, but created for every variable a new stack slot, for
> whatever reason. Which then resulted in such a huge stack frame. With clang
> the stack frame size was only 1,5kb.
> 
> I just checked: with gcc 15.2.0 we are down to 4.5kb. Still too large :)
> 
> Adding s390 compiler folks; but I seem to remember I discussed that back then
> with them.

I briefly remember our discussion.  The short story is that I'm not
aware of any change in that area.  The long story is that gcc is
conservative when it comes to re-using stack slots for different
objects.  Thus, if gcc cannot prove that a stack slot is at some point
in time not referred to anymore, the stack slot is not reused for some
other object.  In the particular case, due to inlining and IIRC loop
unrolling, a huge amount of stack slots had to be allocated and couldn't
be coalesced.

Cheers,
Stefan

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RESEND] sched/fair: Fix overflow in vruntime_eligible()
  2026-05-04 11:22           ` Peter Zijlstra
  2026-05-04 13:16             ` Heiko Carstens
@ 2026-05-05 10:31             ` Peter Zijlstra
  2026-05-05 10:50               ` [tip: sched/urgent] " tip-bot2 for Zhan Xusheng
  2026-05-05 14:13               ` tip-bot2 for Zhan Xusheng
  1 sibling, 2 replies; 17+ messages in thread
From: Peter Zijlstra @ 2026-05-05 10:31 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Zhan Xusheng, Ingo Molnar, Vincent Guittot, Juri Lelli,
	linux-kernel, Zhan Xusheng, Dietmar Eggemann, Valentin Schneider,
	Ben Segall, Steven Rostedt, Mel Gorman, hca, gor, agordeev,
	borntraeger, svens, maddy, mpe, chleroy

On Mon, May 04, 2026 at 01:22:39PM +0200, Peter Zijlstra wrote:

> +#ifdef CONFIG_64BIT
> +#ifdef CONFIG_ARCH_SUPPORTS_INT128
> +	/* An __int128 mult should be cheaper than a division. */
> +	return avg >= (__int128)key * load;
> +#else
> +	/*
> +	 * Since the divisor is @load, which is guaranteed positive, the
> +	 * inequality: avg >= key * load, can be rewritten into a division
> +	 * like: avg/load > key || (avg/load == key && avg%load >= 0).
> +	 */
> +	s64 div = avg / load;
> +	if (div > key)
> +		return true;
> +	return div == key && avg % load >= 0;

Bah, I could of course have put the __builtin_mul_overflow() thing in
here. That probably generates better code than this for those
architectures not doing SUPPORT_INT128.

Just got focussed on keeping the thing exact, but that isn't needed.
That overflow thing is good enough.

---
Subject: sched/fair: Fix overflow in vruntime_eligible()
From: Zhan Xusheng <zhanxusheng@xiaomi.com>
Date: Fri, 1 May 2026 12:40:06 +0200

Zhan Xusheng reported running into sporadic a s64 mult overflow in
vruntime_eligible().

When constructing a worst case scenario:

If you have cgroups, then you can have an entity of weight 2 (per
calc_group_shares()), and its vlag should then be bounded by: (slice+TICK_NSEC)
* NICE_0_LOAD, which is around 44 bits as per the comment on entity_key().

The other extreme is 100*NICE_0_LOAD, thus you get:

{key, weight}[] := {
  puny: { (slice + TICK_NSEC) * NICE_0_LOAD, 2               },
  max:  { 0,                                 100*NICE_0_LOAD },
}

The avg_vruntime() would end up being very close to 0 (which is
zero_vruntime), so no real help making that more accurate.

vruntime_eligible(puny) ends up with:

 avg  = 2 * puny.key (+ 0)
 load = 2 + 100 * NICE_0_LOAD

 avg >= puny.key * load

And that is: (slice + TICK_NSEC) * NICE_0_LOAD * NICE_0_LOAD * 100, which will
overflow s64.

Zhan suggested using __builtin_mul_overflow(), however after staring at
compiler output for various architectures using godbolt, it seems that using an
__int128 multiplication often results in better code.

Specifically, a number of architectures already compute the __int128 product to
determine the overflow. Eg. arm64 already has the 'smulh' instruction used. By
explicitly doing an __int128 multiply, it will emit the 'mul; smulh' pattern,
which modern cores can fuse (armv8-a clang-22.1.0). x86_64 has less branches
(no OF handling).

Since Linux has ARCH_SUPPORTS_INT128 to gate __int128 usage, also provide the
__builtin_mul_overflow() variant as a fallback.

[peterz: Changelog and __int128 bits]
Fixes: 556146ce5e94 ("sched/fair: Avoid overflow in enqueue_entity()")
Reported-by: Zhan Xusheng <zhanxusheng1024@gmail.com>
Closes: https://patch.msgid.link/20260415145742.10359-1-zhanxusheng%40xiaomi.com
Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |   39 ++++++++++++++++++++++++++++++++++-----
 1 file changed, 34 insertions(+), 5 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -882,11 +882,11 @@ bool update_entity_lag(struct cfs_rq *cf
  *
  * lag_i >= 0 -> V >= v_i
  *
- *     \Sum (v_i - v)*w_i
- * V = ------------------ + v
+ *     \Sum (v_i - v0)*w_i
+ * V = ------------------- + v0
  *          \Sum w_i
  *
- * lag_i >= 0 -> \Sum (v_i - v)*w_i >= (v_i - v)*(\Sum w_i)
+ * lag_i >= 0 -> \Sum (v_i - v0)*w_i >= (v_i - v0)*(\Sum w_i)
  *
  * Note: using 'avg_vruntime() > se->vruntime' is inaccurate due
  *       to the loss in precision caused by the division.
@@ -894,7 +894,7 @@ bool update_entity_lag(struct cfs_rq *cf
 static int vruntime_eligible(struct cfs_rq *cfs_rq, u64 vruntime)
 {
 	struct sched_entity *curr = cfs_rq->curr;
-	s64 avg = cfs_rq->sum_w_vruntime;
+	s64 key, avg = cfs_rq->sum_w_vruntime;
 	long load = cfs_rq->sum_weight;
 
 	if (curr && curr->on_rq) {
@@ -904,7 +904,36 @@ static int vruntime_eligible(struct cfs_
 		load += weight;
 	}
 
-	return avg >= vruntime_op(vruntime, "-", cfs_rq->zero_vruntime) * load;
+	key = vruntime_op(vruntime, "-", cfs_rq->zero_vruntime);
+
+	/*
+	 * The worst case term for @key includes 'NSEC_TICK * NICE_0_LOAD'
+	 * and @load obviously includes NICE_0_LOAD. NSEC_TICK is around 24
+	 * bits, while NICE_0_LOAD is 20 on 64bit and 10 otherwise.
+	 *
+	 * This gives that on 64bit the product will be at least 64bit which
+	 * overflows s64, while on 32bit it will only be 44bits and should fit
+	 * comfortably.
+	 */
+#ifdef CONFIG_64BIT
+#ifdef CONFIG_ARCH_SUPPORTS_INT128
+	/* This often results in simpler code than __builtin_mul_overflow(). */
+	return avg >= (__int128)key * load;
+#else
+	s64 rhs;
+	/*
+	 * On overflow, the sign of key tells us the correct answer: a large
+	 * positive key means vruntime >> V, so not eligible; a large negative
+	 * key means vruntime << V, so eligible.
+	 */
+	if (check_mul_overflow(key, load, &rhs))
+		return key <= 0;
+
+	return avg >= rhs;
+#endif
+#else /* 32bit */
+	return avg >= key * load;
+#endif
 }
 
 int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [tip: sched/urgent] sched/fair: Fix overflow in vruntime_eligible()
  2026-05-05 10:31             ` Peter Zijlstra
@ 2026-05-05 10:50               ` tip-bot2 for Zhan Xusheng
  2026-05-05 14:13               ` tip-bot2 for Zhan Xusheng
  1 sibling, 0 replies; 17+ messages in thread
From: tip-bot2 for Zhan Xusheng @ 2026-05-05 10:50 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Zhan Xusheng, Zhan Xusheng, Peter Zijlstra (Intel), x86,
	linux-kernel

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     416dd8fc352b6027210b42de5889ee280e4bc40c
Gitweb:        https://git.kernel.org/tip/416dd8fc352b6027210b42de5889ee280e4bc40c
Author:        Zhan Xusheng <zhanxusheng@xiaomi.com>
AuthorDate:    Fri, 01 May 2026 12:40:06 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 05 May 2026 12:44:24 +02:00

sched/fair: Fix overflow in vruntime_eligible()

Zhan Xusheng reported running into sporadic a s64 mult overflow in
vruntime_eligible().

When constructing a worst case scenario:

If you have cgroups, then you can have an entity of weight 2 (per
calc_group_shares()), and its vlag should then be bounded by: (slice+TICK_NSEC)
* NICE_0_LOAD, which is around 44 bits as per the comment on entity_key().

The other extreme is 100*NICE_0_LOAD, thus you get:

{key, weight}[] := {
  puny: { (slice + TICK_NSEC) * NICE_0_LOAD, 2               },
  max:  { 0,                                 100*NICE_0_LOAD },
}

The avg_vruntime() would end up being very close to 0 (which is
zero_vruntime), so no real help making that more accurate.

vruntime_eligible(puny) ends up with:

 avg  = 2 * puny.key (+ 0)
 load = 2 + 100 * NICE_0_LOAD

 avg >= puny.key * load

And that is: (slice + TICK_NSEC) * NICE_0_LOAD * NICE_0_LOAD * 100, which will
overflow s64.

Zhan suggested using __builtin_mul_overflow(), however after staring at
compiler output for various architectures using godbolt, it seems that using an
__int128 multiplication often results in better code.

Specifically, a number of architectures already compute the __int128 product to
determine the overflow. Eg. arm64 already has the 'smulh' instruction used. By
explicitly doing an __int128 multiply, it will emit the 'mul; smulh' pattern,
which modern cores can fuse (armv8-a clang-22.1.0). x86_64 has less branches
(no OF handling).

Since Linux has ARCH_SUPPORTS_INT128 to gate __int128 usage, also provide the
__builtin_mul_overflow() variant as a fallback.

[peterz: Changelog and __int128 bits]
Fixes: 556146ce5e94 ("sched/fair: Avoid overflow in enqueue_entity()")
Reported-by: Zhan Xusheng <zhanxusheng1024@gmail.com>
Closes: https://patch.msgid.link/20260415145742.10359-1-zhanxusheng%40xiaomi.com
Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260505103155.GN3102924%40noisy.programming.kicks-ass.net
---
 kernel/sched/fair.c | 39 ++++++++++++++++++++++++++++++++++-----
 1 file changed, 34 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7289658..b91c8b2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -882,11 +882,11 @@ bool update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
  *
  * lag_i >= 0 -> V >= v_i
  *
- *     \Sum (v_i - v)*w_i
- * V = ------------------ + v
+ *     \Sum (v_i - v0)*w_i
+ * V = ------------------- + v0
  *          \Sum w_i
  *
- * lag_i >= 0 -> \Sum (v_i - v)*w_i >= (v_i - v)*(\Sum w_i)
+ * lag_i >= 0 -> \Sum (v_i - v0)*w_i >= (v_i - v0)*(\Sum w_i)
  *
  * Note: using 'avg_vruntime() > se->vruntime' is inaccurate due
  *       to the loss in precision caused by the division.
@@ -894,7 +894,7 @@ bool update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
 static int vruntime_eligible(struct cfs_rq *cfs_rq, u64 vruntime)
 {
 	struct sched_entity *curr = cfs_rq->curr;
-	s64 avg = cfs_rq->sum_w_vruntime;
+	s64 key, avg = cfs_rq->sum_w_vruntime;
 	long load = cfs_rq->sum_weight;
 
 	if (curr && curr->on_rq) {
@@ -904,7 +904,36 @@ static int vruntime_eligible(struct cfs_rq *cfs_rq, u64 vruntime)
 		load += weight;
 	}
 
-	return avg >= vruntime_op(vruntime, "-", cfs_rq->zero_vruntime) * load;
+	key = vruntime_op(vruntime, "-", cfs_rq->zero_vruntime);
+
+	/*
+	 * The worst case term for @key includes 'NSEC_TICK * NICE_0_LOAD'
+	 * and @load obviously includes NICE_0_LOAD. NSEC_TICK is around 24
+	 * bits, while NICE_0_LOAD is 20 on 64bit and 10 otherwise.
+	 *
+	 * This gives that on 64bit the product will be at least 64bit which
+	 * overflows s64, while on 32bit it will only be 44bits and should fit
+	 * comfortably.
+	 */
+#ifdef CONFIG_64BIT
+#ifdef CONFIG_ARCH_SUPPORTS_INT128
+	/* This often results in simpler code than __builtin_mul_overflow(). */
+	return avg >= (__int128)key * load;
+#else
+	s64 rhs;
+	/*
+	 * On overflow, the sign of key tells us the correct answer: a large
+	 * positive key means vruntime >> V, so not eligible; a large negative
+	 * key means vruntime << V, so eligible.
+	 */
+	if (check_mul_overflow(key, load, &rhs))
+		return key <= 0;
+
+	return avg >= rhs;
+#endif
+#else /* 32bit */
+	return avg >= key * load;
+#endif
 }
 
 int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [tip: sched/urgent] sched/fair: Fix overflow in vruntime_eligible()
  2026-05-05 10:31             ` Peter Zijlstra
  2026-05-05 10:50               ` [tip: sched/urgent] " tip-bot2 for Zhan Xusheng
@ 2026-05-05 14:13               ` tip-bot2 for Zhan Xusheng
  1 sibling, 0 replies; 17+ messages in thread
From: tip-bot2 for Zhan Xusheng @ 2026-05-05 14:13 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Zhan Xusheng, Zhan Xusheng, Peter Zijlstra (Intel), x86,
	linux-kernel

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     13abe4ce230f22a522facdaba87d1c137054c3a6
Gitweb:        https://git.kernel.org/tip/13abe4ce230f22a522facdaba87d1c137054c3a6
Author:        Zhan Xusheng <zhanxusheng@xiaomi.com>
AuthorDate:    Fri, 01 May 2026 12:40:06 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 05 May 2026 16:03:12 +02:00

sched/fair: Fix overflow in vruntime_eligible()

Zhan Xusheng reported running into sporadic a s64 mult overflow in
vruntime_eligible().

When constructing a worst case scenario:

If you have cgroups, then you can have an entity of weight 2 (per
calc_group_shares()), and its vlag should then be bounded by: (slice+TICK_NSEC)
* NICE_0_LOAD, which is around 44 bits as per the comment on entity_key().

The other extreme is 100*NICE_0_LOAD, thus you get:

{key, weight}[] := {
  puny: { (slice + TICK_NSEC) * NICE_0_LOAD, 2               },
  max:  { 0,                                 100*NICE_0_LOAD },
}

The avg_vruntime() would end up being very close to 0 (which is
zero_vruntime), so no real help making that more accurate.

vruntime_eligible(puny) ends up with:

 avg  = 2 * puny.key (+ 0)
 load = 2 + 100 * NICE_0_LOAD

 avg >= puny.key * load

And that is: (slice + TICK_NSEC) * NICE_0_LOAD * NICE_0_LOAD * 100, which will
overflow s64.

Zhan suggested using __builtin_mul_overflow(), however after staring at
compiler output for various architectures using godbolt, it seems that using an
__int128 multiplication often results in better code.

Specifically, a number of architectures already compute the __int128 product to
determine the overflow. Eg. arm64 already has the 'smulh' instruction used. By
explicitly doing an __int128 multiply, it will emit the 'mul; smulh' pattern,
which modern cores can fuse (armv8-a clang-22.1.0). x86_64 has less branches
(no OF handling).

Since Linux has ARCH_SUPPORTS_INT128 to gate __int128 usage, also provide the
__builtin_mul_overflow() variant as a fallback.

[peterz: Changelog and __int128 bits]
Fixes: 556146ce5e94 ("sched/fair: Avoid overflow in enqueue_entity()")
Reported-by: Zhan Xusheng <zhanxusheng1024@gmail.com>
Closes: https://patch.msgid.link/20260415145742.10359-1-zhanxusheng%40xiaomi.com
Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260505103155.GN3102924%40noisy.programming.kicks-ass.net
---
 kernel/sched/fair.c | 39 ++++++++++++++++++++++++++++++++++-----
 1 file changed, 34 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7289658..b91c8b2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -882,11 +882,11 @@ bool update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
  *
  * lag_i >= 0 -> V >= v_i
  *
- *     \Sum (v_i - v)*w_i
- * V = ------------------ + v
+ *     \Sum (v_i - v0)*w_i
+ * V = ------------------- + v0
  *          \Sum w_i
  *
- * lag_i >= 0 -> \Sum (v_i - v)*w_i >= (v_i - v)*(\Sum w_i)
+ * lag_i >= 0 -> \Sum (v_i - v0)*w_i >= (v_i - v0)*(\Sum w_i)
  *
  * Note: using 'avg_vruntime() > se->vruntime' is inaccurate due
  *       to the loss in precision caused by the division.
@@ -894,7 +894,7 @@ bool update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
 static int vruntime_eligible(struct cfs_rq *cfs_rq, u64 vruntime)
 {
 	struct sched_entity *curr = cfs_rq->curr;
-	s64 avg = cfs_rq->sum_w_vruntime;
+	s64 key, avg = cfs_rq->sum_w_vruntime;
 	long load = cfs_rq->sum_weight;
 
 	if (curr && curr->on_rq) {
@@ -904,7 +904,36 @@ static int vruntime_eligible(struct cfs_rq *cfs_rq, u64 vruntime)
 		load += weight;
 	}
 
-	return avg >= vruntime_op(vruntime, "-", cfs_rq->zero_vruntime) * load;
+	key = vruntime_op(vruntime, "-", cfs_rq->zero_vruntime);
+
+	/*
+	 * The worst case term for @key includes 'NSEC_TICK * NICE_0_LOAD'
+	 * and @load obviously includes NICE_0_LOAD. NSEC_TICK is around 24
+	 * bits, while NICE_0_LOAD is 20 on 64bit and 10 otherwise.
+	 *
+	 * This gives that on 64bit the product will be at least 64bit which
+	 * overflows s64, while on 32bit it will only be 44bits and should fit
+	 * comfortably.
+	 */
+#ifdef CONFIG_64BIT
+#ifdef CONFIG_ARCH_SUPPORTS_INT128
+	/* This often results in simpler code than __builtin_mul_overflow(). */
+	return avg >= (__int128)key * load;
+#else
+	s64 rhs;
+	/*
+	 * On overflow, the sign of key tells us the correct answer: a large
+	 * positive key means vruntime >> V, so not eligible; a large negative
+	 * key means vruntime << V, so eligible.
+	 */
+	if (check_mul_overflow(key, load, &rhs))
+		return key <= 0;
+
+	return avg >= rhs;
+#endif
+#else /* 32bit */
+	return avg >= key * load;
+#endif
 }
 
 int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)

^ permalink raw reply related	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2026-05-05 14:13 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-15 14:57 [PATCH] sched/fair: Fix overflow in vruntime_eligible() Zhan Xusheng
2026-04-28 14:49 ` [PATCH RESEND] " Zhan Xusheng
2026-04-28 16:17   ` K Prateek Nayak
2026-04-28 17:35     ` Peter Zijlstra
2026-04-29  5:03       ` K Prateek Nayak
2026-04-29  7:31         ` Zhan Xusheng
2026-05-01 10:40         ` Peter Zijlstra
2026-05-01 10:48           ` Peter Zijlstra
2026-05-01 13:05           ` David Laight
2026-05-01 13:43             ` Peter Zijlstra
2026-05-04 11:22           ` Peter Zijlstra
2026-05-04 13:16             ` Heiko Carstens
2026-05-04 14:57               ` Peter Zijlstra
2026-05-04 17:40               ` Stefan Schulze Frielinghaus
2026-05-05 10:31             ` Peter Zijlstra
2026-05-05 10:50               ` [tip: sched/urgent] " tip-bot2 for Zhan Xusheng
2026-05-05 14:13               ` tip-bot2 for Zhan Xusheng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox