* Re: [PATCH RESEND] sched/fair: Fix overflow in vruntime_eligible()
2026-05-01 10:40 ` Peter Zijlstra
@ 2026-05-01 10:48 ` Peter Zijlstra
2026-05-01 13:05 ` David Laight
2026-05-04 11:22 ` Peter Zijlstra
2 siblings, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2026-05-01 10:48 UTC (permalink / raw)
To: K Prateek Nayak
Cc: Zhan Xusheng, Ingo Molnar, Vincent Guittot, Juri Lelli,
linux-kernel, Zhan Xusheng, Dietmar Eggemann, Valentin Schneider,
Ben Segall, Steven Rostedt, Mel Gorman
On Fri, May 01, 2026 at 12:40:06PM +0200, Peter Zijlstra wrote:
> Anyway, I had a poke around with godbolt, and the below seems to
> generate the best code for things like x86_64 and arm64.
https://godbolt.org/z/T389jd7bn
> Specifically, the __builtin_mul_overflow() already has to compute the
> 128 bit product anyway for most architectures, so using that directly
> then leads to saner asm and easier to understand code.
>
> AFAICT HPPA64 is the only 64bit architecture that doesn't implement
> __int128 and will thus be demoted to doing what we do on 32bit.
>
> Now all I need to do is write a coherent Changelog to go with it ...
> *sigh*
>
> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 728965851842..214fe9d99834 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -904,7 +904,7 @@ static int vruntime_eligible(struct cfs_rq *cfs_rq, u64 vruntime)
> load += weight;
> }
>
> - return avg >= vruntime_op(vruntime, "-", cfs_rq->zero_vruntime) * load;
> + return avg >= (sched_double_long_t)vruntime_op(vruntime, "-", cfs_rq->zero_vruntime) * load;
> }
>
> int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 9f63b15d309d..f15f4468efe5 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -146,7 +146,7 @@ extern struct list_head asym_cap_list;
> * Really only required when CONFIG_FAIR_GROUP_SCHED=y is also set, but to
> * increase coverage and consistency always enable it on 64-bit platforms.
> */
> -#ifdef CONFIG_64BIT
> +#if defined(CONFIG_64BIT) && defined(__SIZEOF_INT128__)
> # define NICE_0_LOAD_SHIFT (SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
> # define scale_load(w) ((w) << SCHED_FIXEDPOINT_SHIFT)
> # define scale_load_down(w) \
> @@ -157,10 +157,12 @@ extern struct list_head asym_cap_list;
> __w = max(2UL, __w >> SCHED_FIXEDPOINT_SHIFT); \
> __w; \
> })
> +typedef __int128 sched_double_long_t;
> #else
> # define NICE_0_LOAD_SHIFT (SCHED_FIXEDPOINT_SHIFT)
> # define scale_load(w) (w)
> # define scale_load_down(w) (w)
> +typedef s64 sched_double_long_t;
> #endif
>
> /*
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH RESEND] sched/fair: Fix overflow in vruntime_eligible()
2026-05-01 10:40 ` Peter Zijlstra
2026-05-01 10:48 ` Peter Zijlstra
@ 2026-05-01 13:05 ` David Laight
2026-05-01 13:43 ` Peter Zijlstra
2026-05-04 11:22 ` Peter Zijlstra
2 siblings, 1 reply; 17+ messages in thread
From: David Laight @ 2026-05-01 13:05 UTC (permalink / raw)
To: Peter Zijlstra
Cc: K Prateek Nayak, Zhan Xusheng, Ingo Molnar, Vincent Guittot,
Juri Lelli, linux-kernel, Zhan Xusheng, Dietmar Eggemann,
Valentin Schneider, Ben Segall, Steven Rostedt, Mel Gorman
On Fri, 1 May 2026 12:40:06 +0200
Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, Apr 29, 2026 at 10:33:48AM +0530, K Prateek Nayak wrote:
>
> > > Anyway... let me construct a worst case.
> > >
> > > So if you have this cgroup crap on, then you can have an entity of
> > > weight 2, and vlag should then be bounded by: (slice+TICK_NSEC) *
> > > NICE_0_LOAD, which is around 44 bits as per the comment on entity_key().
> > >
> > > The other end is 100*NICE_0_LOAD, so lets wake that, then you get:
> > >
> > > {key, weight}[] := {
> > > puny: { (slice + TICK_NSEC) * NICE_0_LOAD, 2 },
> > > max: { 0, 100*NICE_0_LOAD },
> > > }
> >
> > So MAX_SHARES is (1UL << 18) and we then scale it up so the weight can
> > can go up to (1 << 28) I suppose in UP setting.
> >
> > So let us assume there is a single task with the largest custom slice on
> > each cgroup making the min_slice equal to 100ms. Things start off in
> > such a way that puny having a vruntime of say -1 is picked over max which
> > had a vruntime of 0 for the sake of simplicity.
> >
> > Deadline then becomes:
> >
> > puny->deadline = -1 + __calc_delta(100ms, NICE_0_LOAD, 2)
> >
> > which is 52428800000000 - 1.
> >
> > At 10HZ tick + RUN_TO_PARITY, we get 10ms error so that puts the
> > entity_key() at 57671680000000 (46 bits)
> >
> > >
> > > The avg_vruntime() would end up being very close to 0 (which is
> > > zero_vruntime), so no real help making that more accurate.
> > >
> > > vruntime_eligible(puny) ends up with:
> > >
> > > avg = 2 * puny.key (+ 0)
> > > weight = 2 + 100 * NICE_0_LOAD
> > >
> > > avg >= puny.key * weight
> > >
> > > And that is: (slice + TICK_NSEC) * NICE_0_LOAD * NICE_0_LOAD * 100
> > > > and yes, I suppose that can exceed 64bit :-(
> >
> > Then the worst case right side would be:
> >
> > 57671680000000 * load /* load = ((1 << 28) + 2) */
> >
> > and yes that goes beyond 64-bits at 15481123719086080000000.
> >
> > I'm running with this configurations:
> >
> > echo 2 > /sys/fs/cgroup/cg0/cpu.weight
> > echo 10000 > /sys/fs/cgroup/cg1/cpu.weight
> >
> > Weight of 2 actually gets scaled up to 20480 for UP scenario so I go try
> > to make it even smaller with SMP and wait until it gets picked again.
>
> Make sure all of the actual activity of that cgroup happens 'elsewhere'
> and you can get it down to an actual 2 per calc_group_shares().
>
> > without any splat so I'm not sure if there is something that prevents
> > a possible crash since a weight of 104857600 should have definitely
> > made that entity_key() overflow.
>
> Right.
>
> Anyway, I had a poke around with godbolt, and the below seems to
> generate the best code for things like x86_64 and arm64.
>
> Specifically, the __builtin_mul_overflow() already has to compute the
> 128 bit product anyway for most architectures, so using that directly
> then leads to saner asm and easier to understand code.
Older versions of gcc (maybe before 10?) generate a call to the full
128 bit multiply (_muldid3 ?) on mips64 - rather than just using the instruction
that generates the high part of the product.
The library function is now included but is slightly more complex because
it does the high*low multiples as well.
The same happens on sparc64 and the asm multiply code looks strange.
>
> AFAICT HPPA64 is the only 64bit architecture that doesn't implement
> __int128 and will thus be demoted to doing what we do on 32bit.
>
> Now all I need to do is write a coherent Changelog to go with it ...
> *sigh*
>
> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 728965851842..214fe9d99834 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -904,7 +904,7 @@ static int vruntime_eligible(struct cfs_rq *cfs_rq, u64 vruntime)
> load += weight;
> }
>
> - return avg >= vruntime_op(vruntime, "-", cfs_rq->zero_vruntime) * load;
> + return avg >= (sched_double_long_t)vruntime_op(vruntime, "-", cfs_rq->zero_vruntime) * load;
> }
>
> int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 9f63b15d309d..f15f4468efe5 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -146,7 +146,7 @@ extern struct list_head asym_cap_list;
> * Really only required when CONFIG_FAIR_GROUP_SCHED=y is also set, but to
> * increase coverage and consistency always enable it on 64-bit platforms.
> */
> -#ifdef CONFIG_64BIT
> +#if defined(CONFIG_64BIT) && defined(__SIZEOF_INT128__)
> # define NICE_0_LOAD_SHIFT (SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
> # define scale_load(w) ((w) << SCHED_FIXEDPOINT_SHIFT)
> # define scale_load_down(w) \
> @@ -157,10 +157,12 @@ extern struct list_head asym_cap_list;
> __w = max(2UL, __w >> SCHED_FIXEDPOINT_SHIFT); \
> __w; \
> })
> +typedef __int128 sched_double_long_t;
> #else
> # define NICE_0_LOAD_SHIFT (SCHED_FIXEDPOINT_SHIFT)
> # define scale_load(w) (w)
> # define scale_load_down(w) (w)
> +typedef s64 sched_double_long_t;
Seems strange that this is signed but the 64bit one is unsigned.
-- David
> #endif
>
> /*
>
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [PATCH RESEND] sched/fair: Fix overflow in vruntime_eligible()
2026-05-01 13:05 ` David Laight
@ 2026-05-01 13:43 ` Peter Zijlstra
0 siblings, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2026-05-01 13:43 UTC (permalink / raw)
To: David Laight
Cc: K Prateek Nayak, Zhan Xusheng, Ingo Molnar, Vincent Guittot,
Juri Lelli, linux-kernel, Zhan Xusheng, Dietmar Eggemann,
Valentin Schneider, Ben Segall, Steven Rostedt, Mel Gorman
On Fri, May 01, 2026 at 02:05:21PM +0100, David Laight wrote:
> > Anyway, I had a poke around with godbolt, and the below seems to
> > generate the best code for things like x86_64 and arm64.
> >
> > Specifically, the __builtin_mul_overflow() already has to compute the
> > 128 bit product anyway for most architectures, so using that directly
> > then leads to saner asm and easier to understand code.
>
> Older versions of gcc (maybe before 10?) generate a call to the full
> 128 bit multiply (_muldid3 ?) on mips64 - rather than just using the instruction
> that generates the high part of the product.
> The library function is now included but is slightly more complex because
> it does the high*low multiples as well.
>
> The same happens on sparc64 and the asm multiply code looks strange.
godbolt only has mips64 gcc-9.5 (no 8.1 which is the very oldest gcc we
support) and that doesn't seem to generate calls to library functions.
I don't particularly care about the quality of old compiler output, nor
do I really care for the performance aspect on these fringe
architectures. As long as it builds and isn't obviously broken, things
are good.
Using the gcc-8.5 toolchains from kernel.org (the oldest still supported
set) I see no library calls for MIPS64.
I do see them for Sparc64, but that actually has that library call
implemented.
Anyway, I'll throw it at the robots, see what if anything pops out.
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 9f63b15d309d..f15f4468efe5 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -146,7 +146,7 @@ extern struct list_head asym_cap_list;
> > * Really only required when CONFIG_FAIR_GROUP_SCHED=y is also set, but to
> > * increase coverage and consistency always enable it on 64-bit platforms.
> > */
> > -#ifdef CONFIG_64BIT
> > +#if defined(CONFIG_64BIT) && defined(__SIZEOF_INT128__)
> > # define NICE_0_LOAD_SHIFT (SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
> > # define scale_load(w) ((w) << SCHED_FIXEDPOINT_SHIFT)
> > # define scale_load_down(w) \
> > @@ -157,10 +157,12 @@ extern struct list_head asym_cap_list;
> > __w = max(2UL, __w >> SCHED_FIXEDPOINT_SHIFT); \
> > __w; \
> > })
> > +typedef __int128 sched_double_long_t;
> > #else
> > # define NICE_0_LOAD_SHIFT (SCHED_FIXEDPOINT_SHIFT)
> > # define scale_load(w) (w)
> > # define scale_load_down(w) (w)
> > +typedef s64 sched_double_long_t;
>
> Seems strange that this is signed but the 64bit one is unsigned.
They both signed. __int128 is a signed type.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH RESEND] sched/fair: Fix overflow in vruntime_eligible()
2026-05-01 10:40 ` Peter Zijlstra
2026-05-01 10:48 ` Peter Zijlstra
2026-05-01 13:05 ` David Laight
@ 2026-05-04 11:22 ` Peter Zijlstra
2026-05-04 13:16 ` Heiko Carstens
2026-05-05 10:31 ` Peter Zijlstra
2 siblings, 2 replies; 17+ messages in thread
From: Peter Zijlstra @ 2026-05-04 11:22 UTC (permalink / raw)
To: K Prateek Nayak
Cc: Zhan Xusheng, Ingo Molnar, Vincent Guittot, Juri Lelli,
linux-kernel, Zhan Xusheng, Dietmar Eggemann, Valentin Schneider,
Ben Segall, Steven Rostedt, Mel Gorman, hca, gor, agordeev,
borntraeger, svens, maddy, mpe, chleroy
On Fri, May 01, 2026 at 12:40:06PM +0200, Peter Zijlstra wrote:
> Anyway, I had a poke around with godbolt, and the below seems to
> generate the best code for things like x86_64 and arm64.
>
> Specifically, the __builtin_mul_overflow() already has to compute the
> 128 bit product anyway for most architectures, so using that directly
> then leads to saner asm and easier to understand code.
>
> AFAICT HPPA64 is the only 64bit architecture that doesn't implement
> __int128 and will thus be demoted to doing what we do on 32bit.
I forgot we had ARCH_SUPPORTS_INT128, and I suppose this had better
check that. Now, s390 is a bit weird and excludes GCC even though that
definitely supports __int128. Supposedly there was a issue, but perhaps
modern GCC has this fixed?
Also, PowerPC does not seem to set this at all.
Anyway, I've then ended up with this, it keeps all 64bit architectures
on the extra precision, but incurs an extra division for all that do not
have __int128 selected.
---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 728965851842..01405f6011b0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -882,11 +882,11 @@ bool update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
*
* lag_i >= 0 -> V >= v_i
*
- * \Sum (v_i - v)*w_i
- * V = ------------------ + v
+ * \Sum (v_i - v0)*w_i
+ * V = ------------------- + v0
* \Sum w_i
*
- * lag_i >= 0 -> \Sum (v_i - v)*w_i >= (v_i - v)*(\Sum w_i)
+ * lag_i >= 0 -> \Sum (v_i - v0)*w_i >= (v_i - v0)*(\Sum w_i)
*
* Note: using 'avg_vruntime() > se->vruntime' is inaccurate due
* to the loss in precision caused by the division.
@@ -894,7 +894,7 @@ bool update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
static int vruntime_eligible(struct cfs_rq *cfs_rq, u64 vruntime)
{
struct sched_entity *curr = cfs_rq->curr;
- s64 avg = cfs_rq->sum_w_vruntime;
+ s64 key, avg = cfs_rq->sum_w_vruntime;
long load = cfs_rq->sum_weight;
if (curr && curr->on_rq) {
@@ -904,7 +904,35 @@ static int vruntime_eligible(struct cfs_rq *cfs_rq, u64 vruntime)
load += weight;
}
- return avg >= vruntime_op(vruntime, "-", cfs_rq->zero_vruntime) * load;
+ key = vruntime_op(vruntime, "-", cfs_rq->zero_vruntime);
+
+ /*
+ * The worst case term for @key includes 'NSEC_TICK * NICE_0_LOAD'
+ * and @load obviously includes NICE_0_LOAD. NSEC_TICK is around 24
+ * bits, while NICE_0_LOAD is 20 on 64bit and 10 otherwise.
+ *
+ * This gives that on 64bit the product will be at least 64bit which
+ * overflows s64, while on 32bit it will only be 44bits and should fit
+ * comfortably.
+ */
+#ifdef CONFIG_64BIT
+#ifdef CONFIG_ARCH_SUPPORTS_INT128
+ /* An __int128 mult should be cheaper than a division. */
+ return avg >= (__int128)key * load;
+#else
+ /*
+ * Since the divisor is @load, which is guaranteed positive, the
+ * inequality: avg >= key * load, can be rewritten into a division
+ * like: avg/load > key || (avg/load == key && avg%load >= 0).
+ */
+ s64 div = avg / load;
+ if (div > key)
+ return true;
+ return div == key && avg % load >= 0;
+#endif
+#else /* 32bit */
+ return avg >= key * load;
+#endif
}
int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)
^ permalink raw reply related [flat|nested] 17+ messages in thread* Re: [PATCH RESEND] sched/fair: Fix overflow in vruntime_eligible()
2026-05-04 11:22 ` Peter Zijlstra
@ 2026-05-04 13:16 ` Heiko Carstens
2026-05-04 14:57 ` Peter Zijlstra
2026-05-04 17:40 ` Stefan Schulze Frielinghaus
2026-05-05 10:31 ` Peter Zijlstra
1 sibling, 2 replies; 17+ messages in thread
From: Heiko Carstens @ 2026-05-04 13:16 UTC (permalink / raw)
To: Peter Zijlstra
Cc: K Prateek Nayak, Zhan Xusheng, Ingo Molnar, Vincent Guittot,
Juri Lelli, linux-kernel, Zhan Xusheng, Dietmar Eggemann,
Valentin Schneider, Ben Segall, Steven Rostedt, Mel Gorman, gor,
agordeev, borntraeger, svens, maddy, mpe, chleroy,
Stefan Schulze Frielinghaus, Juergen Christ
On Mon, May 04, 2026 at 01:22:39PM +0200, Peter Zijlstra wrote:
> On Fri, May 01, 2026 at 12:40:06PM +0200, Peter Zijlstra wrote:
>
> > Anyway, I had a poke around with godbolt, and the below seems to
> > generate the best code for things like x86_64 and arm64.
> >
> > Specifically, the __builtin_mul_overflow() already has to compute the
> > 128 bit product anyway for most architectures, so using that directly
> > then leads to saner asm and easier to understand code.
> >
> > AFAICT HPPA64 is the only 64bit architecture that doesn't implement
> > __int128 and will thus be demoted to doing what we do on 32bit.
>
> I forgot we had ARCH_SUPPORTS_INT128, and I suppose this had better
> check that. Now, s390 is a bit weird and excludes GCC even though that
> definitely supports __int128. Supposedly there was a issue, but perhaps
> modern GCC has this fixed?
The reason was not a bug (in terms of incorrect code), but gcc generated a
larger than 6kb stack frame for one of the crypto functions - see commit
fbac266f095d ("s390: select ARCH_SUPPORTS_INT128"). That's just too large to
be acceptable.
If I remember correctly gcc generated code which did not reuse known to be
unused stack slots, but created for every variable a new stack slot, for
whatever reason. Which then resulted in such a huge stack frame. With clang
the stack frame size was only 1,5kb.
I just checked: with gcc 15.2.0 we are down to 4.5kb. Still too large :)
Adding s390 compiler folks; but I seem to remember I discussed that back then
with them.
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [PATCH RESEND] sched/fair: Fix overflow in vruntime_eligible()
2026-05-04 13:16 ` Heiko Carstens
@ 2026-05-04 14:57 ` Peter Zijlstra
2026-05-04 17:40 ` Stefan Schulze Frielinghaus
1 sibling, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2026-05-04 14:57 UTC (permalink / raw)
To: Heiko Carstens
Cc: K Prateek Nayak, Zhan Xusheng, Ingo Molnar, Vincent Guittot,
Juri Lelli, linux-kernel, Zhan Xusheng, Dietmar Eggemann,
Valentin Schneider, Ben Segall, Steven Rostedt, Mel Gorman, gor,
agordeev, borntraeger, svens, maddy, mpe, chleroy,
Stefan Schulze Frielinghaus, Juergen Christ
On Mon, May 04, 2026 at 03:16:09PM +0200, Heiko Carstens wrote:
> On Mon, May 04, 2026 at 01:22:39PM +0200, Peter Zijlstra wrote:
> > On Fri, May 01, 2026 at 12:40:06PM +0200, Peter Zijlstra wrote:
> >
> > > Anyway, I had a poke around with godbolt, and the below seems to
> > > generate the best code for things like x86_64 and arm64.
> > >
> > > Specifically, the __builtin_mul_overflow() already has to compute the
> > > 128 bit product anyway for most architectures, so using that directly
> > > then leads to saner asm and easier to understand code.
> > >
> > > AFAICT HPPA64 is the only 64bit architecture that doesn't implement
> > > __int128 and will thus be demoted to doing what we do on 32bit.
> >
> > I forgot we had ARCH_SUPPORTS_INT128, and I suppose this had better
> > check that. Now, s390 is a bit weird and excludes GCC even though that
> > definitely supports __int128. Supposedly there was a issue, but perhaps
> > modern GCC has this fixed?
>
> The reason was not a bug (in terms of incorrect code), but gcc generated a
> larger than 6kb stack frame for one of the crypto functions - see commit
> fbac266f095d ("s390: select ARCH_SUPPORTS_INT128"). That's just too large to
> be acceptable.
>
> If I remember correctly gcc generated code which did not reuse known to be
> unused stack slots, but created for every variable a new stack slot, for
> whatever reason. Which then resulted in such a huge stack frame. With clang
> the stack frame size was only 1,5kb.
>
> I just checked: with gcc 15.2.0 we are down to 4.5kb. Still too large :)
>
> Adding s390 compiler folks; but I seem to remember I discussed that back then
> with them.
Right, I had indeed found that commit, but since it was from 2023, I had
some hope that gcc-1[56] might have it resolved already. Oh well.
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [PATCH RESEND] sched/fair: Fix overflow in vruntime_eligible()
2026-05-04 13:16 ` Heiko Carstens
2026-05-04 14:57 ` Peter Zijlstra
@ 2026-05-04 17:40 ` Stefan Schulze Frielinghaus
1 sibling, 0 replies; 17+ messages in thread
From: Stefan Schulze Frielinghaus @ 2026-05-04 17:40 UTC (permalink / raw)
To: Heiko Carstens
Cc: Peter Zijlstra, K Prateek Nayak, Zhan Xusheng, Ingo Molnar,
Vincent Guittot, Juri Lelli, linux-kernel, Zhan Xusheng,
Dietmar Eggemann, Valentin Schneider, Ben Segall, Steven Rostedt,
Mel Gorman, gor, agordeev, borntraeger, svens, maddy, mpe,
chleroy, Juergen Christ
On Mon, May 04, 2026 at 03:16:09PM +0200, Heiko Carstens wrote:
> On Mon, May 04, 2026 at 01:22:39PM +0200, Peter Zijlstra wrote:
> > On Fri, May 01, 2026 at 12:40:06PM +0200, Peter Zijlstra wrote:
> >
> > > Anyway, I had a poke around with godbolt, and the below seems to
> > > generate the best code for things like x86_64 and arm64.
> > >
> > > Specifically, the __builtin_mul_overflow() already has to compute the
> > > 128 bit product anyway for most architectures, so using that directly
> > > then leads to saner asm and easier to understand code.
> > >
> > > AFAICT HPPA64 is the only 64bit architecture that doesn't implement
> > > __int128 and will thus be demoted to doing what we do on 32bit.
> >
> > I forgot we had ARCH_SUPPORTS_INT128, and I suppose this had better
> > check that. Now, s390 is a bit weird and excludes GCC even though that
> > definitely supports __int128. Supposedly there was a issue, but perhaps
> > modern GCC has this fixed?
>
> The reason was not a bug (in terms of incorrect code), but gcc generated a
> larger than 6kb stack frame for one of the crypto functions - see commit
> fbac266f095d ("s390: select ARCH_SUPPORTS_INT128"). That's just too large to
> be acceptable.
>
> If I remember correctly gcc generated code which did not reuse known to be
> unused stack slots, but created for every variable a new stack slot, for
> whatever reason. Which then resulted in such a huge stack frame. With clang
> the stack frame size was only 1,5kb.
>
> I just checked: with gcc 15.2.0 we are down to 4.5kb. Still too large :)
>
> Adding s390 compiler folks; but I seem to remember I discussed that back then
> with them.
I briefly remember our discussion. The short story is that I'm not
aware of any change in that area. The long story is that gcc is
conservative when it comes to re-using stack slots for different
objects. Thus, if gcc cannot prove that a stack slot is at some point
in time not referred to anymore, the stack slot is not reused for some
other object. In the particular case, due to inlining and IIRC loop
unrolling, a huge amount of stack slots had to be allocated and couldn't
be coalesced.
Cheers,
Stefan
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH RESEND] sched/fair: Fix overflow in vruntime_eligible()
2026-05-04 11:22 ` Peter Zijlstra
2026-05-04 13:16 ` Heiko Carstens
@ 2026-05-05 10:31 ` Peter Zijlstra
2026-05-05 10:50 ` [tip: sched/urgent] " tip-bot2 for Zhan Xusheng
2026-05-05 14:13 ` tip-bot2 for Zhan Xusheng
1 sibling, 2 replies; 17+ messages in thread
From: Peter Zijlstra @ 2026-05-05 10:31 UTC (permalink / raw)
To: K Prateek Nayak
Cc: Zhan Xusheng, Ingo Molnar, Vincent Guittot, Juri Lelli,
linux-kernel, Zhan Xusheng, Dietmar Eggemann, Valentin Schneider,
Ben Segall, Steven Rostedt, Mel Gorman, hca, gor, agordeev,
borntraeger, svens, maddy, mpe, chleroy
On Mon, May 04, 2026 at 01:22:39PM +0200, Peter Zijlstra wrote:
> +#ifdef CONFIG_64BIT
> +#ifdef CONFIG_ARCH_SUPPORTS_INT128
> + /* An __int128 mult should be cheaper than a division. */
> + return avg >= (__int128)key * load;
> +#else
> + /*
> + * Since the divisor is @load, which is guaranteed positive, the
> + * inequality: avg >= key * load, can be rewritten into a division
> + * like: avg/load > key || (avg/load == key && avg%load >= 0).
> + */
> + s64 div = avg / load;
> + if (div > key)
> + return true;
> + return div == key && avg % load >= 0;
Bah, I could of course have put the __builtin_mul_overflow() thing in
here. That probably generates better code than this for those
architectures not doing SUPPORT_INT128.
Just got focussed on keeping the thing exact, but that isn't needed.
That overflow thing is good enough.
---
Subject: sched/fair: Fix overflow in vruntime_eligible()
From: Zhan Xusheng <zhanxusheng@xiaomi.com>
Date: Fri, 1 May 2026 12:40:06 +0200
Zhan Xusheng reported running into sporadic a s64 mult overflow in
vruntime_eligible().
When constructing a worst case scenario:
If you have cgroups, then you can have an entity of weight 2 (per
calc_group_shares()), and its vlag should then be bounded by: (slice+TICK_NSEC)
* NICE_0_LOAD, which is around 44 bits as per the comment on entity_key().
The other extreme is 100*NICE_0_LOAD, thus you get:
{key, weight}[] := {
puny: { (slice + TICK_NSEC) * NICE_0_LOAD, 2 },
max: { 0, 100*NICE_0_LOAD },
}
The avg_vruntime() would end up being very close to 0 (which is
zero_vruntime), so no real help making that more accurate.
vruntime_eligible(puny) ends up with:
avg = 2 * puny.key (+ 0)
load = 2 + 100 * NICE_0_LOAD
avg >= puny.key * load
And that is: (slice + TICK_NSEC) * NICE_0_LOAD * NICE_0_LOAD * 100, which will
overflow s64.
Zhan suggested using __builtin_mul_overflow(), however after staring at
compiler output for various architectures using godbolt, it seems that using an
__int128 multiplication often results in better code.
Specifically, a number of architectures already compute the __int128 product to
determine the overflow. Eg. arm64 already has the 'smulh' instruction used. By
explicitly doing an __int128 multiply, it will emit the 'mul; smulh' pattern,
which modern cores can fuse (armv8-a clang-22.1.0). x86_64 has less branches
(no OF handling).
Since Linux has ARCH_SUPPORTS_INT128 to gate __int128 usage, also provide the
__builtin_mul_overflow() variant as a fallback.
[peterz: Changelog and __int128 bits]
Fixes: 556146ce5e94 ("sched/fair: Avoid overflow in enqueue_entity()")
Reported-by: Zhan Xusheng <zhanxusheng1024@gmail.com>
Closes: https://patch.msgid.link/20260415145742.10359-1-zhanxusheng%40xiaomi.com
Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
kernel/sched/fair.c | 39 ++++++++++++++++++++++++++++++++++-----
1 file changed, 34 insertions(+), 5 deletions(-)
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -882,11 +882,11 @@ bool update_entity_lag(struct cfs_rq *cf
*
* lag_i >= 0 -> V >= v_i
*
- * \Sum (v_i - v)*w_i
- * V = ------------------ + v
+ * \Sum (v_i - v0)*w_i
+ * V = ------------------- + v0
* \Sum w_i
*
- * lag_i >= 0 -> \Sum (v_i - v)*w_i >= (v_i - v)*(\Sum w_i)
+ * lag_i >= 0 -> \Sum (v_i - v0)*w_i >= (v_i - v0)*(\Sum w_i)
*
* Note: using 'avg_vruntime() > se->vruntime' is inaccurate due
* to the loss in precision caused by the division.
@@ -894,7 +894,7 @@ bool update_entity_lag(struct cfs_rq *cf
static int vruntime_eligible(struct cfs_rq *cfs_rq, u64 vruntime)
{
struct sched_entity *curr = cfs_rq->curr;
- s64 avg = cfs_rq->sum_w_vruntime;
+ s64 key, avg = cfs_rq->sum_w_vruntime;
long load = cfs_rq->sum_weight;
if (curr && curr->on_rq) {
@@ -904,7 +904,36 @@ static int vruntime_eligible(struct cfs_
load += weight;
}
- return avg >= vruntime_op(vruntime, "-", cfs_rq->zero_vruntime) * load;
+ key = vruntime_op(vruntime, "-", cfs_rq->zero_vruntime);
+
+ /*
+ * The worst case term for @key includes 'NSEC_TICK * NICE_0_LOAD'
+ * and @load obviously includes NICE_0_LOAD. NSEC_TICK is around 24
+ * bits, while NICE_0_LOAD is 20 on 64bit and 10 otherwise.
+ *
+ * This gives that on 64bit the product will be at least 64bit which
+ * overflows s64, while on 32bit it will only be 44bits and should fit
+ * comfortably.
+ */
+#ifdef CONFIG_64BIT
+#ifdef CONFIG_ARCH_SUPPORTS_INT128
+ /* This often results in simpler code than __builtin_mul_overflow(). */
+ return avg >= (__int128)key * load;
+#else
+ s64 rhs;
+ /*
+ * On overflow, the sign of key tells us the correct answer: a large
+ * positive key means vruntime >> V, so not eligible; a large negative
+ * key means vruntime << V, so eligible.
+ */
+ if (check_mul_overflow(key, load, &rhs))
+ return key <= 0;
+
+ return avg >= rhs;
+#endif
+#else /* 32bit */
+ return avg >= key * load;
+#endif
}
int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)
^ permalink raw reply [flat|nested] 17+ messages in thread* [tip: sched/urgent] sched/fair: Fix overflow in vruntime_eligible()
2026-05-05 10:31 ` Peter Zijlstra
@ 2026-05-05 10:50 ` tip-bot2 for Zhan Xusheng
2026-05-05 14:13 ` tip-bot2 for Zhan Xusheng
1 sibling, 0 replies; 17+ messages in thread
From: tip-bot2 for Zhan Xusheng @ 2026-05-05 10:50 UTC (permalink / raw)
To: linux-tip-commits
Cc: Zhan Xusheng, Zhan Xusheng, Peter Zijlstra (Intel), x86,
linux-kernel
The following commit has been merged into the sched/urgent branch of tip:
Commit-ID: 416dd8fc352b6027210b42de5889ee280e4bc40c
Gitweb: https://git.kernel.org/tip/416dd8fc352b6027210b42de5889ee280e4bc40c
Author: Zhan Xusheng <zhanxusheng@xiaomi.com>
AuthorDate: Fri, 01 May 2026 12:40:06 +02:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 05 May 2026 12:44:24 +02:00
sched/fair: Fix overflow in vruntime_eligible()
Zhan Xusheng reported running into sporadic a s64 mult overflow in
vruntime_eligible().
When constructing a worst case scenario:
If you have cgroups, then you can have an entity of weight 2 (per
calc_group_shares()), and its vlag should then be bounded by: (slice+TICK_NSEC)
* NICE_0_LOAD, which is around 44 bits as per the comment on entity_key().
The other extreme is 100*NICE_0_LOAD, thus you get:
{key, weight}[] := {
puny: { (slice + TICK_NSEC) * NICE_0_LOAD, 2 },
max: { 0, 100*NICE_0_LOAD },
}
The avg_vruntime() would end up being very close to 0 (which is
zero_vruntime), so no real help making that more accurate.
vruntime_eligible(puny) ends up with:
avg = 2 * puny.key (+ 0)
load = 2 + 100 * NICE_0_LOAD
avg >= puny.key * load
And that is: (slice + TICK_NSEC) * NICE_0_LOAD * NICE_0_LOAD * 100, which will
overflow s64.
Zhan suggested using __builtin_mul_overflow(), however after staring at
compiler output for various architectures using godbolt, it seems that using an
__int128 multiplication often results in better code.
Specifically, a number of architectures already compute the __int128 product to
determine the overflow. Eg. arm64 already has the 'smulh' instruction used. By
explicitly doing an __int128 multiply, it will emit the 'mul; smulh' pattern,
which modern cores can fuse (armv8-a clang-22.1.0). x86_64 has less branches
(no OF handling).
Since Linux has ARCH_SUPPORTS_INT128 to gate __int128 usage, also provide the
__builtin_mul_overflow() variant as a fallback.
[peterz: Changelog and __int128 bits]
Fixes: 556146ce5e94 ("sched/fair: Avoid overflow in enqueue_entity()")
Reported-by: Zhan Xusheng <zhanxusheng1024@gmail.com>
Closes: https://patch.msgid.link/20260415145742.10359-1-zhanxusheng%40xiaomi.com
Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260505103155.GN3102924%40noisy.programming.kicks-ass.net
---
kernel/sched/fair.c | 39 ++++++++++++++++++++++++++++++++++-----
1 file changed, 34 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7289658..b91c8b2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -882,11 +882,11 @@ bool update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
*
* lag_i >= 0 -> V >= v_i
*
- * \Sum (v_i - v)*w_i
- * V = ------------------ + v
+ * \Sum (v_i - v0)*w_i
+ * V = ------------------- + v0
* \Sum w_i
*
- * lag_i >= 0 -> \Sum (v_i - v)*w_i >= (v_i - v)*(\Sum w_i)
+ * lag_i >= 0 -> \Sum (v_i - v0)*w_i >= (v_i - v0)*(\Sum w_i)
*
* Note: using 'avg_vruntime() > se->vruntime' is inaccurate due
* to the loss in precision caused by the division.
@@ -894,7 +894,7 @@ bool update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
static int vruntime_eligible(struct cfs_rq *cfs_rq, u64 vruntime)
{
struct sched_entity *curr = cfs_rq->curr;
- s64 avg = cfs_rq->sum_w_vruntime;
+ s64 key, avg = cfs_rq->sum_w_vruntime;
long load = cfs_rq->sum_weight;
if (curr && curr->on_rq) {
@@ -904,7 +904,36 @@ static int vruntime_eligible(struct cfs_rq *cfs_rq, u64 vruntime)
load += weight;
}
- return avg >= vruntime_op(vruntime, "-", cfs_rq->zero_vruntime) * load;
+ key = vruntime_op(vruntime, "-", cfs_rq->zero_vruntime);
+
+ /*
+ * The worst case term for @key includes 'NSEC_TICK * NICE_0_LOAD'
+ * and @load obviously includes NICE_0_LOAD. NSEC_TICK is around 24
+ * bits, while NICE_0_LOAD is 20 on 64bit and 10 otherwise.
+ *
+ * This gives that on 64bit the product will be at least 64bit which
+ * overflows s64, while on 32bit it will only be 44bits and should fit
+ * comfortably.
+ */
+#ifdef CONFIG_64BIT
+#ifdef CONFIG_ARCH_SUPPORTS_INT128
+ /* This often results in simpler code than __builtin_mul_overflow(). */
+ return avg >= (__int128)key * load;
+#else
+ s64 rhs;
+ /*
+ * On overflow, the sign of key tells us the correct answer: a large
+ * positive key means vruntime >> V, so not eligible; a large negative
+ * key means vruntime << V, so eligible.
+ */
+ if (check_mul_overflow(key, load, &rhs))
+ return key <= 0;
+
+ return avg >= rhs;
+#endif
+#else /* 32bit */
+ return avg >= key * load;
+#endif
}
int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)
^ permalink raw reply related [flat|nested] 17+ messages in thread* [tip: sched/urgent] sched/fair: Fix overflow in vruntime_eligible()
2026-05-05 10:31 ` Peter Zijlstra
2026-05-05 10:50 ` [tip: sched/urgent] " tip-bot2 for Zhan Xusheng
@ 2026-05-05 14:13 ` tip-bot2 for Zhan Xusheng
1 sibling, 0 replies; 17+ messages in thread
From: tip-bot2 for Zhan Xusheng @ 2026-05-05 14:13 UTC (permalink / raw)
To: linux-tip-commits
Cc: Zhan Xusheng, Zhan Xusheng, Peter Zijlstra (Intel), x86,
linux-kernel
The following commit has been merged into the sched/urgent branch of tip:
Commit-ID: 13abe4ce230f22a522facdaba87d1c137054c3a6
Gitweb: https://git.kernel.org/tip/13abe4ce230f22a522facdaba87d1c137054c3a6
Author: Zhan Xusheng <zhanxusheng@xiaomi.com>
AuthorDate: Fri, 01 May 2026 12:40:06 +02:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 05 May 2026 16:03:12 +02:00
sched/fair: Fix overflow in vruntime_eligible()
Zhan Xusheng reported running into sporadic a s64 mult overflow in
vruntime_eligible().
When constructing a worst case scenario:
If you have cgroups, then you can have an entity of weight 2 (per
calc_group_shares()), and its vlag should then be bounded by: (slice+TICK_NSEC)
* NICE_0_LOAD, which is around 44 bits as per the comment on entity_key().
The other extreme is 100*NICE_0_LOAD, thus you get:
{key, weight}[] := {
puny: { (slice + TICK_NSEC) * NICE_0_LOAD, 2 },
max: { 0, 100*NICE_0_LOAD },
}
The avg_vruntime() would end up being very close to 0 (which is
zero_vruntime), so no real help making that more accurate.
vruntime_eligible(puny) ends up with:
avg = 2 * puny.key (+ 0)
load = 2 + 100 * NICE_0_LOAD
avg >= puny.key * load
And that is: (slice + TICK_NSEC) * NICE_0_LOAD * NICE_0_LOAD * 100, which will
overflow s64.
Zhan suggested using __builtin_mul_overflow(), however after staring at
compiler output for various architectures using godbolt, it seems that using an
__int128 multiplication often results in better code.
Specifically, a number of architectures already compute the __int128 product to
determine the overflow. Eg. arm64 already has the 'smulh' instruction used. By
explicitly doing an __int128 multiply, it will emit the 'mul; smulh' pattern,
which modern cores can fuse (armv8-a clang-22.1.0). x86_64 has less branches
(no OF handling).
Since Linux has ARCH_SUPPORTS_INT128 to gate __int128 usage, also provide the
__builtin_mul_overflow() variant as a fallback.
[peterz: Changelog and __int128 bits]
Fixes: 556146ce5e94 ("sched/fair: Avoid overflow in enqueue_entity()")
Reported-by: Zhan Xusheng <zhanxusheng1024@gmail.com>
Closes: https://patch.msgid.link/20260415145742.10359-1-zhanxusheng%40xiaomi.com
Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260505103155.GN3102924%40noisy.programming.kicks-ass.net
---
kernel/sched/fair.c | 39 ++++++++++++++++++++++++++++++++++-----
1 file changed, 34 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7289658..b91c8b2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -882,11 +882,11 @@ bool update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
*
* lag_i >= 0 -> V >= v_i
*
- * \Sum (v_i - v)*w_i
- * V = ------------------ + v
+ * \Sum (v_i - v0)*w_i
+ * V = ------------------- + v0
* \Sum w_i
*
- * lag_i >= 0 -> \Sum (v_i - v)*w_i >= (v_i - v)*(\Sum w_i)
+ * lag_i >= 0 -> \Sum (v_i - v0)*w_i >= (v_i - v0)*(\Sum w_i)
*
* Note: using 'avg_vruntime() > se->vruntime' is inaccurate due
* to the loss in precision caused by the division.
@@ -894,7 +894,7 @@ bool update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
static int vruntime_eligible(struct cfs_rq *cfs_rq, u64 vruntime)
{
struct sched_entity *curr = cfs_rq->curr;
- s64 avg = cfs_rq->sum_w_vruntime;
+ s64 key, avg = cfs_rq->sum_w_vruntime;
long load = cfs_rq->sum_weight;
if (curr && curr->on_rq) {
@@ -904,7 +904,36 @@ static int vruntime_eligible(struct cfs_rq *cfs_rq, u64 vruntime)
load += weight;
}
- return avg >= vruntime_op(vruntime, "-", cfs_rq->zero_vruntime) * load;
+ key = vruntime_op(vruntime, "-", cfs_rq->zero_vruntime);
+
+ /*
+ * The worst case term for @key includes 'NSEC_TICK * NICE_0_LOAD'
+ * and @load obviously includes NICE_0_LOAD. NSEC_TICK is around 24
+ * bits, while NICE_0_LOAD is 20 on 64bit and 10 otherwise.
+ *
+ * This gives that on 64bit the product will be at least 64bit which
+ * overflows s64, while on 32bit it will only be 44bits and should fit
+ * comfortably.
+ */
+#ifdef CONFIG_64BIT
+#ifdef CONFIG_ARCH_SUPPORTS_INT128
+ /* This often results in simpler code than __builtin_mul_overflow(). */
+ return avg >= (__int128)key * load;
+#else
+ s64 rhs;
+ /*
+ * On overflow, the sign of key tells us the correct answer: a large
+ * positive key means vruntime >> V, so not eligible; a large negative
+ * key means vruntime << V, so eligible.
+ */
+ if (check_mul_overflow(key, load, &rhs))
+ return key <= 0;
+
+ return avg >= rhs;
+#endif
+#else /* 32bit */
+ return avg >= key * load;
+#endif
}
int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)
^ permalink raw reply related [flat|nested] 17+ messages in thread