Re: [PATCH RESEND] sched/fair: Fix overflow in vruntime_eligible()

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Peter Zijlstra <peterz@infradead.org>
To: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Zhan Xusheng <zhanxusheng1024@gmail.com>,
	Ingo Molnar <mingo@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	linux-kernel@vger.kernel.org,
	Zhan Xusheng <zhanxusheng@xiaomi.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Valentin Schneider <vschneid@redhat.com>,
	Ben Segall <bsegall@google.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Mel Gorman <mgorman@suse.de>,
	hca@linux.ibm.com, gor@linux.ibm.com, agordeev@linux.ibm.com,
	borntraeger@linux.ibm.com, svens@linux.ibm.com,
	maddy@linux.ibm.com, mpe@ellerman.id.au, chleroy@kernel.org
Subject: Re: [PATCH RESEND] sched/fair: Fix overflow in vruntime_eligible()
Date: Tue, 5 May 2026 12:31:55 +0200	[thread overview]
Message-ID: <20260505103155.GN3102924@noisy.programming.kicks-ass.net> (raw)
In-Reply-To: <20260504112239.GA1174357@noisy.programming.kicks-ass.net>

On Mon, May 04, 2026 at 01:22:39PM +0200, Peter Zijlstra wrote:

> +#ifdef CONFIG_64BIT
> +#ifdef CONFIG_ARCH_SUPPORTS_INT128
> +	/* An __int128 mult should be cheaper than a division. */
> +	return avg >= (__int128)key * load;
> +#else
> +	/*
> +	 * Since the divisor is @load, which is guaranteed positive, the
> +	 * inequality: avg >= key * load, can be rewritten into a division
> +	 * like: avg/load > key || (avg/load == key && avg%load >= 0).
> +	 */
> +	s64 div = avg / load;
> +	if (div > key)
> +		return true;
> +	return div == key && avg % load >= 0;

Bah, I could of course have put the __builtin_mul_overflow() thing in
here. That probably generates better code than this for those
architectures not doing SUPPORT_INT128.

Just got focussed on keeping the thing exact, but that isn't needed.
That overflow thing is good enough.

---
Subject: sched/fair: Fix overflow in vruntime_eligible()
From: Zhan Xusheng <zhanxusheng@xiaomi.com>
Date: Fri, 1 May 2026 12:40:06 +0200

Zhan Xusheng reported running into sporadic a s64 mult overflow in
vruntime_eligible().

When constructing a worst case scenario:

If you have cgroups, then you can have an entity of weight 2 (per
calc_group_shares()), and its vlag should then be bounded by: (slice+TICK_NSEC)
* NICE_0_LOAD, which is around 44 bits as per the comment on entity_key().

The other extreme is 100*NICE_0_LOAD, thus you get:

{key, weight}[] := {
  puny: { (slice + TICK_NSEC) * NICE_0_LOAD, 2               },
  max:  { 0,                                 100*NICE_0_LOAD },
}

The avg_vruntime() would end up being very close to 0 (which is
zero_vruntime), so no real help making that more accurate.

vruntime_eligible(puny) ends up with:

 avg  = 2 * puny.key (+ 0)
 load = 2 + 100 * NICE_0_LOAD

 avg >= puny.key * load

And that is: (slice + TICK_NSEC) * NICE_0_LOAD * NICE_0_LOAD * 100, which will
overflow s64.

Zhan suggested using __builtin_mul_overflow(), however after staring at
compiler output for various architectures using godbolt, it seems that using an
__int128 multiplication often results in better code.

Specifically, a number of architectures already compute the __int128 product to
determine the overflow. Eg. arm64 already has the 'smulh' instruction used. By
explicitly doing an __int128 multiply, it will emit the 'mul; smulh' pattern,
which modern cores can fuse (armv8-a clang-22.1.0). x86_64 has less branches
(no OF handling).

Since Linux has ARCH_SUPPORTS_INT128 to gate __int128 usage, also provide the
__builtin_mul_overflow() variant as a fallback.

[peterz: Changelog and __int128 bits]
Fixes: 556146ce5e94 ("sched/fair: Avoid overflow in enqueue_entity()")
Reported-by: Zhan Xusheng <zhanxusheng1024@gmail.com>
Closes: https://patch.msgid.link/20260415145742.10359-1-zhanxusheng%40xiaomi.com
Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |   39 ++++++++++++++++++++++++++++++++++-----
 1 file changed, 34 insertions(+), 5 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -882,11 +882,11 @@ bool update_entity_lag(struct cfs_rq *cf
  *
  * lag_i >= 0 -> V >= v_i
  *
- *     \Sum (v_i - v)*w_i
- * V = ------------------ + v
+ *     \Sum (v_i - v0)*w_i
+ * V = ------------------- + v0
  *          \Sum w_i
  *
- * lag_i >= 0 -> \Sum (v_i - v)*w_i >= (v_i - v)*(\Sum w_i)
+ * lag_i >= 0 -> \Sum (v_i - v0)*w_i >= (v_i - v0)*(\Sum w_i)
  *
  * Note: using 'avg_vruntime() > se->vruntime' is inaccurate due
  *       to the loss in precision caused by the division.
@@ -894,7 +894,7 @@ bool update_entity_lag(struct cfs_rq *cf
 static int vruntime_eligible(struct cfs_rq *cfs_rq, u64 vruntime)
 {
 	struct sched_entity *curr = cfs_rq->curr;
-	s64 avg = cfs_rq->sum_w_vruntime;
+	s64 key, avg = cfs_rq->sum_w_vruntime;
 	long load = cfs_rq->sum_weight;
 
 	if (curr && curr->on_rq) {
@@ -904,7 +904,36 @@ static int vruntime_eligible(struct cfs_
 		load += weight;
 	}
 
-	return avg >= vruntime_op(vruntime, "-", cfs_rq->zero_vruntime) * load;
+	key = vruntime_op(vruntime, "-", cfs_rq->zero_vruntime);
+
+	/*
+	 * The worst case term for @key includes 'NSEC_TICK * NICE_0_LOAD'
+	 * and @load obviously includes NICE_0_LOAD. NSEC_TICK is around 24
+	 * bits, while NICE_0_LOAD is 20 on 64bit and 10 otherwise.
+	 *
+	 * This gives that on 64bit the product will be at least 64bit which
+	 * overflows s64, while on 32bit it will only be 44bits and should fit
+	 * comfortably.
+	 */
+#ifdef CONFIG_64BIT
+#ifdef CONFIG_ARCH_SUPPORTS_INT128
+	/* This often results in simpler code than __builtin_mul_overflow(). */
+	return avg >= (__int128)key * load;
+#else
+	s64 rhs;
+	/*
+	 * On overflow, the sign of key tells us the correct answer: a large
+	 * positive key means vruntime >> V, so not eligible; a large negative
+	 * key means vruntime << V, so eligible.
+	 */
+	if (check_mul_overflow(key, load, &rhs))
+		return key <= 0;
+
+	return avg >= rhs;
+#endif
+#else /* 32bit */
+	return avg >= key * load;
+#endif
 }
 
 int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)

next prev parent reply	other threads:[~2026-05-05 10:32 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-15 14:57 [PATCH] sched/fair: Fix overflow in vruntime_eligible() Zhan Xusheng
2026-04-28 14:49 ` [PATCH RESEND] " Zhan Xusheng
2026-04-28 16:17   ` K Prateek Nayak
2026-04-28 17:35     ` Peter Zijlstra
2026-04-29  5:03       ` K Prateek Nayak
2026-04-29  7:31         ` Zhan Xusheng
2026-05-01 10:40         ` Peter Zijlstra
2026-05-01 10:48           ` Peter Zijlstra
2026-05-01 13:05           ` David Laight
2026-05-01 13:43             ` Peter Zijlstra
2026-05-04 11:22           ` Peter Zijlstra
2026-05-04 13:16             ` Heiko Carstens
2026-05-04 14:57               ` Peter Zijlstra
2026-05-04 17:40               ` Stefan Schulze Frielinghaus
2026-05-05 10:31             ` Peter Zijlstra [this message]
2026-05-05 10:50               ` [tip: sched/urgent] " tip-bot2 for Zhan Xusheng
2026-05-05 14:13               ` tip-bot2 for Zhan Xusheng
2026-05-06  3:47               ` [PATCH RESEND] " Zhan Xusheng
2026-05-06 15:51               ` [tip: sched/urgent] " tip-bot2 for Zhan Xusheng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260505103155.GN3102924@noisy.programming.kicks-ass.net \
    --to=peterz@infradead.org \
    --cc=agordeev@linux.ibm.com \
    --cc=borntraeger@linux.ibm.com \
    --cc=bsegall@google.com \
    --cc=chleroy@kernel.org \
    --cc=dietmar.eggemann@arm.com \
    --cc=gor@linux.ibm.com \
    --cc=hca@linux.ibm.com \
    --cc=juri.lelli@redhat.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=maddy@linux.ibm.com \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=mpe@ellerman.id.au \
    --cc=rostedt@goodmis.org \
    --cc=svens@linux.ibm.com \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=zhanxusheng1024@gmail.com \
    --cc=zhanxusheng@xiaomi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.