From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from casper.infradead.org (casper.infradead.org [90.155.50.34])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 25D7E38A73B
	for <linux-kernel@vger.kernel.org>; Mon,  4 May 2026 11:22:58 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=90.155.50.34
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1777893783; cv=none; b=JMqr5EbEahM4i/q/YH08mH1sd1E+nsVeHD36UTDtR3rXxlaRqHEIqx7h5EfmLbZieSH21WOCj/VanWC4glMe4TaO+/U/xKyqUrXxlgNMv/uctMBsW5HU8Qlqr2q+sCYcdgawFpj9J7Lu1iHCasDkV36dWjqDguTBwx+hwDjEdwA=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1777893783; c=relaxed/simple;
	bh=OjgJTXOBwLoI//jFypUNlgJ/lwYluihubb/LlFo+E1Q=;
	h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version:
	 Content-Type:Content-Disposition:In-Reply-To; b=HTrOmYg9XCj6kzBU+c/Fhfp9SPg8q8Ol92pIvpeSsPSE9xB6H6REULgC1vc4oYOD75fBevCzy96vaa2wlWNdGGlqa5YV60jcikinPdxPv9VC40gR/QpKwMwzax4QFhmhcR7dW3O3QRBjAMzNxsfywMpxf4FdqD4TqDLfUNXeoA8=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org; spf=none smtp.mailfrom=infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=PYa2eIzL; arc=none smtp.client-ip=90.155.50.34
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org
Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=infradead.org
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="PYa2eIzL"
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version:
	References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To:
	Content-Transfer-Encoding:Content-ID:Content-Description;
	bh=CgDzmQiA/hpU0ziPl9gP8lRLmB4H/6BBdEkAGxW+PbU=; b=PYa2eIzLF/LvOacYhexsQqD4UC
	7iiKB7PgGoSRwfzoHEWkmrtj6qxDPoh828cotq1y01eM/m+TfnVqhR7vBh+Tk2iK82kL/hd1RkNuX
	NSyySYbI6rAuahv4oK7kh256t+T++vyYzO3ktOsMKs9/r7i9wZY7YyejZfgn1zQSI2Er5zu8yOJy2
	U5XirspHhEs8UlDkM4d4aKE9uZQuNtGbVvelygraINlB3jB7fgybxLh+QTCXdvezwAEH7io29PSr2
	yy8pQZyY6OXXSlAmZMJH8qAS3cCDGIM5uCf5sZoBUrw7xnf96QOTiazaLFV0i1qvbuvko+q3jM+uH
	389ARbvg==;
Received: from 77-249-17-252.cable.dynamic.v4.ziggo.nl ([77.249.17.252] helo=noisy.programming.kicks-ass.net)
	by casper.infradead.org with esmtpsa (Exim 4.98.2 #2 (Red Hat Linux))
	id 1wJrNo-00000000mog-3wox;
	Mon, 04 May 2026 11:22:41 +0000
Received: by noisy.programming.kicks-ass.net (Postfix, from userid 1000)
	id A1AB2300969; Mon, 04 May 2026 13:22:39 +0200 (CEST)
Date: Mon, 4 May 2026 13:22:39 +0200
From: Peter Zijlstra <peterz@infradead.org>
To: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Zhan Xusheng <zhanxusheng1024@gmail.com>,
	Ingo Molnar <mingo@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Juri Lelli <juri.lelli@redhat.com>, linux-kernel@vger.kernel.org,
	Zhan Xusheng <zhanxusheng@xiaomi.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Valentin Schneider <vschneid@redhat.com>,
	Ben Segall <bsegall@google.com>,
	Steven Rostedt <rostedt@goodmis.org>, Mel Gorman <mgorman@suse.de>,
	hca@linux.ibm.com, gor@linux.ibm.com, agordeev@linux.ibm.com,
	borntraeger@linux.ibm.com, svens@linux.ibm.com, maddy@linux.ibm.com,
	mpe@ellerman.id.au, chleroy@kernel.org
Subject: Re: [PATCH RESEND] sched/fair: Fix overflow in vruntime_eligible()
Message-ID: <20260504112239.GA1174357@noisy.programming.kicks-ass.net>
References: <20260415145742.10359-1-zhanxusheng@xiaomi.com>
 <20260428144951.121765-1-zhanxusheng@xiaomi.com>
 <57e94d7e-b8fb-49bc-a914-8a7401e43af3@amd.com>
 <20260428173521.GK3126523@noisy.programming.kicks-ass.net>
 <4870ea19-78d4-44b2-9f18-14c3f8782726@amd.com>
 <20260501104006.GA3102624@noisy.programming.kicks-ass.net>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20260501104006.GA3102624@noisy.programming.kicks-ass.net>

On Fri, May 01, 2026 at 12:40:06PM +0200, Peter Zijlstra wrote:

> Anyway, I had a poke around with godbolt, and the below seems to
> generate the best code for things like x86_64 and arm64.
> 
> Specifically, the __builtin_mul_overflow() already has to compute the
> 128 bit product anyway for most architectures, so using that directly
> then leads to saner asm and easier to understand code.
> 
> AFAICT HPPA64 is the only 64bit architecture that doesn't implement
> __int128 and will thus be demoted to doing what we do on 32bit.

I forgot we had ARCH_SUPPORTS_INT128, and I suppose this had better
check that. Now, s390 is a bit weird and excludes GCC even though that
definitely supports __int128. Supposedly there was a issue, but perhaps
modern GCC has this fixed?

Also, PowerPC does not seem to set this at all.

Anyway, I've then ended up with this, it keeps all 64bit architectures
on the extra precision, but incurs an extra division for all that do not
have __int128 selected.

---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 728965851842..01405f6011b0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -882,11 +882,11 @@ bool update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
  *
  * lag_i >= 0 -> V >= v_i
  *
- *     \Sum (v_i - v)*w_i
- * V = ------------------ + v
+ *     \Sum (v_i - v0)*w_i
+ * V = ------------------- + v0
  *          \Sum w_i
  *
- * lag_i >= 0 -> \Sum (v_i - v)*w_i >= (v_i - v)*(\Sum w_i)
+ * lag_i >= 0 -> \Sum (v_i - v0)*w_i >= (v_i - v0)*(\Sum w_i)
  *
  * Note: using 'avg_vruntime() > se->vruntime' is inaccurate due
  *       to the loss in precision caused by the division.
@@ -894,7 +894,7 @@ bool update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
 static int vruntime_eligible(struct cfs_rq *cfs_rq, u64 vruntime)
 {
 	struct sched_entity *curr = cfs_rq->curr;
-	s64 avg = cfs_rq->sum_w_vruntime;
+	s64 key, avg = cfs_rq->sum_w_vruntime;
 	long load = cfs_rq->sum_weight;
 
 	if (curr && curr->on_rq) {
@@ -904,7 +904,35 @@ static int vruntime_eligible(struct cfs_rq *cfs_rq, u64 vruntime)
 		load += weight;
 	}
 
-	return avg >= vruntime_op(vruntime, "-", cfs_rq->zero_vruntime) * load;
+	key = vruntime_op(vruntime, "-", cfs_rq->zero_vruntime);
+
+	/*
+	 * The worst case term for @key includes 'NSEC_TICK * NICE_0_LOAD'
+	 * and @load obviously includes NICE_0_LOAD. NSEC_TICK is around 24
+	 * bits, while NICE_0_LOAD is 20 on 64bit and 10 otherwise.
+	 *
+	 * This gives that on 64bit the product will be at least 64bit which
+	 * overflows s64, while on 32bit it will only be 44bits and should fit
+	 * comfortably.
+	 */
+#ifdef CONFIG_64BIT
+#ifdef CONFIG_ARCH_SUPPORTS_INT128
+	/* An __int128 mult should be cheaper than a division. */
+	return avg >= (__int128)key * load;
+#else
+	/*
+	 * Since the divisor is @load, which is guaranteed positive, the
+	 * inequality: avg >= key * load, can be rewritten into a division
+	 * like: avg/load > key || (avg/load == key && avg%load >= 0).
+	 */
+	s64 div = avg / load;
+	if (div > key)
+		return true;
+	return div == key && avg % load >= 0;
+#endif
+#else /* 32bit */
+	return avg >= key * load;
+#endif
 }
 
 int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)