From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from desiato.infradead.org (desiato.infradead.org [90.155.92.199])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1CE88157A5A
	for <linux-kernel@vger.kernel.org>; Fri,  1 May 2026 10:40:15 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=90.155.92.199
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1777632020; cv=none; b=CGsD3YbRMDEyC4+QMIW8YXINqufyTinjQjYuSPk4uG4LoXNQ5q0nQzr+Vj8KqLPeCTVb6owIXc2InnDks0Y261kiFqvfs/BkZP6RcLpFkupitSjsf1GBjxVdHEhsmvrq/5lzrH6m0lvZaJHY8+Ja67+H/OVM3Vu+glgdsiLfH+E=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1777632020; c=relaxed/simple;
	bh=Y3fYPBlRxKFxVZVBcmBFStGbYUjmfLqA1FnYVKOq26I=;
	h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version:
	 Content-Type:Content-Disposition:In-Reply-To; b=O5YCxo45aVi3jC+HXZ1/7CblhAJJ90E0031Ye8Z4ADnBGzm5FkShOR/z1VdsXtSIpZC5/e/zFoFDJVOh/opXUmmst5jUfBh3rb8M60QJ65OGTcldAHKRPwo8Iw5n0DITn3MpRsU6FtN0DPTa0KLfaBbYMHVZ4aWHJmz8mQkCFh8=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org; spf=none smtp.mailfrom=infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=d2buFo8y; arc=none smtp.client-ip=90.155.92.199
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org
Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=infradead.org
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="d2buFo8y"
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=infradead.org; s=desiato.20200630; h=In-Reply-To:Content-Type:MIME-Version:
	References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To:
	Content-Transfer-Encoding:Content-ID:Content-Description;
	bh=kBoSRg9fsfrgHRcSM860NcuFZUwigoPXcOxFeoXTGCE=; b=d2buFo8yDPIXxbrXFC8+rGxxIt
	3Sl3XVKYuwdSaLPBxNw+3JL3LhKTAu9QemPYEqOXdlQu3hxMpIF94BfwSRm7lpdMNjz4ZXFmj365j
	pqp4isHLL7DuBlqDQJMW86C4DVyuNJ47ggmVtID15qRv1WQdltciXXOywkOYqKREno2ioRzCE4v3V
	f+pqc266uqKAGleEnx7zkEYSVcs032nEnklz2P0hmSQ39hiPGD8JmxYlnC+RsJaPXVz/YZtMkjGwZ
	HWzAaY8jrTN6BtK6b50St+TIgi/8Wg90A4uapDIXKNdty6X3zNwRibJpS9kFza2ScI8U0MqmZWZbH
	FkqDg0Hw==;
Received: from 77-249-17-252.cable.dynamic.v4.ziggo.nl ([77.249.17.252] helo=noisy.programming.kicks-ass.net)
	by desiato.infradead.org with esmtpsa (Exim 4.98.2 #2 (Red Hat Linux))
	id 1wIlI0-00000008pZl-1AlN;
	Fri, 01 May 2026 10:40:08 +0000
Received: by noisy.programming.kicks-ass.net (Postfix, from userid 1000)
	id 4B533300969; Fri, 01 May 2026 12:40:06 +0200 (CEST)
Date: Fri, 1 May 2026 12:40:06 +0200
From: Peter Zijlstra <peterz@infradead.org>
To: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Zhan Xusheng <zhanxusheng1024@gmail.com>,
	Ingo Molnar <mingo@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Juri Lelli <juri.lelli@redhat.com>, linux-kernel@vger.kernel.org,
	Zhan Xusheng <zhanxusheng@xiaomi.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Valentin Schneider <vschneid@redhat.com>,
	Ben Segall <bsegall@google.com>,
	Steven Rostedt <rostedt@goodmis.org>, Mel Gorman <mgorman@suse.de>
Subject: Re: [PATCH RESEND] sched/fair: Fix overflow in vruntime_eligible()
Message-ID: <20260501104006.GA3102624@noisy.programming.kicks-ass.net>
References: <20260415145742.10359-1-zhanxusheng@xiaomi.com>
 <20260428144951.121765-1-zhanxusheng@xiaomi.com>
 <57e94d7e-b8fb-49bc-a914-8a7401e43af3@amd.com>
 <20260428173521.GK3126523@noisy.programming.kicks-ass.net>
 <4870ea19-78d4-44b2-9f18-14c3f8782726@amd.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <4870ea19-78d4-44b2-9f18-14c3f8782726@amd.com>

On Wed, Apr 29, 2026 at 10:33:48AM +0530, K Prateek Nayak wrote:

> > Anyway... let me construct a worst case.
> > 
> > So if you have this cgroup crap on, then you can have an entity of
> > weight 2, and vlag should then be bounded by: (slice+TICK_NSEC) *
> > NICE_0_LOAD, which is around 44 bits as per the comment on entity_key().
> > 
> > The other end is 100*NICE_0_LOAD, so lets wake that, then you get:
> > 
> > {key, weight}[] := {
> >   puny: { (slice + TICK_NSEC) * NICE_0_LOAD, 2               },
> >   max:  { 0,                                 100*NICE_0_LOAD },
> > }
> 
> So MAX_SHARES is (1UL << 18) and we then scale it up so the weight can
> can go up to (1 << 28) I suppose in UP setting.
> 
> So let us assume there is a single task with the largest custom slice on
> each cgroup making the min_slice equal to 100ms. Things start off in
> such a way that puny having a vruntime of say -1 is picked over max which
> had a vruntime of 0 for the sake of simplicity.
> 
> Deadline then becomes:
> 
>     puny->deadline = -1 + __calc_delta(100ms, NICE_0_LOAD, 2)
> 
> which is 52428800000000 - 1.
> 
> At 10HZ tick + RUN_TO_PARITY, we get 10ms error so that puts the
> entity_key() at 57671680000000 (46 bits)
> 
> > 
> > The avg_vruntime() would end up being very close to 0 (which is
> > zero_vruntime), so no real help making that more accurate.
> > 
> > vruntime_eligible(puny) ends up with:
> > 
> >  avg    = 2 * puny.key (+ 0)
> >  weight = 2 + 100 * NICE_0_LOAD
> > 
> >  avg >= puny.key * weight
> > 
> > And that is: (slice + TICK_NSEC) * NICE_0_LOAD * NICE_0_LOAD * 100
> > > and yes, I suppose that can exceed 64bit :-(
> 
> Then the worst case right side would be:
> 
>     57671680000000 * load /* load = ((1 << 28) + 2) */
> 
> and yes that goes beyond 64-bits at 15481123719086080000000.
> 
> I'm running with this configurations:
> 
>     echo 2 > /sys/fs/cgroup/cg0/cpu.weight
>     echo 10000 > /sys/fs/cgroup/cg1/cpu.weight
> 
> Weight of 2 actually gets scaled up to 20480 for UP scenario so I go try
> to make it even smaller with SMP and wait until it gets picked again.

Make sure all of the actual activity of that cgroup happens 'elsewhere'
and you can get it down to an actual 2 per calc_group_shares().

> without any splat so I'm not sure if there is something that prevents
> a possible crash since a weight of 104857600 should have definitely
> made that entity_key() overflow.

Right.

Anyway, I had a poke around with godbolt, and the below seems to
generate the best code for things like x86_64 and arm64.

Specifically, the __builtin_mul_overflow() already has to compute the
128 bit product anyway for most architectures, so using that directly
then leads to saner asm and easier to understand code.

AFAICT HPPA64 is the only 64bit architecture that doesn't implement
__int128 and will thus be demoted to doing what we do on 32bit.

Now all I need to do is write a coherent Changelog to go with it ...
*sigh*

---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 728965851842..214fe9d99834 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -904,7 +904,7 @@ static int vruntime_eligible(struct cfs_rq *cfs_rq, u64 vruntime)
 		load += weight;
 	}
 
-	return avg >= vruntime_op(vruntime, "-", cfs_rq->zero_vruntime) * load;
+	return avg >= (sched_double_long_t)vruntime_op(vruntime, "-", cfs_rq->zero_vruntime) * load;
 }
 
 int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9f63b15d309d..f15f4468efe5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -146,7 +146,7 @@ extern struct list_head asym_cap_list;
  * Really only required when CONFIG_FAIR_GROUP_SCHED=y is also set, but to
  * increase coverage and consistency always enable it on 64-bit platforms.
  */
-#ifdef CONFIG_64BIT
+#if defined(CONFIG_64BIT) && defined(__SIZEOF_INT128__)
 # define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
 # define scale_load(w)		((w) << SCHED_FIXEDPOINT_SHIFT)
 # define scale_load_down(w)					\
@@ -157,10 +157,12 @@ extern struct list_head asym_cap_list;
 		__w = max(2UL, __w >> SCHED_FIXEDPOINT_SHIFT);	\
 	__w;							\
 })
+typedef __int128 sched_double_long_t;
 #else
 # define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT)
 # define scale_load(w)		(w)
 # define scale_load_down(w)	(w)
+typedef s64 sched_double_long_t;
 #endif
 
 /*