From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from casper.infradead.org (casper.infradead.org [90.155.50.34])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0515E3B27EC
	for <linux-kernel@vger.kernel.org>; Tue,  7 Apr 2026 12:01:09 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=90.155.50.34
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1775563272; cv=none; b=s88F9z4dW5MTlCga7Ke37Ylesy0Q1LJb0MQ8g78fzEvAoxFbT8yxY8cTsUQBoH6tYZIuhBtRj+lS5DaEDPHZzQPwCEe7Ohq2EPC2BB0H1NBQ9G9V1QJv9Q7OmWM860QsQufQ52H1cBMyM5gwVWYf+qz2tR9QhGR2h8tMMTj+OAQ=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1775563272; c=relaxed/simple;
	bh=6iYpI+Y3RMVOjokJs1x978y1Z8+y/eUD2XEfG6Hs/Lg=;
	h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version:
	 Content-Type:Content-Disposition:In-Reply-To; b=ieb7tTuZOVw/kPracHei9ZPyWUkgAOAJQ8xy4JC96E27lmZifz5tqUYj8k/QIcVKhiD9nAePya9e3llzREfbPkZyxr0YsJKMZV5e+ZJNViKE5Hoz8qu5fgDVJ9NN/574wx9RXeVq43b+XLBiWTzEy1hRGBNYgtHVqsO6QVrqVjA=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org; spf=none smtp.mailfrom=infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=miqh0J5N; arc=none smtp.client-ip=90.155.50.34
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org
Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=infradead.org
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="miqh0J5N"
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version:
	References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To:
	Content-Transfer-Encoding:Content-ID:Content-Description;
	bh=T6FmB+W/Qsg63RvgEzlbBnMjn82vRxpziL4P0ly3Gdc=; b=miqh0J5NWMSsIORUor+N0JzKBj
	2OyRhlel8AZGptCAERN77Sm/1P/Grb4lEDVGzaKRbJzr1Zi9rXRTgCkgokr6P1AYV3XxjhJmNjVqA
	TIXtpCROiZsshzW3hYfmEpugAwJQQtA3qbLgbu67mrvlRRezkcXuB/Pt2NAazTz0+sYctyPHAHY9p
	FdOHodCuL5N9ptXvn/fExI6WjDFbZX7UENeMNPFlCm6W1HryqQRMMdRRebsljnvEqQMKrFtniFLK/
	XuwRagQXaXm8w0bHdibxIUchDIGGmJ2MyJx1y9PIcpkE42l4Sf1D+tTOEm1BDRlWwOyKvjQhGmEpo
	POPCWPlg==;
Received: from 77-249-17-252.cable.dynamic.v4.ziggo.nl ([77.249.17.252] helo=noisy.programming.kicks-ass.net)
	by casper.infradead.org with esmtpsa (Exim 4.98.2 #2 (Red Hat Linux))
	id 1wA56z-00000003Xmt-3G1W;
	Tue, 07 Apr 2026 12:00:53 +0000
Received: by noisy.programming.kicks-ass.net (Postfix, from userid 1000)
	id 93C933005E5; Tue, 07 Apr 2026 14:00:52 +0200 (CEST)
Date: Tue, 7 Apr 2026 14:00:52 +0200
From: Peter Zijlstra <peterz@infradead.org>
To: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>, mingo@kernel.org,
	juri.lelli@redhat.com, dietmar.eggemann@arm.com,
	rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
	vschneid@redhat.com, linux-kernel@vger.kernel.org,
	wangtao554@huawei.com, quzicheng@huawei.com, dsmythies@telus.net,
	shubhang@os.amperecomputing.com
Subject: Re: [PATCH v2 5/7] sched/fair: Increase weight bits for avg_vruntime
Message-ID: <20260407120052.GG3738010@noisy.programming.kicks-ass.net>
References: <20260219075840.162631716@infradead.org>
 <20260219080624.942813440@infradead.org>
 <CAKfTPtAH3eT3nKHPCLQwYZsgtpANsd4qJRj985u0hXJ4b-dSrw@mail.gmail.com>
 <20260223115100.GI2995752@noisy.programming.kicks-ass.net>
 <0d3680c3-3e17-47b8-8fdb-0cc1f97ffce0@amd.com>
 <99fa12f9-71d3-4766-8742-a3adc9ce4271@amd.com>
 <20260402102215.GT3738010@noisy.programming.kicks-ass.net>
 <e9add3d3-fd43-410b-9dd0-2fb3388bb519@amd.com>
 <b004fa56-b4b8-4cd9-9431-a576f629f31d@amd.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <b004fa56-b4b8-4cd9-9431-a576f629f31d@amd.com>

On Fri, Apr 03, 2026 at 09:32:22AM +0530, K Prateek Nayak wrote:
> On 4/2/2026 4:26 PM, K Prateek Nayak wrote:
> >> That is, something like the below... But with a comment ofc :-)
> >>
> >> Does that make sense?
> > 
> > Let me go queue an overnight test to see if I trip that warning or
> > not.
> 
> Didn't trip any warning and the machine is still up and running
> after 15 Hours so feel free to include:
> 
> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> 
> Perhaps the comment can read something like:
> 
> 		/*
> 		 * A heavy entity can pull the avg_vruntime close to its
> 		 * vruntime post enqueue but the zero_vruntime point is
> 		 * only updated at the next update_deadline() / enqueue
> 		 * / dequeue.
> 		 *
> 		 * Until then, the sum_w_vruntime grow quadratically,
> 		 * proportional to the entity's weight (w_i) as:
> 		 *
> 		 *   sum_w_vruntime -= (lag_i * (W + w_i) / W) * w_i
> 		 *
> 		 * If w_i > W, it is beneficial to pull the
> 		 * zero_vruntime towards the entity's vruntime (V_i)
> 		 * since the sum_w_vruntime would only grow  by
> 		 * (lag_i * W) which consumes lesser bits than leaving
> 		 * the zero_vruntime at the pre-enqueue avg_vruntime.
> 		 */
> 		if (weight > load)
> 			update_zero = true;
> 
> Feel free to reword as you see fit :-)

I've made it like so. You did all the hard work after all. Thanks!

---
Subject: sched/fair: Avoid overflow in enqueue_entity()
From: K Prateek Nayak <kprateek.nayak@amd.com>
Date: Tue Apr  7 13:36:17 CEST 2026

Here is one scenario which was triggered when running:

    stress-ng --yield=32 -t 10000000s&
    while true; do perf bench sched messaging -p -t -l 100000 -g 16; done

on a 256CPUs machine after about an hour into the run:

    __enqeue_entity: entity_key(-141245081754) weight(90891264) overflow_mul(5608800059305154560) vlag(57498) delayed?(0)
    cfs_rq: zero_vruntime(3809707759657809) sum_w_vruntime(0) sum_weight(0) nr_queued(1)
    cfs_rq->curr: entity_key(0) vruntime(3809707759657809) deadline(3809723966988476) weight(37)

The above comes from __enqueue_entity() after a place_entity(). Breaking
this down:

    vlag_initial = 57498
    vlag = (57498 * (37 + 90891264)) / 37 = 141,245,081,754

    vruntime = 3809707759657809 - 141245081754 = 3,809,566,514,576,055
    entity_key(se, cfs_rq) = -141,245,081,754

Now, multiplying the entity_key with its own weight results to
5,608,800,059,305,154,560 (same as what overflow_mul() suggests) but
in Python, without overflow, this would be: -1,2837,944,014,404,397,056

Avoid the overflow (without doing the division for avg_vruntime()), by moving
zero_vruntime to the new entity when it is heavier.

Fixes: 4823725d9d1d ("sched/fair: Increase weight bits for avg_vruntime")
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
[peterz: suggested 'weight > load' condition]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |   32 ++++++++++++++++++++++++++++++--
 1 file changed, 30 insertions(+), 2 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5352,6 +5352,7 @@ static void
 place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
 	u64 vslice, vruntime = avg_vruntime(cfs_rq);
+	bool update_zero = false;
 	s64 lag = 0;
 
 	if (!se->custom_slice)
@@ -5368,7 +5369,7 @@ place_entity(struct cfs_rq *cfs_rq, stru
 	 */
 	if (sched_feat(PLACE_LAG) && cfs_rq->nr_queued && se->vlag) {
 		struct sched_entity *curr = cfs_rq->curr;
-		long load;
+		long load, weight;
 
 		lag = se->vlag;
 
@@ -5428,14 +5429,41 @@ place_entity(struct cfs_rq *cfs_rq, stru
 		if (curr && curr->on_rq)
 			load += avg_vruntime_weight(cfs_rq, curr->load.weight);
 
-		lag *= load + avg_vruntime_weight(cfs_rq, se->load.weight);
+		weight = avg_vruntime_weight(cfs_rq, se->load.weight);
+		lag *= load + weight;
 		if (WARN_ON_ONCE(!load))
 			load = 1;
 		lag = div64_long(lag, load);
+
+		/*
+		 * A heavy entity (relative to the tree) will pull the
+		 * avg_vruntime close to its vruntime position on enqueue. But
+		 * the zero_vruntime point is only updated at the next
+		 * update_deadline()/place_entity()/update_entity_lag().
+		 *
+		 * Specifically (see the comment near avg_vruntime_weight()):
+		 *
+		 *   sum_w_vruntime = \Sum (v_i - v0) * w_i
+		 *
+		 * Note that if v0 is near a light entity, both terms will be
+		 * small for the light entity, while in that case both terms
+		 * are large for the heavy entity, leading to risk of
+		 * overflow.
+		 *
+		 * OTOH if v0 is near the heavy entity, then the difference is
+		 * larger for the light entity, but the factor is small, while
+		 * for the heavy entity the difference is small but the factor
+		 * is large. Avoiding the multiplication overflow.
+		 */
+		if (weight > load)
+			update_zero = true;
 	}
 
 	se->vruntime = vruntime - lag;
 
+	if (update_zero)
+		update_zero_vruntime(cfs_rq, -lag);
+
 	if (sched_feat(PLACE_REL_DEADLINE) && se->rel_deadline) {
 		se->deadline += se->vruntime;
 		se->rel_deadline = 0;