From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0515E3B27EC for ; Tue, 7 Apr 2026 12:01:09 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=90.155.50.34 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775563272; cv=none; b=s88F9z4dW5MTlCga7Ke37Ylesy0Q1LJb0MQ8g78fzEvAoxFbT8yxY8cTsUQBoH6tYZIuhBtRj+lS5DaEDPHZzQPwCEe7Ohq2EPC2BB0H1NBQ9G9V1QJv9Q7OmWM860QsQufQ52H1cBMyM5gwVWYf+qz2tR9QhGR2h8tMMTj+OAQ= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775563272; c=relaxed/simple; bh=6iYpI+Y3RMVOjokJs1x978y1Z8+y/eUD2XEfG6Hs/Lg=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=ieb7tTuZOVw/kPracHei9ZPyWUkgAOAJQ8xy4JC96E27lmZifz5tqUYj8k/QIcVKhiD9nAePya9e3llzREfbPkZyxr0YsJKMZV5e+ZJNViKE5Hoz8qu5fgDVJ9NN/574wx9RXeVq43b+XLBiWTzEy1hRGBNYgtHVqsO6QVrqVjA= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org; spf=none smtp.mailfrom=infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=miqh0J5N; arc=none smtp.client-ip=90.155.50.34 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=infradead.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="miqh0J5N" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=T6FmB+W/Qsg63RvgEzlbBnMjn82vRxpziL4P0ly3Gdc=; b=miqh0J5NWMSsIORUor+N0JzKBj 2OyRhlel8AZGptCAERN77Sm/1P/Grb4lEDVGzaKRbJzr1Zi9rXRTgCkgokr6P1AYV3XxjhJmNjVqA TIXtpCROiZsshzW3hYfmEpugAwJQQtA3qbLgbu67mrvlRRezkcXuB/Pt2NAazTz0+sYctyPHAHY9p FdOHodCuL5N9ptXvn/fExI6WjDFbZX7UENeMNPFlCm6W1HryqQRMMdRRebsljnvEqQMKrFtniFLK/ XuwRagQXaXm8w0bHdibxIUchDIGGmJ2MyJx1y9PIcpkE42l4Sf1D+tTOEm1BDRlWwOyKvjQhGmEpo POPCWPlg==; Received: from 77-249-17-252.cable.dynamic.v4.ziggo.nl ([77.249.17.252] helo=noisy.programming.kicks-ass.net) by casper.infradead.org with esmtpsa (Exim 4.98.2 #2 (Red Hat Linux)) id 1wA56z-00000003Xmt-3G1W; Tue, 07 Apr 2026 12:00:53 +0000 Received: by noisy.programming.kicks-ass.net (Postfix, from userid 1000) id 93C933005E5; Tue, 07 Apr 2026 14:00:52 +0200 (CEST) Date: Tue, 7 Apr 2026 14:00:52 +0200 From: Peter Zijlstra To: K Prateek Nayak Cc: Vincent Guittot , mingo@kernel.org, juri.lelli@redhat.com, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, linux-kernel@vger.kernel.org, wangtao554@huawei.com, quzicheng@huawei.com, dsmythies@telus.net, shubhang@os.amperecomputing.com Subject: Re: [PATCH v2 5/7] sched/fair: Increase weight bits for avg_vruntime Message-ID: <20260407120052.GG3738010@noisy.programming.kicks-ass.net> References: <20260219075840.162631716@infradead.org> <20260219080624.942813440@infradead.org> <20260223115100.GI2995752@noisy.programming.kicks-ass.net> <0d3680c3-3e17-47b8-8fdb-0cc1f97ffce0@amd.com> <99fa12f9-71d3-4766-8742-a3adc9ce4271@amd.com> <20260402102215.GT3738010@noisy.programming.kicks-ass.net> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: On Fri, Apr 03, 2026 at 09:32:22AM +0530, K Prateek Nayak wrote: > On 4/2/2026 4:26 PM, K Prateek Nayak wrote: > >> That is, something like the below... But with a comment ofc :-) > >> > >> Does that make sense? > > > > Let me go queue an overnight test to see if I trip that warning or > > not. > > Didn't trip any warning and the machine is still up and running > after 15 Hours so feel free to include: > > Tested-by: K Prateek Nayak > > Perhaps the comment can read something like: > > /* > * A heavy entity can pull the avg_vruntime close to its > * vruntime post enqueue but the zero_vruntime point is > * only updated at the next update_deadline() / enqueue > * / dequeue. > * > * Until then, the sum_w_vruntime grow quadratically, > * proportional to the entity's weight (w_i) as: > * > * sum_w_vruntime -= (lag_i * (W + w_i) / W) * w_i > * > * If w_i > W, it is beneficial to pull the > * zero_vruntime towards the entity's vruntime (V_i) > * since the sum_w_vruntime would only grow by > * (lag_i * W) which consumes lesser bits than leaving > * the zero_vruntime at the pre-enqueue avg_vruntime. > */ > if (weight > load) > update_zero = true; > > Feel free to reword as you see fit :-) I've made it like so. You did all the hard work after all. Thanks! --- Subject: sched/fair: Avoid overflow in enqueue_entity() From: K Prateek Nayak Date: Tue Apr 7 13:36:17 CEST 2026 Here is one scenario which was triggered when running: stress-ng --yield=32 -t 10000000s& while true; do perf bench sched messaging -p -t -l 100000 -g 16; done on a 256CPUs machine after about an hour into the run: __enqeue_entity: entity_key(-141245081754) weight(90891264) overflow_mul(5608800059305154560) vlag(57498) delayed?(0) cfs_rq: zero_vruntime(3809707759657809) sum_w_vruntime(0) sum_weight(0) nr_queued(1) cfs_rq->curr: entity_key(0) vruntime(3809707759657809) deadline(3809723966988476) weight(37) The above comes from __enqueue_entity() after a place_entity(). Breaking this down: vlag_initial = 57498 vlag = (57498 * (37 + 90891264)) / 37 = 141,245,081,754 vruntime = 3809707759657809 - 141245081754 = 3,809,566,514,576,055 entity_key(se, cfs_rq) = -141,245,081,754 Now, multiplying the entity_key with its own weight results to 5,608,800,059,305,154,560 (same as what overflow_mul() suggests) but in Python, without overflow, this would be: -1,2837,944,014,404,397,056 Avoid the overflow (without doing the division for avg_vruntime()), by moving zero_vruntime to the new entity when it is heavier. Fixes: 4823725d9d1d ("sched/fair: Increase weight bits for avg_vruntime") Signed-off-by: K Prateek Nayak [peterz: suggested 'weight > load' condition] Signed-off-by: Peter Zijlstra (Intel) --- kernel/sched/fair.c | 32 ++++++++++++++++++++++++++++++-- 1 file changed, 30 insertions(+), 2 deletions(-) --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5352,6 +5352,7 @@ static void place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) { u64 vslice, vruntime = avg_vruntime(cfs_rq); + bool update_zero = false; s64 lag = 0; if (!se->custom_slice) @@ -5368,7 +5369,7 @@ place_entity(struct cfs_rq *cfs_rq, stru */ if (sched_feat(PLACE_LAG) && cfs_rq->nr_queued && se->vlag) { struct sched_entity *curr = cfs_rq->curr; - long load; + long load, weight; lag = se->vlag; @@ -5428,14 +5429,41 @@ place_entity(struct cfs_rq *cfs_rq, stru if (curr && curr->on_rq) load += avg_vruntime_weight(cfs_rq, curr->load.weight); - lag *= load + avg_vruntime_weight(cfs_rq, se->load.weight); + weight = avg_vruntime_weight(cfs_rq, se->load.weight); + lag *= load + weight; if (WARN_ON_ONCE(!load)) load = 1; lag = div64_long(lag, load); + + /* + * A heavy entity (relative to the tree) will pull the + * avg_vruntime close to its vruntime position on enqueue. But + * the zero_vruntime point is only updated at the next + * update_deadline()/place_entity()/update_entity_lag(). + * + * Specifically (see the comment near avg_vruntime_weight()): + * + * sum_w_vruntime = \Sum (v_i - v0) * w_i + * + * Note that if v0 is near a light entity, both terms will be + * small for the light entity, while in that case both terms + * are large for the heavy entity, leading to risk of + * overflow. + * + * OTOH if v0 is near the heavy entity, then the difference is + * larger for the light entity, but the factor is small, while + * for the heavy entity the difference is small but the factor + * is large. Avoiding the multiplication overflow. + */ + if (weight > load) + update_zero = true; } se->vruntime = vruntime - lag; + if (update_zero) + update_zero_vruntime(cfs_rq, -lag); + if (sched_feat(PLACE_REL_DEADLINE) && se->rel_deadline) { se->deadline += se->vruntime; se->rel_deadline = 0;