From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9168236CDFC for ; Thu, 2 Apr 2026 10:22:29 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=90.155.50.34 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775125353; cv=none; b=s9nnieA3TfbHK8fOn1RaM53Wjx4ihPn3Lsd1mgSTObhgRX/e19yqc73lFqwhLqqnMslKwTnn5hSL9oJRMw5sgqmQjXvxi6+goDP/G3iZtJsMBor05GoTI2O1J2h8Dk7CwfOL2t8IYpjmt5kxmDicz/8S/nMoh+Pn4sx84W8zOlQ= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775125353; c=relaxed/simple; bh=1bCS9SqFhwMARRojk32ssE99GHLpV9xYWDmuim6PVSQ=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=Dx3LBT8V8/oEc0z4KVniK5Qrk6DA+V4MlHSjrvKqcvGD6GnKD0rjF98zfLcyKQOYJ2tkv2AhZd5K0TLy0jcwApBu/rjVynL0XQc5xQb5UcDojnohNvNvu0wBqapKW+50bxVZRtuTk6IENYvdbU0hJrO/ymq1NDTU0y/CgVt/Rog= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org; spf=none smtp.mailfrom=infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=UHUh+P2z; arc=none smtp.client-ip=90.155.50.34 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=infradead.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="UHUh+P2z" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=1LwPVhH6tQJB5omJfyLfqLSUq3Wq+a28dKUBULq7+3w=; b=UHUh+P2zX695/w+IA3IjWCbgwn 70iJ1+rVov22RDXpHDZ0FI9czp0xRrpGxH7KbsJkl0LHeFo+850HoX6EACJwj+pOUh1KdID3JYKLq TzeL/fwgNqGgEDEDTgMD+d2vx5793rNtl7EAQXFayO+LtWKvYpKj94O+VMvttijV7vvX2fWM9H2LW hs9swLGJ+z/lwj6XirQ9rL729qMb5H5ubI0ENxp+we8/2xa/bFIT2nyU4KDpsOZuWK+10OMArm/YV 2vc4nYpcMAQxpgUarCFeWvkkH6upz2mNyRwIXvGvGMzHNEgbfQbcdQHg3xbrVpnpLJkyXVnsNqOli MBR1pVdw==; Received: from 2001-1c00-8d85-4b00-266e-96ff-fe07-7dcc.cable.dynamic.v6.ziggo.nl ([2001:1c00:8d85:4b00:266e:96ff:fe07:7dcc] helo=noisy.programming.kicks-ass.net) by casper.infradead.org with esmtpsa (Exim 4.98.2 #2 (Red Hat Linux)) id 1w8FBo-0000000C7qV-29iT; Thu, 02 Apr 2026 10:22:16 +0000 Received: by noisy.programming.kicks-ass.net (Postfix, from userid 1000) id 129803032D8; Thu, 02 Apr 2026 12:22:15 +0200 (CEST) Date: Thu, 2 Apr 2026 12:22:15 +0200 From: Peter Zijlstra To: K Prateek Nayak Cc: Vincent Guittot , mingo@kernel.org, juri.lelli@redhat.com, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, linux-kernel@vger.kernel.org, wangtao554@huawei.com, quzicheng@huawei.com, dsmythies@telus.net, shubhang@os.amperecomputing.com Subject: Re: [PATCH v2 5/7] sched/fair: Increase weight bits for avg_vruntime Message-ID: <20260402102215.GT3738010@noisy.programming.kicks-ass.net> References: <20260219075840.162631716@infradead.org> <20260219080624.942813440@infradead.org> <20260223115100.GI2995752@noisy.programming.kicks-ass.net> <0d3680c3-3e17-47b8-8fdb-0cc1f97ffce0@amd.com> <99fa12f9-71d3-4766-8742-a3adc9ce4271@amd.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <99fa12f9-71d3-4766-8742-a3adc9ce4271@amd.com> On Thu, Apr 02, 2026 at 10:58:18AM +0530, K Prateek Nayak wrote: > On 3/30/2026 1:25 PM, K Prateek Nayak wrote: > > ------------[ cut here ]------------ > > (w_vruntime >> 63) != (w_vruntime >> 62) > > WARNING: kernel/sched/fair.c:692 at __enqueue_entity+0x382/0x3a0, CPU#5: stress-ng/5062 > > Back to this: I still see this with latest set of changes on > queue:sched/urgent but it doesn't go kaboom. Nonetheless, it suggests we > are closing in on the s64 limitations of "sum_w_vruntime" which isn't > very comforting. Yeah, we are pushing 64bit pretty hard :/ And if all we would care about was x86_64 I'd have long since used the fact that imul has a 128bit result and idiv actually divides 128bit. But even among 64bit architectures that is somewhat rare :/ > Here is one scenario where it was triggered when running: > > stress-ng --yield=32 -t 10000000s& > while true; do perf bench sched messaging -p -t -l 100000 -g 16; done > > on a 256CPUs machine after about an hour into the run: > > __enqeue_entity: entity_key(-141245081754) weight(90891264) overflow_mul(5608800059305154560) vlag(57498) delayed?(0) > cfs_rq: zero_vruntime(3809707759657809) sum_w_vruntime(0) sum_weight(0) nr_queued(1) > cfs_rq->curr: entity_key(0) vruntime(3809707759657809) deadline(3809723966988476) weight(37) > > The above comes from __enqueue_entity() after a place_entity(). Breaking > this down: > > vlag_initial = 57498 > vlag = (57498 * (37 + 90891264)) / 37 = 141,245,081,754 > > vruntime = 3809707759657809 - 141245081754 = 3,809,566,514,576,055 > entity_key(se, cfs_rq) = -141,245,081,754 > > Now, multiplying the entity_key with its own weight results to > 5,608,800,059,305,154,560 (same as what overflow_mul() suggests) but > in Python, without overflow, this would be: -1,2837,944,014,404,397,056 Oh gawd, this is a 'fun' case. > One way to avoid the warning entirely would be to pull the zero_vruntime > close to avg_vruntime is we are enqueuing a very heavy entity. > > The correct way to do this would be to compute the actual avg_vruntime() > and move the zero_vruntime to that point (but that requires at least one > multiply + divide + update_zero_vruntime()). > > One seemingly cheap way by which I've been able to avoid the warning is > with: > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 226509231e67..bc708bb8b5d0 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -5329,6 +5329,7 @@ static void > place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) > { > u64 vslice, vruntime = avg_vruntime(cfs_rq); > + bool update_zero = false; > s64 lag = 0; > > if (!se->custom_slice) > @@ -5406,6 +5407,17 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) > load += avg_vruntime_weight(cfs_rq, curr->load.weight); > > lag *= load + avg_vruntime_weight(cfs_rq, se->load.weight); > + /* > + * If the entity_key() * sum_weight of all the enqueued entities > + * is more than the sum_w_vruntime, move the zero_vruntime > + * point to the vruntime of the entity which prevents using > + * more bits than necessary for sum_w_vruntime until the > + * next avg_vruntime(). > + * > + * XXX: Cheap enough check? > + */ > + if (abs(lag) > abs(cfs_rq->sum_w_vruntime)) > + update_zero = true; > if (WARN_ON_ONCE(!load)) > load = 1; > lag = div64_long(lag, load); > @@ -5413,6 +5425,9 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) > > se->vruntime = vruntime - lag; > > + if (update_zero) > + update_zero_vruntime(cfs_rq, -lag); > + > if (sched_feat(PLACE_REL_DEADLINE) && se->rel_deadline) { > se->deadline += se->vruntime; > se->rel_deadline = 0; > --- > > But I'm sure it'll make people nervous since we basically move the > zero_vruntime to se->vruntime. It isn't too bad if: > > abs(sum_w_vuntime - (lag * load)) < abs(lag * se->load.weight) > > but we already know that the latter overflows so is there any other > cheaper indicator that we can use to detect the necessity to adjust the > avg_vruntime beforehand at place_entity()? So in general I think it would be fine to move zero_vruntime to the heaviest entity in the tree. And if there are multiple equal heaviest weights, any one of them should be fine. Per necessity heavy entities are more tightly clustered -- the lag is inversely proportional to weight, and the spread is proportional to the lag bound. I suspect something simple like comparing the entity weight against the sum_weight might be enough. If the pre-existing tree is, in aggregate, heavier than the new element, the avg will not move very drastically. However, if the new element is (significantly) heavier than the tree, the avg will move significantly (as demonstrated here). That is, something like the below... But with a comment ofc :-) Does that make sense? diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9298f49f842c..7fbd9538fe30 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5329,6 +5329,7 @@ static void place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) { u64 vslice, vruntime = avg_vruntime(cfs_rq); + bool update_zero = false; s64 lag = 0; if (!se->custom_slice) @@ -5345,7 +5346,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) */ if (sched_feat(PLACE_LAG) && cfs_rq->nr_queued && se->vlag) { struct sched_entity *curr = cfs_rq->curr; - long load; + long load, weight; lag = se->vlag; @@ -5405,14 +5406,21 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) if (curr && curr->on_rq) load += avg_vruntime_weight(cfs_rq, curr->load.weight); - lag *= load + avg_vruntime_weight(cfs_rq, se->load.weight); + weight = avg_vruntime_weight(cfs_rq, se->load.weight); + lag *= load + weight; if (WARN_ON_ONCE(!load)) load = 1; lag = div64_long(lag, load); + + if (weight > load) + update_zero = true; } se->vruntime = vruntime - lag; + if (update_zero) + update_zero_vruntime(cfs_rq, -lag); + if (sched_feat(PLACE_REL_DEADLINE) && se->rel_deadline) { se->deadline += se->vruntime; se->rel_deadline = 0;