From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from desiato.infradead.org (desiato.infradead.org [90.155.92.199]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CEDC3246BCD for ; Mon, 26 Jan 2026 21:23:59 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=90.155.92.199 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769462641; cv=none; b=Ap8KKHaBQemo14Y10FzzqdDFjLTmF3m/m6Rh6bl+XLWnm2X00o2a8a378iB9bR801y+glijoTfr5eRegyILVSUhRlkpItvb89wLoqp7C5w05qB2su8zMMDyEXOj8/jP7vVhKCf6v/k3C1ecK2RIimalzH3yo/7VeWQa3nP44kXM= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769462641; c=relaxed/simple; bh=SkpyAbLgnVRdGP+bJ63IOIIZcYge9biGD6JR82Cej3E=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=HPtedUnul4iTXIdHGjQqwMH1J3bVIhVqtu6yMtPykAZ4pgq1PriATT2CVuOmopW+JJYgT2Y3Ama0iMJOlUBIDeYnMlK5IWp9DE8oA1Azob6lhQ80DxCk/OFsnkK+4EtSK5OxsnPczvatMkVdvSOWsqS/8vXGHbmLDpWUf9FFB2Y= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org; spf=none smtp.mailfrom=infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=cb0dLVei; arc=none smtp.client-ip=90.155.92.199 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=infradead.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="cb0dLVei" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=desiato.20200630; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=ylM9eCaSPMvpueXhyFk+p1RvY0vPHK41pVLdExCiqOc=; b=cb0dLVeirzQq/keN25OHHEimGF qzQlq7yo8ezcIP2vwFWWNsCR8Aa2zqJGCDTSmIwXQtyfC2qdeZ446IgMOn1bmQspABOXe7nDB518o ETQfkrDLL2DKzPT1Zx3TPXBINpnE42+rNtoPczMHwz9VJKRJaDJSDOd6fdupjfdsmbK7cbCveVsPI IQxYVyApObVIBiq9iE6rSIKZhWo4mUA2bHnrC9SJqVpMT9muXDRbWa1FBkQKSYAp+KBEZX7ai3H4w f88h63DMcm7z21/ZN+hgyQIrMlmdvcW1wPwe7+TKq/pfCn4QgrtgvXmnLsSR5mSVl34aohubwDelA 7SJ5IHAA==; Received: from 2001-1c00-8d85-5700-266e-96ff-fe07-7dcc.cable.dynamic.v6.ziggo.nl ([2001:1c00:8d85:5700:266e:96ff:fe07:7dcc] helo=noisy.programming.kicks-ass.net) by desiato.infradead.org with esmtpsa (Exim 4.98.2 #2 (Red Hat Linux)) id 1vkU3o-00000005te4-1sVx; Mon, 26 Jan 2026 21:23:49 +0000 Received: by noisy.programming.kicks-ass.net (Postfix, from userid 1000) id 66CE7300756; Mon, 26 Jan 2026 22:23:47 +0100 (CET) Date: Mon, 26 Jan 2026 22:23:47 +0100 From: Peter Zijlstra To: Zicheng Qu Cc: mingo@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, linux-kernel@vger.kernel.org, tanghui20@huawei.com Subject: Re: [PATCH] sched/fair: Fix vruntime drift by preventing double lag scaling during reweight Message-ID: <20260126212347.GV166857@noisy.programming.kicks-ass.net> References: <20251226001731.3730586-1-quzicheng@huawei.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20251226001731.3730586-1-quzicheng@huawei.com> On Fri, Dec 26, 2025 at 12:17:31AM +0000, Zicheng Qu wrote: > In reweight_entity(), when reweighting a currently running entity (se == > cfs_rq->curr), the entity remains on the runqueue context without > undergoing a full dequeue/enqueue cycle. I am horribly confused by this statement. In case current is on_rq (most often the case) then it is dequeued just as much as any other on_rq task being reweighted. [edit: ... after much puzzling ... Ooh, are you perhaps alluding to the fact that avg_vruntime() always adds cfs_rq->curr in?] > This means avg_vruntime() > remains constant throughout the reweight operation. Now this might have a point; but I don't see how it would be specific to current in any way. Consider the tree entities (vruntime,weight): (-40,10) ( 30,10) (-20,10) This gives: -40*10 + 30*10 + -20*10 -300 w_avg = ----------------------- = ---- = -10 30 30 Now, lets randomly pick (30,10) and do reweight to 5, then we have: vlag = w_avg - vruntime = -10 - 30 = -40 vlag' = vlag * weight / weight' = -40*10/5 = -80 Then placing that at -10 would give: (-40,10) ( 70,5 ) (-20,10) -40*10 + 70*5 + -20*10 -250 w_avg = ---------------------- = ---- = -10 25 25 However, if we pick any other entity, lets say (-40/10) and do the same: 30*10/5 = 60 placing at -10 gives, (-70,5), giving the w_avg: -70*5 + 30*10 + -20*10 -250 w_avg = ---------------------- = ---- = -10 25 25 Or rather, we can say that w_avg is/should-be invariant under the reweight transform. Current or no current, it doesn't matter. Which is more or less what we had prior to: 6d71a9c61604 ("sched/fair: Fix EEVDF entity placement bug causing scheduling lag") > However, the current implementation calls place_entity(..., 0) at the > end of reweight_entity(). Under EEVDF, place_entity() is designed to > handle entities entering the runqueue and calculates the virtual lag > (vlag) to account for the change in the weighted average vruntime (V) > using the formula: > > vlag' = vlag * (W + w_i) / W > > Where 'W' is the current aggregate weight (including > cfs_rq->curr->load.weight) and 'w_i' is the weight of the entity being > enqueued (in this case, the se is exactly the cfs_rq->curr). > > This leads to a "double scaling" logic for running entities: > 1. reweight_entity() already rescales se->vlag based on the new weight > ratio. > 2. place_entity() then mistakenly applies the (W + w_i)/W scaling again, > treating the reweight as a fresh enqueue into a new total weight > pool. > > This can cause the entity's vlag to be amplified (if positive) or > suppressed (if negative) incorrectly during the reweight process. Again, nothing specific to current here. [edit: except for the fact that avg_vruntime() actually does include current when re-adding current -- but that still doesn't explain why it would be a good idea to have two different ways of doing reweight] This all suggests we should revert 6d71a9c61604 ("sched/fair: Fix EEVDF entity placement bug causing scheduling lag"), except we first need an explanation for the problem described there. At the time I observed reweight cycles 1048576 -> 2 -> 1048576 inflating the lag, which obviously should not be. I might have made a mistake, but we should ensure we're not re-introducing that problem, so let me see if I can figure out where that came from if not from this. Now, poking at all this, I did find some numerical stability issues, but rather than blowing up like I saw for 6d71a9c61604, these cause lag to evaporate (also not good). Observe the below proglet; when using w_avg_t (for truncate, like sum_w_vruntime with use of scale_load_weight()): vruntime: 40960 (-51200) -> 40960 (-51200) avg: -10240 -- -10240 vruntime: 40960 (-51200) -> 26843535360 (-26843545600) avg: -35840 -- -35840 vruntime: 26843535360 (-26843571200) -> 15360 (-51200) avg: -18773 -- -18773 vruntime: 15360 (-34133) -> 17895503531 (-17895522304) avg: -35840 -- -35840 vruntime: 17895503531 (-17895539371) -> -1707 (-34133) avg: -24462 -- -24462 vruntime: -1707 (-22755) -> 11930148978 (-11930173440) avg: -35840 -- -35840 vruntime: 11930148978 (-11930184818) -> -13085 (-22755) avg: -28255 -- -28255 Where as, if we run it with w_avg_n: vruntime: 40960 (-51200) -> 40960 (-51200) avg: -10240 -- -10240 vruntime: 40960 (-51200) -> 26843535360 (-26843545600) avg: -10240 -- -35840 vruntime: 26843535360 (-26843545600) -> 40960 (-51200) avg: -10240 -- -10240 vruntime: 40960 (-51200) -> 26843535360 (-26843545600) avg: -10240 -- -35840 vruntime: 26843535360 (-26843545600) -> 40960 (-51200) avg: -10240 -- -10240 vruntime: 40960 (-51200) -> 26843535360 (-26843545600) avg: -10240 -- -35840 vruntime: 26843535360 (-26843545600) -> 40960 (-51200) avg: -10240 -- -10240 Specifically, the problem is because the 2 weight is below the representable value, causing accuracy issues for avg_vruntime which propagate. Luckily, we recently replaced min_vruntime with zero_vruntime, which tracks avg_vruntime far better resulting in smaller 'keys', which in turn -- as it happens -- allows us to get rid of that scale_load_down(). Could you please try queue.git/sched/reweight ? I've only confirmed it boots on my machine. I'll prod a little more at it tomorrow. --- #include #include #include #define min(a,b) \ ({ __typeof__ (a) _a = (a); \ __typeof__ (b) _b = (b); \ _a < _b ? _a : _b; }) struct e { int64_t vr; int64_t w; }; const int scale = 1024; struct e es[] = { { 40*scale, 1024*1024 }, { -20*scale, 1024*1024 }, { -50*scale, 1024*1024 }, }; const int n = sizeof(es)/sizeof(es[0]); int64_t w_avg_n(void) { int64_t ws = 0, w = 0; for (int i=0; i> 10); w += t; ws += es[i].vr * t; } return ws/w; } typedef int64_t (*avg_f)(void); avg_f w_avg = &w_avg_t; // w_avg_n for good, w_avg_t for broken void reweight(int i, int64_t w) { int64_t a = w_avg(); int64_t t, vl = a - es[i].vr; t = (vl * es[i].w) / w; printf("vruntime: %Ld (%Ld) -> %Ld (%Ld)\n", es[i].vr, vl, a-t, t); es[i].vr = a - t; es[i].w = w; } int main(int argc, char **argv) { int i = 0; if (argc > 1) i = atoi(argv[1]); reweight(i, 1024*1024); printf("avg: %Ld -- %Ld\n", w_avg(), w_avg_t()); for (int j=0; j<3; j++) { reweight(i, 2); printf("avg: %Ld -- %Ld\n", w_avg(), w_avg_t()); reweight(i, 1024*1024); printf("avg: %Ld -- %Ld\n", w_avg(), w_avg_t()); } return 0; }