From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-qt1-f169.google.com (mail-qt1-f169.google.com [209.85.160.169]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 00657215055 for ; Tue, 3 Feb 2026 15:38:10 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.160.169 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770133093; cv=none; b=H3erBQrdRE7xevUdGwTyH+mc9lLcFtm2WPl8PMbGbjfUcVnVxAJgTjrKwYqusEJKG1Ee9Cnhi8foc3BL4MArKZDrq4NV152z2kTPHhlt05B5uOdhuAp1ndiIH8W4Nikvy4lmKHENDm36juqX9V9RcmOmEq5xNjLX+t8SkvkrtTk= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770133093; c=relaxed/simple; bh=9dnNySJGq6eaCPMC8g0J47tyy7golKT+KJkxNSqvIFw=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=hnhLaGhX7BUsbtHAtBvc02MazT+/TCmzOVyCIRQIskuN98SdowHWje3S4Yo5Q6nmbA0GSgyE/SCBK8dPSPnhadE/uahYjdKMcqDEH/gO/X4WtFPPL4ITIGU2IJwWu494wU1vn1hOSsQ0Ru9NSPENAJoflkrqeReBiC5s6OY7Ry8= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=cmpxchg.org; spf=pass smtp.mailfrom=cmpxchg.org; dkim=pass (2048-bit key) header.d=cmpxchg.org header.i=@cmpxchg.org header.b=Y7TXnVCd; arc=none smtp.client-ip=209.85.160.169 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=cmpxchg.org Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=cmpxchg.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=cmpxchg.org header.i=@cmpxchg.org header.b="Y7TXnVCd" Received: by mail-qt1-f169.google.com with SMTP id d75a77b69052e-50146483bf9so68029421cf.3 for ; Tue, 03 Feb 2026 07:38:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg.org; s=google; t=1770133090; x=1770737890; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=/9hNNXcPPknjN9aTecmUk+NlbyWFA/x4kRtafA7f1D4=; b=Y7TXnVCdJKo8XGyJzu5rgwFqD8bmosbBmH6ym1rKaKj/VvJ3OKgFHAws74kiAYOdyV jaLEKKBM0AhRUVBzdgJ7DMwBGIIWcjJV3lo0YKWLx13ExwMY40i5TJ9qyU0K1r30pZmm w3bjZYFGZVZzDe7s1C1SGsW/5ozCn6iHbMUEQxz4rnl0LDAuVCHwg/cgFF9wHt4/WKcx sqkOOGTCFH0Z6lx8C+eDnIxKvdNYlg5/fhdmuAqt86bxZYQaZRVxtyURMeJHchdZXtyh SMlb5BmR/CZl0ngmFtMr+fpgS4ToVgOwsoCg4nbza5q/X+cF5GmmfFVZzzJwcYPiQjUL 2Y7Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1770133090; x=1770737890; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-gg:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=/9hNNXcPPknjN9aTecmUk+NlbyWFA/x4kRtafA7f1D4=; b=hXRGNl+S4e3/ONLIUGBrE9owNlVer9wmjnF0sgKfPKvTSmd4sAJsXV2Crkp06OR5ye wwXJdwA/oqQUz2ELb1DZR3hsV9hdAf4VaPlw/ZE540roY7xvxCOIdic+aK5WPXagoAYU +KsgbWJuUjjvxhZr79iihd5DG1ZOUdjIvodm30a0Ay4WgW6Hh4SJyt+Orz1ml4BV6Vfw BgN7QPmXnEX1oKnTJmEzSaxFtxFv3cvloCLctotxM3Q8Jg7eq6jFE8YFFQfwWTTYNhCC 1KvEiBOyQE8nAzkPMenul5WSWvsR7umbDpxsRjPMFA+PPm4B/86iEaADK5bQbjTWjjMT AyXQ== X-Forwarded-Encrypted: i=1; AJvYcCW7cJXOf2IzqLe3z07Hth6zVLwQ551i6hTKoHaJ8moGcytzrgC3ZsDKeA98307Kmyx7g3wo9R0ye/uCH4Y=@vger.kernel.org X-Gm-Message-State: AOJu0YxctIGTZvmtUObVx6JT/QfKG6gUyarl1FOpyGbdbSFGlWkR5l0u pe6zM/+qzViFVz/FXTv4go9y7lQ6N1E6va9hd1VKRxmSax1GX7DoFEGFjIEBindHX4E= X-Gm-Gg: AZuq6aIZ/0Iw4QSsIZxonbUmIGN92pzGJs7/NHQJvbgjT3WXhm5Uw+qNh5Rxii87sK1 ufBu/j5LBw0cksAQoR8UHRlMKz13+Kz042AXF+u6IENM6/KzvoA6QYe0abCGFKpd3bRHkwq8xoh IIvNHp42So/3knjlF0mYwTYbwIayQ2jCaMG0VZXlneyR6hUcypqOyFFDSBM7ytfL2eyVFhCmI0p BF6vpfWbOBRjHl/74gfXLg0p9JreTdI504L1FZgeQod2QXzFDDFkJlzBUV74ZddyJW27tfE+mkE 4PxtgiQyipyMQEYe3uyiAWRRsnTx5w1FmpozIo37BMezJyOSm6/0LgPyhOSDuv+Gk4kA7Qt+sSh byMntDoBDItzBStshaHxJj/Ex0Cu27wljibgMn3vrVYuRIT2g1kf5YXjdag3Ps9vBiBA7lO97GN hA91F0gNQMIg== X-Received: by 2002:a05:622a:1308:b0:502:a100:4054 with SMTP id d75a77b69052e-505d217d0a4mr200718681cf.23.1770133089618; Tue, 03 Feb 2026 07:38:09 -0800 (PST) Received: from localhost ([2603:7000:c00:1d00:365a:60ff:fe62:ff29]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-89521bfe801sm28896d6.11.2026.02.03.07.38.08 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 03 Feb 2026 07:38:08 -0800 (PST) Date: Tue, 3 Feb 2026 10:38:04 -0500 From: Johannes Weiner To: Peter Zijlstra Cc: Suren Baghdasaryan , Ingo Molnar , Chengming Zhou , Dietmar Eggemann , John Stultz , linux-kernel@vger.kernel.org Subject: Re: [PATCH resend 1/2] sched: psi: loosen clock sync between scheduler and aggregator Message-ID: References: <20260114154317.1815429-1-hannes@cmpxchg.org> <20260128103514.GX171111@noisy.programming.kicks-ass.net> <20260128201743.GW166857@noisy.programming.kicks-ass.net> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20260128201743.GW166857@noisy.programming.kicks-ass.net> On Wed, Jan 28, 2026 at 09:17:43PM +0100, Peter Zijlstra wrote: > On Wed, Jan 28, 2026 at 01:56:20PM -0500, Johannes Weiner wrote: > > That mutual understanding of "now" is expensive, so I'm trying to > > remove it. This necessarily opens a race: the reader could sample a > > live state to a time beyond what the scheduler will use to conclude > > that state. The result is that one reader run can oversample slightly, > > and the next run will see a negative delta. > > Right, exactly the problem from: > > 3840cbe24cf0 ("sched: psi: fix bogus pressure spikes from aggregation race") Yep. > > Here is how I'm proposing to handle it: > > > > > > + /* > > > > + * This snooping ahead can obviously race with the > > > > + * state concluding on the cpu. If we previously > > > > + * snooped to a time past where the state concludes, > > > > + * times[s] can now be behind times_prev[s]. > > > > + * > > > > + * time_after32() would be the obvious choice, but > > > > + * S32_MAX is right around two seconds, which is the > > > > + * aggregation interval; if the aggregator gets > > > > + * delayed, there would be a risk of dismissing > > > > + * genuinely large samples. Use a larger margin. > > > > + */ > > > > delta = times[s] - groupc->times_prev[aggregator][s]; > > > > + if (delta > psi_period + (psi_period >> 1)) > > > > + delta = 0; > > > > + > > > > > > This seems to check if times_prev is larger than times; confused again. > > > > In the last round, the reader oversampled. That means times_prev[] > > slipped ahead of the scheduler reality (times[]). > > > > The actual accounting error caused by this should be acceptable, owed > > to the slightly different clock reads of the two racing threads. > > > > Certainly, there is no meaningful additional sample time in *this* > > round, so we fix the delta to 0. > > OK, so you don't care about the fact that consistent races can inflate > the number beyond actuality. You just want to get rid of backward > motion? Yes. The backward motion is what caused the multi-second errors that immediately shift even the long-term averages. Those are blatantly noticable and cause practical issues. And notably, these were the ONLY issue anyone reported against 3840cbe24cf0 in three years. Given the decaying averages, and the recency windows anyone really cares about, a smaller fudge factor is much less of a concern. It's largely self-correcting too: if you have ongoing/recurring pressure periods, oversampling of live state is corrected by subsequent smaller samples against the "actual" time the scheduler ended up recording. The actual underflows were quite rare. The reproducer had to run the aggregation every microsecond for extended periods to trigger a race where the oversampled error is less than state time in the subsequent sampling period: v detect live state | v now() -> record N + error | | v underflow aggregator | | | scheduler |---------------| ^ now() -> record N We *could* look into remembering oversampling debt accurately and incorporate deficits into the samples that go into total=. But we couldn't do it for the avg= because that also encodes recency. And that discrepancy between the two would likely be worse interface-wise than a fudge factor affecting both the same. > The argument being something like: Since its a decaying average, and the > per-time accumulated error is relatively minor, it doesn't affect the > over-all outcome much. > > Because if we just consider it as free-running counters, the > accumulation error is unbound. But since it is time averaged, the error > becomes insignificant. Yes. > > > > groupc->times_prev[aggregator][s] = times[s]; > > > > > > It updates times_prev irrespectively. Storing a potentially larger > > > value. > > > > Right, this is on purpose. Once the delta is extracted and processed, > > we need to update the reader to where the scheduler is, as per > > usual. But there is now a second case: > > > > 1) Existing case. The scheduler accumulated state time the reader > > hasn't seen yet, so times_prev[] is behind times[]. After delta > > extraction, the assignment catches the reader up to the scheduler. > > > > 2) New case. The previous reader run used a `now' too far in the > > future and oversampled. times_prev[] is *ahead* of times[]. The > > assignment *rewinds* the reader back to the scheduler's reality. > > Well, not quite, that too large delta has already 'escaped' and been > accumulated. This way you ensure a second race will again result in a > too large delta being accumulated, rather than the next state being > accounted slightly short -- which would compensate for the previous one > being accounted slightly larger. > > That is afaict 2) ensures you are consistently oversampling but never > undersampling. True. But because of the importance of recency in that data, there is a limit to how much we could backcorrect later samples. > > > > times[s] = delta; > > > > > > And stores the delta, which can be larger than it should be? > > > > We filtered out bogus deltas, so it should be valid or 0. > > Semantically a too large value is equally bogus to a negative value. > > Its just that negative numbers are not expected and wreck things down > the chain. Yes. > > > For all these we call psi_group_change(), which calls record_times() > > > which then sets ->state_start to a smaller value. Resulting in times > > > above to be larger still. > > > > I think there are two considerations. > > > > We use that clock value to both start and stop the clock on a > > state. So while we use a timestamp from slightly earlier in the > > scheduling event, we do it on both ends. This should not affect > > long-term accuracy. > > If we only consider these timestamps, sure. But due to the whole > accumulation mess in between you get unbounded drift (on the accumulator > -- pre-averaging). > > > Then there is reader coherency. Before, state_start would match the > > exact time at which the reader could see the state go live inside the > > state_mask. After the patch, the reader could see a state whose > > state_start is slightly in the past. But that's the common case for > > the reader anyway? Since it rarely runs in the same nanosecond in > > which the state change occurs. > > > > Am I missing something? > > So currently, with things being inside the locks, if we sample early we > miss a bit. If we sample late, we see the exact time. > > With the update time being early, we go back to 3840cbe24cf0, and can > see a too long period in both cases. > > But because you're then also using a late clock on read, the error is > larger still. > > If you are measuring time from a start to 'now', and the actual period > is (10 characters) like so: > > .x.|--------|.y. > > Then, if you move the start to x (earlier), your period becomes longer > (12 characters). If you then also move now to y (later) you get an > ever larger error (14 characters). > > I mean, it all probably doesn't matter, but its there. Right. I think we can live with it, but need it documented in the code comments to maintain a proper mental picture of the implementation tradeoffs and their consequences. > > > So this inflates delta and leaves me utterly confused. > > > > > > Not only does the Changelog here not explain anything, this also very > > > much needs comments in the code, because the next time someone is going > > > to be reading this, they'll break their WTF'o'meter and probably the > > > next one they get too. > > > > Fair enough, it's tricky. > > > > I'll do my best to capture all the above into the changelog and code > > comments. But let's try to get on the same page first. That should > > also help identify which parts exactly need the most help. > > Mostly I wasn't sure on which errors you care about and which you don't. Ok, fair enough. > As I understand it now, you *only* care about not having negative values > in the accumulation chain because that's otherwise unsigned and > negatives show up as large numbers and things go 'funny'. > > You do not care about long term drift in the pure accumulator -- since > its all time averaged? Yep. Thanks for taking a look.