From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mail-qt1-f169.google.com (mail-qt1-f169.google.com [209.85.160.169])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 00657215055
	for <linux-kernel@vger.kernel.org>; Tue,  3 Feb 2026 15:38:10 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.160.169
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1770133093; cv=none; b=H3erBQrdRE7xevUdGwTyH+mc9lLcFtm2WPl8PMbGbjfUcVnVxAJgTjrKwYqusEJKG1Ee9Cnhi8foc3BL4MArKZDrq4NV152z2kTPHhlt05B5uOdhuAp1ndiIH8W4Nikvy4lmKHENDm36juqX9V9RcmOmEq5xNjLX+t8SkvkrtTk=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1770133093; c=relaxed/simple;
	bh=9dnNySJGq6eaCPMC8g0J47tyy7golKT+KJkxNSqvIFw=;
	h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version:
	 Content-Type:Content-Disposition:In-Reply-To; b=hnhLaGhX7BUsbtHAtBvc02MazT+/TCmzOVyCIRQIskuN98SdowHWje3S4Yo5Q6nmbA0GSgyE/SCBK8dPSPnhadE/uahYjdKMcqDEH/gO/X4WtFPPL4ITIGU2IJwWu494wU1vn1hOSsQ0Ru9NSPENAJoflkrqeReBiC5s6OY7Ry8=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=cmpxchg.org; spf=pass smtp.mailfrom=cmpxchg.org; dkim=pass (2048-bit key) header.d=cmpxchg.org header.i=@cmpxchg.org header.b=Y7TXnVCd; arc=none smtp.client-ip=209.85.160.169
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=cmpxchg.org
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=cmpxchg.org
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=cmpxchg.org header.i=@cmpxchg.org header.b="Y7TXnVCd"
Received: by mail-qt1-f169.google.com with SMTP id d75a77b69052e-50146483bf9so68029421cf.3
        for <linux-kernel@vger.kernel.org>; Tue, 03 Feb 2026 07:38:10 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=cmpxchg.org; s=google; t=1770133090; x=1770737890; darn=vger.kernel.org;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to;
        bh=/9hNNXcPPknjN9aTecmUk+NlbyWFA/x4kRtafA7f1D4=;
        b=Y7TXnVCdJKo8XGyJzu5rgwFqD8bmosbBmH6ym1rKaKj/VvJ3OKgFHAws74kiAYOdyV
         jaLEKKBM0AhRUVBzdgJ7DMwBGIIWcjJV3lo0YKWLx13ExwMY40i5TJ9qyU0K1r30pZmm
         w3bjZYFGZVZzDe7s1C1SGsW/5ozCn6iHbMUEQxz4rnl0LDAuVCHwg/cgFF9wHt4/WKcx
         sqkOOGTCFH0Z6lx8C+eDnIxKvdNYlg5/fhdmuAqt86bxZYQaZRVxtyURMeJHchdZXtyh
         SMlb5BmR/CZl0ngmFtMr+fpgS4ToVgOwsoCg4nbza5q/X+cF5GmmfFVZzzJwcYPiQjUL
         2Y7Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1770133090; x=1770737890;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:x-gm-gg:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=/9hNNXcPPknjN9aTecmUk+NlbyWFA/x4kRtafA7f1D4=;
        b=hXRGNl+S4e3/ONLIUGBrE9owNlVer9wmjnF0sgKfPKvTSmd4sAJsXV2Crkp06OR5ye
         wwXJdwA/oqQUz2ELb1DZR3hsV9hdAf4VaPlw/ZE540roY7xvxCOIdic+aK5WPXagoAYU
         +KsgbWJuUjjvxhZr79iihd5DG1ZOUdjIvodm30a0Ay4WgW6Hh4SJyt+Orz1ml4BV6Vfw
         BgN7QPmXnEX1oKnTJmEzSaxFtxFv3cvloCLctotxM3Q8Jg7eq6jFE8YFFQfwWTTYNhCC
         1KvEiBOyQE8nAzkPMenul5WSWvsR7umbDpxsRjPMFA+PPm4B/86iEaADK5bQbjTWjjMT
         AyXQ==
X-Forwarded-Encrypted: i=1; AJvYcCW7cJXOf2IzqLe3z07Hth6zVLwQ551i6hTKoHaJ8moGcytzrgC3ZsDKeA98307Kmyx7g3wo9R0ye/uCH4Y=@vger.kernel.org
X-Gm-Message-State: AOJu0YxctIGTZvmtUObVx6JT/QfKG6gUyarl1FOpyGbdbSFGlWkR5l0u
	pe6zM/+qzViFVz/FXTv4go9y7lQ6N1E6va9hd1VKRxmSax1GX7DoFEGFjIEBindHX4E=
X-Gm-Gg: AZuq6aIZ/0Iw4QSsIZxonbUmIGN92pzGJs7/NHQJvbgjT3WXhm5Uw+qNh5Rxii87sK1
	ufBu/j5LBw0cksAQoR8UHRlMKz13+Kz042AXF+u6IENM6/KzvoA6QYe0abCGFKpd3bRHkwq8xoh
	IIvNHp42So/3knjlF0mYwTYbwIayQ2jCaMG0VZXlneyR6hUcypqOyFFDSBM7ytfL2eyVFhCmI0p
	BF6vpfWbOBRjHl/74gfXLg0p9JreTdI504L1FZgeQod2QXzFDDFkJlzBUV74ZddyJW27tfE+mkE
	4PxtgiQyipyMQEYe3uyiAWRRsnTx5w1FmpozIo37BMezJyOSm6/0LgPyhOSDuv+Gk4kA7Qt+sSh
	byMntDoBDItzBStshaHxJj/Ex0Cu27wljibgMn3vrVYuRIT2g1kf5YXjdag3Ps9vBiBA7lO97GN
	hA91F0gNQMIg==
X-Received: by 2002:a05:622a:1308:b0:502:a100:4054 with SMTP id d75a77b69052e-505d217d0a4mr200718681cf.23.1770133089618;
        Tue, 03 Feb 2026 07:38:09 -0800 (PST)
Received: from localhost ([2603:7000:c00:1d00:365a:60ff:fe62:ff29])
        by smtp.gmail.com with ESMTPSA id 6a1803df08f44-89521bfe801sm28896d6.11.2026.02.03.07.38.08
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Tue, 03 Feb 2026 07:38:08 -0800 (PST)
Date: Tue, 3 Feb 2026 10:38:04 -0500
From: Johannes Weiner <hannes@cmpxchg.org>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>, Ingo Molnar <mingo@kernel.org>,
	Chengming Zhou <chengming.zhou@linux.dev>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	John Stultz <jstultz@google.com>, linux-kernel@vger.kernel.org
Subject: Re: [PATCH resend 1/2] sched: psi: loosen clock sync between
 scheduler and aggregator
Message-ID: <aYIWXJ_Ms030Ppaj@cmpxchg.org>
References: <20260114154317.1815429-1-hannes@cmpxchg.org>
 <20260128103514.GX171111@noisy.programming.kicks-ass.net>
 <aXpb1CPMHXEA7iyB@cmpxchg.org>
 <20260128201743.GW166857@noisy.programming.kicks-ass.net>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20260128201743.GW166857@noisy.programming.kicks-ass.net>

On Wed, Jan 28, 2026 at 09:17:43PM +0100, Peter Zijlstra wrote:
> On Wed, Jan 28, 2026 at 01:56:20PM -0500, Johannes Weiner wrote:
> > That mutual understanding of "now" is expensive, so I'm trying to
> > remove it. This necessarily opens a race: the reader could sample a
> > live state to a time beyond what the scheduler will use to conclude
> > that state. The result is that one reader run can oversample slightly,
> > and the next run will see a negative delta.
> 
> Right, exactly the problem from:
> 
>  3840cbe24cf0 ("sched: psi: fix bogus pressure spikes from aggregation race")

Yep.

> > Here is how I'm proposing to handle it:
> > 
> > > > +		/*
> > > > +		 * This snooping ahead can obviously race with the
> > > > +		 * state concluding on the cpu. If we previously
> > > > +		 * snooped to a time past where the state concludes,
> > > > +		 * times[s] can now be behind times_prev[s].
> > > > +		 *
> > > > +		 * time_after32() would be the obvious choice, but
> > > > +		 * S32_MAX is right around two seconds, which is the
> > > > +		 * aggregation interval; if the aggregator gets
> > > > +		 * delayed, there would be a risk of dismissing
> > > > +		 * genuinely large samples. Use a larger margin.
> > > > +		 */
> > > >  		delta = times[s] - groupc->times_prev[aggregator][s];
> > > > +		if (delta > psi_period + (psi_period >> 1))
> > > > +			delta = 0;
> > > > +
> > > 
> > > This seems to check if times_prev is larger than times; confused again.
> > 
> > In the last round, the reader oversampled. That means times_prev[]
> > slipped ahead of the scheduler reality (times[]).
> > 
> > The actual accounting error caused by this should be acceptable, owed
> > to the slightly different clock reads of the two racing threads.
> > 
> > Certainly, there is no meaningful additional sample time in *this*
> > round, so we fix the delta to 0.
> 
> OK, so you don't care about the fact that consistent races can inflate
> the number beyond actuality. You just want to get rid of backward
> motion?

Yes. The backward motion is what caused the multi-second errors that
immediately shift even the long-term averages. Those are blatantly
noticable and cause practical issues. And notably, these were the ONLY
issue anyone reported against 3840cbe24cf0 in three years.

Given the decaying averages, and the recency windows anyone really
cares about, a smaller fudge factor is much less of a concern.

It's largely self-correcting too: if you have ongoing/recurring
pressure periods, oversampling of live state is corrected by
subsequent smaller samples against the "actual" time the scheduler
ended up recording. The actual underflows were quite rare. The
reproducer had to run the aggregation every microsecond for extended
periods to trigger a race where the oversampled error is less than
state time in the subsequent sampling period:

                             v detect live state
                             | v now() -> record N + error
                             | |     v underflow
  aggregator                 | |     |
  scheduler   |---------------|
                              ^ now() -> record N

We *could* look into remembering oversampling debt accurately and
incorporate deficits into the samples that go into total=. But we
couldn't do it for the avg= because that also encodes recency. And
that discrepancy between the two would likely be worse interface-wise
than a fudge factor affecting both the same.

> The argument being something like: Since its a decaying average, and the
> per-time accumulated error is relatively minor, it doesn't affect the
> over-all outcome much.
> 
> Because if we just consider it as free-running counters, the
> accumulation error is unbound. But since it is time averaged, the error
> becomes insignificant.

Yes.

> > > >  		groupc->times_prev[aggregator][s] = times[s];
> > > 
> > > It updates times_prev irrespectively. Storing a potentially larger
> > > value.
> > 
> > Right, this is on purpose. Once the delta is extracted and processed,
> > we need to update the reader to where the scheduler is, as per
> > usual. But there is now a second case:
> > 
> > 1) Existing case. The scheduler accumulated state time the reader
> >    hasn't seen yet, so times_prev[] is behind times[]. After delta
> >    extraction, the assignment catches the reader up to the scheduler.
> > 
> > 2) New case. The previous reader run used a `now' too far in the
> >    future and oversampled. times_prev[] is *ahead* of times[]. The
> >    assignment *rewinds* the reader back to the scheduler's reality.
> 
> Well, not quite, that too large delta has already 'escaped' and been
> accumulated. This way you ensure a second race will again result in a
> too large delta being accumulated, rather than the next state being
> accounted slightly short -- which would compensate for the previous one
> being accounted slightly larger.
> 
> That is afaict 2) ensures you are consistently oversampling but never
> undersampling.

True. But because of the importance of recency in that data, there is
a limit to how much we could backcorrect later samples.

> > > >  		times[s] = delta;
> > > 
> > > And stores the delta, which can be larger than it should be?
> > 
> > We filtered out bogus deltas, so it should be valid or 0.
> 
> Semantically a too large value is equally bogus to a negative value.
> 
> Its just that negative numbers are not expected and wreck things down
> the chain.

Yes.

> > > For all these we call psi_group_change(), which calls record_times()
> > > which then sets ->state_start to a smaller value. Resulting in times
> > > above to be larger still.
> > 
> > I think there are two considerations.
> > 
> > We use that clock value to both start and stop the clock on a
> > state. So while we use a timestamp from slightly earlier in the
> > scheduling event, we do it on both ends. This should not affect
> > long-term accuracy.
> 
> If we only consider these timestamps, sure. But due to the whole
> accumulation mess in between you get unbounded drift (on the accumulator
> -- pre-averaging).
> 
> > Then there is reader coherency. Before, state_start would match the
> > exact time at which the reader could see the state go live inside the
> > state_mask. After the patch, the reader could see a state whose
> > state_start is slightly in the past. But that's the common case for
> > the reader anyway? Since it rarely runs in the same nanosecond in
> > which the state change occurs.
> > 
> > Am I missing something?
> 
> So currently, with things being inside the locks, if we sample early we
> miss a bit. If we sample late, we see the exact time.
> 
> With the update time being early, we go back to 3840cbe24cf0, and can
> see a too long period in both cases.
> 
> But because you're then also using a late clock on read, the error is
> larger still.
> 
> If you are measuring time from a start to 'now', and the actual period
> is (10 characters) like so:
> 
>    .x.|--------|.y.
> 
> Then, if you move the start to x (earlier), your period becomes longer
> (12 characters). If you then also move now to y (later) you get an
> ever larger error (14 characters).
> 
> I mean, it all probably doesn't matter, but its there.

Right. I think we can live with it, but need it documented in the code
comments to maintain a proper mental picture of the implementation
tradeoffs and their consequences.

> > > So this inflates delta and leaves me utterly confused.
> > > 
> > > Not only does the Changelog here not explain anything, this also very
> > > much needs comments in the code, because the next time someone is going
> > > to be reading this, they'll break their WTF'o'meter and probably the
> > > next one they get too.
> > 
> > Fair enough, it's tricky.
> > 
> > I'll do my best to capture all the above into the changelog and code
> > comments. But let's try to get on the same page first. That should
> > also help identify which parts exactly need the most help.
> 
> Mostly I wasn't sure on which errors you care about and which you don't.

Ok, fair enough.

> As I understand it now, you *only* care about not having negative values
> in the accumulation chain because that's otherwise unsigned and
> negatives show up as large numbers and things go 'funny'.
> 
> You do not care about long term drift in the pure accumulator -- since
> its all time averaged?

Yep.

Thanks for taking a look.