From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 82A3E1FC101
	for <linux-kernel@vger.kernel.org>; Fri, 19 Dec 2025 10:13:09 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1766139191; cv=none; b=EqwnKeZY6u1lHIwuC+lgkVLCziqAKnn5YJmswQtPeV9roKLJACner2Td1V/hrCsBzhzV09bGCTAPJopsVlzsR/PRL4mp0MkH4jRqlw4+ulsUiizaaY1deSsiZemFgJXA0azBEsPL+ERu2EHdGv9oOdJ0kibTBST9JLU35z9yW4o=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1766139191; c=relaxed/simple;
	bh=wYjSJuWGuEiihEi4MOyuQFjuI/o8KD14FC0zk4V6k+w=;
	h=From:To:Cc:Subject:In-Reply-To:References:Date:Message-ID:
	 MIME-Version:Content-Type; b=k6Me/D+aij/qLy+DhwiG/W+E2g3paYxw0PAvBBll4Agwu1ON4TJEV3Dt3Ora4HRlgxzntHOWLXl0XuujdDgNulHFULVCcO5CrGRv5IJwjBaiIhAkxlCx9zyKzDZfp6JFga0B7hQBqAM8Dx+dHqBmxmZEgsC62GkwudFOgJIE6jM=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=qG7QTnhd; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=S1NAQ/Cq; arc=none smtp.client-ip=193.142.43.55
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="qG7QTnhd";
	dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="S1NAQ/Cq"
From: Thomas Gleixner <tglx@linutronix.de>
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de;
	s=2020; t=1766139187;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=pA6Aptgo8hlwrAzZZZ0eafGwe1kEPRGVPYulftvv3hs=;
	b=qG7QTnhdR/nZdHJaGZfSP8Nu33bnt/G2fydnnKMeLrDO7ToF38Npqj6QDLrhwZxJfEpYMO
	CTKCVucXGYiPJvhPi6uwJD+GuoYQI7MvVdTuY4W5HECeT7/fxU8dWYpsqDFGTLUrHUaj8D
	5+9Uvl7qmnAJTnN2L/khdIyi71wUiEywLRs5qpF3UgsLvbPW1etvScIVNC+V2q3D79TDZC
	qB5Coe5obQbUEVyPwIWt7c34t7ogJcjUMsxGKIayu76g5ubLbk0h5KW6l64FNR+dVU7V3k
	QPLMCl9MJmhyT64z8oSoLU6dh7o2X5IeVNQKjMm7jn/+J7C1vO6piX5Pe281nA==
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de;
	s=2020e; t=1766139187;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=pA6Aptgo8hlwrAzZZZ0eafGwe1kEPRGVPYulftvv3hs=;
	b=S1NAQ/CqzKQxNrygiscw/DtAMHIUAcfzgjBaRomChENgiuA1zXqZPvldRAnF3w7pv9l5Ac
	aE1dOdCr7mN6JXCw==
To: paulmck@kernel.org
Cc: LKML <linux-kernel@vger.kernel.org>, Daniel J Blueman
 <daniel@quora.org>, John Stultz <jstultz@google.com>, Waiman Long
 <longman@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Dave Hansen
 <dave.hansen@linux.intel.com>, Tony Luck <tony.luck@intel.com>, Borislav
 Petkov <bp@alien8.de>, Stephen Boyd <sboyd@kernel.org>, Scott Hamilton
 <scott.hamilton@eviden.com>
Subject: Re: clocksource: Reduce watchdog readout delay limit to prevent
 false positives
In-Reply-To: <90dce4d0-bd90-4321-95d8-c8dd1c121de0@paulmck-laptop>
References: <87bjjxc9dq.ffs@tglx>
 <90dce4d0-bd90-4321-95d8-c8dd1c121de0@paulmck-laptop>
Date: Fri, 19 Dec 2025 11:13:05 +0100
Message-ID: <878qeybwzy.ffs@tglx>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain

On Wed, Dec 17 2025 at 16:48, Paul E. McKenney wrote:
> On Wed, Dec 17, 2025 at 06:21:05PM +0100, Thomas Gleixner wrote:
>> The "valid" readout delay between the two reads of the watchdog is larger
>> than the valid delta between the resulting watchdog and clocksource
>> intervals, which results in false positive watchdog results.
>> 
>> Assume TSC is the clocksource and HPET is the watchdog and both have a
>> uncertainty margin of 250us (default). The watchdog readout does:
>> 
>>   1) wdnow = read(HPET);
>>   2) csnow = read(TSC);
>>   3) wdend = read(HPET);
>     4) wd_end2 = read(HPET);

That's completely irrelevant for the problem at hand.

>> The valid window for the delta between #1 and #3 is calculated by the
>> uncertainty margins of the watchdog and the clocksource:
>> 
>>    m = 2 * watchdog.uncertainty_margin + cs.uncertainty margin;
>> 
>> which results in 750us for the TSC/HPET case.
>
> Yes, because this interval includes two watchdog reads (#1 and #3 above)
> and one clocksource read (#2 above).  We therefore need to allow two
> watchdog uncertainties and one clocksource uncertainty.

That's a made up and broken theory based on ill defined heuristics.

The uncertainty of a clocksource is defined by:

    1) The frequency it runs with, which affects the accuracy of the
       time readout.

    2) The access time

Your implementation defines that the uncertainty is at least 50us,
with a default of 125us and an upper limit of 1ms. That's just made up
numbers pulled of of thin air which have nothing to do with reality.

Declaring that TSC access time of up to 1ms and at least 50us is
hillarious.

Adding up these made up margins and then double the watchdog margin to
validate the readout is just a hack to make it "work for me" and thereby
breaking the whole machinery for existing problematic systems. See below.

>> The actual interval comparison uses a smaller margin:
>> 
>>    m = watchdog.uncertainty_margin + cs.uncertainty margin;
>> 
>> which results in 500us for the TSC/HPET case.
>
> This is the (wd_seq_delay > md) comparison, righr?  If so, the reason
> for this is because it is measuring only a pair of watchdog reads (#3
> and #4).  There is no clocksource read on the latency recheck, so we do
> not include the cs->uncertainty_margin value, only the pair of watchdog
> uncertainty values.

No. This is the check which does:

	int64_t md = 2 * watchdog->uncertainty_margin;
        ...

                *wdnow = watchdog->read(watchdog);
		*csnow = cs->read(cs);
		wd_end = watchdog->read(watchdog);
                ...
                
		wd_delay = cycles_to_nsec_safe(watchdog, *wdnow, wd_end);
		if (wd_delay <= md + cs->uncertainty_margin) {
                          ...
                          return WR_READ_SUCCESS;
                }

It has nothing to do with the wd_seq_delay check.

> If this check fails, that indicates that the watchdog clocksource is much
> slower than expected (for example, due to memory-system overload affecting
> HPET on multicore systems), so we skip this measurement interval.
>
>> That means the following scenario will trigger the watchdog:
>> 
>>  Watchdog cycle N:
>> 
>>  1)       wdnow[N] = read(HPET);
>>  2)       csnow[N] = read(TSC);
>>  3)       wdend[N] = read(HPET);
>> 
>> Assume the delay between #1 and #2 is 100us and the delay between #1 and
>> #3 is within the 750us margin, i.e. the readout is considered valid.
>
> Yes.  We expect at most 250us for #1, another 250us for #2, and yet
> another 250us for #3.
>
>>  Watchdog cycle N + 1:
>> 
>>  4)       wdnow[N + 1] = read(HPET);
>>  5)       csnow[N + 1] = read(TSC);
>>  6)       wdend[N + 1] = read(HPET);
>> 
>> If the delay between #4 and #6 is within the 750us margin then any delay
>> between #4 and #5 which is larger than 600us will fail the interval check
>> and mark the TSC unstable because the intervals are calculated against the
>> previous value:
>> 
>>     wd_int = wdnow[N + 1] - wdnow[N];
>>     cs_int = csnow[N + 1] - csnow[N];
>
> Except that getting 600us latency between #4 and #5 is not consistent
> with a 250us uncertainty.  If that is happening, the uncertainty should
> instead be at least 300us.

That's utter nonsense.

>> Putting the above delays in place this results in:
>> 
>>     cs_int = (wdnow[N + 1] + 610us) - (wdnow[N] + 100us);
>>  -> cs_int = wd_int + 510us;
>> 
>> which is obviously larger than the allowed 500us margin and results in
>> marking TSC unstable.
>
> Agreed, but due to the ->uncertainty_margin values being too small.

You seriously fail to understand basic math. Let me draw you a picture:

    H  = HPET read
    T  = TSC read
    RW = read valid window
    IW = interval valid window

    RW-------------------------------------------RW
    IW-------------------------IW

    HT                                       H          <- Read 1 valid
    H                                  TH               <- Read 2 valid
     |---------------------------------|                <- Interval too large

Q: How is increasing the uncertainty values fixing the underlying math
   problem of RW > IW?

A: Not at all. It just papers over it and makes the false positive case
   more unlikely by further weakening the accuracy.

>> Fix this by using the same margin as the interval comparison. If the delay
>> between two watchdog reads is larger than that, then the readout was either
>> disturbed by interconnect congestion, NMIs or SMIs. 
>> 
>> Fixes: 4ac1dd3245b9 ("clocksource: Set cs_watchdog_read() checks based on .uncertainty_margin")
>> Reported-by: Daniel J Blueman <daniel@quora.org>
>
> If this is happening in real life, we have a couple of choices:
>
> 1.	Increase the ->uncertainty_margin values to match the objective
> 	universe.

Which universe? The universe of made up math? See above.

> 2.	In clocksource_watchdog(), replace "(abs(cs_nsec - wd_nsec) > md)"
> 	with "(abs(cs_nsec - wd_nsec) > 2 * md)".
>
> 	The rationale here is that the ->uncertainty_margin values are
> 	two-tailed, as in the clocksource might report a value that is
> 	->uncertainty_margin and ->uncertainty_margin too late.  When I
> 	was coding this, I instead assumed that ->uncertainty_margin
> 	covered the full range, centered on the correct time value.

So you propose the opposite of what I'm doing, which weakens the
watchdog even further.

> 	You would know better than would I.
>
> My concern is that the patch below would force needless cs_watchdog_read()
> retries.

That's not the end of the world and way better than degrading the
watchdog further.

It's already useless for the original purpose to detect even slow skew
of TSC between CPUs because it is relative and the uncertainty margins
are insanely big.

I just booted an old machine where the BIOS "hides" SMI time by saving
the TSC value on entry and restoring it on exit. The original watchdog
implementation caught that. Now it happily continues and I can observe
time going backwards between CPUs in user space. After an hour the
difference between the two CPUs is > 1sec and the system still claims
that everything is fine. And no, TSC_ADJUST does not catch that on this
machine because the CPU does not support it.

IOW, this whole big machine scalability hackery broke basic
functionality for existing systems.

We really need to go back to the drawing board and rethink this
machinery from scratch.

There are two main aspects to the watchdog:

  1) Detect general frequency skew by comparing it against a known
     (assumed to be) stable clocksource

     This addresses the problems on older machines where the BIOS
     fiddles with the CPU frequency behind the kernels back.

     That's a non-issue on modern CPUs because the TSC is constant
     frequency.

  2) Detect inter CPU skew

     This addresses the problems where

     A) the BIOS fiddles with the TSC behind the kernels back (see
        above) and the modification cannot be detected by the TSC_ADJUST
        MSR due to its non-existance

        Not an issue on modern CPUs

     B) the sockets are not synchronized and the TSC frequency on them
        drifts apart

     C) hardware failure (something you mentioned back then when you
        hacked this uncertainty magic up)

As I just validated on that old machine the whole "scalability" hackery
broke #2A and #2B unless the modifications or drifts are massive. But if
they are not it leaves both user space and kernel with inconsistent
time.

That means we really have to look at this in two ways:

  I)  On old hardware where TSC is not constant frequency, the validation
      against HPET is required.

  II) On new hardware with constant frequency TSC and TSC_ADJUST_MSR the
      HPET comparison is not really useful anymore as it does not detect
      low drift issues between unsynchronized sockets.

What can be done instead?

  - Make CPU0 the watchdog supervisor, which runs the timer and
    orchestrates the machinery.

  - Limit the HPET comparison to CPUs which lack constant frequency,
    which makes the whole legacy I/O issue on large systems go away.

    Run this comparision only on CPU0

  - Instead of moving the timer around CPUs, utilize a SMP function call
    similar to what the TSC synchronization mechanism does, i.e.

    CPU0         CPUN
    sync()       sync()
    t1 = rdtsc()    
    handover()
                 t2 = rdtsc()
                 handover()
    t3 = rdtsc()
    ....

    Each step validates that tN <= TN+1, which detects all issues of #2
    above for both old and contemporary machines.

    Of course this mechanism is also affected by interconnect
    congestion, but the only side effect of interconnect congestion is
    that the handover becomes slow, which means that the accuracy of
    detecting time going backwards suffers temporarily.
    
With that all these made up uncertainty heuristics go completely away or
can be turned into something which actually reflects the reality of the
universe, i.e. the accuracy of the clocks and their access time.

I doubt it's needed because the original implementation worked just fine
and it's only relevant for the actual HPET/TSC comparison case. That is
then limited to museum pieces which are not really affected by the
scalability issues of todays machines. Especially not as the HPET/TSC
check is limited to CPU0, which is the one closest to the legacy
peripherals which contain the HPET or in real old machines the ACPI_PM
timer.

I'll go and utilize my copious spare time to implement this, but don't
expect it to materialize before 2026 :)

Thanks,

        tglx