public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Jiri Wiesner <jwiesner@suse.de>
To: Thomas Gleixner <tglx@kernel.org>
Cc: LKML <linux-kernel@vger.kernel.org>,
	"Paul E. McKenney" <paulmck@kernel.org>,
	John Stultz <jstultz@google.com>,
	Waiman Long <longman@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Daniel Lezcano <daniel.lezcano@linaro.org>,
	Stephen Boyd <sboyd@kernel.org>,
	x86@kernel.org, "Gautham R. Shenoy" <gautham.shenoy@amd.com>,
	Daniel J Blueman <daniel@quora.org>,
	Scott Hamilton <scott.hamilton@eviden.com>,
	Helge Deller <deller@gmx.de>,
	linux-parisc@vger.kernel.org,
	Thomas Bogendoerfer <tsbogend@alpha.franken.de>,
	linux-mips@vger.kernel.org
Subject: Re: [patch 5/5] clocksource: Rewrite watchdog code completely
Date: Wed, 25 Feb 2026 19:13:22 +0100	[thread overview]
Message-ID: <aZ87wpdHJ5vajYoL@incl> (raw)
In-Reply-To: <20260123231521.926490888@kernel.org>

On Sat, Jan 24, 2026 at 12:18:01AM +0100, Thomas Gleixner wrote:
> To address this and bring back sanity to the watchdog, rewrite the code
> completely with a different approach:
> 
>   1) Restrict the validation against a reference clocksource to the boot
>      CPU, which is usually the CPU/Socket closest to the legacy block which
>      contains the reference source (HPET/ACPI-PM timer).

The UEFI picks the boot CPU so the kernel does not have control over that. On the other hand, I think the CPU that is connected to the southbridge chip (by DMI or PCIe) will be selected in the majority of UEFI implementations. Even if the boot CPU had to use the inter-processor link the readout latency should often pass the 50 microsecond threshold. This is a histogram of the hpet-tsc-hpet readout latency (in nanoseconds) as measured by the old clocksource watchdog (reads carried out from all CPUs on the machine):

wd_delay Duration Distribution
wd_delay Duration Average:  7822 +-  9413 (min 2875, max 77916)
        Range      Count
    0 -  5000       2766 (73.10%)
 5000 - 10000        383 (10.12%)
20000 - 25000        402 (10.62%)
25000 - 30000         94 ( 2.48%)
30000 - 35000         49 ( 1.29%)
35000 - 40000         35 ( 0.92%)
40000 - 45000         21 ( 0.55%)
45000 - 50000         14 ( 0.37%)
50000 - 55000          7 ( 0.18%)
55000 - 60000          3 ( 0.08%)
60000 - 65000          4 ( 0.11%)
65000 - 70000          2 ( 0.05%)
70000 - 75000          1 ( 0.03%)
75000 - 80000          3 ( 0.08%)
Total count: 3,784

The machine has 8 NUMA nodes with 960x Intel Xeon Platinum 8490H. The machine was running:
stress-ng -t 30m --cpu 480 --switch 520
It definitely does not represent effect of any arbitrary workload on the inter-processor link but it is a data point.

There is one issue: What if the reference clocksource itself experiences time skew? I have seen a case like this with the sgi_rtc clocksource. I created a debugging kernel with the HPET as a second watchdog (not affecting the decisions by the watchdog) and got this result:
> clocksource: timekeeping watchdog on CPU118: Marking clocksource 'tsc' as unstable because the skew is too large:
> clocksource: 'sgi_rtc' wd_nsec: 511302794 wd_now: 1cb50e4c4b wd_last: 1ca7097111 mask: ffffffffffffff
> clocksource: 'hpet' wd2_nsec: 512005960 wd2_now: 65892719 wd2_last: 64c5d684 mask: ffffffff
> clocksource: 'tsc' cs_nsec: 512006458 cs_now: 86b5982cb1 cs_last: 867581bbab mask: ffffffffffffffff
> clocksource: 'tsc' skewed 703664 ns (0 ms) over watchdog 'sgi_rtc' interval of 511302794 ns (511 ms)
> clocksource: 'tsc' is current clocksource.
> tsc: Marking TSC unstable due to clocksource watchdog
> clocksource: Checking clocksource tsc synchronization from CPU 610 to CPUs 0-609,611-767.
> clocksource: Switched to clocksource sgi_rtc

The intervals measured by the TSC and the HPET match very well; the sgi_rtc is off. Even the new implementation of the clocksource watchdog would be susceptible to the reference clocksource experiencing time skew. I think the clocksource watchdog needs to make the assumption that the reference clocksource is right, and the onus should be on hardware developers to make sure the reference clocksource is accurate. In reality, one has to resort to disabling the reference clocksource experiencing time skew or, at least, decreasing the rating of that clocksource.

> +static bool watchdog_check_freq(struct clocksource *cs, bool reset_pending)
> +{
> +		/*
> +		 * Calculate and validate the skew against the allowed PPM
> +		 * value of the maximum delta plus the watchdog readout
> +		 * time.
> +		 */
> +		if (abs(wd_delta - cs_delta) < (max_delta >> ppm_shift) + wd_seq)
> +			return true;

Making the threshold proportional to the length of the interval resolves the issue with the (previously) fixed threshold and the interval being stretched on account of the timer running later than when it was meant to expire.

> +static void watchdog_check_result(struct clocksource *cs)
>  {
> -	struct clocksource *cs;
> +	switch (watchdog_data.result) {
> +	case WD_SUCCESS:
> +		clocksource_tick_stable(cs);
> +		clocksource_enable_highres(cs);
> +		return;
>  
> -	list_for_each_entry(cs, &watchdog_list, wd_list)
> +	case WD_FREQ_TIMEOUT:
> +		watchdog_print_freq_timeout(cs);
> +		/* Try again later and invalidate the reference timestamps. */
>  		cs->flags &= ~CLOCK_SOURCE_WATCHDOG;
> -}
> +		return;

I like that the new clocksource watchdog is far less punishing. A clocksource may be marked unstable only when the readout latency is below 50 us (and there is time skew or unsynchronized CPU sockets). There is no need for skipping watchdog checks to mitigate the clocksource being marked unstable on account of quite possibly unrelated readout latency, SMIs or vCPU preemption.

-- 
Jiri Wiesner
SUSE Labs

  parent reply	other threads:[~2026-02-25 18:13 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-23 23:17 [patch 0/5] clocksource: Rewrite clocksource watchdog and related cleanups Thomas Gleixner
2026-01-23 23:17 ` [patch 1/5] parisc: Remove unused clocksource flags Thomas Gleixner
2026-01-24  8:40   ` Helge Deller
2026-03-12 11:25   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
2026-01-23 23:17 ` [patch 2/5] MIPS: Dont select CLOCKSOURCE_WATCHDOG Thomas Gleixner
2026-01-24 22:28   ` Maciej W. Rozycki
2026-01-26  9:10     ` Thomas Gleixner
2026-03-12 11:25   ` [tip: timers/core] MIPS: Don't " tip-bot2 for Thomas Gleixner
2026-01-23 23:17 ` [patch 3/5] x86/tsc: Handle CLOCK_SOURCE_VALID_FOR_HRES correctly Thomas Gleixner
2026-03-12 11:25   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
2026-01-23 23:17 ` [patch 4/5] clocksource: Dont use non-continuous clocksources as watchdog Thomas Gleixner
2026-03-12 11:25   ` [tip: timers/core] clocksource: Don't " tip-bot2 for Thomas Gleixner
2026-01-23 23:18 ` [patch 5/5] clocksource: Rewrite watchdog code completely Thomas Gleixner
2026-02-02  6:45   ` Daniel J Blueman
2026-02-02 11:27     ` Thomas Gleixner
2026-02-15 12:18       ` Daniel J Blueman
2026-02-23 13:53         ` Thomas Gleixner
2026-03-08  9:53           ` Thomas Gleixner
2026-03-15 14:59           ` Daniel J Blueman
2026-03-17  9:01             ` Thomas Gleixner
2026-03-18 14:10               ` Daniel J Blueman
2026-03-19 20:31                 ` Thomas Gleixner
2026-03-20  2:21                   ` Daniel J Blueman
2026-03-20  8:26               ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
2026-03-20 12:42               ` tip-bot2 for Thomas Gleixner
2026-02-25 18:13   ` Jiri Wiesner [this message]
2026-03-08 10:05     ` [patch 5/5] " Thomas Gleixner
2026-03-11 13:12       ` Jiri Wiesner
2026-03-09 15:43   ` Borislav Petkov
2026-03-11  7:58     ` Thomas Gleixner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aZ87wpdHJ5vajYoL@incl \
    --to=jwiesner@suse.de \
    --cc=daniel.lezcano@linaro.org \
    --cc=daniel@quora.org \
    --cc=deller@gmx.de \
    --cc=gautham.shenoy@amd.com \
    --cc=jstultz@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mips@vger.kernel.org \
    --cc=linux-parisc@vger.kernel.org \
    --cc=longman@redhat.com \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=sboyd@kernel.org \
    --cc=scott.hamilton@eviden.com \
    --cc=tglx@kernel.org \
    --cc=tsbogend@alpha.franken.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox