public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Thomas Gleixner <tglx@kernel.org>
To: Daniel J Blueman <daniel@quora.org>
Cc: LKML <linux-kernel@vger.kernel.org>,
	"Paul E. McKenney" <paulmck@kernel.org>,
	John Stultz <jstultz@google.com>,
	Waiman Long <longman@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Daniel Lezcano <daniel.lezcano@linaro.org>,
	Stephen Boyd <sboyd@kernel.org>,
	x86@kernel.org, "Gautham R. Shenoy" <gautham.shenoy@amd.com>,
	Jiri Wiesner <jwiesner@suse.de>,
	Scott Hamilton <scott.hamilton@eviden.com>,
	Helge Deller <deller@gmx.de>,
	linux-parisc@vger.kernel.org,
	Thomas Bogendoerfer <tsbogend@alpha.franken.de>,
	linux-mips@vger.kernel.org
Subject: Re: [patch 5/5] clocksource: Rewrite watchdog code completely
Date: Mon, 02 Feb 2026 12:27:19 +0100	[thread overview]
Message-ID: <87jywvfkrs.ffs@tglx> (raw)
In-Reply-To: <CAMVG2ssXZKmw-YTKB5=CvhEofKeyEfaBCDZbyzfUcm2+P5rQsQ@mail.gmail.com>

On Mon, Feb 02 2026 at 14:45, Daniel J. Blueman wrote:
> Great work Thomas!

Thank you!

> On Sat, 24 Jan 2026 at 07:18, Thomas Gleixner <tglx@kernel.org> wrote:
>>   2) Compare the TSCs of the other CPUs in a round robin fashion against
>>      the boot CPU in the same way the TSC synchronization on CPU hotplug
>>      works. This still can suffer from delayed reaction of the remote CPU
>>      to the SMP function call and the latency of the control variable cache
>>      line. But this latency is not affecting correctness. It only affects
>>      the accuracy. With low contention the readout latency is in the low
>>      nanoseconds range, which detects even slight skews between CPUs. Under
>>      high contention this becomes obviously less accurate, but still
>>      detects slow skews reliably as it solely relies on subsequent readouts
>>      being monotonically increasing. It just can take slightly longer to
>>      detect the issue.
>
> On x86, I agree iterating at a per-thread level is needed rather than
> one thread per NUMA node, since the TSC_ADJUST architectural MSR is
> per-core and we want detection completeness.

It's per thread not per core.

But that aside the TSC_ADJUST integrity is already self monitored
independent of the watchdog. See tsc_verify_tsc_adjust(). So we might
get away with a per socket check as all threads of a socket are fed by
the same ART (Always Running Timer) and the main concern is that the
ARTs of sockets drift apart especially on systems with more than four
sockets.

> On other architectures, completeness could be traded off for lower
> overhead if it is guaranteed each processor thread uses the same clock
> value, though this is actually is what the clocksource watchdog seeks
> to validate, so agreed on the current approach there too.

x86 is the only one which actually utilizes the watchdog.

>> +/* Maximum time between two watchdog readouts */
>> +#define WATCHDOG_READOUT_MAX_NS                (50 * NSEC_PER_USEC)

> At 1920 threads, the default timeout threshold of 20us triggers
> continuous warnings at idle, however 1000us causes none under an 8
> hour adverse workload [1]; no HPET fallback was seen. A 500us
> threshold causes a low rate of timeouts [2] (overhead amplified due to
> retries), thus 1000us adds margin and should prevent retries.

Right. Idle is definitely an issue when the remote CPU is in a deep
C-state.

My concern is that the control CPU might spin there for a millisecond
with interrupts disabled, which is not really desired especially not on
RT systems.

Something like the untested below delta patch should work.

Thanks,

        tglx
---
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -7,6 +7,7 @@
 
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
+#include <linux/delay.h>
 #include <linux/device.h>
 #include <linux/clocksource.h>
 #include <linux/init.h>
@@ -124,7 +125,8 @@ static atomic_t watchdog_reset_pending;
 #define WATCHDOG_INTERVAL_NS		(WATCHDOG_INTERVAL * (NSEC_PER_SEC / HZ))
 
 /* Maximum time between two watchdog readouts */
-#define WATCHDOG_READOUT_MAX_NS		(50 * NSEC_PER_USEC)
+#define WATCHDOG_READOUT_MAX_US		50
+#define WATCHDOG_READOUT_MAX_NS		(WATCHDOG_READOUT_MAX_US * NSEC_PER_USEC)
 
 /* Shift values to calculate the approximate $N ppm of a given delta. */
 #define SHIFT_500PPM			11
@@ -136,6 +138,9 @@ static atomic_t watchdog_reset_pending;
 /* Five reads local and remote for inter CPU skew detection */
 #define WATCHDOG_REMOTE_MAX_SEQ		10
 
+/* Number of attempts to synchronize with a remote CPU */
+#define WATCHDOG_REMOTE_RETRIES		10
+
 static inline void clocksource_watchdog_lock(unsigned long *flags)
 {
 	spin_lock_irqsave(&watchdog_lock, *flags);
@@ -336,22 +341,17 @@ static void watchdog_check_skew_remote(v
 	atomic_dec(&wd->remote_inprogress);
 }
 
-static void watchdog_check_cpu_skew(struct clocksource *cs)
+static inline bool wd_csd_locked(struct watchdog_cpu_data *wd)
 {
-	unsigned int cpu = cpumask_next_wrap(watchdog_data.curr_cpu, cpu_online_mask);
-	struct watchdog_cpu_data *wd;
-
-	watchdog_data.curr_cpu = cpu;
-	/* Skip the current CPU. Handles num_online_cpus() == 1 as well */
-	if (cpu == smp_processor_id())
-		return;
+	return READ_ONCE(wd->csd.node.u_flags) & CSD_FLAG_LOCK;
+}
 
-	/* Don't interfere with the test mechanics */
-	if ((cs->flags & CLOCK_SOURCE_WDTEST) && !(cs->flags & CLOCK_SOURCE_WDTEST_PERCPU))
-		return;
+static void __watchdog_check_cpu_skew(struct clocksource *cs, unsigned int cpu)
+{
+	struct watchdog_cpu_data *wd;
 
 	wd = per_cpu_ptr(&watchdog_cpu_data, cpu);
-	if (atomic_read(&wd->remote_inprogress)) {
+	if (atomic_read(&wd->remote_inprogress) || wd_csd_locked(wd)) {
 		watchdog_data.result = WD_CPU_TIMEOUT;
 		return;
 	}
@@ -377,6 +377,29 @@ static void watchdog_check_cpu_skew(stru
 	}
 }
 
+static void watchdog_check_cpu_skew(struct clocksource *cs)
+{
+	unsigned int cpu = cpumask_next_wrap(watchdog_data.curr_cpu, cpu_online_mask);
+
+	watchdog_data.curr_cpu = cpu;
+	/* Skip the current CPU. Handles num_online_cpus() == 1 as well */
+	if (cpu == smp_processor_id())
+		return;
+
+	/* Don't interfere with the test mechanics */
+	if ((cs->flags & CLOCK_SOURCE_WDTEST) && !(cs->flags & CLOCK_SOURCE_WDTEST_PERCPU))
+		return;
+
+	for (int i = 0; i < WATCHDOG_REMOTE_RETRIES; i++) {
+		__watchdog_check_cpu_skew(cs, cpu);
+
+		if (watchdog_data.result != WD_CPU_TIMEOUT)
+			return;
+
+		udelay(WATCHDOG_READOUT_MAX_US);
+	}
+}
+
 static bool watchdog_check_freq(struct clocksource *cs, bool reset_pending)
 {
 	unsigned int ppm_shift = SHIFT_4000PPM;

  reply	other threads:[~2026-02-02 11:27 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-23 23:17 [patch 0/5] clocksource: Rewrite clocksource watchdog and related cleanups Thomas Gleixner
2026-01-23 23:17 ` [patch 1/5] parisc: Remove unused clocksource flags Thomas Gleixner
2026-01-24  8:40   ` Helge Deller
2026-03-12 11:25   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
2026-01-23 23:17 ` [patch 2/5] MIPS: Dont select CLOCKSOURCE_WATCHDOG Thomas Gleixner
2026-01-24 22:28   ` Maciej W. Rozycki
2026-01-26  9:10     ` Thomas Gleixner
2026-03-12 11:25   ` [tip: timers/core] MIPS: Don't " tip-bot2 for Thomas Gleixner
2026-01-23 23:17 ` [patch 3/5] x86/tsc: Handle CLOCK_SOURCE_VALID_FOR_HRES correctly Thomas Gleixner
2026-03-12 11:25   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
2026-01-23 23:17 ` [patch 4/5] clocksource: Dont use non-continuous clocksources as watchdog Thomas Gleixner
2026-03-12 11:25   ` [tip: timers/core] clocksource: Don't " tip-bot2 for Thomas Gleixner
2026-01-23 23:18 ` [patch 5/5] clocksource: Rewrite watchdog code completely Thomas Gleixner
2026-02-02  6:45   ` Daniel J Blueman
2026-02-02 11:27     ` Thomas Gleixner [this message]
2026-02-15 12:18       ` Daniel J Blueman
2026-02-23 13:53         ` Thomas Gleixner
2026-03-08  9:53           ` Thomas Gleixner
2026-03-15 14:59           ` Daniel J Blueman
2026-03-17  9:01             ` Thomas Gleixner
2026-03-18 14:10               ` Daniel J Blueman
2026-03-19 20:31                 ` Thomas Gleixner
2026-03-20  2:21                   ` Daniel J Blueman
2026-03-20  8:26               ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
2026-03-20 12:42               ` tip-bot2 for Thomas Gleixner
2026-02-25 18:13   ` [patch 5/5] " Jiri Wiesner
2026-03-08 10:05     ` Thomas Gleixner
2026-03-11 13:12       ` Jiri Wiesner
2026-03-09 15:43   ` Borislav Petkov
2026-03-11  7:58     ` Thomas Gleixner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87jywvfkrs.ffs@tglx \
    --to=tglx@kernel.org \
    --cc=daniel.lezcano@linaro.org \
    --cc=daniel@quora.org \
    --cc=deller@gmx.de \
    --cc=gautham.shenoy@amd.com \
    --cc=jstultz@google.com \
    --cc=jwiesner@suse.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mips@vger.kernel.org \
    --cc=linux-parisc@vger.kernel.org \
    --cc=longman@redhat.com \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=sboyd@kernel.org \
    --cc=scott.hamilton@eviden.com \
    --cc=tsbogend@alpha.franken.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox