All of lore.kernel.org
 help / color / mirror / Atom feed
From: Thomas Gleixner <tglx@kernel.org>
To: Daniel J Blueman <daniel@quora.org>
Cc: LKML <linux-kernel@vger.kernel.org>,
	"Paul E. McKenney" <paulmck@kernel.org>,
	John Stultz <jstultz@google.com>,
	Waiman Long <longman@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Daniel Lezcano <daniel.lezcano@linaro.org>,
	Stephen Boyd <sboyd@kernel.org>,
	x86@kernel.org, "Gautham R. Shenoy" <gautham.shenoy@amd.com>,
	Jiri Wiesner <jwiesner@suse.de>,
	Scott Hamilton <scott.hamilton@eviden.com>,
	Helge Deller <deller@gmx.de>,
	linux-parisc@vger.kernel.org,
	Thomas Bogendoerfer <tsbogend@alpha.franken.de>,
	linux-mips@vger.kernel.org
Subject: Re: [patch 5/5] clocksource: Rewrite watchdog code completely
Date: Mon, 02 Feb 2026 12:27:19 +0100	[thread overview]
Message-ID: <87jywvfkrs.ffs@tglx> (raw)
In-Reply-To: <CAMVG2ssXZKmw-YTKB5=CvhEofKeyEfaBCDZbyzfUcm2+P5rQsQ@mail.gmail.com>

On Mon, Feb 02 2026 at 14:45, Daniel J. Blueman wrote:
> Great work Thomas!

Thank you!

> On Sat, 24 Jan 2026 at 07:18, Thomas Gleixner <tglx@kernel.org> wrote:
>>   2) Compare the TSCs of the other CPUs in a round robin fashion against
>>      the boot CPU in the same way the TSC synchronization on CPU hotplug
>>      works. This still can suffer from delayed reaction of the remote CPU
>>      to the SMP function call and the latency of the control variable cache
>>      line. But this latency is not affecting correctness. It only affects
>>      the accuracy. With low contention the readout latency is in the low
>>      nanoseconds range, which detects even slight skews between CPUs. Under
>>      high contention this becomes obviously less accurate, but still
>>      detects slow skews reliably as it solely relies on subsequent readouts
>>      being monotonically increasing. It just can take slightly longer to
>>      detect the issue.
>
> On x86, I agree iterating at a per-thread level is needed rather than
> one thread per NUMA node, since the TSC_ADJUST architectural MSR is
> per-core and we want detection completeness.

It's per thread not per core.

But that aside the TSC_ADJUST integrity is already self monitored
independent of the watchdog. See tsc_verify_tsc_adjust(). So we might
get away with a per socket check as all threads of a socket are fed by
the same ART (Always Running Timer) and the main concern is that the
ARTs of sockets drift apart especially on systems with more than four
sockets.

> On other architectures, completeness could be traded off for lower
> overhead if it is guaranteed each processor thread uses the same clock
> value, though this is actually is what the clocksource watchdog seeks
> to validate, so agreed on the current approach there too.

x86 is the only one which actually utilizes the watchdog.

>> +/* Maximum time between two watchdog readouts */
>> +#define WATCHDOG_READOUT_MAX_NS                (50 * NSEC_PER_USEC)

> At 1920 threads, the default timeout threshold of 20us triggers
> continuous warnings at idle, however 1000us causes none under an 8
> hour adverse workload [1]; no HPET fallback was seen. A 500us
> threshold causes a low rate of timeouts [2] (overhead amplified due to
> retries), thus 1000us adds margin and should prevent retries.

Right. Idle is definitely an issue when the remote CPU is in a deep
C-state.

My concern is that the control CPU might spin there for a millisecond
with interrupts disabled, which is not really desired especially not on
RT systems.

Something like the untested below delta patch should work.

Thanks,

        tglx
---
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -7,6 +7,7 @@
 
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
+#include <linux/delay.h>
 #include <linux/device.h>
 #include <linux/clocksource.h>
 #include <linux/init.h>
@@ -124,7 +125,8 @@ static atomic_t watchdog_reset_pending;
 #define WATCHDOG_INTERVAL_NS		(WATCHDOG_INTERVAL * (NSEC_PER_SEC / HZ))
 
 /* Maximum time between two watchdog readouts */
-#define WATCHDOG_READOUT_MAX_NS		(50 * NSEC_PER_USEC)
+#define WATCHDOG_READOUT_MAX_US		50
+#define WATCHDOG_READOUT_MAX_NS		(WATCHDOG_READOUT_MAX_US * NSEC_PER_USEC)
 
 /* Shift values to calculate the approximate $N ppm of a given delta. */
 #define SHIFT_500PPM			11
@@ -136,6 +138,9 @@ static atomic_t watchdog_reset_pending;
 /* Five reads local and remote for inter CPU skew detection */
 #define WATCHDOG_REMOTE_MAX_SEQ		10
 
+/* Number of attempts to synchronize with a remote CPU */
+#define WATCHDOG_REMOTE_RETRIES		10
+
 static inline void clocksource_watchdog_lock(unsigned long *flags)
 {
 	spin_lock_irqsave(&watchdog_lock, *flags);
@@ -336,22 +341,17 @@ static void watchdog_check_skew_remote(v
 	atomic_dec(&wd->remote_inprogress);
 }
 
-static void watchdog_check_cpu_skew(struct clocksource *cs)
+static inline bool wd_csd_locked(struct watchdog_cpu_data *wd)
 {
-	unsigned int cpu = cpumask_next_wrap(watchdog_data.curr_cpu, cpu_online_mask);
-	struct watchdog_cpu_data *wd;
-
-	watchdog_data.curr_cpu = cpu;
-	/* Skip the current CPU. Handles num_online_cpus() == 1 as well */
-	if (cpu == smp_processor_id())
-		return;
+	return READ_ONCE(wd->csd.node.u_flags) & CSD_FLAG_LOCK;
+}
 
-	/* Don't interfere with the test mechanics */
-	if ((cs->flags & CLOCK_SOURCE_WDTEST) && !(cs->flags & CLOCK_SOURCE_WDTEST_PERCPU))
-		return;
+static void __watchdog_check_cpu_skew(struct clocksource *cs, unsigned int cpu)
+{
+	struct watchdog_cpu_data *wd;
 
 	wd = per_cpu_ptr(&watchdog_cpu_data, cpu);
-	if (atomic_read(&wd->remote_inprogress)) {
+	if (atomic_read(&wd->remote_inprogress) || wd_csd_locked(wd)) {
 		watchdog_data.result = WD_CPU_TIMEOUT;
 		return;
 	}
@@ -377,6 +377,29 @@ static void watchdog_check_cpu_skew(stru
 	}
 }
 
+static void watchdog_check_cpu_skew(struct clocksource *cs)
+{
+	unsigned int cpu = cpumask_next_wrap(watchdog_data.curr_cpu, cpu_online_mask);
+
+	watchdog_data.curr_cpu = cpu;
+	/* Skip the current CPU. Handles num_online_cpus() == 1 as well */
+	if (cpu == smp_processor_id())
+		return;
+
+	/* Don't interfere with the test mechanics */
+	if ((cs->flags & CLOCK_SOURCE_WDTEST) && !(cs->flags & CLOCK_SOURCE_WDTEST_PERCPU))
+		return;
+
+	for (int i = 0; i < WATCHDOG_REMOTE_RETRIES; i++) {
+		__watchdog_check_cpu_skew(cs, cpu);
+
+		if (watchdog_data.result != WD_CPU_TIMEOUT)
+			return;
+
+		udelay(WATCHDOG_READOUT_MAX_US);
+	}
+}
+
 static bool watchdog_check_freq(struct clocksource *cs, bool reset_pending)
 {
 	unsigned int ppm_shift = SHIFT_4000PPM;

  reply	other threads:[~2026-02-02 11:27 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-23 23:17 [patch 0/5] clocksource: Rewrite clocksource watchdog and related cleanups Thomas Gleixner
2026-01-23 23:17 ` [patch 1/5] parisc: Remove unused clocksource flags Thomas Gleixner
2026-01-24  8:40   ` Helge Deller
2026-01-23 23:17 ` [patch 2/5] MIPS: Dont select CLOCKSOURCE_WATCHDOG Thomas Gleixner
2026-01-24 22:28   ` Maciej W. Rozycki
2026-01-26  9:10     ` Thomas Gleixner
2026-01-23 23:17 ` [patch 3/5] x86/tsc: Handle CLOCK_SOURCE_VALID_FOR_HRES correctly Thomas Gleixner
2026-01-23 23:17 ` [patch 4/5] clocksource: Dont use non-continuous clocksources as watchdog Thomas Gleixner
2026-01-23 23:18 ` [patch 5/5] clocksource: Rewrite watchdog code completely Thomas Gleixner
2026-02-02  6:45   ` Daniel J Blueman
2026-02-02 11:27     ` Thomas Gleixner [this message]
2026-02-15 12:18       ` Daniel J Blueman
2026-02-23 13:53         ` Thomas Gleixner
2026-03-08  9:53           ` Thomas Gleixner
2026-03-15 14:59           ` Daniel J Blueman
2026-03-17  9:01             ` Thomas Gleixner
2026-03-18 14:10               ` Daniel J Blueman
2026-03-19 20:31                 ` Thomas Gleixner
2026-03-20  2:21                   ` Daniel J Blueman
2026-02-25 18:13   ` Jiri Wiesner
2026-03-08 10:05     ` Thomas Gleixner
2026-03-11 13:12       ` Jiri Wiesner
2026-03-09 15:43   ` Borislav Petkov
2026-03-11  7:58     ` Thomas Gleixner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87jywvfkrs.ffs@tglx \
    --to=tglx@kernel.org \
    --cc=daniel.lezcano@linaro.org \
    --cc=daniel@quora.org \
    --cc=deller@gmx.de \
    --cc=gautham.shenoy@amd.com \
    --cc=jstultz@google.com \
    --cc=jwiesner@suse.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mips@vger.kernel.org \
    --cc=linux-parisc@vger.kernel.org \
    --cc=longman@redhat.com \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=sboyd@kernel.org \
    --cc=scott.hamilton@eviden.com \
    --cc=tsbogend@alpha.franken.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.