From: Thomas Gleixner <tglx@kernel.org>
To: Daniel J Blueman <daniel@quora.org>
Cc: LKML <linux-kernel@vger.kernel.org>,
"Paul E. McKenney" <paulmck@kernel.org>,
John Stultz <jstultz@google.com>,
Waiman Long <longman@redhat.com>,
Peter Zijlstra <peterz@infradead.org>,
Daniel Lezcano <daniel.lezcano@linaro.org>,
Stephen Boyd <sboyd@kernel.org>,
x86@kernel.org, "Gautham R. Shenoy" <gautham.shenoy@amd.com>,
Jiri Wiesner <jwiesner@suse.de>,
Scott Hamilton <scott.hamilton@eviden.com>,
Helge Deller <deller@gmx.de>,
linux-parisc@vger.kernel.org,
Thomas Bogendoerfer <tsbogend@alpha.franken.de>,
linux-mips@vger.kernel.org
Subject: Re: [patch 5/5] clocksource: Rewrite watchdog code completely
Date: Mon, 02 Feb 2026 12:27:19 +0100 [thread overview]
Message-ID: <87jywvfkrs.ffs@tglx> (raw)
In-Reply-To: <CAMVG2ssXZKmw-YTKB5=CvhEofKeyEfaBCDZbyzfUcm2+P5rQsQ@mail.gmail.com>
On Mon, Feb 02 2026 at 14:45, Daniel J. Blueman wrote:
> Great work Thomas!
Thank you!
> On Sat, 24 Jan 2026 at 07:18, Thomas Gleixner <tglx@kernel.org> wrote:
>> 2) Compare the TSCs of the other CPUs in a round robin fashion against
>> the boot CPU in the same way the TSC synchronization on CPU hotplug
>> works. This still can suffer from delayed reaction of the remote CPU
>> to the SMP function call and the latency of the control variable cache
>> line. But this latency is not affecting correctness. It only affects
>> the accuracy. With low contention the readout latency is in the low
>> nanoseconds range, which detects even slight skews between CPUs. Under
>> high contention this becomes obviously less accurate, but still
>> detects slow skews reliably as it solely relies on subsequent readouts
>> being monotonically increasing. It just can take slightly longer to
>> detect the issue.
>
> On x86, I agree iterating at a per-thread level is needed rather than
> one thread per NUMA node, since the TSC_ADJUST architectural MSR is
> per-core and we want detection completeness.
It's per thread not per core.
But that aside the TSC_ADJUST integrity is already self monitored
independent of the watchdog. See tsc_verify_tsc_adjust(). So we might
get away with a per socket check as all threads of a socket are fed by
the same ART (Always Running Timer) and the main concern is that the
ARTs of sockets drift apart especially on systems with more than four
sockets.
> On other architectures, completeness could be traded off for lower
> overhead if it is guaranteed each processor thread uses the same clock
> value, though this is actually is what the clocksource watchdog seeks
> to validate, so agreed on the current approach there too.
x86 is the only one which actually utilizes the watchdog.
>> +/* Maximum time between two watchdog readouts */
>> +#define WATCHDOG_READOUT_MAX_NS (50 * NSEC_PER_USEC)
> At 1920 threads, the default timeout threshold of 20us triggers
> continuous warnings at idle, however 1000us causes none under an 8
> hour adverse workload [1]; no HPET fallback was seen. A 500us
> threshold causes a low rate of timeouts [2] (overhead amplified due to
> retries), thus 1000us adds margin and should prevent retries.
Right. Idle is definitely an issue when the remote CPU is in a deep
C-state.
My concern is that the control CPU might spin there for a millisecond
with interrupts disabled, which is not really desired especially not on
RT systems.
Something like the untested below delta patch should work.
Thanks,
tglx
---
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -7,6 +7,7 @@
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/delay.h>
#include <linux/device.h>
#include <linux/clocksource.h>
#include <linux/init.h>
@@ -124,7 +125,8 @@ static atomic_t watchdog_reset_pending;
#define WATCHDOG_INTERVAL_NS (WATCHDOG_INTERVAL * (NSEC_PER_SEC / HZ))
/* Maximum time between two watchdog readouts */
-#define WATCHDOG_READOUT_MAX_NS (50 * NSEC_PER_USEC)
+#define WATCHDOG_READOUT_MAX_US 50
+#define WATCHDOG_READOUT_MAX_NS (WATCHDOG_READOUT_MAX_US * NSEC_PER_USEC)
/* Shift values to calculate the approximate $N ppm of a given delta. */
#define SHIFT_500PPM 11
@@ -136,6 +138,9 @@ static atomic_t watchdog_reset_pending;
/* Five reads local and remote for inter CPU skew detection */
#define WATCHDOG_REMOTE_MAX_SEQ 10
+/* Number of attempts to synchronize with a remote CPU */
+#define WATCHDOG_REMOTE_RETRIES 10
+
static inline void clocksource_watchdog_lock(unsigned long *flags)
{
spin_lock_irqsave(&watchdog_lock, *flags);
@@ -336,22 +341,17 @@ static void watchdog_check_skew_remote(v
atomic_dec(&wd->remote_inprogress);
}
-static void watchdog_check_cpu_skew(struct clocksource *cs)
+static inline bool wd_csd_locked(struct watchdog_cpu_data *wd)
{
- unsigned int cpu = cpumask_next_wrap(watchdog_data.curr_cpu, cpu_online_mask);
- struct watchdog_cpu_data *wd;
-
- watchdog_data.curr_cpu = cpu;
- /* Skip the current CPU. Handles num_online_cpus() == 1 as well */
- if (cpu == smp_processor_id())
- return;
+ return READ_ONCE(wd->csd.node.u_flags) & CSD_FLAG_LOCK;
+}
- /* Don't interfere with the test mechanics */
- if ((cs->flags & CLOCK_SOURCE_WDTEST) && !(cs->flags & CLOCK_SOURCE_WDTEST_PERCPU))
- return;
+static void __watchdog_check_cpu_skew(struct clocksource *cs, unsigned int cpu)
+{
+ struct watchdog_cpu_data *wd;
wd = per_cpu_ptr(&watchdog_cpu_data, cpu);
- if (atomic_read(&wd->remote_inprogress)) {
+ if (atomic_read(&wd->remote_inprogress) || wd_csd_locked(wd)) {
watchdog_data.result = WD_CPU_TIMEOUT;
return;
}
@@ -377,6 +377,29 @@ static void watchdog_check_cpu_skew(stru
}
}
+static void watchdog_check_cpu_skew(struct clocksource *cs)
+{
+ unsigned int cpu = cpumask_next_wrap(watchdog_data.curr_cpu, cpu_online_mask);
+
+ watchdog_data.curr_cpu = cpu;
+ /* Skip the current CPU. Handles num_online_cpus() == 1 as well */
+ if (cpu == smp_processor_id())
+ return;
+
+ /* Don't interfere with the test mechanics */
+ if ((cs->flags & CLOCK_SOURCE_WDTEST) && !(cs->flags & CLOCK_SOURCE_WDTEST_PERCPU))
+ return;
+
+ for (int i = 0; i < WATCHDOG_REMOTE_RETRIES; i++) {
+ __watchdog_check_cpu_skew(cs, cpu);
+
+ if (watchdog_data.result != WD_CPU_TIMEOUT)
+ return;
+
+ udelay(WATCHDOG_READOUT_MAX_US);
+ }
+}
+
static bool watchdog_check_freq(struct clocksource *cs, bool reset_pending)
{
unsigned int ppm_shift = SHIFT_4000PPM;
next prev parent reply other threads:[~2026-02-02 11:27 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-23 23:17 [patch 0/5] clocksource: Rewrite clocksource watchdog and related cleanups Thomas Gleixner
2026-01-23 23:17 ` [patch 1/5] parisc: Remove unused clocksource flags Thomas Gleixner
2026-01-24 8:40 ` Helge Deller
2026-03-12 11:25 ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
2026-01-23 23:17 ` [patch 2/5] MIPS: Dont select CLOCKSOURCE_WATCHDOG Thomas Gleixner
2026-01-24 22:28 ` Maciej W. Rozycki
2026-01-26 9:10 ` Thomas Gleixner
2026-03-12 11:25 ` [tip: timers/core] MIPS: Don't " tip-bot2 for Thomas Gleixner
2026-01-23 23:17 ` [patch 3/5] x86/tsc: Handle CLOCK_SOURCE_VALID_FOR_HRES correctly Thomas Gleixner
2026-03-12 11:25 ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
2026-01-23 23:17 ` [patch 4/5] clocksource: Dont use non-continuous clocksources as watchdog Thomas Gleixner
2026-03-12 11:25 ` [tip: timers/core] clocksource: Don't " tip-bot2 for Thomas Gleixner
2026-01-23 23:18 ` [patch 5/5] clocksource: Rewrite watchdog code completely Thomas Gleixner
2026-02-02 6:45 ` Daniel J Blueman
2026-02-02 11:27 ` Thomas Gleixner [this message]
2026-02-15 12:18 ` Daniel J Blueman
2026-02-23 13:53 ` Thomas Gleixner
2026-03-08 9:53 ` Thomas Gleixner
2026-03-15 14:59 ` Daniel J Blueman
2026-03-17 9:01 ` Thomas Gleixner
2026-03-18 14:10 ` Daniel J Blueman
2026-03-19 20:31 ` Thomas Gleixner
2026-03-20 2:21 ` Daniel J Blueman
2026-03-20 8:26 ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
2026-03-20 12:42 ` tip-bot2 for Thomas Gleixner
2026-02-25 18:13 ` [patch 5/5] " Jiri Wiesner
2026-03-08 10:05 ` Thomas Gleixner
2026-03-11 13:12 ` Jiri Wiesner
2026-03-09 15:43 ` Borislav Petkov
2026-03-11 7:58 ` Thomas Gleixner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87jywvfkrs.ffs@tglx \
--to=tglx@kernel.org \
--cc=daniel.lezcano@linaro.org \
--cc=daniel@quora.org \
--cc=deller@gmx.de \
--cc=gautham.shenoy@amd.com \
--cc=jstultz@google.com \
--cc=jwiesner@suse.de \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mips@vger.kernel.org \
--cc=linux-parisc@vger.kernel.org \
--cc=longman@redhat.com \
--cc=paulmck@kernel.org \
--cc=peterz@infradead.org \
--cc=sboyd@kernel.org \
--cc=scott.hamilton@eviden.com \
--cc=tsbogend@alpha.franken.de \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox