All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3] clocksource: Scale the max retry number of watchdog read according to CPU numbers
@ 2024-01-29 13:45 Feng Tang
  2024-01-30 13:55 ` Paul E. McKenney
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Feng Tang @ 2024-01-29 13:45 UTC (permalink / raw)
  To: John Stultz, Thomas Gleixner, Stephen Boyd, paulmck, Waiman Long,
	Peter Zijlstra, linux-kernel
  Cc: Feng Tang, Jin Wang

There was a bug on one 8-socket server that the TSC is wrongly marked as
'unstable' and disabled during boot time (reproduce rate is about every
120 rounds of reboot tests), with log:

    clocksource: timekeeping watchdog on CPU227: wd-tsc-wd excessive read-back delay of 153560ns vs. limit of 125000ns,
    wd-wd read-back delay only 11440ns, attempt 3, marking tsc unstable
    tsc: Marking TSC unstable due to clocksource watchdog
    TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
    sched_clock: Marking unstable (119294969739, 159204297)<-(125446229205, -5992055152)
    clocksource: Checking clocksource tsc synchronization from CPU 319 to CPUs 0,99,136,180,210,542,601,896.
    clocksource: Switched to clocksource hpet

The reason is for platform with lots of CPU, there are sporadic big or huge
read latency of read watchog/clocksource during boot or when system is under
stress work load, and the frequency and maximum value of the latency goes up
with the increasing of CPU numbers. Current code already has logic to detect
and filter such high latency case by reading 3 times of watchdog, and check
the 2 deltas. Due to the randomness of the latency, there is a low possibility
situation that the first delta (latency) is big, but the second delta is small
and looks valid, which can escape from the check, and there is a
'max_cswd_read_retries' for retrying that check covering this case, whose
default value is only 2 and may be not enough for machines with huge number
of CPUs.

So scale and enlarge the max retry number according to CPU number to better
filter those latency noise for large systems, which has been verified fine
in 4 days and 670 rounds of reboot test on the 8-socket machine.

Also add sanity check for user input value for 'max_cswd_read_retries', make
it self-adaptive by default, and provide a general helper for getting this
max retry number as suggested by Paul and Waiman.

Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Feng Tang <feng.tang@intel.com>
Tested-by: Jin Wang <jin1.wang@intel.com>
---
Changelog:

    since v2:
      * Fix the unexported symbol of helper function being used by
        kernel module issue (Waiman)

    since v1:
      * Add santity check for user input value of 'max_cswd_read_retries'
        and a helper function for getting max retry nubmer (Paul)
      * Apply the same logic to watchdog test code (Waiman)

 include/linux/clocksource.h      | 18 +++++++++++++++++-
 kernel/time/clocksource-wdtest.c | 12 +++++++-----
 kernel/time/clocksource.c        | 10 ++++++----
 3 files changed, 30 insertions(+), 10 deletions(-)

diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index 1d42d4b17327..0483f7dd66a3 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -291,7 +291,23 @@ static inline void timer_probe(void) {}
 #define TIMER_ACPI_DECLARE(name, table_id, fn)		\
 	ACPI_DECLARE_PROBE_ENTRY(timer, name, table_id, 0, NULL, 0, fn)
 
-extern ulong max_cswd_read_retries;
+extern long max_cswd_read_retries;
+
+static inline long clocksource_max_watchdog_read_retries(void)
+{
+	long max_retries = max_cswd_read_retries;
+
+	if (max_cswd_read_retries <= 0) {
+		/* santity check for user input value */
+		if (max_cswd_read_retries != -1)
+			pr_warn_once("max_cswd_read_retries was set with an invalid number: %ld\n",
+				max_cswd_read_retries);
+
+		max_retries = ilog2(num_online_cpus()) + 1;
+	}
+	return max_retries;
+}
+
 void clocksource_verify_percpu(struct clocksource *cs);
 
 #endif /* _LINUX_CLOCKSOURCE_H */
diff --git a/kernel/time/clocksource-wdtest.c b/kernel/time/clocksource-wdtest.c
index df922f49d171..c70cea3c44a1 100644
--- a/kernel/time/clocksource-wdtest.c
+++ b/kernel/time/clocksource-wdtest.c
@@ -106,6 +106,7 @@ static int wdtest_func(void *arg)
 	unsigned long j1, j2;
 	char *s;
 	int i;
+	long max_retries;
 
 	schedule_timeout_uninterruptible(holdoff * HZ);
 
@@ -139,18 +140,19 @@ static int wdtest_func(void *arg)
 	WARN_ON_ONCE(time_before(j2, j1 + NSEC_PER_USEC));
 
 	/* Verify tsc-like stability with various numbers of errors injected. */
-	for (i = 0; i <= max_cswd_read_retries + 1; i++) {
-		if (i <= 1 && i < max_cswd_read_retries)
+	max_retries = clocksource_max_watchdog_read_retries();
+	for (i = 0; i <= max_retries + 1; i++) {
+		if (i <= 1 && i < max_retries)
 			s = "";
-		else if (i <= max_cswd_read_retries)
+		else if (i <= max_retries)
 			s = ", expect message";
 		else
 			s = ", expect clock skew";
-		pr_info("--- Watchdog with %dx error injection, %lu retries%s.\n", i, max_cswd_read_retries, s);
+		pr_info("--- Watchdog with %dx error injection, %ld retries%s.\n", i, max_retries, s);
 		WRITE_ONCE(wdtest_ktime_read_ndelays, i);
 		schedule_timeout_uninterruptible(2 * HZ);
 		WARN_ON_ONCE(READ_ONCE(wdtest_ktime_read_ndelays));
-		WARN_ON_ONCE((i <= max_cswd_read_retries) !=
+		WARN_ON_ONCE((i <= max_retries) !=
 			     !(clocksource_wdtest_ktime.flags & CLOCK_SOURCE_UNSTABLE));
 		wdtest_ktime_clocksource_reset();
 	}
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index c108ed8a9804..2e5a1d6c6712 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -208,8 +208,8 @@ void clocksource_mark_unstable(struct clocksource *cs)
 	spin_unlock_irqrestore(&watchdog_lock, flags);
 }
 
-ulong max_cswd_read_retries = 2;
-module_param(max_cswd_read_retries, ulong, 0644);
+long max_cswd_read_retries = -1;
+module_param(max_cswd_read_retries, long, 0644);
 EXPORT_SYMBOL_GPL(max_cswd_read_retries);
 static int verify_n_cpus = 8;
 module_param(verify_n_cpus, int, 0644);
@@ -225,8 +225,10 @@ static enum wd_read_status cs_watchdog_read(struct clocksource *cs, u64 *csnow,
 	unsigned int nretries;
 	u64 wd_end, wd_end2, wd_delta;
 	int64_t wd_delay, wd_seq_delay;
+	long max_retries;
 
-	for (nretries = 0; nretries <= max_cswd_read_retries; nretries++) {
+	max_retries = clocksource_max_watchdog_read_retries();
+	for (nretries = 0; nretries <= max_retries; nretries++) {
 		local_irq_disable();
 		*wdnow = watchdog->read(watchdog);
 		*csnow = cs->read(cs);
@@ -238,7 +240,7 @@ static enum wd_read_status cs_watchdog_read(struct clocksource *cs, u64 *csnow,
 		wd_delay = clocksource_cyc2ns(wd_delta, watchdog->mult,
 					      watchdog->shift);
 		if (wd_delay <= WATCHDOG_MAX_SKEW) {
-			if (nretries > 1 || nretries >= max_cswd_read_retries) {
+			if (nretries > 1 || nretries >= max_retries) {
 				pr_warn("timekeeping watchdog on CPU%d: %s retried %d times before success\n",
 					smp_processor_id(), watchdog->name, nretries);
 			}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2024-02-20 17:09 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-01-29 13:45 [PATCH v3] clocksource: Scale the max retry number of watchdog read according to CPU numbers Feng Tang
2024-01-30 13:55 ` Paul E. McKenney
2024-01-31  0:53 ` Waiman Long
2024-02-19 11:32 ` Thomas Gleixner
2024-02-19 14:37   ` Feng Tang
2024-02-20  2:20     ` Waiman Long
2024-02-20 15:24       ` Paul E. McKenney
2024-02-20 15:27         ` Feng Tang
2024-02-20 17:09           ` Paul E. McKenney

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.