linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] x86: UV RTC: Add parameter to disable RTC clocksource
@ 2025-07-17 15:44 Jiri Wiesner
  2025-07-17 20:50 ` Dimitri Sivanich
  0 siblings, 1 reply; 3+ messages in thread
From: Jiri Wiesner @ 2025-07-17 15:44 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: Jonathan Corbet, Steve Wahl, Justin Ernst, Kyle Meyer,
	Dimitri Sivanich, Russ Anderson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin

Booting up an 8 NUMA node machine that has an UV RTC clocksource may
result in the TSC being marked unstable by the clocksource watchdog due to
time skew. The failures to verify the TSC happen soon after the current
clocksource is switched to the TSC (usually the watchdog runs twice).
Delaying the checks carried out by the clocksource watchdog after the
system boots up does not make a difference.

The clocksource watchdog compares two clocksources and it is assumed that
it is always the clocksource being verified what has caused the time skew
measured by the clocksource watchdog. To check the validity of this
assumption, a debugging kernel was used. A third clocksource that was set
to the HPET was added. The messages reported by the debugging kernel
indicate that the time skew between the TSC and the HPET was only 22
nanoseconds while the time skew between the TSC and sgi_rtc was 591659
nanoseconds:

clocksource: timekeeping watchdog on CPU176: Marking clocksource 'tsc' as unstable because the skew is too large:
clocksource: 'sgi_rtc' wd_nsec: 479339803 wd_now: 1fab695e5a wd_last: 1f9e44dca0 mask: ffffffffffffff
clocksource: 'hpet' wd2_nsec: 479931440 wd2_now: 90a1af85 wd2_last: 8fea9b37 mask: ffffffff
clocksource: 'tsc' cs_nsec: 479931462 cs_now: 944e1c227d cs_last: 9412097879 mask: ffffffffffffffff
clocksource: Clocksource 'tsc' skewed 591659 ns (0 ms) over watchdog 'sgi_rtc' interval of 479339803 ns (479 ms)
clocksource: 'tsc' is current clocksource.
tsc: Marking TSC unstable due to clocksource watchdog
TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
sched_clock: Marking unstable (90731283360, -1108605523)<-(95136368481, -5513690634)
clocksource: Checking clocksource tsc synchronization from CPU 501 to CPUs 0-500,502-767.
clocksource: CPU 501 check durations 1446ns - 32908ns for clocksource tsc.

This happened on CPU 176, which resides on NUMA node 3. The interval was
computed from timestamps from CPU 176 and from CPU 175, which also resides
on NUMA node 3. Since the time skew was reported between CPUs residing on
the same NUMA node, it is unlikely that the TSC would experience time skew.

The debugging kernel printed out the last message in
clocksource_verify_percpu() unconditionally, and all CPUs were checked.
None of the CPUs was reported as being behind or ahead of CPU 501. The
last message provides a worst case estimate. The value of 2 * cs_nsec_max
(2 * 32908 ns) is the maximum possible time skew between the TSCs of any
two CPUs on the system, as measured by the TSC sync check. The cs_nsec_max
value itself is an estimate because it includes delays incurred by
executing and servicing an inter-processor interrupt synchronously, which
has a non-negligible cost. The maximum possible time skew (of the TSC) of
66 microseconds does not even approach the size of the time skew measured
by the clocksource watchdog.

Testing has shown that the HPET is stabler than sgi_rtc so the HPET is a
better choice for veryfying the TSC. Disabling the sgi_rtc clocksource was
implemented as a workaround. The name of the parameter was inspired by
581f202bcd60 ("x86: UV RTC: Always enable RTC clocksource") and the fact
that there also is a nohpet parameter and a notsc parameter. The uvrtcevt
parameter has been documented.

Signed-off-by: Jiri Wiesner <jwiesner@suse.de>
---
 Documentation/admin-guide/kernel-parameters.txt |  4 ++++
 arch/x86/platform/uv/uv_time.c                  | 11 ++++++++++-
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 07e22ba5bfe3..9839257181e3 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4302,6 +4302,8 @@
 			This is required for the Braillex ib80-piezo Braille
 			reader made by F.H. Papenmeier (Germany).
 
+	nouvrtc		[X86] Disable the UV RTC clocksource (SGI RTC clock).
+
 	nosgx		[X86-64,SGX,EARLY] Disables Intel SGX kernel support.
 
 	nosmap		[PPC,EARLY]
@@ -7839,6 +7841,8 @@
 				16 - SIGBUS faults
 			Example: user_debug=31
 
+	uvrtcevt	[X86] Use UV RTC clock events (SGI RTC clock) for timers.
+
 	vdso=		[X86,SH,SPARC]
 			On X86_32, this is an alias for vdso32=.  Otherwise:
 
diff --git a/arch/x86/platform/uv/uv_time.c b/arch/x86/platform/uv/uv_time.c
index 3712afc3534d..03d59b87c371 100644
--- a/arch/x86/platform/uv/uv_time.c
+++ b/arch/x86/platform/uv/uv_time.c
@@ -61,6 +61,7 @@ struct uv_rtc_timer_head {
  */
 static struct uv_rtc_timer_head		**blade_info __read_mostly;
 
+static int				uv_rtc_enable = 1;
 static int				uv_rtc_evt_enable;
 
 /*
@@ -321,6 +322,14 @@ static void uv_rtc_interrupt(void)
 	ced->event_handler(ced);
 }
 
+static int __init uv_disable_rtc(char *str)
+{
+	uv_rtc_enable = 0;
+
+	return 1;
+}
+__setup("nouvrtc", uv_disable_rtc);
+
 static int __init uv_enable_evt_rtc(char *str)
 {
 	uv_rtc_evt_enable = 1;
@@ -342,7 +351,7 @@ static __init int uv_rtc_setup_clock(void)
 {
 	int rc;
 
-	if (!is_uv_system())
+	if (!uv_rtc_enable || !is_uv_system())
 		return -ENODEV;
 
 	rc = clocksource_register_hz(&clocksource_uv, sn_rtc_cycles_per_second);
-- 
2.43.0


-- 
Jiri Wiesner
SUSE Labs

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] x86: UV RTC: Add parameter to disable RTC clocksource
  2025-07-17 15:44 [PATCH] x86: UV RTC: Add parameter to disable RTC clocksource Jiri Wiesner
@ 2025-07-17 20:50 ` Dimitri Sivanich
  2025-07-29  9:52   ` Jiri Wiesner
  0 siblings, 1 reply; 3+ messages in thread
From: Dimitri Sivanich @ 2025-07-17 20:50 UTC (permalink / raw)
  To: Jiri Wiesner
  Cc: Linux Kernel Mailing List, Jonathan Corbet, Steve Wahl,
	Justin Ernst, Kyle Meyer, Dimitri Sivanich, Russ Anderson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin

On Thu, Jul 17, 2025 at 05:44:45PM +0200, Jiri Wiesner wrote:
> Booting up an 8 NUMA node machine that has an UV RTC clocksource may
> result in the TSC being marked unstable by the clocksource watchdog due to
> time skew. The failures to verify the TSC happen soon after the current
> clocksource is switched to the TSC (usually the watchdog runs twice).
> Delaying the checks carried out by the clocksource watchdog after the
> system boots up does not make a difference.
> 
> The clocksource watchdog compares two clocksources and it is assumed that
> it is always the clocksource being verified what has caused the time skew
> measured by the clocksource watchdog. To check the validity of this
> assumption, a debugging kernel was used. A third clocksource that was set
> to the HPET was added. The messages reported by the debugging kernel
> indicate that the time skew between the TSC and the HPET was only 22
> nanoseconds while the time skew between the TSC and sgi_rtc was 591659
> nanoseconds:
> 
> clocksource: timekeeping watchdog on CPU176: Marking clocksource 'tsc' as unstable because the skew is too large:
> clocksource: 'sgi_rtc' wd_nsec: 479339803 wd_now: 1fab695e5a wd_last: 1f9e44dca0 mask: ffffffffffffff
> clocksource: 'hpet' wd2_nsec: 479931440 wd2_now: 90a1af85 wd2_last: 8fea9b37 mask: ffffffff
> clocksource: 'tsc' cs_nsec: 479931462 cs_now: 944e1c227d cs_last: 9412097879 mask: ffffffffffffffff
> clocksource: Clocksource 'tsc' skewed 591659 ns (0 ms) over watchdog 'sgi_rtc' interval of 479339803 ns (479 ms)
> clocksource: 'tsc' is current clocksource.
> tsc: Marking TSC unstable due to clocksource watchdog
> TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
> sched_clock: Marking unstable (90731283360, -1108605523)<-(95136368481, -5513690634)
> clocksource: Checking clocksource tsc synchronization from CPU 501 to CPUs 0-500,502-767.
> clocksource: CPU 501 check durations 1446ns - 32908ns for clocksource tsc.
> 
> This happened on CPU 176, which resides on NUMA node 3. The interval was
> computed from timestamps from CPU 176 and from CPU 175, which also resides
> on NUMA node 3. Since the time skew was reported between CPUs residing on
> the same NUMA node, it is unlikely that the TSC would experience time skew.
> 
> The debugging kernel printed out the last message in
> clocksource_verify_percpu() unconditionally, and all CPUs were checked.
> None of the CPUs was reported as being behind or ahead of CPU 501. The
> last message provides a worst case estimate. The value of 2 * cs_nsec_max
> (2 * 32908 ns) is the maximum possible time skew between the TSCs of any
> two CPUs on the system, as measured by the TSC sync check. The cs_nsec_max
> value itself is an estimate because it includes delays incurred by
> executing and servicing an inter-processor interrupt synchronously, which
> has a non-negligible cost. The maximum possible time skew (of the TSC) of
> 66 microseconds does not even approach the size of the time skew measured
> by the clocksource watchdog.
> 
> Testing has shown that the HPET is stabler than sgi_rtc so the HPET is a
> better choice for veryfying the TSC. Disabling the sgi_rtc clocksource was
> implemented as a workaround. The name of the parameter was inspired by
> 581f202bcd60 ("x86: UV RTC: Always enable RTC clocksource") and the fact
> that there also is a nohpet parameter and a notsc parameter. The uvrtcevt
> parameter has been documented.
> 

On the face of it, the patch you're proposing looks OK to me, and continues the
precedent shown in other clocksources.

However, while the HPET may seem like a viable backup clocksource for purposes
of watchdog checking, it won't scale when assigned as an actual clocksource.
The UV RTC when used as an actual clocksource is more scalable than the HPET,
but it does have higher access latency than the TSC.  TSC provides the low
access latency clocksource needed by many applications.

HPE UV hardware is designed to have a reliable and synchronized TSC mechanism.  
Comparing the TSC against these secondary clocksources can result in false
positives due to variable access latency caused by system traffic.  The best
course of action against these false positives has been found to simply disable
watchdog checking of the TSC.  Currently we recommend that customers apply
'tsc=nowatchdog' to the kernel command line.  Note that this has been enforced
in the kernel for other platforms with the following commits:

commit b50db7095fe002fa3e16605546cba66bf1b68a3e
Author: Feng Tang <feng.79.tang@gmail.com>
Date:   Wed Nov 17 10:37:51 2021 +0800

    x86/tsc: Disable clocksource watchdog for TSC on qualified platorms

commit 233756a640be811efae33763db718fe29753b1e9
Author: Feng Tang <feng.79.tang@gmail.com>
Date:   Wed Jun 7 15:54:33 2023 +0800

    x86/tsc: Extend watchdog check exemption to 4-Sockets platform

commit b4bac279319d3082eb42f074799c7b18ba528c71
Author: Feng Tang <feng.79.tang@gmail.com>
Date:   Mon Jul 29 10:12:02 2024 +0800

    x86/tsc: Use topology_max_packages() to get package number


Going forward, we will likely submit a patch that disables clocksource watchdog
checking for newer UV systems in the kernel as well.

> Signed-off-by: Jiri Wiesner <jwiesner@suse.de>
> ---
>  Documentation/admin-guide/kernel-parameters.txt |  4 ++++
>  arch/x86/platform/uv/uv_time.c                  | 11 ++++++++++-
>  2 files changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 07e22ba5bfe3..9839257181e3 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -4302,6 +4302,8 @@
>  			This is required for the Braillex ib80-piezo Braille
>  			reader made by F.H. Papenmeier (Germany).
>  
> +	nouvrtc		[X86] Disable the UV RTC clocksource (SGI RTC clock).
> +
>  	nosgx		[X86-64,SGX,EARLY] Disables Intel SGX kernel support.
>  
>  	nosmap		[PPC,EARLY]
> @@ -7839,6 +7841,8 @@
>  				16 - SIGBUS faults
>  			Example: user_debug=31
>  
> +	uvrtcevt	[X86] Use UV RTC clock events (SGI RTC clock) for timers.
> +
>  	vdso=		[X86,SH,SPARC]
>  			On X86_32, this is an alias for vdso32=.  Otherwise:
>  
> diff --git a/arch/x86/platform/uv/uv_time.c b/arch/x86/platform/uv/uv_time.c
> index 3712afc3534d..03d59b87c371 100644
> --- a/arch/x86/platform/uv/uv_time.c
> +++ b/arch/x86/platform/uv/uv_time.c
> @@ -61,6 +61,7 @@ struct uv_rtc_timer_head {
>   */
>  static struct uv_rtc_timer_head		**blade_info __read_mostly;
>  
> +static int				uv_rtc_enable = 1;
>  static int				uv_rtc_evt_enable;
>  
>  /*
> @@ -321,6 +322,14 @@ static void uv_rtc_interrupt(void)
>  	ced->event_handler(ced);
>  }
>  
> +static int __init uv_disable_rtc(char *str)
> +{
> +	uv_rtc_enable = 0;
> +
> +	return 1;
> +}
> +__setup("nouvrtc", uv_disable_rtc);
> +
>  static int __init uv_enable_evt_rtc(char *str)
>  {
>  	uv_rtc_evt_enable = 1;
> @@ -342,7 +351,7 @@ static __init int uv_rtc_setup_clock(void)
>  {
>  	int rc;
>  
> -	if (!is_uv_system())
> +	if (!uv_rtc_enable || !is_uv_system())
>  		return -ENODEV;
>  
>  	rc = clocksource_register_hz(&clocksource_uv, sn_rtc_cycles_per_second);
> -- 
> 2.43.0
> 
> 
> -- 
> Jiri Wiesner
> SUSE Labs

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] x86: UV RTC: Add parameter to disable RTC clocksource
  2025-07-17 20:50 ` Dimitri Sivanich
@ 2025-07-29  9:52   ` Jiri Wiesner
  0 siblings, 0 replies; 3+ messages in thread
From: Jiri Wiesner @ 2025-07-29  9:52 UTC (permalink / raw)
  To: Dimitri Sivanich
  Cc: Linux Kernel Mailing List, Jonathan Corbet, Steve Wahl,
	Justin Ernst, Kyle Meyer, Dimitri Sivanich, Russ Anderson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin

I apologize for the lateness of my reply. I need to fix my mail filtering.

On Thu, Jul 17, 2025 at 03:50:11PM -0500, Dimitri Sivanich wrote:
> However, while the HPET may seem like a viable backup clocksource for purposes
> of watchdog checking, it won't scale when assigned as an actual clocksource.

Agreed.

> The UV RTC when used as an actual clocksource is more scalable than the HPET,
> but it does have higher access latency than the TSC. TSC provides the low
> access latency clocksource needed by many applications.

Agreed.

> HPE UV hardware is designed to have a reliable and synchronized TSC mechanism.  
> Comparing the TSC against these secondary clocksources can result in false
> positives due to variable access latency caused by system traffic.  The best
> course of action against these false positives has been found to simply disable
> watchdog checking of the TSC.  Currently we recommend that customers apply
> 'tsc=nowatchdog' to the kernel command line.

This is what we (SUSE) have instructed our customer to do in this case. But I thought we could do slightly better by disabling the UV RTC and allowing the checks to continue. The HPET is certainly not without flaws and we have reports of cases where the TSC was incorrectly marked unstable when the HPET was being used for verification. Recently, changes were merged that relaxed the thresholds substantially (our reports predate these changes):
> v6.8-rc5-23-g2ed08e4bc532 clocksource: Scale the watchdog read retries automatically
> v6.11-rc1-5-g4ac1dd3245b9 clocksource: Set cs_watchdog_read() checks based on .uncertainty_margin
The thresholds used to be 125,000 ns and 62,500 ns for the hpet-tsc-hpet read-back delay and the hpet-hpet read-back delay, respectively. Now, it is 750,000 ns and 500,000 ns for the hpet-tsc-hpet read-back delay and the hpet-hpet read-back delay, respectively. The relaxed threholds will make most of the false positives (TSC marked unstable) disappear. But I think fixed thresholds are not optimal. I would rather see the thresholds be derived from previous measuments, e.g. the threshold value could be derived form the maximum hpet-hpet read-back delay that has been measured by the clocksource watchdog.

> commit b50db7095fe002fa3e16605546cba66bf1b68a3e
> Author: Feng Tang <feng.79.tang@gmail.com>
> Date:   Wed Nov 17 10:37:51 2021 +0800
> 
>     x86/tsc: Disable clocksource watchdog for TSC on qualified platorms
> 
> commit 233756a640be811efae33763db718fe29753b1e9
> Author: Feng Tang <feng.79.tang@gmail.com>
> Date:   Wed Jun 7 15:54:33 2023 +0800
> 
>     x86/tsc: Extend watchdog check exemption to 4-Sockets platform

These two patches have been backported to all SLES releases that include the tightened thresholds for clocksources watchdog checks:
> v5.13-rc4-23-gdb3a34e17433 clocksource: Retry clock read if long delays detected
> v5.13-rc4-26-g2e27e793e280 clocksource: Reduce clocksource-skew threshold
> v5.17-rc1-2-gfc153c1c58cb clocksource: Add a Kconfig option for WATCHDOG_MAX_SKEW
> v6.2-rc1-2-gc37e85c135ce clocksource: Loosen clocksource watchdog constraints

> commit b4bac279319d3082eb42f074799c7b18ba528c71
> Author: Feng Tang <feng.79.tang@gmail.com>
> Date:   Mon Jul 29 10:12:02 2024 +0800
> 
>     x86/tsc: Use topology_max_packages() to get package number

We could not backport this patch because the older SLES releases do not contain T. Gleixner's patchset that made topology_max_packages() provide correct package count during early boot.

> Going forward, we will likely submit a patch that disables clocksource watchdog
> checking for newer UV systems in the kernel as well.

My impression is that the clocksource watchdog has mostly outlived its usefulness. I am aware of three occasions where the switch to the HPET caused by the clocksource watchdog notified customers of a serious issue on their system. The first occasion was a hardware issue involving the CPU not executing instructions for hundreds of microseconds but the counters reflecting the passage of time were still incrementing (as if the CPU "stuttered"). The other two occasions are still under investigation.

If I understand correctly the point is that it would be more valuable to have the UV RTC available was a clocksource and avoid it being used by the clocksource watchdog for verifying the TSC. If the UV RTC was used as a clocksource its time skew might become problematic. The largest time skew observed on the 8 socket UV machine was around 700 microseconds per 0.5 second, which is beyond what NTP can correct:
> clocksource: timekeeping watchdog on CPU118: Marking clocksource 'tsc' as unstable because the skew is too large:
> clocksource: 'sgi_rtc' wd_nsec: 511302794 wd_now: 1cb50e4c4b wd_last: 1ca7097111 mask: ffffffffffffff
> clocksource: 'hpet' wd2_nsec: 512005960 wd2_now: 65892719 wd2_last: 64c5d684 mask: ffffffff
> clocksource: 'tsc' cs_nsec: 512006458 cs_now: 86b5982cb1 cs_last: 867581bbab mask: ffffffffffffffff
> clocksource: Clocksource 'tsc' skewed 703664 ns (0 ms) over watchdog 'sgi_rtc' interval of 511302794 ns (511 ms)

> clocksource: timekeeping watchdog on CPU118: Marking clocksource 'tsc' as unstable because the skew is too large:
> clocksource: 'sgi_rtc' wd_nsec: 511302198 wd_now: 1b1cdebaa0 wd_last: 1b0ed9e078 mask: ffffffffffffff
> clocksource: 'tsc' cs_nsec: 512005312 cs_now: 7f746eabdd cs_last: 7f34584009 mask: ffffffffffffffff
> clocksource: Clocksource 'tsc' skewed 703114 ns (0 ms) over watchdog 'sgi_rtc' interval of 511302198 ns (511 ms)

If the clocksource watchdog was disabled by default on newer UV systems it would resolve the issue for us.
-- 
Jiri Wiesner
SUSE Labs

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2025-07-29  9:52 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-17 15:44 [PATCH] x86: UV RTC: Add parameter to disable RTC clocksource Jiri Wiesner
2025-07-17 20:50 ` Dimitri Sivanich
2025-07-29  9:52   ` Jiri Wiesner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).