* [PATCH v2] x86/tsc: Extend watchdog check exemption to 4-Sockets platform
@ 2022-10-13 13:12 Feng Tang
2022-10-13 16:02 ` Dave Hansen
2022-10-19 9:18 ` Thomas Gleixner
0 siblings, 2 replies; 6+ messages in thread
From: Feng Tang @ 2022-10-13 13:12 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H . Peter Anvin, Peter Zijlstra, x86, linux-kernel
Cc: rui.zhang, tim.c.chen, Xiongfeng Wang, Feng Tang, Yu Liao
There is report again that the tsc clocksource on a 4 sockets x86
Skylake server was wrongly judged as 'unstable' by 'jiffies' watchdog,
and disabled [1].
Commit b50db7095fe0 ("x86/tsc: Disable clocksource watchdog for TSC
on qualified platorms") was introduce to deal with these false
alarms of tsc unstable issues, covering qualified platforms for 2
sockets or smaller ones.
Extend the exemption to 4 sockets to fix the issue.
We also got similar reports on 8 sockets platform from internal test,
but as Peter pointed out, there was tsc sync issues for 8-sockets
platform, and it'd better be handled architecture by architecture,
instead of directly changing the threshold to 8 here.
Rui also proposed another way to disable 'jiffies' as clocksource
watchdog [2], which can also solve this specific problem in an
architecture independent way, with one limitation that some tsc false
alarms are reported by other watchdogs like HPET in post-boot time,
while 'jiffies' is mostly used in boot phase before hardware
clocksources are initialized.
[1]. https://lore.kernel.org/all/9d3bf570-3108-0336-9c52-9bee15767d29@huawei.com/
[2]. https://lore.kernel.org/all/bd5b97f89ab2887543fc262348d1c7cafcaae536.camel@intel.com/
Reported-by: Yu Liao <liaoyu15@huawei.com>
Tested-by: Yu Liao <liaoyu15@huawei.com>
Signed-off-by: Feng Tang <feng.tang@intel.com>
---
Hi reviewers:
In the v1 review cycle, Dave raised the issue that 'nr_online_nodes'
is not accurate, and could have problem with fakenuma (numa=fake=4 in
cmdline) case and system with CPU-less HBM/PMEM nodes, which we have
discussed in https://lore.kernel.org/lkml/Y0UgeUIJSFNR4mQB@feng-clx/
and will post the solution as a separate RFC patch.
Thanks,
Feng
Changelog:
Since v1:
* Change the max socket number from 8 to 4, as Peter Zijlstra
pointed the 8S machine could have tsc sync issue, and should
not be exempted generally
arch/x86/kernel/tsc.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index cafacb2e58cc..1fa3fdf43159 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1209,7 +1209,7 @@ static void __init check_system_tsc_reliable(void)
* - TSC which does not stop in C-States
* - the TSC_ADJUST register which allows to detect even minimal
* modifications
- * - not more than two sockets. As the number of sockets cannot be
+ * - not more than four sockets. As the number of sockets cannot be
* evaluated at the early boot stage where this has to be
* invoked, check the number of online memory nodes as a
* fallback solution which is an reasonable estimate.
@@ -1217,7 +1217,7 @@ static void __init check_system_tsc_reliable(void)
if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
boot_cpu_has(X86_FEATURE_TSC_ADJUST) &&
- nr_online_nodes <= 2)
+ nr_online_nodes <= 4)
tsc_disable_clocksource_watchdog();
}
--
2.34.1
^ permalink raw reply related [flat|nested] 6+ messages in thread* Re: [PATCH v2] x86/tsc: Extend watchdog check exemption to 4-Sockets platform 2022-10-13 13:12 [PATCH v2] x86/tsc: Extend watchdog check exemption to 4-Sockets platform Feng Tang @ 2022-10-13 16:02 ` Dave Hansen 2022-10-14 0:37 ` Feng Tang 2022-10-19 9:18 ` Thomas Gleixner 1 sibling, 1 reply; 6+ messages in thread From: Dave Hansen @ 2022-10-13 16:02 UTC (permalink / raw) To: Feng Tang, Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin, Peter Zijlstra, x86, linux-kernel Cc: rui.zhang, tim.c.chen, Xiongfeng Wang, Yu Liao On 10/13/22 06:12, Feng Tang wrote: > @@ -1217,7 +1217,7 @@ static void __init check_system_tsc_reliable(void) > if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) && > boot_cpu_has(X86_FEATURE_NONSTOP_TSC) && > boot_cpu_has(X86_FEATURE_TSC_ADJUST) && > - nr_online_nodes <= 2) > + nr_online_nodes <= 4) > tsc_disable_clocksource_watchdog(); I still don't think we should perpetuate this hack. This just plain doesn't work in numa=off numa=fake=... or presumably in cases where NUMA is disabled in the firmware and memory is interleaved across all sockets. It also presumably doesn't work on two-socket systems that have Cluster-on-Die or Sub-NUMA-Clustering where a single socket is chopped up into multiple nodes. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2] x86/tsc: Extend watchdog check exemption to 4-Sockets platform 2022-10-13 16:02 ` Dave Hansen @ 2022-10-14 0:37 ` Feng Tang 2022-10-14 1:14 ` Feng Tang 0 siblings, 1 reply; 6+ messages in thread From: Feng Tang @ 2022-10-14 0:37 UTC (permalink / raw) To: Dave Hansen Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin, Peter Zijlstra, x86, linux-kernel, rui.zhang, tim.c.chen, Xiongfeng Wang, Yu Liao On Thu, Oct 13, 2022 at 09:02:43AM -0700, Dave Hansen wrote: > On 10/13/22 06:12, Feng Tang wrote: > > @@ -1217,7 +1217,7 @@ static void __init check_system_tsc_reliable(void) > > if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) && > > boot_cpu_has(X86_FEATURE_NONSTOP_TSC) && > > boot_cpu_has(X86_FEATURE_TSC_ADJUST) && > > - nr_online_nodes <= 2) > > + nr_online_nodes <= 4) > > tsc_disable_clocksource_watchdog(); > > I still don't think we should perpetuate this hack. > > This just plain doesn't work in numa=off numa=fake=... or presumably in > cases where NUMA is disabled in the firmware and memory is interleaved > across all sockets. > > It also presumably doesn't work on two-socket systems that have > Cluster-on-Die or Sub-NUMA-Clustering where a single socket is chopped > up into multiple nodes. Yes, after you raised the 'nr_online_nodes' issue, Peter, Rui and I have discussed the problem, and plan to post a RFC patch as in https://lore.kernel.org/lkml/Y0UgeUIJSFNR4mQB@feng-clx/ Which can cover: - numa=fake=... case - platform has DRAM nodes and cpu-less HBM/PMEM nodes and 'sub-numa-clustering' can't be covered, and the tsc will be watchdoged as before. For numa=off case, there is only one CPU up, and I think lifting this watchdog for tsc is fine. Thanks, Feng ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2] x86/tsc: Extend watchdog check exemption to 4-Sockets platform 2022-10-14 0:37 ` Feng Tang @ 2022-10-14 1:14 ` Feng Tang 0 siblings, 0 replies; 6+ messages in thread From: Feng Tang @ 2022-10-14 1:14 UTC (permalink / raw) To: Dave Hansen Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin, Peter Zijlstra, x86, linux-kernel, rui.zhang, tim.c.chen, Xiongfeng Wang, Yu Liao On Fri, Oct 14, 2022 at 08:37:18AM +0800, Feng Tang wrote: > On Thu, Oct 13, 2022 at 09:02:43AM -0700, Dave Hansen wrote: > > On 10/13/22 06:12, Feng Tang wrote: > > > @@ -1217,7 +1217,7 @@ static void __init check_system_tsc_reliable(void) > > > if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) && > > > boot_cpu_has(X86_FEATURE_NONSTOP_TSC) && > > > boot_cpu_has(X86_FEATURE_TSC_ADJUST) && > > > - nr_online_nodes <= 2) > > > + nr_online_nodes <= 4) > > > tsc_disable_clocksource_watchdog(); > > > > I still don't think we should perpetuate this hack. > > > > This just plain doesn't work in numa=off numa=fake=... or presumably in > > cases where NUMA is disabled in the firmware and memory is interleaved > > across all sockets. > > > > It also presumably doesn't work on two-socket systems that have > > Cluster-on-Die or Sub-NUMA-Clustering where a single socket is chopped > > up into multiple nodes. > > Yes, after you raised the 'nr_online_nodes' issue, Peter, Rui and I > have discussed the problem, and plan to post a RFC patch as in > https://lore.kernel.org/lkml/Y0UgeUIJSFNR4mQB@feng-clx/ > > Which can cover: > - numa=fake=... case > - platform has DRAM nodes and cpu-less HBM/PMEM nodes > > and 'sub-numa-clustering' can't be covered, and the tsc will be > watchdoged as before. [...] > For numa=off case, there is only one CPU up, and I think lifting this > watchdog for tsc is fine. Sorry, I was wrong about this. 'numa=off' will still boot all CPUs up, but skip SRAT table init and only show one node. so this is another case that the fix patch can't cover. Thanks, Feng ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2] x86/tsc: Extend watchdog check exemption to 4-Sockets platform 2022-10-13 13:12 [PATCH v2] x86/tsc: Extend watchdog check exemption to 4-Sockets platform Feng Tang 2022-10-13 16:02 ` Dave Hansen @ 2022-10-19 9:18 ` Thomas Gleixner 2022-10-20 4:44 ` Feng Tang 1 sibling, 1 reply; 6+ messages in thread From: Thomas Gleixner @ 2022-10-19 9:18 UTC (permalink / raw) To: Feng Tang, Ingo Molnar, Borislav Petkov, Dave Hansen, H . Peter Anvin, Peter Zijlstra, x86, linux-kernel Cc: rui.zhang, tim.c.chen, Xiongfeng Wang, Feng Tang, Yu Liao On Thu, Oct 13 2022 at 21:12, Feng Tang wrote: > There is report again that the tsc clocksource on a 4 sockets x86 > Skylake server was wrongly judged as 'unstable' by 'jiffies' watchdog, > and disabled [1]. > > Commit b50db7095fe0 ("x86/tsc: Disable clocksource watchdog for TSC > on qualified platorms") was introduce to deal with these false > alarms of tsc unstable issues, covering qualified platforms for 2 > sockets or smaller ones. > > Extend the exemption to 4 sockets to fix the issue. > > We also got similar reports on 8 sockets platform from internal test, > but as Peter pointed out, there was tsc sync issues for 8-sockets > platform, and it'd better be handled architecture by architecture, > instead of directly changing the threshold to 8 here. > > Rui also proposed another way to disable 'jiffies' as clocksource > watchdog [2], which can also solve this specific problem in an > architecture independent way, with one limitation that some tsc false > alarms are reported by other watchdogs like HPET in post-boot time, > while 'jiffies' is mostly used in boot phase before hardware > clocksources are initialized. HPET is initialized early, but if HPET is disabled or not advertised then the only other hardware clocksource is PMTIMER which is initialized late via fs_initcall. PMTIMER is initialized late due to broken Pentium era chipsets which are sorted with PCI quirks. For anything else we can initialize it early. Something like the below. I'm sure I said this more than once, but I'm happy to repeat myself forever: Instead of proliferating lousy hacks, can the X86 vendors finaly get their act together and provide some architected information whether the TSC is trustworthy or not? Thanks, tglx --- --- a/arch/x86/kernel/time.c +++ b/arch/x86/kernel/time.c @@ -10,6 +10,7 @@ * */ +#include <linux/acpi_pmtmr.h> #include <linux/clocksource.h> #include <linux/clockchips.h> #include <linux/interrupt.h> @@ -75,6 +76,14 @@ static void __init setup_default_timer_i void __init hpet_time_init(void) { if (!hpet_enable()) { + /* + * Some Pentium chipsets have broken HPETs and need + * PCI quirks to run before init. + */ + if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL || + boot_cpu_data.family != 5) + init_acpi_pm_clocksource(); + if (!pit_timer_init()) return; } --- a/drivers/clocksource/acpi_pm.c +++ b/drivers/clocksource/acpi_pm.c @@ -30,6 +30,7 @@ * in arch/i386/kernel/acpi/boot.c */ u32 pmtmr_ioport __read_mostly; +static bool pmtmr_initialized __init_data; static inline u32 read_pmtmr(void) { @@ -142,7 +143,7 @@ DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_SE * Some boards have the PMTMR running way too fast. We check * the PMTMR rate against PIT channel 2 to catch these cases. */ -static int verify_pmtmr_rate(void) +static int __init verify_pmtmr_rate(void) { u64 value1, value2; unsigned long count, delta; @@ -172,14 +173,18 @@ static int verify_pmtmr_rate(void) /* Number of reads we try to get two different values */ #define ACPI_PM_READ_CHECKS 10000 -static int __init init_acpi_pm_clocksource(void) +int __init init_acpi_pm_clocksource(void) { u64 value1, value2; unsigned int i, j = 0; + int ret; if (!pmtmr_ioport) return -ENODEV; + if (pmtmr_initialized) + return 0; + /* "verify" this timing source: */ for (j = 0; j < ACPI_PM_MONOTONICITY_CHECKS; j++) { udelay(100 * j); @@ -210,10 +215,11 @@ static int __init init_acpi_pm_clocksour return -ENODEV; } - return clocksource_register_hz(&clocksource_acpi_pm, - PMTMR_TICKS_PER_SEC); + ret = clocksource_register_hz(&clocksource_acpi_pm, PMTMR_TICKS_PER_SEC); + if (!ret) + pmtimer_initialized = true; + return ret; } - /* We use fs_initcall because we want the PCI fixups to have run * but we still need to load before device_initcall */ --- a/include/linux/acpi_pmtmr.h +++ b/include/linux/acpi_pmtmr.h @@ -13,6 +13,8 @@ /* Overrun value */ #define ACPI_PM_OVRRUN (1<<24) +extern int __init init_acpi_pm_clocksource(void); + #ifdef CONFIG_X86_PM_TIMER extern u32 acpi_pm_read_verified(void); ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2] x86/tsc: Extend watchdog check exemption to 4-Sockets platform 2022-10-19 9:18 ` Thomas Gleixner @ 2022-10-20 4:44 ` Feng Tang 0 siblings, 0 replies; 6+ messages in thread From: Feng Tang @ 2022-10-20 4:44 UTC (permalink / raw) To: Thomas Gleixner Cc: Ingo Molnar, Borislav Petkov, Dave Hansen, H . Peter Anvin, Peter Zijlstra, x86, linux-kernel, rui.zhang, tim.c.chen, Xiongfeng Wang, Yu Liao On Wed, Oct 19, 2022 at 11:18:43AM +0200, Thomas Gleixner wrote: > On Thu, Oct 13 2022 at 21:12, Feng Tang wrote: > > There is report again that the tsc clocksource on a 4 sockets x86 > > Skylake server was wrongly judged as 'unstable' by 'jiffies' watchdog, > > and disabled [1]. > > > > Commit b50db7095fe0 ("x86/tsc: Disable clocksource watchdog for TSC > > on qualified platorms") was introduce to deal with these false > > alarms of tsc unstable issues, covering qualified platforms for 2 > > sockets or smaller ones. > > > > Extend the exemption to 4 sockets to fix the issue. > > > > We also got similar reports on 8 sockets platform from internal test, > > but as Peter pointed out, there was tsc sync issues for 8-sockets > > platform, and it'd better be handled architecture by architecture, > > instead of directly changing the threshold to 8 here. > > > > Rui also proposed another way to disable 'jiffies' as clocksource > > watchdog [2], which can also solve this specific problem in an > > architecture independent way, with one limitation that some tsc false > > alarms are reported by other watchdogs like HPET in post-boot time, > > while 'jiffies' is mostly used in boot phase before hardware > > clocksources are initialized. > > HPET is initialized early, but if HPET is disabled or not advertised > then the only other hardware clocksource is PMTIMER which is initialized > late via fs_initcall. PMTIMER is initialized late due to broken Pentium > era chipsets which are sorted with PCI quirks. For anything else we can > initialize it early. Something like the below. Thanks for sharing the background and the code! It can reduce the time of 'jiffies' being a watchdog on client platforms whose HPET are disabled. And there were still false positive reports for HPET/PMTIMER as watchdogs, so I still vote to your suggestion of lifting the check for qualified platforms. For that, Dave raised the accuracy issue of 'nr_online_nodes' and we proposed new patch in https://lore.kernel.org/lkml/20221017132942.1646934-1-feng.tang@intel.com/ while the topology_max_packages() still has issue as providing socket number, and I plan to use 'logical_packages' instead. Do you think it's in the right direction? > I'm sure I said this more than once, but I'm happy to repeat myself > forever: > > Instead of proliferating lousy hacks, can the X86 vendors finaly get > their act together and provide some architected information whether > the TSC is trustworthy or not? Yes it will save us a lot of trouble. Maybe better in CPUID info, as if there is some bug in HW/BIOS, it may get fixed with microcode update. Thanks, Feng > Thanks, > > tglx > --- > > --- a/arch/x86/kernel/time.c > +++ b/arch/x86/kernel/time.c > @@ -10,6 +10,7 @@ > * > */ > > +#include <linux/acpi_pmtmr.h> > #include <linux/clocksource.h> > #include <linux/clockchips.h> > #include <linux/interrupt.h> > @@ -75,6 +76,14 @@ static void __init setup_default_timer_i > void __init hpet_time_init(void) > { > if (!hpet_enable()) { > + /* > + * Some Pentium chipsets have broken HPETs and need > + * PCI quirks to run before init. > + */ > + if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL || > + boot_cpu_data.family != 5) > + init_acpi_pm_clocksource(); > + > if (!pit_timer_init()) > return; > } > --- a/drivers/clocksource/acpi_pm.c > +++ b/drivers/clocksource/acpi_pm.c > @@ -30,6 +30,7 @@ > * in arch/i386/kernel/acpi/boot.c > */ > u32 pmtmr_ioport __read_mostly; > +static bool pmtmr_initialized __init_data; > > static inline u32 read_pmtmr(void) > { > @@ -142,7 +143,7 @@ DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_SE > * Some boards have the PMTMR running way too fast. We check > * the PMTMR rate against PIT channel 2 to catch these cases. > */ > -static int verify_pmtmr_rate(void) > +static int __init verify_pmtmr_rate(void) > { > u64 value1, value2; > unsigned long count, delta; > @@ -172,14 +173,18 @@ static int verify_pmtmr_rate(void) > /* Number of reads we try to get two different values */ > #define ACPI_PM_READ_CHECKS 10000 > > -static int __init init_acpi_pm_clocksource(void) > +int __init init_acpi_pm_clocksource(void) > { > u64 value1, value2; > unsigned int i, j = 0; > + int ret; > > if (!pmtmr_ioport) > return -ENODEV; > > + if (pmtmr_initialized) > + return 0; > + > /* "verify" this timing source: */ > for (j = 0; j < ACPI_PM_MONOTONICITY_CHECKS; j++) { > udelay(100 * j); > @@ -210,10 +215,11 @@ static int __init init_acpi_pm_clocksour > return -ENODEV; > } > > - return clocksource_register_hz(&clocksource_acpi_pm, > - PMTMR_TICKS_PER_SEC); > + ret = clocksource_register_hz(&clocksource_acpi_pm, PMTMR_TICKS_PER_SEC); > + if (!ret) > + pmtimer_initialized = true; > + return ret; > } > - > /* We use fs_initcall because we want the PCI fixups to have run > * but we still need to load before device_initcall > */ > --- a/include/linux/acpi_pmtmr.h > +++ b/include/linux/acpi_pmtmr.h > @@ -13,6 +13,8 @@ > /* Overrun value */ > #define ACPI_PM_OVRRUN (1<<24) > > +extern int __init init_acpi_pm_clocksource(void); > + > #ifdef CONFIG_X86_PM_TIMER > > extern u32 acpi_pm_read_verified(void); > > ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2022-10-20 4:45 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-10-13 13:12 [PATCH v2] x86/tsc: Extend watchdog check exemption to 4-Sockets platform Feng Tang 2022-10-13 16:02 ` Dave Hansen 2022-10-14 0:37 ` Feng Tang 2022-10-14 1:14 ` Feng Tang 2022-10-19 9:18 ` Thomas Gleixner 2022-10-20 4:44 ` Feng Tang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox