From: Feng Tang <feng.tang@intel.com>
To: Dave Hansen <dave.hansen@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
Thomas Gleixner <tglx@linutronix.de>,
Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
"H . Peter Anvin" <hpa@zytor.com>, <x86@kernel.org>,
<linux-kernel@vger.kernel.org>, <rui.zhang@intel.com>,
<tim.c.chen@intel.com>,
Xiongfeng Wang <wangxiongfeng2@huawei.com>,
Yu Liao <liaoyu15@huawei.com>
Subject: Re: [PATCH] x86/tsc: Extend the watchdog check exemption to 4S/8S machine
Date: Tue, 11 Oct 2022 15:51:21 +0800 [thread overview]
Message-ID: <Y0UgeUIJSFNR4mQB@feng-clx> (raw)
In-Reply-To: <Y0TCOKc7n38341eJ@feng-clx>
On Tue, Oct 11, 2022 at 09:09:12AM +0800, Feng Tang wrote:
> On Mon, Oct 10, 2022 at 07:23:10AM -0700, Dave Hansen wrote:
> > On 10/9/22 18:23, Feng Tang wrote:
> > >>> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> > >>> index cafacb2e58cc..b4ea79cb1d1a 100644
> > >>> --- a/arch/x86/kernel/tsc.c
> > >>> +++ b/arch/x86/kernel/tsc.c
> > >>> @@ -1217,7 +1217,7 @@ static void __init check_system_tsc_reliable(void)
> > >>> if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
> > >>> boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
> > >>> boot_cpu_has(X86_FEATURE_TSC_ADJUST) &&
> > >>> - nr_online_nodes <= 2)
> > >>> + nr_online_nodes <= 8)
> > >> So you're saying all 8 socket systems since Broadwell (?) are TSC
> > >> sync'ed ?
> > > No, I didn't mean that. I haven't got chance to any 8 sockets
> > > machine, and I got a report last month that on one 8S machine,
> > > the TSC was judged 'unstable' by HPET as watchdog.
> >
> > That's not a great check. Think about numa=fake=4U, for instance. Or a
> > single-socket system with persistent memory and high bandwidth memory.
> >
> > Basically 'nr_online_nodes' is a software construct. It's going to be
> > really hard to infer anything from it about what the _hardware_ is.
>
> You are right! How to get the socket number was indeed a trouble when
> I worked on commit b50db7095fe0, the problem is related to the
> initialization order. This tsc check needs to be done in tsc_init(),
> while the node_stats[] get initialized in later's call of smp_init().
>
> For the case you mentioned above, I dug out some old logs which showed
> its init order:
>
> numa=fake=4 on a SKL desktop
> ================
> [ 0.000066] [tsc_early_init()]: nr_online_nodes = 1
> [ 0.000068] [tsc_early_init()]: nr_cpu_nodes = 0
> [ 0.000070] [tsc_early_init()]: nr_mem_nodes = 0
> [ 0.104015] [tsc_init()]: nr_online_nodes = 4
> [ 0.104019] [tsc_init()]: nr_cpu_nodes = 0
> [ 0.104022] [tsc_init()]: nr_mem_nodes = 4
> [ 0.124778] smp: Brought up 4 nodes, 4 CPUs
> [ 0.760915] [init_tsc_clocksource()]: nr_online_nodes = 4
> [ 0.760919] [init_tsc_clocksource()]: nr_cpu_nodes = 4
> [ 0.760922] [init_tsc_clocksource()]: nr_mem_nodes = 4
>
> QEMU with 2 CPU-DRAM nodes + 2 Persistent memory nodes
> ========================================================
> [ 0.066651] [tsc_early_init()]: nr_online_nodes = 1
> [ 0.067494] [tsc_early_init()]: nr_cpu_nodes = 0
> [ 0.068288] [tsc_early_init()]: nr_mem_nodes = 0
> [ 0.677694] [tsc_init()]: nr_online_nodes = 4
> [ 0.678862] [tsc_init()]: nr_cpu_nodes = 0
> [ 0.679962] [tsc_init()]: nr_mem_nodes = 4
> [ 1.139240] [init_tsc_clocksource()]: nr_online_nodes = 4
> [ 1.140576] [init_tsc_clocksource()]: nr_cpu_nodes = 2
> [ 1.141823] [init_tsc_clocksource()]: nr_mem_nodes = 4
> [ 1.660100] [kernel_init()]: nr_online_nodes = 4
> [ 1.661234] [kernel_init()]: nr_cpu_nodes = 2
> [ 1.662300] [kernel_init()]: nr_mem_nodes = 4
>
> The 'nr_online_nodes' was chosed in the hope of that, in worse case
> the patch is just a nop and won't wrongly lift the check.
>
> One possible solution for this problem is to leverage the SRAT table
> early init which is called before tsc_init(), and can provide CPU
> nodes info. Will try this way.
Th simple patch below is to have a dedicate CPU nodemask and set it in
early SRAT CPU parsing, still it has problem when sub-numa is enabled
in BIOS where there are more NUMA nodes in SRAT table. (also I'm
not sure the change to amdtopology.c is right)
Thanks,
Feng
diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
index e3bae2b60a0d..e745053a5f9a 100644
--- a/arch/x86/include/asm/numa.h
+++ b/arch/x86/include/asm/numa.h
@@ -31,6 +31,7 @@ extern int numa_off;
*/
extern s16 __apicid_to_node[MAX_LOCAL_APIC];
extern nodemask_t numa_nodes_parsed __initdata;
+extern nodemask_t numa_cpu_nodes __initdata;
extern int __init numa_add_memblk(int nodeid, u64 start, u64 end);
extern void __init numa_set_distance(int from, int to, int distance);
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 179e0b1ba5cc..a2a7fc5aa15c 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -29,6 +29,7 @@
#include <asm/intel-family.h>
#include <asm/i8259.h>
#include <asm/uv/uv.h>
+#include <asm/numa.h>
unsigned int __read_mostly cpu_khz; /* TSC clocks / usec, not used here */
EXPORT_SYMBOL(cpu_khz);
@@ -1218,7 +1219,7 @@ first_dump();
if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
boot_cpu_has(X86_FEATURE_TSC_ADJUST) &&
- nr_online_nodes <= 2)
+ nodes_weight(numa_cpu_nodes) <= 2)
tsc_disable_clocksource_watchdog();
}
diff --git a/arch/x86/mm/amdtopology.c b/arch/x86/mm/amdtopology.c
index b3ca7d23e4b0..6b982a16cc38 100644
--- a/arch/x86/mm/amdtopology.c
+++ b/arch/x86/mm/amdtopology.c
@@ -152,6 +152,7 @@ int __init amd_numa_init(void)
prevbase = base;
numa_add_memblk(nodeid, base, limit);
node_set(nodeid, numa_nodes_parsed);
+ node_set(nodeid, numa_cpu_nodes);
}
if (nodes_empty(numa_nodes_parsed))
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 090125b3ee1f..82798fee97a2 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -21,6 +21,7 @@
int numa_off;
nodemask_t numa_nodes_parsed __initdata;
+nodemask_t numa_cpu_nodes __initdata;
struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
EXPORT_SYMBOL(node_data);
diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
index 7688117ac2f4..11b08b317306 100644
--- a/arch/x86/mm/srat.c
+++ b/arch/x86/mm/srat.c
@@ -59,6 +59,7 @@ acpi_numa_x2apic_affinity_init(struct acpi_srat_x2apic_cpu_affinity *pa)
}
set_apicid_to_node(apic_id, node);
node_set(node, numa_nodes_parsed);
+ node_set(node, numa_cpu_nodes);
printk(KERN_INFO "SRAT: PXM %u -> APIC 0x%04x -> Node %u\n",
@@ -106,6 +107,7 @@ acpi_numa_processor_affinity_init(struct acpi_srat_cpu_affinity *pa)
set_apicid_to_node(apic_id, node);
node_set(node, numa_nodes_parsed);
+ node_set(node, numa_cpu_nodes);
printk(KERN_INFO "SRAT: PXM %u -> APIC 0x%02x -> Node %u\n",
next prev parent reply other threads:[~2022-10-11 7:52 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-10-09 5:12 [PATCH] x86/tsc: Extend the watchdog check exemption to 4S/8S machine Feng Tang
2022-10-09 13:01 ` Peter Zijlstra
2022-10-10 1:23 ` Feng Tang
2022-10-10 14:23 ` Dave Hansen
2022-10-11 1:09 ` Feng Tang
2022-10-11 7:51 ` Feng Tang [this message]
2022-10-11 13:01 ` Peter Zijlstra
2022-10-12 8:44 ` Feng Tang
2022-10-11 7:52 ` Peter Zijlstra
2022-10-11 13:33 ` Zhang Rui
2022-10-11 14:01 ` Peter Zijlstra
2022-10-11 14:11 ` Feng Tang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Y0UgeUIJSFNR4mQB@feng-clx \
--to=feng.tang@intel.com \
--cc=bp@alien8.de \
--cc=dave.hansen@intel.com \
--cc=hpa@zytor.com \
--cc=liaoyu15@huawei.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=rui.zhang@intel.com \
--cc=tglx@linutronix.de \
--cc=tim.c.chen@intel.com \
--cc=wangxiongfeng2@huawei.com \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.