linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Feng Tang <feng.tang@intel.com>
To: Dave Hansen <dave.hansen@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	"H . Peter Anvin" <hpa@zytor.com>, <x86@kernel.org>,
	<linux-kernel@vger.kernel.org>, <rui.zhang@intel.com>,
	<tim.c.chen@intel.com>,
	Xiongfeng Wang <wangxiongfeng2@huawei.com>,
	Yu Liao <liaoyu15@huawei.com>
Subject: Re: [PATCH] x86/tsc: Extend the watchdog check exemption to 4S/8S machine
Date: Tue, 11 Oct 2022 15:51:21 +0800	[thread overview]
Message-ID: <Y0UgeUIJSFNR4mQB@feng-clx> (raw)
In-Reply-To: <Y0TCOKc7n38341eJ@feng-clx>

On Tue, Oct 11, 2022 at 09:09:12AM +0800, Feng Tang wrote:
> On Mon, Oct 10, 2022 at 07:23:10AM -0700, Dave Hansen wrote:
> > On 10/9/22 18:23, Feng Tang wrote:
> > >>> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> > >>> index cafacb2e58cc..b4ea79cb1d1a 100644
> > >>> --- a/arch/x86/kernel/tsc.c
> > >>> +++ b/arch/x86/kernel/tsc.c
> > >>> @@ -1217,7 +1217,7 @@ static void __init check_system_tsc_reliable(void)
> > >>>  	if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
> > >>>  	    boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
> > >>>  	    boot_cpu_has(X86_FEATURE_TSC_ADJUST) &&
> > >>> -	    nr_online_nodes <= 2)
> > >>> +	    nr_online_nodes <= 8)
> > >> So you're saying all 8 socket systems since Broadwell (?) are TSC
> > >> sync'ed ?
> > > No, I didn't mean that. I haven't got chance to any 8 sockets
> > > machine, and I got a report last month that on one 8S machine,
> > > the TSC was judged 'unstable' by HPET as watchdog.
> > 
> > That's not a great check.  Think about numa=fake=4U, for instance.  Or a
> > single-socket system with persistent memory and high bandwidth memory.
> > 
> > Basically 'nr_online_nodes' is a software construct.  It's going to be
> > really hard to infer anything from it about what the _hardware_ is.
> 
> You are right! How to get the socket number was indeed a trouble when
> I worked on commit b50db7095fe0, the problem is related to the
> initialization order. This tsc check needs to be done in tsc_init(),
> while the node_stats[] get initialized in later's call of smp_init().
> 
> For the case you mentioned above, I dug out some old logs which showed
> its init order:
> 
>   numa=fake=4 on a SKL desktop
>   ================
>   [    0.000066] [tsc_early_init()]: nr_online_nodes = 1
>   [    0.000068] [tsc_early_init()]: nr_cpu_nodes = 0
>   [    0.000070] [tsc_early_init()]: nr_mem_nodes = 0
>   [    0.104015] [tsc_init()]: nr_online_nodes = 4
>   [    0.104019] [tsc_init()]: nr_cpu_nodes = 0
>   [    0.104022] [tsc_init()]: nr_mem_nodes = 4
>   [    0.124778] smp: Brought up 4 nodes, 4 CPUs
>   [    0.760915] [init_tsc_clocksource()]: nr_online_nodes = 4
>   [    0.760919] [init_tsc_clocksource()]: nr_cpu_nodes = 4
>   [    0.760922] [init_tsc_clocksource()]: nr_mem_nodes = 4
>   
>   QEMU with 2 CPU-DRAM nodes + 2 Persistent memory nodes 
>   ========================================================
>   [    0.066651] [tsc_early_init()]: nr_online_nodes = 1
>   [    0.067494] [tsc_early_init()]: nr_cpu_nodes = 0
>   [    0.068288] [tsc_early_init()]: nr_mem_nodes = 0
>   [    0.677694] [tsc_init()]: nr_online_nodes = 4
>   [    0.678862] [tsc_init()]: nr_cpu_nodes = 0
>   [    0.679962] [tsc_init()]: nr_mem_nodes = 4
>   [    1.139240] [init_tsc_clocksource()]: nr_online_nodes = 4
>   [    1.140576] [init_tsc_clocksource()]: nr_cpu_nodes = 2
>   [    1.141823] [init_tsc_clocksource()]: nr_mem_nodes = 4
>   [    1.660100] [kernel_init()]: nr_online_nodes = 4
>   [    1.661234] [kernel_init()]: nr_cpu_nodes = 2
>   [    1.662300] [kernel_init()]: nr_mem_nodes = 4
> 
> The 'nr_online_nodes' was chosed in the hope of that, in worse case
> the patch is just a nop and won't wrongly lift the check.
> 
> One possible solution for this problem is to leverage the SRAT table
> early init which is called before tsc_init(), and can provide CPU
> nodes info. Will try this way.

Th simple patch below is to have a dedicate CPU nodemask and set it in
early SRAT CPU parsing, still it has problem when sub-numa is enabled
in BIOS where there are more NUMA nodes in SRAT table. (also I'm
not sure the change to amdtopology.c is right)

Thanks,
Feng

diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
index e3bae2b60a0d..e745053a5f9a 100644
--- a/arch/x86/include/asm/numa.h
+++ b/arch/x86/include/asm/numa.h
@@ -31,6 +31,7 @@ extern int numa_off;
  */
 extern s16 __apicid_to_node[MAX_LOCAL_APIC];
 extern nodemask_t numa_nodes_parsed __initdata;
+extern nodemask_t numa_cpu_nodes __initdata;
 
 extern int __init numa_add_memblk(int nodeid, u64 start, u64 end);
 extern void __init numa_set_distance(int from, int to, int distance);
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 179e0b1ba5cc..a2a7fc5aa15c 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -29,6 +29,7 @@
 #include <asm/intel-family.h>
 #include <asm/i8259.h>
 #include <asm/uv/uv.h>
+#include <asm/numa.h>
 
 unsigned int __read_mostly cpu_khz;	/* TSC clocks / usec, not used here */
 EXPORT_SYMBOL(cpu_khz);
@@ -1218,7 +1219,7 @@ first_dump();
 	if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
 	    boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
 	    boot_cpu_has(X86_FEATURE_TSC_ADJUST) &&
-	    nr_online_nodes <= 2)
+	    nodes_weight(numa_cpu_nodes) <= 2)
 		tsc_disable_clocksource_watchdog();
 }
 
diff --git a/arch/x86/mm/amdtopology.c b/arch/x86/mm/amdtopology.c
index b3ca7d23e4b0..6b982a16cc38 100644
--- a/arch/x86/mm/amdtopology.c
+++ b/arch/x86/mm/amdtopology.c
@@ -152,6 +152,7 @@ int __init amd_numa_init(void)
 		prevbase = base;
 		numa_add_memblk(nodeid, base, limit);
 		node_set(nodeid, numa_nodes_parsed);
+		node_set(nodeid, numa_cpu_nodes);
 	}
 
 	if (nodes_empty(numa_nodes_parsed))
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 090125b3ee1f..82798fee97a2 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -21,6 +21,7 @@
 
 int numa_off;
 nodemask_t numa_nodes_parsed __initdata;
+nodemask_t numa_cpu_nodes __initdata;
 
 struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
 EXPORT_SYMBOL(node_data);
diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
index 7688117ac2f4..11b08b317306 100644
--- a/arch/x86/mm/srat.c
+++ b/arch/x86/mm/srat.c
@@ -59,6 +59,7 @@ acpi_numa_x2apic_affinity_init(struct acpi_srat_x2apic_cpu_affinity *pa)
 	}
 	set_apicid_to_node(apic_id, node);
 	node_set(node, numa_nodes_parsed);
+	node_set(node, numa_cpu_nodes);
 
 	printk(KERN_INFO "SRAT: PXM %u -> APIC 0x%04x -> Node %u\n",
@@ -106,6 +107,7 @@ acpi_numa_processor_affinity_init(struct acpi_srat_cpu_affinity *pa)
 
 	set_apicid_to_node(apic_id, node);
 	node_set(node, numa_nodes_parsed);
+	node_set(node, numa_cpu_nodes);
 
 	printk(KERN_INFO "SRAT: PXM %u -> APIC 0x%02x -> Node %u\n",

  reply	other threads:[~2022-10-11  7:52 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-09  5:12 [PATCH] x86/tsc: Extend the watchdog check exemption to 4S/8S machine Feng Tang
2022-10-09 13:01 ` Peter Zijlstra
2022-10-10  1:23   ` Feng Tang
2022-10-10 14:23     ` Dave Hansen
2022-10-11  1:09       ` Feng Tang
2022-10-11  7:51         ` Feng Tang [this message]
2022-10-11 13:01           ` Peter Zijlstra
2022-10-12  8:44             ` Feng Tang
2022-10-11  7:52       ` Peter Zijlstra
2022-10-11 13:33         ` Zhang Rui
2022-10-11 14:01           ` Peter Zijlstra
2022-10-11 14:11             ` Feng Tang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y0UgeUIJSFNR4mQB@feng-clx \
    --to=feng.tang@intel.com \
    --cc=bp@alien8.de \
    --cc=dave.hansen@intel.com \
    --cc=hpa@zytor.com \
    --cc=liaoyu15@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rui.zhang@intel.com \
    --cc=tglx@linutronix.de \
    --cc=tim.c.chen@intel.com \
    --cc=wangxiongfeng2@huawei.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).