linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] Fix NUMA sched domain build errors for GNR-X and CWF-X
@ 2025-08-22 20:14 Tim Chen
  2025-08-22 20:14 ` [PATCH 1/2] sched: topology: Fix topology validation error Tim Chen
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Tim Chen @ 2025-08-22 20:14 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Ben Segall, Mel Gorman,
	Valentin Schneider, Tim Chen, Vincent Guittot, Libo Chen, Abel Wu,
	Len Brown, linux-kernel, Chen Yu, K Prateek Nayak,
	Gautham R . Shenoy, Zhao Liu, Vinicius Costa Gomes

While testing Granite Rapids X (GNR-X) and Clearwater Forest X (CWF-X) in
SNc-3 mode, we encountered sched domain build errors reported in dmesg.
Asymmetric node distances from local node to to nodes in remote package
was not expected by the scheduler domain code and also led to excessive
number of sched domain hierachy levels.

Fix the missing NUMA domain level set in topology_span_sane() check and
also simplify the distance to nodes in remote package to retain distance
symmetry and make the NUMA topology sane for GNR-X and CWF-X.

Tim Chen (1):
  sched: Fix sched domain build error for GNR-X, CWF-X in SNC-3 mode

Vinicius Costa Gomes (1):
  sched: topology: Fix topology validation error

 arch/x86/kernel/smpboot.c      | 28 ++++++++++++++++++++++++++++
 include/linux/sched/topology.h |  1 +
 kernel/sched/topology.c        | 33 +++++++++++++++++++++++++++------
 3 files changed, 56 insertions(+), 6 deletions(-)

-- 
2.32.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/2] sched: topology: Fix topology validation error
  2025-08-22 20:14 [PATCH 0/2] Fix NUMA sched domain build errors for GNR-X and CWF-X Tim Chen
@ 2025-08-22 20:14 ` Tim Chen
  2025-08-25  3:18   ` K Prateek Nayak
  2025-08-25  7:25   ` Peter Zijlstra
  2025-08-22 20:14 ` [PATCH 2/2] sched: Fix sched domain build error for GNR-X, CWF-X in SNC-3 mode Tim Chen
  2025-08-25  4:18 ` [PATCH 0/2] Fix NUMA sched domain build errors for GNR-X and CWF-X K Prateek Nayak
  2 siblings, 2 replies; 14+ messages in thread
From: Tim Chen @ 2025-08-22 20:14 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Vinicius Costa Gomes, Juri Lelli, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Libo Chen, Abel Wu, Len Brown, linux-kernel, Chen Yu,
	K Prateek Nayak, Gautham R . Shenoy, Zhao Liu, Tim Chen

From: Vinicius Costa Gomes <vinicius.gomes@intel.com>

As sd_numa_mask() (the function behind tl->mask() for the NUMA levels
of the topology) depends on the value of sched_domains_curr_level,
it's possible to be iterating over a level while, sd_numa_mask()
thinks we are in another, causing the topology validation to fail (for
valid cases).

Set sched_domains_curr_level to the current topology level while
iterating.

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/topology.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 977e133bb8a4..9a7ac67e3d63 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2394,6 +2394,14 @@ static bool topology_span_sane(const struct cpumask *cpu_map)
 	for_each_sd_topology(tl) {
 		int tl_common_flags = 0;
 
+#ifdef CONFIG_NUMA
+		/*
+		 * sd_numa_mask() (one of the possible values of
+		 * tl->mask()) depends on the current level to work
+		 * correctly.
+		 */
+		sched_domains_curr_level = tl->numa_level;
+#endif
 		if (tl->sd_flags)
 			tl_common_flags = (*tl->sd_flags)();
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/2] sched: Fix sched domain build error for GNR-X, CWF-X in SNC-3 mode
  2025-08-22 20:14 [PATCH 0/2] Fix NUMA sched domain build errors for GNR-X and CWF-X Tim Chen
  2025-08-22 20:14 ` [PATCH 1/2] sched: topology: Fix topology validation error Tim Chen
@ 2025-08-22 20:14 ` Tim Chen
  2025-08-25  5:08   ` Chen, Yu C
  2025-08-25  4:18 ` [PATCH 0/2] Fix NUMA sched domain build errors for GNR-X and CWF-X K Prateek Nayak
  2 siblings, 1 reply; 14+ messages in thread
From: Tim Chen @ 2025-08-22 20:14 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Ben Segall, Mel Gorman,
	Valentin Schneider, Tim Chen, Vincent Guittot, Libo Chen, Abel Wu,
	Len Brown, linux-kernel, Chen Yu, K Prateek Nayak,
	Gautham R . Shenoy, Zhao Liu, Vinicius Costa Gomes

It is possible for Granite Rapids X (GNR) and Clearwater Forest X
(CWF) to have up to 3 dies per package. When sub-numa cluster (SNC-3)
is enabled, each die will become a separate NUMA node in the package
with different distances between dies within the same package.

For example, on GNR-X, we see the following numa distances for a 2 socket
system with 3 dies per socket:

        package 1       package2
            ----------------
            |               |
        ---------       ---------
        |   0   |       |   3   |
        ---------       ---------
            |               |
        ---------       ---------
        |   1   |       |   4   |
        ---------       ---------
            |               |
        ---------       ---------
        |   2   |       |   5   |
        ---------       ---------
            |               |
            ----------------

node distances:
node     0    1    2    3    4    5
   0:   10   15   17   21   28   26
   1:   15   10   15   23   26   23
   2:   17   15   10   26   23   21
   3:   21   28   26   10   15   17
   4:   23   26   23   15   10   15
   5:   26   23   21   17   15   10

The node distances above led to 2 problems:

1. Asymmetric routes taken between nodes in different packages led to
asymmetric scheduler domain perspective depending on which node you
are on.  Current scheduler code failed to build domains properly with
asymmetric distances.

2. Multiple remote distances to respective tiles on remote package create
too many levels of domain hierarchies grouping different nodes between
remote packages.

For example, the above GNR-X topology lead to NUMA domains below:

Sched domains from the perspective of a CPU in node 0, where the number
in bracket represent node number.

NUMA-level 1	[0,1] [2]
NUMA-level 2	[0,1,2] [3]
NUMA-level 3	[0,1,2,3] [5]
NUMA-level 4	[0,1,2,3,5] [4]

Sched domains from the perspective of a CPU in node 4
NUMA-level 1	[4] [3,5]
NUMA-level 2	[3,4,5] [0,2]
NUMA-level 3	[0,2,3,4,5] [1]

Scheduler group peers for load balancing from the perspective of CPU 0
and 4 are different.  Improper task could be chosen for load balancing
between groups such as [0,2,3,4,5] [1].  Ideally you should choose nodes
in 0 or 2 that are in same package as node 1 first.  But instead tasks
in the remote package node 3, 4, 5 could be chosen with an equal chance
and could lead to excessive remote package migrations and imbalance of
load between packages.  We should not group partial remote nodes and
local nodes together.

Simplify the remote distances for CWF-X and GNR-X for the purpose of
sched domains building, which maintains symmetry and leads to a more
reasonable load balance hierarchy.

The sched domains from the perspective of a CPU in node 0 NUMA-level 1
is now
NUMA-level 1	[0,1] [2]
NUMA-level 2	[0,1,2] [3,4,5]

The sched domains from the perspective of a CPU in node 4 NUMA-level 1
is now
NUMA-level 1	[4] [3,5]
NUMA-level 2	[3,4,5] [0,1,2]

We have the same balancing perspective from node 0 or node 4.  Loads are
now balanced equally between packages.

Co-developed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Tested-by: Zhao Liu <zhao1.liu@intel.com>
---
 arch/x86/kernel/smpboot.c      | 28 ++++++++++++++++++++++++++++
 include/linux/sched/topology.h |  1 +
 kernel/sched/topology.c        | 25 +++++++++++++++++++------
 3 files changed, 48 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 33e166f6ab12..c425e84c88b5 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -515,6 +515,34 @@ static void __init build_sched_topology(void)
 	set_sched_topology(topology);
 }
 
+int sched_node_distance(int from, int to)
+{
+	int d = node_distance(from, to);
+
+	if (!x86_has_numa_in_package)
+		return d;
+
+	switch (boot_cpu_data.x86_vfm) {
+	case INTEL_GRANITERAPIDS_X:
+	case INTEL_ATOM_DARKMONT_X:
+		if (d < REMOTE_DISTANCE)
+			return d;
+
+		/*
+		 * Trim finer distance tuning for nodes in remote package
+		 * for the purpose of building sched domains.
+		 * Put NUMA nodes in each remote package in a single sched group.
+		 * Simplify NUMA domains and avoid extra NUMA levels including different
+		 * NUMA nodes in remote packages.
+		 *
+		 * GNR-x and CWF-X has GLUELESS-MESH topology with SNC
+		 * turned on.
+		 */
+		d = (d / 10) * 10;
+	}
+	return d;
+}
+
 void set_cpu_sibling_map(int cpu)
 {
 	bool has_smt = __max_threads_per_core > 1;
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 5263746b63e8..3b62226394af 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -59,6 +59,7 @@ static inline int cpu_numa_flags(void)
 #endif
 
 extern int arch_asym_cpu_priority(int cpu);
+extern int sched_node_distance(int from, int to);
 
 struct sched_domain_attr {
 	int relax_domain_level;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 9a7ac67e3d63..3f485da994a7 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1804,7 +1804,7 @@ bool find_numa_distance(int distance)
 	bool found = false;
 	int i, *distances;
 
-	if (distance == node_distance(0, 0))
+	if (distance == sched_node_distance(0, 0))
 		return true;
 
 	rcu_read_lock();
@@ -1887,6 +1887,15 @@ static void init_numa_topology_type(int offline_node)
 
 #define NR_DISTANCE_VALUES (1 << DISTANCE_BITS)
 
+/*
+ * Architecture could simplify NUMA distance, to avoid
+ * creating too many NUMA levels when SNC is turned on.
+ */
+int __weak sched_node_distance(int from, int to)
+{
+	return node_distance(from, to);
+}
+
 void sched_init_numa(int offline_node)
 {
 	struct sched_domain_topology_level *tl;
@@ -1894,6 +1903,7 @@ void sched_init_numa(int offline_node)
 	int nr_levels = 0;
 	int i, j;
 	int *distances;
+	int max_dist = 0;
 	struct cpumask ***masks;
 
 	/*
@@ -1907,7 +1917,10 @@ void sched_init_numa(int offline_node)
 	bitmap_zero(distance_map, NR_DISTANCE_VALUES);
 	for_each_cpu_node_but(i, offline_node) {
 		for_each_cpu_node_but(j, offline_node) {
-			int distance = node_distance(i, j);
+			int distance = sched_node_distance(i, j);
+
+			if (node_distance(i,j) > max_dist)
+				max_dist = node_distance(i,j);
 
 			if (distance < LOCAL_DISTANCE || distance >= NR_DISTANCE_VALUES) {
 				sched_numa_warn("Invalid distance value range");
@@ -1979,10 +1992,10 @@ void sched_init_numa(int offline_node)
 			masks[i][j] = mask;
 
 			for_each_cpu_node_but(k, offline_node) {
-				if (sched_debug() && (node_distance(j, k) != node_distance(k, j)))
+				if (sched_debug() && (sched_node_distance(j, k) != sched_node_distance(k, j)))
 					sched_numa_warn("Node-distance not symmetric");
 
-				if (node_distance(j, k) > sched_domains_numa_distance[i])
+				if (sched_node_distance(j, k) > sched_domains_numa_distance[i])
 					continue;
 
 				cpumask_or(mask, mask, cpumask_of_node(k));
@@ -2022,7 +2035,7 @@ void sched_init_numa(int offline_node)
 	sched_domain_topology = tl;
 
 	sched_domains_numa_levels = nr_levels;
-	WRITE_ONCE(sched_max_numa_distance, sched_domains_numa_distance[nr_levels - 1]);
+	WRITE_ONCE(sched_max_numa_distance, max_dist);
 
 	init_numa_topology_type(offline_node);
 }
@@ -2092,7 +2105,7 @@ void sched_domains_numa_masks_set(unsigned int cpu)
 				continue;
 
 			/* Set ourselves in the remote node's masks */
-			if (node_distance(j, node) <= sched_domains_numa_distance[i])
+			if (sched_node_distance(j, node) <= sched_domains_numa_distance[i])
 				cpumask_set_cpu(cpu, sched_domains_numa_masks[i][j]);
 		}
 	}
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] sched: topology: Fix topology validation error
  2025-08-22 20:14 ` [PATCH 1/2] sched: topology: Fix topology validation error Tim Chen
@ 2025-08-25  3:18   ` K Prateek Nayak
  2025-08-25  7:58     ` Peter Zijlstra
  2025-08-25  7:25   ` Peter Zijlstra
  1 sibling, 1 reply; 14+ messages in thread
From: K Prateek Nayak @ 2025-08-25  3:18 UTC (permalink / raw)
  To: Tim Chen, Vinicius Costa Gomes, Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Dietmar Eggemann, Ben Segall, Mel Gorman,
	Valentin Schneider, Tim Chen, Vincent Guittot, Libo Chen, Abel Wu,
	Len Brown, linux-kernel, Chen Yu, Gautham R . Shenoy, Zhao Liu

Hello Tim, Vinicius,

On 8/23/2025 1:44 AM, Tim Chen wrote:
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2394,6 +2394,14 @@ static bool topology_span_sane(const struct cpumask *cpu_map)
>  	for_each_sd_topology(tl) {
>  		int tl_common_flags = 0;
>  
> +#ifdef CONFIG_NUMA
> +		/*
> +		 * sd_numa_mask() (one of the possible values of
> +		 * tl->mask()) depends on the current level to work
> +		 * correctly.
> +		 */
> +		sched_domains_curr_level = tl->numa_level;
> +#endif

A similar solution was proposed in [1] and after a few iterations, we
arrived at [2] as a potential solution to this issue. Now that the merge
window is behind us, Peter would it be possible to pick one of these up?

P.S. Leon has confirmed this solved the splat of their deployments too
on an earlier version [3].

[1] https://lore.kernel.org/lkml/20250624041235.1589-1-kprateek.nayak@amd.com/
[2] https://lore.kernel.org/lkml/20250715040824.893-1-kprateek.nayak@amd.com/
[3] https://lore.kernel.org/lkml/20250720104136.GI402218@unreal/

>  		if (tl->sd_flags)
>  			tl_common_flags = (*tl->sd_flags)();
>  

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/2] Fix NUMA sched domain build errors for GNR-X and CWF-X
  2025-08-22 20:14 [PATCH 0/2] Fix NUMA sched domain build errors for GNR-X and CWF-X Tim Chen
  2025-08-22 20:14 ` [PATCH 1/2] sched: topology: Fix topology validation error Tim Chen
  2025-08-22 20:14 ` [PATCH 2/2] sched: Fix sched domain build error for GNR-X, CWF-X in SNC-3 mode Tim Chen
@ 2025-08-25  4:18 ` K Prateek Nayak
  2025-08-25 21:38   ` Tim Chen
  2 siblings, 1 reply; 14+ messages in thread
From: K Prateek Nayak @ 2025-08-25  4:18 UTC (permalink / raw)
  To: Tim Chen, Peter Zijlstra, Ingo Molnar
  Cc: Juri Lelli, Dietmar Eggemann, Ben Segall, Mel Gorman,
	Valentin Schneider, Tim Chen, Vincent Guittot, Libo Chen, Abel Wu,
	Len Brown, linux-kernel, Chen Yu, Gautham R . Shenoy, Zhao Liu,
	Vinicius Costa Gomes

Hello Tim,

On 8/23/2025 1:44 AM, Tim Chen wrote:
> While testing Granite Rapids X (GNR-X) and Clearwater Forest X (CWF-X) in
> SNc-3 mode, we encountered sched domain build errors reported in dmesg.
> Asymmetric node distances from local node to to nodes in remote package
> was not expected by the scheduler domain code and also led to excessive
> number of sched domain hierachy levels.
> 
> Fix the missing NUMA domain level set in topology_span_sane() check and
> also simplify the distance to nodes in remote package to retain distance
> symmetry and make the NUMA topology sane for GNR-X and CWF-X.

I did some sanity testing on an EPYC platform  on NPS2/4 and didn't
see any changes to the sched domain layout or the sched_node_distance()
being used when constructing them with the series.

Feel free to include:

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] sched: Fix sched domain build error for GNR-X, CWF-X in SNC-3 mode
  2025-08-22 20:14 ` [PATCH 2/2] sched: Fix sched domain build error for GNR-X, CWF-X in SNC-3 mode Tim Chen
@ 2025-08-25  5:08   ` Chen, Yu C
  2025-08-25  7:56     ` Peter Zijlstra
  2025-08-25 20:05     ` Tim Chen
  0 siblings, 2 replies; 14+ messages in thread
From: Chen, Yu C @ 2025-08-25  5:08 UTC (permalink / raw)
  To: Tim Chen, Peter Zijlstra, Ingo Molnar
  Cc: Juri Lelli, Dietmar Eggemann, Ben Segall, Mel Gorman,
	Valentin Schneider, Tim Chen, Vincent Guittot, Libo Chen, Abel Wu,
	Len Brown, linux-kernel, K Prateek Nayak, Gautham R . Shenoy,
	Zhao Liu, Vinicius Costa Gomes, Chen Yu

On 8/23/2025 4:14 AM, Tim Chen wrote:
> It is possible for Granite Rapids X (GNR) and Clearwater Forest X
> (CWF) to have up to 3 dies per package. When sub-numa cluster (SNC-3)
> is enabled, each die will become a separate NUMA node in the package
> with different distances between dies within the same package.
> 
> For example, on GNR-X, we see the following numa distances for a 2 socket
> system with 3 dies per socket:
> 
>          package 1       package2
>              ----------------
>              |               |
>          ---------       ---------
>          |   0   |       |   3   |
>          ---------       ---------
>              |               |
>          ---------       ---------
>          |   1   |       |   4   |
>          ---------       ---------
>              |               |
>          ---------       ---------
>          |   2   |       |   5   |
>          ---------       ---------
>              |               |
>              ----------------
> 
> node distances:
> node     0    1    2    3    4    5
>     0:   10   15   17   21   28   26
>     1:   15   10   15   23   26   23
>     2:   17   15   10   26   23   21
>     3:   21   28   26   10   15   17
>     4:   23   26   23   15   10   15
>     5:   26   23   21   17   15   10
> 
> The node distances above led to 2 problems:
> 
> 1. Asymmetric routes taken between nodes in different packages led to
> asymmetric scheduler domain perspective depending on which node you
> are on.  Current scheduler code failed to build domains properly with
> asymmetric distances.
> 
> 2. Multiple remote distances to respective tiles on remote package create
> too many levels of domain hierarchies grouping different nodes between
> remote packages.
> 
> For example, the above GNR-X topology lead to NUMA domains below:
> 
> Sched domains from the perspective of a CPU in node 0, where the number
> in bracket represent node number.
> 
> NUMA-level 1	[0,1] [2]
> NUMA-level 2	[0,1,2] [3]
> NUMA-level 3	[0,1,2,3] [5]
> NUMA-level 4	[0,1,2,3,5] [4]
> 
> Sched domains from the perspective of a CPU in node 4
> NUMA-level 1	[4] [3,5]
> NUMA-level 2	[3,4,5] [0,2]
> NUMA-level 3	[0,2,3,4,5] [1]
> 
> Scheduler group peers for load balancing from the perspective of CPU 0
> and 4 are different.  Improper task could be chosen for load balancing
> between groups such as [0,2,3,4,5] [1].  Ideally you should choose nodes
> in 0 or 2 that are in same package as node 1 first.  But instead tasks
> in the remote package node 3, 4, 5 could be chosen with an equal chance
> and could lead to excessive remote package migrations and imbalance of
> load between packages.  We should not group partial remote nodes and
> local nodes together.
> 
> Simplify the remote distances for CWF-X and GNR-X for the purpose of
> sched domains building, which maintains symmetry and leads to a more
> reasonable load balance hierarchy.
> 
> The sched domains from the perspective of a CPU in node 0 NUMA-level 1
> is now
> NUMA-level 1	[0,1] [2]
> NUMA-level 2	[0,1,2] [3,4,5]
> 
> The sched domains from the perspective of a CPU in node 4 NUMA-level 1
> is now
> NUMA-level 1	[4] [3,5]
> NUMA-level 2	[3,4,5] [0,1,2]
> 
> We have the same balancing perspective from node 0 or node 4.  Loads are
> now balanced equally between packages.
> 
> Co-developed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
> Tested-by: Zhao Liu <zhao1.liu@intel.com>
> ---
>   arch/x86/kernel/smpboot.c      | 28 ++++++++++++++++++++++++++++
>   include/linux/sched/topology.h |  1 +
>   kernel/sched/topology.c        | 25 +++++++++++++++++++------
>   3 files changed, 48 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 33e166f6ab12..c425e84c88b5 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -515,6 +515,34 @@ static void __init build_sched_topology(void)
>   	set_sched_topology(topology);
>   }
>   
> +int sched_node_distance(int from, int to)
> +{
> +	int d = node_distance(from, to);
> +
> +	if (!x86_has_numa_in_package)
> +		return d;
> +
> +	switch (boot_cpu_data.x86_vfm) {
> +	case INTEL_GRANITERAPIDS_X:
> +	case INTEL_ATOM_DARKMONT_X:
> +		if (d < REMOTE_DISTANCE)
> +			return d;
> +
> +		/*
> +		 * Trim finer distance tuning for nodes in remote package
> +		 * for the purpose of building sched domains.
> +		 * Put NUMA nodes in each remote package in a single sched group.
> +		 * Simplify NUMA domains and avoid extra NUMA levels including different
> +		 * NUMA nodes in remote packages.
> +		 *
> +		 * GNR-x and CWF-X has GLUELESS-MESH topology with SNC
> +		 * turned on.
> +		 */
> +		d = (d / 10) * 10;

Does the '10' here mean that, the distance of the hierarchy socket
is 10 from SLIT table? For example, from a socket0 point of view,
the distance of socket1 to socket0 is within [20, 29), the distance
of socket2 to socket0 is [30,39), and so on. If this is the case,
maybe add a comment above for future reference.

> +	}
> +	return d;
> +}
> +
>   void set_cpu_sibling_map(int cpu)
>   {
>   	bool has_smt = __max_threads_per_core > 1;
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 5263746b63e8..3b62226394af 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -59,6 +59,7 @@ static inline int cpu_numa_flags(void)
>   #endif
>   
>   extern int arch_asym_cpu_priority(int cpu);
> +extern int sched_node_distance(int from, int to);
>   
>   struct sched_domain_attr {
>   	int relax_domain_level;
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 9a7ac67e3d63..3f485da994a7 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1804,7 +1804,7 @@ bool find_numa_distance(int distance)
>   	bool found = false;
>   	int i, *distances;
>   
> -	if (distance == node_distance(0, 0))
> +	if (distance == sched_node_distance(0, 0))
>   		return true;
>

If I understand correct, this patch is trying to fix the sched
domain issue during load balancing, and NUMA balance logic
should not be changed because NUMA balancing is not based on
sched domain?

That is to say, since the find_numa_distance() is only used by
NUMA balance, should we keep find_numa_distance() to still use
node_distance()?

>   	rcu_read_lock();
> @@ -1887,6 +1887,15 @@ static void init_numa_topology_type(int offline_node)
>   
>   #define NR_DISTANCE_VALUES (1 << DISTANCE_BITS)
>   
> +/*
> + * Architecture could simplify NUMA distance, to avoid
> + * creating too many NUMA levels when SNC is turned on.
> + */
> +int __weak sched_node_distance(int from, int to)
> +{
> +	return node_distance(from, to);
> +}
> +
>   void sched_init_numa(int offline_node)
>   {
>   	struct sched_domain_topology_level *tl;
> @@ -1894,6 +1903,7 @@ void sched_init_numa(int offline_node)
>   	int nr_levels = 0;
>   	int i, j;
>   	int *distances;
> +	int max_dist = 0;
>   	struct cpumask ***masks;
>   
>   	/*
> @@ -1907,7 +1917,10 @@ void sched_init_numa(int offline_node)
>   	bitmap_zero(distance_map, NR_DISTANCE_VALUES);
>   	for_each_cpu_node_but(i, offline_node) {
>   		for_each_cpu_node_but(j, offline_node) {
> -			int distance = node_distance(i, j);
> +			int distance = sched_node_distance(i, j);
> +
> +			if (node_distance(i,j) > max_dist)
> +				max_dist = node_distance(i,j);
>   
>   			if (distance < LOCAL_DISTANCE || distance >= NR_DISTANCE_VALUES) {
>   				sched_numa_warn("Invalid distance value range");
> @@ -1979,10 +1992,10 @@ void sched_init_numa(int offline_node)
>   			masks[i][j] = mask;
>   
>   			for_each_cpu_node_but(k, offline_node) {
> -				if (sched_debug() && (node_distance(j, k) != node_distance(k, j)))
> +				if (sched_debug() && (sched_node_distance(j, k) != sched_node_distance(k, j)))
>   					sched_numa_warn("Node-distance not symmetric");
>   
> -				if (node_distance(j, k) > sched_domains_numa_distance[i])
> +				if (sched_node_distance(j, k) > sched_domains_numa_distance[i])
>   					continue;
>   
>   				cpumask_or(mask, mask, cpumask_of_node(k));
> @@ -2022,7 +2035,7 @@ void sched_init_numa(int offline_node)
>   	sched_domain_topology = tl;
>   
>   	sched_domains_numa_levels = nr_levels;
> -	WRITE_ONCE(sched_max_numa_distance, sched_domains_numa_distance[nr_levels - 1]);
> +	WRITE_ONCE(sched_max_numa_distance, max_dist);

Above change is to use the original node_distance() rather than
sched_node_distance() for sched_max_numa_distance, and
sched_max_numa_distance is only used by NUMA balance to figure out
the NUMA topology type as well as scaling the NUMA fault statistics
for remote Nodes.

So I think we might want to keep it align by using node_distance()
in find_numa_distance().

thanks,
Chenyu
>   
>   	init_numa_topology_type(offline_node);
>   }
> @@ -2092,7 +2105,7 @@ void sched_domains_numa_masks_set(unsigned int cpu)
>   				continue;
>   
>   			/* Set ourselves in the remote node's masks */
> -			if (node_distance(j, node) <= sched_domains_numa_distance[i])
> +			if (sched_node_distance(j, node) <= sched_domains_numa_distance[i])
>   				cpumask_set_cpu(cpu, sched_domains_numa_masks[i][j]);
>   		}
>   	}

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] sched: topology: Fix topology validation error
  2025-08-22 20:14 ` [PATCH 1/2] sched: topology: Fix topology validation error Tim Chen
  2025-08-25  3:18   ` K Prateek Nayak
@ 2025-08-25  7:25   ` Peter Zijlstra
  2025-08-25 21:09     ` Tim Chen
  1 sibling, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2025-08-25  7:25 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, Vinicius Costa Gomes, Juri Lelli, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, Tim Chen,
	Vincent Guittot, Libo Chen, Abel Wu, Len Brown, linux-kernel,
	Chen Yu, K Prateek Nayak, Gautham R . Shenoy, Zhao Liu

On Fri, Aug 22, 2025 at 01:14:14PM -0700, Tim Chen wrote:
> From: Vinicius Costa Gomes <vinicius.gomes@intel.com>
> 
> As sd_numa_mask() (the function behind tl->mask() for the NUMA levels
> of the topology) depends on the value of sched_domains_curr_level,
> it's possible to be iterating over a level while, sd_numa_mask()
> thinks we are in another, causing the topology validation to fail (for
> valid cases).
> 
> Set sched_domains_curr_level to the current topology level while
> iterating.
> 
> Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
>  kernel/sched/topology.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 977e133bb8a4..9a7ac67e3d63 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2394,6 +2394,14 @@ static bool topology_span_sane(const struct cpumask *cpu_map)
>  	for_each_sd_topology(tl) {
>  		int tl_common_flags = 0;
>  
> +#ifdef CONFIG_NUMA
> +		/*
> +		 * sd_numa_mask() (one of the possible values of
> +		 * tl->mask()) depends on the current level to work
> +		 * correctly.
> +		 */

This is propagating that ugly hack from sd_init(), isn't it. Except its
pretending like its sane code... And for what?

> +		sched_domains_curr_level = tl->numa_level;
> +#endif
>  		if (tl->sd_flags)
>  			tl_common_flags = (*tl->sd_flags)();
>  
		if (tl_common_flags & SD_NUMA)
			continue;

So how does this make any difference ?

We should never get to calling tl->mask() for NUMA.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] sched: Fix sched domain build error for GNR-X, CWF-X in SNC-3 mode
  2025-08-25  5:08   ` Chen, Yu C
@ 2025-08-25  7:56     ` Peter Zijlstra
  2025-08-25 21:36       ` Tim Chen
  2025-08-25 20:05     ` Tim Chen
  1 sibling, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2025-08-25  7:56 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: Tim Chen, Ingo Molnar, Juri Lelli, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Libo Chen, Abel Wu, Len Brown, linux-kernel, K Prateek Nayak,
	Gautham R . Shenoy, Zhao Liu, Vinicius Costa Gomes, Chen Yu

On Mon, Aug 25, 2025 at 01:08:39PM +0800, Chen, Yu C wrote:
> On 8/23/2025 4:14 AM, Tim Chen wrote:
> > It is possible for Granite Rapids X (GNR) and Clearwater Forest X
> > (CWF) to have up to 3 dies per package. When sub-numa cluster (SNC-3)
> > is enabled, each die will become a separate NUMA node in the package
> > with different distances between dies within the same package.
> > 
> > For example, on GNR-X, we see the following numa distances for a 2 socket
> > system with 3 dies per socket:
> > 
> >          package 1       package2
> >              ----------------
> >              |               |
> >          ---------       ---------
> >          |   0   |       |   3   |
> >          ---------       ---------
> >              |               |
> >          ---------       ---------
> >          |   1   |       |   4   |
> >          ---------       ---------
> >              |               |
> >          ---------       ---------
> >          |   2   |       |   5   |
> >          ---------       ---------
> >              |               |
> >              ----------------
> > 
> > node distances:
> > node     0    1    2    3    4    5
> >     0:   10   15   17   21   28   26
> >     1:   15   10   15   23   26   23
> >     2:   17   15   10   26   23   21
> >     3:   21   28   26   10   15   17
> >     4:   23   26   23   15   10   15
> >     5:   26   23   21   17   15   10
> > 

> > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> > index 33e166f6ab12..c425e84c88b5 100644
> > --- a/arch/x86/kernel/smpboot.c
> > +++ b/arch/x86/kernel/smpboot.c
> > @@ -515,6 +515,34 @@ static void __init build_sched_topology(void)
> >   	set_sched_topology(topology);
> >   }
> > +int sched_node_distance(int from, int to)
> > +{
> > +	int d = node_distance(from, to);
> > +
> > +	if (!x86_has_numa_in_package)
> > +		return d;
> > +
> > +	switch (boot_cpu_data.x86_vfm) {
> > +	case INTEL_GRANITERAPIDS_X:
> > +	case INTEL_ATOM_DARKMONT_X:
> > +		if (d < REMOTE_DISTANCE)
> > +			return d;
> > +
> > +		/*
> > +		 * Trim finer distance tuning for nodes in remote package
> > +		 * for the purpose of building sched domains.
> > +		 * Put NUMA nodes in each remote package in a single sched group.
> > +		 * Simplify NUMA domains and avoid extra NUMA levels including different
> > +		 * NUMA nodes in remote packages.
> > +		 *
> > +		 * GNR-x and CWF-X has GLUELESS-MESH topology with SNC
> > +		 * turned on.
> > +		 */
> > +		d = (d / 10) * 10;
> 
> Does the '10' here mean that, the distance of the hierarchy socket
> is 10 from SLIT table? For example, from a socket0 point of view,
> the distance of socket1 to socket0 is within [20, 29), the distance
> of socket2 to socket0 is [30,39), and so on. If this is the case,
> maybe add a comment above for future reference.

This is all because of the ACPI SLIT distance definitions I suppose, 10
for local and 20 for remote (which IMO is actively wrong, since it
mandates distances that are not relative performance).

Additionally, the table above magically has all the remote distances in
the range of [20,29] and so the strip 1s thing works.

The problem of course is that the SLIT table is fully under control of
the BIOS and random BIOS monkey could cause this to not be so making the
above code not work as intended. Eg. if the remote distances ends up
being in the range of [20,35] or whatever, then it all goes sideways.

( There is a history of manupulating the SLIT table to influence
scheduler behaviour of OS of choice :-/ )

Similarly, when doing a 4 node system, it is possible a 2 hop distances
doesn't align nicely with the 10s and we're up a creek again.

This is all very fragile. A much better way would be to allocate a new
SLIT table, identify the (local) clusters and replace all remote
instances with an average.

Eg. since (21+28+26+23+26+23+26+23+21)/9 ~ 24, you end up with:

 node     0    1    2    3    4    5
     0:   10   15   17   24   24   24
     1:   15   10   15   24   24   24
     2:   17   15   10   24   24   24
     3:   24   24   24   10   15   17
     4:   24   24   24   15   10   15
     5:   24   24   24   17   15   10



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] sched: topology: Fix topology validation error
  2025-08-25  3:18   ` K Prateek Nayak
@ 2025-08-25  7:58     ` Peter Zijlstra
  2025-08-25  9:23       ` Peter Zijlstra
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2025-08-25  7:58 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Tim Chen, Vinicius Costa Gomes, Ingo Molnar, Juri Lelli,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	Tim Chen, Vincent Guittot, Libo Chen, Abel Wu, Len Brown,
	linux-kernel, Chen Yu, Gautham R . Shenoy, Zhao Liu

On Mon, Aug 25, 2025 at 08:48:29AM +0530, K Prateek Nayak wrote:
> Hello Tim, Vinicius,
> 
> On 8/23/2025 1:44 AM, Tim Chen wrote:
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -2394,6 +2394,14 @@ static bool topology_span_sane(const struct cpumask *cpu_map)
> >  	for_each_sd_topology(tl) {
> >  		int tl_common_flags = 0;
> >  
> > +#ifdef CONFIG_NUMA
> > +		/*
> > +		 * sd_numa_mask() (one of the possible values of
> > +		 * tl->mask()) depends on the current level to work
> > +		 * correctly.
> > +		 */
> > +		sched_domains_curr_level = tl->numa_level;
> > +#endif
> 
> A similar solution was proposed in [1] and after a few iterations, we
> arrived at [2] as a potential solution to this issue. Now that the merge
> window is behind us, Peter would it be possible to pick one of these up?
> 
> P.S. Leon has confirmed this solved the splat of their deployments too
> on an earlier version [3].
> 
> [1] https://lore.kernel.org/lkml/20250624041235.1589-1-kprateek.nayak@amd.com/
> [2] https://lore.kernel.org/lkml/20250715040824.893-1-kprateek.nayak@amd.com/
> [3] https://lore.kernel.org/lkml/20250720104136.GI402218@unreal/

I'm sure that's stuck somewhere in my holiday backlog ... Let me go try
and find it.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] sched: topology: Fix topology validation error
  2025-08-25  7:58     ` Peter Zijlstra
@ 2025-08-25  9:23       ` Peter Zijlstra
  0 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2025-08-25  9:23 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Tim Chen, Vinicius Costa Gomes, Ingo Molnar, Juri Lelli,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	Tim Chen, Vincent Guittot, Libo Chen, Abel Wu, Len Brown,
	linux-kernel, Chen Yu, Gautham R . Shenoy, Zhao Liu

On Mon, Aug 25, 2025 at 09:58:07AM +0200, Peter Zijlstra wrote:
> On Mon, Aug 25, 2025 at 08:48:29AM +0530, K Prateek Nayak wrote:
> > Hello Tim, Vinicius,
> > 
> > On 8/23/2025 1:44 AM, Tim Chen wrote:
> > > --- a/kernel/sched/topology.c
> > > +++ b/kernel/sched/topology.c
> > > @@ -2394,6 +2394,14 @@ static bool topology_span_sane(const struct cpumask *cpu_map)
> > >  	for_each_sd_topology(tl) {
> > >  		int tl_common_flags = 0;
> > >  
> > > +#ifdef CONFIG_NUMA
> > > +		/*
> > > +		 * sd_numa_mask() (one of the possible values of
> > > +		 * tl->mask()) depends on the current level to work
> > > +		 * correctly.
> > > +		 */
> > > +		sched_domains_curr_level = tl->numa_level;
> > > +#endif
> > 
> > A similar solution was proposed in [1] and after a few iterations, we
> > arrived at [2] as a potential solution to this issue. Now that the merge
> > window is behind us, Peter would it be possible to pick one of these up?
> > 
> > P.S. Leon has confirmed this solved the splat of their deployments too
> > on an earlier version [3].
> > 
> > [1] https://lore.kernel.org/lkml/20250624041235.1589-1-kprateek.nayak@amd.com/
> > [2] https://lore.kernel.org/lkml/20250715040824.893-1-kprateek.nayak@amd.com/
> > [3] https://lore.kernel.org/lkml/20250720104136.GI402218@unreal/
> 
> I'm sure that's stuck somewhere in my holiday backlog ... Let me go try
> and find it.

Replied there.

  https://lkml.kernel.org/r/20250825091910.GT3245006@noisy.programming.kicks-ass.net

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] sched: Fix sched domain build error for GNR-X, CWF-X in SNC-3 mode
  2025-08-25  5:08   ` Chen, Yu C
  2025-08-25  7:56     ` Peter Zijlstra
@ 2025-08-25 20:05     ` Tim Chen
  1 sibling, 0 replies; 14+ messages in thread
From: Tim Chen @ 2025-08-25 20:05 UTC (permalink / raw)
  To: Chen, Yu C, Peter Zijlstra, Ingo Molnar
  Cc: Juri Lelli, Dietmar Eggemann, Ben Segall, Mel Gorman,
	Valentin Schneider, Tim Chen, Vincent Guittot, Libo Chen, Abel Wu,
	Len Brown, linux-kernel, K Prateek Nayak, Gautham R . Shenoy,
	Zhao Liu, Vinicius Costa Gomes, Chen Yu

On Mon, 2025-08-25 at 13:08 +0800, Chen, Yu C wrote:
> On 8/23/2025 4:14 AM, Tim Chen wrote:
> > 
... snip...
> > 
> > Co-developed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
> > Tested-by: Zhao Liu <zhao1.liu@intel.com>
> > ---
> >   arch/x86/kernel/smpboot.c      | 28 ++++++++++++++++++++++++++++
> >   include/linux/sched/topology.h |  1 +
> >   kernel/sched/topology.c        | 25 +++++++++++++++++++------
> >   3 files changed, 48 insertions(+), 6 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> > index 33e166f6ab12..c425e84c88b5 100644
> > --- a/arch/x86/kernel/smpboot.c
> > +++ b/arch/x86/kernel/smpboot.c
> > @@ -515,6 +515,34 @@ static void __init build_sched_topology(void)
> >   	set_sched_topology(topology);
> >   }
> >   
> > +int sched_node_distance(int from, int to)
> > +{
> > +	int d = node_distance(from, to);
> > +
> > +	if (!x86_has_numa_in_package)
> > +		return d;
> > +
> > +	switch (boot_cpu_data.x86_vfm) {
> > +	case INTEL_GRANITERAPIDS_X:
> > +	case INTEL_ATOM_DARKMONT_X:
> > +		if (d < REMOTE_DISTANCE)
> > +			return d;
> > +
> > +		/*
> > +		 * Trim finer distance tuning for nodes in remote package
> > +		 * for the purpose of building sched domains.
> > +		 * Put NUMA nodes in each remote package in a single sched group.
> > +		 * Simplify NUMA domains and avoid extra NUMA levels including different
> > +		 * NUMA nodes in remote packages.
> > +		 *
> > +		 * GNR-x and CWF-X has GLUELESS-MESH topology with SNC
> > +		 * turned on.
> > +		 */
> > +		d = (d / 10) * 10;
> 
> Does the '10' here mean that, the distance of the hierarchy socket
> is 10 from SLIT table? 
> 

Yes.

> For example, from a socket0 point of view,
> the distance of socket1 to socket0 is within [20, 29), the distance
> of socket2 to socket0 is [30,39), and so on. If this is the case,
> maybe add a comment above for future reference.
> 

We don't expect to have more than 2 sockets for GNR and CWF.
So the case of 2 hops like [30,39) should not happen.

> > +	}
> > +	return d;
> > +}
> > +
> >   void set_cpu_sibling_map(int cpu)
> >   {
> >   	bool has_smt = __max_threads_per_core > 1;
> > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> > index 5263746b63e8..3b62226394af 100644
> > --- a/include/linux/sched/topology.h
> > +++ b/include/linux/sched/topology.h
> > @@ -59,6 +59,7 @@ static inline int cpu_numa_flags(void)
> >   #endif
> >   
> >   extern int arch_asym_cpu_priority(int cpu);
> > +extern int sched_node_distance(int from, int to);
> >   
> >   struct sched_domain_attr {
> >   	int relax_domain_level;
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 9a7ac67e3d63..3f485da994a7 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -1804,7 +1804,7 @@ bool find_numa_distance(int distance)
> >   	bool found = false;
> >   	int i, *distances;
> >   
> > -	if (distance == node_distance(0, 0))
> > +	if (distance == sched_node_distance(0, 0))
> >   		return true;
> > 
> 
> If I understand correct, this patch is trying to fix the sched
> domain issue during load balancing, and NUMA balance logic
> should not be changed because NUMA balancing is not based on
> sched domain?
> 
> That is to say, since the find_numa_distance() is only used by
> NUMA balance, should we keep find_numa_distance() to still use
> node_distance()?

The procedure here is using the distance matrix that's initialized
using sched_node_distance(). Hence the change.

Otherwise we could keep a separate sched_distance matrix and uses
only node_distance here.  Did not do that to minimize the change.

Tim

> 
> >   	rcu_read_lock();
> > @@ -1887,6 +1887,15 @@ static void init_numa_topology_type(int offline_node)
> >   
> >   #define NR_DISTANCE_VALUES (1 << DISTANCE_BITS)
> >   
> > +/*
> > + * Architecture could simplify NUMA distance, to avoid
> > + * creating too many NUMA levels when SNC is turned on.
> > + */
> > +int __weak sched_node_distance(int from, int to)
> > +{
> > +	return node_distance(from, to);
> > +}
> > +
> >   void sched_init_numa(int offline_node)
> >   {
> >   	struct sched_domain_topology_level *tl;
> > @@ -1894,6 +1903,7 @@ void sched_init_numa(int offline_node)
> >   	int nr_levels = 0;
> >   	int i, j;
> >   	int *distances;
> > +	int max_dist = 0;
> >   	struct cpumask ***masks;
> >   
> >   	/*
> > @@ -1907,7 +1917,10 @@ void sched_init_numa(int offline_node)
> >   	bitmap_zero(distance_map, NR_DISTANCE_VALUES);
> >   	for_each_cpu_node_but(i, offline_node) {
> >   		for_each_cpu_node_but(j, offline_node) {
> > -			int distance = node_distance(i, j);
> > +			int distance = sched_node_distance(i, j);
> > +
> > +			if (node_distance(i,j) > max_dist)
> > +				max_dist = node_distance(i,j);
> >   
> >   			if (distance < LOCAL_DISTANCE || distance >= NR_DISTANCE_VALUES) {
> >   				sched_numa_warn("Invalid distance value range");
> > @@ -1979,10 +1992,10 @@ void sched_init_numa(int offline_node)
> >   			masks[i][j] = mask;
> >   
> >   			for_each_cpu_node_but(k, offline_node) {
> > -				if (sched_debug() && (node_distance(j, k) != node_distance(k, j)))
> > +				if (sched_debug() && (sched_node_distance(j, k) != sched_node_distance(k, j)))
> >   					sched_numa_warn("Node-distance not symmetric");
> >   
> > -				if (node_distance(j, k) > sched_domains_numa_distance[i])
> > +				if (sched_node_distance(j, k) > sched_domains_numa_distance[i])
> >   					continue;
> >   
> >   				cpumask_or(mask, mask, cpumask_of_node(k));
> > @@ -2022,7 +2035,7 @@ void sched_init_numa(int offline_node)
> >   	sched_domain_topology = tl;
> >   
> >   	sched_domains_numa_levels = nr_levels;
> > -	WRITE_ONCE(sched_max_numa_distance, sched_domains_numa_distance[nr_levels - 1]);
> > +	WRITE_ONCE(sched_max_numa_distance, max_dist);
> 
> Above change is to use the original node_distance() rather than
> sched_node_distance() for sched_max_numa_distance, and
> sched_max_numa_distance is only used by NUMA balance to figure out
> the NUMA topology type as well as scaling the NUMA fault statistics
> for remote Nodes.
> 
> So I think we might want to keep it align by using node_distance()
> in find_numa_distance().
> 
> thanks,
> Chenyu
> >   
> >   	init_numa_topology_type(offline_node);
> >   }
> > @@ -2092,7 +2105,7 @@ void sched_domains_numa_masks_set(unsigned int cpu)
> >   				continue;
> >   
> >   			/* Set ourselves in the remote node's masks */
> > -			if (node_distance(j, node) <= sched_domains_numa_distance[i])
> > +			if (sched_node_distance(j, node) <= sched_domains_numa_distance[i])
> >   				cpumask_set_cpu(cpu, sched_domains_numa_masks[i][j]);
> >   		}
> >   	}


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] sched: topology: Fix topology validation error
  2025-08-25  7:25   ` Peter Zijlstra
@ 2025-08-25 21:09     ` Tim Chen
  0 siblings, 0 replies; 14+ messages in thread
From: Tim Chen @ 2025-08-25 21:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vinicius Costa Gomes, Juri Lelli, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, Tim Chen,
	Vincent Guittot, Libo Chen, Abel Wu, Len Brown, linux-kernel,
	Chen Yu, K Prateek Nayak, Gautham R . Shenoy, Zhao Liu

On Mon, 2025-08-25 at 09:25 +0200, Peter Zijlstra wrote:
> On Fri, Aug 22, 2025 at 01:14:14PM -0700, Tim Chen wrote:
> > From: Vinicius Costa Gomes <vinicius.gomes@intel.com>
> > 
> > As sd_numa_mask() (the function behind tl->mask() for the NUMA levels
> > of the topology) depends on the value of sched_domains_curr_level,
> > it's possible to be iterating over a level while, sd_numa_mask()
> > thinks we are in another, causing the topology validation to fail (for
> > valid cases).
> > 
> > Set sched_domains_curr_level to the current topology level while
> > iterating.
> > 
> > Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
> > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> > ---
> >  kernel/sched/topology.c | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> > 
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 977e133bb8a4..9a7ac67e3d63 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -2394,6 +2394,14 @@ static bool topology_span_sane(const struct cpumask *cpu_map)
> >  	for_each_sd_topology(tl) {
> >  		int tl_common_flags = 0;
> >  
> > +#ifdef CONFIG_NUMA
> > +		/*
> > +		 * sd_numa_mask() (one of the possible values of
> > +		 * tl->mask()) depends on the current level to work
> > +		 * correctly.
> > +		 */
> 
> This is propagating that ugly hack from sd_init(), isn't it. Except its
> pretending like its sane code... And for what?

How about the following fix for the CONFIG_NUMA case?  Will this be more sane? 

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index b958fe48e020..a92457fed135 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1758,7 +1758,7 @@ static struct sched_domain_topology_level *sched_domain_topology =
 static struct sched_domain_topology_level *sched_domain_topology_saved;
 
 #define for_each_sd_topology(tl)                       \
-       for (tl = sched_domain_topology; tl->mask; tl++)
+       for (tl = sched_domain_topology; tl->mask; ++tl, sched_domains_curr_level = tl->numa_level)
 
 void __init set_sched_topology(struct sched_domain_topology_level *tl)
 {


> 
> > +		sched_domains_curr_level = tl->numa_level;
> > +#endif
> >  		if (tl->sd_flags)
> >  			tl_common_flags = (*tl->sd_flags)();
> >  
> 		if (tl_common_flags & SD_NUMA)
> 			continue;
> 
> So how does this make any difference ?
> 
> We should never get to calling tl->mask() for NUMA.
> 

True.  I think we originally was fixing the v6.16 case which
wasn't checking for the SD_NUMA flag.  Overlooked that when we ported
the fix.

That said, I think that the for_each_sd_topology() macro needs to have
sched_domains_curr_level updated to prevent future problems.

Tim



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] sched: Fix sched domain build error for GNR-X, CWF-X in SNC-3 mode
  2025-08-25  7:56     ` Peter Zijlstra
@ 2025-08-25 21:36       ` Tim Chen
  0 siblings, 0 replies; 14+ messages in thread
From: Tim Chen @ 2025-08-25 21:36 UTC (permalink / raw)
  To: Peter Zijlstra, Chen, Yu C
  Cc: Ingo Molnar, Juri Lelli, Dietmar Eggemann, Ben Segall, Mel Gorman,
	Valentin Schneider, Tim Chen, Vincent Guittot, Libo Chen, Abel Wu,
	Len Brown, linux-kernel, K Prateek Nayak, Gautham R . Shenoy,
	Zhao Liu, Vinicius Costa Gomes, Chen Yu, Arjan van de Ven

On Mon, 2025-08-25 at 09:56 +0200, Peter Zijlstra wrote:
> > 

... snip ...

> > > > > > +		/*
> > > > > > +		 * Trim finer distance tuning for nodes in remote package
> > > > > > +		 * for the purpose of building sched domains.
> > > > > > +		 * Put NUMA nodes in each remote package in a single sched group.
> > > > > > +		 * Simplify NUMA domains and avoid extra NUMA levels including different
> > > > > > +		 * NUMA nodes in remote packages.
> > > > > > +		 *
> > > > > > +		 * GNR-x and CWF-X has GLUELESS-MESH topology with SNC
> > > > > > +		 * turned on.
> > > > > > +		 */
> > > > > > +		d = (d / 10) * 10;
> > > > 
> > > > Does the '10' here mean that, the distance of the hierarchy socket
> > > > is 10 from SLIT table? For example, from a socket0 point of view,
> > > > the distance of socket1 to socket0 is within [20, 29), the distance
> > > > of socket2 to socket0 is [30,39), and so on. If this is the case,
> > > > maybe add a comment above for future reference.
> > 
> > This is all because of the ACPI SLIT distance definitions I suppose, 10
> > for local and 20 for remote (which IMO is actively wrong, since it
> > mandates distances that are not relative performance).
> > 
> > Additionally, the table above magically has all the remote distances in
> > the range of [20,29] and so the strip 1s thing works.
> > 
> > The problem of course is that the SLIT table is fully under control of
> > the BIOS and random BIOS monkey could cause this to not be so making the
> > above code not work as intended. Eg. if the remote distances ends up
> > being in the range of [20,35] or whatever, then it all goes sideways.
> > 
> > ( There is a history of manupulating the SLIT table to influence
> > scheduler behaviour of OS of choice :-/ )
> > 
> > Similarly, when doing a 4 node system, it is possible a 2 hop distances
> > doesn't align nicely with the 10s and we're up a creek again.

We don't expect 4 node systems for GNR nor CWF. So hopefully we don't need to
worry about them.  Otherwise we may need additional code to check for 2 hops.

> > 
> > This is all very fragile. A much better way would be to allocate a new
> > SLIT table, identify the (local) clusters and replace all remote
> > instances with an average.

Are you suggesting to have one SLIT distance table that's simplified for
scheduler domain build and another for true node distance?

> > 
> > Eg. since (21+28+26+23+26+23+26+23+21)/9 ~ 24, you end up with:
> > 
> >  node     0    1    2    3    4    5
> >      0:   10   15   17   24   24   24
> >      1:   15   10   15   24   24   24
> >      2:   17   15   10   24   24   24
> >      3:   24   24   24   10   15   17
> >      4:   24   24   24   15   10   15
> >      5:   24   24   24   17   15   10
> > 
> > 

Will take a closer look to use average for nodes
in remote package.

Tim

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/2] Fix NUMA sched domain build errors for GNR-X and CWF-X
  2025-08-25  4:18 ` [PATCH 0/2] Fix NUMA sched domain build errors for GNR-X and CWF-X K Prateek Nayak
@ 2025-08-25 21:38   ` Tim Chen
  0 siblings, 0 replies; 14+ messages in thread
From: Tim Chen @ 2025-08-25 21:38 UTC (permalink / raw)
  To: K Prateek Nayak, Peter Zijlstra, Ingo Molnar
  Cc: Juri Lelli, Dietmar Eggemann, Ben Segall, Mel Gorman,
	Valentin Schneider, Tim Chen, Vincent Guittot, Libo Chen, Abel Wu,
	Len Brown, linux-kernel, Chen Yu, Gautham R . Shenoy, Zhao Liu,
	Vinicius Costa Gomes

On Mon, 2025-08-25 at 09:48 +0530, K Prateek Nayak wrote:
> Hello Tim,
> 
> On 8/23/2025 1:44 AM, Tim Chen wrote:
> > While testing Granite Rapids X (GNR-X) and Clearwater Forest X (CWF-X) in
> > SNc-3 mode, we encountered sched domain build errors reported in dmesg.
> > Asymmetric node distances from local node to to nodes in remote package
> > was not expected by the scheduler domain code and also led to excessive
> > number of sched domain hierachy levels.
> > 
> > Fix the missing NUMA domain level set in topology_span_sane() check and
> > also simplify the distance to nodes in remote package to retain distance
> > symmetry and make the NUMA topology sane for GNR-X and CWF-X.
> 
> I did some sanity testing on an EPYC platform  on NPS2/4 and didn't
> see any changes to the sched domain layout or the sched_node_distance()
> being used when constructing them with the series.
> 
> Feel free to include:
> 
> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> 

Thanks for testing it.

Tim

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-08-25 21:38 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-22 20:14 [PATCH 0/2] Fix NUMA sched domain build errors for GNR-X and CWF-X Tim Chen
2025-08-22 20:14 ` [PATCH 1/2] sched: topology: Fix topology validation error Tim Chen
2025-08-25  3:18   ` K Prateek Nayak
2025-08-25  7:58     ` Peter Zijlstra
2025-08-25  9:23       ` Peter Zijlstra
2025-08-25  7:25   ` Peter Zijlstra
2025-08-25 21:09     ` Tim Chen
2025-08-22 20:14 ` [PATCH 2/2] sched: Fix sched domain build error for GNR-X, CWF-X in SNC-3 mode Tim Chen
2025-08-25  5:08   ` Chen, Yu C
2025-08-25  7:56     ` Peter Zijlstra
2025-08-25 21:36       ` Tim Chen
2025-08-25 20:05     ` Tim Chen
2025-08-25  4:18 ` [PATCH 0/2] Fix NUMA sched domain build errors for GNR-X and CWF-X K Prateek Nayak
2025-08-25 21:38   ` Tim Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).