public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/2] Fix NUMA sched domain build errors for GNR and CWF
@ 2025-09-05 18:36 Tim Chen
  2025-09-05 18:36 ` [PATCH v2 1/2] sched: Create architecture specific sched domain distances Tim Chen
  2025-09-05 18:36 ` [PATCH v2 2/2] sched: Fix sched domain build error for GNR, CWF in SNC-3 mode Tim Chen
  0 siblings, 2 replies; 5+ messages in thread
From: Tim Chen @ 2025-09-05 18:36 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Ben Segall, Mel Gorman,
	Valentin Schneider, Tim Chen, Vincent Guittot, Libo Chen, Abel Wu,
	Len Brown, linux-kernel, Chen Yu, K Prateek Nayak,
	Gautham R . Shenoy, Zhao Liu, Vinicius Costa Gomes,
	Arjan Van De Ven

While testing Granite Rapids (GNR) and Clearwater Forest (CWF) in
SNc-3 mode, we encountered sched domain build errors reported in dmesg.
Asymmetric node distances from local node to to nodes in remote package
was not expected by the scheduler domain code and also led to excessive
number of sched domain hierarchy levels.

Simplify the remote node distances for the purpose of building sched
domains for GNR and CWF to fix the domain build errors and reduce
the number of NUMA sched domain levels.

Changes in v2:
- Allow modification of NUMA distances by architecture to be the
  sched domain NUMA distances for building sched domains to
  simplify NUMA domains.
  Maintain separate NUMA distances for the purpose of building
  sched domains from actual NUMA distances.
- Use average remote node distance as the distance to nodes in remote
  packages for GNR and CWF.
- Remove the original fix for topology_span_sane() that's superseded
  by better fix from Pratek.
  https://lore.kernel.org/lkml/175688671425.1920.13690753997160836570.tip-bot2@tip-bot2/.
- Link to v1: https://lore.kernel.org/lkml/cover.1755893468.git.tim.c.chen@linux.intel.com/


Tim Chen (2):
  sched: Create architecture specific sched domain distances
  sched: Fix sched domain build error for GNR, CWF in SNC-3 mode

 arch/x86/kernel/smpboot.c      |  28 ++++++++
 include/linux/sched/topology.h |   2 +
 kernel/sched/topology.c        | 118 ++++++++++++++++++++++++++++-----
 3 files changed, 131 insertions(+), 17 deletions(-)

-- 
2.32.0


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH v2 1/2] sched: Create architecture specific sched domain distances
  2025-09-05 18:36 [PATCH v2 0/2] Fix NUMA sched domain build errors for GNR and CWF Tim Chen
@ 2025-09-05 18:36 ` Tim Chen
  2025-09-07 16:28   ` Chen, Yu C
  2025-09-05 18:36 ` [PATCH v2 2/2] sched: Fix sched domain build error for GNR, CWF in SNC-3 mode Tim Chen
  1 sibling, 1 reply; 5+ messages in thread
From: Tim Chen @ 2025-09-05 18:36 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Ben Segall, Mel Gorman,
	Valentin Schneider, Tim Chen, Vincent Guittot, Libo Chen, Abel Wu,
	Len Brown, linux-kernel, Chen Yu, K Prateek Nayak,
	Gautham R . Shenoy, Zhao Liu, Vinicius Costa Gomes,
	Arjan Van De Ven

Allow architecture specific sched domain NUMA distances that can be
modified from NUMA node distances for the purpose of building NUMA
sched domains.

The actual NUMA distances are kept separately.  This allows for NUMA
domain levels modification when building sched domains for specific
architectures.

Consolidate the recording of unique NUMA distances in an array to
sched_record_numa_dist() so the function can be reused to record NUMA
distances when the NUMA distance metric is changed.

No functional change if there's no arch specific NUMA distances
are being defined.

Co-developed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 include/linux/sched/topology.h |   2 +
 kernel/sched/topology.c        | 118 ++++++++++++++++++++++++++++-----
 2 files changed, 103 insertions(+), 17 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 5263746b63e8..4f58e78ca52e 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -59,6 +59,8 @@ static inline int cpu_numa_flags(void)
 #endif
 
 extern int arch_asym_cpu_priority(int cpu);
+extern int arch_sched_node_distance(int from, int to);
+extern int sched_avg_remote_numa_distance;
 
 struct sched_domain_attr {
 	int relax_domain_level;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 977e133bb8a4..1f08dfef2ea5 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1591,10 +1591,13 @@ static void claim_allocations(int cpu, struct sched_domain *sd)
 enum numa_topology_type sched_numa_topology_type;
 
 static int			sched_domains_numa_levels;
+static int			sched_numa_node_levels;
 static int			sched_domains_curr_level;
 
 int				sched_max_numa_distance;
+int				sched_avg_remote_numa_distance;
 static int			*sched_domains_numa_distance;
+static int			*sched_numa_node_distance;
 static struct cpumask		***sched_domains_numa_masks;
 #endif /* CONFIG_NUMA */
 
@@ -1808,10 +1811,10 @@ bool find_numa_distance(int distance)
 		return true;
 
 	rcu_read_lock();
-	distances = rcu_dereference(sched_domains_numa_distance);
+	distances = rcu_dereference(sched_numa_node_distance);
 	if (!distances)
 		goto unlock;
-	for (i = 0; i < sched_domains_numa_levels; i++) {
+	for (i = 0; i < sched_numa_node_levels; i++) {
 		if (distances[i] == distance) {
 			found = true;
 			break;
@@ -1887,14 +1890,29 @@ static void init_numa_topology_type(int offline_node)
 
 #define NR_DISTANCE_VALUES (1 << DISTANCE_BITS)
 
-void sched_init_numa(int offline_node)
+/*
+ * Architecture could simplify NUMA distance, to avoid
+ * creating too many NUMA levels.
+ */
+int __weak arch_sched_node_distance(int from, int to)
+{
+	return node_distance(from, to);
+}
+
+static int numa_node_dist(int i, int j)
+{
+	return node_distance(i, j);
+}
+
+static int sched_record_numa_dist(int offline_node, int (*n_dist)(int, int),
+		int **dist, int *maximum_dist, int *levels)
+
 {
-	struct sched_domain_topology_level *tl;
 	unsigned long *distance_map;
 	int nr_levels = 0;
 	int i, j;
 	int *distances;
-	struct cpumask ***masks;
+	int max_dist = 0;
 
 	/*
 	 * O(nr_nodes^2) de-duplicating selection sort -- in order to find the
@@ -1902,17 +1920,20 @@ void sched_init_numa(int offline_node)
 	 */
 	distance_map = bitmap_alloc(NR_DISTANCE_VALUES, GFP_KERNEL);
 	if (!distance_map)
-		return;
+		return -ENOMEM;
 
 	bitmap_zero(distance_map, NR_DISTANCE_VALUES);
 	for_each_cpu_node_but(i, offline_node) {
 		for_each_cpu_node_but(j, offline_node) {
-			int distance = node_distance(i, j);
+			int distance = n_dist(i, j);
+
+			if (distance > max_dist)
+				max_dist = distance;
 
 			if (distance < LOCAL_DISTANCE || distance >= NR_DISTANCE_VALUES) {
 				sched_numa_warn("Invalid distance value range");
 				bitmap_free(distance_map);
-				return;
+				return -EINVAL;
 			}
 
 			bitmap_set(distance_map, distance, 1);
@@ -1927,17 +1948,70 @@ void sched_init_numa(int offline_node)
 	distances = kcalloc(nr_levels, sizeof(int), GFP_KERNEL);
 	if (!distances) {
 		bitmap_free(distance_map);
-		return;
+		return -ENOMEM;
 	}
-
 	for (i = 0, j = 0; i < nr_levels; i++, j++) {
 		j = find_next_bit(distance_map, NR_DISTANCE_VALUES, j);
 		distances[i] = j;
 	}
-	rcu_assign_pointer(sched_domains_numa_distance, distances);
+	*dist = distances;
+	if (levels)
+		*levels = nr_levels;
+
+	if (maximum_dist)
+		*maximum_dist = max_dist;
 
 	bitmap_free(distance_map);
 
+	return 0;
+}
+
+static int avg_remote_numa_distance(int offline_node)
+{
+	int i, j;
+	int distance, nr_remote = 0, total_distance = 0;
+
+	for_each_cpu_node_but(i, offline_node) {
+		for_each_cpu_node_but(j, offline_node) {
+			distance = node_distance(i, j);
+
+			if (distance >= REMOTE_DISTANCE) {
+				nr_remote++;
+				total_distance += distance;
+			}
+		}
+	}
+	if (nr_remote)
+		return total_distance / nr_remote;
+	else
+		return REMOTE_DISTANCE;
+}
+
+void sched_init_numa(int offline_node)
+{
+	struct sched_domain_topology_level *tl;
+	int nr_levels, nr_node_levels;
+	int i, j;
+	int *distances, *domain_distances;
+	int max_dist;
+	struct cpumask ***masks;
+
+	if (sched_record_numa_dist(offline_node, numa_node_dist, &distances,
+				   &max_dist, &nr_node_levels))
+		return;
+
+	WRITE_ONCE(sched_avg_remote_numa_distance,
+		   avg_remote_numa_distance(offline_node));
+
+	if (sched_record_numa_dist(offline_node,
+				   arch_sched_node_distance, &domain_distances,
+				   NULL, &nr_levels)) {
+		kfree(distances);
+		return;
+	}
+	rcu_assign_pointer(sched_numa_node_distance, distances);
+	WRITE_ONCE(sched_numa_node_levels, nr_node_levels);
+
 	/*
 	 * 'nr_levels' contains the number of unique distances
 	 *
@@ -1954,6 +2028,8 @@ void sched_init_numa(int offline_node)
 	 *
 	 * We reset it to 'nr_levels' at the end of this function.
 	 */
+	rcu_assign_pointer(sched_domains_numa_distance, domain_distances);
+
 	sched_domains_numa_levels = 0;
 
 	masks = kzalloc(sizeof(void *) * nr_levels, GFP_KERNEL);
@@ -1979,10 +2055,13 @@ void sched_init_numa(int offline_node)
 			masks[i][j] = mask;
 
 			for_each_cpu_node_but(k, offline_node) {
-				if (sched_debug() && (node_distance(j, k) != node_distance(k, j)))
+				if (sched_debug() &&
+				    (arch_sched_node_distance(j, k) !=
+				     arch_sched_node_distance(k, j)))
 					sched_numa_warn("Node-distance not symmetric");
 
-				if (node_distance(j, k) > sched_domains_numa_distance[i])
+				if (arch_sched_node_distance(j, k) >
+					sched_domains_numa_distance[i])
 					continue;
 
 				cpumask_or(mask, mask, cpumask_of_node(k));
@@ -2022,7 +2101,7 @@ void sched_init_numa(int offline_node)
 	sched_domain_topology = tl;
 
 	sched_domains_numa_levels = nr_levels;
-	WRITE_ONCE(sched_max_numa_distance, sched_domains_numa_distance[nr_levels - 1]);
+	WRITE_ONCE(sched_max_numa_distance, max_dist);
 
 	init_numa_topology_type(offline_node);
 }
@@ -2030,14 +2109,18 @@ void sched_init_numa(int offline_node)
 
 static void sched_reset_numa(void)
 {
-	int nr_levels, *distances;
+	int nr_levels, *distances, *dom_distances;
 	struct cpumask ***masks;
 
 	nr_levels = sched_domains_numa_levels;
+	sched_numa_node_levels = 0;
 	sched_domains_numa_levels = 0;
 	sched_max_numa_distance = 0;
+	sched_avg_remote_numa_distance = 0;
 	sched_numa_topology_type = NUMA_DIRECT;
-	distances = sched_domains_numa_distance;
+	distances = sched_numa_node_distance;
+	dom_distances = sched_domains_numa_distance;
+	rcu_assign_pointer(sched_numa_node_distance, NULL);
 	rcu_assign_pointer(sched_domains_numa_distance, NULL);
 	masks = sched_domains_numa_masks;
 	rcu_assign_pointer(sched_domains_numa_masks, NULL);
@@ -2054,6 +2137,7 @@ static void sched_reset_numa(void)
 			kfree(masks[i]);
 		}
 		kfree(masks);
+		kfree(dom_distances);
 	}
 	if (sched_domain_topology_saved) {
 		kfree(sched_domain_topology);
@@ -2092,7 +2176,7 @@ void sched_domains_numa_masks_set(unsigned int cpu)
 				continue;
 
 			/* Set ourselves in the remote node's masks */
-			if (node_distance(j, node) <= sched_domains_numa_distance[i])
+			if (arch_sched_node_distance(j, node) <= sched_domains_numa_distance[i])
 				cpumask_set_cpu(cpu, sched_domains_numa_masks[i][j]);
 		}
 	}
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH v2 2/2] sched: Fix sched domain build error for GNR, CWF in SNC-3 mode
  2025-09-05 18:36 [PATCH v2 0/2] Fix NUMA sched domain build errors for GNR and CWF Tim Chen
  2025-09-05 18:36 ` [PATCH v2 1/2] sched: Create architecture specific sched domain distances Tim Chen
@ 2025-09-05 18:36 ` Tim Chen
  1 sibling, 0 replies; 5+ messages in thread
From: Tim Chen @ 2025-09-05 18:36 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Ben Segall, Mel Gorman,
	Valentin Schneider, Tim Chen, Vincent Guittot, Libo Chen, Abel Wu,
	Len Brown, linux-kernel, Chen Yu, K Prateek Nayak,
	Gautham R . Shenoy, Zhao Liu, Vinicius Costa Gomes,
	Arjan Van De Ven

It is possible for Granite Rapids (GNR) and Clearwater Forest
(CWF) to have up to 3 dies per package. When sub-numa cluster (SNC-3)
is enabled, each die will become a separate NUMA node in the package
with different distances between dies within the same package.

For example, on GNR, we see the following NUMA distances for a 2 socket
system with 3 dies per socket:

    package 1       package2
	----------------
	|               |
    ---------       ---------
    |   0   |       |   3   |
    ---------       ---------
	|               |
    ---------       ---------
    |   1   |       |   4   |
    ---------       ---------
	|               |
    ---------       ---------
    |   2   |       |   5   |
    ---------       ---------
	|               |
	----------------

node distances:
node     0    1    2    3    4    5
0:   	10   15   17   21   28   26
1:   	15   10   15   23   26   23
2:   	17   15   10   26   23   21
3:   	21   28   26   10   15   17
4:   	23   26   23   15   10   15
5:   	26   23   21   17   15   10

The node distances above led to 2 problems:

1. Asymmetric routes taken between nodes in different packages led to
asymmetric scheduler domain perspective depending on which node you
are on.  Current scheduler code failed to build domains properly with
asymmetric distances.

2. Multiple remote distances to respective tiles on remote package create
too many levels of domain hierarchies grouping different nodes between
remote packages.

For example, the above GNR-X topology lead to NUMA domains below:

Sched domains from the perspective of a CPU in node 0, where the number
in bracket represent node number.

NUMA-level 1    [0,1] [2]
NUMA-level 2    [0,1,2] [3]
NUMA-level 3    [0,1,2,3] [5]
NUMA-level 4    [0,1,2,3,5] [4]

Sched domains from the perspective of a CPU in node 4
NUMA-level 1    [4] [3,5]
NUMA-level 2    [3,4,5] [0,2]
NUMA-level 3    [0,2,3,4,5] [1]

Scheduler group peers for load balancing from the perspective of CPU 0
and 4 are different.  Improper task could be chosen for load balancing
between groups such as [0,2,3,4,5] [1].  Ideally you should choose nodes
in 0 or 2 that are in same package as node 1 first.  But instead tasks
in the remote package node 3, 4, 5 could be chosen with an equal chance
and could lead to excessive remote package migrations and imbalance of
load between packages.  We should not group partial remote nodes and
local nodes together.
Simplify the remote distances for CWF-X and GNR-X for the purpose of
sched domains building, which maintains symmetry and leads to a more
reasonable load balance hierarchy.

The sched domains from the perspective of a CPU in node 0 NUMA-level 1
is now
NUMA-level 1    [0,1] [2]
NUMA-level 2    [0,1,2] [3,4,5]

The sched domains from the perspective of a CPU in node 4 NUMA-level 1
is now
NUMA-level 1    [4] [3,5]
NUMA-level 2    [3,4,5] [0,1,2]

We have the same balancing perspective from node 0 or node 4.  Loads are
now balanced equally between packages.

Tested-by: Zhao Liu <zhao1.liu@intel.com>
Co-developed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 arch/x86/kernel/smpboot.c | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 33e166f6ab12..3f894c525e49 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -515,6 +515,34 @@ static void __init build_sched_topology(void)
 	set_sched_topology(topology);
 }
 
+int arch_sched_node_distance(int from, int to)
+{
+	int d = node_distance(from, to);
+
+	if (!x86_has_numa_in_package)
+		return d;
+
+	switch (boot_cpu_data.x86_vfm) {
+	case INTEL_GRANITERAPIDS_X:
+	case INTEL_ATOM_DARKMONT_X:
+		if (d < REMOTE_DISTANCE)
+			return d;
+
+		/*
+		 * Trim finer distance tuning for nodes in remote package
+		 * for the purpose of building sched domains.  Put NUMA nodes
+		 * in each remote package in the same sched group.
+		 * Simplify NUMA domains and avoid extra NUMA levels including
+		 * different NUMA nodes in remote packages.
+		 *
+		 * GNR and CWF don't expect systmes with more than 2 packages
+		 * and more than 2 hops between packages.
+		 */
+		d = sched_avg_remote_numa_distance;
+	}
+	return d;
+}
+
 void set_cpu_sibling_map(int cpu)
 {
 	bool has_smt = __max_threads_per_core > 1;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v2 1/2] sched: Create architecture specific sched domain distances
  2025-09-05 18:36 ` [PATCH v2 1/2] sched: Create architecture specific sched domain distances Tim Chen
@ 2025-09-07 16:28   ` Chen, Yu C
  2025-09-08 18:23     ` Tim Chen
  0 siblings, 1 reply; 5+ messages in thread
From: Chen, Yu C @ 2025-09-07 16:28 UTC (permalink / raw)
  To: Tim Chen
  Cc: Juri Lelli, Dietmar Eggemann, Ben Segall, Mel Gorman,
	Valentin Schneider, Tim Chen, Vincent Guittot, Libo Chen, Abel Wu,
	Len Brown, linux-kernel, K Prateek Nayak, Gautham R . Shenoy,
	Zhao Liu, Vinicius Costa Gomes, Arjan Van De Ven, Peter Zijlstra,
	Ingo Molnar

On 9/6/2025 2:36 AM, Tim Chen wrote:
> Allow architecture specific sched domain NUMA distances that can be
> modified from NUMA node distances for the purpose of building NUMA
> sched domains.
> 
> The actual NUMA distances are kept separately.  This allows for NUMA
> domain levels modification when building sched domains for specific
> architectures.
> 
> Consolidate the recording of unique NUMA distances in an array to
> sched_record_numa_dist() so the function can be reused to record NUMA
> distances when the NUMA distance metric is changed.
> 
> No functional change if there's no arch specific NUMA distances
> are being defined.
> 
> Co-developed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
>   include/linux/sched/topology.h |   2 +
>   kernel/sched/topology.c        | 118 ++++++++++++++++++++++++++++-----
>   2 files changed, 103 insertions(+), 17 deletions(-)
> 
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 5263746b63e8..4f58e78ca52e 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -59,6 +59,8 @@ static inline int cpu_numa_flags(void)
>   #endif
>   
>   extern int arch_asym_cpu_priority(int cpu);
> +extern int arch_sched_node_distance(int from, int to);
> +extern int sched_avg_remote_numa_distance;
>   
>   struct sched_domain_attr {
>   	int relax_domain_level;
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 977e133bb8a4..1f08dfef2ea5 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1591,10 +1591,13 @@ static void claim_allocations(int cpu, struct sched_domain *sd)
>   enum numa_topology_type sched_numa_topology_type;
>   
>   static int			sched_domains_numa_levels;
> +static int			sched_numa_node_levels;
>   static int			sched_domains_curr_level;
>   
>   int				sched_max_numa_distance;
> +int				sched_avg_remote_numa_distance;
>   static int			*sched_domains_numa_distance;
> +static int			*sched_numa_node_distance;
>   static struct cpumask		***sched_domains_numa_masks;
>   #endif /* CONFIG_NUMA */
>   
> @@ -1808,10 +1811,10 @@ bool find_numa_distance(int distance)
>   		return true;
>   
>   	rcu_read_lock();
> -	distances = rcu_dereference(sched_domains_numa_distance);
> +	distances = rcu_dereference(sched_numa_node_distance);
>   	if (!distances)
>   		goto unlock;
> -	for (i = 0; i < sched_domains_numa_levels; i++) {
> +	for (i = 0; i < sched_numa_node_levels; i++) {
>   		if (distances[i] == distance) {
>   			found = true;
>   			break;
> @@ -1887,14 +1890,29 @@ static void init_numa_topology_type(int offline_node)
>   
>   #define NR_DISTANCE_VALUES (1 << DISTANCE_BITS)
>   
> -void sched_init_numa(int offline_node)
> +/*
> + * Architecture could simplify NUMA distance, to avoid
> + * creating too many NUMA levels.
> + */
> +int __weak arch_sched_node_distance(int from, int to)
> +{
> +	return node_distance(from, to);
> +}
> +
> +static int numa_node_dist(int i, int j)
> +{
> +	return node_distance(i, j);
> +}
> +

numa_node_dist() seems to be used only once by
sched_record_numa_dist(), would it be possible to
use node_distance() directly
sched_record_numa_dist(offline_node, node_distance, &distances,
				   &max_dist, &nr_node_levels))?

> +static int sched_record_numa_dist(int offline_node, int (*n_dist)(int, int),
> +		int **dist, int *maximum_dist, int *levels)
> +
>   {
> -	struct sched_domain_topology_level *tl;
>   	unsigned long *distance_map;
>   	int nr_levels = 0;
>   	int i, j;
>   	int *distances;
> -	struct cpumask ***masks;
> +	int max_dist = 0;
>   
>   	/*
>   	 * O(nr_nodes^2) de-duplicating selection sort -- in order to find the
> @@ -1902,17 +1920,20 @@ void sched_init_numa(int offline_node)
>   	 */
>   	distance_map = bitmap_alloc(NR_DISTANCE_VALUES, GFP_KERNEL);
>   	if (!distance_map)
> -		return;
> +		return -ENOMEM;
>   
>   	bitmap_zero(distance_map, NR_DISTANCE_VALUES);
>   	for_each_cpu_node_but(i, offline_node) {
>   		for_each_cpu_node_but(j, offline_node) {
> -			int distance = node_distance(i, j);
> +			int distance = n_dist(i, j);
> +
> +			if (distance > max_dist)
> +				max_dist = distance;
>   
>   			if (distance < LOCAL_DISTANCE || distance >= NR_DISTANCE_VALUES) {
>   				sched_numa_warn("Invalid distance value range");
>   				bitmap_free(distance_map);
> -				return;
> +				return -EINVAL;
>   			}
>   
>   			bitmap_set(distance_map, distance, 1);
> @@ -1927,17 +1948,70 @@ void sched_init_numa(int offline_node)
>   	distances = kcalloc(nr_levels, sizeof(int), GFP_KERNEL);
>   	if (!distances) {
>   		bitmap_free(distance_map);
> -		return;
> +		return -ENOMEM;
>   	}
> -
>   	for (i = 0, j = 0; i < nr_levels; i++, j++) {
>   		j = find_next_bit(distance_map, NR_DISTANCE_VALUES, j);
>   		distances[i] = j;
>   	}
> -	rcu_assign_pointer(sched_domains_numa_distance, distances);
> +	*dist = distances;
> +	if (levels)
> +		*levels = nr_levels;
> +
> +	if (maximum_dist)
> +		*maximum_dist = max_dist;
>   
>   	bitmap_free(distance_map);
>   
> +	return 0;
> +}
> +
> +static int avg_remote_numa_distance(int offline_node)
> +{
> +	int i, j;
> +	int distance, nr_remote = 0, total_distance = 0;
> +
> +	for_each_cpu_node_but(i, offline_node) {
> +		for_each_cpu_node_but(j, offline_node) {
> +			distance = node_distance(i, j);
> +
> +			if (distance >= REMOTE_DISTANCE) {
> +				nr_remote++;
> +				total_distance += distance;
> +			}
> +		}
> +	}
> +	if (nr_remote)
> +		return total_distance / nr_remote;
> +	else
> +		return REMOTE_DISTANCE;
> +}
> +
> +void sched_init_numa(int offline_node)
> +{
> +	struct sched_domain_topology_level *tl;
> +	int nr_levels, nr_node_levels;
> +	int i, j;
> +	int *distances, *domain_distances;
> +	int max_dist;
> +	struct cpumask ***masks;
> +
> +	if (sched_record_numa_dist(offline_node, numa_node_dist, &distances,
> +				   &max_dist, &nr_node_levels))
> +		return;
> +
> +	WRITE_ONCE(sched_avg_remote_numa_distance,
> +		   avg_remote_numa_distance(offline_node));
> +
> +	if (sched_record_numa_dist(offline_node,
> +				   arch_sched_node_distance, &domain_distances,
> +				   NULL, &nr_levels)) {
> +		kfree(distances);
> +		return;
> +	}
> +	rcu_assign_pointer(sched_numa_node_distance, distances);
> +	WRITE_ONCE(sched_numa_node_levels, nr_node_levels);
> +
>   	/*
>   	 * 'nr_levels' contains the number of unique distances
>   	 *
> @@ -1954,6 +2028,8 @@ void sched_init_numa(int offline_node)
>   	 *
>   	 * We reset it to 'nr_levels' at the end of this function.
>   	 */
> +	rcu_assign_pointer(sched_domains_numa_distance, domain_distances);
> +
>   	sched_domains_numa_levels = 0;
>   
>   	masks = kzalloc(sizeof(void *) * nr_levels, GFP_KERNEL);
> @@ -1979,10 +2055,13 @@ void sched_init_numa(int offline_node)
>   			masks[i][j] = mask;
>   
>   			for_each_cpu_node_but(k, offline_node) {
> -				if (sched_debug() && (node_distance(j, k) != node_distance(k, j)))
> +				if (sched_debug() &&
> +				    (arch_sched_node_distance(j, k) !=
> +				     arch_sched_node_distance(k, j)))
>   					sched_numa_warn("Node-distance not symmetric");
>   
> -				if (node_distance(j, k) > sched_domains_numa_distance[i])
> +				if (arch_sched_node_distance(j, k) >
> +					sched_domains_numa_distance[i])
>   					continue;
>   
>   				cpumask_or(mask, mask, cpumask_of_node(k));
> @@ -2022,7 +2101,7 @@ void sched_init_numa(int offline_node)
>   	sched_domain_topology = tl;
>   
>   	sched_domains_numa_levels = nr_levels;
> -	WRITE_ONCE(sched_max_numa_distance, sched_domains_numa_distance[nr_levels - 1]);
> +	WRITE_ONCE(sched_max_numa_distance, max_dist);
>   

Would it be possible to use
WRITE_ONCE(sched_max_numa_distance, distance[nr_node_levels - 1]);
so we can simplify the code by removing the introduced 'max_dist'
both in sched_record_numa_dist() and sched_init_numa().

thanks,
Chenyu

>   	init_numa_topology_type(offline_node);
>   }
> @@ -2030,14 +2109,18 @@ void sched_init_numa(int offline_node)
>   
>   static void sched_reset_numa(void)
>   {
> -	int nr_levels, *distances;
> +	int nr_levels, *distances, *dom_distances;
>   	struct cpumask ***masks;
>   
>   	nr_levels = sched_domains_numa_levels;
> +	sched_numa_node_levels = 0;
>   	sched_domains_numa_levels = 0;
>   	sched_max_numa_distance = 0;
> +	sched_avg_remote_numa_distance = 0;
>   	sched_numa_topology_type = NUMA_DIRECT;
> -	distances = sched_domains_numa_distance;
> +	distances = sched_numa_node_distance;
> +	dom_distances = sched_domains_numa_distance;
> +	rcu_assign_pointer(sched_numa_node_distance, NULL);
>   	rcu_assign_pointer(sched_domains_numa_distance, NULL);
>   	masks = sched_domains_numa_masks;
>   	rcu_assign_pointer(sched_domains_numa_masks, NULL);
> @@ -2054,6 +2137,7 @@ static void sched_reset_numa(void)
>   			kfree(masks[i]);
>   		}
>   		kfree(masks);
> +		kfree(dom_distances);
>   	}
>   	if (sched_domain_topology_saved) {
>   		kfree(sched_domain_topology);
> @@ -2092,7 +2176,7 @@ void sched_domains_numa_masks_set(unsigned int cpu)
>   				continue;
>   
>   			/* Set ourselves in the remote node's masks */
> -			if (node_distance(j, node) <= sched_domains_numa_distance[i])
> +			if (arch_sched_node_distance(j, node) <= sched_domains_numa_distance[i])
>   				cpumask_set_cpu(cpu, sched_domains_numa_masks[i][j]);
>   		}
>   	}

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2 1/2] sched: Create architecture specific sched domain distances
  2025-09-07 16:28   ` Chen, Yu C
@ 2025-09-08 18:23     ` Tim Chen
  0 siblings, 0 replies; 5+ messages in thread
From: Tim Chen @ 2025-09-08 18:23 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: Juri Lelli, Dietmar Eggemann, Ben Segall, Mel Gorman,
	Valentin Schneider, Tim Chen, Vincent Guittot, Libo Chen, Abel Wu,
	Len Brown, linux-kernel, K Prateek Nayak, Gautham R . Shenoy,
	Zhao Liu, Vinicius Costa Gomes, Arjan Van De Ven, Peter Zijlstra,
	Ingo Molnar

On Mon, 2025-09-08 at 00:28 +0800, Chen, Yu C wrote:
> On 9/6/2025 2:36 AM, Tim Chen wrote:

... snip ...
> > -void sched_init_numa(int offline_node)
> > +/*
> > + * Architecture could simplify NUMA distance, to avoid
> > + * creating too many NUMA levels.
> > + */
> > +int __weak arch_sched_node_distance(int from, int to)
> > +{
> > +	return node_distance(from, to);
> > +}
> > +
> > +static int numa_node_dist(int i, int j)
> > +{
> > +	return node_distance(i, j);
> > +}
> > +
> 
> numa_node_dist() seems to be used only once by
> sched_record_numa_dist(), would it be possible to
> use node_distance() directly
> sched_record_numa_dist(offline_node, node_distance, &distances,
> 				   &max_dist, &nr_node_levels))?

Otherwise I will need to pass a flag to sched_record_numa_dist to
choose which distance to use.  I am okay either way. Choosing
the current method so it makes sched_record_numa_dist() simpler.


> 
> > +static int sched_record_numa_dist(int offline_node, int (*n_dist)(int, int),
> > +		int **dist, int *maximum_dist, int *levels)
> > +
> >   {
> > -	struct sched_domain_topology_level *tl;
> >   	unsigned long *distance_map;
> >   	int nr_levels = 0;
> >   	int i, j;
> >   	int *distances;
> > -	struct cpumask ***masks;
> > +	int max_dist = 0;
> >   
> > 
... snip ...

> > +static int avg_remote_numa_distance(int offline_node)
> > +{
> > +	int i, j;
> > +	int distance, nr_remote = 0, total_distance = 0;
> > +
> > +	for_each_cpu_node_but(i, offline_node) {
> > +		for_each_cpu_node_but(j, offline_node) {
> > +			distance = node_distance(i, j);
> > +
> > +			if (distance >= REMOTE_DISTANCE) {
> > +				nr_remote++;
> > +				total_distance += distance;
> > +			}
> > +		}
> > +	}
> > +	if (nr_remote)
> > +		return total_distance / nr_remote;
> > +	else
> > +		return REMOTE_DISTANCE;
> > +}
> > +
> > +void sched_init_numa(int offline_node)
> > +{
> > +	struct sched_domain_topology_level *tl;
> > +	int nr_levels, nr_node_levels;
> > +	int i, j;
> > +	int *distances, *domain_distances;
> > +	int max_dist;
> > +	struct cpumask ***masks;
> > +
> > +	if (sched_record_numa_dist(offline_node, numa_node_dist, &distances,
> > +				   &max_dist, &nr_node_levels))
> > +		return;
> > +
> > +	WRITE_ONCE(sched_avg_remote_numa_distance,
> > +		   avg_remote_numa_distance(offline_node));
> > +
> > +	if (sched_record_numa_dist(offline_node,
> > +				   arch_sched_node_distance, &domain_distances,
> > +				   NULL, &nr_levels)) {
> > +		kfree(distances);
> > +		return;
> > +	}
> > +	rcu_assign_pointer(sched_numa_node_distance, distances);
> > +	WRITE_ONCE(sched_numa_node_levels, nr_node_levels);
> > +
> >   	/*
> >   	 * 'nr_levels' contains the number of unique distances
> >   	 *
> > @@ -1954,6 +2028,8 @@ void sched_init_numa(int offline_node)
> >   	 *
> >   	 * We reset it to 'nr_levels' at the end of this function.
> >   	 */
> > +	rcu_assign_pointer(sched_domains_numa_distance, domain_distances);
> > +
> >   	sched_domains_numa_levels = 0;
> >   
> >   	masks = kzalloc(sizeof(void *) * nr_levels, GFP_KERNEL);
> > @@ -1979,10 +2055,13 @@ void sched_init_numa(int offline_node)
> >   			masks[i][j] = mask;
> >   
> >   			for_each_cpu_node_but(k, offline_node) {
> > -				if (sched_debug() && (node_distance(j, k) != node_distance(k, j)))
> > +				if (sched_debug() &&
> > +				    (arch_sched_node_distance(j, k) !=
> > +				     arch_sched_node_distance(k, j)))
> >   					sched_numa_warn("Node-distance not symmetric");
> >   
> > -				if (node_distance(j, k) > sched_domains_numa_distance[i])
> > +				if (arch_sched_node_distance(j, k) >
> > +					sched_domains_numa_distance[i])
> >   					continue;
> >   
> >   				cpumask_or(mask, mask, cpumask_of_node(k));
> > @@ -2022,7 +2101,7 @@ void sched_init_numa(int offline_node)
> >   	sched_domain_topology = tl;
> >   
> >   	sched_domains_numa_levels = nr_levels;
> > -	WRITE_ONCE(sched_max_numa_distance, sched_domains_numa_distance[nr_levels - 1]);
> > +	WRITE_ONCE(sched_max_numa_distance, max_dist);
> >   
> 
> Would it be possible to use
> WRITE_ONCE(sched_max_numa_distance, distance[nr_node_levels - 1]);
> so we can simplify the code by removing the introduced 'max_dist'
> both in sched_record_numa_dist() and sched_init_numa().

Sure, I think that simplifies sched_record_numa_dist().


Tim

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-09-08 18:23 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-05 18:36 [PATCH v2 0/2] Fix NUMA sched domain build errors for GNR and CWF Tim Chen
2025-09-05 18:36 ` [PATCH v2 1/2] sched: Create architecture specific sched domain distances Tim Chen
2025-09-07 16:28   ` Chen, Yu C
2025-09-08 18:23     ` Tim Chen
2025-09-05 18:36 ` [PATCH v2 2/2] sched: Fix sched domain build error for GNR, CWF in SNC-3 mode Tim Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox