* [PATCH v3 0/2] Fix NUMA sched domain build errors for GNR and CWF
@ 2025-09-11 18:30 Tim Chen
2025-09-11 18:30 ` [PATCH v3 1/2] sched: Create architecture specific sched domain distances Tim Chen
2025-09-11 18:30 ` [PATCH v3 2/2] sched: Fix sched domain build error for GNR, CWF in SNC-3 mode Tim Chen
0 siblings, 2 replies; 18+ messages in thread
From: Tim Chen @ 2025-09-11 18:30 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar
Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Ben Segall, Mel Gorman,
Valentin Schneider, Tim Chen, Vincent Guittot, Libo Chen, Abel Wu,
Len Brown, linux-kernel, Chen Yu, K Prateek Nayak,
Gautham R . Shenoy, Zhao Liu, Vinicius Costa Gomes,
Arjan Van De Ven
While testing Granite Rapids (GNR) and Clearwater Forest (CWF) in
SNc-3 mode, we encountered sched domain build errors reported in dmesg.
Asymmetric node distances from local node to nodes in remote package
was not expected by the scheduler domain code and also led to excessive
number of sched domain hierarchy levels.
Simplify the remote node distances for the purpose of building sched
domains for GNR and CWF to fix the domain build errors and reduce
the number of NUMA sched domain levels.
Changes in v3:
- Simplify sched_record_numa_dist() by getting rid of max distance
computation.
- minor clean ups.
- Link to v2:
https://lore.kernel.org/lkml/61a6adbb845c148361101e16737307c8aa7ee362.1757097030.git.tim.c.chen@linux.intel.com/
Changes in v2:
- Allow modification of NUMA distances by architecture to be the
sched domain NUMA distances for building sched domains to
simplify NUMA domains.
Maintain separate NUMA distances for the purpose of building
sched domains from actual NUMA distances.
- Use average remote node distance as the distance to nodes in remote
packages for GNR and CWF.
- Remove the original fix for topology_span_sane() that's superseded
by better fix from Pratek.
https://lore.kernel.org/lkml/175688671425.1920.13690753997160836570.tip-bot2@tip-bot2/.
- Link to v1: https://lore.kernel.org/lkml/cover.1755893468.git.tim.c.chen@linux.intel.com/
Tim Chen (2):
sched: Create architecture specific sched domain distances
sched: Fix sched domain build error for GNR, CWF in SNC-3 mode
arch/x86/kernel/smpboot.c | 28 ++++++++
include/linux/sched/topology.h | 2 +
kernel/sched/topology.c | 114 ++++++++++++++++++++++++++++-----
3 files changed, 127 insertions(+), 17 deletions(-)
--
2.32.0
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v3 1/2] sched: Create architecture specific sched domain distances
2025-09-11 18:30 [PATCH v3 0/2] Fix NUMA sched domain build errors for GNR and CWF Tim Chen
@ 2025-09-11 18:30 ` Tim Chen
2025-09-12 3:23 ` K Prateek Nayak
` (2 more replies)
2025-09-11 18:30 ` [PATCH v3 2/2] sched: Fix sched domain build error for GNR, CWF in SNC-3 mode Tim Chen
1 sibling, 3 replies; 18+ messages in thread
From: Tim Chen @ 2025-09-11 18:30 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar
Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Ben Segall, Mel Gorman,
Valentin Schneider, Tim Chen, Vincent Guittot, Libo Chen, Abel Wu,
Len Brown, linux-kernel, Chen Yu, K Prateek Nayak,
Gautham R . Shenoy, Zhao Liu, Vinicius Costa Gomes,
Arjan Van De Ven
Allow architecture specific sched domain NUMA distances that can be
modified from NUMA node distances for the purpose of building NUMA
sched domains.
The actual NUMA distances are kept separately. This allows for NUMA
domain levels modification when building sched domains for specific
architectures.
Consolidate the recording of unique NUMA distances in an array to
sched_record_numa_dist() so the function can be reused to record NUMA
distances when the NUMA distance metric is changed.
No functional change if there's no arch specific NUMA distances
are being defined.
Co-developed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
include/linux/sched/topology.h | 2 +
kernel/sched/topology.c | 114 ++++++++++++++++++++++++++++-----
2 files changed, 99 insertions(+), 17 deletions(-)
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 5263746b63e8..4f58e78ca52e 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -59,6 +59,8 @@ static inline int cpu_numa_flags(void)
#endif
extern int arch_asym_cpu_priority(int cpu);
+extern int arch_sched_node_distance(int from, int to);
+extern int sched_avg_remote_numa_distance;
struct sched_domain_attr {
int relax_domain_level;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 977e133bb8a4..6c0ff62322cb 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1591,10 +1591,13 @@ static void claim_allocations(int cpu, struct sched_domain *sd)
enum numa_topology_type sched_numa_topology_type;
static int sched_domains_numa_levels;
+static int sched_numa_node_levels;
static int sched_domains_curr_level;
int sched_max_numa_distance;
+int sched_avg_remote_numa_distance;
static int *sched_domains_numa_distance;
+static int *sched_numa_node_distance;
static struct cpumask ***sched_domains_numa_masks;
#endif /* CONFIG_NUMA */
@@ -1808,10 +1811,10 @@ bool find_numa_distance(int distance)
return true;
rcu_read_lock();
- distances = rcu_dereference(sched_domains_numa_distance);
+ distances = rcu_dereference(sched_numa_node_distance);
if (!distances)
goto unlock;
- for (i = 0; i < sched_domains_numa_levels; i++) {
+ for (i = 0; i < sched_numa_node_levels; i++) {
if (distances[i] == distance) {
found = true;
break;
@@ -1887,14 +1890,32 @@ static void init_numa_topology_type(int offline_node)
#define NR_DISTANCE_VALUES (1 << DISTANCE_BITS)
-void sched_init_numa(int offline_node)
+/*
+ * Architecture could modify NUMA distance, to change
+ * grouping of NUMA nodes and number of NUMA levels when creating
+ * NUMA level sched domains.
+ *
+ * One NUMA level is created for each unique
+ * arch_sched_node_distance.
+ */
+int __weak arch_sched_node_distance(int from, int to)
+{
+ return node_distance(from, to);
+}
+
+static int numa_node_dist(int i, int j)
+{
+ return node_distance(i, j);
+}
+
+static int sched_record_numa_dist(int offline_node, int (*n_dist)(int, int),
+ int **dist, int *levels)
+
{
- struct sched_domain_topology_level *tl;
unsigned long *distance_map;
int nr_levels = 0;
int i, j;
int *distances;
- struct cpumask ***masks;
/*
* O(nr_nodes^2) de-duplicating selection sort -- in order to find the
@@ -1902,17 +1923,17 @@ void sched_init_numa(int offline_node)
*/
distance_map = bitmap_alloc(NR_DISTANCE_VALUES, GFP_KERNEL);
if (!distance_map)
- return;
+ return -ENOMEM;
bitmap_zero(distance_map, NR_DISTANCE_VALUES);
for_each_cpu_node_but(i, offline_node) {
for_each_cpu_node_but(j, offline_node) {
- int distance = node_distance(i, j);
+ int distance = n_dist(i, j);
if (distance < LOCAL_DISTANCE || distance >= NR_DISTANCE_VALUES) {
sched_numa_warn("Invalid distance value range");
bitmap_free(distance_map);
- return;
+ return -EINVAL;
}
bitmap_set(distance_map, distance, 1);
@@ -1927,17 +1948,66 @@ void sched_init_numa(int offline_node)
distances = kcalloc(nr_levels, sizeof(int), GFP_KERNEL);
if (!distances) {
bitmap_free(distance_map);
- return;
+ return -ENOMEM;
}
-
for (i = 0, j = 0; i < nr_levels; i++, j++) {
j = find_next_bit(distance_map, NR_DISTANCE_VALUES, j);
distances[i] = j;
}
- rcu_assign_pointer(sched_domains_numa_distance, distances);
+ *dist = distances;
+ *levels = nr_levels;
bitmap_free(distance_map);
+ return 0;
+}
+
+static int avg_remote_numa_distance(int offline_node)
+{
+ int i, j;
+ int distance, nr_remote = 0, total_distance = 0;
+
+ for_each_cpu_node_but(i, offline_node) {
+ for_each_cpu_node_but(j, offline_node) {
+ distance = node_distance(i, j);
+
+ if (distance >= REMOTE_DISTANCE) {
+ nr_remote++;
+ total_distance += distance;
+ }
+ }
+ }
+ if (nr_remote)
+ return total_distance / nr_remote;
+ else
+ return REMOTE_DISTANCE;
+}
+
+void sched_init_numa(int offline_node)
+{
+ struct sched_domain_topology_level *tl;
+ int nr_levels, nr_node_levels;
+ int i, j;
+ int *distances, *domain_distances;
+ struct cpumask ***masks;
+
+ if (sched_record_numa_dist(offline_node, numa_node_dist, &distances,
+ &nr_node_levels))
+ return;
+
+ WRITE_ONCE(sched_avg_remote_numa_distance,
+ avg_remote_numa_distance(offline_node));
+
+ if (sched_record_numa_dist(offline_node,
+ arch_sched_node_distance, &domain_distances,
+ &nr_levels)) {
+ kfree(distances);
+ return;
+ }
+ rcu_assign_pointer(sched_numa_node_distance, distances);
+ WRITE_ONCE(sched_max_numa_distance, distances[nr_node_levels - 1]);
+ WRITE_ONCE(sched_numa_node_levels, nr_node_levels);
+
/*
* 'nr_levels' contains the number of unique distances
*
@@ -1954,6 +2024,8 @@ void sched_init_numa(int offline_node)
*
* We reset it to 'nr_levels' at the end of this function.
*/
+ rcu_assign_pointer(sched_domains_numa_distance, domain_distances);
+
sched_domains_numa_levels = 0;
masks = kzalloc(sizeof(void *) * nr_levels, GFP_KERNEL);
@@ -1979,10 +2051,13 @@ void sched_init_numa(int offline_node)
masks[i][j] = mask;
for_each_cpu_node_but(k, offline_node) {
- if (sched_debug() && (node_distance(j, k) != node_distance(k, j)))
+ if (sched_debug() &&
+ (arch_sched_node_distance(j, k) !=
+ arch_sched_node_distance(k, j)))
sched_numa_warn("Node-distance not symmetric");
- if (node_distance(j, k) > sched_domains_numa_distance[i])
+ if (arch_sched_node_distance(j, k) >
+ sched_domains_numa_distance[i])
continue;
cpumask_or(mask, mask, cpumask_of_node(k));
@@ -2022,7 +2097,6 @@ void sched_init_numa(int offline_node)
sched_domain_topology = tl;
sched_domains_numa_levels = nr_levels;
- WRITE_ONCE(sched_max_numa_distance, sched_domains_numa_distance[nr_levels - 1]);
init_numa_topology_type(offline_node);
}
@@ -2030,14 +2104,18 @@ void sched_init_numa(int offline_node)
static void sched_reset_numa(void)
{
- int nr_levels, *distances;
+ int nr_levels, *distances, *dom_distances;
struct cpumask ***masks;
nr_levels = sched_domains_numa_levels;
+ sched_numa_node_levels = 0;
sched_domains_numa_levels = 0;
sched_max_numa_distance = 0;
+ sched_avg_remote_numa_distance = 0;
sched_numa_topology_type = NUMA_DIRECT;
- distances = sched_domains_numa_distance;
+ distances = sched_numa_node_distance;
+ dom_distances = sched_domains_numa_distance;
+ rcu_assign_pointer(sched_numa_node_distance, NULL);
rcu_assign_pointer(sched_domains_numa_distance, NULL);
masks = sched_domains_numa_masks;
rcu_assign_pointer(sched_domains_numa_masks, NULL);
@@ -2054,6 +2132,7 @@ static void sched_reset_numa(void)
kfree(masks[i]);
}
kfree(masks);
+ kfree(dom_distances);
}
if (sched_domain_topology_saved) {
kfree(sched_domain_topology);
@@ -2092,7 +2171,8 @@ void sched_domains_numa_masks_set(unsigned int cpu)
continue;
/* Set ourselves in the remote node's masks */
- if (node_distance(j, node) <= sched_domains_numa_distance[i])
+ if (arch_sched_node_distance(j, node) <=
+ sched_domains_numa_distance[i])
cpumask_set_cpu(cpu, sched_domains_numa_masks[i][j]);
}
}
--
2.32.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH v3 2/2] sched: Fix sched domain build error for GNR, CWF in SNC-3 mode
2025-09-11 18:30 [PATCH v3 0/2] Fix NUMA sched domain build errors for GNR and CWF Tim Chen
2025-09-11 18:30 ` [PATCH v3 1/2] sched: Create architecture specific sched domain distances Tim Chen
@ 2025-09-11 18:30 ` Tim Chen
2025-09-12 5:08 ` K Prateek Nayak
` (2 more replies)
1 sibling, 3 replies; 18+ messages in thread
From: Tim Chen @ 2025-09-11 18:30 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar
Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Ben Segall, Mel Gorman,
Valentin Schneider, Tim Chen, Vincent Guittot, Libo Chen, Abel Wu,
Len Brown, linux-kernel, Chen Yu, K Prateek Nayak,
Gautham R . Shenoy, Zhao Liu, Vinicius Costa Gomes,
Arjan Van De Ven
It is possible for Granite Rapids (GNR) and Clearwater Forest
(CWF) to have up to 3 dies per package. When sub-numa cluster (SNC-3)
is enabled, each die will become a separate NUMA node in the package
with different distances between dies within the same package.
For example, on GNR, we see the following numa distances for a 2 socket
system with 3 dies per socket:
package 1 package2
----------------
| |
--------- ---------
| 0 | | 3 |
--------- ---------
| |
--------- ---------
| 1 | | 4 |
--------- ---------
| |
--------- ---------
| 2 | | 5 |
--------- ---------
| |
----------------
node distances:
node 0 1 2 3 4 5
0: 10 15 17 21 28 26
1: 15 10 15 23 26 23
2: 17 15 10 26 23 21
3: 21 28 26 10 15 17
4: 23 26 23 15 10 15
5: 26 23 21 17 15 10
The node distances above led to 2 problems:
1. Asymmetric routes taken between nodes in different packages led to
asymmetric scheduler domain perspective depending on which node you
are on. Current scheduler code failed to build domains properly with
asymmetric distances.
2. Multiple remote distances to respective tiles on remote package create
too many levels of domain hierarchies grouping different nodes between
remote packages.
For example, the above GNR-X topology lead to NUMA domains below:
Sched domains from the perspective of a CPU in node 0, where the number
in bracket represent node number.
NUMA-level 1 [0,1] [2]
NUMA-level 2 [0,1,2] [3]
NUMA-level 3 [0,1,2,3] [5]
NUMA-level 4 [0,1,2,3,5] [4]
Sched domains from the perspective of a CPU in node 4
NUMA-level 1 [4] [3,5]
NUMA-level 2 [3,4,5] [0,2]
NUMA-level 3 [0,2,3,4,5] [1]
Scheduler group peers for load balancing from the perspective of CPU 0
and 4 are different. Improper task could be chosen for load balancing
between groups such as [0,2,3,4,5] [1]. Ideally you should choose nodes
in 0 or 2 that are in same package as node 1 first. But instead tasks
in the remote package node 3, 4, 5 could be chosen with an equal chance
and could lead to excessive remote package migrations and imbalance of
load between packages. We should not group partial remote nodes and
local nodes together.
Simplify the remote distances for CWF-X and GNR-X for the purpose of
sched domains building, which maintains symmetry and leads to a more
reasonable load balance hierarchy.
The sched domains from the perspective of a CPU in node 0 NUMA-level 1
is now
NUMA-level 1 [0,1] [2]
NUMA-level 2 [0,1,2] [3,4,5]
The sched domains from the perspective of a CPU in node 4 NUMA-level 1
is now
NUMA-level 1 [4] [3,5]
NUMA-level 2 [3,4,5] [0,1,2]
We have the same balancing perspective from node 0 or node 4. Loads are
now balanced equally between packages.
Tested-by: Zhao Liu <zhao1.liu@intel.com>
Co-developed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
arch/x86/kernel/smpboot.c | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 33e166f6ab12..3f894c525e49 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -515,6 +515,34 @@ static void __init build_sched_topology(void)
set_sched_topology(topology);
}
+int arch_sched_node_distance(int from, int to)
+{
+ int d = node_distance(from, to);
+
+ if (!x86_has_numa_in_package)
+ return d;
+
+ switch (boot_cpu_data.x86_vfm) {
+ case INTEL_GRANITERAPIDS_X:
+ case INTEL_ATOM_DARKMONT_X:
+ if (d < REMOTE_DISTANCE)
+ return d;
+
+ /*
+ * Trim finer distance tuning for nodes in remote package
+ * for the purpose of building sched domains. Put NUMA nodes
+ * in each remote package in the same sched group.
+ * Simplify NUMA domains and avoid extra NUMA levels including
+ * different NUMA nodes in remote packages.
+ *
+ * GNR and CWF don't expect systmes with more than 2 packages
+ * and more than 2 hops between packages.
+ */
+ d = sched_avg_remote_numa_distance;
+ }
+ return d;
+}
+
void set_cpu_sibling_map(int cpu)
{
bool has_smt = __max_threads_per_core > 1;
--
2.32.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH v3 1/2] sched: Create architecture specific sched domain distances
2025-09-11 18:30 ` [PATCH v3 1/2] sched: Create architecture specific sched domain distances Tim Chen
@ 2025-09-12 3:23 ` K Prateek Nayak
2025-09-15 16:44 ` Tim Chen
2025-09-12 5:24 ` Chen, Yu C
2025-09-15 12:37 ` Peter Zijlstra
2 siblings, 1 reply; 18+ messages in thread
From: K Prateek Nayak @ 2025-09-12 3:23 UTC (permalink / raw)
To: Tim Chen, Peter Zijlstra, Ingo Molnar
Cc: Juri Lelli, Dietmar Eggemann, Ben Segall, Mel Gorman,
Valentin Schneider, Tim Chen, Vincent Guittot, Libo Chen, Abel Wu,
Len Brown, linux-kernel, Chen Yu, Gautham R . Shenoy, Zhao Liu,
Vinicius Costa Gomes, Arjan Van De Ven
Hello Tim,
On 9/12/2025 12:00 AM, Tim Chen wrote:
> +static int sched_record_numa_dist(int offline_node, int (*n_dist)(int, int),
> + int **dist, int *levels)
> +
nit. Is the blank line above intentional?
Also personally I prefer breaking the two lines above as:
static int
sched_record_numa_dist(int offline_node, int (*n_dist)(int, int), int **dist, int *levels)
{
...
}
> {
> - struct sched_domain_topology_level *tl;
> unsigned long *distance_map;
Since we are breaking this out and adding return values, can we also
cleanup that bitmap_free() before every return with __free(bitmap) like:
(Only build tested)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 6c0ff62322cb..baa79e79ced8 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1910,9 +1910,8 @@ static int numa_node_dist(int i, int j)
static int sched_record_numa_dist(int offline_node, int (*n_dist)(int, int),
int **dist, int *levels)
-
{
- unsigned long *distance_map;
+ unsigned long *distance_map __free(bitmap) = NULL;
int nr_levels = 0;
int i, j;
int *distances;
@@ -1932,7 +1931,6 @@ static int sched_record_numa_dist(int offline_node, int (*n_dist)(int, int),
if (distance < LOCAL_DISTANCE || distance >= NR_DISTANCE_VALUES) {
sched_numa_warn("Invalid distance value range");
- bitmap_free(distance_map);
return -EINVAL;
}
@@ -1946,19 +1944,17 @@ static int sched_record_numa_dist(int offline_node, int (*n_dist)(int, int),
nr_levels = bitmap_weight(distance_map, NR_DISTANCE_VALUES);
distances = kcalloc(nr_levels, sizeof(int), GFP_KERNEL);
- if (!distances) {
- bitmap_free(distance_map);
+ if (!distances)
return -ENOMEM;
- }
+
for (i = 0, j = 0; i < nr_levels; i++, j++) {
j = find_next_bit(distance_map, NR_DISTANCE_VALUES, j);
distances[i] = j;
}
+
*dist = distances;
*levels = nr_levels;
- bitmap_free(distance_map);
-
return 0;
}
---
> int nr_levels = 0;
> int i, j;
> int *distances;
> - struct cpumask ***masks;
>
> /*
> * O(nr_nodes^2) de-duplicating selection sort -- in order to find the
> @@ -1902,17 +1923,17 @@ void sched_init_numa(int offline_node)
> */
> distance_map = bitmap_alloc(NR_DISTANCE_VALUES, GFP_KERNEL);
> if (!distance_map)
> - return;
> + return -ENOMEM;
>
> bitmap_zero(distance_map, NR_DISTANCE_VALUES);
> for_each_cpu_node_but(i, offline_node) {
> for_each_cpu_node_but(j, offline_node) {
> - int distance = node_distance(i, j);
> + int distance = n_dist(i, j);
>
> if (distance < LOCAL_DISTANCE || distance >= NR_DISTANCE_VALUES) {
> sched_numa_warn("Invalid distance value range");
> bitmap_free(distance_map);
> - return;
> + return -EINVAL;
> }
>
> bitmap_set(distance_map, distance, 1);
> @@ -1927,17 +1948,66 @@ void sched_init_numa(int offline_node)
> distances = kcalloc(nr_levels, sizeof(int), GFP_KERNEL);
> if (!distances) {
> bitmap_free(distance_map);
> - return;
> + return -ENOMEM;
> }
> -
> for (i = 0, j = 0; i < nr_levels; i++, j++) {
> j = find_next_bit(distance_map, NR_DISTANCE_VALUES, j);
> distances[i] = j;
> }
> - rcu_assign_pointer(sched_domains_numa_distance, distances);
> + *dist = distances;
> + *levels = nr_levels;
>
> bitmap_free(distance_map);
>
> + return 0;
> +}
> +
> +static int avg_remote_numa_distance(int offline_node)
> +{
> + int i, j;
> + int distance, nr_remote = 0, total_distance = 0;
> +
> + for_each_cpu_node_but(i, offline_node) {
> + for_each_cpu_node_but(j, offline_node) {
> + distance = node_distance(i, j);
> +
> + if (distance >= REMOTE_DISTANCE) {
> + nr_remote++;
> + total_distance += distance;
> + }
> + }
> + }
> + if (nr_remote)
> + return total_distance / nr_remote;
> + else
> + return REMOTE_DISTANCE;
> +}
> +
> +void sched_init_numa(int offline_node)
> +{
> + struct sched_domain_topology_level *tl;
> + int nr_levels, nr_node_levels;
> + int i, j;
> + int *distances, *domain_distances;
> + struct cpumask ***masks;
> +
> + if (sched_record_numa_dist(offline_node, numa_node_dist, &distances,
> + &nr_node_levels))
> + return;
> +
> + WRITE_ONCE(sched_avg_remote_numa_distance,
> + avg_remote_numa_distance(offline_node));
nit.
Can add a small comment here saying arch_sched_node_distance() may
depend on sched_avg_remote_numa_distance and requires it to be
initialized correctly before computing domain_distances.
Apart from those nitpicks, the changes look good to me. Please feel free
to include:
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
--
Thanks and Regards,
Prateek
> +
> + if (sched_record_numa_dist(offline_node,
> + arch_sched_node_distance, &domain_distances,
> + &nr_levels)) {
> + kfree(distances);
> + return;
> + }
> + rcu_assign_pointer(sched_numa_node_distance, distances);
> + WRITE_ONCE(sched_max_numa_distance, distances[nr_node_levels - 1]);
> + WRITE_ONCE(sched_numa_node_levels, nr_node_levels);
> +
> /*
> * 'nr_levels' contains the number of unique distances
> *
^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH v3 2/2] sched: Fix sched domain build error for GNR, CWF in SNC-3 mode
2025-09-11 18:30 ` [PATCH v3 2/2] sched: Fix sched domain build error for GNR, CWF in SNC-3 mode Tim Chen
@ 2025-09-12 5:08 ` K Prateek Nayak
2025-09-15 17:15 ` Tim Chen
2025-09-12 5:39 ` Chen, Yu C
2025-09-15 12:46 ` Peter Zijlstra
2 siblings, 1 reply; 18+ messages in thread
From: K Prateek Nayak @ 2025-09-12 5:08 UTC (permalink / raw)
To: Tim Chen, Peter Zijlstra, Ingo Molnar
Cc: Juri Lelli, Dietmar Eggemann, Ben Segall, Mel Gorman,
Valentin Schneider, Tim Chen, Vincent Guittot, Libo Chen, Abel Wu,
Len Brown, linux-kernel, Chen Yu, Gautham R . Shenoy, Zhao Liu,
Vinicius Costa Gomes, Arjan Van De Ven
Hello Tim,
On 9/12/2025 12:00 AM, Tim Chen wrote:
> It is possible for Granite Rapids (GNR) and Clearwater Forest
> (CWF) to have up to 3 dies per package. When sub-numa cluster (SNC-3)
> is enabled, each die will become a separate NUMA node in the package
> with different distances between dies within the same package.
>
> For example, on GNR, we see the following numa distances for a 2 socket
> system with 3 dies per socket:
>
> package 1 package2
> ----------------
> | |
> --------- ---------
> | 0 | | 3 |
> --------- ---------
> | |
> --------- ---------
> | 1 | | 4 |
> --------- ---------
> | |
> --------- ---------
> | 2 | | 5 |
> --------- ---------
> | |
> ----------------
>
> node distances:
> node 0 1 2 3 4 5
> 0: 10 15 17 21 28 26
> 1: 15 10 15 23 26 23
> 2: 17 15 10 26 23 21
> 3: 21 28 26 10 15 17
> 4: 23 26 23 15 10 15
> 5: 26 23 21 17 15 10
>
> The node distances above led to 2 problems:
>
> 1. Asymmetric routes taken between nodes in different packages led to
> asymmetric scheduler domain perspective depending on which node you
> are on. Current scheduler code failed to build domains properly with
> asymmetric distances.
>
> 2. Multiple remote distances to respective tiles on remote package create
> too many levels of domain hierarchies grouping different nodes between
> remote packages.
>
> For example, the above GNR-X topology lead to NUMA domains below:
>
> Sched domains from the perspective of a CPU in node 0, where the number
> in bracket represent node number.
>
> NUMA-level 1 [0,1] [2]
> NUMA-level 2 [0,1,2] [3]
> NUMA-level 3 [0,1,2,3] [5]
> NUMA-level 4 [0,1,2,3,5] [4]
>
> Sched domains from the perspective of a CPU in node 4
> NUMA-level 1 [4] [3,5]
> NUMA-level 2 [3,4,5] [0,2]
> NUMA-level 3 [0,2,3,4,5] [1]
>
> Scheduler group peers for load balancing from the perspective of CPU 0
> and 4 are different. Improper task could be chosen for load balancing
> between groups such as [0,2,3,4,5] [1]. Ideally you should choose nodes
> in 0 or 2 that are in same package as node 1 first. But instead tasks
> in the remote package node 3, 4, 5 could be chosen with an equal chance
> and could lead to excessive remote package migrations and imbalance of
> load between packages. We should not group partial remote nodes and
> local nodes together.
> Simplify the remote distances for CWF-X and GNR-X for the purpose of
> sched domains building, which maintains symmetry and leads to a more
> reasonable load balance hierarchy.
>
> The sched domains from the perspective of a CPU in node 0 NUMA-level 1
> is now
> NUMA-level 1 [0,1] [2]
> NUMA-level 2 [0,1,2] [3,4,5]
>
> The sched domains from the perspective of a CPU in node 4 NUMA-level 1
> is now
> NUMA-level 1 [4] [3,5]
> NUMA-level 2 [3,4,5] [0,1,2]
>
> We have the same balancing perspective from node 0 or node 4. Loads are
> now balanced equally between packages.
>
> Tested-by: Zhao Liu <zhao1.liu@intel.com>
> Co-developed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
> Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Feel free to include:
Reviewed-and-tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
--
Thanks and Regards,
Prateek
> ---
> arch/x86/kernel/smpboot.c | 28 ++++++++++++++++++++++++++++
> 1 file changed, 28 insertions(+)
>
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 33e166f6ab12..3f894c525e49 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -515,6 +515,34 @@ static void __init build_sched_topology(void)
> set_sched_topology(topology);
> }
>
> +int arch_sched_node_distance(int from, int to)
> +{
> + int d = node_distance(from, to);
> +
> + if (!x86_has_numa_in_package)
> + return d;
> +
> + switch (boot_cpu_data.x86_vfm) {
> + case INTEL_GRANITERAPIDS_X:
> + case INTEL_ATOM_DARKMONT_X:
> + if (d < REMOTE_DISTANCE)
> + return d;
> +
> + /*
> + * Trim finer distance tuning for nodes in remote package
> + * for the purpose of building sched domains. Put NUMA nodes
> + * in each remote package in the same sched group.
> + * Simplify NUMA domains and avoid extra NUMA levels including
> + * different NUMA nodes in remote packages.
> + *
> + * GNR and CWF don't expect systmes with more than 2 packages
> + * and more than 2 hops between packages.
> + */
> + d = sched_avg_remote_numa_distance;
> + }
> + return d;
> +}
> +
> void set_cpu_sibling_map(int cpu)
> {
> bool has_smt = __max_threads_per_core > 1;
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 1/2] sched: Create architecture specific sched domain distances
2025-09-11 18:30 ` [PATCH v3 1/2] sched: Create architecture specific sched domain distances Tim Chen
2025-09-12 3:23 ` K Prateek Nayak
@ 2025-09-12 5:24 ` Chen, Yu C
2025-09-15 16:49 ` Tim Chen
2025-09-15 17:16 ` Tim Chen
2025-09-15 12:37 ` Peter Zijlstra
2 siblings, 2 replies; 18+ messages in thread
From: Chen, Yu C @ 2025-09-12 5:24 UTC (permalink / raw)
To: Tim Chen, Peter Zijlstra, Ingo Molnar
Cc: Juri Lelli, Dietmar Eggemann, Ben Segall, Mel Gorman,
Valentin Schneider, Tim Chen, Vincent Guittot, Libo Chen, Abel Wu,
Len Brown, linux-kernel, K Prateek Nayak, Gautham R . Shenoy,
Zhao Liu, Vinicius Costa Gomes, Arjan Van De Ven
On 9/12/2025 2:30 AM, Tim Chen wrote:
> Allow architecture specific sched domain NUMA distances that can be
> modified from NUMA node distances for the purpose of building NUMA
> sched domains.
>
> The actual NUMA distances are kept separately. This allows for NUMA
> domain levels modification when building sched domains for specific
> architectures.
>
> Consolidate the recording of unique NUMA distances in an array to
> sched_record_numa_dist() so the function can be reused to record NUMA
> distances when the NUMA distance metric is changed.
>
> No functional change if there's no arch specific NUMA distances
> are being defined.
>
[snip]
> +
> +void sched_init_numa(int offline_node)
> +{
> + struct sched_domain_topology_level *tl;
> + int nr_levels, nr_node_levels;
> + int i, j;
> + int *distances, *domain_distances;
> + struct cpumask ***masks;
> +
> + if (sched_record_numa_dist(offline_node, numa_node_dist, &distances,
> + &nr_node_levels))
> + return;
> +
> + WRITE_ONCE(sched_avg_remote_numa_distance,
> + avg_remote_numa_distance(offline_node));
> +
> + if (sched_record_numa_dist(offline_node,
> + arch_sched_node_distance, &domain_distances,
> + &nr_levels)) {
> + kfree(distances);
> + return;
> + }
> + rcu_assign_pointer(sched_numa_node_distance, distances);
> + WRITE_ONCE(sched_max_numa_distance, distances[nr_node_levels - 1]);
[snip]
> @@ -2022,7 +2097,6 @@ void sched_init_numa(int offline_node)
> sched_domain_topology = tl;
>
> sched_domains_numa_levels = nr_levels;
> - WRITE_ONCE(sched_max_numa_distance, sched_domains_numa_distance[nr_levels - 1]);
>
Before this patch, sched_max_numa_distance is assigned a valid
value at the end of sched_init_numa(), after sched_domains_numa_masks
and sched_domain_topology_level are successfully created or appended
, the kzalloc() call should succeed.
Now we assign sched_max_numa_distance earlier, without considering
the status of NUMA sched domains. I think this is intended, because
sched domains are only for generic load balancing, while
sched_max_numa_distance is for NUMA load balancing; in theory, they
use different metrics in their strategies. Thus, this change should
not cause any issues.
From my understanding,
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
thanks,
Chenyu
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 2/2] sched: Fix sched domain build error for GNR, CWF in SNC-3 mode
2025-09-11 18:30 ` [PATCH v3 2/2] sched: Fix sched domain build error for GNR, CWF in SNC-3 mode Tim Chen
2025-09-12 5:08 ` K Prateek Nayak
@ 2025-09-12 5:39 ` Chen, Yu C
2025-09-12 9:23 ` K Prateek Nayak
2025-09-15 12:46 ` Peter Zijlstra
2 siblings, 1 reply; 18+ messages in thread
From: Chen, Yu C @ 2025-09-12 5:39 UTC (permalink / raw)
To: Tim Chen, Peter Zijlstra, Ingo Molnar
Cc: Juri Lelli, Dietmar Eggemann, Ben Segall, Mel Gorman,
Valentin Schneider, Tim Chen, Vincent Guittot, Libo Chen, Abel Wu,
Len Brown, linux-kernel, K Prateek Nayak, Gautham R . Shenoy,
Zhao Liu, Vinicius Costa Gomes, Arjan Van De Ven
On 9/12/2025 2:30 AM, Tim Chen wrote:
[snip]
>
> +int arch_sched_node_distance(int from, int to)
> +{
> + int d = node_distance(from, to);
> +
> + if (!x86_has_numa_in_package)
> + return d;
> +
> + switch (boot_cpu_data.x86_vfm) {
> + case INTEL_GRANITERAPIDS_X:
> + case INTEL_ATOM_DARKMONT_X:
> + if (d < REMOTE_DISTANCE)
> + return d;
> +
> + /*
> + * Trim finer distance tuning for nodes in remote package
> + * for the purpose of building sched domains. Put NUMA nodes
> + * in each remote package in the same sched group.
> + * Simplify NUMA domains and avoid extra NUMA levels including
> + * different NUMA nodes in remote packages.
> + *
> + * GNR and CWF don't expect systmes with more than 2 packages
> + * and more than 2 hops between packages.
> + */
> + d = sched_avg_remote_numa_distance;
sched_avg_remote_numa_distance is defined in topology.c with
CONFIG_NUMA controlled, should we make arch_sched_node_distance()
be controlled under CONFIG_NUMA too?
thanks,
Chenyu
> + }
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 2/2] sched: Fix sched domain build error for GNR, CWF in SNC-3 mode
2025-09-12 5:39 ` Chen, Yu C
@ 2025-09-12 9:23 ` K Prateek Nayak
2025-09-12 11:59 ` Chen, Yu C
0 siblings, 1 reply; 18+ messages in thread
From: K Prateek Nayak @ 2025-09-12 9:23 UTC (permalink / raw)
To: Chen, Yu C, Tim Chen, Peter Zijlstra, Ingo Molnar
Cc: Juri Lelli, Dietmar Eggemann, Ben Segall, Mel Gorman,
Valentin Schneider, Tim Chen, Vincent Guittot, Libo Chen, Abel Wu,
Len Brown, linux-kernel, Gautham R . Shenoy, Zhao Liu,
Vinicius Costa Gomes, Arjan Van De Ven
On 9/12/2025 11:09 AM, Chen, Yu C wrote:
> sched_avg_remote_numa_distance is defined in topology.c with
> CONFIG_NUMA controlled, should we make arch_sched_node_distance()
> be controlled under CONFIG_NUMA too?
Good catch! Given node_distance() too is behind CONFIG_NUMA, I
think we can put this behind CONFIG_NUMA too (including those
declarations in include/linux/sched/topology.h)
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 2/2] sched: Fix sched domain build error for GNR, CWF in SNC-3 mode
2025-09-12 9:23 ` K Prateek Nayak
@ 2025-09-12 11:59 ` Chen, Yu C
0 siblings, 0 replies; 18+ messages in thread
From: Chen, Yu C @ 2025-09-12 11:59 UTC (permalink / raw)
To: K Prateek Nayak, Tim Chen, Peter Zijlstra, Ingo Molnar
Cc: Juri Lelli, Dietmar Eggemann, Ben Segall, Mel Gorman,
Valentin Schneider, Tim Chen, Vincent Guittot, Libo Chen, Abel Wu,
Len Brown, linux-kernel, Gautham R . Shenoy, Zhao Liu,
Vinicius Costa Gomes, Arjan Van De Ven
On 9/12/2025 5:23 PM, K Prateek Nayak wrote:
> On 9/12/2025 11:09 AM, Chen, Yu C wrote:
>> sched_avg_remote_numa_distance is defined in topology.c with
>> CONFIG_NUMA controlled, should we make arch_sched_node_distance()
>> be controlled under CONFIG_NUMA too?
>
> Good catch! Given node_distance() too is behind CONFIG_NUMA, I
> think we can put this behind CONFIG_NUMA too (including those
> declarations in include/linux/sched/topology.h)
>
Exactly, only NUMA would use this function.
Thanks,
Chenyu
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 1/2] sched: Create architecture specific sched domain distances
2025-09-11 18:30 ` [PATCH v3 1/2] sched: Create architecture specific sched domain distances Tim Chen
2025-09-12 3:23 ` K Prateek Nayak
2025-09-12 5:24 ` Chen, Yu C
@ 2025-09-15 12:37 ` Peter Zijlstra
2025-09-15 17:13 ` Tim Chen
2 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2025-09-15 12:37 UTC (permalink / raw)
To: Tim Chen
Cc: Ingo Molnar, Juri Lelli, Dietmar Eggemann, Ben Segall, Mel Gorman,
Valentin Schneider, Tim Chen, Vincent Guittot, Libo Chen, Abel Wu,
Len Brown, linux-kernel, Chen Yu, K Prateek Nayak,
Gautham R . Shenoy, Zhao Liu, Vinicius Costa Gomes,
Arjan Van De Ven
On Thu, Sep 11, 2025 at 11:30:56AM -0700, Tim Chen wrote:
> Allow architecture specific sched domain NUMA distances that can be
> modified from NUMA node distances for the purpose of building NUMA
> sched domains.
>
> The actual NUMA distances are kept separately. This allows for NUMA
> domain levels modification when building sched domains for specific
> architectures.
>
> Consolidate the recording of unique NUMA distances in an array to
> sched_record_numa_dist() so the function can be reused to record NUMA
> distances when the NUMA distance metric is changed.
>
> No functional change if there's no arch specific NUMA distances
> are being defined.
Keeping both metrics side-by-side is confusing -- and not very well
justified by the above.
Is there any appreciable benefit to mixing the two like this?
>
> Co-developed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
> Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
> include/linux/sched/topology.h | 2 +
> kernel/sched/topology.c | 114 ++++++++++++++++++++++++++++-----
> 2 files changed, 99 insertions(+), 17 deletions(-)
>
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 5263746b63e8..4f58e78ca52e 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -59,6 +59,8 @@ static inline int cpu_numa_flags(void)
> #endif
>
> extern int arch_asym_cpu_priority(int cpu);
> +extern int arch_sched_node_distance(int from, int to);
> +extern int sched_avg_remote_numa_distance;
>
> struct sched_domain_attr {
> int relax_domain_level;
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 977e133bb8a4..6c0ff62322cb 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1591,10 +1591,13 @@ static void claim_allocations(int cpu, struct sched_domain *sd)
> enum numa_topology_type sched_numa_topology_type;
>
> static int sched_domains_numa_levels;
> +static int sched_numa_node_levels;
> static int sched_domains_curr_level;
>
> int sched_max_numa_distance;
> +int sched_avg_remote_numa_distance;
> static int *sched_domains_numa_distance;
> +static int *sched_numa_node_distance;
> static struct cpumask ***sched_domains_numa_masks;
> #endif /* CONFIG_NUMA */
>
> @@ -1808,10 +1811,10 @@ bool find_numa_distance(int distance)
> return true;
>
> rcu_read_lock();
> - distances = rcu_dereference(sched_domains_numa_distance);
> + distances = rcu_dereference(sched_numa_node_distance);
> if (!distances)
> goto unlock;
> - for (i = 0; i < sched_domains_numa_levels; i++) {
> + for (i = 0; i < sched_numa_node_levels; i++) {
> if (distances[i] == distance) {
> found = true;
> break;
I'm assuming (because its not actually stated anywhere) that
sched_numa_$FOO is based on the SLIT table, while sched_domain_$FOO is
the modified thing.
And you're saying it makes a significant difference to
preferred_group_nid()?
> +static int sched_record_numa_dist(int offline_node, int (*n_dist)(int, int),
> + int **dist, int *levels)
> +
That's a coding style fail; use cino=(0:0.
> {
> - struct sched_domain_topology_level *tl;
> unsigned long *distance_map;
> int nr_levels = 0;
> int i, j;
> int *distances;
> - struct cpumask ***masks;
>
> /*
> * O(nr_nodes^2) de-duplicating selection sort -- in order to find the
> @@ -1902,17 +1923,17 @@ void sched_init_numa(int offline_node)
> */
> distance_map = bitmap_alloc(NR_DISTANCE_VALUES, GFP_KERNEL);
> if (!distance_map)
> - return;
> + return -ENOMEM;
>
> bitmap_zero(distance_map, NR_DISTANCE_VALUES);
> for_each_cpu_node_but(i, offline_node) {
> for_each_cpu_node_but(j, offline_node) {
> - int distance = node_distance(i, j);
> + int distance = n_dist(i, j);
>
> if (distance < LOCAL_DISTANCE || distance >= NR_DISTANCE_VALUES) {
> sched_numa_warn("Invalid distance value range");
> bitmap_free(distance_map);
> - return;
> + return -EINVAL;
> }
>
> bitmap_set(distance_map, distance, 1);
> @@ -1927,17 +1948,66 @@ void sched_init_numa(int offline_node)
> distances = kcalloc(nr_levels, sizeof(int), GFP_KERNEL);
> if (!distances) {
> bitmap_free(distance_map);
> - return;
> + return -ENOMEM;
> }
> -
> for (i = 0, j = 0; i < nr_levels; i++, j++) {
> j = find_next_bit(distance_map, NR_DISTANCE_VALUES, j);
> distances[i] = j;
> }
> - rcu_assign_pointer(sched_domains_numa_distance, distances);
> + *dist = distances;
> + *levels = nr_levels;
>
> bitmap_free(distance_map);
>
> + return 0;
> +}
> +
> +static int avg_remote_numa_distance(int offline_node)
> +{
> + int i, j;
> + int distance, nr_remote = 0, total_distance = 0;
> +
> + for_each_cpu_node_but(i, offline_node) {
> + for_each_cpu_node_but(j, offline_node) {
> + distance = node_distance(i, j);
> +
> + if (distance >= REMOTE_DISTANCE) {
> + nr_remote++;
> + total_distance += distance;
> + }
> + }
> + }
> + if (nr_remote)
> + return total_distance / nr_remote;
> + else
> + return REMOTE_DISTANCE;
> +}
> +
> +void sched_init_numa(int offline_node)
> +{
> + struct sched_domain_topology_level *tl;
> + int nr_levels, nr_node_levels;
> + int i, j;
> + int *distances, *domain_distances;
> + struct cpumask ***masks;
> +
> + if (sched_record_numa_dist(offline_node, numa_node_dist, &distances,
> + &nr_node_levels))
> + return;
> +
> + WRITE_ONCE(sched_avg_remote_numa_distance,
> + avg_remote_numa_distance(offline_node));
What is the point of all this? sched_avg_remote_numa_distance isn't
actually used anywhere. I'm thinking it doesn't want to be in this patch
at the very least.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 2/2] sched: Fix sched domain build error for GNR, CWF in SNC-3 mode
2025-09-11 18:30 ` [PATCH v3 2/2] sched: Fix sched domain build error for GNR, CWF in SNC-3 mode Tim Chen
2025-09-12 5:08 ` K Prateek Nayak
2025-09-12 5:39 ` Chen, Yu C
@ 2025-09-15 12:46 ` Peter Zijlstra
2 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2025-09-15 12:46 UTC (permalink / raw)
To: Tim Chen
Cc: Ingo Molnar, Juri Lelli, Dietmar Eggemann, Ben Segall, Mel Gorman,
Valentin Schneider, Tim Chen, Vincent Guittot, Libo Chen, Abel Wu,
Len Brown, linux-kernel, Chen Yu, K Prateek Nayak,
Gautham R . Shenoy, Zhao Liu, Vinicius Costa Gomes,
Arjan Van De Ven
On Thu, Sep 11, 2025 at 11:30:57AM -0700, Tim Chen wrote:
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 33e166f6ab12..3f894c525e49 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -515,6 +515,34 @@ static void __init build_sched_topology(void)
> set_sched_topology(topology);
> }
>
> +int arch_sched_node_distance(int from, int to)
> +{
> + int d = node_distance(from, to);
> +
> + if (!x86_has_numa_in_package)
> + return d;
> +
> + switch (boot_cpu_data.x86_vfm) {
> + case INTEL_GRANITERAPIDS_X:
> + case INTEL_ATOM_DARKMONT_X:
> + if (d < REMOTE_DISTANCE)
> + return d;
> +
> + /*
> + * Trim finer distance tuning for nodes in remote package
> + * for the purpose of building sched domains. Put NUMA nodes
> + * in each remote package in the same sched group.
> + * Simplify NUMA domains and avoid extra NUMA levels including
> + * different NUMA nodes in remote packages.
> + *
> + * GNR and CWF don't expect systmes with more than 2 packages
> + * and more than 2 hops between packages.
> + */
> + d = sched_avg_remote_numa_distance;
So all of that avg_remote crap should live here, and in this patch. It
really should not be in generic code.
You really need to assert this 'expectation', otherwise weird stuff will
happen. The whole 'avg_remote' thing hard relies on there being a single
remote package.
> + }
> + return d;
> +}
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 1/2] sched: Create architecture specific sched domain distances
2025-09-12 3:23 ` K Prateek Nayak
@ 2025-09-15 16:44 ` Tim Chen
2025-09-17 6:45 ` K Prateek Nayak
0 siblings, 1 reply; 18+ messages in thread
From: Tim Chen @ 2025-09-15 16:44 UTC (permalink / raw)
To: K Prateek Nayak, Peter Zijlstra, Ingo Molnar
Cc: Juri Lelli, Dietmar Eggemann, Ben Segall, Mel Gorman,
Valentin Schneider, Tim Chen, Vincent Guittot, Libo Chen, Abel Wu,
Len Brown, linux-kernel, Chen Yu, Gautham R . Shenoy, Zhao Liu,
Vinicius Costa Gomes, Arjan Van De Ven
On Fri, 2025-09-12 at 08:53 +0530, K Prateek Nayak wrote:
> Hello Tim,
>
> On 9/12/2025 12:00 AM, Tim Chen wrote:
> > +static int sched_record_numa_dist(int offline_node, int (*n_dist)(int, int),
> > + int **dist, int *levels)
> > +
>
> nit. Is the blank line above intentional?
>
> Also personally I prefer breaking the two lines above as:
>
> static int
> sched_record_numa_dist(int offline_node, int (*n_dist)(int, int), int **dist, int *levels)
That would exceed 80 characters. So we would still need to move some parameters to a different
line to keep within the limit.
> {
> ...
> }
>
> > {
> > - struct sched_domain_topology_level *tl;
> > unsigned long *distance_map;
>
> Since we are breaking this out and adding return values, can we also
> cleanup that bitmap_free() before every return with __free(bitmap) like:
>
> (Only build tested)
Yes, __kfree will be better here.
>
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 6c0ff62322cb..baa79e79ced8 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1910,9 +1910,8 @@ static int numa_node_dist(int i, int j)
>
> static int sched_record_numa_dist(int offline_node, int (*n_dist)(int, int),
> int **dist, int *levels)
> -
> {
> - unsigned long *distance_map;
> + unsigned long *distance_map __free(bitmap) = NULL;
> int nr_levels = 0;
> int i, j;
> int *distances;
> @@ -1932,7 +1931,6 @@ static int sched_record_numa_dist(int offline_node, int (*n_dist)(int, int),
>
> if (distance < LOCAL_DISTANCE || distance >= NR_DISTANCE_VALUES) {
> sched_numa_warn("Invalid distance value range");
> - bitmap_free(distance_map);
> return -EINVAL;
> }
>
> @@ -1946,19 +1944,17 @@ static int sched_record_numa_dist(int offline_node, int (*n_dist)(int, int),
> nr_levels = bitmap_weight(distance_map, NR_DISTANCE_VALUES);
>
> distances = kcalloc(nr_levels, sizeof(int), GFP_KERNEL);
> - if (!distances) {
> - bitmap_free(distance_map);
> + if (!distances)
> return -ENOMEM;
> - }
> +
> for (i = 0, j = 0; i < nr_levels; i++, j++) {
> j = find_next_bit(distance_map, NR_DISTANCE_VALUES, j);
> distances[i] = j;
> }
> +
> *dist = distances;
> *levels = nr_levels;
>
> - bitmap_free(distance_map);
> -
> return 0;
> }
>
> ---
>
> > int nr_levels = 0;
> > int i, j;
> > int *distances;
> > - struct cpumask ***masks;
> >
> > /*
> > * O(nr_nodes^2) de-duplicating selection sort -- in order to find the
> > @@ -1902,17 +1923,17 @@ void sched_init_numa(int offline_node)
> > */
> > distance_map = bitmap_alloc(NR_DISTANCE_VALUES, GFP_KERNEL);
> > if (!distance_map)
> > - return;
> > + return -ENOMEM;
> >
> > bitmap_zero(distance_map, NR_DISTANCE_VALUES);
> > for_each_cpu_node_but(i, offline_node) {
> > for_each_cpu_node_but(j, offline_node) {
> > - int distance = node_distance(i, j);
> > + int distance = n_dist(i, j);
> >
> > if (distance < LOCAL_DISTANCE || distance >= NR_DISTANCE_VALUES) {
> > sched_numa_warn("Invalid distance value range");
> > bitmap_free(distance_map);
> > - return;
> > + return -EINVAL;
> > }
> >
> > bitmap_set(distance_map, distance, 1);
> > @@ -1927,17 +1948,66 @@ void sched_init_numa(int offline_node)
> > distances = kcalloc(nr_levels, sizeof(int), GFP_KERNEL);
> > if (!distances) {
> > bitmap_free(distance_map);
> > - return;
> > + return -ENOMEM;
> > }
> > -
> > for (i = 0, j = 0; i < nr_levels; i++, j++) {
> > j = find_next_bit(distance_map, NR_DISTANCE_VALUES, j);
> > distances[i] = j;
> > }
> > - rcu_assign_pointer(sched_domains_numa_distance, distances);
> > + *dist = distances;
> > + *levels = nr_levels;
> >
> > bitmap_free(distance_map);
> >
> > + return 0;
> > +}
> > +
> > +static int avg_remote_numa_distance(int offline_node)
> > +{
> > + int i, j;
> > + int distance, nr_remote = 0, total_distance = 0;
> > +
> > + for_each_cpu_node_but(i, offline_node) {
> > + for_each_cpu_node_but(j, offline_node) {
> > + distance = node_distance(i, j);
> > +
> > + if (distance >= REMOTE_DISTANCE) {
> > + nr_remote++;
> > + total_distance += distance;
> > + }
> > + }
> > + }
> > + if (nr_remote)
> > + return total_distance / nr_remote;
> > + else
> > + return REMOTE_DISTANCE;
> > +}
> > +
> > +void sched_init_numa(int offline_node)
> > +{
> > + struct sched_domain_topology_level *tl;
> > + int nr_levels, nr_node_levels;
> > + int i, j;
> > + int *distances, *domain_distances;
> > + struct cpumask ***masks;
> > +
> > + if (sched_record_numa_dist(offline_node, numa_node_dist, &distances,
> > + &nr_node_levels))
> > + return;
> > +
> > + WRITE_ONCE(sched_avg_remote_numa_distance,
> > + avg_remote_numa_distance(offline_node));
>
> nit.
>
> Can add a small comment here saying arch_sched_node_distance() may
> depend on sched_avg_remote_numa_distance and requires it to be
> initialized correctly before computing domain_distances.
Sure.
Thanks for the review.
Tim
>
> Apart from those nitpicks, the changes look good to me. Please feel free
> to include:
>
> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 1/2] sched: Create architecture specific sched domain distances
2025-09-12 5:24 ` Chen, Yu C
@ 2025-09-15 16:49 ` Tim Chen
2025-09-15 17:16 ` Tim Chen
1 sibling, 0 replies; 18+ messages in thread
From: Tim Chen @ 2025-09-15 16:49 UTC (permalink / raw)
To: Chen, Yu C, Peter Zijlstra, Ingo Molnar
Cc: Juri Lelli, Dietmar Eggemann, Ben Segall, Mel Gorman,
Valentin Schneider, Tim Chen, Vincent Guittot, Libo Chen, Abel Wu,
Len Brown, linux-kernel, K Prateek Nayak, Gautham R . Shenoy,
Zhao Liu, Vinicius Costa Gomes, Arjan Van De Ven
On Fri, 2025-09-12 at 13:24 +0800, Chen, Yu C wrote:
> On 9/12/2025 2:30 AM, Tim Chen wrote:
> > Allow architecture specific sched domain NUMA distances that can be
> > modified from NUMA node distances for the purpose of building NUMA
> > sched domains.
> >
> > The actual NUMA distances are kept separately. This allows for NUMA
> > domain levels modification when building sched domains for specific
> > architectures.
> >
> > Consolidate the recording of unique NUMA distances in an array to
> > sched_record_numa_dist() so the function can be reused to record NUMA
> > distances when the NUMA distance metric is changed.
> >
> > No functional change if there's no arch specific NUMA distances
> > are being defined.
> >
>
> [snip]
>
> > +
> > +void sched_init_numa(int offline_node)
> > +{
> > + struct sched_domain_topology_level *tl;
> > + int nr_levels, nr_node_levels;
> > + int i, j;
> > + int *distances, *domain_distances;
> > + struct cpumask ***masks;
> > +
> > + if (sched_record_numa_dist(offline_node, numa_node_dist, &distances,
> > + &nr_node_levels))
> > + return;
> > +
> > + WRITE_ONCE(sched_avg_remote_numa_distance,
> > + avg_remote_numa_distance(offline_node));
> > +
> > + if (sched_record_numa_dist(offline_node,
> > + arch_sched_node_distance, &domain_distances,
> > + &nr_levels)) {
> > + kfree(distances);
> > + return;
> > + }
> > + rcu_assign_pointer(sched_numa_node_distance, distances);
> > + WRITE_ONCE(sched_max_numa_distance, distances[nr_node_levels - 1]);
>
> [snip]
>
> > @@ -2022,7 +2097,6 @@ void sched_init_numa(int offline_node)
> > sched_domain_topology = tl;
> >
> > sched_domains_numa_levels = nr_levels;
> > - WRITE_ONCE(sched_max_numa_distance, sched_domains_numa_distance[nr_levels - 1]);
> >
>
> Before this patch, sched_max_numa_distance is assigned a valid
> value at the end of sched_init_numa(), after sched_domains_numa_masks
> and sched_domain_topology_level are successfully created or appended
> , the kzalloc() call should succeed.
>
> Now we assign sched_max_numa_distance earlier, without considering
> the status of NUMA sched domains. I think this is intended, because
> sched domains are only for generic load balancing, while
> sched_max_numa_distance is for NUMA load balancing; in theory, they
> use different metrics in their strategies. Thus, this change should
> not cause any issues.
Yes, now sched_max_numa_distance is used in conjunction with
sched_numa_node_distance. So putting them together is okay.
>
> From my understanding,
>
> Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Thanks.
Tim
>
> thanks,
> Chenyu
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 1/2] sched: Create architecture specific sched domain distances
2025-09-15 12:37 ` Peter Zijlstra
@ 2025-09-15 17:13 ` Tim Chen
2025-09-15 20:04 ` Tim Chen
0 siblings, 1 reply; 18+ messages in thread
From: Tim Chen @ 2025-09-15 17:13 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Juri Lelli, Dietmar Eggemann, Ben Segall, Mel Gorman,
Valentin Schneider, Tim Chen, Vincent Guittot, Libo Chen, Abel Wu,
Len Brown, linux-kernel, Chen Yu, K Prateek Nayak,
Gautham R . Shenoy, Zhao Liu, Vinicius Costa Gomes,
Arjan Van De Ven
On Mon, 2025-09-15 at 14:37 +0200, Peter Zijlstra wrote:
> On Thu, Sep 11, 2025 at 11:30:56AM -0700, Tim Chen wrote:
> > Allow architecture specific sched domain NUMA distances that can be
> > modified from NUMA node distances for the purpose of building NUMA
> > sched domains.
> >
> > The actual NUMA distances are kept separately. This allows for NUMA
> > domain levels modification when building sched domains for specific
> > architectures.
> >
> > Consolidate the recording of unique NUMA distances in an array to
> > sched_record_numa_dist() so the function can be reused to record NUMA
> > distances when the NUMA distance metric is changed.
> >
> > No functional change if there's no arch specific NUMA distances
> > are being defined.
>
> Keeping both metrics side-by-side is confusing -- and not very well
> justified by the above.
>
> Is there any appreciable benefit to mixing the two like this?
We can set sched_numa_node_distance to point to the same array as
sched_numa_node_distance if no arch specific NUMA distances
are defined. It would add some additional wrinkles to
the record logic and deallocation logic to detect
whether we have kept them the same. I've opted to always
have two metrics to keep the code logic simpler.
Let me know if you prefer otherwise.
>
> >
> > Co-developed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
> > Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
> > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> > ---
> > include/linux/sched/topology.h | 2 +
> > kernel/sched/topology.c | 114 ++++++++++++++++++++++++++++-----
> > 2 files changed, 99 insertions(+), 17 deletions(-)
> >
> > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> > index 5263746b63e8..4f58e78ca52e 100644
> > --- a/include/linux/sched/topology.h
> > +++ b/include/linux/sched/topology.h
> > @@ -59,6 +59,8 @@ static inline int cpu_numa_flags(void)
> > #endif
> >
> > extern int arch_asym_cpu_priority(int cpu);
> > +extern int arch_sched_node_distance(int from, int to);
> > +extern int sched_avg_remote_numa_distance;
> >
> > struct sched_domain_attr {
> > int relax_domain_level;
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 977e133bb8a4..6c0ff62322cb 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -1591,10 +1591,13 @@ static void claim_allocations(int cpu, struct sched_domain *sd)
> > enum numa_topology_type sched_numa_topology_type;
> >
> > static int sched_domains_numa_levels;
> > +static int sched_numa_node_levels;
> > static int sched_domains_curr_level;
> >
> > int sched_max_numa_distance;
> > +int sched_avg_remote_numa_distance;
> > static int *sched_domains_numa_distance;
> > +static int *sched_numa_node_distance;
> > static struct cpumask ***sched_domains_numa_masks;
> > #endif /* CONFIG_NUMA */
> >
> > @@ -1808,10 +1811,10 @@ bool find_numa_distance(int distance)
> > return true;
> >
> > rcu_read_lock();
> > - distances = rcu_dereference(sched_domains_numa_distance);
> > + distances = rcu_dereference(sched_numa_node_distance);
> > if (!distances)
> > goto unlock;
> > - for (i = 0; i < sched_domains_numa_levels; i++) {
> > + for (i = 0; i < sched_numa_node_levels; i++) {
> > if (distances[i] == distance) {
> > found = true;
> > break;
>
> I'm assuming (because its not actually stated anywhere) that
> sched_numa_$FOO is based on the SLIT table, while sched_domain_$FOO is
> the modified thing.
>
> And you're saying it makes a significant difference to
> preferred_group_nid()?
>
> > +static int sched_record_numa_dist(int offline_node, int (*n_dist)(int, int),
> > + int **dist, int *levels)
> > +
>
> That's a coding style fail; use cino=(0:0.
>
> > {
> > - struct sched_domain_topology_level *tl;
> > unsigned long *distance_map;
> > int nr_levels = 0;
> > int i, j;
> > int *distances;
> > - struct cpumask ***masks;
> >
> > /*
> > * O(nr_nodes^2) de-duplicating selection sort -- in order to find the
> > @@ -1902,17 +1923,17 @@ void sched_init_numa(int offline_node)
> > */
> > distance_map = bitmap_alloc(NR_DISTANCE_VALUES, GFP_KERNEL);
> > if (!distance_map)
> > - return;
> > + return -ENOMEM;
> >
> > bitmap_zero(distance_map, NR_DISTANCE_VALUES);
> > for_each_cpu_node_but(i, offline_node) {
> > for_each_cpu_node_but(j, offline_node) {
> > - int distance = node_distance(i, j);
> > + int distance = n_dist(i, j);
> >
> > if (distance < LOCAL_DISTANCE || distance >= NR_DISTANCE_VALUES) {
> > sched_numa_warn("Invalid distance value range");
> > bitmap_free(distance_map);
> > - return;
> > + return -EINVAL;
> > }
> >
> > bitmap_set(distance_map, distance, 1);
> > @@ -1927,17 +1948,66 @@ void sched_init_numa(int offline_node)
> > distances = kcalloc(nr_levels, sizeof(int), GFP_KERNEL);
> > if (!distances) {
> > bitmap_free(distance_map);
> > - return;
> > + return -ENOMEM;
> > }
> > -
> > for (i = 0, j = 0; i < nr_levels; i++, j++) {
> > j = find_next_bit(distance_map, NR_DISTANCE_VALUES, j);
> > distances[i] = j;
> > }
> > - rcu_assign_pointer(sched_domains_numa_distance, distances);
> > + *dist = distances;
> > + *levels = nr_levels;
> >
> > bitmap_free(distance_map);
> >
> > + return 0;
> > +}
> > +
> > +static int avg_remote_numa_distance(int offline_node)
> > +{
> > + int i, j;
> > + int distance, nr_remote = 0, total_distance = 0;
> > +
> > + for_each_cpu_node_but(i, offline_node) {
> > + for_each_cpu_node_but(j, offline_node) {
> > + distance = node_distance(i, j);
> > +
> > + if (distance >= REMOTE_DISTANCE) {
> > + nr_remote++;
> > + total_distance += distance;
> > + }
> > + }
> > + }
> > + if (nr_remote)
> > + return total_distance / nr_remote;
> > + else
> > + return REMOTE_DISTANCE;
> > +}
> > +
> > +void sched_init_numa(int offline_node)
> > +{
> > + struct sched_domain_topology_level *tl;
> > + int nr_levels, nr_node_levels;
> > + int i, j;
> > + int *distances, *domain_distances;
> > + struct cpumask ***masks;
> > +
> > + if (sched_record_numa_dist(offline_node, numa_node_dist, &distances,
> > + &nr_node_levels))
> > + return;
> > +
> > + WRITE_ONCE(sched_avg_remote_numa_distance,
> > + avg_remote_numa_distance(offline_node));
>
> What is the point of all this? sched_avg_remote_numa_distance isn't
> actually used anywhere. I'm thinking it doesn't want to be in this patch
> at the very least.
sched_avg_remote_numa_distance actually could change when we offline/online a
node. I think arch_sched_node_distance(i, j) needs to be changed to
arch_sched_node_distance(i, j, offline_node) so it knows not to include
offline_node in its avg distance computation. I will do that then.
Thanks for reviewing the code.
Tim
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 2/2] sched: Fix sched domain build error for GNR, CWF in SNC-3 mode
2025-09-12 5:08 ` K Prateek Nayak
@ 2025-09-15 17:15 ` Tim Chen
0 siblings, 0 replies; 18+ messages in thread
From: Tim Chen @ 2025-09-15 17:15 UTC (permalink / raw)
To: K Prateek Nayak, Peter Zijlstra, Ingo Molnar
Cc: Juri Lelli, Dietmar Eggemann, Ben Segall, Mel Gorman,
Valentin Schneider, Tim Chen, Vincent Guittot, Libo Chen, Abel Wu,
Len Brown, linux-kernel, Chen Yu, Gautham R . Shenoy, Zhao Liu,
Vinicius Costa Gomes, Arjan Van De Ven
On Fri, 2025-09-12 at 10:38 +0530, K Prateek Nayak wrote:
> Hello Tim,
>
> On 9/12/2025 12:00 AM, Tim Chen wrote:
> > It is possible for Granite Rapids (GNR) and Clearwater Forest
> > (CWF) to have up to 3 dies per package. When sub-numa cluster (SNC-3)
> > is enabled, each die will become a separate NUMA node in the package
> > with different distances between dies within the same package.
> >
> > For example, on GNR, we see the following numa distances for a 2 socket
> > system with 3 dies per socket:
> >
> > package 1 package2
> > ----------------
> > | |
> > --------- ---------
> > | 0 | | 3 |
> > --------- ---------
> > | |
> > --------- ---------
> > | 1 | | 4 |
> > --------- ---------
> > | |
> > --------- ---------
> > | 2 | | 5 |
> > --------- ---------
> > | |
> > ----------------
> >
> > node distances:
> > node 0 1 2 3 4 5
> > 0: 10 15 17 21 28 26
> > 1: 15 10 15 23 26 23
> > 2: 17 15 10 26 23 21
> > 3: 21 28 26 10 15 17
> > 4: 23 26 23 15 10 15
> > 5: 26 23 21 17 15 10
> >
> > The node distances above led to 2 problems:
> >
> > 1. Asymmetric routes taken between nodes in different packages led to
> > asymmetric scheduler domain perspective depending on which node you
> > are on. Current scheduler code failed to build domains properly with
> > asymmetric distances.
> >
> > 2. Multiple remote distances to respective tiles on remote package create
> > too many levels of domain hierarchies grouping different nodes between
> > remote packages.
> >
> > For example, the above GNR-X topology lead to NUMA domains below:
> >
> > Sched domains from the perspective of a CPU in node 0, where the number
> > in bracket represent node number.
> >
> > NUMA-level 1 [0,1] [2]
> > NUMA-level 2 [0,1,2] [3]
> > NUMA-level 3 [0,1,2,3] [5]
> > NUMA-level 4 [0,1,2,3,5] [4]
> >
> > Sched domains from the perspective of a CPU in node 4
> > NUMA-level 1 [4] [3,5]
> > NUMA-level 2 [3,4,5] [0,2]
> > NUMA-level 3 [0,2,3,4,5] [1]
> >
> > Scheduler group peers for load balancing from the perspective of CPU 0
> > and 4 are different. Improper task could be chosen for load balancing
> > between groups such as [0,2,3,4,5] [1]. Ideally you should choose nodes
> > in 0 or 2 that are in same package as node 1 first. But instead tasks
> > in the remote package node 3, 4, 5 could be chosen with an equal chance
> > and could lead to excessive remote package migrations and imbalance of
> > load between packages. We should not group partial remote nodes and
> > local nodes together.
> > Simplify the remote distances for CWF-X and GNR-X for the purpose of
> > sched domains building, which maintains symmetry and leads to a more
> > reasonable load balance hierarchy.
> >
> > The sched domains from the perspective of a CPU in node 0 NUMA-level 1
> > is now
> > NUMA-level 1 [0,1] [2]
> > NUMA-level 2 [0,1,2] [3,4,5]
> >
> > The sched domains from the perspective of a CPU in node 4 NUMA-level 1
> > is now
> > NUMA-level 1 [4] [3,5]
> > NUMA-level 2 [3,4,5] [0,1,2]
> >
> > We have the same balancing perspective from node 0 or node 4. Loads are
> > now balanced equally between packages.
> >
> > Tested-by: Zhao Liu <zhao1.liu@intel.com>
> > Co-developed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
> > Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
> > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
>
> Feel free to include:
>
> Reviewed-and-tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Thanks for reviewing and testing.
Tim
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 1/2] sched: Create architecture specific sched domain distances
2025-09-12 5:24 ` Chen, Yu C
2025-09-15 16:49 ` Tim Chen
@ 2025-09-15 17:16 ` Tim Chen
1 sibling, 0 replies; 18+ messages in thread
From: Tim Chen @ 2025-09-15 17:16 UTC (permalink / raw)
To: Chen, Yu C, Peter Zijlstra, Ingo Molnar
Cc: Juri Lelli, Dietmar Eggemann, Ben Segall, Mel Gorman,
Valentin Schneider, Tim Chen, Vincent Guittot, Libo Chen, Abel Wu,
Len Brown, linux-kernel, K Prateek Nayak, Gautham R . Shenoy,
Zhao Liu, Vinicius Costa Gomes, Arjan Van De Ven
On Fri, 2025-09-12 at 13:24 +0800, Chen, Yu C wrote:
> On 9/12/2025 2:30 AM, Tim Chen wrote:
> > Allow architecture specific sched domain NUMA distances that can be
> > modified from NUMA node distances for the purpose of building NUMA
> > sched domains.
> >
> > The actual NUMA distances are kept separately. This allows for NUMA
> > domain levels modification when building sched domains for specific
> > architectures.
> >
> > Consolidate the recording of unique NUMA distances in an array to
> > sched_record_numa_dist() so the function can be reused to record NUMA
> > distances when the NUMA distance metric is changed.
> >
> > No functional change if there's no arch specific NUMA distances
> > are being defined.
> >
>
> [snip]
>
> > +
> > +void sched_init_numa(int offline_node)
> > +{
> > + struct sched_domain_topology_level *tl;
> > + int nr_levels, nr_node_levels;
> > + int i, j;
> > + int *distances, *domain_distances;
> > + struct cpumask ***masks;
> > +
> > + if (sched_record_numa_dist(offline_node, numa_node_dist, &distances,
> > + &nr_node_levels))
> > + return;
> > +
> > + WRITE_ONCE(sched_avg_remote_numa_distance,
> > + avg_remote_numa_distance(offline_node));
> > +
> > + if (sched_record_numa_dist(offline_node,
> > + arch_sched_node_distance, &domain_distances,
> > + &nr_levels)) {
> > + kfree(distances);
> > + return;
> > + }
> > + rcu_assign_pointer(sched_numa_node_distance, distances);
> > + WRITE_ONCE(sched_max_numa_distance, distances[nr_node_levels - 1]);
>
> [snip]
>
> > @@ -2022,7 +2097,6 @@ void sched_init_numa(int offline_node)
> > sched_domain_topology = tl;
> >
> > sched_domains_numa_levels = nr_levels;
> > - WRITE_ONCE(sched_max_numa_distance, sched_domains_numa_distance[nr_levels - 1]);
> >
>
> Before this patch, sched_max_numa_distance is assigned a valid
> value at the end of sched_init_numa(), after sched_domains_numa_masks
> and sched_domain_topology_level are successfully created or appended
> , the kzalloc() call should succeed.
>
> Now we assign sched_max_numa_distance earlier, without considering
> the status of NUMA sched domains. I think this is intended, because
> sched domains are only for generic load balancing, while
> sched_max_numa_distance is for NUMA load balancing; in theory, they
> use different metrics in their strategies. Thus, this change should
> not cause any issues.
>
> From my understanding,
>
> Reviewed-by: Chen Yu <yu.c.chen@intel.com>
>
> thanks,
> Chenyu
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 1/2] sched: Create architecture specific sched domain distances
2025-09-15 17:13 ` Tim Chen
@ 2025-09-15 20:04 ` Tim Chen
0 siblings, 0 replies; 18+ messages in thread
From: Tim Chen @ 2025-09-15 20:04 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Juri Lelli, Dietmar Eggemann, Ben Segall, Mel Gorman,
Valentin Schneider, Tim Chen, Vincent Guittot, Libo Chen, Abel Wu,
Len Brown, linux-kernel, Chen Yu, K Prateek Nayak,
Gautham R . Shenoy, Zhao Liu, Vinicius Costa Gomes,
Arjan Van De Ven
On Mon, 2025-09-15 at 10:13 -0700, Tim Chen wrote:
> On Mon, 2025-09-15 at 14:37 +0200, Peter Zijlstra wrote:
> > On Thu, Sep 11, 2025 at 11:30:56AM -0700, Tim Chen wrote:
<snip>
> > > +
> > > + if (sched_record_numa_dist(offline_node, numa_node_dist, &distances,
> > > + &nr_node_levels))
> > > + return;
> > > +
> > > + WRITE_ONCE(sched_avg_remote_numa_distance,
> > > + avg_remote_numa_distance(offline_node));
> >
> > What is the point of all this? sched_avg_remote_numa_distance isn't
> > actually used anywhere. I'm thinking it doesn't want to be in this patch
> > at the very least.
>
> sched_avg_remote_numa_distance actually could change when we offline/online a
> node. I think arch_sched_node_distance(i, j) needs to be changed to
> arch_sched_node_distance(i, j, offline_node) so it knows not to include
> offline_node in its avg distance computation. I will do that then.
>
On second thought, GNR and CWF topology only need a remote average
distance. Whether the offline node is entered into the computation
does not change the resulting sched domain topology. So paasing
offline node is not necessary and I will update the code as such
and move the average distance computation to x86 specific file.
Tim
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 1/2] sched: Create architecture specific sched domain distances
2025-09-15 16:44 ` Tim Chen
@ 2025-09-17 6:45 ` K Prateek Nayak
0 siblings, 0 replies; 18+ messages in thread
From: K Prateek Nayak @ 2025-09-17 6:45 UTC (permalink / raw)
To: Tim Chen, Peter Zijlstra, Ingo Molnar
Cc: Juri Lelli, Dietmar Eggemann, Ben Segall, Mel Gorman,
Valentin Schneider, Tim Chen, Vincent Guittot, Libo Chen, Abel Wu,
Len Brown, linux-kernel, Chen Yu, Gautham R . Shenoy, Zhao Liu,
Vinicius Costa Gomes, Arjan Van De Ven
Hello Tim,
On 9/15/2025 10:14 PM, Tim Chen wrote:
> On Fri, 2025-09-12 at 08:53 +0530, K Prateek Nayak wrote:
>> Hello Tim,
>>
>> On 9/12/2025 12:00 AM, Tim Chen wrote:
>>> +static int sched_record_numa_dist(int offline_node, int (*n_dist)(int, int),
>>> + int **dist, int *levels)
>>> +
>> nit. Is the blank line above intentional?
>>
>> Also personally I prefer breaking the two lines above as:
>>
>> static int
>> sched_record_numa_dist(int offline_node, int (*n_dist)(int, int), int **dist, int *levels)
> That would exceed 80 characters. So we would still need to move some parameters to a different
> line to keep within the limit.
Well build_balance_mask() in the same file follows this pattern and
exceeds 80 characters. Maybe it is alright as long as is under 100
characters :)
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2025-09-17 6:45 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-11 18:30 [PATCH v3 0/2] Fix NUMA sched domain build errors for GNR and CWF Tim Chen
2025-09-11 18:30 ` [PATCH v3 1/2] sched: Create architecture specific sched domain distances Tim Chen
2025-09-12 3:23 ` K Prateek Nayak
2025-09-15 16:44 ` Tim Chen
2025-09-17 6:45 ` K Prateek Nayak
2025-09-12 5:24 ` Chen, Yu C
2025-09-15 16:49 ` Tim Chen
2025-09-15 17:16 ` Tim Chen
2025-09-15 12:37 ` Peter Zijlstra
2025-09-15 17:13 ` Tim Chen
2025-09-15 20:04 ` Tim Chen
2025-09-11 18:30 ` [PATCH v3 2/2] sched: Fix sched domain build error for GNR, CWF in SNC-3 mode Tim Chen
2025-09-12 5:08 ` K Prateek Nayak
2025-09-15 17:15 ` Tim Chen
2025-09-12 5:39 ` Chen, Yu C
2025-09-12 9:23 ` K Prateek Nayak
2025-09-12 11:59 ` Chen, Yu C
2025-09-15 12:46 ` Peter Zijlstra
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox