* [PATCH 0/2] Fix NUMA sched domain build errors for GNR-X and CWF-X
@ 2025-08-22 20:14 Tim Chen
2025-08-22 20:14 ` [PATCH 1/2] sched: topology: Fix topology validation error Tim Chen
` (2 more replies)
0 siblings, 3 replies; 14+ messages in thread
From: Tim Chen @ 2025-08-22 20:14 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar
Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Ben Segall, Mel Gorman,
Valentin Schneider, Tim Chen, Vincent Guittot, Libo Chen, Abel Wu,
Len Brown, linux-kernel, Chen Yu, K Prateek Nayak,
Gautham R . Shenoy, Zhao Liu, Vinicius Costa Gomes
While testing Granite Rapids X (GNR-X) and Clearwater Forest X (CWF-X) in
SNc-3 mode, we encountered sched domain build errors reported in dmesg.
Asymmetric node distances from local node to to nodes in remote package
was not expected by the scheduler domain code and also led to excessive
number of sched domain hierachy levels.
Fix the missing NUMA domain level set in topology_span_sane() check and
also simplify the distance to nodes in remote package to retain distance
symmetry and make the NUMA topology sane for GNR-X and CWF-X.
Tim Chen (1):
sched: Fix sched domain build error for GNR-X, CWF-X in SNC-3 mode
Vinicius Costa Gomes (1):
sched: topology: Fix topology validation error
arch/x86/kernel/smpboot.c | 28 ++++++++++++++++++++++++++++
include/linux/sched/topology.h | 1 +
kernel/sched/topology.c | 33 +++++++++++++++++++++++++++------
3 files changed, 56 insertions(+), 6 deletions(-)
--
2.32.0
^ permalink raw reply [flat|nested] 14+ messages in thread* [PATCH 1/2] sched: topology: Fix topology validation error 2025-08-22 20:14 [PATCH 0/2] Fix NUMA sched domain build errors for GNR-X and CWF-X Tim Chen @ 2025-08-22 20:14 ` Tim Chen 2025-08-25 3:18 ` K Prateek Nayak 2025-08-25 7:25 ` Peter Zijlstra 2025-08-22 20:14 ` [PATCH 2/2] sched: Fix sched domain build error for GNR-X, CWF-X in SNC-3 mode Tim Chen 2025-08-25 4:18 ` [PATCH 0/2] Fix NUMA sched domain build errors for GNR-X and CWF-X K Prateek Nayak 2 siblings, 2 replies; 14+ messages in thread From: Tim Chen @ 2025-08-22 20:14 UTC (permalink / raw) To: Peter Zijlstra, Ingo Molnar Cc: Vinicius Costa Gomes, Juri Lelli, Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot, Libo Chen, Abel Wu, Len Brown, linux-kernel, Chen Yu, K Prateek Nayak, Gautham R . Shenoy, Zhao Liu, Tim Chen From: Vinicius Costa Gomes <vinicius.gomes@intel.com> As sd_numa_mask() (the function behind tl->mask() for the NUMA levels of the topology) depends on the value of sched_domains_curr_level, it's possible to be iterating over a level while, sd_numa_mask() thinks we are in another, causing the topology validation to fail (for valid cases). Set sched_domains_curr_level to the current topology level while iterating. Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> --- kernel/sched/topology.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 977e133bb8a4..9a7ac67e3d63 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -2394,6 +2394,14 @@ static bool topology_span_sane(const struct cpumask *cpu_map) for_each_sd_topology(tl) { int tl_common_flags = 0; +#ifdef CONFIG_NUMA + /* + * sd_numa_mask() (one of the possible values of + * tl->mask()) depends on the current level to work + * correctly. + */ + sched_domains_curr_level = tl->numa_level; +#endif if (tl->sd_flags) tl_common_flags = (*tl->sd_flags)(); -- 2.32.0 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] sched: topology: Fix topology validation error 2025-08-22 20:14 ` [PATCH 1/2] sched: topology: Fix topology validation error Tim Chen @ 2025-08-25 3:18 ` K Prateek Nayak 2025-08-25 7:58 ` Peter Zijlstra 2025-08-25 7:25 ` Peter Zijlstra 1 sibling, 1 reply; 14+ messages in thread From: K Prateek Nayak @ 2025-08-25 3:18 UTC (permalink / raw) To: Tim Chen, Vinicius Costa Gomes, Peter Zijlstra Cc: Ingo Molnar, Juri Lelli, Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot, Libo Chen, Abel Wu, Len Brown, linux-kernel, Chen Yu, Gautham R . Shenoy, Zhao Liu Hello Tim, Vinicius, On 8/23/2025 1:44 AM, Tim Chen wrote: > --- a/kernel/sched/topology.c > +++ b/kernel/sched/topology.c > @@ -2394,6 +2394,14 @@ static bool topology_span_sane(const struct cpumask *cpu_map) > for_each_sd_topology(tl) { > int tl_common_flags = 0; > > +#ifdef CONFIG_NUMA > + /* > + * sd_numa_mask() (one of the possible values of > + * tl->mask()) depends on the current level to work > + * correctly. > + */ > + sched_domains_curr_level = tl->numa_level; > +#endif A similar solution was proposed in [1] and after a few iterations, we arrived at [2] as a potential solution to this issue. Now that the merge window is behind us, Peter would it be possible to pick one of these up? P.S. Leon has confirmed this solved the splat of their deployments too on an earlier version [3]. [1] https://lore.kernel.org/lkml/20250624041235.1589-1-kprateek.nayak@amd.com/ [2] https://lore.kernel.org/lkml/20250715040824.893-1-kprateek.nayak@amd.com/ [3] https://lore.kernel.org/lkml/20250720104136.GI402218@unreal/ > if (tl->sd_flags) > tl_common_flags = (*tl->sd_flags)(); > -- Thanks and Regards, Prateek ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] sched: topology: Fix topology validation error 2025-08-25 3:18 ` K Prateek Nayak @ 2025-08-25 7:58 ` Peter Zijlstra 2025-08-25 9:23 ` Peter Zijlstra 0 siblings, 1 reply; 14+ messages in thread From: Peter Zijlstra @ 2025-08-25 7:58 UTC (permalink / raw) To: K Prateek Nayak Cc: Tim Chen, Vinicius Costa Gomes, Ingo Molnar, Juri Lelli, Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot, Libo Chen, Abel Wu, Len Brown, linux-kernel, Chen Yu, Gautham R . Shenoy, Zhao Liu On Mon, Aug 25, 2025 at 08:48:29AM +0530, K Prateek Nayak wrote: > Hello Tim, Vinicius, > > On 8/23/2025 1:44 AM, Tim Chen wrote: > > --- a/kernel/sched/topology.c > > +++ b/kernel/sched/topology.c > > @@ -2394,6 +2394,14 @@ static bool topology_span_sane(const struct cpumask *cpu_map) > > for_each_sd_topology(tl) { > > int tl_common_flags = 0; > > > > +#ifdef CONFIG_NUMA > > + /* > > + * sd_numa_mask() (one of the possible values of > > + * tl->mask()) depends on the current level to work > > + * correctly. > > + */ > > + sched_domains_curr_level = tl->numa_level; > > +#endif > > A similar solution was proposed in [1] and after a few iterations, we > arrived at [2] as a potential solution to this issue. Now that the merge > window is behind us, Peter would it be possible to pick one of these up? > > P.S. Leon has confirmed this solved the splat of their deployments too > on an earlier version [3]. > > [1] https://lore.kernel.org/lkml/20250624041235.1589-1-kprateek.nayak@amd.com/ > [2] https://lore.kernel.org/lkml/20250715040824.893-1-kprateek.nayak@amd.com/ > [3] https://lore.kernel.org/lkml/20250720104136.GI402218@unreal/ I'm sure that's stuck somewhere in my holiday backlog ... Let me go try and find it. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] sched: topology: Fix topology validation error 2025-08-25 7:58 ` Peter Zijlstra @ 2025-08-25 9:23 ` Peter Zijlstra 0 siblings, 0 replies; 14+ messages in thread From: Peter Zijlstra @ 2025-08-25 9:23 UTC (permalink / raw) To: K Prateek Nayak Cc: Tim Chen, Vinicius Costa Gomes, Ingo Molnar, Juri Lelli, Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot, Libo Chen, Abel Wu, Len Brown, linux-kernel, Chen Yu, Gautham R . Shenoy, Zhao Liu On Mon, Aug 25, 2025 at 09:58:07AM +0200, Peter Zijlstra wrote: > On Mon, Aug 25, 2025 at 08:48:29AM +0530, K Prateek Nayak wrote: > > Hello Tim, Vinicius, > > > > On 8/23/2025 1:44 AM, Tim Chen wrote: > > > --- a/kernel/sched/topology.c > > > +++ b/kernel/sched/topology.c > > > @@ -2394,6 +2394,14 @@ static bool topology_span_sane(const struct cpumask *cpu_map) > > > for_each_sd_topology(tl) { > > > int tl_common_flags = 0; > > > > > > +#ifdef CONFIG_NUMA > > > + /* > > > + * sd_numa_mask() (one of the possible values of > > > + * tl->mask()) depends on the current level to work > > > + * correctly. > > > + */ > > > + sched_domains_curr_level = tl->numa_level; > > > +#endif > > > > A similar solution was proposed in [1] and after a few iterations, we > > arrived at [2] as a potential solution to this issue. Now that the merge > > window is behind us, Peter would it be possible to pick one of these up? > > > > P.S. Leon has confirmed this solved the splat of their deployments too > > on an earlier version [3]. > > > > [1] https://lore.kernel.org/lkml/20250624041235.1589-1-kprateek.nayak@amd.com/ > > [2] https://lore.kernel.org/lkml/20250715040824.893-1-kprateek.nayak@amd.com/ > > [3] https://lore.kernel.org/lkml/20250720104136.GI402218@unreal/ > > I'm sure that's stuck somewhere in my holiday backlog ... Let me go try > and find it. Replied there. https://lkml.kernel.org/r/20250825091910.GT3245006@noisy.programming.kicks-ass.net ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] sched: topology: Fix topology validation error 2025-08-22 20:14 ` [PATCH 1/2] sched: topology: Fix topology validation error Tim Chen 2025-08-25 3:18 ` K Prateek Nayak @ 2025-08-25 7:25 ` Peter Zijlstra 2025-08-25 21:09 ` Tim Chen 1 sibling, 1 reply; 14+ messages in thread From: Peter Zijlstra @ 2025-08-25 7:25 UTC (permalink / raw) To: Tim Chen Cc: Ingo Molnar, Vinicius Costa Gomes, Juri Lelli, Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot, Libo Chen, Abel Wu, Len Brown, linux-kernel, Chen Yu, K Prateek Nayak, Gautham R . Shenoy, Zhao Liu On Fri, Aug 22, 2025 at 01:14:14PM -0700, Tim Chen wrote: > From: Vinicius Costa Gomes <vinicius.gomes@intel.com> > > As sd_numa_mask() (the function behind tl->mask() for the NUMA levels > of the topology) depends on the value of sched_domains_curr_level, > it's possible to be iterating over a level while, sd_numa_mask() > thinks we are in another, causing the topology validation to fail (for > valid cases). > > Set sched_domains_curr_level to the current topology level while > iterating. > > Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> > --- > kernel/sched/topology.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > index 977e133bb8a4..9a7ac67e3d63 100644 > --- a/kernel/sched/topology.c > +++ b/kernel/sched/topology.c > @@ -2394,6 +2394,14 @@ static bool topology_span_sane(const struct cpumask *cpu_map) > for_each_sd_topology(tl) { > int tl_common_flags = 0; > > +#ifdef CONFIG_NUMA > + /* > + * sd_numa_mask() (one of the possible values of > + * tl->mask()) depends on the current level to work > + * correctly. > + */ This is propagating that ugly hack from sd_init(), isn't it. Except its pretending like its sane code... And for what? > + sched_domains_curr_level = tl->numa_level; > +#endif > if (tl->sd_flags) > tl_common_flags = (*tl->sd_flags)(); > if (tl_common_flags & SD_NUMA) continue; So how does this make any difference ? We should never get to calling tl->mask() for NUMA. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] sched: topology: Fix topology validation error 2025-08-25 7:25 ` Peter Zijlstra @ 2025-08-25 21:09 ` Tim Chen 0 siblings, 0 replies; 14+ messages in thread From: Tim Chen @ 2025-08-25 21:09 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Vinicius Costa Gomes, Juri Lelli, Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot, Libo Chen, Abel Wu, Len Brown, linux-kernel, Chen Yu, K Prateek Nayak, Gautham R . Shenoy, Zhao Liu On Mon, 2025-08-25 at 09:25 +0200, Peter Zijlstra wrote: > On Fri, Aug 22, 2025 at 01:14:14PM -0700, Tim Chen wrote: > > From: Vinicius Costa Gomes <vinicius.gomes@intel.com> > > > > As sd_numa_mask() (the function behind tl->mask() for the NUMA levels > > of the topology) depends on the value of sched_domains_curr_level, > > it's possible to be iterating over a level while, sd_numa_mask() > > thinks we are in another, causing the topology validation to fail (for > > valid cases). > > > > Set sched_domains_curr_level to the current topology level while > > iterating. > > > > Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> > > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> > > --- > > kernel/sched/topology.c | 8 ++++++++ > > 1 file changed, 8 insertions(+) > > > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > > index 977e133bb8a4..9a7ac67e3d63 100644 > > --- a/kernel/sched/topology.c > > +++ b/kernel/sched/topology.c > > @@ -2394,6 +2394,14 @@ static bool topology_span_sane(const struct cpumask *cpu_map) > > for_each_sd_topology(tl) { > > int tl_common_flags = 0; > > > > +#ifdef CONFIG_NUMA > > + /* > > + * sd_numa_mask() (one of the possible values of > > + * tl->mask()) depends on the current level to work > > + * correctly. > > + */ > > This is propagating that ugly hack from sd_init(), isn't it. Except its > pretending like its sane code... And for what? How about the following fix for the CONFIG_NUMA case? Will this be more sane? diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index b958fe48e020..a92457fed135 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1758,7 +1758,7 @@ static struct sched_domain_topology_level *sched_domain_topology = static struct sched_domain_topology_level *sched_domain_topology_saved; #define for_each_sd_topology(tl) \ - for (tl = sched_domain_topology; tl->mask; tl++) + for (tl = sched_domain_topology; tl->mask; ++tl, sched_domains_curr_level = tl->numa_level) void __init set_sched_topology(struct sched_domain_topology_level *tl) { > > > + sched_domains_curr_level = tl->numa_level; > > +#endif > > if (tl->sd_flags) > > tl_common_flags = (*tl->sd_flags)(); > > > if (tl_common_flags & SD_NUMA) > continue; > > So how does this make any difference ? > > We should never get to calling tl->mask() for NUMA. > True. I think we originally was fixing the v6.16 case which wasn't checking for the SD_NUMA flag. Overlooked that when we ported the fix. That said, I think that the for_each_sd_topology() macro needs to have sched_domains_curr_level updated to prevent future problems. Tim ^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 2/2] sched: Fix sched domain build error for GNR-X, CWF-X in SNC-3 mode 2025-08-22 20:14 [PATCH 0/2] Fix NUMA sched domain build errors for GNR-X and CWF-X Tim Chen 2025-08-22 20:14 ` [PATCH 1/2] sched: topology: Fix topology validation error Tim Chen @ 2025-08-22 20:14 ` Tim Chen 2025-08-25 5:08 ` Chen, Yu C 2025-08-25 4:18 ` [PATCH 0/2] Fix NUMA sched domain build errors for GNR-X and CWF-X K Prateek Nayak 2 siblings, 1 reply; 14+ messages in thread From: Tim Chen @ 2025-08-22 20:14 UTC (permalink / raw) To: Peter Zijlstra, Ingo Molnar Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot, Libo Chen, Abel Wu, Len Brown, linux-kernel, Chen Yu, K Prateek Nayak, Gautham R . Shenoy, Zhao Liu, Vinicius Costa Gomes It is possible for Granite Rapids X (GNR) and Clearwater Forest X (CWF) to have up to 3 dies per package. When sub-numa cluster (SNC-3) is enabled, each die will become a separate NUMA node in the package with different distances between dies within the same package. For example, on GNR-X, we see the following numa distances for a 2 socket system with 3 dies per socket: package 1 package2 ---------------- | | --------- --------- | 0 | | 3 | --------- --------- | | --------- --------- | 1 | | 4 | --------- --------- | | --------- --------- | 2 | | 5 | --------- --------- | | ---------------- node distances: node 0 1 2 3 4 5 0: 10 15 17 21 28 26 1: 15 10 15 23 26 23 2: 17 15 10 26 23 21 3: 21 28 26 10 15 17 4: 23 26 23 15 10 15 5: 26 23 21 17 15 10 The node distances above led to 2 problems: 1. Asymmetric routes taken between nodes in different packages led to asymmetric scheduler domain perspective depending on which node you are on. Current scheduler code failed to build domains properly with asymmetric distances. 2. Multiple remote distances to respective tiles on remote package create too many levels of domain hierarchies grouping different nodes between remote packages. For example, the above GNR-X topology lead to NUMA domains below: Sched domains from the perspective of a CPU in node 0, where the number in bracket represent node number. NUMA-level 1 [0,1] [2] NUMA-level 2 [0,1,2] [3] NUMA-level 3 [0,1,2,3] [5] NUMA-level 4 [0,1,2,3,5] [4] Sched domains from the perspective of a CPU in node 4 NUMA-level 1 [4] [3,5] NUMA-level 2 [3,4,5] [0,2] NUMA-level 3 [0,2,3,4,5] [1] Scheduler group peers for load balancing from the perspective of CPU 0 and 4 are different. Improper task could be chosen for load balancing between groups such as [0,2,3,4,5] [1]. Ideally you should choose nodes in 0 or 2 that are in same package as node 1 first. But instead tasks in the remote package node 3, 4, 5 could be chosen with an equal chance and could lead to excessive remote package migrations and imbalance of load between packages. We should not group partial remote nodes and local nodes together. Simplify the remote distances for CWF-X and GNR-X for the purpose of sched domains building, which maintains symmetry and leads to a more reasonable load balance hierarchy. The sched domains from the perspective of a CPU in node 0 NUMA-level 1 is now NUMA-level 1 [0,1] [2] NUMA-level 2 [0,1,2] [3,4,5] The sched domains from the perspective of a CPU in node 4 NUMA-level 1 is now NUMA-level 1 [4] [3,5] NUMA-level 2 [3,4,5] [0,1,2] We have the same balancing perspective from node 0 or node 4. Loads are now balanced equally between packages. Co-developed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Tested-by: Zhao Liu <zhao1.liu@intel.com> --- arch/x86/kernel/smpboot.c | 28 ++++++++++++++++++++++++++++ include/linux/sched/topology.h | 1 + kernel/sched/topology.c | 25 +++++++++++++++++++------ 3 files changed, 48 insertions(+), 6 deletions(-) diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 33e166f6ab12..c425e84c88b5 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -515,6 +515,34 @@ static void __init build_sched_topology(void) set_sched_topology(topology); } +int sched_node_distance(int from, int to) +{ + int d = node_distance(from, to); + + if (!x86_has_numa_in_package) + return d; + + switch (boot_cpu_data.x86_vfm) { + case INTEL_GRANITERAPIDS_X: + case INTEL_ATOM_DARKMONT_X: + if (d < REMOTE_DISTANCE) + return d; + + /* + * Trim finer distance tuning for nodes in remote package + * for the purpose of building sched domains. + * Put NUMA nodes in each remote package in a single sched group. + * Simplify NUMA domains and avoid extra NUMA levels including different + * NUMA nodes in remote packages. + * + * GNR-x and CWF-X has GLUELESS-MESH topology with SNC + * turned on. + */ + d = (d / 10) * 10; + } + return d; +} + void set_cpu_sibling_map(int cpu) { bool has_smt = __max_threads_per_core > 1; diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index 5263746b63e8..3b62226394af 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -59,6 +59,7 @@ static inline int cpu_numa_flags(void) #endif extern int arch_asym_cpu_priority(int cpu); +extern int sched_node_distance(int from, int to); struct sched_domain_attr { int relax_domain_level; diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 9a7ac67e3d63..3f485da994a7 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1804,7 +1804,7 @@ bool find_numa_distance(int distance) bool found = false; int i, *distances; - if (distance == node_distance(0, 0)) + if (distance == sched_node_distance(0, 0)) return true; rcu_read_lock(); @@ -1887,6 +1887,15 @@ static void init_numa_topology_type(int offline_node) #define NR_DISTANCE_VALUES (1 << DISTANCE_BITS) +/* + * Architecture could simplify NUMA distance, to avoid + * creating too many NUMA levels when SNC is turned on. + */ +int __weak sched_node_distance(int from, int to) +{ + return node_distance(from, to); +} + void sched_init_numa(int offline_node) { struct sched_domain_topology_level *tl; @@ -1894,6 +1903,7 @@ void sched_init_numa(int offline_node) int nr_levels = 0; int i, j; int *distances; + int max_dist = 0; struct cpumask ***masks; /* @@ -1907,7 +1917,10 @@ void sched_init_numa(int offline_node) bitmap_zero(distance_map, NR_DISTANCE_VALUES); for_each_cpu_node_but(i, offline_node) { for_each_cpu_node_but(j, offline_node) { - int distance = node_distance(i, j); + int distance = sched_node_distance(i, j); + + if (node_distance(i,j) > max_dist) + max_dist = node_distance(i,j); if (distance < LOCAL_DISTANCE || distance >= NR_DISTANCE_VALUES) { sched_numa_warn("Invalid distance value range"); @@ -1979,10 +1992,10 @@ void sched_init_numa(int offline_node) masks[i][j] = mask; for_each_cpu_node_but(k, offline_node) { - if (sched_debug() && (node_distance(j, k) != node_distance(k, j))) + if (sched_debug() && (sched_node_distance(j, k) != sched_node_distance(k, j))) sched_numa_warn("Node-distance not symmetric"); - if (node_distance(j, k) > sched_domains_numa_distance[i]) + if (sched_node_distance(j, k) > sched_domains_numa_distance[i]) continue; cpumask_or(mask, mask, cpumask_of_node(k)); @@ -2022,7 +2035,7 @@ void sched_init_numa(int offline_node) sched_domain_topology = tl; sched_domains_numa_levels = nr_levels; - WRITE_ONCE(sched_max_numa_distance, sched_domains_numa_distance[nr_levels - 1]); + WRITE_ONCE(sched_max_numa_distance, max_dist); init_numa_topology_type(offline_node); } @@ -2092,7 +2105,7 @@ void sched_domains_numa_masks_set(unsigned int cpu) continue; /* Set ourselves in the remote node's masks */ - if (node_distance(j, node) <= sched_domains_numa_distance[i]) + if (sched_node_distance(j, node) <= sched_domains_numa_distance[i]) cpumask_set_cpu(cpu, sched_domains_numa_masks[i][j]); } } -- 2.32.0 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] sched: Fix sched domain build error for GNR-X, CWF-X in SNC-3 mode 2025-08-22 20:14 ` [PATCH 2/2] sched: Fix sched domain build error for GNR-X, CWF-X in SNC-3 mode Tim Chen @ 2025-08-25 5:08 ` Chen, Yu C 2025-08-25 7:56 ` Peter Zijlstra 2025-08-25 20:05 ` Tim Chen 0 siblings, 2 replies; 14+ messages in thread From: Chen, Yu C @ 2025-08-25 5:08 UTC (permalink / raw) To: Tim Chen, Peter Zijlstra, Ingo Molnar Cc: Juri Lelli, Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot, Libo Chen, Abel Wu, Len Brown, linux-kernel, K Prateek Nayak, Gautham R . Shenoy, Zhao Liu, Vinicius Costa Gomes, Chen Yu On 8/23/2025 4:14 AM, Tim Chen wrote: > It is possible for Granite Rapids X (GNR) and Clearwater Forest X > (CWF) to have up to 3 dies per package. When sub-numa cluster (SNC-3) > is enabled, each die will become a separate NUMA node in the package > with different distances between dies within the same package. > > For example, on GNR-X, we see the following numa distances for a 2 socket > system with 3 dies per socket: > > package 1 package2 > ---------------- > | | > --------- --------- > | 0 | | 3 | > --------- --------- > | | > --------- --------- > | 1 | | 4 | > --------- --------- > | | > --------- --------- > | 2 | | 5 | > --------- --------- > | | > ---------------- > > node distances: > node 0 1 2 3 4 5 > 0: 10 15 17 21 28 26 > 1: 15 10 15 23 26 23 > 2: 17 15 10 26 23 21 > 3: 21 28 26 10 15 17 > 4: 23 26 23 15 10 15 > 5: 26 23 21 17 15 10 > > The node distances above led to 2 problems: > > 1. Asymmetric routes taken between nodes in different packages led to > asymmetric scheduler domain perspective depending on which node you > are on. Current scheduler code failed to build domains properly with > asymmetric distances. > > 2. Multiple remote distances to respective tiles on remote package create > too many levels of domain hierarchies grouping different nodes between > remote packages. > > For example, the above GNR-X topology lead to NUMA domains below: > > Sched domains from the perspective of a CPU in node 0, where the number > in bracket represent node number. > > NUMA-level 1 [0,1] [2] > NUMA-level 2 [0,1,2] [3] > NUMA-level 3 [0,1,2,3] [5] > NUMA-level 4 [0,1,2,3,5] [4] > > Sched domains from the perspective of a CPU in node 4 > NUMA-level 1 [4] [3,5] > NUMA-level 2 [3,4,5] [0,2] > NUMA-level 3 [0,2,3,4,5] [1] > > Scheduler group peers for load balancing from the perspective of CPU 0 > and 4 are different. Improper task could be chosen for load balancing > between groups such as [0,2,3,4,5] [1]. Ideally you should choose nodes > in 0 or 2 that are in same package as node 1 first. But instead tasks > in the remote package node 3, 4, 5 could be chosen with an equal chance > and could lead to excessive remote package migrations and imbalance of > load between packages. We should not group partial remote nodes and > local nodes together. > > Simplify the remote distances for CWF-X and GNR-X for the purpose of > sched domains building, which maintains symmetry and leads to a more > reasonable load balance hierarchy. > > The sched domains from the perspective of a CPU in node 0 NUMA-level 1 > is now > NUMA-level 1 [0,1] [2] > NUMA-level 2 [0,1,2] [3,4,5] > > The sched domains from the perspective of a CPU in node 4 NUMA-level 1 > is now > NUMA-level 1 [4] [3,5] > NUMA-level 2 [3,4,5] [0,1,2] > > We have the same balancing perspective from node 0 or node 4. Loads are > now balanced equally between packages. > > Co-developed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> > Tested-by: Zhao Liu <zhao1.liu@intel.com> > --- > arch/x86/kernel/smpboot.c | 28 ++++++++++++++++++++++++++++ > include/linux/sched/topology.h | 1 + > kernel/sched/topology.c | 25 +++++++++++++++++++------ > 3 files changed, 48 insertions(+), 6 deletions(-) > > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c > index 33e166f6ab12..c425e84c88b5 100644 > --- a/arch/x86/kernel/smpboot.c > +++ b/arch/x86/kernel/smpboot.c > @@ -515,6 +515,34 @@ static void __init build_sched_topology(void) > set_sched_topology(topology); > } > > +int sched_node_distance(int from, int to) > +{ > + int d = node_distance(from, to); > + > + if (!x86_has_numa_in_package) > + return d; > + > + switch (boot_cpu_data.x86_vfm) { > + case INTEL_GRANITERAPIDS_X: > + case INTEL_ATOM_DARKMONT_X: > + if (d < REMOTE_DISTANCE) > + return d; > + > + /* > + * Trim finer distance tuning for nodes in remote package > + * for the purpose of building sched domains. > + * Put NUMA nodes in each remote package in a single sched group. > + * Simplify NUMA domains and avoid extra NUMA levels including different > + * NUMA nodes in remote packages. > + * > + * GNR-x and CWF-X has GLUELESS-MESH topology with SNC > + * turned on. > + */ > + d = (d / 10) * 10; Does the '10' here mean that, the distance of the hierarchy socket is 10 from SLIT table? For example, from a socket0 point of view, the distance of socket1 to socket0 is within [20, 29), the distance of socket2 to socket0 is [30,39), and so on. If this is the case, maybe add a comment above for future reference. > + } > + return d; > +} > + > void set_cpu_sibling_map(int cpu) > { > bool has_smt = __max_threads_per_core > 1; > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h > index 5263746b63e8..3b62226394af 100644 > --- a/include/linux/sched/topology.h > +++ b/include/linux/sched/topology.h > @@ -59,6 +59,7 @@ static inline int cpu_numa_flags(void) > #endif > > extern int arch_asym_cpu_priority(int cpu); > +extern int sched_node_distance(int from, int to); > > struct sched_domain_attr { > int relax_domain_level; > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > index 9a7ac67e3d63..3f485da994a7 100644 > --- a/kernel/sched/topology.c > +++ b/kernel/sched/topology.c > @@ -1804,7 +1804,7 @@ bool find_numa_distance(int distance) > bool found = false; > int i, *distances; > > - if (distance == node_distance(0, 0)) > + if (distance == sched_node_distance(0, 0)) > return true; > If I understand correct, this patch is trying to fix the sched domain issue during load balancing, and NUMA balance logic should not be changed because NUMA balancing is not based on sched domain? That is to say, since the find_numa_distance() is only used by NUMA balance, should we keep find_numa_distance() to still use node_distance()? > rcu_read_lock(); > @@ -1887,6 +1887,15 @@ static void init_numa_topology_type(int offline_node) > > #define NR_DISTANCE_VALUES (1 << DISTANCE_BITS) > > +/* > + * Architecture could simplify NUMA distance, to avoid > + * creating too many NUMA levels when SNC is turned on. > + */ > +int __weak sched_node_distance(int from, int to) > +{ > + return node_distance(from, to); > +} > + > void sched_init_numa(int offline_node) > { > struct sched_domain_topology_level *tl; > @@ -1894,6 +1903,7 @@ void sched_init_numa(int offline_node) > int nr_levels = 0; > int i, j; > int *distances; > + int max_dist = 0; > struct cpumask ***masks; > > /* > @@ -1907,7 +1917,10 @@ void sched_init_numa(int offline_node) > bitmap_zero(distance_map, NR_DISTANCE_VALUES); > for_each_cpu_node_but(i, offline_node) { > for_each_cpu_node_but(j, offline_node) { > - int distance = node_distance(i, j); > + int distance = sched_node_distance(i, j); > + > + if (node_distance(i,j) > max_dist) > + max_dist = node_distance(i,j); > > if (distance < LOCAL_DISTANCE || distance >= NR_DISTANCE_VALUES) { > sched_numa_warn("Invalid distance value range"); > @@ -1979,10 +1992,10 @@ void sched_init_numa(int offline_node) > masks[i][j] = mask; > > for_each_cpu_node_but(k, offline_node) { > - if (sched_debug() && (node_distance(j, k) != node_distance(k, j))) > + if (sched_debug() && (sched_node_distance(j, k) != sched_node_distance(k, j))) > sched_numa_warn("Node-distance not symmetric"); > > - if (node_distance(j, k) > sched_domains_numa_distance[i]) > + if (sched_node_distance(j, k) > sched_domains_numa_distance[i]) > continue; > > cpumask_or(mask, mask, cpumask_of_node(k)); > @@ -2022,7 +2035,7 @@ void sched_init_numa(int offline_node) > sched_domain_topology = tl; > > sched_domains_numa_levels = nr_levels; > - WRITE_ONCE(sched_max_numa_distance, sched_domains_numa_distance[nr_levels - 1]); > + WRITE_ONCE(sched_max_numa_distance, max_dist); Above change is to use the original node_distance() rather than sched_node_distance() for sched_max_numa_distance, and sched_max_numa_distance is only used by NUMA balance to figure out the NUMA topology type as well as scaling the NUMA fault statistics for remote Nodes. So I think we might want to keep it align by using node_distance() in find_numa_distance(). thanks, Chenyu > > init_numa_topology_type(offline_node); > } > @@ -2092,7 +2105,7 @@ void sched_domains_numa_masks_set(unsigned int cpu) > continue; > > /* Set ourselves in the remote node's masks */ > - if (node_distance(j, node) <= sched_domains_numa_distance[i]) > + if (sched_node_distance(j, node) <= sched_domains_numa_distance[i]) > cpumask_set_cpu(cpu, sched_domains_numa_masks[i][j]); > } > } ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] sched: Fix sched domain build error for GNR-X, CWF-X in SNC-3 mode 2025-08-25 5:08 ` Chen, Yu C @ 2025-08-25 7:56 ` Peter Zijlstra 2025-08-25 21:36 ` Tim Chen 2025-08-25 20:05 ` Tim Chen 1 sibling, 1 reply; 14+ messages in thread From: Peter Zijlstra @ 2025-08-25 7:56 UTC (permalink / raw) To: Chen, Yu C Cc: Tim Chen, Ingo Molnar, Juri Lelli, Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot, Libo Chen, Abel Wu, Len Brown, linux-kernel, K Prateek Nayak, Gautham R . Shenoy, Zhao Liu, Vinicius Costa Gomes, Chen Yu On Mon, Aug 25, 2025 at 01:08:39PM +0800, Chen, Yu C wrote: > On 8/23/2025 4:14 AM, Tim Chen wrote: > > It is possible for Granite Rapids X (GNR) and Clearwater Forest X > > (CWF) to have up to 3 dies per package. When sub-numa cluster (SNC-3) > > is enabled, each die will become a separate NUMA node in the package > > with different distances between dies within the same package. > > > > For example, on GNR-X, we see the following numa distances for a 2 socket > > system with 3 dies per socket: > > > > package 1 package2 > > ---------------- > > | | > > --------- --------- > > | 0 | | 3 | > > --------- --------- > > | | > > --------- --------- > > | 1 | | 4 | > > --------- --------- > > | | > > --------- --------- > > | 2 | | 5 | > > --------- --------- > > | | > > ---------------- > > > > node distances: > > node 0 1 2 3 4 5 > > 0: 10 15 17 21 28 26 > > 1: 15 10 15 23 26 23 > > 2: 17 15 10 26 23 21 > > 3: 21 28 26 10 15 17 > > 4: 23 26 23 15 10 15 > > 5: 26 23 21 17 15 10 > > > > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c > > index 33e166f6ab12..c425e84c88b5 100644 > > --- a/arch/x86/kernel/smpboot.c > > +++ b/arch/x86/kernel/smpboot.c > > @@ -515,6 +515,34 @@ static void __init build_sched_topology(void) > > set_sched_topology(topology); > > } > > +int sched_node_distance(int from, int to) > > +{ > > + int d = node_distance(from, to); > > + > > + if (!x86_has_numa_in_package) > > + return d; > > + > > + switch (boot_cpu_data.x86_vfm) { > > + case INTEL_GRANITERAPIDS_X: > > + case INTEL_ATOM_DARKMONT_X: > > + if (d < REMOTE_DISTANCE) > > + return d; > > + > > + /* > > + * Trim finer distance tuning for nodes in remote package > > + * for the purpose of building sched domains. > > + * Put NUMA nodes in each remote package in a single sched group. > > + * Simplify NUMA domains and avoid extra NUMA levels including different > > + * NUMA nodes in remote packages. > > + * > > + * GNR-x and CWF-X has GLUELESS-MESH topology with SNC > > + * turned on. > > + */ > > + d = (d / 10) * 10; > > Does the '10' here mean that, the distance of the hierarchy socket > is 10 from SLIT table? For example, from a socket0 point of view, > the distance of socket1 to socket0 is within [20, 29), the distance > of socket2 to socket0 is [30,39), and so on. If this is the case, > maybe add a comment above for future reference. This is all because of the ACPI SLIT distance definitions I suppose, 10 for local and 20 for remote (which IMO is actively wrong, since it mandates distances that are not relative performance). Additionally, the table above magically has all the remote distances in the range of [20,29] and so the strip 1s thing works. The problem of course is that the SLIT table is fully under control of the BIOS and random BIOS monkey could cause this to not be so making the above code not work as intended. Eg. if the remote distances ends up being in the range of [20,35] or whatever, then it all goes sideways. ( There is a history of manupulating the SLIT table to influence scheduler behaviour of OS of choice :-/ ) Similarly, when doing a 4 node system, it is possible a 2 hop distances doesn't align nicely with the 10s and we're up a creek again. This is all very fragile. A much better way would be to allocate a new SLIT table, identify the (local) clusters and replace all remote instances with an average. Eg. since (21+28+26+23+26+23+26+23+21)/9 ~ 24, you end up with: node 0 1 2 3 4 5 0: 10 15 17 24 24 24 1: 15 10 15 24 24 24 2: 17 15 10 24 24 24 3: 24 24 24 10 15 17 4: 24 24 24 15 10 15 5: 24 24 24 17 15 10 ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] sched: Fix sched domain build error for GNR-X, CWF-X in SNC-3 mode 2025-08-25 7:56 ` Peter Zijlstra @ 2025-08-25 21:36 ` Tim Chen 0 siblings, 0 replies; 14+ messages in thread From: Tim Chen @ 2025-08-25 21:36 UTC (permalink / raw) To: Peter Zijlstra, Chen, Yu C Cc: Ingo Molnar, Juri Lelli, Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot, Libo Chen, Abel Wu, Len Brown, linux-kernel, K Prateek Nayak, Gautham R . Shenoy, Zhao Liu, Vinicius Costa Gomes, Chen Yu, Arjan van de Ven On Mon, 2025-08-25 at 09:56 +0200, Peter Zijlstra wrote: > > ... snip ... > > > > > > + /* > > > > > > + * Trim finer distance tuning for nodes in remote package > > > > > > + * for the purpose of building sched domains. > > > > > > + * Put NUMA nodes in each remote package in a single sched group. > > > > > > + * Simplify NUMA domains and avoid extra NUMA levels including different > > > > > > + * NUMA nodes in remote packages. > > > > > > + * > > > > > > + * GNR-x and CWF-X has GLUELESS-MESH topology with SNC > > > > > > + * turned on. > > > > > > + */ > > > > > > + d = (d / 10) * 10; > > > > > > > > Does the '10' here mean that, the distance of the hierarchy socket > > > > is 10 from SLIT table? For example, from a socket0 point of view, > > > > the distance of socket1 to socket0 is within [20, 29), the distance > > > > of socket2 to socket0 is [30,39), and so on. If this is the case, > > > > maybe add a comment above for future reference. > > > > This is all because of the ACPI SLIT distance definitions I suppose, 10 > > for local and 20 for remote (which IMO is actively wrong, since it > > mandates distances that are not relative performance). > > > > Additionally, the table above magically has all the remote distances in > > the range of [20,29] and so the strip 1s thing works. > > > > The problem of course is that the SLIT table is fully under control of > > the BIOS and random BIOS monkey could cause this to not be so making the > > above code not work as intended. Eg. if the remote distances ends up > > being in the range of [20,35] or whatever, then it all goes sideways. > > > > ( There is a history of manupulating the SLIT table to influence > > scheduler behaviour of OS of choice :-/ ) > > > > Similarly, when doing a 4 node system, it is possible a 2 hop distances > > doesn't align nicely with the 10s and we're up a creek again. We don't expect 4 node systems for GNR nor CWF. So hopefully we don't need to worry about them. Otherwise we may need additional code to check for 2 hops. > > > > This is all very fragile. A much better way would be to allocate a new > > SLIT table, identify the (local) clusters and replace all remote > > instances with an average. Are you suggesting to have one SLIT distance table that's simplified for scheduler domain build and another for true node distance? > > > > Eg. since (21+28+26+23+26+23+26+23+21)/9 ~ 24, you end up with: > > > > node 0 1 2 3 4 5 > > 0: 10 15 17 24 24 24 > > 1: 15 10 15 24 24 24 > > 2: 17 15 10 24 24 24 > > 3: 24 24 24 10 15 17 > > 4: 24 24 24 15 10 15 > > 5: 24 24 24 17 15 10 > > > > Will take a closer look to use average for nodes in remote package. Tim ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] sched: Fix sched domain build error for GNR-X, CWF-X in SNC-3 mode 2025-08-25 5:08 ` Chen, Yu C 2025-08-25 7:56 ` Peter Zijlstra @ 2025-08-25 20:05 ` Tim Chen 1 sibling, 0 replies; 14+ messages in thread From: Tim Chen @ 2025-08-25 20:05 UTC (permalink / raw) To: Chen, Yu C, Peter Zijlstra, Ingo Molnar Cc: Juri Lelli, Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot, Libo Chen, Abel Wu, Len Brown, linux-kernel, K Prateek Nayak, Gautham R . Shenoy, Zhao Liu, Vinicius Costa Gomes, Chen Yu On Mon, 2025-08-25 at 13:08 +0800, Chen, Yu C wrote: > On 8/23/2025 4:14 AM, Tim Chen wrote: > > ... snip... > > > > Co-developed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> > > Tested-by: Zhao Liu <zhao1.liu@intel.com> > > --- > > arch/x86/kernel/smpboot.c | 28 ++++++++++++++++++++++++++++ > > include/linux/sched/topology.h | 1 + > > kernel/sched/topology.c | 25 +++++++++++++++++++------ > > 3 files changed, 48 insertions(+), 6 deletions(-) > > > > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c > > index 33e166f6ab12..c425e84c88b5 100644 > > --- a/arch/x86/kernel/smpboot.c > > +++ b/arch/x86/kernel/smpboot.c > > @@ -515,6 +515,34 @@ static void __init build_sched_topology(void) > > set_sched_topology(topology); > > } > > > > +int sched_node_distance(int from, int to) > > +{ > > + int d = node_distance(from, to); > > + > > + if (!x86_has_numa_in_package) > > + return d; > > + > > + switch (boot_cpu_data.x86_vfm) { > > + case INTEL_GRANITERAPIDS_X: > > + case INTEL_ATOM_DARKMONT_X: > > + if (d < REMOTE_DISTANCE) > > + return d; > > + > > + /* > > + * Trim finer distance tuning for nodes in remote package > > + * for the purpose of building sched domains. > > + * Put NUMA nodes in each remote package in a single sched group. > > + * Simplify NUMA domains and avoid extra NUMA levels including different > > + * NUMA nodes in remote packages. > > + * > > + * GNR-x and CWF-X has GLUELESS-MESH topology with SNC > > + * turned on. > > + */ > > + d = (d / 10) * 10; > > Does the '10' here mean that, the distance of the hierarchy socket > is 10 from SLIT table? > Yes. > For example, from a socket0 point of view, > the distance of socket1 to socket0 is within [20, 29), the distance > of socket2 to socket0 is [30,39), and so on. If this is the case, > maybe add a comment above for future reference. > We don't expect to have more than 2 sockets for GNR and CWF. So the case of 2 hops like [30,39) should not happen. > > + } > > + return d; > > +} > > + > > void set_cpu_sibling_map(int cpu) > > { > > bool has_smt = __max_threads_per_core > 1; > > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h > > index 5263746b63e8..3b62226394af 100644 > > --- a/include/linux/sched/topology.h > > +++ b/include/linux/sched/topology.h > > @@ -59,6 +59,7 @@ static inline int cpu_numa_flags(void) > > #endif > > > > extern int arch_asym_cpu_priority(int cpu); > > +extern int sched_node_distance(int from, int to); > > > > struct sched_domain_attr { > > int relax_domain_level; > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > > index 9a7ac67e3d63..3f485da994a7 100644 > > --- a/kernel/sched/topology.c > > +++ b/kernel/sched/topology.c > > @@ -1804,7 +1804,7 @@ bool find_numa_distance(int distance) > > bool found = false; > > int i, *distances; > > > > - if (distance == node_distance(0, 0)) > > + if (distance == sched_node_distance(0, 0)) > > return true; > > > > If I understand correct, this patch is trying to fix the sched > domain issue during load balancing, and NUMA balance logic > should not be changed because NUMA balancing is not based on > sched domain? > > That is to say, since the find_numa_distance() is only used by > NUMA balance, should we keep find_numa_distance() to still use > node_distance()? The procedure here is using the distance matrix that's initialized using sched_node_distance(). Hence the change. Otherwise we could keep a separate sched_distance matrix and uses only node_distance here. Did not do that to minimize the change. Tim > > > rcu_read_lock(); > > @@ -1887,6 +1887,15 @@ static void init_numa_topology_type(int offline_node) > > > > #define NR_DISTANCE_VALUES (1 << DISTANCE_BITS) > > > > +/* > > + * Architecture could simplify NUMA distance, to avoid > > + * creating too many NUMA levels when SNC is turned on. > > + */ > > +int __weak sched_node_distance(int from, int to) > > +{ > > + return node_distance(from, to); > > +} > > + > > void sched_init_numa(int offline_node) > > { > > struct sched_domain_topology_level *tl; > > @@ -1894,6 +1903,7 @@ void sched_init_numa(int offline_node) > > int nr_levels = 0; > > int i, j; > > int *distances; > > + int max_dist = 0; > > struct cpumask ***masks; > > > > /* > > @@ -1907,7 +1917,10 @@ void sched_init_numa(int offline_node) > > bitmap_zero(distance_map, NR_DISTANCE_VALUES); > > for_each_cpu_node_but(i, offline_node) { > > for_each_cpu_node_but(j, offline_node) { > > - int distance = node_distance(i, j); > > + int distance = sched_node_distance(i, j); > > + > > + if (node_distance(i,j) > max_dist) > > + max_dist = node_distance(i,j); > > > > if (distance < LOCAL_DISTANCE || distance >= NR_DISTANCE_VALUES) { > > sched_numa_warn("Invalid distance value range"); > > @@ -1979,10 +1992,10 @@ void sched_init_numa(int offline_node) > > masks[i][j] = mask; > > > > for_each_cpu_node_but(k, offline_node) { > > - if (sched_debug() && (node_distance(j, k) != node_distance(k, j))) > > + if (sched_debug() && (sched_node_distance(j, k) != sched_node_distance(k, j))) > > sched_numa_warn("Node-distance not symmetric"); > > > > - if (node_distance(j, k) > sched_domains_numa_distance[i]) > > + if (sched_node_distance(j, k) > sched_domains_numa_distance[i]) > > continue; > > > > cpumask_or(mask, mask, cpumask_of_node(k)); > > @@ -2022,7 +2035,7 @@ void sched_init_numa(int offline_node) > > sched_domain_topology = tl; > > > > sched_domains_numa_levels = nr_levels; > > - WRITE_ONCE(sched_max_numa_distance, sched_domains_numa_distance[nr_levels - 1]); > > + WRITE_ONCE(sched_max_numa_distance, max_dist); > > Above change is to use the original node_distance() rather than > sched_node_distance() for sched_max_numa_distance, and > sched_max_numa_distance is only used by NUMA balance to figure out > the NUMA topology type as well as scaling the NUMA fault statistics > for remote Nodes. > > So I think we might want to keep it align by using node_distance() > in find_numa_distance(). > > thanks, > Chenyu > > > > init_numa_topology_type(offline_node); > > } > > @@ -2092,7 +2105,7 @@ void sched_domains_numa_masks_set(unsigned int cpu) > > continue; > > > > /* Set ourselves in the remote node's masks */ > > - if (node_distance(j, node) <= sched_domains_numa_distance[i]) > > + if (sched_node_distance(j, node) <= sched_domains_numa_distance[i]) > > cpumask_set_cpu(cpu, sched_domains_numa_masks[i][j]); > > } > > } ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 0/2] Fix NUMA sched domain build errors for GNR-X and CWF-X 2025-08-22 20:14 [PATCH 0/2] Fix NUMA sched domain build errors for GNR-X and CWF-X Tim Chen 2025-08-22 20:14 ` [PATCH 1/2] sched: topology: Fix topology validation error Tim Chen 2025-08-22 20:14 ` [PATCH 2/2] sched: Fix sched domain build error for GNR-X, CWF-X in SNC-3 mode Tim Chen @ 2025-08-25 4:18 ` K Prateek Nayak 2025-08-25 21:38 ` Tim Chen 2 siblings, 1 reply; 14+ messages in thread From: K Prateek Nayak @ 2025-08-25 4:18 UTC (permalink / raw) To: Tim Chen, Peter Zijlstra, Ingo Molnar Cc: Juri Lelli, Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot, Libo Chen, Abel Wu, Len Brown, linux-kernel, Chen Yu, Gautham R . Shenoy, Zhao Liu, Vinicius Costa Gomes Hello Tim, On 8/23/2025 1:44 AM, Tim Chen wrote: > While testing Granite Rapids X (GNR-X) and Clearwater Forest X (CWF-X) in > SNc-3 mode, we encountered sched domain build errors reported in dmesg. > Asymmetric node distances from local node to to nodes in remote package > was not expected by the scheduler domain code and also led to excessive > number of sched domain hierachy levels. > > Fix the missing NUMA domain level set in topology_span_sane() check and > also simplify the distance to nodes in remote package to retain distance > symmetry and make the NUMA topology sane for GNR-X and CWF-X. I did some sanity testing on an EPYC platform on NPS2/4 and didn't see any changes to the sched domain layout or the sched_node_distance() being used when constructing them with the series. Feel free to include: Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> -- Thanks and Regards, Prateek ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 0/2] Fix NUMA sched domain build errors for GNR-X and CWF-X 2025-08-25 4:18 ` [PATCH 0/2] Fix NUMA sched domain build errors for GNR-X and CWF-X K Prateek Nayak @ 2025-08-25 21:38 ` Tim Chen 0 siblings, 0 replies; 14+ messages in thread From: Tim Chen @ 2025-08-25 21:38 UTC (permalink / raw) To: K Prateek Nayak, Peter Zijlstra, Ingo Molnar Cc: Juri Lelli, Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot, Libo Chen, Abel Wu, Len Brown, linux-kernel, Chen Yu, Gautham R . Shenoy, Zhao Liu, Vinicius Costa Gomes On Mon, 2025-08-25 at 09:48 +0530, K Prateek Nayak wrote: > Hello Tim, > > On 8/23/2025 1:44 AM, Tim Chen wrote: > > While testing Granite Rapids X (GNR-X) and Clearwater Forest X (CWF-X) in > > SNc-3 mode, we encountered sched domain build errors reported in dmesg. > > Asymmetric node distances from local node to to nodes in remote package > > was not expected by the scheduler domain code and also led to excessive > > number of sched domain hierachy levels. > > > > Fix the missing NUMA domain level set in topology_span_sane() check and > > also simplify the distance to nodes in remote package to retain distance > > symmetry and make the NUMA topology sane for GNR-X and CWF-X. > > I did some sanity testing on an EPYC platform on NPS2/4 and didn't > see any changes to the sched domain layout or the sched_node_distance() > being used when constructing them with the series. > > Feel free to include: > > Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> > Thanks for testing it. Tim ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2025-08-25 21:38 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-08-22 20:14 [PATCH 0/2] Fix NUMA sched domain build errors for GNR-X and CWF-X Tim Chen 2025-08-22 20:14 ` [PATCH 1/2] sched: topology: Fix topology validation error Tim Chen 2025-08-25 3:18 ` K Prateek Nayak 2025-08-25 7:58 ` Peter Zijlstra 2025-08-25 9:23 ` Peter Zijlstra 2025-08-25 7:25 ` Peter Zijlstra 2025-08-25 21:09 ` Tim Chen 2025-08-22 20:14 ` [PATCH 2/2] sched: Fix sched domain build error for GNR-X, CWF-X in SNC-3 mode Tim Chen 2025-08-25 5:08 ` Chen, Yu C 2025-08-25 7:56 ` Peter Zijlstra 2025-08-25 21:36 ` Tim Chen 2025-08-25 20:05 ` Tim Chen 2025-08-25 4:18 ` [PATCH 0/2] Fix NUMA sched domain build errors for GNR-X and CWF-X K Prateek Nayak 2025-08-25 21:38 ` Tim Chen
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.