public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] sched/topology: Check average distances to remote packages
@ 2026-02-05  0:24 Kyle Meyer
  2026-02-23 16:42 ` Kyle Meyer
  2026-02-23 17:03 ` Peter Zijlstra
  0 siblings, 2 replies; 19+ messages in thread
From: Kyle Meyer @ 2026-02-05  0:24 UTC (permalink / raw)
  To: tim.c.chen, bp, dave.hansen, mingo, peterz, tglx, vinicius.gomes
  Cc: brgerst, hpa, kprateek.nayak, linux-kernel, patryk.wlazlyn,
	rafael.j.wysocki, russ.anderson, x86, yu.c.chen, zhao1.liu,
	kyle.meyer

Granite Rapids (GNR) and Clearwater Forest (CWF) average distances to
remote packages to fix scheduler domains, see [1] for more information.

A warning and backtrace are printed when sub-NUMA clustering (SNC) is
enabled and there are more than 2 packages because the average distances
to remote packages could be different, skewing the single average remote
distance.

This is unnecessary when the average distances to remote packages are
the same.

Support single average remote distance on systems with more than 2
packages, preventing unnecessary warnings and backtraces, by checking if
average distances to remote packages are the same.

[1] commit 4d6dd05d07d0 ("sched/topology: Fix sched domain build error for GNR, CWF in SNC-3 mode").

Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Kyle Meyer <kyle.meyer@hpe.com>
---

The warning and backtrace were noticed on a 16 socket GNR system with SNC-2 enabled.

v1:
* https://lore.kernel.org/all/aXjvLjTCRe8d3UFD@hpe.com/

v1 -> v2:
* Initialize pkg_total_distance and pkg_nr_remote to NULL, as suggested by Tim.

---
 arch/x86/kernel/smpboot.c | 69 ++++++++++++++++++++++++++++-----------
 1 file changed, 50 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 5cd6950ab672..dc8f15bd2e19 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -518,27 +518,69 @@ static int avg_remote_numa_distance(void)
 {
 	int i, j;
 	int distance, nr_remote, total_distance;
+	int max_pkgs = topology_max_packages();
+	int cpu, pkg, pkg_avg_distance;
+	int *pkg_total_distance = NULL, *pkg_nr_remote = NULL;
 
 	if (sched_avg_remote_distance > 0)
 		return sched_avg_remote_distance;
 
+	sched_avg_remote_distance = REMOTE_DISTANCE;
+
 	nr_remote = 0;
 	total_distance = 0;
+
+	pkg_total_distance = kcalloc(max_pkgs, sizeof(int), GFP_KERNEL);
+	if (!pkg_total_distance)
+		goto cleanup;
+
+	pkg_nr_remote = kcalloc(max_pkgs, sizeof(int), GFP_KERNEL);
+	if (!pkg_nr_remote)
+		goto cleanup;
+
 	for_each_node_state(i, N_CPU) {
 		for_each_node_state(j, N_CPU) {
 			distance = node_distance(i, j);
 
-			if (distance >= REMOTE_DISTANCE) {
-				nr_remote++;
-				total_distance += distance;
-			}
+			if (distance < REMOTE_DISTANCE)
+				continue;
+
+			nr_remote++;
+			total_distance += distance;
+
+			cpu = cpumask_first(cpumask_of_node(j));
+			if (cpu >= nr_cpu_ids)
+				continue;
+
+			pkg = topology_physical_package_id(cpu);
+			pkg_total_distance[pkg] += distance;
+			pkg_nr_remote[pkg]++;
 		}
 	}
-	if (nr_remote)
-		sched_avg_remote_distance = total_distance / nr_remote;
-	else
-		sched_avg_remote_distance = REMOTE_DISTANCE;
 
+	if (!nr_remote)
+		goto cleanup;
+
+	sched_avg_remote_distance = total_distance / nr_remote;
+
+	/*
+	 * Single average remote distance won't be appropriate if different
+	 * packages have different distances to remote packages.
+	 */
+	for (i = 0; i < max_pkgs; i++) {
+		if (!pkg_nr_remote[i])
+			continue;
+
+		pkg_avg_distance = pkg_total_distance[i] / pkg_nr_remote[i];
+
+		pr_debug("sched: Avg. distance to remote package %d: %d\n", i, pkg_avg_distance);
+
+		if (pkg_avg_distance != sched_avg_remote_distance)
+			WARN_ONCE(1, "sched: Avg. distances to remote packages are different\n");
+	}
+cleanup:
+	kfree(pkg_nr_remote);
+	kfree(pkg_total_distance);
 	return sched_avg_remote_distance;
 }
 
@@ -564,18 +606,7 @@ int arch_sched_node_distance(int from, int to)
 		 * in the remote package in the same sched group.
 		 * Simplify NUMA domains and avoid extra NUMA levels including
 		 * different remote NUMA nodes and local nodes.
-		 *
-		 * GNR and CWF don't expect systems with more than 2 packages
-		 * and more than 2 hops between packages. Single average remote
-		 * distance won't be appropriate if there are more than 2
-		 * packages as average distance to different remote packages
-		 * could be different.
 		 */
-		WARN_ONCE(topology_max_packages() > 2,
-			  "sched: Expect only up to 2 packages for GNR or CWF, "
-			  "but saw %d packages when building sched domains.",
-			  topology_max_packages());
-
 		d = avg_remote_numa_distance();
 	}
 	return d;
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] sched/topology: Check average distances to remote packages
  2026-02-05  0:24 [PATCH v2] sched/topology: Check average distances to remote packages Kyle Meyer
@ 2026-02-23 16:42 ` Kyle Meyer
  2026-02-23 17:03 ` Peter Zijlstra
  1 sibling, 0 replies; 19+ messages in thread
From: Kyle Meyer @ 2026-02-23 16:42 UTC (permalink / raw)
  To: tim.c.chen, bp, dave.hansen, mingo, peterz, tglx, vinicius.gomes
  Cc: brgerst, hpa, kprateek.nayak, linux-kernel, patryk.wlazlyn,
	rafael.j.wysocki, russ.anderson, x86, yu.c.chen, zhao1.liu

On Wed, Feb 04, 2026 at 06:24:31PM -0600, Kyle Meyer wrote:
> Granite Rapids (GNR) and Clearwater Forest (CWF) average distances to
> remote packages to fix scheduler domains, see [1] for more information.
> 
> A warning and backtrace are printed when sub-NUMA clustering (SNC) is
> enabled and there are more than 2 packages because the average distances
> to remote packages could be different, skewing the single average remote
> distance.
> 
> This is unnecessary when the average distances to remote packages are
> the same.
> 
> Support single average remote distance on systems with more than 2
> packages, preventing unnecessary warnings and backtraces, by checking if
> average distances to remote packages are the same.
> 
> [1] commit 4d6dd05d07d0 ("sched/topology: Fix sched domain build error for GNR, CWF in SNC-3 mode").
> 
> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Kyle Meyer <kyle.meyer@hpe.com>
> ---
> 
> The warning and backtrace were noticed on a 16 socket GNR system with SNC-2 enabled.
> 
> v1:
> * https://lore.kernel.org/all/aXjvLjTCRe8d3UFD@hpe.com/
> 
> v1 -> v2:
> * Initialize pkg_total_distance and pkg_nr_remote to NULL, as suggested by Tim.
> 
> ---
>  arch/x86/kernel/smpboot.c | 69 ++++++++++++++++++++++++++++-----------
>  1 file changed, 50 insertions(+), 19 deletions(-)
> 
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 5cd6950ab672..dc8f15bd2e19 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -518,27 +518,69 @@ static int avg_remote_numa_distance(void)
>  {
>  	int i, j;
>  	int distance, nr_remote, total_distance;
> +	int max_pkgs = topology_max_packages();
> +	int cpu, pkg, pkg_avg_distance;
> +	int *pkg_total_distance = NULL, *pkg_nr_remote = NULL;
>  
>  	if (sched_avg_remote_distance > 0)
>  		return sched_avg_remote_distance;
>  
> +	sched_avg_remote_distance = REMOTE_DISTANCE;
> +
>  	nr_remote = 0;
>  	total_distance = 0;
> +
> +	pkg_total_distance = kcalloc(max_pkgs, sizeof(int), GFP_KERNEL);
> +	if (!pkg_total_distance)
> +		goto cleanup;
> +
> +	pkg_nr_remote = kcalloc(max_pkgs, sizeof(int), GFP_KERNEL);
> +	if (!pkg_nr_remote)
> +		goto cleanup;
> +
>  	for_each_node_state(i, N_CPU) {
>  		for_each_node_state(j, N_CPU) {
>  			distance = node_distance(i, j);
>  
> -			if (distance >= REMOTE_DISTANCE) {
> -				nr_remote++;
> -				total_distance += distance;
> -			}
> +			if (distance < REMOTE_DISTANCE)
> +				continue;
> +
> +			nr_remote++;
> +			total_distance += distance;
> +
> +			cpu = cpumask_first(cpumask_of_node(j));
> +			if (cpu >= nr_cpu_ids)
> +				continue;
> +
> +			pkg = topology_physical_package_id(cpu);
> +			pkg_total_distance[pkg] += distance;
> +			pkg_nr_remote[pkg]++;
>  		}
>  	}
> -	if (nr_remote)
> -		sched_avg_remote_distance = total_distance / nr_remote;
> -	else
> -		sched_avg_remote_distance = REMOTE_DISTANCE;
>  
> +	if (!nr_remote)
> +		goto cleanup;
> +
> +	sched_avg_remote_distance = total_distance / nr_remote;
> +
> +	/*
> +	 * Single average remote distance won't be appropriate if different
> +	 * packages have different distances to remote packages.
> +	 */
> +	for (i = 0; i < max_pkgs; i++) {
> +		if (!pkg_nr_remote[i])
> +			continue;
> +
> +		pkg_avg_distance = pkg_total_distance[i] / pkg_nr_remote[i];
> +
> +		pr_debug("sched: Avg. distance to remote package %d: %d\n", i, pkg_avg_distance);
> +
> +		if (pkg_avg_distance != sched_avg_remote_distance)
> +			WARN_ONCE(1, "sched: Avg. distances to remote packages are different\n");
> +	}
> +cleanup:
> +	kfree(pkg_nr_remote);
> +	kfree(pkg_total_distance);
>  	return sched_avg_remote_distance;
>  }
>  
> @@ -564,18 +606,7 @@ int arch_sched_node_distance(int from, int to)
>  		 * in the remote package in the same sched group.
>  		 * Simplify NUMA domains and avoid extra NUMA levels including
>  		 * different remote NUMA nodes and local nodes.
> -		 *
> -		 * GNR and CWF don't expect systems with more than 2 packages
> -		 * and more than 2 hops between packages. Single average remote
> -		 * distance won't be appropriate if there are more than 2
> -		 * packages as average distance to different remote packages
> -		 * could be different.
>  		 */
> -		WARN_ONCE(topology_max_packages() > 2,
> -			  "sched: Expect only up to 2 packages for GNR or CWF, "
> -			  "but saw %d packages when building sched domains.",
> -			  topology_max_packages());
> -
>  		d = avg_remote_numa_distance();
>  	}
>  	return d;
> -- 
> 2.52.0
> 

Just a friendly ping.

Thanks,
Kyle Meyer

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] sched/topology: Check average distances to remote packages
  2026-02-05  0:24 [PATCH v2] sched/topology: Check average distances to remote packages Kyle Meyer
  2026-02-23 16:42 ` Kyle Meyer
@ 2026-02-23 17:03 ` Peter Zijlstra
  2026-02-25  1:43   ` Kyle Meyer
  1 sibling, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2026-02-23 17:03 UTC (permalink / raw)
  To: Kyle Meyer
  Cc: tim.c.chen, bp, dave.hansen, mingo, tglx, vinicius.gomes, brgerst,
	hpa, kprateek.nayak, linux-kernel, patryk.wlazlyn,
	rafael.j.wysocki, russ.anderson, x86, yu.c.chen, zhao1.liu

On Wed, Feb 04, 2026 at 06:24:26PM -0600, Kyle Meyer wrote:
> Granite Rapids (GNR) and Clearwater Forest (CWF) average distances to
> remote packages to fix scheduler domains, see [1] for more information.
> 
> A warning and backtrace are printed when sub-NUMA clustering (SNC) is
> enabled and there are more than 2 packages because the average distances
> to remote packages could be different, skewing the single average remote
> distance.

But earlier Tim said these systems will not have more than 2 packages.
So what's what?

So what do these new systems look like?

> This is unnecessary when the average distances to remote packages are
> the same.
> 
> Support single average remote distance on systems with more than 2
> packages, preventing unnecessary warnings and backtraces, by checking if
> average distances to remote packages are the same.



> ---
>  arch/x86/kernel/smpboot.c | 69 ++++++++++++++++++++++++++++-----------
>  1 file changed, 50 insertions(+), 19 deletions(-)
> 
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 5cd6950ab672..dc8f15bd2e19 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -518,27 +518,69 @@ static int avg_remote_numa_distance(void)
>  {
>  	int i, j;
>  	int distance, nr_remote, total_distance;
> +	int max_pkgs = topology_max_packages();
> +	int cpu, pkg, pkg_avg_distance;
> +	int *pkg_total_distance = NULL, *pkg_nr_remote = NULL;

Can you make that the normal reverse xmas thing?

>  	if (sched_avg_remote_distance > 0)
>  		return sched_avg_remote_distance;
>  
> +	sched_avg_remote_distance = REMOTE_DISTANCE;
> +
>  	nr_remote = 0;
>  	total_distance = 0;
> +
> +	pkg_total_distance = kcalloc(max_pkgs, sizeof(int), GFP_KERNEL);
> +	if (!pkg_total_distance)
> +		goto cleanup;
> +
> +	pkg_nr_remote = kcalloc(max_pkgs, sizeof(int), GFP_KERNEL);
> +	if (!pkg_nr_remote)
> +		goto cleanup;
> +
>  	for_each_node_state(i, N_CPU) {
>  		for_each_node_state(j, N_CPU) {
>  			distance = node_distance(i, j);
>  
> -			if (distance >= REMOTE_DISTANCE) {
> -				nr_remote++;
> -				total_distance += distance;
> -			}
> +			if (distance < REMOTE_DISTANCE)
> +				continue;
> +
> +			nr_remote++;
> +			total_distance += distance;
> +
> +			cpu = cpumask_first(cpumask_of_node(j));
> +			if (cpu >= nr_cpu_ids)
> +				continue;
> +
> +			pkg = topology_physical_package_id(cpu);
> +			pkg_total_distance[pkg] += distance;
> +			pkg_nr_remote[pkg]++;

This is broken, physical_package_id is not guaranteed to be dense.

>  		}
>  	}
> -	if (nr_remote)
> -		sched_avg_remote_distance = total_distance / nr_remote;
> -	else
> -		sched_avg_remote_distance = REMOTE_DISTANCE;
>  
> +	if (!nr_remote)
> +		goto cleanup;
> +
> +	sched_avg_remote_distance = total_distance / nr_remote;
> +
> +	/*
> +	 * Single average remote distance won't be appropriate if different
> +	 * packages have different distances to remote packages.
> +	 */
> +	for (i = 0; i < max_pkgs; i++) {
> +		if (!pkg_nr_remote[i])
> +			continue;
> +
> +		pkg_avg_distance = pkg_total_distance[i] / pkg_nr_remote[i];
> +
> +		pr_debug("sched: Avg. distance to remote package %d: %d\n", i, pkg_avg_distance);
> +
> +		if (pkg_avg_distance != sched_avg_remote_distance)
> +			WARN_ONCE(1, "sched: Avg. distances to remote packages are different\n");
> +	}

This is pretty yuck.

Also, what's with the pr_debug() stuff?

Anyway, that function was fairly magical, and now it is nearly
impenetrable. If we want this, it needs comments. Definitely more
comments, with nice pictures on.

> +cleanup:
> +	kfree(pkg_nr_remote);
> +	kfree(pkg_total_distance);
>  	return sched_avg_remote_distance;
>  }
>  
> @@ -564,18 +606,7 @@ int arch_sched_node_distance(int from, int to)
>  		 * in the remote package in the same sched group.
>  		 * Simplify NUMA domains and avoid extra NUMA levels including
>  		 * different remote NUMA nodes and local nodes.
> -		 *
> -		 * GNR and CWF don't expect systems with more than 2 packages
> -		 * and more than 2 hops between packages. Single average remote
> -		 * distance won't be appropriate if there are more than 2
> -		 * packages as average distance to different remote packages
> -		 * could be different.
>  		 */
> -		WARN_ONCE(topology_max_packages() > 2,
> -			  "sched: Expect only up to 2 packages for GNR or CWF, "
> -			  "but saw %d packages when building sched domains.",
> -			  topology_max_packages());
> -
>  		d = avg_remote_numa_distance();
>  	}
>  	return d;
> -- 
> 2.52.0
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] sched/topology: Check average distances to remote packages
  2026-02-23 17:03 ` Peter Zijlstra
@ 2026-02-25  1:43   ` Kyle Meyer
  2026-02-25  9:05     ` Chen, Yu C
  2026-02-25 12:30     ` Peter Zijlstra
  0 siblings, 2 replies; 19+ messages in thread
From: Kyle Meyer @ 2026-02-25  1:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tim.c.chen, bp, dave.hansen, mingo, tglx, vinicius.gomes, brgerst,
	hpa, kprateek.nayak, linux-kernel, patryk.wlazlyn,
	rafael.j.wysocki, russ.anderson, x86, yu.c.chen, zhao1.liu

On Mon, Feb 23, 2026 at 06:03:14PM +0100, Peter Zijlstra wrote:
> On Wed, Feb 04, 2026 at 06:24:26PM -0600, Kyle Meyer wrote:
> > Granite Rapids (GNR) and Clearwater Forest (CWF) average distances to
> > remote packages to fix scheduler domains, see [1] for more information.
> > 
> > A warning and backtrace are printed when sub-NUMA clustering (SNC) is
> > enabled and there are more than 2 packages because the average distances
> > to remote packages could be different, skewing the single average remote
> > distance.
> 
> But earlier Tim said these systems will not have more than 2 packages.
> So what's what?

We have Intel customer reference boards with 2, 4, and 8 sockets.
 
> So what do these new systems look like?

Here's an 8 socket (2 chassis) HPE system with SNC enabled:

node   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
  0:  10  12  16  16  16  16  18  18  40  40  40  40  40  40  40  40
  1:  12  10  16  16  16  16  18  18  40  40  40  40  40  40  40  40
  2:  16  16  10  12  18  18  16  16  40  40  40  40  40  40  40  40
  3:  16  16  12  10  18  18  16  16  40  40  40  40  40  40  40  40
  4:  16  16  18  18  10  12  16  16  40  40  40  40  40  40  40  40
  5:  16  16  18  18  12  10  16  16  40  40  40  40  40  40  40  40
  6:  18  18  16  16  16  16  10  12  40  40  40  40  40  40  40  40
  7:  18  18  16  16  16  16  12  10  40  40  40  40  40  40  40  40
  8:  40  40  40  40  40  40  40  40  10  12  16  16  16  16  18  18
  9:  40  40  40  40  40  40  40  40  12  10  16  16  16  16  18  18
 10:  40  40  40  40  40  40  40  40  16  16  10  12  18  18  16  16
 11:  40  40  40  40  40  40  40  40  16  16  12  10  18  18  16  16
 12:  40  40  40  40  40  40  40  40  16  16  18  18  10  12  16  16
 13:  40  40  40  40  40  40  40  40  16  16  18  18  12  10  16  16
 14:  40  40  40  40  40  40  40  40  18  18  16  16  16  16  10  12
 15:  40  40  40  40  40  40  40  40  18  18  16  16  16  16  12  10

10 = Same chassis and socket
12 = Same chassis and socket (SNC)
16 = Same chassis and adjacent socket
18 = Same chassis and non-adjacent socket
40 = Different chassis

Each processor connects to an ASIC (XNC) that acts as a multiplexer, extending
the UPI interconnect across the entire system.

We don't experience the scheduler domain issue reported by Tim because our SLIT
provides symmetric distances to remote NUMA nodes, but we trigger the WARN_ONCE
because we exceed 2 packages.

> > This is unnecessary when the average distances to remote packages are
> > the same.
> > 
> > Support single average remote distance on systems with more than 2
> > packages, preventing unnecessary warnings and backtraces, by checking if
> > average distances to remote packages are the same.
> 
> 
> 
> > ---
> >  arch/x86/kernel/smpboot.c | 69 ++++++++++++++++++++++++++++-----------
> >  1 file changed, 50 insertions(+), 19 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> > index 5cd6950ab672..dc8f15bd2e19 100644
> > --- a/arch/x86/kernel/smpboot.c
> > +++ b/arch/x86/kernel/smpboot.c
> > @@ -518,27 +518,69 @@ static int avg_remote_numa_distance(void)
> >  {
> >  	int i, j;
> >  	int distance, nr_remote, total_distance;
> > +	int max_pkgs = topology_max_packages();
> > +	int cpu, pkg, pkg_avg_distance;
> > +	int *pkg_total_distance = NULL, *pkg_nr_remote = NULL;
> 
> Can you make that the normal reverse xmas thing?

Yes.
 
> >  	if (sched_avg_remote_distance > 0)
> >  		return sched_avg_remote_distance;
> >  
> > +	sched_avg_remote_distance = REMOTE_DISTANCE;
> > +
> >  	nr_remote = 0;
> >  	total_distance = 0;
> > +
> > +	pkg_total_distance = kcalloc(max_pkgs, sizeof(int), GFP_KERNEL);
> > +	if (!pkg_total_distance)
> > +		goto cleanup;
> > +
> > +	pkg_nr_remote = kcalloc(max_pkgs, sizeof(int), GFP_KERNEL);
> > +	if (!pkg_nr_remote)
> > +		goto cleanup;
> > +
> >  	for_each_node_state(i, N_CPU) {
> >  		for_each_node_state(j, N_CPU) {
> >  			distance = node_distance(i, j);
> >  
> > -			if (distance >= REMOTE_DISTANCE) {
> > -				nr_remote++;
> > -				total_distance += distance;
> > -			}
> > +			if (distance < REMOTE_DISTANCE)
> > +				continue;
> > +
> > +			nr_remote++;
> > +			total_distance += distance;
> > +
> > +			cpu = cpumask_first(cpumask_of_node(j));
> > +			if (cpu >= nr_cpu_ids)
> > +				continue;
> > +
> > +			pkg = topology_physical_package_id(cpu);
> > +			pkg_total_distance[pkg] += distance;
> > +			pkg_nr_remote[pkg]++;
> 
> This is broken, physical_package_id is not guaranteed to be dense.

Thank you, I'll fix this.

> >  		}
> >  	}
> > -	if (nr_remote)
> > -		sched_avg_remote_distance = total_distance / nr_remote;
> > -	else
> > -		sched_avg_remote_distance = REMOTE_DISTANCE;
> >  
> > +	if (!nr_remote)
> > +		goto cleanup;
> > +
> > +	sched_avg_remote_distance = total_distance / nr_remote;
> > +
> > +	/*
> > +	 * Single average remote distance won't be appropriate if different
> > +	 * packages have different distances to remote packages.
> > +	 */
> > +	for (i = 0; i < max_pkgs; i++) {
> > +		if (!pkg_nr_remote[i])
> > +			continue;
> > +
> > +		pkg_avg_distance = pkg_total_distance[i] / pkg_nr_remote[i];
> > +
> > +		pr_debug("sched: Avg. distance to remote package %d: %d\n", i, pkg_avg_distance);
> > +
> > +		if (pkg_avg_distance != sched_avg_remote_distance)
> > +			WARN_ONCE(1, "sched: Avg. distances to remote packages are different\n");
> > +	}
> 
> This is pretty yuck.
> 
> Also, what's with the pr_debug() stuff?
> 
> Anyway, that function was fairly magical, and now it is nearly
> impenetrable. If we want this, it needs comments. Definitely more
> comments, with nice pictures on.

OK, thank you for the feedback, I'll work on a v3.

Thanks,
Kyle Meyer

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] sched/topology: Check average distances to remote packages
  2026-02-25  1:43   ` Kyle Meyer
@ 2026-02-25  9:05     ` Chen, Yu C
  2026-02-25 12:30     ` Peter Zijlstra
  1 sibling, 0 replies; 19+ messages in thread
From: Chen, Yu C @ 2026-02-25  9:05 UTC (permalink / raw)
  To: Kyle Meyer
  Cc: tim.c.chen, bp, dave.hansen, mingo, tglx, vinicius.gomes, brgerst,
	hpa, kprateek.nayak, linux-kernel, patryk.wlazlyn,
	rafael.j.wysocki, russ.anderson, x86, zhao1.liu, Peter Zijlstra,
	Aubrey Li

Hi Kyle,

On 2/25/2026 9:43 AM, Kyle Meyer wrote:
> On Mon, Feb 23, 2026 at 06:03:14PM +0100, Peter Zijlstra wrote:
>> On Wed, Feb 04, 2026 at 06:24:26PM -0600, Kyle Meyer wrote:
>>> Granite Rapids (GNR) and Clearwater Forest (CWF) average distances to
>>> remote packages to fix scheduler domains, see [1] for more information.
>>>
>>> A warning and backtrace are printed when sub-NUMA clustering (SNC) is
>>> enabled and there are more than 2 packages because the average distances
>>> to remote packages could be different, skewing the single average remote
>>> distance.
>>
>> But earlier Tim said these systems will not have more than 2 packages.
>> So what's what?
> 
> We have Intel customer reference boards with 2, 4, and 8 sockets.
>   

Thanks for the info. We were not previously aware that the stock GNR 
platform
will scale up to 8 sockets.

>> So what do these new systems look like?
> 
> Here's an 8 socket (2 chassis) HPE system with SNC enabled:
> 
> node   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
>    0:  10  12  16  16  16  16  18  18  40  40  40  40  40  40  40  40
>    1:  12  10  16  16  16  16  18  18  40  40  40  40  40  40  40  40
>    2:  16  16  10  12  18  18  16  16  40  40  40  40  40  40  40  40
>    3:  16  16  12  10  18  18  16  16  40  40  40  40  40  40  40  40
>    4:  16  16  18  18  10  12  16  16  40  40  40  40  40  40  40  40
>    5:  16  16  18  18  12  10  16  16  40  40  40  40  40  40  40  40
>    6:  18  18  16  16  16  16  10  12  40  40  40  40  40  40  40  40
>    7:  18  18  16  16  16  16  12  10  40  40  40  40  40  40  40  40
>    8:  40  40  40  40  40  40  40  40  10  12  16  16  16  16  18  18
>    9:  40  40  40  40  40  40  40  40  12  10  16  16  16  16  18  18
>   10:  40  40  40  40  40  40  40  40  16  16  10  12  18  18  16  16
>   11:  40  40  40  40  40  40  40  40  16  16  12  10  18  18  16  16
>   12:  40  40  40  40  40  40  40  40  16  16  18  18  10  12  16  16
>   13:  40  40  40  40  40  40  40  40  16  16  18  18  12  10  16  16
>   14:  40  40  40  40  40  40  40  40  18  18  16  16  16  16  10  12
>   15:  40  40  40  40  40  40  40  40  18  18  16  16  16  16  12  10
> 
> 10 = Same chassis and socket
> 12 = Same chassis and socket (SNC)
> 16 = Same chassis and adjacent socket
> 18 = Same chassis and non-adjacent socket

Previously, I thought that REMOTE_DISTANCE represents the
distance between two nodes on different sockets, but that does
not appear to be the case here. I could not find any definition
of “20 or double” in the SLIT section of the ACPI specification.
Thus, I assume this value of 20 is an artificial threshold. In my
view, checking whether all distances above 20 are identical really
depends on the specific platform. For example, if we have a distance
value such as:
22 = Same chassis and non-adjacent socket then applying the current
patch would trigger a warning regardless.

That said, since 20 is an artificial threshold, I have a tentative
idea:we could normalize the SLIT distances by sorting the slit_dist
values, finding the 75th percentile value, keeping all slit_dist values
below the 75th percentile unchanged, and treating all slit_dist values
above the 75th percentile as remote - assigning them the average remote
distance. This way, we could eliminate the arbitrary value of 20. But
that might require rewrite.. and for now it is ok to keep 20.

> 40 = Different chassis
> 
> Each processor connects to an ASIC (XNC) that acts as a multiplexer, extending
> the UPI interconnect across the entire system.
> 
> We don't experience the scheduler domain issue reported by Tim because our SLIT
> provides symmetric distances to remote NUMA nodes, but we trigger the WARN_ONCE
> because we exceed 2 packages.
> 
>>> This is unnecessary when the average distances to remote packages are
>>> the same.
>>>
>>> Support single average remote distance on systems with more than 2
>>> packages, preventing unnecessary warnings and backtraces, by checking if
>>> average distances to remote packages are the same.
>>
>>
>>

[ ... ]

>>> +			pkg = topology_physical_package_id(cpu);
>>> +			pkg_total_distance[pkg] += distance;
>>> +			pkg_nr_remote[pkg]++;
>>
>> This is broken, physical_package_id is not guaranteed to be dense.
> 
> Thank you, I'll fix this.
> 

It seems that topology_logical_package_id() can work here.

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] sched/topology: Check average distances to remote packages
  2026-02-25  1:43   ` Kyle Meyer
  2026-02-25  9:05     ` Chen, Yu C
@ 2026-02-25 12:30     ` Peter Zijlstra
  2026-02-25 13:36       ` Peter Zijlstra
                         ` (2 more replies)
  1 sibling, 3 replies; 19+ messages in thread
From: Peter Zijlstra @ 2026-02-25 12:30 UTC (permalink / raw)
  To: Kyle Meyer
  Cc: tim.c.chen, bp, dave.hansen, mingo, tglx, vinicius.gomes, brgerst,
	hpa, kprateek.nayak, linux-kernel, patryk.wlazlyn,
	rafael.j.wysocki, russ.anderson, x86, yu.c.chen, zhao1.liu

On Tue, Feb 24, 2026 at 07:43:10PM -0600, Kyle Meyer wrote:

> Here's an 8 socket (2 chassis) HPE system with SNC enabled:
> 
> node   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
>   0:  10  12  16  16  16  16  18  18  40  40  40  40  40  40  40  40
>   1:  12  10  16  16  16  16  18  18  40  40  40  40  40  40  40  40
>   2:  16  16  10  12  18  18  16  16  40  40  40  40  40  40  40  40
>   3:  16  16  12  10  18  18  16  16  40  40  40  40  40  40  40  40
>   4:  16  16  18  18  10  12  16  16  40  40  40  40  40  40  40  40
>   5:  16  16  18  18  12  10  16  16  40  40  40  40  40  40  40  40
>   6:  18  18  16  16  16  16  10  12  40  40  40  40  40  40  40  40
>   7:  18  18  16  16  16  16  12  10  40  40  40  40  40  40  40  40
>   8:  40  40  40  40  40  40  40  40  10  12  16  16  16  16  18  18
>   9:  40  40  40  40  40  40  40  40  12  10  16  16  16  16  18  18
>  10:  40  40  40  40  40  40  40  40  16  16  10  12  18  18  16  16
>  11:  40  40  40  40  40  40  40  40  16  16  12  10  18  18  16  16
>  12:  40  40  40  40  40  40  40  40  16  16  18  18  10  12  16  16
>  13:  40  40  40  40  40  40  40  40  16  16  18  18  12  10  16  16
>  14:  40  40  40  40  40  40  40  40  18  18  16  16  16  16  10  12
>  15:  40  40  40  40  40  40  40  40  18  18  16  16  16  16  12  10
> 
> 10 = Same chassis and socket
> 12 = Same chassis and socket (SNC)
> 16 = Same chassis and adjacent socket
> 18 = Same chassis and non-adjacent socket
> 40 = Different chassis
> 
> Each processor connects to an ASIC (XNC) that acts as a multiplexer, extending
> the UPI interconnect across the entire system.
> 
> We don't experience the scheduler domain issue reported by Tim because our SLIT
> provides symmetric distances to remote NUMA nodes, but we trigger the WARN_ONCE
> because we exceed 2 packages.

The original case was for SNC-3, the above looks to be SNC-2. Does your
system also support SNC-3?

Anyway, yes your SLIT table looks sane (unlike that SNC-3 monster Tim
showed earlier).

And it also shows that using REMOTE_DISTANCE (20) was completely random
and 'wrong'.

So per 4d6dd05d07d0 ("sched/topology: Fix sched domain build error for GNR, CWF in SNC-3 mode")

Tim's original crazy SNC-3 SLIT table was:

node distances:
node     0    1    2    3    4    5
    0:   10   15   17   21   28   26
    1:   15   10   15   23   26   23
    2:   17   15   10   26   23   21
    3:   21   28   26   10   15   17
    4:   23   26   23   15   10   15
    5:   26   23   21   17   15   10

And per:

  https://lore.kernel.org/lkml/20250825075642.GQ3245006@noisy.programming.kicks-ass.net/

My suggestion was to average the off-trace clusters to restore sanity.

So how about we go about implementing that without reference to magical
numbers, something like so. This obviously needs a little TLC, but it
might just work.

Hmm?

---
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 5cd6950ab672..cba3e4b14250 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -513,33 +513,55 @@ static void __init build_sched_topology(void)
 }
 
 #ifdef CONFIG_NUMA
-static int sched_avg_remote_distance;
-static int avg_remote_numa_distance(void)
+
+/*
+ * Find the largest symmetric cluster in an attempt to identify the unit size.
+ *
+ * XXX doesn't respect N_CPU node classes and such.
+ */
+static int slit_cluster_size(void)
 {
-	int i, j;
-	int distance, nr_remote, total_distance;
+	int i, j, n, m = num_possible_nodes();
 
-	if (sched_avg_remote_distance > 0)
-		return sched_avg_remote_distance;
-
-	nr_remote = 0;
-	total_distance = 0;
-	for_each_node_state(i, N_CPU) {
-		for_each_node_state(j, N_CPU) {
-			distance = node_distance(i, j);
-
-			if (distance >= REMOTE_DISTANCE) {
-				nr_remote++;
-				total_distance += distance;
+	for (n = 2; n < m; n++) {
+		for (i = 0; i < n; i++) {
+			for (j = i; j < n; j++) {
+				if (node_distance(i, j) != node_distance(j, i))
+					return n - 1;
 			}
 		}
 	}
-	if (nr_remote)
-		sched_avg_remote_distance = total_distance / nr_remote;
-	else
-		sched_avg_remote_distance = REMOTE_DISTANCE;
 
-	return sched_avg_remote_distance;
+	return m;
+}
+
+static int slit_cluster_distance(int i, int j)
+{
+	static int u = 0;
+	long d = 0;
+	int x, y;
+
+	if (!u)
+		u = slit_cluster_size();
+
+	/*
+	 * Is this a unit cluster on the trace?
+	 */
+	if ((i / u) == (j / u))
+		return node_distance(i, j);
+
+	/*
+	 * Off-trace cluster, return average of the cluster to force symmetry.
+	 */
+	x = i - (i % u);
+	y = j - (j % u);
+
+	for (i = x; i < x + u; i++) {
+		for (j = y; j < y + u; j++)
+			d += node_distance(i, j);
+	}
+
+	return d / (u*u);
 }
 
 int arch_sched_node_distance(int from, int to)
@@ -550,8 +572,7 @@ int arch_sched_node_distance(int from, int to)
 	case INTEL_GRANITERAPIDS_X:
 	case INTEL_ATOM_DARKMONT_X:
 
-		if (!x86_has_numa_in_package || topology_max_packages() == 1 ||
-		    d < REMOTE_DISTANCE)
+		if (!x86_has_numa_in_package || topology_max_packages() == 1)
 			return d;
 
 		/*
@@ -571,12 +592,7 @@ int arch_sched_node_distance(int from, int to)
 		 * packages as average distance to different remote packages
 		 * could be different.
 		 */
-		WARN_ONCE(topology_max_packages() > 2,
-			  "sched: Expect only up to 2 packages for GNR or CWF, "
-			  "but saw %d packages when building sched domains.",
-			  topology_max_packages());
-
-		d = avg_remote_numa_distance();
+		return slit_cluster_distance(from, to);
 	}
 	return d;
 }

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] sched/topology: Check average distances to remote packages
  2026-02-25 12:30     ` Peter Zijlstra
@ 2026-02-25 13:36       ` Peter Zijlstra
  2026-02-25 15:39       ` Chen, Yu C
  2026-02-25 16:41       ` Kyle Meyer
  2 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2026-02-25 13:36 UTC (permalink / raw)
  To: Kyle Meyer
  Cc: tim.c.chen, bp, dave.hansen, mingo, tglx, vinicius.gomes, brgerst,
	hpa, kprateek.nayak, linux-kernel, patryk.wlazlyn,
	rafael.j.wysocki, russ.anderson, x86, yu.c.chen, zhao1.liu

On Wed, Feb 25, 2026 at 01:30:52PM +0100, Peter Zijlstra wrote:
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 5cd6950ab672..cba3e4b14250 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -513,33 +513,55 @@ static void __init build_sched_topology(void)
>  }
>  
>  #ifdef CONFIG_NUMA
> +
> +/*
> + * Find the largest symmetric cluster in an attempt to identify the unit size.
> + *
> + * XXX doesn't respect N_CPU node classes and such.
> + */
> +static int slit_cluster_size(void)
>  {
> +	int i, j, n, m = num_possible_nodes();
>  
> +	for (n = 2; n < m; n++) {
> +		for (i = 0; i < n; i++) {
> +			for (j = i; j < n; j++) {
> +				if (node_distance(i, j) != node_distance(j, i))
> +					return n - 1;
>  			}
>  		}
>  	}
>  
> +	return m;
> +}

If we make x86_has_numa_in_package a counter of how many nodes in the
package, we could use that number, rather than trying to guesstimate it.

Similarly, if the system would enumerate the SNC mode anywhere, that too
could be used.

> +static int slit_cluster_distance(int i, int j)
> +{
> +	static int u = 0;
> +	long d = 0;
> +	int x, y;
> +
> +	if (!u)
> +		u = slit_cluster_size();
> +
> +	/*
> +	 * Is this a unit cluster on the trace?
> +	 */
> +	if ((i / u) == (j / u))
> +		return node_distance(i, j);
> +
> +	/*
> +	 * Off-trace cluster, return average of the cluster to force symmetry.
> +	 */
> +	x = i - (i % u);
> +	y = j - (j % u);
> +
> +	for (i = x; i < x + u; i++) {
> +		for (j = y; j < y + u; j++)
> +			d += node_distance(i, j);
> +	}
> +
> +	return d / (u*u);
>  }

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] sched/topology: Check average distances to remote packages
  2026-02-25 12:30     ` Peter Zijlstra
  2026-02-25 13:36       ` Peter Zijlstra
@ 2026-02-25 15:39       ` Chen, Yu C
  2026-02-25 15:44         ` Peter Zijlstra
  2026-02-25 16:41       ` Kyle Meyer
  2 siblings, 1 reply; 19+ messages in thread
From: Chen, Yu C @ 2026-02-25 15:39 UTC (permalink / raw)
  To: Peter Zijlstra, Kyle Meyer
  Cc: tim.c.chen, bp, dave.hansen, mingo, tglx, vinicius.gomes, brgerst,
	hpa, kprateek.nayak, linux-kernel, patryk.wlazlyn,
	rafael.j.wysocki, russ.anderson, x86, zhao1.liu

On 2/25/2026 8:30 PM, Peter Zijlstra wrote:
> On Tue, Feb 24, 2026 at 07:43:10PM -0600, Kyle Meyer wrote:
> 

[ ... ]

> 
> And it also shows that using REMOTE_DISTANCE (20) was completely random
> and 'wrong'.
> 
> So per 4d6dd05d07d0 ("sched/topology: Fix sched domain build error for GNR, CWF in SNC-3 mode")
> 
> Tim's original crazy SNC-3 SLIT table was:
> 
> node distances:
> node     0    1    2    3    4    5
>      0:   10   15   17   21   28   26
>      1:   15   10   15   23   26   23
>      2:   17   15   10   26   23   21
>      3:   21   28   26   10   15   17
>      4:   23   26   23   15   10   15
>      5:   26   23   21   17   15   10
> 
> And per:
> 
>    https://lore.kernel.org/lkml/20250825075642.GQ3245006@noisy.programming.kicks-ass.net/
> 
> My suggestion was to average the off-trace clusters to restore sanity.
> 

In above example, the node distances become:
"Eg. since (21+28+26+23+26+23+26+23+21)/9 ~ 24, you end up with:

  node     0    1    2    3    4    5
      0:   10   15   17   24   24   24
      1:   15   10   15   24   24   24
      2:   17   15   10   24   24   24
      3:   24   24   24   10   15   17
      4:   24   24   24   15   10   15
      5:   24   24   24   17   15   10
"

> +
> +static int slit_cluster_distance(int i, int j)
> +{
> +	static int u = 0;
> +	long d = 0;
> +	int x, y;
> +
> +	if (!u)
> +		u = slit_cluster_size();
> +
> +	/*
> +	 * Is this a unit cluster on the trace?
> +	 */
> +	if ((i / u) == (j / u))
> +		return node_distance(i, j);

the u is 3 in above example, because slit_cluster_size()
found that node0, node1 and node2 are in the same biggest
symmetric cluster. Not sure if I understand it correctly,
here we will treat node4 and node5 as the same cluster,
but without checking whether node_distance(4, 5) and
node_distance(5,4) are the same. If node_dist(4,5)!=node_dist(5,4),
will we keep it as it is?

thanks,
Chenyu



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] sched/topology: Check average distances to remote packages
  2026-02-25 15:39       ` Chen, Yu C
@ 2026-02-25 15:44         ` Peter Zijlstra
  2026-02-25 16:32           ` Peter Zijlstra
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2026-02-25 15:44 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: Kyle Meyer, tim.c.chen, bp, dave.hansen, mingo, tglx,
	vinicius.gomes, brgerst, hpa, kprateek.nayak, linux-kernel,
	patryk.wlazlyn, rafael.j.wysocki, russ.anderson, x86, zhao1.liu

On Wed, Feb 25, 2026 at 11:39:54PM +0800, Chen, Yu C wrote:

> > +static int slit_cluster_distance(int i, int j)
> > +{
> > +	static int u = 0;
> > +	long d = 0;
> > +	int x, y;
> > +
> > +	if (!u)
> > +		u = slit_cluster_size();
> > +
> > +	/*
> > +	 * Is this a unit cluster on the trace?
> > +	 */
> > +	if ((i / u) == (j / u))
> > +		return node_distance(i, j);
> 
> the u is 3 in above example, because slit_cluster_size()
> found that node0, node1 and node2 are in the same biggest
> symmetric cluster.

> Not sure if I understand it correctly,
> here we will treat node4 and node5 as the same cluster,
> but without checking whether node_distance(4, 5) and
> node_distance(5,4) are the same. If node_dist(4,5)!=node_dist(5,4),
> will we keep it as it is?

Yes, so this assumes that all u sized clusters on the trace are similar
and 'sane' without verification.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] sched/topology: Check average distances to remote packages
  2026-02-25 15:44         ` Peter Zijlstra
@ 2026-02-25 16:32           ` Peter Zijlstra
  2026-02-25 16:40             ` Peter Zijlstra
  2026-02-25 21:37             ` Tim Chen
  0 siblings, 2 replies; 19+ messages in thread
From: Peter Zijlstra @ 2026-02-25 16:32 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: Kyle Meyer, tim.c.chen, bp, dave.hansen, mingo, tglx,
	vinicius.gomes, brgerst, hpa, kprateek.nayak, linux-kernel,
	patryk.wlazlyn, rafael.j.wysocki, russ.anderson, x86, zhao1.liu

On Wed, Feb 25, 2026 at 04:44:09PM +0100, Peter Zijlstra wrote:

> Yes, so this assumes that all u sized clusters on the trace are similar
> and 'sane' without verification.

That gave me an idea; how's this then?

---
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 5cd6950ab672..b1e464fd98c0 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -513,33 +513,99 @@ static void __init build_sched_topology(void)
 }
 
 #ifdef CONFIG_NUMA
-static int sched_avg_remote_distance;
-static int avg_remote_numa_distance(void)
+
+static bool slit_cluster_symmetric(int i, int j, int n)
 {
-	int i, j;
-	int distance, nr_remote, total_distance;
+	WARN_ON_ONCE((i % n) || (j % n));
 
-	if (sched_avg_remote_distance > 0)
-		return sched_avg_remote_distance;
-
-	nr_remote = 0;
-	total_distance = 0;
-	for_each_node_state(i, N_CPU) {
-		for_each_node_state(j, N_CPU) {
-			distance = node_distance(i, j);
-
-			if (distance >= REMOTE_DISTANCE) {
-				nr_remote++;
-				total_distance += distance;
-			}
+	for (int k = i; k < i + n; k++) {
+		for (int l = k; l < j + n; l++) {
+			if (node_distance(k, l) != node_distance(k, l))
+				return false;
 		}
 	}
-	if (nr_remote)
-		sched_avg_remote_distance = total_distance / nr_remote;
-	else
-		sched_avg_remote_distance = REMOTE_DISTANCE;
 
-	return sched_avg_remote_distance;
+	return true;
+}
+
+static bool slit_cluster_match(int i, int j, int x, int y, int n)
+{
+	WARN_ON_ONCE((i % n) || (j % n) || (x % n) || (y % n));
+
+	for (int k = 0; k < n; k++) {
+		for (int l = k; l < n; l++) {
+			if (node_distance(i + k, j + l) != node_distance(x + k, y + l))
+				return false;
+		}
+	}
+
+	return true;
+}
+
+/*
+ * Find the largest symmetric,repeating cluster in an attempt to identify the
+ * unit size.
+ */
+static int slit_cluster_size(void)
+{
+	int nodes = num_possible_nodes();
+
+	/*
+	 * There are at least 2 packages; so half-nodes is the largest
+	 * possible unit, go down from that.
+	 */
+	for (int u = nodes / 2; u; u--) {
+		/*
+		 * If u doesn't divide nodes, it can't be a unit.
+		 */
+		if (nodes % u)
+			continue;
+
+		/*
+		 * Unit must be symmetric,
+		 */
+		if (!slit_cluster_symmetric(0, 0, u))
+			continue;
+
+		/*
+		 * and repeating.
+		 */
+		if (slit_cluster_match(0, 0, u, u, u))
+			return u;
+	}
+
+	return nodes;
+}
+
+static int slit_cluster_distance(int i, int j)
+{
+	static int u = 0;
+	long d = 0;
+	int x, y;
+
+	if (!u)
+		u = slit_cluster_size();
+
+	/*
+	 * Is this a unit cluster on the trace?
+	 */
+	if ((i / u) == (j / u))
+		return node_distance(i, j);
+
+	/*
+	 * Off-trace cluster, return average of the cluster to force symmetry.
+	 */
+	x = i - (i % u);
+	y = j - (j % u);
+
+	for (i = x; i < x + u; i++) {
+		for (j = y; j < y + u; j++) {
+			d += node_distance(i, j);
+			d += node_distance(j, i);
+		}
+	}
+
+	return d / (2*u*u);
 }
 
 int arch_sched_node_distance(int from, int to)
@@ -550,8 +616,7 @@ int arch_sched_node_distance(int from, int to)
 	case INTEL_GRANITERAPIDS_X:
 	case INTEL_ATOM_DARKMONT_X:
 
-		if (!x86_has_numa_in_package || topology_max_packages() == 1 ||
-		    d < REMOTE_DISTANCE)
+		if (!x86_has_numa_in_package || topology_max_packages() == 1)
 			return d;
 
 		/*
@@ -564,19 +629,8 @@ int arch_sched_node_distance(int from, int to)
 		 * in the remote package in the same sched group.
 		 * Simplify NUMA domains and avoid extra NUMA levels including
 		 * different remote NUMA nodes and local nodes.
-		 *
-		 * GNR and CWF don't expect systems with more than 2 packages
-		 * and more than 2 hops between packages. Single average remote
-		 * distance won't be appropriate if there are more than 2
-		 * packages as average distance to different remote packages
-		 * could be different.
 		 */
-		WARN_ONCE(topology_max_packages() > 2,
-			  "sched: Expect only up to 2 packages for GNR or CWF, "
-			  "but saw %d packages when building sched domains.",
-			  topology_max_packages());
-
-		d = avg_remote_numa_distance();
+		return slit_cluster_distance(from, to);
 	}
 	return d;
 }

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] sched/topology: Check average distances to remote packages
  2026-02-25 16:32           ` Peter Zijlstra
@ 2026-02-25 16:40             ` Peter Zijlstra
  2026-02-25 21:37             ` Tim Chen
  1 sibling, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2026-02-25 16:40 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: Kyle Meyer, tim.c.chen, bp, dave.hansen, mingo, tglx,
	vinicius.gomes, brgerst, hpa, kprateek.nayak, linux-kernel,
	patryk.wlazlyn, rafael.j.wysocki, russ.anderson, x86, zhao1.liu

On Wed, Feb 25, 2026 at 05:32:46PM +0100, Peter Zijlstra wrote:
> +static bool slit_cluster_symmetric(int i, int j, int n)
>  {
> +	WARN_ON_ONCE((i % n) || (j % n));
>  
> +	for (int k = i; k < i + n; k++) {
> +		for (int l = k; l < j + n; l++) {
> +			if (node_distance(k, l) != node_distance(k, l))
> +				return false;
>  		}
>  	}
>  
> +	return true;
> +}
> +
> +static bool slit_cluster_match(int i, int j, int x, int y, int n)
> +{
> +	WARN_ON_ONCE((i % n) || (j % n) || (x % n) || (y % n));
> +
> +	for (int k = 0; k < n; k++) {
> +		for (int l = k; l < n; l++) {
> +			if (node_distance(i + k, j + l) != node_distance(x + k, y + l))
> +				return false;
> +		}
> +	}
> +
> +	return true;
> +}
> +
> +/*
> + * Find the largest symmetric,repeating cluster in an attempt to identify the
> + * unit size.
> + */
> +static int slit_cluster_size(void)
> +{
> +	int nodes = num_possible_nodes();
> +
> +	/*
> +	 * There are at least 2 packages; so half-nodes is the largest
> +	 * possible unit, go down from that.
> +	 */
> +	for (int u = nodes / 2; u; u--) {

nodes / topology_max_packages() might also do. And I worry about
num_possible_nodes(), if that includes CXL or other such nonsense then
we're up a creek.

But as stated before, ideally the hardware can actually just tell us the
right number.

> +		/*
> +		 * If u doesn't divide nodes, it can't be a unit.
> +		 */
> +		if (nodes % u)
> +			continue;
> +
> +		/*
> +		 * Unit must be symmetric,
> +		 */
> +		if (!slit_cluster_symmetric(0, 0, u))
> +			continue;
> +
> +		/*
> +		 * and repeating.
> +		 */
> +		if (slit_cluster_match(0, 0, u, u, u))
> +			return u;
> +	}
> +
> +	return nodes;
> +}
> +
> +static int slit_cluster_distance(int i, int j)
> +{
> +	static int u = 0;
> +	long d = 0;
> +	int x, y;
> +
> +	if (!u)
> +		u = slit_cluster_size();
> +
> +	/*
> +	 * Is this a unit cluster on the trace?
> +	 */
> +	if ((i / u) == (j / u))
> +		return node_distance(i, j);
> +
> +	/*
> +	 * Off-trace cluster, return average of the cluster to force symmetry.
> +	 */
> +	x = i - (i % u);
> +	y = j - (j % u);
> +
> +	for (i = x; i < x + u; i++) {
> +		for (j = y; j < y + u; j++) {
> +			d += node_distance(i, j);
> +			d += node_distance(j, i);
> +		}
> +	}
> +
> +	return d / (2*u*u);
>  }

Note that I changed this to average over the symmetric pair of off-trace
clusters. Because if some BIOS monkey is really taking bong-hits, there
is absolutely no guarantee the just the one cluster average will result
in a symmetric system in the end.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] sched/topology: Check average distances to remote packages
  2026-02-25 12:30     ` Peter Zijlstra
  2026-02-25 13:36       ` Peter Zijlstra
  2026-02-25 15:39       ` Chen, Yu C
@ 2026-02-25 16:41       ` Kyle Meyer
  2026-02-25 16:49         ` Peter Zijlstra
  2 siblings, 1 reply; 19+ messages in thread
From: Kyle Meyer @ 2026-02-25 16:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tim.c.chen, bp, dave.hansen, mingo, tglx, vinicius.gomes, brgerst,
	hpa, kprateek.nayak, linux-kernel, patryk.wlazlyn,
	rafael.j.wysocki, russ.anderson, x86, yu.c.chen, zhao1.liu

On Wed, Feb 25, 2026 at 01:30:52PM +0100, Peter Zijlstra wrote:
> On Tue, Feb 24, 2026 at 07:43:10PM -0600, Kyle Meyer wrote:
> 
> > Here's an 8 socket (2 chassis) HPE system with SNC enabled:
> > 
> > node   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
> >   0:  10  12  16  16  16  16  18  18  40  40  40  40  40  40  40  40
> >   1:  12  10  16  16  16  16  18  18  40  40  40  40  40  40  40  40
> >   2:  16  16  10  12  18  18  16  16  40  40  40  40  40  40  40  40
> >   3:  16  16  12  10  18  18  16  16  40  40  40  40  40  40  40  40
> >   4:  16  16  18  18  10  12  16  16  40  40  40  40  40  40  40  40
> >   5:  16  16  18  18  12  10  16  16  40  40  40  40  40  40  40  40
> >   6:  18  18  16  16  16  16  10  12  40  40  40  40  40  40  40  40
> >   7:  18  18  16  16  16  16  12  10  40  40  40  40  40  40  40  40
> >   8:  40  40  40  40  40  40  40  40  10  12  16  16  16  16  18  18
> >   9:  40  40  40  40  40  40  40  40  12  10  16  16  16  16  18  18
> >  10:  40  40  40  40  40  40  40  40  16  16  10  12  18  18  16  16
> >  11:  40  40  40  40  40  40  40  40  16  16  12  10  18  18  16  16
> >  12:  40  40  40  40  40  40  40  40  16  16  18  18  10  12  16  16
> >  13:  40  40  40  40  40  40  40  40  16  16  18  18  12  10  16  16
> >  14:  40  40  40  40  40  40  40  40  18  18  16  16  16  16  10  12
> >  15:  40  40  40  40  40  40  40  40  18  18  16  16  16  16  12  10
> > 
> > 10 = Same chassis and socket
> > 12 = Same chassis and socket (SNC)
> > 16 = Same chassis and adjacent socket
> > 18 = Same chassis and non-adjacent socket
> > 40 = Different chassis
> > 
> > Each processor connects to an ASIC (XNC) that acts as a multiplexer, extending
> > the UPI interconnect across the entire system.
> > 
> > We don't experience the scheduler domain issue reported by Tim because our SLIT
> > provides symmetric distances to remote NUMA nodes, but we trigger the WARN_ONCE
> > because we exceed 2 packages.
> 
> The original case was for SNC-3, the above looks to be SNC-2. Does your
> system also support SNC-3?

We do not currently use SKUs that support SNC-3.

That distance would be set to 12:

node   0   1   2
  0:  10  12  12
  1:  12  10  12
  2:  12  12  10

That might be changed if there's actually a difference in distance.

Distances to adjacent sockets, non-adjacent sockets, and different chassis would
remain the same.
 
> Anyway, yes your SLIT table looks sane (unlike that SNC-3 monster Tim
> showed earlier).
> 
> And it also shows that using REMOTE_DISTANCE (20) was completely random
> and 'wrong'.
> 
> So per 4d6dd05d07d0 ("sched/topology: Fix sched domain build error for GNR, CWF in SNC-3 mode")
> 
> Tim's original crazy SNC-3 SLIT table was:
> 
> node distances:
> node     0    1    2    3    4    5
>     0:   10   15   17   21   28   26
>     1:   15   10   15   23   26   23
>     2:   17   15   10   26   23   21
>     3:   21   28   26   10   15   17
>     4:   23   26   23   15   10   15
>     5:   26   23   21   17   15   10
> 
> And per:
> 
>   https://lore.kernel.org/lkml/20250825075642.GQ3245006@noisy.programming.kicks-ass.net/ 
> 
> My suggestion was to average the off-trace clusters to restore sanity.
> 
> So how about we go about implementing that without reference to magical
> numbers, something like so. This obviously needs a little TLC, but it
> might just work.
> 
> Hmm?
> 
> ---
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 5cd6950ab672..cba3e4b14250 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -513,33 +513,55 @@ static void __init build_sched_topology(void)
>  }
>  
>  #ifdef CONFIG_NUMA
> -static int sched_avg_remote_distance;
> -static int avg_remote_numa_distance(void)
> +
> +/*
> + * Find the largest symmetric cluster in an attempt to identify the unit size.
> + *
> + * XXX doesn't respect N_CPU node classes and such.
> + */
> +static int slit_cluster_size(void)
>  {
> -	int i, j;
> -	int distance, nr_remote, total_distance;
> +	int i, j, n, m = num_possible_nodes();
>  
> -	if (sched_avg_remote_distance > 0)
> -		return sched_avg_remote_distance;
> -
> -	nr_remote = 0;
> -	total_distance = 0;
> -	for_each_node_state(i, N_CPU) {
> -		for_each_node_state(j, N_CPU) {
> -			distance = node_distance(i, j);
> -
> -			if (distance >= REMOTE_DISTANCE) {
> -				nr_remote++;
> -				total_distance += distance;
> +	for (n = 2; n < m; n++) {
> +		for (i = 0; i < n; i++) {
> +			for (j = i; j < n; j++) {
> +				if (node_distance(i, j) != node_distance(j, i))
> +					return n - 1;
>  			}
>  		}
>  	}
> -	if (nr_remote)
> -		sched_avg_remote_distance = total_distance / nr_remote;
> -	else
> -		sched_avg_remote_distance = REMOTE_DISTANCE;
>  
> -	return sched_avg_remote_distance;
> +	return m;
> +}
> +
> +static int slit_cluster_distance(int i, int j)
> +{
> +	static int u = 0;
> +	long d = 0;
> +	int x, y;
> +
> +	if (!u)
> +		u = slit_cluster_size();
> +
> +	/*
> +	 * Is this a unit cluster on the trace?
> +	 */
> +	if ((i / u) == (j / u))
> +		return node_distance(i, j);
> +
> +	/*
> +	 * Off-trace cluster, return average of the cluster to force symmetry.
> +	 */
> +	x = i - (i % u);
> +	y = j - (j % u);
> +
> +	for (i = x; i < x + u; i++) {
> +		for (j = y; j < y + u; j++)
> +			d += node_distance(i, j);
> +	}
> +
> +	return d / (u*u);
>  }
>  
>  int arch_sched_node_distance(int from, int to)
> @@ -550,8 +572,7 @@ int arch_sched_node_distance(int from, int to)
>  	case INTEL_GRANITERAPIDS_X:
>  	case INTEL_ATOM_DARKMONT_X:
>  
> -		if (!x86_has_numa_in_package || topology_max_packages() == 1 ||
> -		    d < REMOTE_DISTANCE)
> +		if (!x86_has_numa_in_package || topology_max_packages() == 1)
>  			return d;
>  
>  		/*
> @@ -571,12 +592,7 @@ int arch_sched_node_distance(int from, int to)
>  		 * packages as average distance to different remote packages
>  		 * could be different.
>  		 */
> -		WARN_ONCE(topology_max_packages() > 2,
> -			  "sched: Expect only up to 2 packages for GNR or CWF, "
> -			  "but saw %d packages when building sched domains.",
> -			  topology_max_packages());
> -
> -		d = avg_remote_numa_distance();
> +		return slit_cluster_distance(from, to);
>  	}
>  	return d;
>  }

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] sched/topology: Check average distances to remote packages
  2026-02-25 16:41       ` Kyle Meyer
@ 2026-02-25 16:49         ` Peter Zijlstra
  0 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2026-02-25 16:49 UTC (permalink / raw)
  To: Kyle Meyer
  Cc: tim.c.chen, bp, dave.hansen, mingo, tglx, vinicius.gomes, brgerst,
	hpa, kprateek.nayak, linux-kernel, patryk.wlazlyn,
	rafael.j.wysocki, russ.anderson, x86, yu.c.chen, zhao1.liu

On Wed, Feb 25, 2026 at 10:41:38AM -0600, Kyle Meyer wrote:
> On Wed, Feb 25, 2026 at 01:30:52PM +0100, Peter Zijlstra wrote:

> > The original case was for SNC-3, the above looks to be SNC-2. Does your
> > system also support SNC-3?
> 
> We do not currently use SKUs that support SNC-3.
> 
> That distance would be set to 12:
> 
> node   0   1   2
>   0:  10  12  12
>   1:  12  10  12
>   2:  12  12  10
> 
> That might be changed if there's actually a difference in distance.
> 
> Distances to adjacent sockets, non-adjacent sockets, and different chassis would
> remain the same.

OK, excellent. So its just the Intel reference systems that are crazy.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] sched/topology: Check average distances to remote packages
  2026-02-25 16:32           ` Peter Zijlstra
  2026-02-25 16:40             ` Peter Zijlstra
@ 2026-02-25 21:37             ` Tim Chen
  2026-02-25 22:30               ` Peter Zijlstra
  1 sibling, 1 reply; 19+ messages in thread
From: Tim Chen @ 2026-02-25 21:37 UTC (permalink / raw)
  To: Peter Zijlstra, Chen, Yu C
  Cc: Kyle Meyer, bp, dave.hansen, mingo, tglx, vinicius.gomes, brgerst,
	hpa, kprateek.nayak, linux-kernel, patryk.wlazlyn,
	rafael.j.wysocki, russ.anderson, x86, zhao1.liu

On Wed, 2026-02-25 at 17:32 +0100, Peter Zijlstra wrote:
> On Wed, Feb 25, 2026 at 04:44:09PM +0100, Peter Zijlstra wrote:
> 
> > Yes, so this assumes that all u sized clusters on the trace are similar
> > and 'sane' without verification.
> 
> That gave me an idea; how's this then?

Sorry I was sick for a few days.  Just catching up on this
thread here. I think your patch takes care of both GNR SNC-3 
with 3 compute dies (with non-symmetric remote
distances) and generic SNC-2 with 2 dies (symmetric
distances) very well.

Minor suggestion below for the patch.

Will ask the original GNR teams with the problem to try
it out.

> 
> ---
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 5cd6950ab672..b1e464fd98c0 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -513,33 +513,99 @@ static void __init build_sched_topology(void)
>  }
>  
>  #ifdef CONFIG_NUMA
> -static int sched_avg_remote_distance;
> -static int avg_remote_numa_distance(void)
> +
> +static bool slit_cluster_symmetric(int i, int j, int n)
>  {
> -	int i, j;
> -	int distance, nr_remote, total_distance;
> +	WARN_ON_ONCE((i % n) || (j % n));
>  
> -	if (sched_avg_remote_distance > 0)
> -		return sched_avg_remote_distance;
> -
> -	nr_remote = 0;
> -	total_distance = 0;
> -	for_each_node_state(i, N_CPU) {
> -		for_each_node_state(j, N_CPU) {
> -			distance = node_distance(i, j);
> -
> -			if (distance >= REMOTE_DISTANCE) {
> -				nr_remote++;
> -				total_distance += distance;
> -			}
> +	for (int k = i; k < i + n; k++) {
> +		for (int l = k; l < j + n; l++) {
> +			if (node_distance(k, l) != node_distance(k, l))
> +				return false;
>  		}
>  	}
> -	if (nr_remote)
> -		sched_avg_remote_distance = total_distance / nr_remote;
> -	else
> -		sched_avg_remote_distance = REMOTE_DISTANCE;
>  
> -	return sched_avg_remote_distance;
> +	return true;
> +}
> +
> +static bool slit_cluster_match(int i, int j, int x, int y, int n)

Seems like we only call this function with i==j and x==y (i.e. cluster
at i and cluster at x). Can we simplify?

Thanks.

Tim

> +{
> +	WARN_ON_ONCE((i % n) || (j % n) || (x % n) || (y % n));
> +
> +	for (int k = 0; k < n; k++) {
> +		for (int l = k; l < n; l++) {
> +			if (node_distance(i + k, j + l) != node_distance(x + k, y + l))
> +				return false;
> +		}
> +	}
> +
> +	return true;
> +}
> +
> +/*
> + * Find the largest symmetric,repeating cluster in an attempt to identify the
> + * unit size.
> + */
> +static int slit_cluster_size(void)
> +{
> +	int nodes = num_possible_nodes();
> +
> +	/*
> +	 * There are at least 2 packages; so half-nodes is the largest
> +	 * possible unit, go down from that.
> +	 */
> +	for (int u = nodes / 2; u; u--) {
> +		/*
> +		 * If u doesn't divide nodes, it can't be a unit.
> +		 */
> +		if (nodes % u)
> +			continue;
> +
> +		/*
> +		 * Unit must be symmetric,
> +		 */
> +		if (!slit_cluster_symmetric(0, 0, u))
> +			continue;
> +
> +		/*
> +		 * and repeating.
> +		 */
> +		if (slit_cluster_match(0, 0, u, u, u))
> +			return u;
> +	}
> +
> +	return nodes;
> +}
> +
> +static int slit_cluster_distance(int i, int j)
> +{
> +	static int u = 0;
> +	long d = 0;
> +	int x, y;
> +
> +	if (!u)
> +		u = slit_cluster_size();
> +
> +	/*
> +	 * Is this a unit cluster on the trace?
> +	 */
> +	if ((i / u) == (j / u))
> +		return node_distance(i, j);
> +
> +	/*
> +	 * Off-trace cluster, return average of the cluster to force symmetry.
> +	 */
> +	x = i - (i % u);
> +	y = j - (j % u);
> +
> +	for (i = x; i < x + u; i++) {
> +		for (j = y; j < y + u; j++) {
> +			d += node_distance(i, j);
> +			d += node_distance(j, i);
> +		}
> +	}
> +
> +	return d / (2*u*u);
>  }
>  
>  int arch_sched_node_distance(int from, int to)
> @@ -550,8 +616,7 @@ int arch_sched_node_distance(int from, int to)
>  	case INTEL_GRANITERAPIDS_X:
>  	case INTEL_ATOM_DARKMONT_X:
>  
> -		if (!x86_has_numa_in_package || topology_max_packages() == 1 ||
> -		    d < REMOTE_DISTANCE)
> +		if (!x86_has_numa_in_package || topology_max_packages() == 1)
>  			return d;
>  
>  		/*
> @@ -564,19 +629,8 @@ int arch_sched_node_distance(int from, int to)
>  		 * in the remote package in the same sched group.
>  		 * Simplify NUMA domains and avoid extra NUMA levels including
>  		 * different remote NUMA nodes and local nodes.
> -		 *
> -		 * GNR and CWF don't expect systems with more than 2 packages
> -		 * and more than 2 hops between packages. Single average remote
> -		 * distance won't be appropriate if there are more than 2
> -		 * packages as average distance to different remote packages
> -		 * could be different.
>  		 */
> -		WARN_ONCE(topology_max_packages() > 2,
> -			  "sched: Expect only up to 2 packages for GNR or CWF, "
> -			  "but saw %d packages when building sched domains.",
> -			  topology_max_packages());
> -
> -		d = avg_remote_numa_distance();
> +		return slit_cluster_distance(from, to);
>  	}
>  	return d;
>  }

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] sched/topology: Check average distances to remote packages
  2026-02-25 21:37             ` Tim Chen
@ 2026-02-25 22:30               ` Peter Zijlstra
  2026-02-25 22:54                 ` Peter Zijlstra
  2026-02-25 22:55                 ` Tim Chen
  0 siblings, 2 replies; 19+ messages in thread
From: Peter Zijlstra @ 2026-02-25 22:30 UTC (permalink / raw)
  To: Tim Chen
  Cc: Chen, Yu C, Kyle Meyer, bp, dave.hansen, mingo, tglx,
	vinicius.gomes, brgerst, hpa, kprateek.nayak, linux-kernel,
	patryk.wlazlyn, rafael.j.wysocki, russ.anderson, x86, zhao1.liu

On Wed, Feb 25, 2026 at 01:37:11PM -0800, Tim Chen wrote:
> On Wed, 2026-02-25 at 17:32 +0100, Peter Zijlstra wrote:
> > On Wed, Feb 25, 2026 at 04:44:09PM +0100, Peter Zijlstra wrote:
> > 
> > > Yes, so this assumes that all u sized clusters on the trace are similar
> > > and 'sane' without verification.
> > 
> > That gave me an idea; how's this then?
> 
> Sorry I was sick for a few days.  Just catching up on this
> thread here. I think your patch takes care of both GNR SNC-3 
> with 3 compute dies (with non-symmetric remote
> distances) and generic SNC-2 with 2 dies (symmetric
> distances) very well.
> 
> Minor suggestion below for the patch.
> 
> Will ask the original GNR teams with the problem to try
> it out.

Since HPE can obviously have a sane SLIT table; why can't we simply
claim the SLIT table they had is broken and needs fixing?

Also, is there really no enumeration of the SNC mode available; must we
really divinate?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] sched/topology: Check average distances to remote packages
  2026-02-25 22:30               ` Peter Zijlstra
@ 2026-02-25 22:54                 ` Peter Zijlstra
  2026-02-25 22:55                 ` Tim Chen
  1 sibling, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2026-02-25 22:54 UTC (permalink / raw)
  To: Tim Chen
  Cc: Chen, Yu C, Kyle Meyer, bp, dave.hansen, mingo, tglx,
	vinicius.gomes, brgerst, hpa, kprateek.nayak, linux-kernel,
	patryk.wlazlyn, rafael.j.wysocki, russ.anderson, x86, zhao1.liu

On Wed, Feb 25, 2026 at 11:30:24PM +0100, Peter Zijlstra wrote:

> Also, is there really no enumeration of the SNC mode available; must we
> really divinate?

There isn't; but we have existing divination in resctl instead of the
topology code :-(

I'll try and fix that tomorrow.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] sched/topology: Check average distances to remote packages
  2026-02-25 22:30               ` Peter Zijlstra
  2026-02-25 22:54                 ` Peter Zijlstra
@ 2026-02-25 22:55                 ` Tim Chen
  2026-02-25 23:29                   ` Kyle Meyer
  1 sibling, 1 reply; 19+ messages in thread
From: Tim Chen @ 2026-02-25 22:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chen, Yu C, Kyle Meyer, bp, dave.hansen, mingo, tglx,
	vinicius.gomes, brgerst, hpa, kprateek.nayak, linux-kernel,
	patryk.wlazlyn, rafael.j.wysocki, russ.anderson, x86, zhao1.liu

On Wed, 2026-02-25 at 23:30 +0100, Peter Zijlstra wrote:
> On Wed, Feb 25, 2026 at 01:37:11PM -0800, Tim Chen wrote:
> > On Wed, 2026-02-25 at 17:32 +0100, Peter Zijlstra wrote:
> > > On Wed, Feb 25, 2026 at 04:44:09PM +0100, Peter Zijlstra wrote:
> > > 
> > > > Yes, so this assumes that all u sized clusters on the trace are similar
> > > > and 'sane' without verification.
> > > 
> > > That gave me an idea; how's this then?
> > 
> > Sorry I was sick for a few days.  Just catching up on this
> > thread here. I think your patch takes care of both GNR SNC-3 
> > with 3 compute dies (with non-symmetric remote
> > distances) and generic SNC-2 with 2 dies (symmetric
> > distances) very well.
> > 
> > Minor suggestion below for the patch.
> > 
> > Will ask the original GNR teams with the problem to try
> > it out.
> 
> Since HPE can obviously have a sane SLIT table; why can't we simply
> claim the SLIT table they had is broken and needs fixing?

From what I can see HPE seems to use SNC-2 variant of GNR so the SLIT
is symmetric.

Unfortunately in the topology for the 2 socket GNR that has 3 dies, there
are truly unsymmetric paths from between die A to die B between remote
sockets from what I'm told. 

> 
> Also, is there really no enumeration of the SNC mode available; must we
> really divinate?

Let me dig into that a bit. I was also thinking with that information,
it will make the code a lot simpler.

Tim

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] sched/topology: Check average distances to remote packages
  2026-02-25 22:55                 ` Tim Chen
@ 2026-02-25 23:29                   ` Kyle Meyer
  2026-02-26 18:14                     ` Tim Chen
  0 siblings, 1 reply; 19+ messages in thread
From: Kyle Meyer @ 2026-02-25 23:29 UTC (permalink / raw)
  To: Tim Chen
  Cc: Peter Zijlstra, Chen, Yu C, bp, dave.hansen, mingo, tglx,
	vinicius.gomes, brgerst, hpa, kprateek.nayak, linux-kernel,
	patryk.wlazlyn, rafael.j.wysocki, russ.anderson, x86, zhao1.liu

On Wed, Feb 25, 2026 at 02:55:58PM -0800, Tim Chen wrote:
> On Wed, 2026-02-25 at 23:30 +0100, Peter Zijlstra wrote:
> > On Wed, Feb 25, 2026 at 01:37:11PM -0800, Tim Chen wrote:
> > > On Wed, 2026-02-25 at 17:32 +0100, Peter Zijlstra wrote:
> > > > On Wed, Feb 25, 2026 at 04:44:09PM +0100, Peter Zijlstra wrote:
> > > > 
> > > > > Yes, so this assumes that all u sized clusters on the trace are similar
> > > > > and 'sane' without verification.
> > > > 
> > > > That gave me an idea; how's this then?
> > > 
> > > Sorry I was sick for a few days.  Just catching up on this
> > > thread here. I think your patch takes care of both GNR SNC-3 
> > > with 3 compute dies (with non-symmetric remote
> > > distances) and generic SNC-2 with 2 dies (symmetric
> > > distances) very well.
> > > 
> > > Minor suggestion below for the patch.
> > > 
> > > Will ask the original GNR teams with the problem to try
> > > it out.
> > 
> > Since HPE can obviously have a sane SLIT table; why can't we simply
> > claim the SLIT table they had is broken and needs fixing?
> 
> From what I can see HPE seems to use SNC-2 variant of GNR so the SLIT
> is symmetric.

Yes, and the SKUs that don't support SNC.

The SKUs that support SNC-3 are limited to 2 packages.

> Unfortunately in the topology for the 2 socket GNR that has 3 dies, there
> are truly unsymmetric paths from between die A to die B between remote
> sockets from what I'm told.

What does MLC look like?

> > Also, is there really no enumeration of the SNC mode available; must we
> > really divinate?
> 
> Let me dig into that a bit. I was also thinking with that information,
> it will make the code a lot simpler.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] sched/topology: Check average distances to remote packages
  2026-02-25 23:29                   ` Kyle Meyer
@ 2026-02-26 18:14                     ` Tim Chen
  0 siblings, 0 replies; 19+ messages in thread
From: Tim Chen @ 2026-02-26 18:14 UTC (permalink / raw)
  To: Kyle Meyer
  Cc: Peter Zijlstra, Chen, Yu C, bp, dave.hansen, mingo, tglx,
	vinicius.gomes, brgerst, hpa, kprateek.nayak, linux-kernel,
	patryk.wlazlyn, rafael.j.wysocki, russ.anderson, x86, zhao1.liu

On Wed, 2026-02-25 at 17:29 -0600, Kyle Meyer wrote:
> On Wed, Feb 25, 2026 at 02:55:58PM -0800, Tim Chen wrote:
> > On Wed, 2026-02-25 at 23:30 +0100, Peter Zijlstra wrote:
> > > On Wed, Feb 25, 2026 at 01:37:11PM -0800, Tim Chen wrote:
> > > > On Wed, 2026-02-25 at 17:32 +0100, Peter Zijlstra wrote:
> > > > > On Wed, Feb 25, 2026 at 04:44:09PM +0100, Peter Zijlstra wrote:
> > > > > 
> > > > > > Yes, so this assumes that all u sized clusters on the trace are similar
> > > > > > and 'sane' without verification.
> > > > > 
> > > > > That gave me an idea; how's this then?
> > > > 
> > > > Sorry I was sick for a few days.  Just catching up on this
> > > > thread here. I think your patch takes care of both GNR SNC-3 
> > > > with 3 compute dies (with non-symmetric remote
> > > > distances) and generic SNC-2 with 2 dies (symmetric
> > > > distances) very well.
> > > > 
> > > > Minor suggestion below for the patch.
> > > > 
> > > > Will ask the original GNR teams with the problem to try
> > > > it out.
> > > 
> > > Since HPE can obviously have a sane SLIT table; why can't we simply
> > > claim the SLIT table they had is broken and needs fixing?
> > 
> > From what I can see HPE seems to use SNC-2 variant of GNR so the SLIT
> > is symmetric.
> 
> Yes, and the SKUs that don't support SNC.
> 
> The SKUs that support SNC-3 are limited to 2 packages.

Yes, I think there are only 2 packages SNC-3 out there. 

> 
> > Unfortunately in the topology for the 2 socket GNR that has 3 dies, there
> > are truly unsymmetric paths from between die A to die B between remote
> > sockets from what I'm told.
> 
> What does MLC look like?

I don't have access to one for measurements.  Will have to ask colleagues
to measure that.

> 
> > > Also, is there really no enumeration of the SNC mode available; must we
> > > really divinate?
> > 
> > Let me dig into that a bit. I was also thinking with that information,
> > it will make the code a lot simpler.

There's truly no hardware bits to expose the SNC mode.  Will have to
rely on snc_get_config() and use the ratio of number of CPUs in node vs L3
to get an estimate.

Tim

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2026-02-26 18:14 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-05  0:24 [PATCH v2] sched/topology: Check average distances to remote packages Kyle Meyer
2026-02-23 16:42 ` Kyle Meyer
2026-02-23 17:03 ` Peter Zijlstra
2026-02-25  1:43   ` Kyle Meyer
2026-02-25  9:05     ` Chen, Yu C
2026-02-25 12:30     ` Peter Zijlstra
2026-02-25 13:36       ` Peter Zijlstra
2026-02-25 15:39       ` Chen, Yu C
2026-02-25 15:44         ` Peter Zijlstra
2026-02-25 16:32           ` Peter Zijlstra
2026-02-25 16:40             ` Peter Zijlstra
2026-02-25 21:37             ` Tim Chen
2026-02-25 22:30               ` Peter Zijlstra
2026-02-25 22:54                 ` Peter Zijlstra
2026-02-25 22:55                 ` Tim Chen
2026-02-25 23:29                   ` Kyle Meyer
2026-02-26 18:14                     ` Tim Chen
2026-02-25 16:41       ` Kyle Meyer
2026-02-25 16:49         ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox