All of lore.kernel.org
 help / color / mirror / Atom feed
From: Kyle Meyer <kyle.meyer@hpe.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: tim.c.chen@linux.intel.com, bp@alien8.de,
	dave.hansen@linux.intel.com, mingo@redhat.com, tglx@kernel.org,
	vinicius.gomes@intel.com, brgerst@gmail.com, hpa@zytor.com,
	kprateek.nayak@amd.com, linux-kernel@vger.kernel.org,
	patryk.wlazlyn@linux.intel.com, rafael.j.wysocki@intel.com,
	russ.anderson@hpe.com, x86@kernel.org, yu.c.chen@intel.com,
	zhao1.liu@intel.com
Subject: Re: [PATCH v2] sched/topology: Check average distances to remote packages
Date: Wed, 25 Feb 2026 10:41:38 -0600	[thread overview]
Message-ID: <aZ8mQkRtNSgFGdVv@hpe.com> (raw)
In-Reply-To: <20260225123052.GN3016024@noisy.programming.kicks-ass.net>

On Wed, Feb 25, 2026 at 01:30:52PM +0100, Peter Zijlstra wrote:
> On Tue, Feb 24, 2026 at 07:43:10PM -0600, Kyle Meyer wrote:
> 
> > Here's an 8 socket (2 chassis) HPE system with SNC enabled:
> > 
> > node   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
> >   0:  10  12  16  16  16  16  18  18  40  40  40  40  40  40  40  40
> >   1:  12  10  16  16  16  16  18  18  40  40  40  40  40  40  40  40
> >   2:  16  16  10  12  18  18  16  16  40  40  40  40  40  40  40  40
> >   3:  16  16  12  10  18  18  16  16  40  40  40  40  40  40  40  40
> >   4:  16  16  18  18  10  12  16  16  40  40  40  40  40  40  40  40
> >   5:  16  16  18  18  12  10  16  16  40  40  40  40  40  40  40  40
> >   6:  18  18  16  16  16  16  10  12  40  40  40  40  40  40  40  40
> >   7:  18  18  16  16  16  16  12  10  40  40  40  40  40  40  40  40
> >   8:  40  40  40  40  40  40  40  40  10  12  16  16  16  16  18  18
> >   9:  40  40  40  40  40  40  40  40  12  10  16  16  16  16  18  18
> >  10:  40  40  40  40  40  40  40  40  16  16  10  12  18  18  16  16
> >  11:  40  40  40  40  40  40  40  40  16  16  12  10  18  18  16  16
> >  12:  40  40  40  40  40  40  40  40  16  16  18  18  10  12  16  16
> >  13:  40  40  40  40  40  40  40  40  16  16  18  18  12  10  16  16
> >  14:  40  40  40  40  40  40  40  40  18  18  16  16  16  16  10  12
> >  15:  40  40  40  40  40  40  40  40  18  18  16  16  16  16  12  10
> > 
> > 10 = Same chassis and socket
> > 12 = Same chassis and socket (SNC)
> > 16 = Same chassis and adjacent socket
> > 18 = Same chassis and non-adjacent socket
> > 40 = Different chassis
> > 
> > Each processor connects to an ASIC (XNC) that acts as a multiplexer, extending
> > the UPI interconnect across the entire system.
> > 
> > We don't experience the scheduler domain issue reported by Tim because our SLIT
> > provides symmetric distances to remote NUMA nodes, but we trigger the WARN_ONCE
> > because we exceed 2 packages.
> 
> The original case was for SNC-3, the above looks to be SNC-2. Does your
> system also support SNC-3?

We do not currently use SKUs that support SNC-3.

That distance would be set to 12:

node   0   1   2
  0:  10  12  12
  1:  12  10  12
  2:  12  12  10

That might be changed if there's actually a difference in distance.

Distances to adjacent sockets, non-adjacent sockets, and different chassis would
remain the same.
 
> Anyway, yes your SLIT table looks sane (unlike that SNC-3 monster Tim
> showed earlier).
> 
> And it also shows that using REMOTE_DISTANCE (20) was completely random
> and 'wrong'.
> 
> So per 4d6dd05d07d0 ("sched/topology: Fix sched domain build error for GNR, CWF in SNC-3 mode")
> 
> Tim's original crazy SNC-3 SLIT table was:
> 
> node distances:
> node     0    1    2    3    4    5
>     0:   10   15   17   21   28   26
>     1:   15   10   15   23   26   23
>     2:   17   15   10   26   23   21
>     3:   21   28   26   10   15   17
>     4:   23   26   23   15   10   15
>     5:   26   23   21   17   15   10
> 
> And per:
> 
>   https://lore.kernel.org/lkml/20250825075642.GQ3245006@noisy.programming.kicks-ass.net/ 
> 
> My suggestion was to average the off-trace clusters to restore sanity.
> 
> So how about we go about implementing that without reference to magical
> numbers, something like so. This obviously needs a little TLC, but it
> might just work.
> 
> Hmm?
> 
> ---
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 5cd6950ab672..cba3e4b14250 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -513,33 +513,55 @@ static void __init build_sched_topology(void)
>  }
>  
>  #ifdef CONFIG_NUMA
> -static int sched_avg_remote_distance;
> -static int avg_remote_numa_distance(void)
> +
> +/*
> + * Find the largest symmetric cluster in an attempt to identify the unit size.
> + *
> + * XXX doesn't respect N_CPU node classes and such.
> + */
> +static int slit_cluster_size(void)
>  {
> -	int i, j;
> -	int distance, nr_remote, total_distance;
> +	int i, j, n, m = num_possible_nodes();
>  
> -	if (sched_avg_remote_distance > 0)
> -		return sched_avg_remote_distance;
> -
> -	nr_remote = 0;
> -	total_distance = 0;
> -	for_each_node_state(i, N_CPU) {
> -		for_each_node_state(j, N_CPU) {
> -			distance = node_distance(i, j);
> -
> -			if (distance >= REMOTE_DISTANCE) {
> -				nr_remote++;
> -				total_distance += distance;
> +	for (n = 2; n < m; n++) {
> +		for (i = 0; i < n; i++) {
> +			for (j = i; j < n; j++) {
> +				if (node_distance(i, j) != node_distance(j, i))
> +					return n - 1;
>  			}
>  		}
>  	}
> -	if (nr_remote)
> -		sched_avg_remote_distance = total_distance / nr_remote;
> -	else
> -		sched_avg_remote_distance = REMOTE_DISTANCE;
>  
> -	return sched_avg_remote_distance;
> +	return m;
> +}
> +
> +static int slit_cluster_distance(int i, int j)
> +{
> +	static int u = 0;
> +	long d = 0;
> +	int x, y;
> +
> +	if (!u)
> +		u = slit_cluster_size();
> +
> +	/*
> +	 * Is this a unit cluster on the trace?
> +	 */
> +	if ((i / u) == (j / u))
> +		return node_distance(i, j);
> +
> +	/*
> +	 * Off-trace cluster, return average of the cluster to force symmetry.
> +	 */
> +	x = i - (i % u);
> +	y = j - (j % u);
> +
> +	for (i = x; i < x + u; i++) {
> +		for (j = y; j < y + u; j++)
> +			d += node_distance(i, j);
> +	}
> +
> +	return d / (u*u);
>  }
>  
>  int arch_sched_node_distance(int from, int to)
> @@ -550,8 +572,7 @@ int arch_sched_node_distance(int from, int to)
>  	case INTEL_GRANITERAPIDS_X:
>  	case INTEL_ATOM_DARKMONT_X:
>  
> -		if (!x86_has_numa_in_package || topology_max_packages() == 1 ||
> -		    d < REMOTE_DISTANCE)
> +		if (!x86_has_numa_in_package || topology_max_packages() == 1)
>  			return d;
>  
>  		/*
> @@ -571,12 +592,7 @@ int arch_sched_node_distance(int from, int to)
>  		 * packages as average distance to different remote packages
>  		 * could be different.
>  		 */
> -		WARN_ONCE(topology_max_packages() > 2,
> -			  "sched: Expect only up to 2 packages for GNR or CWF, "
> -			  "but saw %d packages when building sched domains.",
> -			  topology_max_packages());
> -
> -		d = avg_remote_numa_distance();
> +		return slit_cluster_distance(from, to);
>  	}
>  	return d;
>  }

  parent reply	other threads:[~2026-02-25 16:42 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-05  0:24 [PATCH v2] sched/topology: Check average distances to remote packages Kyle Meyer
2026-02-23 16:42 ` Kyle Meyer
2026-02-23 17:03 ` Peter Zijlstra
2026-02-25  1:43   ` Kyle Meyer
2026-02-25  9:05     ` Chen, Yu C
2026-02-25 12:30     ` Peter Zijlstra
2026-02-25 13:36       ` Peter Zijlstra
2026-02-25 15:39       ` Chen, Yu C
2026-02-25 15:44         ` Peter Zijlstra
2026-02-25 16:32           ` Peter Zijlstra
2026-02-25 16:40             ` Peter Zijlstra
2026-02-25 21:37             ` Tim Chen
2026-02-25 22:30               ` Peter Zijlstra
2026-02-25 22:54                 ` Peter Zijlstra
2026-02-25 22:55                 ` Tim Chen
2026-02-25 23:29                   ` Kyle Meyer
2026-02-26 18:14                     ` Tim Chen
2026-02-25 16:41       ` Kyle Meyer [this message]
2026-02-25 16:49         ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aZ8mQkRtNSgFGdVv@hpe.com \
    --to=kyle.meyer@hpe.com \
    --cc=bp@alien8.de \
    --cc=brgerst@gmail.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=hpa@zytor.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=patryk.wlazlyn@linux.intel.com \
    --cc=peterz@infradead.org \
    --cc=rafael.j.wysocki@intel.com \
    --cc=russ.anderson@hpe.com \
    --cc=tglx@kernel.org \
    --cc=tim.c.chen@linux.intel.com \
    --cc=vinicius.gomes@intel.com \
    --cc=x86@kernel.org \
    --cc=yu.c.chen@intel.com \
    --cc=zhao1.liu@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.