All of lore.kernel.org
 help / color / mirror / Atom feed
From: Tim Chen <tim.c.chen@linux.intel.com>
To: "Chen, Yu C" <yu.c.chen@intel.com>,
	Peter Zijlstra <peterz@infradead.org>,
	 Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Libo Chen <libo.chen@oracle.com>,
	Abel Wu <wuyun.abel@bytedance.com>,
	Len Brown <len.brown@intel.com>,
	linux-kernel@vger.kernel.org,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	 Zhao Liu <zhao1.liu@intel.com>,
	Vinicius Costa Gomes <vinicius.gomes@intel.com>,
	Chen Yu <yu.chen.surf@foxmail.com>
Subject: Re: [PATCH 2/2] sched: Fix sched domain build error for GNR-X, CWF-X in SNC-3 mode
Date: Mon, 25 Aug 2025 13:05:35 -0700	[thread overview]
Message-ID: <4d9f64ea691b8a2f7571671156d511407aeee1c8.camel@linux.intel.com> (raw)
In-Reply-To: <c03c0137-931f-4dc9-b2c6-d01d4eb60010@intel.com>

On Mon, 2025-08-25 at 13:08 +0800, Chen, Yu C wrote:
> On 8/23/2025 4:14 AM, Tim Chen wrote:
> > 
... snip...
> > 
> > Co-developed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
> > Tested-by: Zhao Liu <zhao1.liu@intel.com>
> > ---
> >   arch/x86/kernel/smpboot.c      | 28 ++++++++++++++++++++++++++++
> >   include/linux/sched/topology.h |  1 +
> >   kernel/sched/topology.c        | 25 +++++++++++++++++++------
> >   3 files changed, 48 insertions(+), 6 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> > index 33e166f6ab12..c425e84c88b5 100644
> > --- a/arch/x86/kernel/smpboot.c
> > +++ b/arch/x86/kernel/smpboot.c
> > @@ -515,6 +515,34 @@ static void __init build_sched_topology(void)
> >   	set_sched_topology(topology);
> >   }
> >   
> > +int sched_node_distance(int from, int to)
> > +{
> > +	int d = node_distance(from, to);
> > +
> > +	if (!x86_has_numa_in_package)
> > +		return d;
> > +
> > +	switch (boot_cpu_data.x86_vfm) {
> > +	case INTEL_GRANITERAPIDS_X:
> > +	case INTEL_ATOM_DARKMONT_X:
> > +		if (d < REMOTE_DISTANCE)
> > +			return d;
> > +
> > +		/*
> > +		 * Trim finer distance tuning for nodes in remote package
> > +		 * for the purpose of building sched domains.
> > +		 * Put NUMA nodes in each remote package in a single sched group.
> > +		 * Simplify NUMA domains and avoid extra NUMA levels including different
> > +		 * NUMA nodes in remote packages.
> > +		 *
> > +		 * GNR-x and CWF-X has GLUELESS-MESH topology with SNC
> > +		 * turned on.
> > +		 */
> > +		d = (d / 10) * 10;
> 
> Does the '10' here mean that, the distance of the hierarchy socket
> is 10 from SLIT table? 
> 

Yes.

> For example, from a socket0 point of view,
> the distance of socket1 to socket0 is within [20, 29), the distance
> of socket2 to socket0 is [30,39), and so on. If this is the case,
> maybe add a comment above for future reference.
> 

We don't expect to have more than 2 sockets for GNR and CWF.
So the case of 2 hops like [30,39) should not happen.

> > +	}
> > +	return d;
> > +}
> > +
> >   void set_cpu_sibling_map(int cpu)
> >   {
> >   	bool has_smt = __max_threads_per_core > 1;
> > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> > index 5263746b63e8..3b62226394af 100644
> > --- a/include/linux/sched/topology.h
> > +++ b/include/linux/sched/topology.h
> > @@ -59,6 +59,7 @@ static inline int cpu_numa_flags(void)
> >   #endif
> >   
> >   extern int arch_asym_cpu_priority(int cpu);
> > +extern int sched_node_distance(int from, int to);
> >   
> >   struct sched_domain_attr {
> >   	int relax_domain_level;
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 9a7ac67e3d63..3f485da994a7 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -1804,7 +1804,7 @@ bool find_numa_distance(int distance)
> >   	bool found = false;
> >   	int i, *distances;
> >   
> > -	if (distance == node_distance(0, 0))
> > +	if (distance == sched_node_distance(0, 0))
> >   		return true;
> > 
> 
> If I understand correct, this patch is trying to fix the sched
> domain issue during load balancing, and NUMA balance logic
> should not be changed because NUMA balancing is not based on
> sched domain?
> 
> That is to say, since the find_numa_distance() is only used by
> NUMA balance, should we keep find_numa_distance() to still use
> node_distance()?

The procedure here is using the distance matrix that's initialized
using sched_node_distance(). Hence the change.

Otherwise we could keep a separate sched_distance matrix and uses
only node_distance here.  Did not do that to minimize the change.

Tim

> 
> >   	rcu_read_lock();
> > @@ -1887,6 +1887,15 @@ static void init_numa_topology_type(int offline_node)
> >   
> >   #define NR_DISTANCE_VALUES (1 << DISTANCE_BITS)
> >   
> > +/*
> > + * Architecture could simplify NUMA distance, to avoid
> > + * creating too many NUMA levels when SNC is turned on.
> > + */
> > +int __weak sched_node_distance(int from, int to)
> > +{
> > +	return node_distance(from, to);
> > +}
> > +
> >   void sched_init_numa(int offline_node)
> >   {
> >   	struct sched_domain_topology_level *tl;
> > @@ -1894,6 +1903,7 @@ void sched_init_numa(int offline_node)
> >   	int nr_levels = 0;
> >   	int i, j;
> >   	int *distances;
> > +	int max_dist = 0;
> >   	struct cpumask ***masks;
> >   
> >   	/*
> > @@ -1907,7 +1917,10 @@ void sched_init_numa(int offline_node)
> >   	bitmap_zero(distance_map, NR_DISTANCE_VALUES);
> >   	for_each_cpu_node_but(i, offline_node) {
> >   		for_each_cpu_node_but(j, offline_node) {
> > -			int distance = node_distance(i, j);
> > +			int distance = sched_node_distance(i, j);
> > +
> > +			if (node_distance(i,j) > max_dist)
> > +				max_dist = node_distance(i,j);
> >   
> >   			if (distance < LOCAL_DISTANCE || distance >= NR_DISTANCE_VALUES) {
> >   				sched_numa_warn("Invalid distance value range");
> > @@ -1979,10 +1992,10 @@ void sched_init_numa(int offline_node)
> >   			masks[i][j] = mask;
> >   
> >   			for_each_cpu_node_but(k, offline_node) {
> > -				if (sched_debug() && (node_distance(j, k) != node_distance(k, j)))
> > +				if (sched_debug() && (sched_node_distance(j, k) != sched_node_distance(k, j)))
> >   					sched_numa_warn("Node-distance not symmetric");
> >   
> > -				if (node_distance(j, k) > sched_domains_numa_distance[i])
> > +				if (sched_node_distance(j, k) > sched_domains_numa_distance[i])
> >   					continue;
> >   
> >   				cpumask_or(mask, mask, cpumask_of_node(k));
> > @@ -2022,7 +2035,7 @@ void sched_init_numa(int offline_node)
> >   	sched_domain_topology = tl;
> >   
> >   	sched_domains_numa_levels = nr_levels;
> > -	WRITE_ONCE(sched_max_numa_distance, sched_domains_numa_distance[nr_levels - 1]);
> > +	WRITE_ONCE(sched_max_numa_distance, max_dist);
> 
> Above change is to use the original node_distance() rather than
> sched_node_distance() for sched_max_numa_distance, and
> sched_max_numa_distance is only used by NUMA balance to figure out
> the NUMA topology type as well as scaling the NUMA fault statistics
> for remote Nodes.
> 
> So I think we might want to keep it align by using node_distance()
> in find_numa_distance().
> 
> thanks,
> Chenyu
> >   
> >   	init_numa_topology_type(offline_node);
> >   }
> > @@ -2092,7 +2105,7 @@ void sched_domains_numa_masks_set(unsigned int cpu)
> >   				continue;
> >   
> >   			/* Set ourselves in the remote node's masks */
> > -			if (node_distance(j, node) <= sched_domains_numa_distance[i])
> > +			if (sched_node_distance(j, node) <= sched_domains_numa_distance[i])
> >   				cpumask_set_cpu(cpu, sched_domains_numa_masks[i][j]);
> >   		}
> >   	}


  parent reply	other threads:[~2025-08-25 20:05 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-22 20:14 [PATCH 0/2] Fix NUMA sched domain build errors for GNR-X and CWF-X Tim Chen
2025-08-22 20:14 ` [PATCH 1/2] sched: topology: Fix topology validation error Tim Chen
2025-08-25  3:18   ` K Prateek Nayak
2025-08-25  7:58     ` Peter Zijlstra
2025-08-25  9:23       ` Peter Zijlstra
2025-08-25  7:25   ` Peter Zijlstra
2025-08-25 21:09     ` Tim Chen
2025-08-22 20:14 ` [PATCH 2/2] sched: Fix sched domain build error for GNR-X, CWF-X in SNC-3 mode Tim Chen
2025-08-25  5:08   ` Chen, Yu C
2025-08-25  7:56     ` Peter Zijlstra
2025-08-25 21:36       ` Tim Chen
2025-08-25 20:05     ` Tim Chen [this message]
2025-08-25  4:18 ` [PATCH 0/2] Fix NUMA sched domain build errors for GNR-X and CWF-X K Prateek Nayak
2025-08-25 21:38   ` Tim Chen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4d9f64ea691b8a2f7571671156d511407aeee1c8.camel@linux.intel.com \
    --to=tim.c.chen@linux.intel.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=gautham.shenoy@amd.com \
    --cc=juri.lelli@redhat.com \
    --cc=kprateek.nayak@amd.com \
    --cc=len.brown@intel.com \
    --cc=libo.chen@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=tim.c.chen@intel.com \
    --cc=vincent.guittot@linaro.org \
    --cc=vinicius.gomes@intel.com \
    --cc=vschneid@redhat.com \
    --cc=wuyun.abel@bytedance.com \
    --cc=yu.c.chen@intel.com \
    --cc=yu.chen.surf@foxmail.com \
    --cc=zhao1.liu@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.