Re: [PATCH 01/10] sched/topology: introduce for_each_numa_hop_node() / sched_numa_hop_node()

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Andrea Righi <arighi@nvidia.com>
To: Yury Norov <yury.norov@gmail.com>
Cc: Tejun Heo <tj@kernel.org>, David Vernet <void@manifault.com>,
	Changwoo Min <changwoo@igalia.com>,
	Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	bpf@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 01/10] sched/topology: introduce for_each_numa_hop_node() / sched_numa_hop_node()
Date: Tue, 24 Dec 2024 08:54:40 +0100	[thread overview]
Message-ID: <Z2powPXbWlZqzU6r@gpd3> (raw)
In-Reply-To: <Z2nTkshW2sUmNLVA@yury-ThinkPad>

Hi Yury,

On Mon, Dec 23, 2024 at 01:18:10PM -0800, Yury Norov wrote:
> On Fri, Dec 20, 2024 at 04:11:33PM +0100, Andrea Righi wrote:
> > Introduce for_each_numa_hop_node() and sched_numa_hop_node() to iterate
> > over node IDs in order of increasing NUMA distance from a given starting
> > node.
> > 
> > These iterator functions are similar to for_each_numa_hop_mask() and
> > sched_numa_hop_mask(), but instead of providing a cpumask at each
> > iteration, they provide a node ID.
> > 
> > Example usage:
> > 
> >   nodemask_t hop_nodes = NODE_MASK_NONE;
> >   int start = cpu_to_node(smp_processor_id());
> > 
> >   for_each_numa_hop_node(node, start, hop_nodes, N_ONLINE)
> >   	pr_info("node (%d, %d) -> \n",
> >   		 start, node, node_distance(start, node);
> 
> This iterates nodes, not hops. The hop is a set of equidistant nodes,
> and the iterator (the first argument) should be a nodemask. I'm OK with
> that as soon as you find it practical. But then you shouldn't mention
> hops in the patch.
> 
> Also, can you check that it works correctly against a configuration with
> equidistant nodes?

Ok, and yes, it should mention nodes, not hops.

> 
> > Simulating the following NUMA topology in virtme-ng:
> > 
> >  $ numactl -H
> >  available: 4 nodes (0-3)
> >  node 0 cpus: 0 1
> >  node 0 size: 1006 MB
> >  node 0 free: 928 MB
> >  node 1 cpus: 2 3
> >  node 1 size: 1007 MB
> >  node 1 free: 986 MB
> >  node 2 cpus: 4 5
> >  node 2 size: 889 MB
> >  node 2 free: 862 MB
> >  node 3 cpus: 6 7
> >  node 3 size: 1006 MB
> >  node 3 free: 983 MB
> >  node distances:
> >  node     0    1    2    3
> >     0:   10   51   31   41
> >     1:   51   10   21   61
> >     2:   31   21   10   11
> >     3:   41   61   11   10
> > 
> > The output of the example above (on node 0) is the following:
> > 
> >  [   84.953644] node (0, 0) -> 10
> >  [   84.953712] node (0, 2) -> 31
> >  [   84.953764] node (0, 3) -> 41
> >  [   84.953817] node (0, 1) -> 51
> > 
> > Cc: Yury Norov <yury.norov@gmail.com>
> > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> > ---
> >  include/linux/topology.h | 28 ++++++++++++++++++++++-
> >  kernel/sched/topology.c  | 49 ++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 76 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/topology.h b/include/linux/topology.h
> > index 52f5850730b3..d9014d90580d 100644
> > --- a/include/linux/topology.h
> > +++ b/include/linux/topology.h
> > @@ -248,12 +248,18 @@ static inline const struct cpumask *cpu_cpu_mask(int cpu)
> >  #ifdef CONFIG_NUMA
> >  int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node);
> >  extern const struct cpumask *sched_numa_hop_mask(unsigned int node, unsigned int hops);
> > -#else
> > +extern int sched_numa_hop_node(nodemask_t *hop_nodes, int start, unsigned int state);
> > +#else /* !CONFIG_NUMA */
> >  static __always_inline int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node)
> >  {
> >  	return cpumask_nth_and(cpu, cpus, cpu_online_mask);
> >  }
> >  
> > +static inline int sched_numa_hop_node(nodemask_t *hop_nodes, int start, unsigned int state)
> > +{
> > +	return NUMA_NO_NODE;
> > +}
> > +
> >  static inline const struct cpumask *
> >  sched_numa_hop_mask(unsigned int node, unsigned int hops)
> >  {
> > @@ -261,6 +267,26 @@ sched_numa_hop_mask(unsigned int node, unsigned int hops)
> >  }
> >  #endif	/* CONFIG_NUMA */
> >  
> > +/**
> > + * for_each_numa_hop_node - iterate over NUMA nodes at increasing hop distances
> > + *                          from a given starting node.
> > + * @__node: the iteration variable, representing the current NUMA node.
> > + * @__start: the NUMA node to start the iteration from.
> > + * @__hop_nodes: a nodemask_t to track the visited nodes.
> > + * @__state: state of NUMA nodes to iterate.
> > + *
> > + * Requires rcu_lock to be held.
> > + *
> > + * This macro iterates over NUMA nodes in increasing distance from
> > + * @start_node.
> > + *
> > + * Yields NUMA_NO_NODE when all the nodes have been visited.
> > + */
> > +#define for_each_numa_hop_node(__node, __start, __hop_nodes, __state)		\
> 
> As soon as this is not the hops iterator, the proper name would be just
> 
>         for_each_numa_node()
> 
> And because the 'numa' prefix here doesn't look like a prefix, I think
> it would be nice to comment what this 'numa' means and what's the
> difference between this and for_each_node() iterator, especially in
> terms of complexity.

How about renaming this iterator for_each_node_by_distance()?

> 
> Also you don't need underscores in macro declarations unless
> absolutely necessary.

Ok.

> 
> > +	for (int __node = __start;						\
> 
> The __node declared in for() masks out the __node provided in the
> macro.

Ok will fix this.

> 
> > +	     __node != NUMA_NO_NODE;						\
> > +	     __node = sched_numa_hop_node(&(__hop_nodes), __start, __state))
> > +
> >  /**
> >   * for_each_numa_hop_mask - iterate over cpumasks of increasing NUMA distance
> >   *                          from a given node.
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 9748a4c8d668..8e77c235ad9a 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -2185,6 +2185,55 @@ int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node)
> >  }
> >  EXPORT_SYMBOL_GPL(sched_numa_find_nth_cpu);
> >  
> > +/**
> > + * sched_numa_hop_node - Find the NUMA node at the closest hop distance
> > + *                       from node @start.
> > + *
> > + * @hop_nodes: a pointer to a nodemask_t representing the visited nodes.
> > + * @start: the NUMA node to start the hop search from.
> > + * @state: the node state to filter nodes by.
> > + *
> > + * This function iterates over all NUMA nodes in the given state and
> > + * calculates the hop distance to the starting node. It returns the NUMA
> > + * node that is the closest in terms of hop distance
> > + * that has not already been considered (not set in @hop_nodes). If the
> > + * node is found, it is marked as considered in the @hop_nodes bitmask.
> > + *
> > + * The function checks if the node is not the start node and ensures it is
> > + * not already part of the hop_nodes set. It then computes the distance to
> > + * the start node using the node_distance() function. The closest node is
> > + * chosen based on the minimum distance.
> > + *
> > + * Returns the NUMA node ID closest in terms of hop distance from the
> > + * @start node, or NUMA_NO_NODE if no node is found (or all nodes have been
> > + * visited).
> 
> for_each_node_state() returns MAX_NUMNODES when it finishes
> traversing. I think you should do the same here. 

Ok.

> 
> > + */
> > +int sched_numa_hop_node(nodemask_t *hop_nodes, int start, unsigned int state)
> > +{
> > +	int dist, n, min_node, min_dist;
> > +
> > +	if (state >= NR_NODE_STATES)
> > +		return NUMA_NO_NODE;
> 
>  -EINVAL. But, do we need to check the parameter at all?

numa_nearest_node() has the same check (returning -EINVAL), it seems sane
to do this check here as well to prevent out-of-bounds access to
node_states[state].

> 
> > +
> > +	min_node = NUMA_NO_NODE;
> > +	min_dist = INT_MAX;
> > +
> > +	for_each_node_state(n, state) {
> > +		if (n == start || node_isset(n, *hop_nodes))
> > +			continue;
> > +		dist = node_distance(start, n);
> > +		if (dist < min_dist) {
> > +			min_dist = dist;
> > +			min_node = n;
> > +		}
> > +	}
> 
> This is a version of numa_nearest_node(). The only difference is that
> you add 'hop_nodes' mask, which in fact is 'visited' nodes.
> 
> I think it should be like:
> 
>  int numa_nearest_unvisited_node(nodemask_t *visited, int start, unsigned int state)
>  {
>         for_each_node_andnot(node_states[state], visited) (
>                 ...
>         }
>  }

Makes sense. I'll change this and at this point I'll move this code to
mm/mempolicy.c, close to numa_nearest_node().

Thanks!
-Andrea

next prev parent reply	other threads:[~2024-12-24  7:54 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-20 15:11 [PATCHSET v8 sched_ext/for-6.14] sched_ext: split global idle cpumask into per-NUMA cpumasks Andrea Righi
2024-12-20 15:11 ` [PATCH 01/10] sched/topology: introduce for_each_numa_hop_node() / sched_numa_hop_node() Andrea Righi
2024-12-23 21:18   ` Yury Norov
2024-12-24  7:54     ` Andrea Righi [this message]
2024-12-24 17:33       ` Yury Norov
2024-12-20 15:11 ` [PATCH 02/10] sched_ext: Move built-in idle CPU selection policy to a separate file Andrea Righi
2024-12-24 21:21   ` Tejun Heo
2024-12-20 15:11 ` [PATCH 03/10] sched_ext: idle: introduce check_builtin_idle_enabled() helper Andrea Righi
2024-12-20 15:11 ` [PATCH 04/10] sched_ext: idle: use assign_cpu() to update the idle cpumask Andrea Righi
2024-12-23 22:26   ` Yury Norov
2024-12-20 15:11 ` [PATCH 05/10] sched_ext: idle: clarify comments Andrea Righi
2024-12-23 22:28   ` Yury Norov
2024-12-20 15:11 ` [PATCH 06/10] sched_ext: Introduce SCX_OPS_NODE_BUILTIN_IDLE Andrea Righi
2024-12-20 15:11 ` [PATCH 07/10] sched_ext: Introduce per-node idle cpumasks Andrea Righi
2024-12-24  4:05   ` Yury Norov
2024-12-24  8:18     ` Andrea Righi
2024-12-24 17:59       ` Yury Norov
2024-12-20 15:11 ` [PATCH 08/10] sched_ext: idle: introduce SCX_PICK_IDLE_NODE Andrea Righi
2024-12-24  2:48   ` Yury Norov
2024-12-24  3:53     ` Yury Norov
2024-12-24  8:37       ` Andrea Righi
2024-12-24 18:15         ` Yury Norov
2024-12-24  8:22     ` Andrea Righi
2024-12-24 21:29       ` Tejun Heo
2024-12-20 15:11 ` [PATCH 09/10] sched_ext: idle: Get rid of the scx_selcpu_topo_numa logic Andrea Righi
2024-12-23 23:39   ` Yury Norov
2024-12-24  8:58     ` Andrea Righi
2024-12-20 15:11 ` [PATCH 10/10] sched_ext: idle: Introduce NUMA aware idle cpu kfunc helpers Andrea Righi
2024-12-24  0:57   ` Yury Norov
2024-12-24  9:32     ` Andrea Righi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z2powPXbWlZqzU6r@gpd3 \
    --to=arighi@nvidia.com \
    --cc=bpf@vger.kernel.org \
    --cc=bsegall@google.com \
    --cc=changwoo@igalia.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=tj@kernel.org \
    --cc=vincent.guittot@linaro.org \
    --cc=void@manifault.com \
    --cc=vschneid@redhat.com \
    --cc=yury.norov@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.