Re: [PATCH 5/6] sched_ext: idle: Per-node idle cpumasks

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Yury Norov <yury.norov@gmail.com>
To: Andrea Righi <arighi@nvidia.com>
Cc: Tejun Heo <tj@kernel.org>, David Vernet <void@manifault.com>,
	Changwoo Min <changwoo@igalia.com>,
	Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Ian May <ianm@nvidia.com>,
	bpf@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 5/6] sched_ext: idle: Per-node idle cpumasks
Date: Mon, 10 Feb 2025 11:57:42 -0500	[thread overview]
Message-ID: <Z6owBvYiArjXvIGC@thinkpad> (raw)
In-Reply-To: <Z6ju7vFK5TpJamn5@thinkpad>

On Sun, Feb 09, 2025 at 01:07:44PM -0500, Yury Norov wrote:
> On Fri, Feb 07, 2025 at 09:40:52PM +0100, Andrea Righi wrote:
> > Using a single global idle mask can lead to inefficiencies and a lot of
> > stress on the cache coherency protocol on large systems with multiple
> > NUMA nodes, since all the CPUs can create a really intense read/write
> > activity on the single global cpumask.
> 
> Can you put your perf numbers here too?
>  
> > Therefore, split the global cpumask into multiple per-NUMA node cpumasks
> > to improve scalability and performance on large systems.
> > 
> > The concept is that each cpumask will track only the idle CPUs within
> > its corresponding NUMA node, treating CPUs in other NUMA nodes as busy.
> > In this way concurrent access to the idle cpumask will be restricted
> > within each NUMA node.
> > 
> > The split of multiple per-node idle cpumasks can be controlled using the
> > SCX_OPS_BUILTIN_IDLE_PER_NODE flag.
> > 
> > By default SCX_OPS_BUILTIN_IDLE_PER_NODE is not enabled and a global
> > host-wide idle cpumask is used, maintaining the previous behavior.
> > 
> > NOTE: if a scheduler explicitly enables the per-node idle cpumasks (via
> > SCX_OPS_BUILTIN_IDLE_PER_NODE), scx_bpf_get_idle_cpu/smtmask() will
> > trigger an scx error, since there are no system-wide cpumasks.
> > 
> > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> > ---
> >  kernel/sched/ext_idle.c | 242 ++++++++++++++++++++++++++++++++--------
> >  kernel/sched/ext_idle.h |  11 +-
> >  2 files changed, 203 insertions(+), 50 deletions(-)
> > 
> > diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
> > index a3f2b00903ac2..4b90ec9018c1a 100644
> > --- a/kernel/sched/ext_idle.c
> > +++ b/kernel/sched/ext_idle.c
> > @@ -18,25 +18,88 @@ DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_enabled);
> >  DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_per_node);
> >  
> >  #ifdef CONFIG_SMP
> > -#ifdef CONFIG_CPUMASK_OFFSTACK
> > -#define CL_ALIGNED_IF_ONSTACK
> > -#else
> > -#define CL_ALIGNED_IF_ONSTACK __cacheline_aligned_in_smp
> > -#endif
> > -
> >  /* Enable/disable LLC aware optimizations */
> >  DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_llc);
> >  
> >  /* Enable/disable NUMA aware optimizations */
> >  DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_numa);
> >  
> > -static struct {
> > +/*
> > + * cpumasks to track idle CPUs within each NUMA node.
> > + *
> > + * If SCX_OPS_BUILTIN_IDLE_PER_NODE is not enabled, a single global cpumask
> > + * from is used to track all the idle CPUs in the system.
> > + */
> > +struct idle_cpus {
> >  	cpumask_var_t cpu;
> >  	cpumask_var_t smt;
> > -} idle_masks CL_ALIGNED_IF_ONSTACK;
> > +};
> > +
> > +/*
> > + * Global host-wide idle cpumasks (used when SCX_OPS_BUILTIN_IDLE_PER_NODE
> > + * is not enabled).
> > + */
> > +static struct idle_cpus scx_idle_global_masks;
> > +
> > +/*
> > + * Per-node idle cpumasks.
> > + */
> > +static struct idle_cpus **scx_idle_node_masks;
> > +
> > +/*
> > + * Initialize per-node idle cpumasks.
> > + *
> > + * In case of a single NUMA node or if NUMA support is disabled, only a
> > + * single global host-wide cpumask will be initialized.
> > + */
> > +void scx_idle_init_masks(void)
> > +{
> > +	int node;
> > +
> > +	/* Allocate global idle cpumasks */
> > +	BUG_ON(!alloc_cpumask_var(&scx_idle_global_masks.cpu, GFP_KERNEL));
> > +	BUG_ON(!alloc_cpumask_var(&scx_idle_global_masks.smt, GFP_KERNEL));
> > +
> > +	/* Allocate per-node idle cpumasks */
> > +	scx_idle_node_masks = kcalloc(num_possible_nodes(),
> > +				      sizeof(*scx_idle_node_masks), GFP_KERNEL);
> > +	BUG_ON(!scx_idle_node_masks);
> > +
> > +	for_each_node(node) {
> > +		scx_idle_node_masks[node] = kzalloc_node(sizeof(**scx_idle_node_masks),
> > +							 GFP_KERNEL, node);
> > +		BUG_ON(!scx_idle_node_masks[node]);
> > +
> > +		BUG_ON(!alloc_cpumask_var_node(&scx_idle_node_masks[node]->cpu, GFP_KERNEL, node));
> > +		BUG_ON(!alloc_cpumask_var_node(&scx_idle_node_masks[node]->smt, GFP_KERNEL, node));
> > +	}
> > +}
> > +
> > +/*
> > + * Return the idle masks associated to a target @node.
> > + */
> > +static struct idle_cpus *idle_cpumask(int node)
> > +{
> > +	return node == NUMA_NO_NODE ? &scx_idle_global_masks : scx_idle_node_masks[node];
> > +}
> > +
> > +/*
> > + * Return the node id associated to a target idle CPU (used to determine
> > + * the proper idle cpumask).
> > + */
> > +static int idle_cpu_to_node(int cpu)
> > +{
> > +	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
> > +		return NUMA_NO_NODE;
> > +
> > +	return cpu_to_node(cpu);
> > +}
> >  
> >  bool scx_idle_test_and_clear_cpu(int cpu)
> >  {
> > +	int node = idle_cpu_to_node(cpu);
> > +	struct cpumask *idle_cpus = idle_cpumask(node)->cpu;
> > +
> >  #ifdef CONFIG_SCHED_SMT
> >  	/*
> >  	 * SMT mask should be cleared whether we can claim @cpu or not. The SMT
> > @@ -45,33 +108,38 @@ bool scx_idle_test_and_clear_cpu(int cpu)
> >  	 */
> >  	if (sched_smt_active()) {
> >  		const struct cpumask *smt = cpu_smt_mask(cpu);
> > +		struct cpumask *idle_smts = idle_cpumask(node)->smt;
> >  
> >  		/*
> >  		 * If offline, @cpu is not its own sibling and
> >  		 * scx_pick_idle_cpu() can get caught in an infinite loop as
> > -		 * @cpu is never cleared from idle_masks.smt. Ensure that @cpu
> > -		 * is eventually cleared.
> > +		 * @cpu is never cleared from the idle SMT mask. Ensure that
> > +		 * @cpu is eventually cleared.
> >  		 *
> >  		 * NOTE: Use cpumask_intersects() and cpumask_test_cpu() to
> >  		 * reduce memory writes, which may help alleviate cache
> >  		 * coherence pressure.
> >  		 */
> > -		if (cpumask_intersects(smt, idle_masks.smt))
> > -			cpumask_andnot(idle_masks.smt, idle_masks.smt, smt);
> > -		else if (cpumask_test_cpu(cpu, idle_masks.smt))
> > -			__cpumask_clear_cpu(cpu, idle_masks.smt);
> > +		if (cpumask_intersects(smt, idle_smts))
> > +			cpumask_andnot(idle_smts, idle_smts, smt);
> > +		else if (cpumask_test_cpu(cpu, idle_smts))
> > +			__cpumask_clear_cpu(cpu, idle_smts);
> >  	}
> >  #endif
> > -	return cpumask_test_and_clear_cpu(cpu, idle_masks.cpu);
> > +
> > +	return cpumask_test_and_clear_cpu(cpu, idle_cpus);
> >  }
> >  
> > -s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags)
> > +/*
> > + * Pick an idle CPU in a specific NUMA node.
> > + */
> > +s32 pick_idle_cpu_from_node(const struct cpumask *cpus_allowed, int node, u64 flags)
> >  {
> >  	int cpu;
> >  
> >  retry:
> >  	if (sched_smt_active()) {
> > -		cpu = cpumask_any_and_distribute(idle_masks.smt, cpus_allowed);
> > +		cpu = cpumask_any_and_distribute(idle_cpumask(node)->smt, cpus_allowed);
> >  		if (cpu < nr_cpu_ids)
> >  			goto found;
> >  
> > @@ -79,7 +147,7 @@ s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags)
> >  			return -EBUSY;
> >  	}
> >  
> > -	cpu = cpumask_any_and_distribute(idle_masks.cpu, cpus_allowed);
> > +	cpu = cpumask_any_and_distribute(idle_cpumask(node)->cpu, cpus_allowed);
> >  	if (cpu >= nr_cpu_ids)
> >  		return -EBUSY;
> >  
> > @@ -90,6 +158,55 @@ s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags)
> >  		goto retry;
> >  }
> >  
> > +/*
> > + * Find the best idle CPU in the system, relative to @node.
> > + */
> > +s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, int node, u64 flags)
> > +{
> > +	nodemask_t unvisited = NODE_MASK_ALL;

This should be a NODEMASK_ALLOC(). We don't want to eat up too much of the
stack, right?

> > +	s32 cpu = -EBUSY;
> > +
> > +	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
> > +		return pick_idle_cpu_from_node(cpus_allowed, NUMA_NO_NODE, flags);
> > +
> > +	/*
> > +	 * If an initial node is not specified, start with the current
> > +	 * node.
> > +	 */
> > +	if (node == NUMA_NO_NODE)
> > +		node = numa_node_id();
> > +
> > +	/*
> > +	 * Traverse all nodes in order of increasing distance, starting
> > +	 * from @node.
> > +	 *
> > +	 * This loop is O(N^2), with N being the amount of NUMA nodes,
> > +	 * which might be quite expensive in large NUMA systems. However,
> > +	 * this complexity comes into play only when a scheduler enables
> > +	 * SCX_OPS_BUILTIN_IDLE_PER_NODE and it's requesting an idle CPU
> > +	 * without specifying a target NUMA node, so it shouldn't be a
> > +	 * bottleneck is most cases.
> > +	 *
> > +	 * As a future optimization we may want to cache the list of hop
> > +	 * nodes in a per-node array, instead of actually traversing them
> > +	 * every time.
> > +	 */
> > +	for_each_numa_node(node, unvisited, N_POSSIBLE) {
> > +		cpu = pick_idle_cpu_from_node(cpus_allowed, node, flags);
> > +		if (cpu >= 0)
> > +			break;
> > +
> > +		/*
> > +		 * Check if the search is restricted to the same core or
> > +		 * the same node.
> > +		 */
> > +		if (flags & SCX_PICK_IDLE_IN_NODE)
> > +			break;
> 
> If SCX_PICK_IDLE_IN_NODE is set, you can avoid the loop at all, right?
> Just:
> 	if (flags & SCX_PICK_IDLE_IN_NODE)
> 	        return pick_idle_cpu_from_node(cpus_allowed, node, flags);
> 
> 	for_each_numa_node(node, unvisited, N_POSSIBLE) {
> 		cpu = pick_idle_cpu_from_node(cpus_allowed, node, flags);
> 		if (cpu >= 0)
> 			return cpu;
>         }
> 
> Thanks,
> Yury

next prev parent reply	other threads:[~2025-02-10 16:57 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-07 20:40 [PATCHSET v10 sched_ext/for-6.15] sched_ext: split global idle cpumask into per-NUMA cpumasks Andrea Righi
2025-02-07 20:40 ` [PATCH 1/6] mm/numa: Introduce numa_nearest_nodemask() Andrea Righi
2025-02-09 17:40   ` Yury Norov
2025-02-10  8:28     ` Andrea Righi
2025-02-10 16:41       ` Yury Norov
2025-02-10 16:51         ` Andrea Righi
2025-02-07 20:40 ` [PATCH 2/6] sched/topology: Introduce for_each_numa_node() iterator Andrea Righi
2025-02-07 21:46   ` Tejun Heo
2025-02-07 21:55     ` Andrea Righi
2025-02-07 21:56       ` Tejun Heo
2025-02-09 17:51         ` Yury Norov
2025-02-09 17:50   ` Yury Norov
2025-02-07 20:40 ` [PATCH 3/6] sched_ext: idle: Introduce SCX_OPS_BUILTIN_IDLE_PER_NODE Andrea Righi
2025-02-07 20:40 ` [PATCH 4/6] sched_ext: idle: introduce SCX_PICK_IDLE_IN_NODE Andrea Righi
2025-02-07 22:02   ` Tejun Heo
2025-02-07 20:40 ` [PATCH 5/6] sched_ext: idle: Per-node idle cpumasks Andrea Righi
2025-02-07 22:30   ` Tejun Heo
2025-02-08  8:47     ` Andrea Righi
2025-02-09 18:07   ` Yury Norov
2025-02-10 16:57     ` Yury Norov [this message]
2025-02-11  7:32       ` Andrea Righi
2025-02-11  7:41         ` Andrea Righi
2025-02-11  9:50           ` Andrea Righi
2025-02-11 14:19             ` Yury Norov
2025-02-11 14:34               ` Andrea Righi
2025-02-11 14:45                 ` Andrea Righi
2025-02-11 16:38                   ` Steven Rostedt
2025-02-11 18:05                     ` Andrea Righi
2025-02-07 20:40 ` [PATCH 6/6] sched_ext: idle: Introduce node-aware idle cpu kfunc helpers Andrea Righi
2025-02-07 22:39   ` Tejun Heo
2025-02-08  9:19     ` Andrea Righi
2025-02-09  6:31       ` Tejun Heo
2025-02-09  8:11         ` Andrea Righi
2025-02-10  6:01           ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z6owBvYiArjXvIGC@thinkpad \
    --to=yury.norov@gmail.com \
    --cc=arighi@nvidia.com \
    --cc=bpf@vger.kernel.org \
    --cc=bsegall@google.com \
    --cc=changwoo@igalia.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=ianm@nvidia.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=tj@kernel.org \
    --cc=vincent.guittot@linaro.org \
    --cc=void@manifault.com \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.