Re: [PATCH 10/10] sched_ext: idle: Introduce NUMA aware idle cpu kfunc helpers

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Andrea Righi <arighi@nvidia.com>
To: Yury Norov <yury.norov@gmail.com>
Cc: Tejun Heo <tj@kernel.org>, David Vernet <void@manifault.com>,
	Changwoo Min <changwoo@igalia.com>,
	Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	bpf@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 10/10] sched_ext: idle: Introduce NUMA aware idle cpu kfunc helpers
Date: Tue, 24 Dec 2024 10:32:48 +0100	[thread overview]
Message-ID: <Z2p_wI_YpG2Jlf3C@gpd3> (raw)
In-Reply-To: <Z2oG9-AS-2OwB7Ib@yury-ThinkPad>

On Mon, Dec 23, 2024 at 04:57:36PM -0800, Yury Norov wrote:
> On Fri, Dec 20, 2024 at 04:11:42PM +0100, Andrea Righi wrote:
> > Add the following kfunc's to provide scx schedulers direct access to
> > per-node idle cpumasks information:
> > 
> >  const struct cpumask *scx_bpf_get_idle_cpumask_node(int node)
> >  const struct cpumask *scx_bpf_get_idle_smtmask_node(int node)
> >  s32 scx_bpf_pick_idle_cpu_node(const cpumask_t *cpus_allowed,
> > 				int node, u64 flags)
> >  int scx_bpf_cpu_to_node(s32 cpu)
> > 
> > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> > ---
> >  kernel/sched/ext_idle.c                  | 163 ++++++++++++++++++++---
> >  tools/sched_ext/include/scx/common.bpf.h |   4 +
> >  tools/sched_ext/include/scx/compat.bpf.h |  19 +++
> >  3 files changed, 170 insertions(+), 16 deletions(-)
> > 
> > diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
> > index b36e93da1b75..0f8ccc1e290e 100644
> > --- a/kernel/sched/ext_idle.c
> > +++ b/kernel/sched/ext_idle.c
> > @@ -28,6 +28,60 @@ static bool check_builtin_idle_enabled(void)
> >  	return false;
> >  }
> >  
> > +static bool check_builtin_idle_per_node_enabled(void)
> > +{
> > +	if (static_branch_likely(&scx_builtin_idle_per_node))
> > +		return true;
> 
> return 0;
> 
> > +
> > +	scx_ops_error("per-node idle tracking is disabled");
> > +	return false;
> 
> return -ENOTSUP;

Ok.

> 
> > +}
> > +
> > +/*
> > + * Validate and resolve a NUMA node.
> > + *
> > + * Return the resolved node ID on success or a negative value otherwise.
> > + */
> > +static int validate_node(int node)
> > +{
> > +	if (!check_builtin_idle_per_node_enabled())
> > +		return -EINVAL;
> 
> So the node may be valid, but this validator may fail. EINVAL is a
> misleading error code for that. You need ENOTSUP.

Ok.

> 
> > +
> > +	/* If no node is specified, use the current one */
> > +	if (node == NUMA_NO_NODE)
> > +		return numa_node_id();
> > +
> > +	/* Make sure node is in a valid range */
> > +	if (node < 0 || node >= nr_node_ids) {
> > +		scx_ops_error("invalid node %d", node);
> > +		return -ENOENT;
> 
> No such file or directory? Hmm...
> 
> This should be EINVAL. I would join this one with node_possible()
> check. We probably need bpf_node_possible() or something...

Ok about EINVAL.

About bpf_node_possible() I'm not sure, it'd be convenient to have a kfunc
for the BPF code to validate a node, but then we may also need to introduce
bpf_node_online(), or even bpf_node_state(), ...?

This can be probably addressed in a separate patch.

> 
> > +	}
> > +
> > +	/* Make sure the node is part of the set of possible nodes */
> > +	if (!node_possible(node)) {
> > +		scx_ops_error("unavailable node %d", node);
> 
> Not that it's unavailable. It just doesn't exist... I'd say:
> 
> 	scx_ops_error("Non-existing node %d. The existing nodes are: %pbl",
>                       node, nodemask_pr_args(node_states[N_POSSIBLE]));
> 
> > +		return -EINVAL;
> > +	}
> 
> What if user provides offline or cpu-less nodes? Is that a normal usage?
> If not, it would be nice to print warning, or even return an error...

I think we're returning -EBUSY in this case, which might be a reasonable
error already. Triggering an scx_ops_error() seems a bit too aggressive.

> 
> > +
> > +	return node;
> > +}
> > +
> > +/*
> > + * Return the node id associated to a target idle CPU (used to determine
> > + * the proper idle cpumask).
> > + */
> > +static int idle_cpu_to_node(int cpu)
> > +{
> > +	int node;
> > +
> > +	if (static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
> > +		node = cpu_to_node(cpu);
> > +	else
> > +		node = NUMA_FLAT_NODE;
> > +
> > +	return node;
> > +}
> > +
> >  #ifdef CONFIG_SMP
> >  struct idle_cpumask {
> >  	cpumask_var_t cpu;
> > @@ -83,22 +137,6 @@ static void idle_masks_init(void)
> >  
> >  static DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_llc);
> >  
> > -/*
> > - * Return the node id associated to a target idle CPU (used to determine
> > - * the proper idle cpumask).
> > - */
> > -static int idle_cpu_to_node(int cpu)
> > -{
> > -	int node;
> > -
> > -	if (static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
> > -		node = cpu_to_node(cpu);
> > -	else
> > -		node = NUMA_FLAT_NODE;
> > -
> > -	return node;
> > -}
> > -
> >  static bool test_and_clear_cpu_idle(int cpu)
> >  {
> >  	int node = idle_cpu_to_node(cpu);
> > @@ -613,6 +651,17 @@ static void reset_idle_masks(void) {}
> >   */
> >  __bpf_kfunc_start_defs();
> >  
> > +/**
> > + * scx_bpf_cpu_to_node - Return the NUMA node the given @cpu belongs to
> > + */
> > +__bpf_kfunc int scx_bpf_cpu_to_node(s32 cpu)
> > +{
> > +	if (cpu < 0 || cpu >= nr_cpu_ids)
> > +		return -EINVAL;
> > +
> > +	return idle_cpu_to_node(cpu);
> > +}
> > +
> >  /**
> >   * scx_bpf_select_cpu_dfl - The default implementation of ops.select_cpu()
> >   * @p: task_struct to select a CPU for
> > @@ -645,6 +694,28 @@ __bpf_kfunc s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
> >  	return prev_cpu;
> >  }
> >  
> > +/**
> > + * scx_bpf_get_idle_cpumask_node - Get a referenced kptr to the idle-tracking
> > + * per-CPU cpumask of a target NUMA node.
> > + *
> > + * NUMA_NO_NODE is interpreted as the current node.
> > + *
> > + * Returns an empty cpumask if idle tracking is not enabled, if @node is not
> > + * valid, or running on a UP kernel.
> > + */
> > +__bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask_node(int node)
> > +{
> > +	node = validate_node(node);
> > +	if (node < 0)
> > +		return cpu_none_mask;
> 
> I think I commented this in v7. This simply hides an error. You need to
> return ERR_PTR(node). And your user should check it with IS_ERR_VALUE().
> 
> This should be consistent with scx_bpf_pick_idle_cpu_node(), where you
> return an actual error.

I think I changed it... somewhere, but it looks like I missed this part. :)
Will change this as well, thanks!

> 
> > +
> > +#ifdef CONFIG_SMP
> > +	return get_idle_cpumask(node);
> > +#else
> > +	return cpu_none_mask;
> > +#endif
> > +}
> > +
> >  /**
> >   * scx_bpf_get_idle_cpumask - Get a referenced kptr to the idle-tracking
> >   * per-CPU cpumask.
> > @@ -664,6 +735,32 @@ __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void)
> >  	return get_idle_cpumask(NUMA_FLAT_NODE);
> >  }
> >  
> > +/**
> > + * scx_bpf_get_idle_smtmask_node - Get a referenced kptr to the idle-tracking,
> > + * per-physical-core cpumask of a target NUMA node. Can be used to determine
> > + * if an entire physical core is free.
> 
> If it goes to DOCs, it should have parameters section.

Ok.

> 
> > + *
> > + * NUMA_NO_NODE is interpreted as the current node.
> > + *
> > + * Returns an empty cpumask if idle tracking is not enabled, if @node is not
> > + * valid, or running on a UP kernel.
> > + */
> > +__bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask_node(int node)
> > +{
> > +	node = validate_node(node);
> > +	if (node < 0)
> > +		return cpu_none_mask;
> > +
> > +#ifdef CONFIG_SMP
> > +	if (sched_smt_active())
> > +		return get_idle_smtmask(node);
> > +	else
> > +		return get_idle_cpumask(node);
> > +#else
> > +	return cpu_none_mask;
> > +#endif
> > +}
> > +
> >  /**
> >   * scx_bpf_get_idle_smtmask - Get a referenced kptr to the idle-tracking,
> >   * per-physical-core cpumask. Can be used to determine if an entire physical
> > @@ -722,6 +819,36 @@ __bpf_kfunc bool scx_bpf_test_and_clear_cpu_idle(s32 cpu)
> >  		return false;
> >  }
> >  
> > +/**
> > + * scx_bpf_pick_idle_cpu_node - Pick and claim an idle cpu from a NUMA node
> > + * @cpus_allowed: Allowed cpumask
> > + * @node: target NUMA node
> > + * @flags: %SCX_PICK_IDLE_CPU_* flags
> > + *
> > + * Pick and claim an idle cpu in @cpus_allowed from the NUMA node @node.
> > + * Returns the picked idle cpu number on success. -%EBUSY if no matching cpu
> > + * was found.
> 
> validate_node() returns more errors.
> 
> > + *
> > + * If @node is NUMA_NO_NODE, the search is restricted to the current NUMA
> > + * node. Otherwise, the search starts from @node and proceeds to other
> > + * online NUMA nodes in order of increasing distance (unless
> > + * SCX_PICK_IDLE_NODE is specified, in which case the search is limited to
> > + * the target @node).
> 
> Can you reorder statements, like:
> 
> Restricted to current node if NUMA_NO_NODE.
> Restricted to @node if SCX_PICK_IDLE_NODE is specified
> Otherwise ...
> 
> What if NUMA_NO_NODE + SCX_PICK_IDLE_NODE? Seems to be OK, but looks
> redundant and non-intuitive. Why not if NUMA_NO_NODE provided, start
> from current node, but not restrict with it?

The more I think about NUMA_NO_NODE behavior, the more I'm convinved we
should just return -EBUSY (or a similar error). Implicitly assuming
NUMA_NO_NODE == current node seems a bit confusing in some cases.

Moreover, BPF already has the bpf_get_numa_node_id() helper, so there's
no reason to introduce this NUMA_NO_NODE == current node assumption.

> 
> > + *
> > + * Unavailable if ops.update_idle() is implemented and
> > + * %SCX_OPS_KEEP_BUILTIN_IDLE is not set or if %SCX_OPS_KEEP_BUILTIN_IDLE is
> > + * not set.
> > + */
> > +__bpf_kfunc s32 scx_bpf_pick_idle_cpu_node(const struct cpumask *cpus_allowed,
> > +					   int node, u64 flags)
> > +{
> > +	node = validate_node(node);
> 
> Hold on! This validate_node() replaces NO_NODE with current node but
> doesn't touch flags. It means that scx_pick_idle_cpu() will never see
> NO_NODE, and will not be able to restrict to current node. The comment
> above is incorrect, right?

Yes, the comment is incorrect, the logic here was to simply replace
NUMA_NO_NODE with current node, the restriction is only determined by
SCX_PICK_IDLE_NODE.

However, as mentioned above, I think we should just get rid of this
NO_NODE == current node assumption, this is yet another place where it adds
unnecessary complexity and it makes the code harder to follow.

Thanks,
-Andrea

     prev parent reply	other threads:[~2024-12-24  9:33 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-20 15:11 [PATCHSET v8 sched_ext/for-6.14] sched_ext: split global idle cpumask into per-NUMA cpumasks Andrea Righi
2024-12-20 15:11 ` [PATCH 01/10] sched/topology: introduce for_each_numa_hop_node() / sched_numa_hop_node() Andrea Righi
2024-12-23 21:18   ` Yury Norov
2024-12-24  7:54     ` Andrea Righi
2024-12-24 17:33       ` Yury Norov
2024-12-20 15:11 ` [PATCH 02/10] sched_ext: Move built-in idle CPU selection policy to a separate file Andrea Righi
2024-12-24 21:21   ` Tejun Heo
2024-12-20 15:11 ` [PATCH 03/10] sched_ext: idle: introduce check_builtin_idle_enabled() helper Andrea Righi
2024-12-20 15:11 ` [PATCH 04/10] sched_ext: idle: use assign_cpu() to update the idle cpumask Andrea Righi
2024-12-23 22:26   ` Yury Norov
2024-12-20 15:11 ` [PATCH 05/10] sched_ext: idle: clarify comments Andrea Righi
2024-12-23 22:28   ` Yury Norov
2024-12-20 15:11 ` [PATCH 06/10] sched_ext: Introduce SCX_OPS_NODE_BUILTIN_IDLE Andrea Righi
2024-12-20 15:11 ` [PATCH 07/10] sched_ext: Introduce per-node idle cpumasks Andrea Righi
2024-12-24  4:05   ` Yury Norov
2024-12-24  8:18     ` Andrea Righi
2024-12-24 17:59       ` Yury Norov
2024-12-20 15:11 ` [PATCH 08/10] sched_ext: idle: introduce SCX_PICK_IDLE_NODE Andrea Righi
2024-12-24  2:48   ` Yury Norov
2024-12-24  3:53     ` Yury Norov
2024-12-24  8:37       ` Andrea Righi
2024-12-24 18:15         ` Yury Norov
2024-12-24  8:22     ` Andrea Righi
2024-12-24 21:29       ` Tejun Heo
2024-12-20 15:11 ` [PATCH 09/10] sched_ext: idle: Get rid of the scx_selcpu_topo_numa logic Andrea Righi
2024-12-23 23:39   ` Yury Norov
2024-12-24  8:58     ` Andrea Righi
2024-12-20 15:11 ` [PATCH 10/10] sched_ext: idle: Introduce NUMA aware idle cpu kfunc helpers Andrea Righi
2024-12-24  0:57   ` Yury Norov
2024-12-24  9:32     ` Andrea Righi [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z2p_wI_YpG2Jlf3C@gpd3 \
    --to=arighi@nvidia.com \
    --cc=bpf@vger.kernel.org \
    --cc=bsegall@google.com \
    --cc=changwoo@igalia.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=tj@kernel.org \
    --cc=vincent.guittot@linaro.org \
    --cc=void@manifault.com \
    --cc=vschneid@redhat.com \
    --cc=yury.norov@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.