From: Andrea Righi <arighi@nvidia.com>
To: Yury Norov <yury.norov@gmail.com>
Cc: Tejun Heo <tj@kernel.org>, David Vernet <void@manifault.com>,
Changwoo Min <changwoo@igalia.com>,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH 4/4] sched_ext: Introduce NUMA aware idle cpu kfunc helpers
Date: Wed, 11 Dec 2024 21:55:36 +0100 [thread overview]
Message-ID: <Z1n8SC6Z298E28Zg@gpd3> (raw)
In-Reply-To: <Z1n6WgfTBBBIH2of@yury-ThinkPad>
On Wed, Dec 11, 2024 at 12:47:22PM -0800, Yury Norov wrote:
> On Wed, Dec 11, 2024 at 09:20:56PM +0100, Andrea Righi wrote:
> > On Wed, Dec 11, 2024 at 09:43:26AM -0800, Yury Norov wrote:
> > > On Mon, Dec 09, 2024 at 11:40:58AM +0100, Andrea Righi wrote:
> > > > Add the following kfunc's to provide scx schedulers direct access to
> > > > per-node idle cpumasks information:
> > > >
> > > > const struct cpumask *scx_bpf_get_idle_cpumask_node(int node)
> > > > const struct cpumask *scx_bpf_get_idle_smtmask_node(int node)
> > > > s32 scx_bpf_pick_idle_cpu_node(int node,
> > > > const cpumask_t *cpus_allowed, u64 flags)
> > > > int scx_bpf_cpu_to_node(s32 cpu)
> > > >
> > > > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> > > > ---
> > > > kernel/sched/ext.c | 96 +++++++++++++++++++++++-
> > > > tools/sched_ext/include/scx/common.bpf.h | 4 +
> > > > tools/sched_ext/include/scx/compat.bpf.h | 19 +++++
> > > > 3 files changed, 117 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> > > > index d0d57323bcfc..ea7cc481782c 100644
> > > > --- a/kernel/sched/ext.c
> > > > +++ b/kernel/sched/ext.c
> > > > @@ -433,6 +433,7 @@ struct sched_ext_ops {
> > > > * - scx_bpf_select_cpu_dfl()
> > > > * - scx_bpf_test_and_clear_cpu_idle()
> > > > * - scx_bpf_pick_idle_cpu()
> > > > + * - scx_bpf_pick_idle_cpu_node()
> > > > *
> > > > * The user also must implement ops.select_cpu() as the default
> > > > * implementation relies on scx_bpf_select_cpu_dfl().
> > > > @@ -955,6 +956,8 @@ static struct cpumask *get_idle_cpumask_node(int node)
> > > > if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
> > > > return idle_masks[0]->cpu;
> > > >
> > > > + if (node < 0 || node >= num_possible_nodes())
> > > > + return NULL;
> > >
> > > 1. This sanity should go before the check above.
> > > 2. In-kernel users don't need to do sanity checks. BPF users should,
> > > but for them you need to move it in BPF wrapper.
> > > 3. -1 is a valid parameter, means NUMA_NO_NODE.
> >
> > Ok, but what would you return with NUMA_NO_NODE, in theory we should return
> > a global system-wide cpumask, that doesn't exist with the per-node
> > cpumasks. Maybe just return cpu_none_mask? That's what I've done in the
> > next version, that seems safer than returning NULL.
>
> To begin with, you can just disallow NUMA_NO_NODE for this interface.
> Put a corresponding comment or warning, and you're done.
Ok.
>
> On the other hand, you can treat it as 'I don't care' hint and return
> a cpumask for any node that has idle CPUs.
>
> Returning cpu_none_mask?.. OK, it's possible, but what does that
> bring? User will have to traverse empty mask just to find nothing.
> I'd rather disallow NUMA_NO_NODE than returning something useless.
I like the idea of returning a "random" node, or maybe
idle_masks[numa_node_id()]?
>
> > > > return idle_masks[node]->cpu;
> > > > }
> > > >
> > > > @@ -963,6 +966,8 @@ static struct cpumask *get_idle_smtmask_node(int node)
> > > > if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
> > > > return idle_masks[0]->smt;
> > > >
> > > > + if (node < 0 || node >= num_possible_nodes())
> > > > + return NULL;
> > > > return idle_masks[node]->smt;
> > > > }
> > > >
> > > > @@ -7469,6 +7474,16 @@ __bpf_kfunc u32 scx_bpf_nr_cpu_ids(void)
> > > > return nr_cpu_ids;
> > > > }
> > > >
> > > > +/**
> > > > + * scx_bpf_cpu_to_node - Return the NUMA node the given @cpu belongs to
> > > > + */
> > > > +__bpf_kfunc int scx_bpf_cpu_to_node(s32 cpu)
> > > > +{
> > > > + if (cpu < 0 || cpu >= nr_cpu_ids)
> > > > + return -EINVAL;
> > > > + return cpu_to_node(cpu);
> > > > +}
> > >
> > > I believe this wrapper should be declared somewhere in
> > > kernel/sched/topology.c, and better be a separate patch.
> >
> > Maybe kernel/bpf/helpers.c? And name it bpf_cpu_to_node()?
>
> Sure, even better
>
> > > > +
> > > > /**
> > > > * scx_bpf_get_possible_cpumask - Get a referenced kptr to cpu_possible_mask
> > > > */
> > > > @@ -7499,11 +7514,32 @@ __bpf_kfunc void scx_bpf_put_cpumask(const struct cpumask *cpumask)
> > > > */
> > > > }
> > > >
> > > > +/**
> > > > + * scx_bpf_get_idle_cpumask_node - Get a referenced kptr to the idle-tracking
> > > > + * per-CPU cpumask of a target NUMA node.
> > > > + *
> > > > + * Returns an empty cpumask if idle tracking is not enabled, if @node is not
> > > > + * valid, or running on a UP kernel.
> > > > + */
> > > > +__bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask_node(int node)
> > > > +{
> > > > + if (!static_branch_likely(&scx_builtin_idle_enabled)) {
> > > > + scx_ops_error("built-in idle tracking is disabled");
> > > > + return cpu_none_mask;
> > > > + }
> > > > + if (!static_branch_likely(&scx_builtin_idle_per_node)) {
> > > > + scx_ops_error("per-node idle tracking is disabled");
> > > > + return cpu_none_mask;
> > > > + }
> > >
> > > Nub question: is it possible that scx_builtin_idle_per_node is enable,
> > > but scx_builtin_idle_enabled not? From my naive perspective, we can't
> > > enable per-node idle masks without enabling general idle masks. Or I
> > > mislead it?
> >
> > In theory a BPF scheduler could set SCX_OPS_BUILTIN_IDLE_PER_NODE (without
> > SCX_OPS_KEEP_BUILTIN_IDLE) in .flags while implementing ops.update_idle().
> >
> > In this way we would have scx_builtin_idle_enabled==false and
> > scx_builtin_idle_per_node==true, which doesn't make much sense, so we
> > should probably handle this case in validate_ops() and trigger an error.
> >
> > Good catch!
> >
> > >
> > > > +
> > > > + return get_idle_cpumask_node(node) ? : cpu_none_mask;
> > > > +}
> > > > /**
> > > > * scx_bpf_get_idle_cpumask - Get a referenced kptr to the idle-tracking
> > > > * per-CPU cpumask of the current NUMA node.
> > > > *
> > > > - * Returns NULL if idle tracking is not enabled, or running on a UP kernel.
> > > > + * Returns an emtpy cpumask if idle tracking is not enabled, or running on a UP
> > > > + * kernel.
> > > > */
> > > > __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void)
> > > > {
> > > > @@ -7515,12 +7551,35 @@ __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void)
> > > > return get_curr_idle_cpumask();
> > > > }
> > > >
> > > > +/**
> > > > + * scx_bpf_get_idle_smtmask_node - Get a referenced kptr to the idle-tracking,
> > > > + * per-physical-core cpumask of a target NUMA node. Can be used to determine
> > > > + * if an entire physical core is free.
> > > > + *
> > > > + * Returns an empty cpumask if idle tracking is not enabled, if @node is not
> > > > + * valid, or running on a UP kernel.
> > > > + */
> > > > +__bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask_node(int node)
> > > > +{
> > > > + if (!static_branch_likely(&scx_builtin_idle_enabled)) {
> > > > + scx_ops_error("built-in idle tracking is disabled");
> > > > + return cpu_none_mask;
> > > > + }
> > > > + if (!static_branch_likely(&scx_builtin_idle_per_node)) {
> > > > + scx_ops_error("per-node idle tracking is disabled");
> > > > + return cpu_none_mask;
> > > > + }
> > >
> > > Can you add vertical spacing between blocks?
> >
> > You mean a blank between the two blocks, right?
>
> Yes
>
> > Anyway, ...
> >
> > >
> > > Also, because you use this construction more than once, I think it
> > > makes sense to make it a helper.
> >
> > With a proper error check in validate_ops() we can just get rid of the
> > scx_builtin_idle_enabled block and simply check scx_builtin_idle_per_node.
>
> But still, having a helper is better than opencoding the same 4-lines
> pattern again and again
Yep, makes sense. Will do that.
-Andrea
next prev parent reply other threads:[~2024-12-11 20:55 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-12-09 10:40 [PATCHSET v5 sched_ext/for-6.14] sched_ext: split global idle cpumask into per-NUMA cpumasks Andrea Righi
2024-12-09 10:40 ` [PATCH 1/4] sched_ext: Introduce per-NUMA idle cpumasks Andrea Righi
2024-12-09 19:32 ` Yury Norov
2024-12-09 20:40 ` Andrea Righi
2024-12-10 0:14 ` Andrea Righi
2024-12-10 2:10 ` Yury Norov
2024-12-14 6:05 ` Andrea Righi
2024-12-11 17:46 ` Yury Norov
2024-12-09 10:40 ` [PATCH 2/4] sched_ext: Get rid of the scx_selcpu_topo_numa logic Andrea Righi
2024-12-11 8:05 ` Changwoo Min
2024-12-11 12:22 ` Andrea Righi
2024-12-09 10:40 ` [PATCH 3/4] sched_ext: Introduce SCX_OPS_NODE_BUILTIN_IDLE Andrea Righi
2024-12-11 18:21 ` Yury Norov
2024-12-11 19:59 ` Andrea Righi
2024-12-09 10:40 ` [PATCH 4/4] sched_ext: Introduce NUMA aware idle cpu kfunc helpers Andrea Righi
2024-12-11 17:43 ` Yury Norov
2024-12-11 20:20 ` Andrea Righi
2024-12-11 20:47 ` Yury Norov
2024-12-11 20:55 ` Andrea Righi [this message]
-- strict thread matches above, loose matches on Subject: below --
2024-12-05 21:00 [PATCHSET v4 sched_ext/for-6.14] sched_ext: split global idle cpumask into per-NUMA cpumasks Andrea Righi
2024-12-05 21:00 ` [PATCH 4/4] sched_ext: Introduce NUMA aware idle cpu kfunc helpers Andrea Righi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z1n8SC6Z298E28Zg@gpd3 \
--to=arighi@nvidia.com \
--cc=changwoo@igalia.com \
--cc=linux-kernel@vger.kernel.org \
--cc=tj@kernel.org \
--cc=void@manifault.com \
--cc=yury.norov@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.