From: Yury Norov <yury.norov@gmail.com>
To: Andrea Righi <arighi@nvidia.com>
Cc: Tejun Heo <tj@kernel.org>, David Vernet <void@manifault.com>,
Changwoo Min <changwoo@igalia.com>,
Ingo Molnar <mingo@redhat.com>,
Peter Zijlstra <peterz@infradead.org>,
Juri Lelli <juri.lelli@redhat.com>,
Vincent Guittot <vincent.guittot@linaro.org>,
Dietmar Eggemann <dietmar.eggemann@arm.com>,
Steven Rostedt <rostedt@goodmis.org>,
Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
Valentin Schneider <vschneid@redhat.com>,
bpf@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 10/10] sched_ext: idle: Introduce NUMA aware idle cpu kfunc helpers
Date: Mon, 23 Dec 2024 16:57:36 -0800 [thread overview]
Message-ID: <Z2oG9-AS-2OwB7Ib@yury-ThinkPad> (raw)
In-Reply-To: <20241220154107.287478-11-arighi@nvidia.com>
On Fri, Dec 20, 2024 at 04:11:42PM +0100, Andrea Righi wrote:
> Add the following kfunc's to provide scx schedulers direct access to
> per-node idle cpumasks information:
>
> const struct cpumask *scx_bpf_get_idle_cpumask_node(int node)
> const struct cpumask *scx_bpf_get_idle_smtmask_node(int node)
> s32 scx_bpf_pick_idle_cpu_node(const cpumask_t *cpus_allowed,
> int node, u64 flags)
> int scx_bpf_cpu_to_node(s32 cpu)
>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> ---
> kernel/sched/ext_idle.c | 163 ++++++++++++++++++++---
> tools/sched_ext/include/scx/common.bpf.h | 4 +
> tools/sched_ext/include/scx/compat.bpf.h | 19 +++
> 3 files changed, 170 insertions(+), 16 deletions(-)
>
> diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
> index b36e93da1b75..0f8ccc1e290e 100644
> --- a/kernel/sched/ext_idle.c
> +++ b/kernel/sched/ext_idle.c
> @@ -28,6 +28,60 @@ static bool check_builtin_idle_enabled(void)
> return false;
> }
>
> +static bool check_builtin_idle_per_node_enabled(void)
> +{
> + if (static_branch_likely(&scx_builtin_idle_per_node))
> + return true;
return 0;
> +
> + scx_ops_error("per-node idle tracking is disabled");
> + return false;
return -ENOTSUP;
> +}
> +
> +/*
> + * Validate and resolve a NUMA node.
> + *
> + * Return the resolved node ID on success or a negative value otherwise.
> + */
> +static int validate_node(int node)
> +{
> + if (!check_builtin_idle_per_node_enabled())
> + return -EINVAL;
So the node may be valid, but this validator may fail. EINVAL is a
misleading error code for that. You need ENOTSUP.
> +
> + /* If no node is specified, use the current one */
> + if (node == NUMA_NO_NODE)
> + return numa_node_id();
> +
> + /* Make sure node is in a valid range */
> + if (node < 0 || node >= nr_node_ids) {
> + scx_ops_error("invalid node %d", node);
> + return -ENOENT;
No such file or directory? Hmm...
This should be EINVAL. I would join this one with node_possible()
check. We probably need bpf_node_possible() or something...
> + }
> +
> + /* Make sure the node is part of the set of possible nodes */
> + if (!node_possible(node)) {
> + scx_ops_error("unavailable node %d", node);
Not that it's unavailable. It just doesn't exist... I'd say:
scx_ops_error("Non-existing node %d. The existing nodes are: %pbl",
node, nodemask_pr_args(node_states[N_POSSIBLE]));
> + return -EINVAL;
> + }
What if user provides offline or cpu-less nodes? Is that a normal usage?
If not, it would be nice to print warning, or even return an error...
> +
> + return node;
> +}
> +
> +/*
> + * Return the node id associated to a target idle CPU (used to determine
> + * the proper idle cpumask).
> + */
> +static int idle_cpu_to_node(int cpu)
> +{
> + int node;
> +
> + if (static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
> + node = cpu_to_node(cpu);
> + else
> + node = NUMA_FLAT_NODE;
> +
> + return node;
> +}
> +
> #ifdef CONFIG_SMP
> struct idle_cpumask {
> cpumask_var_t cpu;
> @@ -83,22 +137,6 @@ static void idle_masks_init(void)
>
> static DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_llc);
>
> -/*
> - * Return the node id associated to a target idle CPU (used to determine
> - * the proper idle cpumask).
> - */
> -static int idle_cpu_to_node(int cpu)
> -{
> - int node;
> -
> - if (static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
> - node = cpu_to_node(cpu);
> - else
> - node = NUMA_FLAT_NODE;
> -
> - return node;
> -}
> -
> static bool test_and_clear_cpu_idle(int cpu)
> {
> int node = idle_cpu_to_node(cpu);
> @@ -613,6 +651,17 @@ static void reset_idle_masks(void) {}
> */
> __bpf_kfunc_start_defs();
>
> +/**
> + * scx_bpf_cpu_to_node - Return the NUMA node the given @cpu belongs to
> + */
> +__bpf_kfunc int scx_bpf_cpu_to_node(s32 cpu)
> +{
> + if (cpu < 0 || cpu >= nr_cpu_ids)
> + return -EINVAL;
> +
> + return idle_cpu_to_node(cpu);
> +}
> +
> /**
> * scx_bpf_select_cpu_dfl - The default implementation of ops.select_cpu()
> * @p: task_struct to select a CPU for
> @@ -645,6 +694,28 @@ __bpf_kfunc s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
> return prev_cpu;
> }
>
> +/**
> + * scx_bpf_get_idle_cpumask_node - Get a referenced kptr to the idle-tracking
> + * per-CPU cpumask of a target NUMA node.
> + *
> + * NUMA_NO_NODE is interpreted as the current node.
> + *
> + * Returns an empty cpumask if idle tracking is not enabled, if @node is not
> + * valid, or running on a UP kernel.
> + */
> +__bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask_node(int node)
> +{
> + node = validate_node(node);
> + if (node < 0)
> + return cpu_none_mask;
I think I commented this in v7. This simply hides an error. You need to
return ERR_PTR(node). And your user should check it with IS_ERR_VALUE().
This should be consistent with scx_bpf_pick_idle_cpu_node(), where you
return an actual error.
> +
> +#ifdef CONFIG_SMP
> + return get_idle_cpumask(node);
> +#else
> + return cpu_none_mask;
> +#endif
> +}
> +
> /**
> * scx_bpf_get_idle_cpumask - Get a referenced kptr to the idle-tracking
> * per-CPU cpumask.
> @@ -664,6 +735,32 @@ __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void)
> return get_idle_cpumask(NUMA_FLAT_NODE);
> }
>
> +/**
> + * scx_bpf_get_idle_smtmask_node - Get a referenced kptr to the idle-tracking,
> + * per-physical-core cpumask of a target NUMA node. Can be used to determine
> + * if an entire physical core is free.
If it goes to DOCs, it should have parameters section.
> + *
> + * NUMA_NO_NODE is interpreted as the current node.
> + *
> + * Returns an empty cpumask if idle tracking is not enabled, if @node is not
> + * valid, or running on a UP kernel.
> + */
> +__bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask_node(int node)
> +{
> + node = validate_node(node);
> + if (node < 0)
> + return cpu_none_mask;
> +
> +#ifdef CONFIG_SMP
> + if (sched_smt_active())
> + return get_idle_smtmask(node);
> + else
> + return get_idle_cpumask(node);
> +#else
> + return cpu_none_mask;
> +#endif
> +}
> +
> /**
> * scx_bpf_get_idle_smtmask - Get a referenced kptr to the idle-tracking,
> * per-physical-core cpumask. Can be used to determine if an entire physical
> @@ -722,6 +819,36 @@ __bpf_kfunc bool scx_bpf_test_and_clear_cpu_idle(s32 cpu)
> return false;
> }
>
> +/**
> + * scx_bpf_pick_idle_cpu_node - Pick and claim an idle cpu from a NUMA node
> + * @cpus_allowed: Allowed cpumask
> + * @node: target NUMA node
> + * @flags: %SCX_PICK_IDLE_CPU_* flags
> + *
> + * Pick and claim an idle cpu in @cpus_allowed from the NUMA node @node.
> + * Returns the picked idle cpu number on success. -%EBUSY if no matching cpu
> + * was found.
validate_node() returns more errors.
> + *
> + * If @node is NUMA_NO_NODE, the search is restricted to the current NUMA
> + * node. Otherwise, the search starts from @node and proceeds to other
> + * online NUMA nodes in order of increasing distance (unless
> + * SCX_PICK_IDLE_NODE is specified, in which case the search is limited to
> + * the target @node).
Can you reorder statements, like:
Restricted to current node if NUMA_NO_NODE.
Restricted to @node if SCX_PICK_IDLE_NODE is specified
Otherwise ...
What if NUMA_NO_NODE + SCX_PICK_IDLE_NODE? Seems to be OK, but looks
redundant and non-intuitive. Why not if NUMA_NO_NODE provided, start
from current node, but not restrict with it?
> + *
> + * Unavailable if ops.update_idle() is implemented and
> + * %SCX_OPS_KEEP_BUILTIN_IDLE is not set or if %SCX_OPS_KEEP_BUILTIN_IDLE is
> + * not set.
> + */
> +__bpf_kfunc s32 scx_bpf_pick_idle_cpu_node(const struct cpumask *cpus_allowed,
> + int node, u64 flags)
> +{
> + node = validate_node(node);
Hold on! This validate_node() replaces NO_NODE with current node but
doesn't touch flags. It means that scx_pick_idle_cpu() will never see
NO_NODE, and will not be able to restrict to current node. The comment
above is incorrect, right?
> + if (node < 0)
> + return node;
> +
> + return scx_pick_idle_cpu(cpus_allowed, node, flags);
> +}
> +
> /**
> * scx_bpf_pick_idle_cpu - Pick and claim an idle cpu
> * @cpus_allowed: Allowed cpumask
> @@ -785,11 +912,15 @@ __bpf_kfunc s32 scx_bpf_pick_any_cpu(const struct cpumask *cpus_allowed,
> __bpf_kfunc_end_defs();
>
> BTF_KFUNCS_START(scx_kfunc_ids_select_cpu)
> +BTF_ID_FLAGS(func, scx_bpf_cpu_to_node)
> BTF_ID_FLAGS(func, scx_bpf_select_cpu_dfl, KF_RCU)
> +BTF_ID_FLAGS(func, scx_bpf_get_idle_cpumask_node, KF_ACQUIRE)
> BTF_ID_FLAGS(func, scx_bpf_get_idle_cpumask, KF_ACQUIRE)
> +BTF_ID_FLAGS(func, scx_bpf_get_idle_smtmask_node, KF_ACQUIRE)
> BTF_ID_FLAGS(func, scx_bpf_get_idle_smtmask, KF_ACQUIRE)
> BTF_ID_FLAGS(func, scx_bpf_put_idle_cpumask, KF_RELEASE)
> BTF_ID_FLAGS(func, scx_bpf_test_and_clear_cpu_idle)
> +BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu_node, KF_RCU)
> BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu, KF_RCU)
> BTF_ID_FLAGS(func, scx_bpf_pick_any_cpu, KF_RCU)
> BTF_KFUNCS_END(scx_kfunc_ids_select_cpu)
> diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h
> index 858ba1f438f6..fe0433f7c4d9 100644
> --- a/tools/sched_ext/include/scx/common.bpf.h
> +++ b/tools/sched_ext/include/scx/common.bpf.h
> @@ -63,13 +63,17 @@ u32 scx_bpf_cpuperf_cap(s32 cpu) __ksym __weak;
> u32 scx_bpf_cpuperf_cur(s32 cpu) __ksym __weak;
> void scx_bpf_cpuperf_set(s32 cpu, u32 perf) __ksym __weak;
> u32 scx_bpf_nr_cpu_ids(void) __ksym __weak;
> +int scx_bpf_cpu_to_node(s32 cpu) __ksym __weak;
> const struct cpumask *scx_bpf_get_possible_cpumask(void) __ksym __weak;
> const struct cpumask *scx_bpf_get_online_cpumask(void) __ksym __weak;
> void scx_bpf_put_cpumask(const struct cpumask *cpumask) __ksym __weak;
> +const struct cpumask *scx_bpf_get_idle_cpumask_node(int node) __ksym __weak;
> const struct cpumask *scx_bpf_get_idle_cpumask(void) __ksym;
> +const struct cpumask *scx_bpf_get_idle_smtmask_node(int node) __ksym __weak;
> const struct cpumask *scx_bpf_get_idle_smtmask(void) __ksym;
> void scx_bpf_put_idle_cpumask(const struct cpumask *cpumask) __ksym;
> bool scx_bpf_test_and_clear_cpu_idle(s32 cpu) __ksym;
> +s32 scx_bpf_pick_idle_cpu_node(const cpumask_t *cpus_allowed, int node, u64 flags) __ksym __weak;
> s32 scx_bpf_pick_idle_cpu(const cpumask_t *cpus_allowed, u64 flags) __ksym;
> s32 scx_bpf_pick_any_cpu(const cpumask_t *cpus_allowed, u64 flags) __ksym;
> bool scx_bpf_task_running(const struct task_struct *p) __ksym;
> diff --git a/tools/sched_ext/include/scx/compat.bpf.h b/tools/sched_ext/include/scx/compat.bpf.h
> index d56520100a26..dfc329d5a91e 100644
> --- a/tools/sched_ext/include/scx/compat.bpf.h
> +++ b/tools/sched_ext/include/scx/compat.bpf.h
> @@ -125,6 +125,25 @@ bool scx_bpf_dispatch_vtime_from_dsq___compat(struct bpf_iter_scx_dsq *it__iter,
> false; \
> })
>
> +#define __COMPAT_scx_bpf_cpu_to_node(cpu) \
> + (bpf_ksym_exists(scx_bpf_cpu_to_node) ? \
> + scx_bpf_cpu_to_node(cpu) : 0)
> +
> +#define __COMPAT_scx_bpf_get_idle_cpumask_node(node) \
> + (bpf_ksym_exists(scx_bpf_get_idle_cpumask_node) ? \
> + scx_bpf_get_idle_cpumask_node(node) : \
> + scx_bpf_get_idle_cpumask()) \
> +
> +#define __COMPAT_scx_bpf_get_idle_smtmask_node(node) \
> + (bpf_ksym_exists(scx_bpf_get_idle_smtmask_node) ? \
> + scx_bpf_get_idle_smtmask_node(node) : \
> + scx_bpf_get_idle_smtmask())
> +
> +#define __COMPAT_scx_bpf_pick_idle_cpu_node(cpus_allowed, node, flags) \
> + (bpf_ksym_exists(scx_bpf_pick_idle_cpu_node) ? \
> + scx_bpf_pick_idle_cpu_node(cpus_allowed, node, flags) : \
> + scx_bpf_pick_idle_cpu(cpus_allowed, flags))
> +
> /*
> * Define sched_ext_ops. This may be expanded to define multiple variants for
> * backward compatibility. See compat.h::SCX_OPS_LOAD/ATTACH().
> --
> 2.47.1
next prev parent reply other threads:[~2024-12-24 0:57 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-12-20 15:11 [PATCHSET v8 sched_ext/for-6.14] sched_ext: split global idle cpumask into per-NUMA cpumasks Andrea Righi
2024-12-20 15:11 ` [PATCH 01/10] sched/topology: introduce for_each_numa_hop_node() / sched_numa_hop_node() Andrea Righi
2024-12-23 21:18 ` Yury Norov
2024-12-24 7:54 ` Andrea Righi
2024-12-24 17:33 ` Yury Norov
2024-12-20 15:11 ` [PATCH 02/10] sched_ext: Move built-in idle CPU selection policy to a separate file Andrea Righi
2024-12-24 21:21 ` Tejun Heo
2024-12-20 15:11 ` [PATCH 03/10] sched_ext: idle: introduce check_builtin_idle_enabled() helper Andrea Righi
2024-12-20 15:11 ` [PATCH 04/10] sched_ext: idle: use assign_cpu() to update the idle cpumask Andrea Righi
2024-12-23 22:26 ` Yury Norov
2024-12-20 15:11 ` [PATCH 05/10] sched_ext: idle: clarify comments Andrea Righi
2024-12-23 22:28 ` Yury Norov
2024-12-20 15:11 ` [PATCH 06/10] sched_ext: Introduce SCX_OPS_NODE_BUILTIN_IDLE Andrea Righi
2024-12-20 15:11 ` [PATCH 07/10] sched_ext: Introduce per-node idle cpumasks Andrea Righi
2024-12-24 4:05 ` Yury Norov
2024-12-24 8:18 ` Andrea Righi
2024-12-24 17:59 ` Yury Norov
2024-12-20 15:11 ` [PATCH 08/10] sched_ext: idle: introduce SCX_PICK_IDLE_NODE Andrea Righi
2024-12-24 2:48 ` Yury Norov
2024-12-24 3:53 ` Yury Norov
2024-12-24 8:37 ` Andrea Righi
2024-12-24 18:15 ` Yury Norov
2024-12-24 8:22 ` Andrea Righi
2024-12-24 21:29 ` Tejun Heo
2024-12-20 15:11 ` [PATCH 09/10] sched_ext: idle: Get rid of the scx_selcpu_topo_numa logic Andrea Righi
2024-12-23 23:39 ` Yury Norov
2024-12-24 8:58 ` Andrea Righi
2024-12-20 15:11 ` [PATCH 10/10] sched_ext: idle: Introduce NUMA aware idle cpu kfunc helpers Andrea Righi
2024-12-24 0:57 ` Yury Norov [this message]
2024-12-24 9:32 ` Andrea Righi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z2oG9-AS-2OwB7Ib@yury-ThinkPad \
--to=yury.norov@gmail.com \
--cc=arighi@nvidia.com \
--cc=bpf@vger.kernel.org \
--cc=bsegall@google.com \
--cc=changwoo@igalia.com \
--cc=dietmar.eggemann@arm.com \
--cc=juri.lelli@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=tj@kernel.org \
--cc=vincent.guittot@linaro.org \
--cc=void@manifault.com \
--cc=vschneid@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.