Re: [PATCH 4/4] sched_ext: Introduce NUMA aware idle cpu kfunc helpers

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Yury Norov <yury.norov@gmail.com>
To: Andrea Righi <arighi@nvidia.com>
Cc: Tejun Heo <tj@kernel.org>, David Vernet <void@manifault.com>,
	Changwoo Min <changwoo@igalia.com>,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH 4/4] sched_ext: Introduce NUMA aware idle cpu kfunc helpers
Date: Wed, 11 Dec 2024 09:43:26 -0800	[thread overview]
Message-ID: <Z1nPPhe_83lBTna4@yury-ThinkPad> (raw)
In-Reply-To: <20241209104632.718085-5-arighi@nvidia.com>

On Mon, Dec 09, 2024 at 11:40:58AM +0100, Andrea Righi wrote:
> Add the following kfunc's to provide scx schedulers direct access to
> per-node idle cpumasks information:
> 
>  const struct cpumask *scx_bpf_get_idle_cpumask_node(int node)
>  const struct cpumask *scx_bpf_get_idle_smtmask_node(int node)
>  s32 scx_bpf_pick_idle_cpu_node(int node,
>                                 const cpumask_t *cpus_allowed, u64 flags)
>  int scx_bpf_cpu_to_node(s32 cpu)
> 
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> ---
>  kernel/sched/ext.c                       | 96 +++++++++++++++++++++++-
>  tools/sched_ext/include/scx/common.bpf.h |  4 +
>  tools/sched_ext/include/scx/compat.bpf.h | 19 +++++
>  3 files changed, 117 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index d0d57323bcfc..ea7cc481782c 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -433,6 +433,7 @@ struct sched_ext_ops {
>  	 * - scx_bpf_select_cpu_dfl()
>  	 * - scx_bpf_test_and_clear_cpu_idle()
>  	 * - scx_bpf_pick_idle_cpu()
> +	 * - scx_bpf_pick_idle_cpu_node()
>  	 *
>  	 * The user also must implement ops.select_cpu() as the default
>  	 * implementation relies on scx_bpf_select_cpu_dfl().
> @@ -955,6 +956,8 @@ static struct cpumask *get_idle_cpumask_node(int node)
>  	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
>  		return idle_masks[0]->cpu;
>  
> +	if (node < 0 || node >= num_possible_nodes())
> +		return NULL;

1. This sanity should go before the check above.
2. In-kernel users don't need to do sanity checks. BPF users should,
   but for them you need to move it in BPF wrapper.
3. -1 is a valid parameter, means NUMA_NO_NODE. 

>  	return idle_masks[node]->cpu;
>  }
>  
> @@ -963,6 +966,8 @@ static struct cpumask *get_idle_smtmask_node(int node)
>  	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
>  		return idle_masks[0]->smt;
>  
> +	if (node < 0 || node >= num_possible_nodes())
> +		return NULL;
>  	return idle_masks[node]->smt;
>  }
>  
> @@ -7469,6 +7474,16 @@ __bpf_kfunc u32 scx_bpf_nr_cpu_ids(void)
>  	return nr_cpu_ids;
>  }
>  
> +/**
> + * scx_bpf_cpu_to_node - Return the NUMA node the given @cpu belongs to
> + */
> +__bpf_kfunc int scx_bpf_cpu_to_node(s32 cpu)
> +{
> +	if (cpu < 0 || cpu >= nr_cpu_ids)
> +		return -EINVAL;
> +	return cpu_to_node(cpu);
> +}

I believe this wrapper should be declared somewhere in
kernel/sched/topology.c, and better be a separate patch.

> +
>  /**
>   * scx_bpf_get_possible_cpumask - Get a referenced kptr to cpu_possible_mask
>   */
> @@ -7499,11 +7514,32 @@ __bpf_kfunc void scx_bpf_put_cpumask(const struct cpumask *cpumask)
>  	 */
>  }
>  
> +/**
> + * scx_bpf_get_idle_cpumask_node - Get a referenced kptr to the idle-tracking
> + * per-CPU cpumask of a target NUMA node.
> + *
> + * Returns an empty cpumask if idle tracking is not enabled, if @node is not
> + * valid, or running on a UP kernel.
> + */
> +__bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask_node(int node)
> +{
> +	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
> +		scx_ops_error("built-in idle tracking is disabled");
> +		return cpu_none_mask;
> +	}
> +	if (!static_branch_likely(&scx_builtin_idle_per_node)) {
> +		scx_ops_error("per-node idle tracking is disabled");
> +		return cpu_none_mask;
> +	}

Nub question: is it possible that scx_builtin_idle_per_node is enable,
but scx_builtin_idle_enabled not? From my naive perspective, we can't
enable per-node idle masks without enabling general idle masks. Or I
mislead it?

> +
> +	return get_idle_cpumask_node(node) ? : cpu_none_mask;
> +}
>  /**
>   * scx_bpf_get_idle_cpumask - Get a referenced kptr to the idle-tracking
>   * per-CPU cpumask of the current NUMA node.
>   *
> - * Returns NULL if idle tracking is not enabled, or running on a UP kernel.
> + * Returns an emtpy cpumask if idle tracking is not enabled, or running on a UP
> + * kernel.
>   */
>  __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void)
>  {
> @@ -7515,12 +7551,35 @@ __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void)
>  	return get_curr_idle_cpumask();
>  }
>  
> +/**
> + * scx_bpf_get_idle_smtmask_node - Get a referenced kptr to the idle-tracking,
> + * per-physical-core cpumask of a target NUMA node. Can be used to determine
> + * if an entire physical core is free.
> + *
> + * Returns an empty cpumask if idle tracking is not enabled, if @node is not
> + * valid, or running on a UP kernel.
> + */
> +__bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask_node(int node)
> +{
> +	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
> +		scx_ops_error("built-in idle tracking is disabled");
> +		return cpu_none_mask;
> +	}
> +	if (!static_branch_likely(&scx_builtin_idle_per_node)) {
> +		scx_ops_error("per-node idle tracking is disabled");
> +		return cpu_none_mask;
> +	}

Can you add vertical spacing between blocks?

Also, because you use this construction more than once, I think it
makes sense to make it a helper.

> +
> +	return get_idle_smtmask_node(node) ? : cpu_none_mask;
> +}
> +
>  /**
>   * scx_bpf_get_idle_smtmask - Get a referenced kptr to the idle-tracking,
>   * per-physical-core cpumask of the current NUMA node. Can be used to determine
>   * if an entire physical core is free.
>   *
> - * Returns NULL if idle tracking is not enabled, or running on a UP kernel.
> + * Returns an empty cumask if idle tracking is not enabled, or running on a UP
> + * kernel.
>   */
>  __bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask(void)
>  {
> @@ -7569,6 +7628,35 @@ __bpf_kfunc bool scx_bpf_test_and_clear_cpu_idle(s32 cpu)
>  		return false;
>  }
>  
> +/**
> + * scx_bpf_pick_idle_cpu_node - Pick and claim an idle cpu from a NUMA node
> + * @node: target NUMA node
> + * @cpus_allowed: Allowed cpumask
> + * @flags: %SCX_PICK_IDLE_CPU_* flags
> + *
> + * Pick and claim an idle cpu in @cpus_allowed from the NUMA node @node.
> + * Returns the picked idle cpu number on success. -%EBUSY if no matching cpu
> + * was found.
> + *
> + * Unavailable if ops.update_idle() is implemented and
> + * %SCX_OPS_KEEP_BUILTIN_IDLE is not set or if %SCX_OPS_KEEP_BUILTIN_IDLE is
> + * not set.
> + */
> +__bpf_kfunc s32 scx_bpf_pick_idle_cpu_node(int node, const struct cpumask *cpus_allowed,
> +				      u64 flags)
> +{

Sanity checks here?

> +	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
> +		scx_ops_error("built-in idle tracking is disabled");
> +		return -EBUSY;
> +	}
> +	if (!static_branch_likely(&scx_builtin_idle_per_node)) {
> +		scx_ops_error("per-node idle tracking is disabled");
> +		return -EBUSY;
> +	}
> +
> +	return scx_pick_idle_cpu_from_node(node, cpus_allowed, flags);
> +}
> +
>  /**
>   * scx_bpf_pick_idle_cpu - Pick and claim an idle cpu
>   * @cpus_allowed: Allowed cpumask
> @@ -7705,14 +7793,18 @@ BTF_ID_FLAGS(func, scx_bpf_cpuperf_cap)
>  BTF_ID_FLAGS(func, scx_bpf_cpuperf_cur)
>  BTF_ID_FLAGS(func, scx_bpf_cpuperf_set)
>  BTF_ID_FLAGS(func, scx_bpf_nr_cpu_ids)
> +BTF_ID_FLAGS(func, scx_bpf_cpu_to_node)
>  BTF_ID_FLAGS(func, scx_bpf_get_possible_cpumask, KF_ACQUIRE)
>  BTF_ID_FLAGS(func, scx_bpf_get_online_cpumask, KF_ACQUIRE)
>  BTF_ID_FLAGS(func, scx_bpf_put_cpumask, KF_RELEASE)
>  BTF_ID_FLAGS(func, scx_bpf_get_idle_cpumask, KF_ACQUIRE)
> +BTF_ID_FLAGS(func, scx_bpf_get_idle_cpumask_node, KF_ACQUIRE)
>  BTF_ID_FLAGS(func, scx_bpf_get_idle_smtmask, KF_ACQUIRE)
> +BTF_ID_FLAGS(func, scx_bpf_get_idle_smtmask_node, KF_ACQUIRE)
>  BTF_ID_FLAGS(func, scx_bpf_put_idle_cpumask, KF_RELEASE)
>  BTF_ID_FLAGS(func, scx_bpf_test_and_clear_cpu_idle)
>  BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu, KF_RCU)
> +BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu_node, KF_RCU)
>  BTF_ID_FLAGS(func, scx_bpf_pick_any_cpu, KF_RCU)
>  BTF_ID_FLAGS(func, scx_bpf_task_running, KF_RCU)
>  BTF_ID_FLAGS(func, scx_bpf_task_cpu, KF_RCU)
> diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h
> index 625f5b046776..9bbf6d5083b5 100644
> --- a/tools/sched_ext/include/scx/common.bpf.h
> +++ b/tools/sched_ext/include/scx/common.bpf.h
> @@ -59,14 +59,18 @@ u32 scx_bpf_cpuperf_cap(s32 cpu) __ksym __weak;
>  u32 scx_bpf_cpuperf_cur(s32 cpu) __ksym __weak;
>  void scx_bpf_cpuperf_set(s32 cpu, u32 perf) __ksym __weak;
>  u32 scx_bpf_nr_cpu_ids(void) __ksym __weak;
> +int scx_bpf_cpu_to_node(s32 cpu) __ksym __weak;
>  const struct cpumask *scx_bpf_get_possible_cpumask(void) __ksym __weak;
>  const struct cpumask *scx_bpf_get_online_cpumask(void) __ksym __weak;
>  void scx_bpf_put_cpumask(const struct cpumask *cpumask) __ksym __weak;
>  const struct cpumask *scx_bpf_get_idle_cpumask(void) __ksym;
> +const struct cpumask *scx_bpf_get_idle_cpumask_node(int node) __ksym __weak;
>  const struct cpumask *scx_bpf_get_idle_smtmask(void) __ksym;
> +const struct cpumask *scx_bpf_get_idle_smtmask_node(int node) __ksym __weak;
>  void scx_bpf_put_idle_cpumask(const struct cpumask *cpumask) __ksym;
>  bool scx_bpf_test_and_clear_cpu_idle(s32 cpu) __ksym;
>  s32 scx_bpf_pick_idle_cpu(const cpumask_t *cpus_allowed, u64 flags) __ksym;
> +s32 scx_bpf_pick_idle_cpu_node(int node, const cpumask_t *cpus_allowed, u64 flags) __ksym __weak;
>  s32 scx_bpf_pick_any_cpu(const cpumask_t *cpus_allowed, u64 flags) __ksym;
>  bool scx_bpf_task_running(const struct task_struct *p) __ksym;
>  s32 scx_bpf_task_cpu(const struct task_struct *p) __ksym;
> diff --git a/tools/sched_ext/include/scx/compat.bpf.h b/tools/sched_ext/include/scx/compat.bpf.h
> index d56520100a26..587650490743 100644
> --- a/tools/sched_ext/include/scx/compat.bpf.h
> +++ b/tools/sched_ext/include/scx/compat.bpf.h
> @@ -125,6 +125,25 @@ bool scx_bpf_dispatch_vtime_from_dsq___compat(struct bpf_iter_scx_dsq *it__iter,
>  	false;									\
>  })
>  
> +#define __COMPAT_scx_bpf_cpu_to_node(cpu)                                       \
> +	(bpf_ksym_exists(scx_bpf_cpu_to_node) ?                                 \
> +	 scx_bpf_cpu_to_node(cpu) : 0)
> +
> +#define __COMPAT_scx_bpf_get_idle_cpumask_node(node)                            \
> +	(bpf_ksym_exists(scx_bpf_get_idle_cpumask_node) ?                       \
> +	 scx_bpf_get_idle_cpumask_node(node) :                                  \
> +	 scx_bpf_get_idle_cpumask())                                            \
> +
> +#define __COMPAT_scx_bpf_get_idle_smtmask_node(node)                            \
> +	(bpf_ksym_exists(scx_bpf_get_idle_smtmask_node) ?                       \
> +	 scx_bpf_get_idle_smtmask_node(node) :                                  \
> +	 scx_bpf_get_idle_smtmask())                                            \
> +
> +#define __COMPAT_scx_bpf_pick_idle_cpu_node(node, cpus_allowed, flags)          \
> +	(bpf_ksym_exists(scx_bpf_pick_idle_cpu_node) ?                          \
> +	 scx_bpf_pick_idle_cpu_node(node, cpus_allowed, flags) :                \
> +	 scx_bpf_pick_idle_cpu(cpus_allowed, flags))
> +
>  /*
>   * Define sched_ext_ops. This may be expanded to define multiple variants for
>   * backward compatibility. See compat.h::SCX_OPS_LOAD/ATTACH().
> -- 
> 2.47.1

next prev parent reply	other threads:[~2024-12-11 17:43 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-09 10:40 [PATCHSET v5 sched_ext/for-6.14] sched_ext: split global idle cpumask into per-NUMA cpumasks Andrea Righi
2024-12-09 10:40 ` [PATCH 1/4] sched_ext: Introduce per-NUMA idle cpumasks Andrea Righi
2024-12-09 19:32   ` Yury Norov
2024-12-09 20:40     ` Andrea Righi
2024-12-10  0:14     ` Andrea Righi
2024-12-10  2:10       ` Yury Norov
2024-12-14  6:05         ` Andrea Righi
2024-12-11 17:46   ` Yury Norov
2024-12-09 10:40 ` [PATCH 2/4] sched_ext: Get rid of the scx_selcpu_topo_numa logic Andrea Righi
2024-12-11  8:05   ` Changwoo Min
2024-12-11 12:22     ` Andrea Righi
2024-12-09 10:40 ` [PATCH 3/4] sched_ext: Introduce SCX_OPS_NODE_BUILTIN_IDLE Andrea Righi
2024-12-11 18:21   ` Yury Norov
2024-12-11 19:59     ` Andrea Righi
2024-12-09 10:40 ` [PATCH 4/4] sched_ext: Introduce NUMA aware idle cpu kfunc helpers Andrea Righi
2024-12-11 17:43   ` Yury Norov [this message]
2024-12-11 20:20     ` Andrea Righi
2024-12-11 20:47       ` Yury Norov
2024-12-11 20:55         ` Andrea Righi
  -- strict thread matches above, loose matches on Subject: below --
2024-12-05 21:00 [PATCHSET v4 sched_ext/for-6.14] sched_ext: split global idle cpumask into per-NUMA cpumasks Andrea Righi
2024-12-05 21:00 ` [PATCH 4/4] sched_ext: Introduce NUMA aware idle cpu kfunc helpers Andrea Righi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z1nPPhe_83lBTna4@yury-ThinkPad \
    --to=yury.norov@gmail.com \
    --cc=arighi@nvidia.com \
    --cc=changwoo@igalia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=tj@kernel.org \
    --cc=void@manifault.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.