Re: [PATCH 3/6] sched_ext: Introduce per-node idle cpumasks

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Yury Norov <yury.norov@gmail.com>
To: Andrea Righi <arighi@nvidia.com>
Cc: Tejun Heo <tj@kernel.org>, David Vernet <void@manifault.com>,
	Changwoo Min <changwoo@igalia.com>,
	Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Rasmus Villemoes <linux@rasmusvillemoes.dk>,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH 3/6] sched_ext: Introduce per-node idle cpumasks
Date: Fri, 20 Dec 2024 08:48:55 -0800	[thread overview]
Message-ID: <Z2Wf96x8PTNq8efH@yury-ThinkPad> (raw)
In-Reply-To: <20241217094156.577262-4-arighi@nvidia.com>

On Tue, Dec 17, 2024 at 10:32:28AM +0100, Andrea Righi wrote:
> Using a single global idle mask can lead to inefficiencies and a lot of
> stress on the cache coherency protocol on large systems with multiple
> NUMA nodes, since all the CPUs can create a really intense read/write
> activity on the single global cpumask.
> 
> Therefore, split the global cpumask into multiple per-NUMA node cpumasks
> to improve scalability and performance on large systems.
> 
> The concept is that each cpumask will track only the idle CPUs within
> its corresponding NUMA node, treating CPUs in other NUMA nodes as busy.
> In this way concurrent access to the idle cpumask will be restricted
> within each NUMA node.
> 
> NOTE: with per-node idle cpumasks enabled scx_bpf_get_idle_cpu/smtmask()
> are returning the cpumask of the current NUMA node, instead of a single
> cpumask for all the CPUs.
> 
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> ---
>  kernel/sched/ext.c | 281 +++++++++++++++++++++++++++++++++------------
>  1 file changed, 209 insertions(+), 72 deletions(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index a17abd2df4d4..d4666db4a212 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -894,6 +894,7 @@ static DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_enabled);
>  #ifdef CONFIG_SMP
>  static DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_llc);
>  static DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_numa);
> +static DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_per_node);
>  #endif
>  
>  static struct static_key_false scx_has_op[SCX_OPI_END] =
> @@ -937,11 +938,82 @@ static struct delayed_work scx_watchdog_work;
>  #define CL_ALIGNED_IF_ONSTACK __cacheline_aligned_in_smp
>  #endif
>  
> -static struct {
> +struct idle_cpumask {
>  	cpumask_var_t cpu;
>  	cpumask_var_t smt;
> -} idle_masks CL_ALIGNED_IF_ONSTACK;
> +};
> +
> +/*
> + * Make sure a NUMA node is in a valid range.
> + */
> +static int validate_node(int node)
> +{
> +	/* If no node is specified, return the current one */
> +	if (node == NUMA_NO_NODE)
> +		return numa_node_id();
> +
> +	/* Make sure node is in the range of possible nodes */
> +	if (node < 0 || node >= num_possible_nodes())
> +		return -EINVAL;
> +
> +	return node;
> +}

This is needed in BPF code, right? Kernel users should always provide
correct parameters.

> +/*
> + * cpumasks to track idle CPUs within each NUMA node.
> + *
> + * If SCX_OPS_BUILTIN_IDLE_PER_NODE is not specified, a single flat cpumask
> + * from node 0 is used to track all idle CPUs system-wide.
> + */
> +static struct idle_cpumask **idle_masks CL_ALIGNED_IF_ONSTACK;
> +
> +static struct cpumask *get_idle_mask_node(int node, bool smt)

Like Rasmus said in another thread, this 'get' prefix looks weird.

  > +/**
  > + * get_parity8 - get the parity of an u8 value
  
  I know it's prototypical bikeshedding, but what's with the "get_"
  prefix? Certainly the purpose of any pure function is to "get" the
  result of some computation. We don't have "get_strlen()".
  
  Please either just "parity8", or if you want to emphasize that this
  belongs in the bit-munging family, "bit_parity8".

So maybe idle_nodes(), idle_smt_nodes() and so on?

> +{
> +	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
> +		return smt ? idle_masks[0]->smt : idle_masks[0]->cpu;
> +
> +	node = validate_node(node);
> + 	if (node < 0)
> +		return cpu_none_mask;

cpu_none_mask is a valid pointer. How should user distinguish invalid
parameter and empty nodemask? You should return -EINVAL.

Anyways...

This would make a horrid mess. Imagine I'm your user, and I'm
experimenting with this idle feature. I decided to begin with
per-node idle masks disabled and call it for example like:

        get_idle_mask_node(random(), true)

And it works! I get all my idle CPUs just well. So I'm happily build
my complex and huge codebase on top of it.

Then one day I decide to enable those fancy per-node idle masks
because you said they unload interconnect and whatever... At that
point what I want is to click a button just to collect some perf
numbers. So I do that, and I find all my codebase coredumped in some
random place... Because per-node idle masks basic functions
simply have different interface: they disallow node = random() as
parameter, while global idle mask is OK with it.

> +
> +	return smt ? idle_masks[node]->smt : idle_masks[node]->cpu;
> +}
> +
> +static struct cpumask *get_idle_cpumask_node(int node)

inline or __always_inline?

> +{
> +	return get_idle_mask_node(node, false);
> +}
> +
> +static struct cpumask *get_idle_smtmask_node(int node)
> +{
> +	return get_idle_mask_node(node, true);
> +}
> +
> +static void idle_masks_init(void)
> +{
> +	int node;
> +
> +	idle_masks = kcalloc(num_possible_nodes(), sizeof(*idle_masks), GFP_KERNEL);
> +	BUG_ON(!idle_masks);
> +
> +	for_each_node_state(node, N_POSSIBLE) {
> +		idle_masks[node] = kzalloc_node(sizeof(**idle_masks), GFP_KERNEL, node);
> +		BUG_ON(!idle_masks[node]);
> +
> +		BUG_ON(!alloc_cpumask_var_node(&idle_masks[node]->cpu, GFP_KERNEL, node));
> +		BUG_ON(!alloc_cpumask_var_node(&idle_masks[node]->smt, GFP_KERNEL, node));
> +	}
> +}
> +#else	/* !CONFIG_SMP */
> +static struct cpumask *get_idle_cpumask_node(int node)
> +{
> +	return cpu_none_mask;
> +}
>  
> +static struct cpumask *get_idle_smtmask_node(int node)
> +{
> +	return cpu_none_mask;
> +}
>  #endif	/* CONFIG_SMP */
>  
>  /* for %SCX_KICK_WAIT */
> @@ -3173,6 +3245,9 @@ bool scx_prio_less(const struct task_struct *a, const struct task_struct *b,
>  
>  static bool test_and_clear_cpu_idle(int cpu)
>  {
> +	int node = cpu_to_node(cpu);
> +	struct cpumask *idle_cpu = get_idle_cpumask_node(node);
> +
>  #ifdef CONFIG_SCHED_SMT
>  	/*
>  	 * SMT mask should be cleared whether we can claim @cpu or not. The SMT
> @@ -3181,29 +3256,34 @@ static bool test_and_clear_cpu_idle(int cpu)
>  	 */
>  	if (sched_smt_active()) {
>  		const struct cpumask *smt = cpu_smt_mask(cpu);
> +		struct cpumask *idle_smt = get_idle_smtmask_node(node);
>  
>  		/*
>  		 * If offline, @cpu is not its own sibling and
>  		 * scx_pick_idle_cpu() can get caught in an infinite loop as
> -		 * @cpu is never cleared from idle_masks.smt. Ensure that @cpu
> -		 * is eventually cleared.
> +		 * @cpu is never cleared from the idle SMT mask. Ensure that
> +		 * @cpu is eventually cleared.
> +		 *
> +		 * NOTE: Use cpumask_intersects() and cpumask_test_cpu() to
> +		 * reduce memory writes, which may help alleviate cache
> +		 * coherence pressure.
>  		 */
> -		if (cpumask_intersects(smt, idle_masks.smt))
> -			cpumask_andnot(idle_masks.smt, idle_masks.smt, smt);
> -		else if (cpumask_test_cpu(cpu, idle_masks.smt))
> -			__cpumask_clear_cpu(cpu, idle_masks.smt);
> +		if (cpumask_intersects(smt, idle_smt))
> +			cpumask_andnot(idle_smt, idle_smt, smt);
> +		else if (cpumask_test_cpu(cpu, idle_smt))
> +			__cpumask_clear_cpu(cpu, idle_smt);
>  	}
>  #endif
> -	return cpumask_test_and_clear_cpu(cpu, idle_masks.cpu);
> +	return cpumask_test_and_clear_cpu(cpu, idle_cpu);
>  }
>  
> -static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags)
> +static s32 scx_pick_idle_cpu_from_node(int node, const struct cpumask *cpus_allowed, u64 flags)
>  {
>  	int cpu;
>  
>  retry:
>  	if (sched_smt_active()) {
> -		cpu = cpumask_any_and_distribute(idle_masks.smt, cpus_allowed);
> +		cpu = cpumask_any_and_distribute(get_idle_smtmask_node(node), cpus_allowed);
>  		if (cpu < nr_cpu_ids)
>  			goto found;
>  
> @@ -3211,7 +3291,7 @@ static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags)
>  			return -EBUSY;
>  	}
>  
> -	cpu = cpumask_any_and_distribute(idle_masks.cpu, cpus_allowed);
> +	cpu = cpumask_any_and_distribute(get_idle_cpumask_node(node), cpus_allowed);
>  	if (cpu >= nr_cpu_ids)
>  		return -EBUSY;
>  
> @@ -3220,6 +3300,40 @@ static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags)
>  		return cpu;
>  	else
>  		goto retry;
> +
> +}
> +
> +static s32
> +scx_pick_idle_cpu_numa(const struct cpumask *cpus_allowed, s32 prev_cpu, u64 flags)

We have a sort of convention here. If you pick a cpu based on NUMA
distances, the 'numa' should be a 2nd prefix after a subsystem where
it's used. Refer for example:

        sched_numa_find_nth_cpu()
        sched_numa_hop_mask()
        for_each_numa_hop_mask()

So I'd name it scx_numa_idle_cpu()

> +{
> +	nodemask_t hop_nodes = NODE_MASK_NONE;
> +	int start_node = cpu_to_node(prev_cpu);
> +	s32 cpu = -EBUSY;
> +
> +	/*
> +	 * Traverse all online nodes in order of increasing distance,
> +	 * starting from prev_cpu's node.
> +	 */
> +	rcu_read_lock();

RCU is a leftover from for_each_numa_hop_mask(), I guess?

> +	for_each_numa_hop_node(node, start_node, hop_nodes, N_ONLINE) {
> +		cpu = scx_pick_idle_cpu_from_node(node, cpus_allowed, flags);
> +		if (cpu >= 0)
> +			break;
> +	}
> +	rcu_read_unlock();
> +
> +	return cpu;
> +}
> +
> +static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, s32 prev_cpu, u64 flags)
> +{
> +	/*
> +	 * Only node 0 is used if per-node idle cpumasks are disabled.
> +	 */
> +	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
> +		return scx_pick_idle_cpu_from_node(0, cpus_allowed, flags);
> +
> +	return scx_pick_idle_cpu_numa(cpus_allowed, prev_cpu, flags);
>  }
>  
>  /*
> @@ -3339,7 +3453,7 @@ static bool llc_numa_mismatch(void)
>   * CPU belongs to a single LLC domain, and that each LLC domain is entirely
>   * contained within a single NUMA node.
>   */
> -static void update_selcpu_topology(void)
> +static void update_selcpu_topology(struct sched_ext_ops *ops)
>  {
>  	bool enable_llc = false, enable_numa = false;
>  	unsigned int nr_cpus;
> @@ -3360,6 +3474,14 @@ static void update_selcpu_topology(void)
>  	if (nr_cpus > 0) {
>  		if (nr_cpus < num_online_cpus())
>  			enable_llc = true;
> +		/*
> +		 * No need to enable LLC optimization if the LLC domains are
> +		 * perfectly overlapping with the NUMA domains when per-node
> +		 * cpumasks are enabled.
> +		 */
> +		if ((ops->flags & SCX_OPS_BUILTIN_IDLE_PER_NODE) &&
> +		    !llc_numa_mismatch())
> +			enable_llc = false;
>  		pr_debug("sched_ext: LLC=%*pb weight=%u\n",
>  			 cpumask_pr_args(llc_span(cpu)), llc_weight(cpu));
>  	}
> @@ -3395,6 +3517,14 @@ static void update_selcpu_topology(void)
>  		static_branch_enable_cpuslocked(&scx_selcpu_topo_numa);
>  	else
>  		static_branch_disable_cpuslocked(&scx_selcpu_topo_numa);
> +
> +	/*
> +	 * Check if we need to enable per-node cpumasks.
> +	 */
> +	if (ops->flags & SCX_OPS_BUILTIN_IDLE_PER_NODE)
> +		static_branch_enable_cpuslocked(&scx_builtin_idle_per_node);
> +	else
> +		static_branch_disable_cpuslocked(&scx_builtin_idle_per_node);
>  }
>  
>  /*
> @@ -3415,6 +3545,8 @@ static void update_selcpu_topology(void)
>   * 4. Pick a CPU within the same NUMA node, if enabled:
>   *   - choose a CPU from the same NUMA node to reduce memory access latency.
>   *
> + * 5. Pick any idle CPU usable by the task.
> + *

I would make it a separate patch, or even send it independently. It
doesn't look related to the series...

>   * Step 3 and 4 are performed only if the system has, respectively, multiple
>   * LLC domains / multiple NUMA nodes (see scx_selcpu_topo_llc and
>   * scx_selcpu_topo_numa).
> @@ -3427,6 +3559,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  {
>  	const struct cpumask *llc_cpus = NULL;
>  	const struct cpumask *numa_cpus = NULL;
> +	int node = cpu_to_node(prev_cpu);
>  	s32 cpu;
>  
>  	*found = false;
> @@ -3484,9 +3617,9 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  		 * piled up on it even if there is an idle core elsewhere on
>  		 * the system.
>  		 */
> -		if (!cpumask_empty(idle_masks.cpu) &&
> -		    !(current->flags & PF_EXITING) &&
> -		    cpu_rq(cpu)->scx.local_dsq.nr == 0) {
> +		if (!(current->flags & PF_EXITING) &&
> +		    cpu_rq(cpu)->scx.local_dsq.nr == 0 &&
> +		    !cpumask_empty(get_idle_cpumask_node(cpu_to_node(cpu)))) {
>  			if (cpumask_test_cpu(cpu, p->cpus_ptr))
>  				goto cpu_found;
>  		}
> @@ -3500,7 +3633,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  		/*
>  		 * Keep using @prev_cpu if it's part of a fully idle core.
>  		 */
> -		if (cpumask_test_cpu(prev_cpu, idle_masks.smt) &&
> +		if (cpumask_test_cpu(prev_cpu, get_idle_smtmask_node(node)) &&
>  		    test_and_clear_cpu_idle(prev_cpu)) {
>  			cpu = prev_cpu;
>  			goto cpu_found;
> @@ -3510,7 +3643,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  		 * Search for any fully idle core in the same LLC domain.
>  		 */
>  		if (llc_cpus) {
> -			cpu = scx_pick_idle_cpu(llc_cpus, SCX_PICK_IDLE_CORE);
> +			cpu = scx_pick_idle_cpu_from_node(node, llc_cpus, SCX_PICK_IDLE_CORE);
>  			if (cpu >= 0)
>  				goto cpu_found;
>  		}
> @@ -3519,7 +3652,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  		 * Search for any fully idle core in the same NUMA node.
>  		 */
>  		if (numa_cpus) {
> -			cpu = scx_pick_idle_cpu(numa_cpus, SCX_PICK_IDLE_CORE);
> +			cpu = scx_pick_idle_cpu_from_node(node, numa_cpus, SCX_PICK_IDLE_CORE);
>  			if (cpu >= 0)
>  				goto cpu_found;
>  		}
> @@ -3527,7 +3660,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  		/*
>  		 * Search for any full idle core usable by the task.
>  		 */
> -		cpu = scx_pick_idle_cpu(p->cpus_ptr, SCX_PICK_IDLE_CORE);
> +		cpu = scx_pick_idle_cpu(p->cpus_ptr, prev_cpu, SCX_PICK_IDLE_CORE);
>  		if (cpu >= 0)
>  			goto cpu_found;
>  	}
> @@ -3544,7 +3677,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  	 * Search for any idle CPU in the same LLC domain.
>  	 */
>  	if (llc_cpus) {
> -		cpu = scx_pick_idle_cpu(llc_cpus, 0);
> +		cpu = scx_pick_idle_cpu_from_node(node, llc_cpus, 0);
>  		if (cpu >= 0)
>  			goto cpu_found;
>  	}
> @@ -3553,7 +3686,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  	 * Search for any idle CPU in the same NUMA node.
>  	 */
>  	if (numa_cpus) {
> -		cpu = scx_pick_idle_cpu(numa_cpus, 0);
> +		cpu = scx_pick_idle_cpu_from_node(node, numa_cpus, 0);
>  		if (cpu >= 0)
>  			goto cpu_found;
>  	}
> @@ -3561,7 +3694,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  	/*
>  	 * Search for any idle CPU usable by the task.
>  	 */
> -	cpu = scx_pick_idle_cpu(p->cpus_ptr, 0);
> +	cpu = scx_pick_idle_cpu(p->cpus_ptr, prev_cpu, 0);
>  	if (cpu >= 0)
>  		goto cpu_found;
>  
> @@ -3643,17 +3776,33 @@ static void set_cpus_allowed_scx(struct task_struct *p,
>  
>  static void reset_idle_masks(void)
>  {
> +	int node;
> +
> +	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node)) {
> +		cpumask_copy(get_idle_cpumask_node(0), cpu_online_mask);
> +		cpumask_copy(get_idle_smtmask_node(0), cpu_online_mask);
> +		return;
> +	}
> +
>  	/*
>  	 * Consider all online cpus idle. Should converge to the actual state
>  	 * quickly.
>  	 */
> -	cpumask_copy(idle_masks.cpu, cpu_online_mask);
> -	cpumask_copy(idle_masks.smt, cpu_online_mask);
> +	for_each_node_state(node, N_POSSIBLE) {

for_each_node(node)

If the comment above is still valid, you don't need to visit every
possible node. You need to visit only N_ONLINE, or even N_CPU nodes.

This adds to the discussion in patch #1. Taking care of CPU-less nodes
is useless for the scheduler purposes. And if you don't even initialize
them, then in for_each_numa_hop_node() you should skip those nodes.

> +		const struct cpumask *node_mask = cpumask_of_node(node);
> +		struct cpumask *idle_cpu = get_idle_cpumask_node(node);
> +		struct cpumask *idle_smt = get_idle_smtmask_node(node);
> +
> +		cpumask_and(idle_cpu, cpu_online_mask, node_mask);
> +		cpumask_copy(idle_smt, idle_cpu);
> +	}
>  }
>  
>  void __scx_update_idle(struct rq *rq, bool idle)
>  {
>  	int cpu = cpu_of(rq);
> +	int node = cpu_to_node(cpu);
> +	struct cpumask *idle_cpu = get_idle_cpumask_node(node);
>  
>  	if (SCX_HAS_OP(update_idle) && !scx_rq_bypassing(rq)) {
>  		SCX_CALL_OP(SCX_KF_REST, update_idle, cpu_of(rq), idle);
> @@ -3661,27 +3810,25 @@ void __scx_update_idle(struct rq *rq, bool idle)
>  			return;
>  	}
>  
> -	if (idle)
> -		cpumask_set_cpu(cpu, idle_masks.cpu);
> -	else
> -		cpumask_clear_cpu(cpu, idle_masks.cpu);
> +	assign_cpu(cpu, idle_cpu, idle);

Can you also make it a separate preparation patch?

>  
>  #ifdef CONFIG_SCHED_SMT
>  	if (sched_smt_active()) {
>  		const struct cpumask *smt = cpu_smt_mask(cpu);
> +		struct cpumask *idle_smt = get_idle_smtmask_node(node);
>  
>  		if (idle) {
>  			/*
> -			 * idle_masks.smt handling is racy but that's fine as
> -			 * it's only for optimization and self-correcting.
> +			 * idle_smt handling is racy but that's fine as it's
> +			 * only for optimization and self-correcting.
>  			 */
>  			for_each_cpu(cpu, smt) {
> -				if (!cpumask_test_cpu(cpu, idle_masks.cpu))
> +				if (!cpumask_test_cpu(cpu, idle_cpu))
>  					return;
>  			}

So, we continue only if all smt cpus are idle. This looks like an
opencoded cpumask_subset():

                        if (!cpumask_subset(idle_cpu, smt))
                                return;

Is that true? If so, can you please again make it a small
preparation patch?

> -			cpumask_or(idle_masks.smt, idle_masks.smt, smt);
> +			cpumask_or(idle_smt, idle_smt, smt);
>  		} else {
> -			cpumask_andnot(idle_masks.smt, idle_masks.smt, smt);
> +			cpumask_andnot(idle_smt, idle_smt, smt);
>  		}
>  	}
>  #endif
> @@ -3694,7 +3841,7 @@ static void handle_hotplug(struct rq *rq, bool online)
>  	atomic_long_inc(&scx_hotplug_seq);
>  
>  	if (scx_enabled())
> -		update_selcpu_topology();
> +		update_selcpu_topology(&scx_ops);
>  
>  	if (online && SCX_HAS_OP(cpu_online))
>  		SCX_CALL_OP(SCX_KF_UNLOCKED, cpu_online, cpu);
> @@ -3729,7 +3876,10 @@ static void rq_offline_scx(struct rq *rq)
>  #else	/* CONFIG_SMP */
>  
>  static bool test_and_clear_cpu_idle(int cpu) { return false; }
> -static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags) { return -EBUSY; }
> +static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, s32 prev_cpu, u64 flags)
> +{
> +	return -EBUSY;
> +}
>  static void reset_idle_masks(void) {}
>  
>  #endif	/* CONFIG_SMP */
> @@ -5579,7 +5729,7 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
>  
>  	check_hotplug_seq(ops);
>  #ifdef CONFIG_SMP
> -	update_selcpu_topology();
> +	update_selcpu_topology(ops);
>  #endif
>  	cpus_read_unlock();
>  
> @@ -6272,8 +6422,7 @@ void __init init_sched_ext_class(void)
>  
>  	BUG_ON(rhashtable_init(&dsq_hash, &dsq_hash_params));
>  #ifdef CONFIG_SMP
> -	BUG_ON(!alloc_cpumask_var(&idle_masks.cpu, GFP_KERNEL));
> -	BUG_ON(!alloc_cpumask_var(&idle_masks.smt, GFP_KERNEL));
> +	idle_masks_init();
>  #endif
>  	scx_kick_cpus_pnt_seqs =
>  		__alloc_percpu(sizeof(scx_kick_cpus_pnt_seqs[0]) * nr_cpu_ids,
> @@ -6309,6 +6458,15 @@ void __init init_sched_ext_class(void)
>   */
>  #include <linux/btf_ids.h>
>  
> +static bool check_builtin_idle_enabled(void)
> +{
> +	if (static_branch_likely(&scx_builtin_idle_enabled))
> +		return true;
> +
> +	scx_ops_error("built-in idle tracking is disabled");
> +	return false;
> +}
> +
>  __bpf_kfunc_start_defs();
>  
>  /**
> @@ -6328,10 +6486,8 @@ __bpf_kfunc_start_defs();
>  __bpf_kfunc s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  				       u64 wake_flags, bool *is_idle)
>  {
> -	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
> -		scx_ops_error("built-in idle tracking is disabled");
> +	if (!check_builtin_idle_enabled())
>  		goto prev_cpu;
> -	}
>  
>  	if (!scx_kf_allowed(SCX_KF_SELECT_CPU))
>  		goto prev_cpu;
> @@ -7419,46 +7575,31 @@ __bpf_kfunc void scx_bpf_put_cpumask(const struct cpumask *cpumask)
>  
>  /**
>   * scx_bpf_get_idle_cpumask - Get a referenced kptr to the idle-tracking
> - * per-CPU cpumask.
> + * per-CPU cpumask of the current NUMA node.
>   *
>   * Returns NULL if idle tracking is not enabled, or running on a UP kernel.
>   */
>  __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void)
>  {
> -	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
> -		scx_ops_error("built-in idle tracking is disabled");
> +	if (!check_builtin_idle_enabled())
>  		return cpu_none_mask;
> -	}
>  
> -#ifdef CONFIG_SMP
> -	return idle_masks.cpu;
> -#else
> -	return cpu_none_mask;
> -#endif
> +	return get_idle_cpumask_node(NUMA_NO_NODE);
>  }
>  
>  /**
>   * scx_bpf_get_idle_smtmask - Get a referenced kptr to the idle-tracking,
> - * per-physical-core cpumask. Can be used to determine if an entire physical
> - * core is free.
> + * per-physical-core cpumask of the current NUMA node. Can be used to determine
> + * if an entire physical core is free.
>   *
>   * Returns NULL if idle tracking is not enabled, or running on a UP kernel.
>   */
>  __bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask(void)
>  {
> -	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
> -		scx_ops_error("built-in idle tracking is disabled");
> +	if (!check_builtin_idle_enabled())
>  		return cpu_none_mask;
> -	}
>  
> -#ifdef CONFIG_SMP
> -	if (sched_smt_active())
> -		return idle_masks.smt;
> -	else
> -		return idle_masks.cpu;
> -#else
> -	return cpu_none_mask;
> -#endif
> +	return get_idle_smtmask_node(NUMA_NO_NODE);
>  }
>  
>  /**
> @@ -7487,10 +7628,8 @@ __bpf_kfunc void scx_bpf_put_idle_cpumask(const struct cpumask *idle_mask)
>   */
>  __bpf_kfunc bool scx_bpf_test_and_clear_cpu_idle(s32 cpu)
>  {
> -	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
> -		scx_ops_error("built-in idle tracking is disabled");
> +	if (!check_builtin_idle_enabled())
>  		return false;
> -	}
>  
>  	if (ops_cpu_valid(cpu, NULL))
>  		return test_and_clear_cpu_idle(cpu);
> @@ -7520,12 +7659,10 @@ __bpf_kfunc bool scx_bpf_test_and_clear_cpu_idle(s32 cpu)
>  __bpf_kfunc s32 scx_bpf_pick_idle_cpu(const struct cpumask *cpus_allowed,
>  				      u64 flags)
>  {
> -	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
> -		scx_ops_error("built-in idle tracking is disabled");
> +	if (!check_builtin_idle_enabled())
>  		return -EBUSY;
> -	}
>  
> -	return scx_pick_idle_cpu(cpus_allowed, flags);
> +	return scx_pick_idle_cpu(cpus_allowed, smp_processor_id(), flags);
>  }
>  
>  /**
> @@ -7548,7 +7685,7 @@ __bpf_kfunc s32 scx_bpf_pick_any_cpu(const struct cpumask *cpus_allowed,
>  	s32 cpu;
>  
>  	if (static_branch_likely(&scx_builtin_idle_enabled)) {
> -		cpu = scx_pick_idle_cpu(cpus_allowed, flags);
> +		cpu = scx_pick_idle_cpu(cpus_allowed, smp_processor_id(), flags);
>  		if (cpu >= 0)
>  			return cpu;
>  	}
> -- 
> 2.47.1

next prev parent reply	other threads:[~2024-12-20 16:48 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-17  9:32 [PATCHSET v7 sched_ext/for-6.14] sched_ext: split global idle cpumask into per-NUMA cpumasks Andrea Righi
2024-12-17  9:32 ` [PATCH 1/6] sched/topology: introduce for_each_numa_hop_node() / sched_numa_hop_node() Andrea Righi
2024-12-17 21:57   ` Tejun Heo
2024-12-18 10:23     ` Andrea Righi
2024-12-18 16:04       ` Tejun Heo
2024-12-19 18:26         ` Yury Norov
2024-12-19 19:43           ` Andrea Righi
2024-12-19 19:52           ` Peter Zijlstra
2024-12-19 21:16             ` Andrea Righi
2024-12-17  9:32 ` [PATCH 2/6] sched_ext: Introduce SCX_OPS_NODE_BUILTIN_IDLE Andrea Righi
2024-12-17  9:32 ` [PATCH 3/6] sched_ext: Introduce per-node idle cpumasks Andrea Righi
2024-12-17 23:22   ` Tejun Heo
2024-12-18 10:21     ` Andrea Righi
2024-12-18 16:10       ` Tejun Heo
2024-12-18 16:18         ` Andrea Righi
2024-12-17 23:23   ` Tejun Heo
2024-12-20 16:48   ` Yury Norov [this message]
2024-12-20 17:52     ` Andrea Righi
2024-12-17  9:32 ` [PATCH 4/6] sched_ext: Get rid of the scx_selcpu_topo_numa logic Andrea Righi
2024-12-17  9:32 ` [PATCH 5/6] sched_ext: Introduce NUMA aware idle cpu kfunc helpers Andrea Righi
2024-12-17  9:32 ` [PATCH 6/6] sched_ext: Move built-in idle CPU selection policy to a separate file Andrea Righi
2024-12-20 14:53   ` Yury Norov
2024-12-20 14:58     ` Andrea Righi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z2Wf96x8PTNq8efH@yury-ThinkPad \
    --to=yury.norov@gmail.com \
    --cc=arighi@nvidia.com \
    --cc=bsegall@google.com \
    --cc=changwoo@igalia.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux@rasmusvillemoes.dk \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=tj@kernel.org \
    --cc=vincent.guittot@linaro.org \
    --cc=void@manifault.com \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.