Re: [PATCH 1/2] sched_ext: Cache per-node NUMA distance order in scx_idle_init_masks()

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Andrea Righi <arighi@nvidia.com>
To: Cheng-Yang Chou <yphbchou0911@gmail.com>
Cc: sched-ext@lists.linux.dev, Tejun Heo <tj@kernel.org>,
	David Vernet <void@manifault.com>,
	Changwoo Min <changwoo@igalia.com>,
	Ching-Chun Huang <jserv@ccns.ncku.edu.tw>,
	Chia-Ping Tsai <chia7712@gmail.com>
Subject: Re: [PATCH 1/2] sched_ext: Cache per-node NUMA distance order in scx_idle_init_masks()
Date: Mon, 16 Mar 2026 16:51:34 +0100	[thread overview]
Message-ID: <abgnBgO83uXd7Xq-@gpd4> (raw)
In-Reply-To: <20260315181017.1377138-2-yphbchou0911@gmail.com>

Hi Cheng-Yang,

this has been sitting in my TODO list for a while, so thanks for looking
at it. :)

Comments below.

On Mon, Mar 16, 2026 at 02:10:14AM +0800, Cheng-Yang Chou wrote:
> Add scx_numa_node_order[] and scx_numa_node_order_cnt[], per-node
> arrays that store NUMA nodes sorted by increasing distance from each
> node. They are allocated and populated once during boot in
> scx_idle_init_masks() using the existing for_each_node_numadist()
> O(N^2) traversal, so the cost is paid only at init time.
> 
> pick_idle_cpu_from_online_nodes() will use these cached arrays to
> reduce its per-call complexity from O(N^2) to O(N).
> 
> Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
> ---
>  kernel/sched/ext_idle.c | 38 ++++++++++++++++++++++++++++++++++++++
>  1 file changed, 38 insertions(+)
> 
> diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
> index 03be4d664267..623a0356d6c3 100644
> --- a/kernel/sched/ext_idle.c
> +++ b/kernel/sched/ext_idle.c
> @@ -144,6 +144,14 @@ static s32 pick_idle_cpu_in_node(const struct cpumask *cpus_allowed, int node, u
>   */
>  static DEFINE_PER_CPU(nodemask_t, per_cpu_unvisited);
>  
> +/*
> + * Per-node sorted arrays of NUMA nodes in increasing distance order,
> + * pre-computed at init time to avoid the O(N^2) traversal on each call
> + * to pick_idle_cpu_from_online_nodes().
> + */
> +static int **scx_numa_node_order;
> +static int *scx_numa_node_order_cnt;

Do we need this array? Isn't this always nr_node_ids - 1 (all other
possible nodes)?

Also, if nodes are sparse we should probably initialize the array with
NUMA_NO_NODE items and skip them when we scan the array (in addition to the
node_online() check).

> +
>  /*
>   * Search for an idle CPU across all nodes, excluding @node.
>   */
> @@ -685,6 +693,36 @@ void scx_idle_init_masks(void)
>  		BUG_ON(!alloc_cpumask_var_node(&per_cpu(local_numa_idle_cpumask, i),
>  					       GFP_KERNEL, cpu_to_node(i)));
>  	}
> +
> +#ifdef CONFIG_NUMA
> +	/*
> +	 * Pre-compute per-node NUMA distance order so that
> +	 * pick_idle_cpu_from_online_nodes() can iterate in O(N) instead of
> +	 * the O(N^2) traversal required by for_each_node_numadist().
> +	 */
> +	scx_numa_node_order = kcalloc(nr_node_ids, sizeof(*scx_numa_node_order), GFP_KERNEL);

Considering the comment above, we probably need only nr_node_ids - 1 items.

> +	BUG_ON(!scx_numa_node_order);
> +	scx_numa_node_order_cnt = kcalloc(nr_node_ids, sizeof(*scx_numa_node_order_cnt), GFP_KERNEL);
> +	BUG_ON(!scx_numa_node_order_cnt);
> +
> +	for_each_node(i) {
> +		scx_numa_node_order[i] = kcalloc(nr_node_ids, sizeof(int), GFP_KERNEL);
> +		BUG_ON(!scx_numa_node_order[i]);
> +	}
> +
> +	rcu_read_lock();
> +	for_each_node(i) {
> +		nodemask_t unvisited;
> +		int node = i, j = 0;
> +
> +		nodes_copy(unvisited, node_states[N_POSSIBLE]);
> +		node_clear(i, unvisited);
> +		for_each_node_numadist(node, unvisited)
> +			scx_numa_node_order[i][j++] = node;
> +		scx_numa_node_order_cnt[i] = j;
> +	}
> +	rcu_read_unlock();
> +#endif
>  }
>  
>  static void update_builtin_idle(int cpu, bool idle)
> -- 
> 2.48.1
> 

Thanks,
-Andrea

next prev parent reply	other threads:[~2026-03-16 15:51 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-15 18:10 [PATCH sched_ext/for-7.1 0/2] sched_ext: Reduce idle CPU NUMA traversal from O(N^2) to O(N) Cheng-Yang Chou
2026-03-15 18:10 ` [PATCH 1/2] sched_ext: Cache per-node NUMA distance order in scx_idle_init_masks() Cheng-Yang Chou
2026-03-16 15:51   ` Andrea Righi [this message]
2026-03-17 15:04     ` Cheng-Yang Chou
2026-03-17 17:32       ` Tejun Heo
2026-03-17 18:02         ` Andrea Righi
2026-03-17 19:00           ` Cheng-Yang Chou
2026-03-15 18:10 ` [PATCH 2/2] sched_ext: Use cached NUMA order in pick_idle_cpu_from_online_nodes() Cheng-Yang Chou
2026-03-16 12:27 ` [PATCH sched_ext/for-7.1 0/2] sched_ext: Reduce idle CPU NUMA traversal from O(N^2) to O(N) Christian Loehle
2026-03-16 14:50   ` Andrea Righi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=abgnBgO83uXd7Xq-@gpd4 \
    --to=arighi@nvidia.com \
    --cc=changwoo@igalia.com \
    --cc=chia7712@gmail.com \
    --cc=jserv@ccns.ncku.edu.tw \
    --cc=sched-ext@lists.linux.dev \
    --cc=tj@kernel.org \
    --cc=void@manifault.com \
    --cc=yphbchou0911@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.