[PATCH sched_ext/for-7.1 0/2] sched_ext: Reduce idle CPU NUMA traversal from O(N^2) to O(N)

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH sched_ext/for-7.1 0/2] sched_ext: Reduce idle CPU NUMA traversal from O(N^2) to O(N)
@ 2026-03-15 18:10 Cheng-Yang Chou
  2026-03-15 18:10 ` [PATCH 1/2] sched_ext: Cache per-node NUMA distance order in scx_idle_init_masks() Cheng-Yang Chou
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Cheng-Yang Chou @ 2026-03-15 18:10 UTC (permalink / raw)
  To: sched-ext, Tejun Heo, David Vernet, Andrea Righi, Changwoo Min
  Cc: Ching-Chun Huang, Chia-Ping Tsai, yphbchou0911

pick_idle_cpu_from_online_nodes() uses for_each_node_numadist() which
performs an O(N^2) traversal on every call to find the nearest idle
CPU across NUMA nodes.

Since NUMA distances are fixed at boot and do not change at runtime,
the per-node traversal order can be computed once during initialization
and reused on every subsequent call. This series pre-computes the
sorted order into scx_numa_node_order[], reducing the per-call cost
to O(N).

Patch 1 allocates and populates scx_numa_node_order[] and
scx_numa_node_order_cnt[] in scx_idle_init_masks() using the existing
for_each_node_numadist() traversal, paying the O(N^2) cost only once
at init time.

Patch 2 switches pick_idle_cpu_from_online_nodes() to iterate the
cached arrays, removes the now-unnecessary per_cpu_unvisited nodemask,
and adds a node_online() check so CPU hotplug is handled correctly
without needing to rebuild the arrays.

Tested with make -j only. I do not have access to a NUMA machine to
run runtime tests.

Thanks,
Cheng-Yang

---

Cheng-Yang Chou (2):
  sched_ext: Cache per-node NUMA distance order in scx_idle_init_masks()
  sched_ext: Use cached NUMA order in pick_idle_cpu_from_online_nodes()

 kernel/sched/ext_idle.c | 81 +++++++++++++++++++++++++----------------
 1 file changed, 49 insertions(+), 32 deletions(-)

-- 
2.48.1

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/2] sched_ext: Cache per-node NUMA distance order in scx_idle_init_masks()
  2026-03-15 18:10 [PATCH sched_ext/for-7.1 0/2] sched_ext: Reduce idle CPU NUMA traversal from O(N^2) to O(N) Cheng-Yang Chou
@ 2026-03-15 18:10 ` Cheng-Yang Chou
  2026-03-16 15:51   ` Andrea Righi
  2026-03-15 18:10 ` [PATCH 2/2] sched_ext: Use cached NUMA order in pick_idle_cpu_from_online_nodes() Cheng-Yang Chou
  2026-03-16 12:27 ` [PATCH sched_ext/for-7.1 0/2] sched_ext: Reduce idle CPU NUMA traversal from O(N^2) to O(N) Christian Loehle
  2 siblings, 1 reply; 10+ messages in thread
From: Cheng-Yang Chou @ 2026-03-15 18:10 UTC (permalink / raw)
  To: sched-ext, Tejun Heo, David Vernet, Andrea Righi, Changwoo Min
  Cc: Ching-Chun Huang, Chia-Ping Tsai, yphbchou0911

Add scx_numa_node_order[] and scx_numa_node_order_cnt[], per-node
arrays that store NUMA nodes sorted by increasing distance from each
node. They are allocated and populated once during boot in
scx_idle_init_masks() using the existing for_each_node_numadist()
O(N^2) traversal, so the cost is paid only at init time.

pick_idle_cpu_from_online_nodes() will use these cached arrays to
reduce its per-call complexity from O(N^2) to O(N).

Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
---
 kernel/sched/ext_idle.c | 38 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
index 03be4d664267..623a0356d6c3 100644
--- a/kernel/sched/ext_idle.c
+++ b/kernel/sched/ext_idle.c
@@ -144,6 +144,14 @@ static s32 pick_idle_cpu_in_node(const struct cpumask *cpus_allowed, int node, u
  */
 static DEFINE_PER_CPU(nodemask_t, per_cpu_unvisited);
 
+/*
+ * Per-node sorted arrays of NUMA nodes in increasing distance order,
+ * pre-computed at init time to avoid the O(N^2) traversal on each call
+ * to pick_idle_cpu_from_online_nodes().
+ */
+static int **scx_numa_node_order;
+static int *scx_numa_node_order_cnt;
+
 /*
  * Search for an idle CPU across all nodes, excluding @node.
  */
@@ -685,6 +693,36 @@ void scx_idle_init_masks(void)
 		BUG_ON(!alloc_cpumask_var_node(&per_cpu(local_numa_idle_cpumask, i),
 					       GFP_KERNEL, cpu_to_node(i)));
 	}
+
+#ifdef CONFIG_NUMA
+	/*
+	 * Pre-compute per-node NUMA distance order so that
+	 * pick_idle_cpu_from_online_nodes() can iterate in O(N) instead of
+	 * the O(N^2) traversal required by for_each_node_numadist().
+	 */
+	scx_numa_node_order = kcalloc(nr_node_ids, sizeof(*scx_numa_node_order), GFP_KERNEL);
+	BUG_ON(!scx_numa_node_order);
+	scx_numa_node_order_cnt = kcalloc(nr_node_ids, sizeof(*scx_numa_node_order_cnt), GFP_KERNEL);
+	BUG_ON(!scx_numa_node_order_cnt);
+
+	for_each_node(i) {
+		scx_numa_node_order[i] = kcalloc(nr_node_ids, sizeof(int), GFP_KERNEL);
+		BUG_ON(!scx_numa_node_order[i]);
+	}
+
+	rcu_read_lock();
+	for_each_node(i) {
+		nodemask_t unvisited;
+		int node = i, j = 0;
+
+		nodes_copy(unvisited, node_states[N_POSSIBLE]);
+		node_clear(i, unvisited);
+		for_each_node_numadist(node, unvisited)
+			scx_numa_node_order[i][j++] = node;
+		scx_numa_node_order_cnt[i] = j;
+	}
+	rcu_read_unlock();
+#endif
 }
 
 static void update_builtin_idle(int cpu, bool idle)
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/2] sched_ext: Use cached NUMA order in pick_idle_cpu_from_online_nodes()
  2026-03-15 18:10 [PATCH sched_ext/for-7.1 0/2] sched_ext: Reduce idle CPU NUMA traversal from O(N^2) to O(N) Cheng-Yang Chou
  2026-03-15 18:10 ` [PATCH 1/2] sched_ext: Cache per-node NUMA distance order in scx_idle_init_masks() Cheng-Yang Chou
@ 2026-03-15 18:10 ` Cheng-Yang Chou
  2026-03-16 12:27 ` [PATCH sched_ext/for-7.1 0/2] sched_ext: Reduce idle CPU NUMA traversal from O(N^2) to O(N) Christian Loehle
  2 siblings, 0 replies; 10+ messages in thread
From: Cheng-Yang Chou @ 2026-03-15 18:10 UTC (permalink / raw)
  To: sched-ext, Tejun Heo, David Vernet, Andrea Righi, Changwoo Min
  Cc: Ching-Chun Huang, Chia-Ping Tsai, yphbchou0911

Switch pick_idle_cpu_from_online_nodes() from the O(N^2)
for_each_node_numadist() traversal to the pre-sorted
scx_numa_node_order[] arrays introduced in the previous patch.
The per-call cost drops from O(N^2) to O(N), where N is the number
of NUMA nodes.

Nodes that went offline after boot are skipped with a node_online()
check, so CPU hotplug remains correct without needing to rebuild
the arrays.

The per_cpu_unvisited nodemask, used only by the old implementation
to track visited nodes across loop iterations, is no longer needed
and is removed.

Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
---
 kernel/sched/ext_idle.c | 49 ++++++++++++-----------------------------
 1 file changed, 14 insertions(+), 35 deletions(-)

diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
index 623a0356d6c3..90225bf3992c 100644
--- a/kernel/sched/ext_idle.c
+++ b/kernel/sched/ext_idle.c
@@ -138,12 +138,6 @@ static s32 pick_idle_cpu_in_node(const struct cpumask *cpus_allowed, int node, u
 }
 
 #ifdef CONFIG_NUMA
-/*
- * Tracks nodes that have not yet been visited when searching for an idle
- * CPU across all available nodes.
- */
-static DEFINE_PER_CPU(nodemask_t, per_cpu_unvisited);
-
 /*
  * Per-node sorted arrays of NUMA nodes in increasing distance order,
  * pre-computed at init time to avoid the O(N^2) traversal on each call
@@ -157,42 +151,27 @@ static int *scx_numa_node_order_cnt;
  */
 static s32 pick_idle_cpu_from_online_nodes(const struct cpumask *cpus_allowed, int node, u64 flags)
 {
-	nodemask_t *unvisited;
-	s32 cpu = -EBUSY;
-
-	preempt_disable();
-	unvisited = this_cpu_ptr(&per_cpu_unvisited);
+	int idx;
 
 	/*
-	 * Restrict the search to the online nodes (excluding the current
-	 * node that has been visited already).
+	 * The order is pre-computed at boot time in scx_numa_node_order[],
+	 * so each call is O(N) rather than O(N^2), where N is the number
+	 * of NUMA nodes.
 	 */
-	nodes_copy(*unvisited, node_states[N_ONLINE]);
-	node_clear(node, *unvisited);
+	for (idx = 0; idx < scx_numa_node_order_cnt[node]; idx++) {
+		int numa_node = scx_numa_node_order[node][idx];
+		s32 cpu;
 
-	/*
-	 * Traverse all nodes in order of increasing distance, starting
-	 * from @node.
-	 *
-	 * This loop is O(N^2), with N being the amount of NUMA nodes,
-	 * which might be quite expensive in large NUMA systems. However,
-	 * this complexity comes into play only when a scheduler enables
-	 * SCX_OPS_BUILTIN_IDLE_PER_NODE and it's requesting an idle CPU
-	 * without specifying a target NUMA node, so it shouldn't be a
-	 * bottleneck is most cases.
-	 *
-	 * As a future optimization we may want to cache the list of nodes
-	 * in a per-node array, instead of actually traversing them every
-	 * time.
-	 */
-	for_each_node_numadist(node, *unvisited) {
-		cpu = pick_idle_cpu_in_node(cpus_allowed, node, flags);
+		/* Nodes that went offline after boot are skipped. */
+		if (!node_online(numa_node))
+			continue;
+
+		cpu = pick_idle_cpu_in_node(cpus_allowed, numa_node, flags);
 		if (cpu >= 0)
-			break;
+			return cpu;
 	}
-	preempt_enable();
 
-	return cpu;
+	return -EBUSY;
 }
 #else
 static inline s32
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH sched_ext/for-7.1 0/2] sched_ext: Reduce idle CPU NUMA traversal from O(N^2) to O(N)
  2026-03-15 18:10 [PATCH sched_ext/for-7.1 0/2] sched_ext: Reduce idle CPU NUMA traversal from O(N^2) to O(N) Cheng-Yang Chou
  2026-03-15 18:10 ` [PATCH 1/2] sched_ext: Cache per-node NUMA distance order in scx_idle_init_masks() Cheng-Yang Chou
  2026-03-15 18:10 ` [PATCH 2/2] sched_ext: Use cached NUMA order in pick_idle_cpu_from_online_nodes() Cheng-Yang Chou
@ 2026-03-16 12:27 ` Christian Loehle
  2026-03-16 14:50   ` Andrea Righi
  2 siblings, 1 reply; 10+ messages in thread
From: Christian Loehle @ 2026-03-16 12:27 UTC (permalink / raw)
  To: Cheng-Yang Chou, sched-ext, Tejun Heo, David Vernet, Andrea Righi,
	Changwoo Min
  Cc: Ching-Chun Huang, Chia-Ping Tsai

On 3/15/26 18:10, Cheng-Yang Chou wrote:
> pick_idle_cpu_from_online_nodes() uses for_each_node_numadist() which
> performs an O(N^2) traversal on every call to find the nearest idle
> CPU across NUMA nodes.
> 
> Since NUMA distances are fixed at boot and do not change at runtime,
> the per-node traversal order can be computed once during initialization
> and reused on every subsequent call. This series pre-computes the
> sorted order into scx_numa_node_order[], reducing the per-call cost
> to O(N).
> 
> Patch 1 allocates and populates scx_numa_node_order[] and
> scx_numa_node_order_cnt[] in scx_idle_init_masks() using the existing
> for_each_node_numadist() traversal, paying the O(N^2) cost only once
> at init time.
> 
> Patch 2 switches pick_idle_cpu_from_online_nodes() to iterate the
> cached arrays, removes the now-unnecessary per_cpu_unvisited nodemask,
> and adds a node_online() check so CPU hotplug is handled correctly
> without needing to rebuild the arrays.
> 
> Tested with make -j only. I do not have access to a NUMA machine to
> run runtime tests.

qemu can at least functionally test that using something like
-smp 8,sockets=2,clusters=2,cores=2,threads=1

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH sched_ext/for-7.1 0/2] sched_ext: Reduce idle CPU NUMA traversal from O(N^2) to O(N)
  2026-03-16 12:27 ` [PATCH sched_ext/for-7.1 0/2] sched_ext: Reduce idle CPU NUMA traversal from O(N^2) to O(N) Christian Loehle
@ 2026-03-16 14:50   ` Andrea Righi
  0 siblings, 0 replies; 10+ messages in thread
From: Andrea Righi @ 2026-03-16 14:50 UTC (permalink / raw)
  To: Christian Loehle
  Cc: Cheng-Yang Chou, sched-ext, Tejun Heo, David Vernet, Changwoo Min,
	Ching-Chun Huang, Chia-Ping Tsai

On Mon, Mar 16, 2026 at 12:27:28PM +0000, Christian Loehle wrote:
> On 3/15/26 18:10, Cheng-Yang Chou wrote:
> > pick_idle_cpu_from_online_nodes() uses for_each_node_numadist() which
> > performs an O(N^2) traversal on every call to find the nearest idle
> > CPU across NUMA nodes.
> > 
> > Since NUMA distances are fixed at boot and do not change at runtime,
> > the per-node traversal order can be computed once during initialization
> > and reused on every subsequent call. This series pre-computes the
> > sorted order into scx_numa_node_order[], reducing the per-call cost
> > to O(N).
> > 
> > Patch 1 allocates and populates scx_numa_node_order[] and
> > scx_numa_node_order_cnt[] in scx_idle_init_masks() using the existing
> > for_each_node_numadist() traversal, paying the O(N^2) cost only once
> > at init time.
> > 
> > Patch 2 switches pick_idle_cpu_from_online_nodes() to iterate the
> > cached arrays, removes the now-unnecessary per_cpu_unvisited nodemask,
> > and adds a node_online() check so CPU hotplug is handled correctly
> > without needing to rebuild the arrays.
> > 
> > Tested with make -j only. I do not have access to a NUMA machine to
> > run runtime tests.
> 
> qemu can at least functionally test that using something like
> -smp 8,sockets=2,clusters=2,cores=2,threads=1

You can also use virtme-ng to easily simulate different NUMA nodes and
distances, for example:

$ vng -v --cpu 16,sockets=4,cores=2,threads=2 -m 4G \
  --numa 1G,cpus=0-3 --numa 1G,cpus=4-7 --numa 1G,cpus=8-11 --numa 1G,cpus=12-15\
  --numa-distance 0,1=51 --numa-distance 0,2=31 \
  --numa-distance 0,3=41 --numa-distance 1,2=21 \
  --numa-distance 1,3=61 --numa-distance 2,3=11
...
$ numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3
node 0 size: 1005 MB
node 0 free: 933 MB
node 1 cpus: 4 5 6 7
node 1 size: 1006 MB
node 1 free: 978 MB
node 2 cpus: 8 9 10 11
node 2 size: 888 MB
node 2 free: 855 MB
node 3 cpus: 12 13 14 15
node 3 size: 1005 MB
node 3 free: 984 MB
node distances:
node     0    1    2    3
   0:   10   51   31   41
   1:   51   10   21   61
   2:   31   21   10   11
   3:   41   61   11   10

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] sched_ext: Cache per-node NUMA distance order in scx_idle_init_masks()
  2026-03-15 18:10 ` [PATCH 1/2] sched_ext: Cache per-node NUMA distance order in scx_idle_init_masks() Cheng-Yang Chou
@ 2026-03-16 15:51   ` Andrea Righi
  2026-03-17 15:04     ` Cheng-Yang Chou
  0 siblings, 1 reply; 10+ messages in thread
From: Andrea Righi @ 2026-03-16 15:51 UTC (permalink / raw)
  To: Cheng-Yang Chou
  Cc: sched-ext, Tejun Heo, David Vernet, Changwoo Min,
	Ching-Chun Huang, Chia-Ping Tsai

Hi Cheng-Yang,

this has been sitting in my TODO list for a while, so thanks for looking
at it. :)

Comments below.

On Mon, Mar 16, 2026 at 02:10:14AM +0800, Cheng-Yang Chou wrote:
> Add scx_numa_node_order[] and scx_numa_node_order_cnt[], per-node
> arrays that store NUMA nodes sorted by increasing distance from each
> node. They are allocated and populated once during boot in
> scx_idle_init_masks() using the existing for_each_node_numadist()
> O(N^2) traversal, so the cost is paid only at init time.
> 
> pick_idle_cpu_from_online_nodes() will use these cached arrays to
> reduce its per-call complexity from O(N^2) to O(N).
> 
> Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
> ---
>  kernel/sched/ext_idle.c | 38 ++++++++++++++++++++++++++++++++++++++
>  1 file changed, 38 insertions(+)
> 
> diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
> index 03be4d664267..623a0356d6c3 100644
> --- a/kernel/sched/ext_idle.c
> +++ b/kernel/sched/ext_idle.c
> @@ -144,6 +144,14 @@ static s32 pick_idle_cpu_in_node(const struct cpumask *cpus_allowed, int node, u
>   */
>  static DEFINE_PER_CPU(nodemask_t, per_cpu_unvisited);
>  
> +/*
> + * Per-node sorted arrays of NUMA nodes in increasing distance order,
> + * pre-computed at init time to avoid the O(N^2) traversal on each call
> + * to pick_idle_cpu_from_online_nodes().
> + */
> +static int **scx_numa_node_order;
> +static int *scx_numa_node_order_cnt;

Do we need this array? Isn't this always nr_node_ids - 1 (all other
possible nodes)?

Also, if nodes are sparse we should probably initialize the array with
NUMA_NO_NODE items and skip them when we scan the array (in addition to the
node_online() check).

> +
>  /*
>   * Search for an idle CPU across all nodes, excluding @node.
>   */
> @@ -685,6 +693,36 @@ void scx_idle_init_masks(void)
>  		BUG_ON(!alloc_cpumask_var_node(&per_cpu(local_numa_idle_cpumask, i),
>  					       GFP_KERNEL, cpu_to_node(i)));
>  	}
> +
> +#ifdef CONFIG_NUMA
> +	/*
> +	 * Pre-compute per-node NUMA distance order so that
> +	 * pick_idle_cpu_from_online_nodes() can iterate in O(N) instead of
> +	 * the O(N^2) traversal required by for_each_node_numadist().
> +	 */
> +	scx_numa_node_order = kcalloc(nr_node_ids, sizeof(*scx_numa_node_order), GFP_KERNEL);

Considering the comment above, we probably need only nr_node_ids - 1 items.

> +	BUG_ON(!scx_numa_node_order);
> +	scx_numa_node_order_cnt = kcalloc(nr_node_ids, sizeof(*scx_numa_node_order_cnt), GFP_KERNEL);
> +	BUG_ON(!scx_numa_node_order_cnt);
> +
> +	for_each_node(i) {
> +		scx_numa_node_order[i] = kcalloc(nr_node_ids, sizeof(int), GFP_KERNEL);
> +		BUG_ON(!scx_numa_node_order[i]);
> +	}
> +
> +	rcu_read_lock();
> +	for_each_node(i) {
> +		nodemask_t unvisited;
> +		int node = i, j = 0;
> +
> +		nodes_copy(unvisited, node_states[N_POSSIBLE]);
> +		node_clear(i, unvisited);
> +		for_each_node_numadist(node, unvisited)
> +			scx_numa_node_order[i][j++] = node;
> +		scx_numa_node_order_cnt[i] = j;
> +	}
> +	rcu_read_unlock();
> +#endif
>  }
>  
>  static void update_builtin_idle(int cpu, bool idle)
> -- 
> 2.48.1
> 

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] sched_ext: Cache per-node NUMA distance order in scx_idle_init_masks()
  2026-03-16 15:51   ` Andrea Righi
@ 2026-03-17 15:04     ` Cheng-Yang Chou
  2026-03-17 17:32       ` Tejun Heo
  0 siblings, 1 reply; 10+ messages in thread
From: Cheng-Yang Chou @ 2026-03-17 15:04 UTC (permalink / raw)
  To: Andrea Righi
  Cc: sched-ext, Tejun Heo, David Vernet, Changwoo Min,
	Ching-Chun Huang, Chia-Ping Tsai, Christian Loehle

Hi Andrea,

On Mon, Mar 16, 2026 at 04:51:34PM +0100, Andrea Righi wrote:
> Hi Cheng-Yang,
> 
> this has been sitting in my TODO list for a while, so thanks for looking
> at it. :)
> 
> Comments below.
> 
> On Mon, Mar 16, 2026 at 02:10:14AM +0800, Cheng-Yang Chou wrote:
> > Add scx_numa_node_order[] and scx_numa_node_order_cnt[], per-node
> > arrays that store NUMA nodes sorted by increasing distance from each
> > node. They are allocated and populated once during boot in
> > scx_idle_init_masks() using the existing for_each_node_numadist()
> > O(N^2) traversal, so the cost is paid only at init time.
> > 
> > pick_idle_cpu_from_online_nodes() will use these cached arrays to
> > reduce its per-call complexity from O(N^2) to O(N).
> > 
> > Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
> > ---
> >  kernel/sched/ext_idle.c | 38 ++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 38 insertions(+)
> > 
> > diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
> > index 03be4d664267..623a0356d6c3 100644
> > --- a/kernel/sched/ext_idle.c
> > +++ b/kernel/sched/ext_idle.c
> > @@ -144,6 +144,14 @@ static s32 pick_idle_cpu_in_node(const struct cpumask *cpus_allowed, int node, u
> >   */
> >  static DEFINE_PER_CPU(nodemask_t, per_cpu_unvisited);
> >  
> > +/*
> > + * Per-node sorted arrays of NUMA nodes in increasing distance order,
> > + * pre-computed at init time to avoid the O(N^2) traversal on each call
> > + * to pick_idle_cpu_from_online_nodes().
> > + */
> > +static int **scx_numa_node_order;
> > +static int *scx_numa_node_order_cnt;
> 
> Do we need this array? Isn't this always nr_node_ids - 1 (all other
> possible nodes)?
> 
> Also, if nodes are sparse we should probably initialize the array with
> NUMA_NO_NODE items and skip them when we scan the array (in addition to the
> node_online() check).
> 

Both are good points, will be addressed in the v2 patch.
Also, thanks to Christian for the QEMU tip and to you for the virtme-ng
example, will test before sending the patch.

> > +
> >  /*
> >   * Search for an idle CPU across all nodes, excluding @node.
> >   */
> > @@ -685,6 +693,36 @@ void scx_idle_init_masks(void)
> >  		BUG_ON(!alloc_cpumask_var_node(&per_cpu(local_numa_idle_cpumask, i),
> >  					       GFP_KERNEL, cpu_to_node(i)));
> >  	}
> > +
> > +#ifdef CONFIG_NUMA
> > +	/*
> > +	 * Pre-compute per-node NUMA distance order so that
> > +	 * pick_idle_cpu_from_online_nodes() can iterate in O(N) instead of
> > +	 * the O(N^2) traversal required by for_each_node_numadist().
> > +	 */
> > +	scx_numa_node_order = kcalloc(nr_node_ids, sizeof(*scx_numa_node_order), GFP_KERNEL);
> 
> Considering the comment above, we probably need only nr_node_ids - 1 items.

ACK.

-- 
Thanks,
Cheng-Yang

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] sched_ext: Cache per-node NUMA distance order in scx_idle_init_masks()
  2026-03-17 15:04     ` Cheng-Yang Chou
@ 2026-03-17 17:32       ` Tejun Heo
  2026-03-17 18:02         ` Andrea Righi
  0 siblings, 1 reply; 10+ messages in thread
From: Tejun Heo @ 2026-03-17 17:32 UTC (permalink / raw)
  To: Cheng-Yang Chou
  Cc: Andrea Righi, sched-ext, David Vernet, Changwoo Min,
	Ching-Chun Huang, Chia-Ping Tsai, Christian Loehle

On Tue, Mar 17, 2026 at 11:04:11PM +0800, Cheng-Yang Chou wrote:
> Hi Andrea,
> 
> On Mon, Mar 16, 2026 at 04:51:34PM +0100, Andrea Righi wrote:
> > Hi Cheng-Yang,
> > 
> > this has been sitting in my TODO list for a while, so thanks for looking
> > at it. :)
> > 
> > Comments below.
> > 
> > On Mon, Mar 16, 2026 at 02:10:14AM +0800, Cheng-Yang Chou wrote:
> > > Add scx_numa_node_order[] and scx_numa_node_order_cnt[], per-node
> > > arrays that store NUMA nodes sorted by increasing distance from each
> > > node. They are allocated and populated once during boot in
> > > scx_idle_init_masks() using the existing for_each_node_numadist()
> > > O(N^2) traversal, so the cost is paid only at init time.

Wasn't the conclusion that given the low numa node count, O(N^2) doesn't
matter for now here although I can see node count becoming high enough with
AMD fake LLC NUMA enabled on an actual large NUMA machines. But if we decide
that's an actual problem, shouldn't it be addressed in the topology code
rather than from sched_ext side?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] sched_ext: Cache per-node NUMA distance order in scx_idle_init_masks()
  2026-03-17 17:32       ` Tejun Heo
@ 2026-03-17 18:02         ` Andrea Righi
  2026-03-17 19:00           ` Cheng-Yang Chou
  0 siblings, 1 reply; 10+ messages in thread
From: Andrea Righi @ 2026-03-17 18:02 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Cheng-Yang Chou, sched-ext, David Vernet, Changwoo Min,
	Ching-Chun Huang, Chia-Ping Tsai, Christian Loehle

On Tue, Mar 17, 2026 at 07:32:54AM -1000, Tejun Heo wrote:
> On Tue, Mar 17, 2026 at 11:04:11PM +0800, Cheng-Yang Chou wrote:
> > Hi Andrea,
> > 
> > On Mon, Mar 16, 2026 at 04:51:34PM +0100, Andrea Righi wrote:
> > > Hi Cheng-Yang,
> > > 
> > > this has been sitting in my TODO list for a while, so thanks for looking
> > > at it. :)
> > > 
> > > Comments below.
> > > 
> > > On Mon, Mar 16, 2026 at 02:10:14AM +0800, Cheng-Yang Chou wrote:
> > > > Add scx_numa_node_order[] and scx_numa_node_order_cnt[], per-node
> > > > arrays that store NUMA nodes sorted by increasing distance from each
> > > > node. They are allocated and populated once during boot in
> > > > scx_idle_init_masks() using the existing for_each_node_numadist()
> > > > O(N^2) traversal, so the cost is paid only at init time.
> 
> Wasn't the conclusion that given the low numa node count, O(N^2) doesn't
> matter for now here although I can see node count becoming high enough with
> AMD fake LLC NUMA enabled on an actual large NUMA machines. But if we decide
> that's an actual problem, shouldn't it be addressed in the topology code
> rather than from sched_ext side?

Yes, that was the conclusion from the previous discussion on this topic.
Honestly I've never seen NUMA system where the O(N^2) could be an issue.
One thing that I like about this change is the removal of
preempt_disable/enable() when we iterate nodes.

However, speaking of performance, the O(N^2) bitmask iteration on a single
nodemask_t is probably more efficient than the O(N) iteration through the
scx_numa_node_order[] array (it should be more cache friendly). Ideally we
should run some tests on those big Intel NUMA machines that were showing
the bad regressions with the single global idle cpumasks, but I don't think
I have access to them anymore...

So, thinking more about this, I'm still conflicted if we should apply this
change or not, especially because we don't have any performance number...

-Andrea

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] sched_ext: Cache per-node NUMA distance order in scx_idle_init_masks()
  2026-03-17 18:02         ` Andrea Righi
@ 2026-03-17 19:00           ` Cheng-Yang Chou
  0 siblings, 0 replies; 10+ messages in thread
From: Cheng-Yang Chou @ 2026-03-17 19:00 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Tejun Heo, sched-ext, David Vernet, Changwoo Min,
	Ching-Chun Huang, Chia-Ping Tsai, Christian Loehle

Hi Tejun, Andrea,

On Tue, Mar 17, 2026 at 07:02:12PM +0100, Andrea Righi wrote:
> On Tue, Mar 17, 2026 at 07:32:54AM -1000, Tejun Heo wrote:
> > On Tue, Mar 17, 2026 at 11:04:11PM +0800, Cheng-Yang Chou wrote:
> > > Hi Andrea,
> > > 
> > > On Mon, Mar 16, 2026 at 04:51:34PM +0100, Andrea Righi wrote:
> > > > Hi Cheng-Yang,
> > > > 
> > > > this has been sitting in my TODO list for a while, so thanks for looking
> > > > at it. :)
> > > > 
> > > > Comments below.
> > > > 
> > > > On Mon, Mar 16, 2026 at 02:10:14AM +0800, Cheng-Yang Chou wrote:
> > > > > Add scx_numa_node_order[] and scx_numa_node_order_cnt[], per-node
> > > > > arrays that store NUMA nodes sorted by increasing distance from each
> > > > > node. They are allocated and populated once during boot in
> > > > > scx_idle_init_masks() using the existing for_each_node_numadist()
> > > > > O(N^2) traversal, so the cost is paid only at init time.
> > 
> > Wasn't the conclusion that given the low numa node count, O(N^2) doesn't
> > matter for now here although I can see node count becoming high enough with
> > AMD fake LLC NUMA enabled on an actual large NUMA machines. But if we decide
> > that's an actual problem, shouldn't it be addressed in the topology code
> > rather than from sched_ext side?
> 
> Yes, that was the conclusion from the previous discussion on this topic.
> Honestly I've never seen NUMA system where the O(N^2) could be an issue.
> One thing that I like about this change is the removal of
> preempt_disable/enable() when we iterate nodes.
> 
> However, speaking of performance, the O(N^2) bitmask iteration on a single
> nodemask_t is probably more efficient than the O(N) iteration through the
> scx_numa_node_order[] array (it should be more cache friendly). Ideally we
> should run some tests on those big Intel NUMA machines that were showing
> the bad regressions with the single global idle cpumasks, but I don't think
> I have access to them anymore...
> 
> So, thinking more about this, I'm still conflicted if we should apply this
> change or not, especially because we don't have any performance number...

Since I don't have access to large bare-metal NUMA machines to properly
benchmark the cache-friendly hypothesis versus the O(N^2) overhead, and
considering the architectural concerns Tejun mentioned earlier, I agree
it's best to just drop this patch for now.

Thanks both for the detailed review and insights!

-- 
Thanks,
Cheng-Yang

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2026-03-17 19:00 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-15 18:10 [PATCH sched_ext/for-7.1 0/2] sched_ext: Reduce idle CPU NUMA traversal from O(N^2) to O(N) Cheng-Yang Chou
2026-03-15 18:10 ` [PATCH 1/2] sched_ext: Cache per-node NUMA distance order in scx_idle_init_masks() Cheng-Yang Chou
2026-03-16 15:51   ` Andrea Righi
2026-03-17 15:04     ` Cheng-Yang Chou
2026-03-17 17:32       ` Tejun Heo
2026-03-17 18:02         ` Andrea Righi
2026-03-17 19:00           ` Cheng-Yang Chou
2026-03-15 18:10 ` [PATCH 2/2] sched_ext: Use cached NUMA order in pick_idle_cpu_from_online_nodes() Cheng-Yang Chou
2026-03-16 12:27 ` [PATCH sched_ext/for-7.1 0/2] sched_ext: Reduce idle CPU NUMA traversal from O(N^2) to O(N) Christian Loehle
2026-03-16 14:50   ` Andrea Righi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.