[PATCHSET v7 sched_ext/for-6.14] sched_ext: split global idle cpumask into per-NUMA cpumasks

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCHSET v7 sched_ext/for-6.14] sched_ext: split global idle cpumask into per-NUMA cpumasks
@ 2024-12-17  9:32 Andrea Righi
  2024-12-17  9:32 ` [PATCH 1/6] sched/topology: introduce for_each_numa_hop_node() / sched_numa_hop_node() Andrea Righi
                   ` (5 more replies)
  0 siblings, 6 replies; 23+ messages in thread
From: Andrea Righi @ 2024-12-17  9:32 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min
  Cc: Yury Norov, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, linux-kernel

= Overview =

As discussed during the sched_ext office hours, using a global cpumask
to keep track of the idle CPUs can be inefficient and it may not scale
really well on large NUMA systems.

Therefore, split the idle cpumask into multiple per-NUMA node cpumasks
to improve scalability and performance on such large systems.

Scalability issues seem to be more noticeable on Intel Sapphire Rapids
dual-socket architectures.

= Test =

Hardware:
 - System: DGX B200
    - CPUs: 224 SMT threads (112 physical cores)
    - Processor: INTEL(R) XEON(R) PLATINUM 8570
    - 2 NUMA nodes

Scheduler:
 - scx_simple [1] (so that we can focus at the built-in idle selection
   policy and not at the scheduling policy itself)

Test:
 - Run a parallel kernel build `make -j $(nproc)` and measure the average
   elapsed time over 10 runs:

          avg time | stdev
          ---------+------
 before:   52.431s | 2.895
  after:   50.342s | 2.895

= Conclusion =

Splitting the global cpumask into multiple per-NUMA cpumasks helped to
achieve a speedup of approximately +4% with this particular architecture
and test case.

I've repeated the same test on a DGX-1 (40 physical cores, Intel Xeon
E5-2698 v4 @ 2.20GHz, 2 NUMA nodes) and I didn't observe any measurable
difference.

In general, on smaller systems, I haven't noticed any measurable
regressions or improvements with the same test (parallel kernel build)
and scheduler (scx_simple).

NOTE: splitting the global cpumask into multiple cpumasks may increase
the overhead of scx_bpf_pick_idle_cpu() or ops.select_cpu() (for
schedulers relying on the built-in CPU idle selection policy) in
presence of multiple NUMA nodes, particularly under high system load,
since we may have to access multiple cpumasks to find an idle CPU.

However, this increased overhead seems to be highly compensated by a
lower overhead when updating the idle state (__scx_update_idle()) and by
the fact that CPUs are more likely operating within their local idle
cpumask, reducing the stress on the cache coherency protocol.

= References =

[1] https://github.com/sched-ext/scx/blob/main/scheds/c/scx_simple.bpf.c

ChangeLog v6 -> v7:
 - addressed some issues based on Yury's review (thanks!)
 - introduced a new iterator to navigate the NUMA nodes in order of
   increasing distance

ChangeLog v5 -> v6:
 - refactor patch set to introduce SCX_OPS_NODE_BUILTIN_IDLE before
   the per-node cpumasks
 - move idle CPU selection policy to a separate file (ext_idle.c)
   (no functional change, just some code shuffling)

ChangeLog v4 -> v5:
 - introduce new scx_bpf_cpu_to_node() kfunc
 - provide __COMPAT_*() helpers for the new kfunc's

ChangeLog v3 -> v4:
 - introduce SCX_OPS_NODE_BUILTIN_IDLE to select multiple per-node
   cpumasks or single flat cpumask
 - introduce new kfuncs to access per-node idle cpumasks information
 - use for_each_numa_hop_mask() to traverse NUMA nodes in increasing
   distance
 - dropped nodemask helpers (not needed anymore)
 - rebase to sched_ext/for-6.14

ChangeLog v2 -> v3:
  - introduce for_each_online_node_wrap()
  - re-introduce cpumask_intersects() in test_and_clear_cpu_idle() (to
    reduce memory writes / cache coherence pressure)
  - get rid of the redundant scx_selcpu_topo_numa logic
  [test results are pretty much identical, so I haven't updated them from v2]

ChangeLog v1 -> v2:
  - renamed for_each_node_mask|state_from() -> for_each_node_mask|state_wrap()
  - misc cpumask optimizations (thanks to Yury)

Andrea Righi (6):
      sched/topology: introduce for_each_numa_hop_node() / sched_numa_hop_node()
      sched_ext: Introduce SCX_OPS_NODE_BUILTIN_IDLE
      sched_ext: Introduce per-node idle cpumasks
      sched_ext: Get rid of the scx_selcpu_topo_numa logic
      sched_ext: Introduce NUMA aware idle cpu kfunc helpers
      sched_ext: Move built-in idle CPU selection policy to a separate file

 MAINTAINERS                              |   1 +
 include/linux/topology.h                 |  28 +-
 kernel/sched/ext.c                       | 742 ++-------------------------
 kernel/sched/ext_idle.c                  | 835 +++++++++++++++++++++++++++++++
 kernel/sched/topology.c                  |  49 ++
 tools/sched_ext/include/scx/common.bpf.h |   4 +
 tools/sched_ext/include/scx/compat.bpf.h |  19 +
 7 files changed, 984 insertions(+), 694 deletions(-)
 create mode 100644 kernel/sched/ext_idle.c

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/6] sched/topology: introduce for_each_numa_hop_node() / sched_numa_hop_node()
  2024-12-17  9:32 [PATCHSET v7 sched_ext/for-6.14] sched_ext: split global idle cpumask into per-NUMA cpumasks Andrea Righi
@ 2024-12-17  9:32 ` Andrea Righi
  2024-12-17 21:57   ` Tejun Heo
  2024-12-17  9:32 ` [PATCH 2/6] sched_ext: Introduce SCX_OPS_NODE_BUILTIN_IDLE Andrea Righi
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 23+ messages in thread
From: Andrea Righi @ 2024-12-17  9:32 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min
  Cc: Yury Norov, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, linux-kernel

Introduce for_each_numa_hop_node() and sched_numa_hop_node() to iterate
over node IDs in order of increasing NUMA distance from a given starting
node.

These iterator functions are similar to for_each_numa_hop_mask() and
sched_numa_hop_mask(), but instead of providing a cpumask at each
iteration, they provide a node ID.

Example usage:

  nodemask_t hop_nodes = NODE_MASK_NONE;
  int start = cpu_to_node(smp_processor_id());

  for_each_numa_hop_node(node, start, hop_nodes, N_ONLINE)
  	pr_info("node (%d, %d) -> \n",
  		 start, node, node_distance(start, node);

Simulating the following NUMA topology in virtme-ng:

 $ numactl -H
 available: 4 nodes (0-3)
 node 0 cpus: 0 1
 node 0 size: 1006 MB
 node 0 free: 928 MB
 node 1 cpus: 2 3
 node 1 size: 1007 MB
 node 1 free: 986 MB
 node 2 cpus: 4 5
 node 2 size: 889 MB
 node 2 free: 862 MB
 node 3 cpus: 6 7
 node 3 size: 1006 MB
 node 3 free: 983 MB
 node distances:
 node     0    1    2    3
    0:   10   51   31   41
    1:   51   10   21   61
    2:   31   21   10   11
    3:   41   61   11   10

The output of the example above (on node 0) is the following:

 [   84.953644] node (0, 0) -> 10
 [   84.953712] node (0, 2) -> 31
 [   84.953764] node (0, 3) -> 41
 [   84.953817] node (0, 1) -> 51

Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 include/linux/topology.h | 28 ++++++++++++++++++++++-
 kernel/sched/topology.c  | 49 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 76 insertions(+), 1 deletion(-)

diff --git a/include/linux/topology.h b/include/linux/topology.h
index 52f5850730b3..d9014d90580d 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -248,12 +248,18 @@ static inline const struct cpumask *cpu_cpu_mask(int cpu)
 #ifdef CONFIG_NUMA
 int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node);
 extern const struct cpumask *sched_numa_hop_mask(unsigned int node, unsigned int hops);
-#else
+extern int sched_numa_hop_node(nodemask_t *hop_nodes, int start, unsigned int state);
+#else /* !CONFIG_NUMA */
 static __always_inline int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node)
 {
 	return cpumask_nth_and(cpu, cpus, cpu_online_mask);
 }
 
+static inline int sched_numa_hop_node(nodemask_t *hop_nodes, int start, unsigned int state)
+{
+	return NUMA_NO_NODE;
+}
+
 static inline const struct cpumask *
 sched_numa_hop_mask(unsigned int node, unsigned int hops)
 {
@@ -261,6 +267,26 @@ sched_numa_hop_mask(unsigned int node, unsigned int hops)
 }
 #endif	/* CONFIG_NUMA */
 
+/**
+ * for_each_numa_hop_node - iterate over NUMA nodes at increasing hop distances
+ *                          from a given starting node.
+ * @__node: the iteration variable, representing the current NUMA node.
+ * @__start: the NUMA node to start the iteration from.
+ * @__hop_nodes: a nodemask_t to track the visited nodes.
+ * @__state: state of NUMA nodes to iterate.
+ *
+ * Requires rcu_lock to be held.
+ *
+ * This macro iterates over NUMA nodes in increasing distance from
+ * @start_node.
+ *
+ * Yields NUMA_NO_NODE when all the nodes have been visited.
+ */
+#define for_each_numa_hop_node(__node, __start, __hop_nodes, __state)		\
+	for (int __node = __start;						\
+	     __node != NUMA_NO_NODE;						\
+	     __node = sched_numa_hop_node(&(__hop_nodes), __start, __state))
+
 /**
  * for_each_numa_hop_mask - iterate over cpumasks of increasing NUMA distance
  *                          from a given node.
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 9748a4c8d668..8e77c235ad9a 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2185,6 +2185,55 @@ int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node)
 }
 EXPORT_SYMBOL_GPL(sched_numa_find_nth_cpu);
 
+/**
+ * sched_numa_hop_node - Find the NUMA node at the closest hop distance
+ *                       from node @start.
+ *
+ * @hop_nodes: a pointer to a nodemask_t representing the visited nodes.
+ * @start: the NUMA node to start the hop search from.
+ * @state: the node state to filter nodes by.
+ *
+ * This function iterates over all NUMA nodes in the given state and
+ * calculates the hop distance to the starting node. It returns the NUMA
+ * node that is the closest in terms of hop distance
+ * that has not already been considered (not set in @hop_nodes). If the
+ * node is found, it is marked as considered in the @hop_nodes bitmask.
+ *
+ * The function checks if the node is not the start node and ensures it is
+ * not already part of the hop_nodes set. It then computes the distance to
+ * the start node using the node_distance() function. The closest node is
+ * chosen based on the minimum distance.
+ *
+ * Returns the NUMA node ID closest in terms of hop distance from the
+ * @start node, or NUMA_NO_NODE if no node is found (or all nodes have been
+ * visited).
+ */
+int sched_numa_hop_node(nodemask_t *hop_nodes, int start, unsigned int state)
+{
+	int dist, n, min_node, min_dist;
+
+	if (state >= NR_NODE_STATES)
+		return NUMA_NO_NODE;
+
+	min_node = NUMA_NO_NODE;
+	min_dist = INT_MAX;
+
+	for_each_node_state(n, state) {
+		if (n == start || node_isset(n, *hop_nodes))
+			continue;
+		dist = node_distance(start, n);
+		if (dist < min_dist) {
+			min_dist = dist;
+			min_node = n;
+		}
+	}
+	if (min_node != NUMA_NO_NODE)
+		node_set(min_node, *hop_nodes);
+
+	return min_node;
+}
+EXPORT_SYMBOL_GPL(sched_numa_hop_node);
+
 /**
  * sched_numa_hop_mask() - Get the cpumask of CPUs at most @hops hops away from
  *                         @node
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/6] sched_ext: Introduce SCX_OPS_NODE_BUILTIN_IDLE
  2024-12-17  9:32 [PATCHSET v7 sched_ext/for-6.14] sched_ext: split global idle cpumask into per-NUMA cpumasks Andrea Righi
  2024-12-17  9:32 ` [PATCH 1/6] sched/topology: introduce for_each_numa_hop_node() / sched_numa_hop_node() Andrea Righi
@ 2024-12-17  9:32 ` Andrea Righi
  2024-12-17  9:32 ` [PATCH 3/6] sched_ext: Introduce per-node idle cpumasks Andrea Righi
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 23+ messages in thread
From: Andrea Righi @ 2024-12-17  9:32 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min
  Cc: Yury Norov, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, linux-kernel

Add the new scheduler flag SCX_OPS_NODE_BUILTIN_IDLE, which allows each
scx scheduler to select between using a global flat idle cpumask or
multiple per-node cpumasks.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/ext.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 54e659ba9476..a17abd2df4d4 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -122,6 +122,12 @@ enum scx_ops_flags {
 	 */
 	SCX_OPS_SWITCH_PARTIAL	= 1LLU << 3,
 
+	/*
+	 * If set, enable per-node idle cpumasks. If clear, use a single global
+	 * flat idle cpumask.
+	 */
+	SCX_OPS_BUILTIN_IDLE_PER_NODE = 1LLU << 4,
+
 	/*
 	 * CPU cgroup support flags
 	 */
@@ -131,6 +137,7 @@ enum scx_ops_flags {
 				  SCX_OPS_ENQ_LAST |
 				  SCX_OPS_ENQ_EXITING |
 				  SCX_OPS_SWITCH_PARTIAL |
+				  SCX_OPS_BUILTIN_IDLE_PER_NODE |
 				  SCX_OPS_HAS_CGROUP_WEIGHT,
 };
 
@@ -5446,6 +5453,16 @@ static int validate_ops(const struct sched_ext_ops *ops)
 		return -EINVAL;
 	}
 
+	/*
+	 * SCX_OPS_BUILTIN_IDLE_PER_NODE requires built-in CPU idle
+	 * selection policy to be enabled.
+	 */
+	if ((ops->flags & SCX_OPS_BUILTIN_IDLE_PER_NODE) &&
+	    (ops->update_idle && !(ops->flags & SCX_OPS_KEEP_BUILTIN_IDLE))) {
+		scx_ops_error("SCX_OPS_BUILTIN_IDLE_PER_NODE requires CPU idle selection enabled");
+		return -EINVAL;
+	}
+
 	return 0;
 }
 
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 3/6] sched_ext: Introduce per-node idle cpumasks
  2024-12-17  9:32 [PATCHSET v7 sched_ext/for-6.14] sched_ext: split global idle cpumask into per-NUMA cpumasks Andrea Righi
  2024-12-17  9:32 ` [PATCH 1/6] sched/topology: introduce for_each_numa_hop_node() / sched_numa_hop_node() Andrea Righi
  2024-12-17  9:32 ` [PATCH 2/6] sched_ext: Introduce SCX_OPS_NODE_BUILTIN_IDLE Andrea Righi
@ 2024-12-17  9:32 ` Andrea Righi
  2024-12-17 23:22   ` Tejun Heo
                     ` (2 more replies)
  2024-12-17  9:32 ` [PATCH 4/6] sched_ext: Get rid of the scx_selcpu_topo_numa logic Andrea Righi
                   ` (2 subsequent siblings)
  5 siblings, 3 replies; 23+ messages in thread
From: Andrea Righi @ 2024-12-17  9:32 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min
  Cc: Yury Norov, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, linux-kernel

Using a single global idle mask can lead to inefficiencies and a lot of
stress on the cache coherency protocol on large systems with multiple
NUMA nodes, since all the CPUs can create a really intense read/write
activity on the single global cpumask.

Therefore, split the global cpumask into multiple per-NUMA node cpumasks
to improve scalability and performance on large systems.

The concept is that each cpumask will track only the idle CPUs within
its corresponding NUMA node, treating CPUs in other NUMA nodes as busy.
In this way concurrent access to the idle cpumask will be restricted
within each NUMA node.

NOTE: with per-node idle cpumasks enabled scx_bpf_get_idle_cpu/smtmask()
are returning the cpumask of the current NUMA node, instead of a single
cpumask for all the CPUs.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/ext.c | 281 +++++++++++++++++++++++++++++++++------------
 1 file changed, 209 insertions(+), 72 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index a17abd2df4d4..d4666db4a212 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -894,6 +894,7 @@ static DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_enabled);
 #ifdef CONFIG_SMP
 static DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_llc);
 static DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_numa);
+static DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_per_node);
 #endif
 
 static struct static_key_false scx_has_op[SCX_OPI_END] =
@@ -937,11 +938,82 @@ static struct delayed_work scx_watchdog_work;
 #define CL_ALIGNED_IF_ONSTACK __cacheline_aligned_in_smp
 #endif
 
-static struct {
+struct idle_cpumask {
 	cpumask_var_t cpu;
 	cpumask_var_t smt;
-} idle_masks CL_ALIGNED_IF_ONSTACK;
+};
+
+/*
+ * Make sure a NUMA node is in a valid range.
+ */
+static int validate_node(int node)
+{
+	/* If no node is specified, return the current one */
+	if (node == NUMA_NO_NODE)
+		return numa_node_id();
+
+	/* Make sure node is in the range of possible nodes */
+	if (node < 0 || node >= num_possible_nodes())
+		return -EINVAL;
+
+	return node;
+}
+
+/*
+ * cpumasks to track idle CPUs within each NUMA node.
+ *
+ * If SCX_OPS_BUILTIN_IDLE_PER_NODE is not specified, a single flat cpumask
+ * from node 0 is used to track all idle CPUs system-wide.
+ */
+static struct idle_cpumask **idle_masks CL_ALIGNED_IF_ONSTACK;
+
+static struct cpumask *get_idle_mask_node(int node, bool smt)
+{
+	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
+		return smt ? idle_masks[0]->smt : idle_masks[0]->cpu;
+
+	node = validate_node(node);
+	if (node < 0)
+		return cpu_none_mask;
+
+	return smt ? idle_masks[node]->smt : idle_masks[node]->cpu;
+}
+
+static struct cpumask *get_idle_cpumask_node(int node)
+{
+	return get_idle_mask_node(node, false);
+}
+
+static struct cpumask *get_idle_smtmask_node(int node)
+{
+	return get_idle_mask_node(node, true);
+}
+
+static void idle_masks_init(void)
+{
+	int node;
+
+	idle_masks = kcalloc(num_possible_nodes(), sizeof(*idle_masks), GFP_KERNEL);
+	BUG_ON(!idle_masks);
+
+	for_each_node_state(node, N_POSSIBLE) {
+		idle_masks[node] = kzalloc_node(sizeof(**idle_masks), GFP_KERNEL, node);
+		BUG_ON(!idle_masks[node]);
+
+		BUG_ON(!alloc_cpumask_var_node(&idle_masks[node]->cpu, GFP_KERNEL, node));
+		BUG_ON(!alloc_cpumask_var_node(&idle_masks[node]->smt, GFP_KERNEL, node));
+	}
+}
+#else	/* !CONFIG_SMP */
+static struct cpumask *get_idle_cpumask_node(int node)
+{
+	return cpu_none_mask;
+}
 
+static struct cpumask *get_idle_smtmask_node(int node)
+{
+	return cpu_none_mask;
+}
 #endif	/* CONFIG_SMP */
 
 /* for %SCX_KICK_WAIT */
@@ -3173,6 +3245,9 @@ bool scx_prio_less(const struct task_struct *a, const struct task_struct *b,
 
 static bool test_and_clear_cpu_idle(int cpu)
 {
+	int node = cpu_to_node(cpu);
+	struct cpumask *idle_cpu = get_idle_cpumask_node(node);
+
 #ifdef CONFIG_SCHED_SMT
 	/*
 	 * SMT mask should be cleared whether we can claim @cpu or not. The SMT
@@ -3181,29 +3256,34 @@ static bool test_and_clear_cpu_idle(int cpu)
 	 */
 	if (sched_smt_active()) {
 		const struct cpumask *smt = cpu_smt_mask(cpu);
+		struct cpumask *idle_smt = get_idle_smtmask_node(node);
 
 		/*
 		 * If offline, @cpu is not its own sibling and
 		 * scx_pick_idle_cpu() can get caught in an infinite loop as
-		 * @cpu is never cleared from idle_masks.smt. Ensure that @cpu
-		 * is eventually cleared.
+		 * @cpu is never cleared from the idle SMT mask. Ensure that
+		 * @cpu is eventually cleared.
+		 *
+		 * NOTE: Use cpumask_intersects() and cpumask_test_cpu() to
+		 * reduce memory writes, which may help alleviate cache
+		 * coherence pressure.
 		 */
-		if (cpumask_intersects(smt, idle_masks.smt))
-			cpumask_andnot(idle_masks.smt, idle_masks.smt, smt);
-		else if (cpumask_test_cpu(cpu, idle_masks.smt))
-			__cpumask_clear_cpu(cpu, idle_masks.smt);
+		if (cpumask_intersects(smt, idle_smt))
+			cpumask_andnot(idle_smt, idle_smt, smt);
+		else if (cpumask_test_cpu(cpu, idle_smt))
+			__cpumask_clear_cpu(cpu, idle_smt);
 	}
 #endif
-	return cpumask_test_and_clear_cpu(cpu, idle_masks.cpu);
+	return cpumask_test_and_clear_cpu(cpu, idle_cpu);
 }
 
-static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags)
+static s32 scx_pick_idle_cpu_from_node(int node, const struct cpumask *cpus_allowed, u64 flags)
 {
 	int cpu;
 
 retry:
 	if (sched_smt_active()) {
-		cpu = cpumask_any_and_distribute(idle_masks.smt, cpus_allowed);
+		cpu = cpumask_any_and_distribute(get_idle_smtmask_node(node), cpus_allowed);
 		if (cpu < nr_cpu_ids)
 			goto found;
 
@@ -3211,7 +3291,7 @@ static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags)
 			return -EBUSY;
 	}
 
-	cpu = cpumask_any_and_distribute(idle_masks.cpu, cpus_allowed);
+	cpu = cpumask_any_and_distribute(get_idle_cpumask_node(node), cpus_allowed);
 	if (cpu >= nr_cpu_ids)
 		return -EBUSY;
 
@@ -3220,6 +3300,40 @@ static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags)
 		return cpu;
 	else
 		goto retry;
+
+}
+
+static s32
+scx_pick_idle_cpu_numa(const struct cpumask *cpus_allowed, s32 prev_cpu, u64 flags)
+{
+	nodemask_t hop_nodes = NODE_MASK_NONE;
+	int start_node = cpu_to_node(prev_cpu);
+	s32 cpu = -EBUSY;
+
+	/*
+	 * Traverse all online nodes in order of increasing distance,
+	 * starting from prev_cpu's node.
+	 */
+	rcu_read_lock();
+	for_each_numa_hop_node(node, start_node, hop_nodes, N_ONLINE) {
+		cpu = scx_pick_idle_cpu_from_node(node, cpus_allowed, flags);
+		if (cpu >= 0)
+			break;
+	}
+	rcu_read_unlock();
+
+	return cpu;
+}
+
+static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, s32 prev_cpu, u64 flags)
+{
+	/*
+	 * Only node 0 is used if per-node idle cpumasks are disabled.
+	 */
+	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
+		return scx_pick_idle_cpu_from_node(0, cpus_allowed, flags);
+
+	return scx_pick_idle_cpu_numa(cpus_allowed, prev_cpu, flags);
 }
 
 /*
@@ -3339,7 +3453,7 @@ static bool llc_numa_mismatch(void)
  * CPU belongs to a single LLC domain, and that each LLC domain is entirely
  * contained within a single NUMA node.
  */
-static void update_selcpu_topology(void)
+static void update_selcpu_topology(struct sched_ext_ops *ops)
 {
 	bool enable_llc = false, enable_numa = false;
 	unsigned int nr_cpus;
@@ -3360,6 +3474,14 @@ static void update_selcpu_topology(void)
 	if (nr_cpus > 0) {
 		if (nr_cpus < num_online_cpus())
 			enable_llc = true;
+		/*
+		 * No need to enable LLC optimization if the LLC domains are
+		 * perfectly overlapping with the NUMA domains when per-node
+		 * cpumasks are enabled.
+		 */
+		if ((ops->flags & SCX_OPS_BUILTIN_IDLE_PER_NODE) &&
+		    !llc_numa_mismatch())
+			enable_llc = false;
 		pr_debug("sched_ext: LLC=%*pb weight=%u\n",
 			 cpumask_pr_args(llc_span(cpu)), llc_weight(cpu));
 	}
@@ -3395,6 +3517,14 @@ static void update_selcpu_topology(void)
 		static_branch_enable_cpuslocked(&scx_selcpu_topo_numa);
 	else
 		static_branch_disable_cpuslocked(&scx_selcpu_topo_numa);
+
+	/*
+	 * Check if we need to enable per-node cpumasks.
+	 */
+	if (ops->flags & SCX_OPS_BUILTIN_IDLE_PER_NODE)
+		static_branch_enable_cpuslocked(&scx_builtin_idle_per_node);
+	else
+		static_branch_disable_cpuslocked(&scx_builtin_idle_per_node);
 }
 
 /*
@@ -3415,6 +3545,8 @@ static void update_selcpu_topology(void)
  * 4. Pick a CPU within the same NUMA node, if enabled:
  *   - choose a CPU from the same NUMA node to reduce memory access latency.
  *
+ * 5. Pick any idle CPU usable by the task.
+ *
  * Step 3 and 4 are performed only if the system has, respectively, multiple
  * LLC domains / multiple NUMA nodes (see scx_selcpu_topo_llc and
  * scx_selcpu_topo_numa).
@@ -3427,6 +3559,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 {
 	const struct cpumask *llc_cpus = NULL;
 	const struct cpumask *numa_cpus = NULL;
+	int node = cpu_to_node(prev_cpu);
 	s32 cpu;
 
 	*found = false;
@@ -3484,9 +3617,9 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 		 * piled up on it even if there is an idle core elsewhere on
 		 * the system.
 		 */
-		if (!cpumask_empty(idle_masks.cpu) &&
-		    !(current->flags & PF_EXITING) &&
-		    cpu_rq(cpu)->scx.local_dsq.nr == 0) {
+		if (!(current->flags & PF_EXITING) &&
+		    cpu_rq(cpu)->scx.local_dsq.nr == 0 &&
+		    !cpumask_empty(get_idle_cpumask_node(cpu_to_node(cpu)))) {
 			if (cpumask_test_cpu(cpu, p->cpus_ptr))
 				goto cpu_found;
 		}
@@ -3500,7 +3633,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 		/*
 		 * Keep using @prev_cpu if it's part of a fully idle core.
 		 */
-		if (cpumask_test_cpu(prev_cpu, idle_masks.smt) &&
+		if (cpumask_test_cpu(prev_cpu, get_idle_smtmask_node(node)) &&
 		    test_and_clear_cpu_idle(prev_cpu)) {
 			cpu = prev_cpu;
 			goto cpu_found;
@@ -3510,7 +3643,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 		 * Search for any fully idle core in the same LLC domain.
 		 */
 		if (llc_cpus) {
-			cpu = scx_pick_idle_cpu(llc_cpus, SCX_PICK_IDLE_CORE);
+			cpu = scx_pick_idle_cpu_from_node(node, llc_cpus, SCX_PICK_IDLE_CORE);
 			if (cpu >= 0)
 				goto cpu_found;
 		}
@@ -3519,7 +3652,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 		 * Search for any fully idle core in the same NUMA node.
 		 */
 		if (numa_cpus) {
-			cpu = scx_pick_idle_cpu(numa_cpus, SCX_PICK_IDLE_CORE);
+			cpu = scx_pick_idle_cpu_from_node(node, numa_cpus, SCX_PICK_IDLE_CORE);
 			if (cpu >= 0)
 				goto cpu_found;
 		}
@@ -3527,7 +3660,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 		/*
 		 * Search for any full idle core usable by the task.
 		 */
-		cpu = scx_pick_idle_cpu(p->cpus_ptr, SCX_PICK_IDLE_CORE);
+		cpu = scx_pick_idle_cpu(p->cpus_ptr, prev_cpu, SCX_PICK_IDLE_CORE);
 		if (cpu >= 0)
 			goto cpu_found;
 	}
@@ -3544,7 +3677,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 	 * Search for any idle CPU in the same LLC domain.
 	 */
 	if (llc_cpus) {
-		cpu = scx_pick_idle_cpu(llc_cpus, 0);
+		cpu = scx_pick_idle_cpu_from_node(node, llc_cpus, 0);
 		if (cpu >= 0)
 			goto cpu_found;
 	}
@@ -3553,7 +3686,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 	 * Search for any idle CPU in the same NUMA node.
 	 */
 	if (numa_cpus) {
-		cpu = scx_pick_idle_cpu(numa_cpus, 0);
+		cpu = scx_pick_idle_cpu_from_node(node, numa_cpus, 0);
 		if (cpu >= 0)
 			goto cpu_found;
 	}
@@ -3561,7 +3694,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 	/*
 	 * Search for any idle CPU usable by the task.
 	 */
-	cpu = scx_pick_idle_cpu(p->cpus_ptr, 0);
+	cpu = scx_pick_idle_cpu(p->cpus_ptr, prev_cpu, 0);
 	if (cpu >= 0)
 		goto cpu_found;
 
@@ -3643,17 +3776,33 @@ static void set_cpus_allowed_scx(struct task_struct *p,
 
 static void reset_idle_masks(void)
 {
+	int node;
+
+	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node)) {
+		cpumask_copy(get_idle_cpumask_node(0), cpu_online_mask);
+		cpumask_copy(get_idle_smtmask_node(0), cpu_online_mask);
+		return;
+	}
+
 	/*
 	 * Consider all online cpus idle. Should converge to the actual state
 	 * quickly.
 	 */
-	cpumask_copy(idle_masks.cpu, cpu_online_mask);
-	cpumask_copy(idle_masks.smt, cpu_online_mask);
+	for_each_node_state(node, N_POSSIBLE) {
+		const struct cpumask *node_mask = cpumask_of_node(node);
+		struct cpumask *idle_cpu = get_idle_cpumask_node(node);
+		struct cpumask *idle_smt = get_idle_smtmask_node(node);
+
+		cpumask_and(idle_cpu, cpu_online_mask, node_mask);
+		cpumask_copy(idle_smt, idle_cpu);
+	}
 }
 
 void __scx_update_idle(struct rq *rq, bool idle)
 {
 	int cpu = cpu_of(rq);
+	int node = cpu_to_node(cpu);
+	struct cpumask *idle_cpu = get_idle_cpumask_node(node);
 
 	if (SCX_HAS_OP(update_idle) && !scx_rq_bypassing(rq)) {
 		SCX_CALL_OP(SCX_KF_REST, update_idle, cpu_of(rq), idle);
@@ -3661,27 +3810,25 @@ void __scx_update_idle(struct rq *rq, bool idle)
 			return;
 	}
 
-	if (idle)
-		cpumask_set_cpu(cpu, idle_masks.cpu);
-	else
-		cpumask_clear_cpu(cpu, idle_masks.cpu);
+	assign_cpu(cpu, idle_cpu, idle);
 
 #ifdef CONFIG_SCHED_SMT
 	if (sched_smt_active()) {
 		const struct cpumask *smt = cpu_smt_mask(cpu);
+		struct cpumask *idle_smt = get_idle_smtmask_node(node);
 
 		if (idle) {
 			/*
-			 * idle_masks.smt handling is racy but that's fine as
-			 * it's only for optimization and self-correcting.
+			 * idle_smt handling is racy but that's fine as it's
+			 * only for optimization and self-correcting.
 			 */
 			for_each_cpu(cpu, smt) {
-				if (!cpumask_test_cpu(cpu, idle_masks.cpu))
+				if (!cpumask_test_cpu(cpu, idle_cpu))
 					return;
 			}
-			cpumask_or(idle_masks.smt, idle_masks.smt, smt);
+			cpumask_or(idle_smt, idle_smt, smt);
 		} else {
-			cpumask_andnot(idle_masks.smt, idle_masks.smt, smt);
+			cpumask_andnot(idle_smt, idle_smt, smt);
 		}
 	}
 #endif
@@ -3694,7 +3841,7 @@ static void handle_hotplug(struct rq *rq, bool online)
 	atomic_long_inc(&scx_hotplug_seq);
 
 	if (scx_enabled())
-		update_selcpu_topology();
+		update_selcpu_topology(&scx_ops);
 
 	if (online && SCX_HAS_OP(cpu_online))
 		SCX_CALL_OP(SCX_KF_UNLOCKED, cpu_online, cpu);
@@ -3729,7 +3876,10 @@ static void rq_offline_scx(struct rq *rq)
 #else	/* CONFIG_SMP */
 
 static bool test_and_clear_cpu_idle(int cpu) { return false; }
-static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags) { return -EBUSY; }
+static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, s32 prev_cpu, u64 flags)
+{
+	return -EBUSY;
+}
 static void reset_idle_masks(void) {}
 
 #endif	/* CONFIG_SMP */
@@ -5579,7 +5729,7 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
 
 	check_hotplug_seq(ops);
 #ifdef CONFIG_SMP
-	update_selcpu_topology();
+	update_selcpu_topology(ops);
 #endif
 	cpus_read_unlock();
 
@@ -6272,8 +6422,7 @@ void __init init_sched_ext_class(void)
 
 	BUG_ON(rhashtable_init(&dsq_hash, &dsq_hash_params));
 #ifdef CONFIG_SMP
-	BUG_ON(!alloc_cpumask_var(&idle_masks.cpu, GFP_KERNEL));
-	BUG_ON(!alloc_cpumask_var(&idle_masks.smt, GFP_KERNEL));
+	idle_masks_init();
 #endif
 	scx_kick_cpus_pnt_seqs =
 		__alloc_percpu(sizeof(scx_kick_cpus_pnt_seqs[0]) * nr_cpu_ids,
@@ -6309,6 +6458,15 @@ void __init init_sched_ext_class(void)
  */
 #include <linux/btf_ids.h>
 
+static bool check_builtin_idle_enabled(void)
+{
+	if (static_branch_likely(&scx_builtin_idle_enabled))
+		return true;
+
+	scx_ops_error("built-in idle tracking is disabled");
+	return false;
+}
+
 __bpf_kfunc_start_defs();
 
 /**
@@ -6328,10 +6486,8 @@ __bpf_kfunc_start_defs();
 __bpf_kfunc s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 				       u64 wake_flags, bool *is_idle)
 {
-	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
-		scx_ops_error("built-in idle tracking is disabled");
+	if (!check_builtin_idle_enabled())
 		goto prev_cpu;
-	}
 
 	if (!scx_kf_allowed(SCX_KF_SELECT_CPU))
 		goto prev_cpu;
@@ -7419,46 +7575,31 @@ __bpf_kfunc void scx_bpf_put_cpumask(const struct cpumask *cpumask)
 
 /**
  * scx_bpf_get_idle_cpumask - Get a referenced kptr to the idle-tracking
- * per-CPU cpumask.
+ * per-CPU cpumask of the current NUMA node.
  *
  * Returns NULL if idle tracking is not enabled, or running on a UP kernel.
  */
 __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void)
 {
-	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
-		scx_ops_error("built-in idle tracking is disabled");
+	if (!check_builtin_idle_enabled())
 		return cpu_none_mask;
-	}
 
-#ifdef CONFIG_SMP
-	return idle_masks.cpu;
-#else
-	return cpu_none_mask;
-#endif
+	return get_idle_cpumask_node(NUMA_NO_NODE);
 }
 
 /**
  * scx_bpf_get_idle_smtmask - Get a referenced kptr to the idle-tracking,
- * per-physical-core cpumask. Can be used to determine if an entire physical
- * core is free.
+ * per-physical-core cpumask of the current NUMA node. Can be used to determine
+ * if an entire physical core is free.
  *
  * Returns NULL if idle tracking is not enabled, or running on a UP kernel.
  */
 __bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask(void)
 {
-	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
-		scx_ops_error("built-in idle tracking is disabled");
+	if (!check_builtin_idle_enabled())
 		return cpu_none_mask;
-	}
 
-#ifdef CONFIG_SMP
-	if (sched_smt_active())
-		return idle_masks.smt;
-	else
-		return idle_masks.cpu;
-#else
-	return cpu_none_mask;
-#endif
+	return get_idle_smtmask_node(NUMA_NO_NODE);
 }
 
 /**
@@ -7487,10 +7628,8 @@ __bpf_kfunc void scx_bpf_put_idle_cpumask(const struct cpumask *idle_mask)
  */
 __bpf_kfunc bool scx_bpf_test_and_clear_cpu_idle(s32 cpu)
 {
-	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
-		scx_ops_error("built-in idle tracking is disabled");
+	if (!check_builtin_idle_enabled())
 		return false;
-	}
 
 	if (ops_cpu_valid(cpu, NULL))
 		return test_and_clear_cpu_idle(cpu);
@@ -7520,12 +7659,10 @@ __bpf_kfunc bool scx_bpf_test_and_clear_cpu_idle(s32 cpu)
 __bpf_kfunc s32 scx_bpf_pick_idle_cpu(const struct cpumask *cpus_allowed,
 				      u64 flags)
 {
-	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
-		scx_ops_error("built-in idle tracking is disabled");
+	if (!check_builtin_idle_enabled())
 		return -EBUSY;
-	}
 
-	return scx_pick_idle_cpu(cpus_allowed, flags);
+	return scx_pick_idle_cpu(cpus_allowed, smp_processor_id(), flags);
 }
 
 /**
@@ -7548,7 +7685,7 @@ __bpf_kfunc s32 scx_bpf_pick_any_cpu(const struct cpumask *cpus_allowed,
 	s32 cpu;
 
 	if (static_branch_likely(&scx_builtin_idle_enabled)) {
-		cpu = scx_pick_idle_cpu(cpus_allowed, flags);
+		cpu = scx_pick_idle_cpu(cpus_allowed, smp_processor_id(), flags);
 		if (cpu >= 0)
 			return cpu;
 	}
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 4/6] sched_ext: Get rid of the scx_selcpu_topo_numa logic
  2024-12-17  9:32 [PATCHSET v7 sched_ext/for-6.14] sched_ext: split global idle cpumask into per-NUMA cpumasks Andrea Righi
                   ` (2 preceding siblings ...)
  2024-12-17  9:32 ` [PATCH 3/6] sched_ext: Introduce per-node idle cpumasks Andrea Righi
@ 2024-12-17  9:32 ` Andrea Righi
  2024-12-17  9:32 ` [PATCH 5/6] sched_ext: Introduce NUMA aware idle cpu kfunc helpers Andrea Righi
  2024-12-17  9:32 ` [PATCH 6/6] sched_ext: Move built-in idle CPU selection policy to a separate file Andrea Righi
  5 siblings, 0 replies; 23+ messages in thread
From: Andrea Righi @ 2024-12-17  9:32 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min
  Cc: Yury Norov, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, linux-kernel

With the introduction of separate per-NUMA node cpumasks, we
automatically track idle CPUs within each NUMA node.

This makes the special logic for determining idle CPUs in each NUMA node
redundant and unnecessary, so we can get rid of it.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/ext.c | 75 +++-------------------------------------------
 1 file changed, 4 insertions(+), 71 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index d4666db4a212..80778d660be6 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -893,7 +893,6 @@ static DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_enabled);
 
 #ifdef CONFIG_SMP
 static DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_llc);
-static DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_numa);
 static DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_per_node);
 #endif
 
@@ -3385,25 +3384,6 @@ static unsigned int numa_weight(s32 cpu)
 	return sg->group_weight;
 }
 
-/*
- * Return the cpumask representing the NUMA domain of @cpu (or NULL if the NUMA
- * domain is not defined).
- */
-static struct cpumask *numa_span(s32 cpu)
-{
-	struct sched_domain *sd;
-	struct sched_group *sg;
-
-	sd = rcu_dereference(per_cpu(sd_numa, cpu));
-	if (!sd)
-		return NULL;
-	sg = sd->groups;
-	if (!sg)
-		return NULL;
-
-	return sched_group_span(sg);
-}
-
 /*
  * Return true if the LLC domains do not perfectly overlap with the NUMA
  * domains, false otherwise.
@@ -3455,7 +3435,7 @@ static bool llc_numa_mismatch(void)
  */
 static void update_selcpu_topology(struct sched_ext_ops *ops)
 {
-	bool enable_llc = false, enable_numa = false;
+	bool enable_llc = false;
 	unsigned int nr_cpus;
 	s32 cpu = cpumask_first(cpu_online_mask);
 
@@ -3485,38 +3465,15 @@ static void update_selcpu_topology(struct sched_ext_ops *ops)
 		pr_debug("sched_ext: LLC=%*pb weight=%u\n",
 			 cpumask_pr_args(llc_span(cpu)), llc_weight(cpu));
 	}
-
-	/*
-	 * Enable NUMA optimization only when there are multiple NUMA domains
-	 * among the online CPUs and the NUMA domains don't perfectly overlaps
-	 * with the LLC domains.
-	 *
-	 * If all CPUs belong to the same NUMA node and the same LLC domain,
-	 * enabling both NUMA and LLC optimizations is unnecessary, as checking
-	 * for an idle CPU in the same domain twice is redundant.
-	 */
-	nr_cpus = numa_weight(cpu);
-	if (nr_cpus > 0) {
-		if (nr_cpus < num_online_cpus() && llc_numa_mismatch())
-			enable_numa = true;
-		pr_debug("sched_ext: NUMA=%*pb weight=%u\n",
-			 cpumask_pr_args(numa_span(cpu)), numa_weight(cpu));
-	}
 	rcu_read_unlock();
 
 	pr_debug("sched_ext: LLC idle selection %s\n",
 		 enable_llc ? "enabled" : "disabled");
-	pr_debug("sched_ext: NUMA idle selection %s\n",
-		 enable_numa ? "enabled" : "disabled");
 
 	if (enable_llc)
 		static_branch_enable_cpuslocked(&scx_selcpu_topo_llc);
 	else
 		static_branch_disable_cpuslocked(&scx_selcpu_topo_llc);
-	if (enable_numa)
-		static_branch_enable_cpuslocked(&scx_selcpu_topo_numa);
-	else
-		static_branch_disable_cpuslocked(&scx_selcpu_topo_numa);
 
 	/*
 	 * Check if we need to enable per-node cpumasks.
@@ -3547,9 +3504,8 @@ static void update_selcpu_topology(struct sched_ext_ops *ops)
  *
  * 5. Pick any idle CPU usable by the task.
  *
- * Step 3 and 4 are performed only if the system has, respectively, multiple
- * LLC domains / multiple NUMA nodes (see scx_selcpu_topo_llc and
- * scx_selcpu_topo_numa).
+ * Step 3 is performed only if the system has multiple LLC domains that are not
+ * perfectly overlapping with the NUMA domains (see scx_selcpu_topo_llc).
  *
  * NOTE: tasks that can only run on 1 CPU are excluded by this logic, because
  * we never call ops.select_cpu() for them, see select_task_rq().
@@ -3558,7 +3514,6 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 			      u64 wake_flags, bool *found)
 {
 	const struct cpumask *llc_cpus = NULL;
-	const struct cpumask *numa_cpus = NULL;
 	int node = cpu_to_node(prev_cpu);
 	s32 cpu;
 
@@ -3580,13 +3535,9 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 	 * CPU affinity), the task will simply use the flat scheduling domain
 	 * defined by user-space.
 	 */
-	if (p->nr_cpus_allowed >= num_possible_cpus()) {
-		if (static_branch_maybe(CONFIG_NUMA, &scx_selcpu_topo_numa))
-			numa_cpus = numa_span(prev_cpu);
-
+	if (p->nr_cpus_allowed >= num_possible_cpus())
 		if (static_branch_maybe(CONFIG_SCHED_MC, &scx_selcpu_topo_llc))
 			llc_cpus = llc_span(prev_cpu);
-	}
 
 	/*
 	 * If WAKE_SYNC, try to migrate the wakee to the waker's CPU.
@@ -3648,15 +3599,6 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 				goto cpu_found;
 		}
 
-		/*
-		 * Search for any fully idle core in the same NUMA node.
-		 */
-		if (numa_cpus) {
-			cpu = scx_pick_idle_cpu_from_node(node, numa_cpus, SCX_PICK_IDLE_CORE);
-			if (cpu >= 0)
-				goto cpu_found;
-		}
-
 		/*
 		 * Search for any full idle core usable by the task.
 		 */
@@ -3682,15 +3624,6 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 			goto cpu_found;
 	}
 
-	/*
-	 * Search for any idle CPU in the same NUMA node.
-	 */
-	if (numa_cpus) {
-		cpu = scx_pick_idle_cpu_from_node(node, numa_cpus, 0);
-		if (cpu >= 0)
-			goto cpu_found;
-	}
-
 	/*
 	 * Search for any idle CPU usable by the task.
 	 */
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 5/6] sched_ext: Introduce NUMA aware idle cpu kfunc helpers
  2024-12-17  9:32 [PATCHSET v7 sched_ext/for-6.14] sched_ext: split global idle cpumask into per-NUMA cpumasks Andrea Righi
                   ` (3 preceding siblings ...)
  2024-12-17  9:32 ` [PATCH 4/6] sched_ext: Get rid of the scx_selcpu_topo_numa logic Andrea Righi
@ 2024-12-17  9:32 ` Andrea Righi
  2024-12-17  9:32 ` [PATCH 6/6] sched_ext: Move built-in idle CPU selection policy to a separate file Andrea Righi
  5 siblings, 0 replies; 23+ messages in thread
From: Andrea Righi @ 2024-12-17  9:32 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min
  Cc: Yury Norov, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, linux-kernel

Add the following kfunc's to provide scx schedulers direct access to
per-node idle cpumasks information:

 const struct cpumask *scx_bpf_get_idle_cpumask_node(int node)
 const struct cpumask *scx_bpf_get_idle_smtmask_node(int node)
 s32 scx_bpf_pick_idle_cpu_node(int node,
                                const cpumask_t *cpus_allowed, u64 flags)
 int scx_bpf_cpu_to_node(s32 cpu)

Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/ext.c                       | 98 +++++++++++++++++++++++-
 tools/sched_ext/include/scx/common.bpf.h |  4 +
 tools/sched_ext/include/scx/compat.bpf.h | 19 +++++
 3 files changed, 118 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 80778d660be6..1edf4206aab5 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -433,6 +433,7 @@ struct sched_ext_ops {
 	 * - scx_bpf_select_cpu_dfl()
 	 * - scx_bpf_test_and_clear_cpu_idle()
 	 * - scx_bpf_pick_idle_cpu()
+	 * - scx_bpf_pick_idle_cpu_node()
 	 *
 	 * The user also must implement ops.select_cpu() as the default
 	 * implementation relies on scx_bpf_select_cpu_dfl().
@@ -890,10 +891,10 @@ static DEFINE_STATIC_KEY_FALSE(scx_ops_enq_last);
 static DEFINE_STATIC_KEY_FALSE(scx_ops_enq_exiting);
 static DEFINE_STATIC_KEY_FALSE(scx_ops_cpu_preempt);
 static DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_enabled);
+static DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_per_node);
 
 #ifdef CONFIG_SMP
 static DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_llc);
-static DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_per_node);
 #endif
 
 static struct static_key_false scx_has_op[SCX_OPI_END] =
@@ -1004,6 +1005,11 @@ static void idle_masks_init(void)
 	}
 }
 #else	/* !CONFIG_SMP */
+static int validate_node(int node)
+{
+	return numa_node_id();
+}
+
 static struct cpumask *get_idle_cpumask_node(int node)
 {
 	return cpu_none_mask;
@@ -3813,6 +3819,10 @@ static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, s32 prev_cpu, u
 {
 	return -EBUSY;
 }
+static s32 scx_pick_idle_cpu_from_node(int node, const struct cpumask *cpus_allowed, u64 flags)
+{
+	return -EBUSY;
+}
 static void reset_idle_masks(void) {}
 
 #endif	/* CONFIG_SMP */
@@ -6400,6 +6410,15 @@ static bool check_builtin_idle_enabled(void)
 	return false;
 }
 
+static bool check_builtin_idle_per_node_enabled(void)
+{
+	if (static_branch_likely(&scx_builtin_idle_per_node))
+		return true;
+
+	scx_ops_error("per-node idle tracking is disabled");
+	return false;
+}
+
 __bpf_kfunc_start_defs();
 
 /**
@@ -7476,6 +7495,16 @@ __bpf_kfunc u32 scx_bpf_nr_cpu_ids(void)
 	return nr_cpu_ids;
 }
 
+/**
+ * scx_bpf_cpu_to_node - Return the NUMA node the given @cpu belongs to
+ */
+__bpf_kfunc int scx_bpf_cpu_to_node(s32 cpu)
+{
+	if (cpu < 0 || cpu >= nr_cpu_ids)
+		return -EINVAL;
+	return cpu_to_node(cpu);
+}
+
 /**
  * scx_bpf_get_possible_cpumask - Get a referenced kptr to cpu_possible_mask
  */
@@ -7506,11 +7535,26 @@ __bpf_kfunc void scx_bpf_put_cpumask(const struct cpumask *cpumask)
 	 */
 }
 
+/**
+ * scx_bpf_get_idle_cpumask_node - Get a referenced kptr to the idle-tracking
+ * per-CPU cpumask of a target NUMA node.
+ *
+ * Returns an empty cpumask if idle tracking is not enabled, if @node is not
+ * valid, or running on a UP kernel.
+ */
+__bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask_node(int node)
+{
+	if (!check_builtin_idle_per_node_enabled())
+		return cpu_none_mask;
+
+	return get_idle_cpumask_node(node);
+}
 /**
  * scx_bpf_get_idle_cpumask - Get a referenced kptr to the idle-tracking
  * per-CPU cpumask of the current NUMA node.
  *
- * Returns NULL if idle tracking is not enabled, or running on a UP kernel.
+ * Returns an emtpy cpumask if idle tracking is not enabled, or running on a UP
+ * kernel.
  */
 __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void)
 {
@@ -7520,12 +7564,29 @@ __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void)
 	return get_idle_cpumask_node(NUMA_NO_NODE);
 }
 
+/**
+ * scx_bpf_get_idle_smtmask_node - Get a referenced kptr to the idle-tracking,
+ * per-physical-core cpumask of a target NUMA node. Can be used to determine
+ * if an entire physical core is free.
+ *
+ * Returns an empty cpumask if idle tracking is not enabled, if @node is not
+ * valid, or running on a UP kernel.
+ */
+__bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask_node(int node)
+{
+	if (!check_builtin_idle_per_node_enabled())
+		return cpu_none_mask;
+
+	return get_idle_smtmask_node(node);
+}
+
 /**
  * scx_bpf_get_idle_smtmask - Get a referenced kptr to the idle-tracking,
  * per-physical-core cpumask of the current NUMA node. Can be used to determine
  * if an entire physical core is free.
  *
- * Returns NULL if idle tracking is not enabled, or running on a UP kernel.
+ * Returns an empty cumask if idle tracking is not enabled, or running on a UP
+ * kernel.
  */
 __bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask(void)
 {
@@ -7570,6 +7631,33 @@ __bpf_kfunc bool scx_bpf_test_and_clear_cpu_idle(s32 cpu)
 		return false;
 }
 
+/**
+ * scx_bpf_pick_idle_cpu_node - Pick and claim an idle cpu from a NUMA node
+ * @node: target NUMA node
+ * @cpus_allowed: Allowed cpumask
+ * @flags: %SCX_PICK_IDLE_CPU_* flags
+ *
+ * Pick and claim an idle cpu in @cpus_allowed from the NUMA node @node.
+ * Returns the picked idle cpu number on success. -%EBUSY if no matching cpu
+ * was found.
+ *
+ * Unavailable if ops.update_idle() is implemented and
+ * %SCX_OPS_KEEP_BUILTIN_IDLE is not set or if %SCX_OPS_KEEP_BUILTIN_IDLE is
+ * not set.
+ */
+__bpf_kfunc s32 scx_bpf_pick_idle_cpu_node(int node, const struct cpumask *cpus_allowed,
+				      u64 flags)
+{
+	node = validate_node(node);
+	if (node < 0)
+		return node;
+
+	if (!check_builtin_idle_per_node_enabled())
+		return -EBUSY;
+
+	return scx_pick_idle_cpu_from_node(node, cpus_allowed, flags);
+}
+
 /**
  * scx_bpf_pick_idle_cpu - Pick and claim an idle cpu
  * @cpus_allowed: Allowed cpumask
@@ -7704,14 +7792,18 @@ BTF_ID_FLAGS(func, scx_bpf_cpuperf_cap)
 BTF_ID_FLAGS(func, scx_bpf_cpuperf_cur)
 BTF_ID_FLAGS(func, scx_bpf_cpuperf_set)
 BTF_ID_FLAGS(func, scx_bpf_nr_cpu_ids)
+BTF_ID_FLAGS(func, scx_bpf_cpu_to_node)
 BTF_ID_FLAGS(func, scx_bpf_get_possible_cpumask, KF_ACQUIRE)
 BTF_ID_FLAGS(func, scx_bpf_get_online_cpumask, KF_ACQUIRE)
 BTF_ID_FLAGS(func, scx_bpf_put_cpumask, KF_RELEASE)
 BTF_ID_FLAGS(func, scx_bpf_get_idle_cpumask, KF_ACQUIRE)
+BTF_ID_FLAGS(func, scx_bpf_get_idle_cpumask_node, KF_ACQUIRE)
 BTF_ID_FLAGS(func, scx_bpf_get_idle_smtmask, KF_ACQUIRE)
+BTF_ID_FLAGS(func, scx_bpf_get_idle_smtmask_node, KF_ACQUIRE)
 BTF_ID_FLAGS(func, scx_bpf_put_idle_cpumask, KF_RELEASE)
 BTF_ID_FLAGS(func, scx_bpf_test_and_clear_cpu_idle)
 BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu, KF_RCU)
+BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu_node, KF_RCU)
 BTF_ID_FLAGS(func, scx_bpf_pick_any_cpu, KF_RCU)
 BTF_ID_FLAGS(func, scx_bpf_task_running, KF_RCU)
 BTF_ID_FLAGS(func, scx_bpf_task_cpu, KF_RCU)
diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h
index 858ba1f438f6..0ad368401579 100644
--- a/tools/sched_ext/include/scx/common.bpf.h
+++ b/tools/sched_ext/include/scx/common.bpf.h
@@ -63,14 +63,18 @@ u32 scx_bpf_cpuperf_cap(s32 cpu) __ksym __weak;
 u32 scx_bpf_cpuperf_cur(s32 cpu) __ksym __weak;
 void scx_bpf_cpuperf_set(s32 cpu, u32 perf) __ksym __weak;
 u32 scx_bpf_nr_cpu_ids(void) __ksym __weak;
+int scx_bpf_cpu_to_node(s32 cpu) __ksym __weak;
 const struct cpumask *scx_bpf_get_possible_cpumask(void) __ksym __weak;
 const struct cpumask *scx_bpf_get_online_cpumask(void) __ksym __weak;
 void scx_bpf_put_cpumask(const struct cpumask *cpumask) __ksym __weak;
 const struct cpumask *scx_bpf_get_idle_cpumask(void) __ksym;
+const struct cpumask *scx_bpf_get_idle_cpumask_node(int node) __ksym __weak;
 const struct cpumask *scx_bpf_get_idle_smtmask(void) __ksym;
+const struct cpumask *scx_bpf_get_idle_smtmask_node(int node) __ksym __weak;
 void scx_bpf_put_idle_cpumask(const struct cpumask *cpumask) __ksym;
 bool scx_bpf_test_and_clear_cpu_idle(s32 cpu) __ksym;
 s32 scx_bpf_pick_idle_cpu(const cpumask_t *cpus_allowed, u64 flags) __ksym;
+s32 scx_bpf_pick_idle_cpu_node(int node, const cpumask_t *cpus_allowed, u64 flags) __ksym __weak;
 s32 scx_bpf_pick_any_cpu(const cpumask_t *cpus_allowed, u64 flags) __ksym;
 bool scx_bpf_task_running(const struct task_struct *p) __ksym;
 s32 scx_bpf_task_cpu(const struct task_struct *p) __ksym;
diff --git a/tools/sched_ext/include/scx/compat.bpf.h b/tools/sched_ext/include/scx/compat.bpf.h
index d56520100a26..587650490743 100644
--- a/tools/sched_ext/include/scx/compat.bpf.h
+++ b/tools/sched_ext/include/scx/compat.bpf.h
@@ -125,6 +125,25 @@ bool scx_bpf_dispatch_vtime_from_dsq___compat(struct bpf_iter_scx_dsq *it__iter,
 	false;									\
 })
 
+#define __COMPAT_scx_bpf_cpu_to_node(cpu)                                       \
+	(bpf_ksym_exists(scx_bpf_cpu_to_node) ?                                 \
+	 scx_bpf_cpu_to_node(cpu) : 0)
+
+#define __COMPAT_scx_bpf_get_idle_cpumask_node(node)                            \
+	(bpf_ksym_exists(scx_bpf_get_idle_cpumask_node) ?                       \
+	 scx_bpf_get_idle_cpumask_node(node) :                                  \
+	 scx_bpf_get_idle_cpumask())                                            \
+
+#define __COMPAT_scx_bpf_get_idle_smtmask_node(node)                            \
+	(bpf_ksym_exists(scx_bpf_get_idle_smtmask_node) ?                       \
+	 scx_bpf_get_idle_smtmask_node(node) :                                  \
+	 scx_bpf_get_idle_smtmask())                                            \
+
+#define __COMPAT_scx_bpf_pick_idle_cpu_node(node, cpus_allowed, flags)          \
+	(bpf_ksym_exists(scx_bpf_pick_idle_cpu_node) ?                          \
+	 scx_bpf_pick_idle_cpu_node(node, cpus_allowed, flags) :                \
+	 scx_bpf_pick_idle_cpu(cpus_allowed, flags))
+
 /*
  * Define sched_ext_ops. This may be expanded to define multiple variants for
  * backward compatibility. See compat.h::SCX_OPS_LOAD/ATTACH().
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 6/6] sched_ext: Move built-in idle CPU selection policy to a separate file
  2024-12-17  9:32 [PATCHSET v7 sched_ext/for-6.14] sched_ext: split global idle cpumask into per-NUMA cpumasks Andrea Righi
                   ` (4 preceding siblings ...)
  2024-12-17  9:32 ` [PATCH 5/6] sched_ext: Introduce NUMA aware idle cpu kfunc helpers Andrea Righi
@ 2024-12-17  9:32 ` Andrea Righi
  2024-12-20 14:53   ` Yury Norov
  5 siblings, 1 reply; 23+ messages in thread
From: Andrea Righi @ 2024-12-17  9:32 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min
  Cc: Yury Norov, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, linux-kernel

As ext.c is becoming quite large, move the idle CPU selection policy
to a separate file (ext_idle.c) for better code readability.

No functional change, this is purely code reorganization.

Suggested-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 MAINTAINERS             |   1 +
 kernel/sched/ext.c      | 857 +---------------------------------------
 kernel/sched/ext_idle.c | 835 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 853 insertions(+), 840 deletions(-)
 create mode 100644 kernel/sched/ext_idle.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 1e930c7a58b1..02960d1b9ee9 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -20909,6 +20909,7 @@ T:	git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git
 F:	include/linux/sched/ext.h
 F:	kernel/sched/ext.h
 F:	kernel/sched/ext.c
+F:	kernel/sched/ext_idle.c
 F:	tools/sched_ext/
 F:	tools/testing/selftests/sched_ext
 
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 1edf4206aab5..40d13594c1d1 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -6,6 +6,8 @@
  * Copyright (c) 2022 Tejun Heo <tj@kernel.org>
  * Copyright (c) 2022 David Vernet <dvernet@meta.com>
  */
+#include <linux/btf_ids.h>
+
 #define SCX_OP_IDX(op)		(offsetof(struct sched_ext_ops, op) / sizeof(void (*)(void)))
 
 enum scx_consts {
@@ -890,12 +892,6 @@ static bool scx_warned_zero_slice;
 static DEFINE_STATIC_KEY_FALSE(scx_ops_enq_last);
 static DEFINE_STATIC_KEY_FALSE(scx_ops_enq_exiting);
 static DEFINE_STATIC_KEY_FALSE(scx_ops_cpu_preempt);
-static DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_enabled);
-static DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_per_node);
-
-#ifdef CONFIG_SMP
-static DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_llc);
-#endif
 
 static struct static_key_false scx_has_op[SCX_OPI_END] =
 	{ [0 ... SCX_OPI_END-1] = STATIC_KEY_FALSE_INIT };
@@ -903,6 +899,17 @@ static struct static_key_false scx_has_op[SCX_OPI_END] =
 static atomic_t scx_exit_kind = ATOMIC_INIT(SCX_EXIT_DONE);
 static struct scx_exit_info *scx_exit_info;
 
+#define scx_ops_error_kind(err, fmt, args...)					\
+	scx_ops_exit_kind((err), 0, fmt, ##args)
+
+#define scx_ops_exit(code, fmt, args...)					\
+	scx_ops_exit_kind(SCX_EXIT_UNREG_KERN, (code), fmt, ##args)
+
+#define scx_ops_error(fmt, args...)						\
+	scx_ops_error_kind(SCX_EXIT_ERROR, fmt, ##args)
+
+#define SCX_HAS_OP(op)	static_branch_likely(&scx_has_op[SCX_OP_IDX(op)])
+
 static atomic_long_t scx_nr_rejected = ATOMIC_LONG_INIT(0);
 static atomic_long_t scx_hotplug_seq = ATOMIC_LONG_INIT(0);
 
@@ -930,97 +937,6 @@ static unsigned long scx_watchdog_timestamp = INITIAL_JIFFIES;
 
 static struct delayed_work scx_watchdog_work;
 
-/* idle tracking */
-#ifdef CONFIG_SMP
-#ifdef CONFIG_CPUMASK_OFFSTACK
-#define CL_ALIGNED_IF_ONSTACK
-#else
-#define CL_ALIGNED_IF_ONSTACK __cacheline_aligned_in_smp
-#endif
-
-struct idle_cpumask {
-	cpumask_var_t cpu;
-	cpumask_var_t smt;
-};
-
-/*
- * Make sure a NUMA node is in a valid range.
- */
-static int validate_node(int node)
-{
-	/* If no node is specified, return the current one */
-	if (node == NUMA_NO_NODE)
-		return numa_node_id();
-
-	/* Make sure node is in the range of possible nodes */
-	if (node < 0 || node >= num_possible_nodes())
-		return -EINVAL;
-
-	return node;
-}
-
-/*
- * cpumasks to track idle CPUs within each NUMA node.
- *
- * If SCX_OPS_BUILTIN_IDLE_PER_NODE is not specified, a single flat cpumask
- * from node 0 is used to track all idle CPUs system-wide.
- */
-static struct idle_cpumask **idle_masks CL_ALIGNED_IF_ONSTACK;
-
-static struct cpumask *get_idle_mask_node(int node, bool smt)
-{
-	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
-		return smt ? idle_masks[0]->smt : idle_masks[0]->cpu;
-
-	node = validate_node(node);
-	if (node < 0)
-		return cpu_none_mask;
-
-	return smt ? idle_masks[node]->smt : idle_masks[node]->cpu;
-}
-
-static struct cpumask *get_idle_cpumask_node(int node)
-{
-	return get_idle_mask_node(node, false);
-}
-
-static struct cpumask *get_idle_smtmask_node(int node)
-{
-	return get_idle_mask_node(node, true);
-}
-
-static void idle_masks_init(void)
-{
-	int node;
-
-	idle_masks = kcalloc(num_possible_nodes(), sizeof(*idle_masks), GFP_KERNEL);
-	BUG_ON(!idle_masks);
-
-	for_each_node_state(node, N_POSSIBLE) {
-		idle_masks[node] = kzalloc_node(sizeof(**idle_masks), GFP_KERNEL, node);
-		BUG_ON(!idle_masks[node]);
-
-		BUG_ON(!alloc_cpumask_var_node(&idle_masks[node]->cpu, GFP_KERNEL, node));
-		BUG_ON(!alloc_cpumask_var_node(&idle_masks[node]->smt, GFP_KERNEL, node));
-	}
-}
-#else	/* !CONFIG_SMP */
-static int validate_node(int node)
-{
-	return numa_node_id();
-}
-
-static struct cpumask *get_idle_cpumask_node(int node)
-{
-	return cpu_none_mask;
-}
-
-static struct cpumask *get_idle_smtmask_node(int node)
-{
-	return cpu_none_mask;
-}
-#endif	/* CONFIG_SMP */
-
 /* for %SCX_KICK_WAIT */
 static unsigned long __percpu *scx_kick_cpus_pnt_seqs;
 
@@ -1107,17 +1023,6 @@ static __printf(3, 4) void scx_ops_exit_kind(enum scx_exit_kind kind,
 					     s64 exit_code,
 					     const char *fmt, ...);
 
-#define scx_ops_error_kind(err, fmt, args...)					\
-	scx_ops_exit_kind((err), 0, fmt, ##args)
-
-#define scx_ops_exit(code, fmt, args...)					\
-	scx_ops_exit_kind(SCX_EXIT_UNREG_KERN, (code), fmt, ##args)
-
-#define scx_ops_error(fmt, args...)						\
-	scx_ops_error_kind(SCX_EXIT_ERROR, fmt, ##args)
-
-#define SCX_HAS_OP(op)	static_branch_likely(&scx_has_op[SCX_OP_IDX(op)])
-
 static long jiffies_delta_msecs(unsigned long at, unsigned long now)
 {
 	if (time_after(at, now))
@@ -1624,6 +1529,9 @@ static int ops_sanitize_err(const char *ops_name, s32 err)
 	return -EPROTO;
 }
 
+/* Built-in idle CPU selection policy */
+#include "ext_idle.c"
+
 static void run_deferred(struct rq *rq)
 {
 	process_ddsp_deferred_locals(rq);
@@ -3247,406 +3155,6 @@ bool scx_prio_less(const struct task_struct *a, const struct task_struct *b,
 #endif	/* CONFIG_SCHED_CORE */
 
 #ifdef CONFIG_SMP
-
-static bool test_and_clear_cpu_idle(int cpu)
-{
-	int node = cpu_to_node(cpu);
-	struct cpumask *idle_cpu = get_idle_cpumask_node(node);
-
-#ifdef CONFIG_SCHED_SMT
-	/*
-	 * SMT mask should be cleared whether we can claim @cpu or not. The SMT
-	 * cluster is not wholly idle either way. This also prevents
-	 * scx_pick_idle_cpu() from getting caught in an infinite loop.
-	 */
-	if (sched_smt_active()) {
-		const struct cpumask *smt = cpu_smt_mask(cpu);
-		struct cpumask *idle_smt = get_idle_smtmask_node(node);
-
-		/*
-		 * If offline, @cpu is not its own sibling and
-		 * scx_pick_idle_cpu() can get caught in an infinite loop as
-		 * @cpu is never cleared from the idle SMT mask. Ensure that
-		 * @cpu is eventually cleared.
-		 *
-		 * NOTE: Use cpumask_intersects() and cpumask_test_cpu() to
-		 * reduce memory writes, which may help alleviate cache
-		 * coherence pressure.
-		 */
-		if (cpumask_intersects(smt, idle_smt))
-			cpumask_andnot(idle_smt, idle_smt, smt);
-		else if (cpumask_test_cpu(cpu, idle_smt))
-			__cpumask_clear_cpu(cpu, idle_smt);
-	}
-#endif
-	return cpumask_test_and_clear_cpu(cpu, idle_cpu);
-}
-
-static s32 scx_pick_idle_cpu_from_node(int node, const struct cpumask *cpus_allowed, u64 flags)
-{
-	int cpu;
-
-retry:
-	if (sched_smt_active()) {
-		cpu = cpumask_any_and_distribute(get_idle_smtmask_node(node), cpus_allowed);
-		if (cpu < nr_cpu_ids)
-			goto found;
-
-		if (flags & SCX_PICK_IDLE_CORE)
-			return -EBUSY;
-	}
-
-	cpu = cpumask_any_and_distribute(get_idle_cpumask_node(node), cpus_allowed);
-	if (cpu >= nr_cpu_ids)
-		return -EBUSY;
-
-found:
-	if (test_and_clear_cpu_idle(cpu))
-		return cpu;
-	else
-		goto retry;
-
-}
-
-static s32
-scx_pick_idle_cpu_numa(const struct cpumask *cpus_allowed, s32 prev_cpu, u64 flags)
-{
-	nodemask_t hop_nodes = NODE_MASK_NONE;
-	int start_node = cpu_to_node(prev_cpu);
-	s32 cpu = -EBUSY;
-
-	/*
-	 * Traverse all online nodes in order of increasing distance,
-	 * starting from prev_cpu's node.
-	 */
-	rcu_read_lock();
-	for_each_numa_hop_node(node, start_node, hop_nodes, N_ONLINE) {
-		cpu = scx_pick_idle_cpu_from_node(node, cpus_allowed, flags);
-		if (cpu >= 0)
-			break;
-	}
-	rcu_read_unlock();
-
-	return cpu;
-}
-
-static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, s32 prev_cpu, u64 flags)
-{
-	/*
-	 * Only node 0 is used if per-node idle cpumasks are disabled.
-	 */
-	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
-		return scx_pick_idle_cpu_from_node(0, cpus_allowed, flags);
-
-	return scx_pick_idle_cpu_numa(cpus_allowed, prev_cpu, flags);
-}
-
-/*
- * Return the amount of CPUs in the same LLC domain of @cpu (or zero if the LLC
- * domain is not defined).
- */
-static unsigned int llc_weight(s32 cpu)
-{
-	struct sched_domain *sd;
-
-	sd = rcu_dereference(per_cpu(sd_llc, cpu));
-	if (!sd)
-		return 0;
-
-	return sd->span_weight;
-}
-
-/*
- * Return the cpumask representing the LLC domain of @cpu (or NULL if the LLC
- * domain is not defined).
- */
-static struct cpumask *llc_span(s32 cpu)
-{
-	struct sched_domain *sd;
-
-	sd = rcu_dereference(per_cpu(sd_llc, cpu));
-	if (!sd)
-		return 0;
-
-	return sched_domain_span(sd);
-}
-
-/*
- * Return the amount of CPUs in the same NUMA domain of @cpu (or zero if the
- * NUMA domain is not defined).
- */
-static unsigned int numa_weight(s32 cpu)
-{
-	struct sched_domain *sd;
-	struct sched_group *sg;
-
-	sd = rcu_dereference(per_cpu(sd_numa, cpu));
-	if (!sd)
-		return 0;
-	sg = sd->groups;
-	if (!sg)
-		return 0;
-
-	return sg->group_weight;
-}
-
-/*
- * Return true if the LLC domains do not perfectly overlap with the NUMA
- * domains, false otherwise.
- */
-static bool llc_numa_mismatch(void)
-{
-	int cpu;
-
-	/*
-	 * We need to scan all online CPUs to verify whether their scheduling
-	 * domains overlap.
-	 *
-	 * While it is rare to encounter architectures with asymmetric NUMA
-	 * topologies, CPU hotplugging or virtualized environments can result
-	 * in asymmetric configurations.
-	 *
-	 * For example:
-	 *
-	 *  NUMA 0:
-	 *    - LLC 0: cpu0..cpu7
-	 *    - LLC 1: cpu8..cpu15 [offline]
-	 *
-	 *  NUMA 1:
-	 *    - LLC 0: cpu16..cpu23
-	 *    - LLC 1: cpu24..cpu31
-	 *
-	 * In this case, if we only check the first online CPU (cpu0), we might
-	 * incorrectly assume that the LLC and NUMA domains are fully
-	 * overlapping, which is incorrect (as NUMA 1 has two distinct LLC
-	 * domains).
-	 */
-	for_each_online_cpu(cpu)
-		if (llc_weight(cpu) != numa_weight(cpu))
-			return true;
-
-	return false;
-}
-
-/*
- * Initialize topology-aware scheduling.
- *
- * Detect if the system has multiple LLC or multiple NUMA domains and enable
- * cache-aware / NUMA-aware scheduling optimizations in the default CPU idle
- * selection policy.
- *
- * Assumption: the kernel's internal topology representation assumes that each
- * CPU belongs to a single LLC domain, and that each LLC domain is entirely
- * contained within a single NUMA node.
- */
-static void update_selcpu_topology(struct sched_ext_ops *ops)
-{
-	bool enable_llc = false;
-	unsigned int nr_cpus;
-	s32 cpu = cpumask_first(cpu_online_mask);
-
-	/*
-	 * Enable LLC domain optimization only when there are multiple LLC
-	 * domains among the online CPUs. If all online CPUs are part of a
-	 * single LLC domain, the idle CPU selection logic can choose any
-	 * online CPU without bias.
-	 *
-	 * Note that it is sufficient to check the LLC domain of the first
-	 * online CPU to determine whether a single LLC domain includes all
-	 * CPUs.
-	 */
-	rcu_read_lock();
-	nr_cpus = llc_weight(cpu);
-	if (nr_cpus > 0) {
-		if (nr_cpus < num_online_cpus())
-			enable_llc = true;
-		/*
-		 * No need to enable LLC optimization if the LLC domains are
-		 * perfectly overlapping with the NUMA domains when per-node
-		 * cpumasks are enabled.
-		 */
-		if ((ops->flags & SCX_OPS_BUILTIN_IDLE_PER_NODE) &&
-		    !llc_numa_mismatch())
-			enable_llc = false;
-		pr_debug("sched_ext: LLC=%*pb weight=%u\n",
-			 cpumask_pr_args(llc_span(cpu)), llc_weight(cpu));
-	}
-	rcu_read_unlock();
-
-	pr_debug("sched_ext: LLC idle selection %s\n",
-		 enable_llc ? "enabled" : "disabled");
-
-	if (enable_llc)
-		static_branch_enable_cpuslocked(&scx_selcpu_topo_llc);
-	else
-		static_branch_disable_cpuslocked(&scx_selcpu_topo_llc);
-
-	/*
-	 * Check if we need to enable per-node cpumasks.
-	 */
-	if (ops->flags & SCX_OPS_BUILTIN_IDLE_PER_NODE)
-		static_branch_enable_cpuslocked(&scx_builtin_idle_per_node);
-	else
-		static_branch_disable_cpuslocked(&scx_builtin_idle_per_node);
-}
-
-/*
- * Built-in CPU idle selection policy:
- *
- * 1. Prioritize full-idle cores:
- *   - always prioritize CPUs from fully idle cores (both logical CPUs are
- *     idle) to avoid interference caused by SMT.
- *
- * 2. Reuse the same CPU:
- *   - prefer the last used CPU to take advantage of cached data (L1, L2) and
- *     branch prediction optimizations.
- *
- * 3. Pick a CPU within the same LLC (Last-Level Cache):
- *   - if the above conditions aren't met, pick a CPU that shares the same LLC
- *     to maintain cache locality.
- *
- * 4. Pick a CPU within the same NUMA node, if enabled:
- *   - choose a CPU from the same NUMA node to reduce memory access latency.
- *
- * 5. Pick any idle CPU usable by the task.
- *
- * Step 3 is performed only if the system has multiple LLC domains that are not
- * perfectly overlapping with the NUMA domains (see scx_selcpu_topo_llc).
- *
- * NOTE: tasks that can only run on 1 CPU are excluded by this logic, because
- * we never call ops.select_cpu() for them, see select_task_rq().
- */
-static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
-			      u64 wake_flags, bool *found)
-{
-	const struct cpumask *llc_cpus = NULL;
-	int node = cpu_to_node(prev_cpu);
-	s32 cpu;
-
-	*found = false;
-
-	/*
-	 * This is necessary to protect llc_cpus.
-	 */
-	rcu_read_lock();
-
-	/*
-	 * Determine the scheduling domain only if the task is allowed to run
-	 * on all CPUs.
-	 *
-	 * This is done primarily for efficiency, as it avoids the overhead of
-	 * updating a cpumask every time we need to select an idle CPU (which
-	 * can be costly in large SMP systems), but it also aligns logically:
-	 * if a task's scheduling domain is restricted by user-space (through
-	 * CPU affinity), the task will simply use the flat scheduling domain
-	 * defined by user-space.
-	 */
-	if (p->nr_cpus_allowed >= num_possible_cpus())
-		if (static_branch_maybe(CONFIG_SCHED_MC, &scx_selcpu_topo_llc))
-			llc_cpus = llc_span(prev_cpu);
-
-	/*
-	 * If WAKE_SYNC, try to migrate the wakee to the waker's CPU.
-	 */
-	if (wake_flags & SCX_WAKE_SYNC) {
-		cpu = smp_processor_id();
-
-		/*
-		 * If the waker's CPU is cache affine and prev_cpu is idle,
-		 * then avoid a migration.
-		 */
-		if (cpus_share_cache(cpu, prev_cpu) &&
-		    test_and_clear_cpu_idle(prev_cpu)) {
-			cpu = prev_cpu;
-			goto cpu_found;
-		}
-
-		/*
-		 * If the waker's local DSQ is empty, and the system is under
-		 * utilized, try to wake up @p to the local DSQ of the waker.
-		 *
-		 * Checking only for an empty local DSQ is insufficient as it
-		 * could give the wakee an unfair advantage when the system is
-		 * oversaturated.
-		 *
-		 * Checking only for the presence of idle CPUs is also
-		 * insufficient as the local DSQ of the waker could have tasks
-		 * piled up on it even if there is an idle core elsewhere on
-		 * the system.
-		 */
-		if (!(current->flags & PF_EXITING) &&
-		    cpu_rq(cpu)->scx.local_dsq.nr == 0 &&
-		    !cpumask_empty(get_idle_cpumask_node(cpu_to_node(cpu)))) {
-			if (cpumask_test_cpu(cpu, p->cpus_ptr))
-				goto cpu_found;
-		}
-	}
-
-	/*
-	 * If CPU has SMT, any wholly idle CPU is likely a better pick than
-	 * partially idle @prev_cpu.
-	 */
-	if (sched_smt_active()) {
-		/*
-		 * Keep using @prev_cpu if it's part of a fully idle core.
-		 */
-		if (cpumask_test_cpu(prev_cpu, get_idle_smtmask_node(node)) &&
-		    test_and_clear_cpu_idle(prev_cpu)) {
-			cpu = prev_cpu;
-			goto cpu_found;
-		}
-
-		/*
-		 * Search for any fully idle core in the same LLC domain.
-		 */
-		if (llc_cpus) {
-			cpu = scx_pick_idle_cpu_from_node(node, llc_cpus, SCX_PICK_IDLE_CORE);
-			if (cpu >= 0)
-				goto cpu_found;
-		}
-
-		/*
-		 * Search for any full idle core usable by the task.
-		 */
-		cpu = scx_pick_idle_cpu(p->cpus_ptr, prev_cpu, SCX_PICK_IDLE_CORE);
-		if (cpu >= 0)
-			goto cpu_found;
-	}
-
-	/*
-	 * Use @prev_cpu if it's idle.
-	 */
-	if (test_and_clear_cpu_idle(prev_cpu)) {
-		cpu = prev_cpu;
-		goto cpu_found;
-	}
-
-	/*
-	 * Search for any idle CPU in the same LLC domain.
-	 */
-	if (llc_cpus) {
-		cpu = scx_pick_idle_cpu_from_node(node, llc_cpus, 0);
-		if (cpu >= 0)
-			goto cpu_found;
-	}
-
-	/*
-	 * Search for any idle CPU usable by the task.
-	 */
-	cpu = scx_pick_idle_cpu(p->cpus_ptr, prev_cpu, 0);
-	if (cpu >= 0)
-		goto cpu_found;
-
-	rcu_read_unlock();
-	return prev_cpu;
-
-cpu_found:
-	rcu_read_unlock();
-
-	*found = true;
-	return cpu;
-}
-
 static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flags)
 {
 	/*
@@ -3713,66 +3221,6 @@ static void set_cpus_allowed_scx(struct task_struct *p,
 				 (struct cpumask *)p->cpus_ptr);
 }
 
-static void reset_idle_masks(void)
-{
-	int node;
-
-	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node)) {
-		cpumask_copy(get_idle_cpumask_node(0), cpu_online_mask);
-		cpumask_copy(get_idle_smtmask_node(0), cpu_online_mask);
-		return;
-	}
-
-	/*
-	 * Consider all online cpus idle. Should converge to the actual state
-	 * quickly.
-	 */
-	for_each_node_state(node, N_POSSIBLE) {
-		const struct cpumask *node_mask = cpumask_of_node(node);
-		struct cpumask *idle_cpu = get_idle_cpumask_node(node);
-		struct cpumask *idle_smt = get_idle_smtmask_node(node);
-
-		cpumask_and(idle_cpu, cpu_online_mask, node_mask);
-		cpumask_copy(idle_smt, idle_cpu);
-	}
-}
-
-void __scx_update_idle(struct rq *rq, bool idle)
-{
-	int cpu = cpu_of(rq);
-	int node = cpu_to_node(cpu);
-	struct cpumask *idle_cpu = get_idle_cpumask_node(node);
-
-	if (SCX_HAS_OP(update_idle) && !scx_rq_bypassing(rq)) {
-		SCX_CALL_OP(SCX_KF_REST, update_idle, cpu_of(rq), idle);
-		if (!static_branch_unlikely(&scx_builtin_idle_enabled))
-			return;
-	}
-
-	assign_cpu(cpu, idle_cpu, idle);
-
-#ifdef CONFIG_SCHED_SMT
-	if (sched_smt_active()) {
-		const struct cpumask *smt = cpu_smt_mask(cpu);
-		struct cpumask *idle_smt = get_idle_smtmask_node(node);
-
-		if (idle) {
-			/*
-			 * idle_smt handling is racy but that's fine as it's
-			 * only for optimization and self-correcting.
-			 */
-			for_each_cpu(cpu, smt) {
-				if (!cpumask_test_cpu(cpu, idle_cpu))
-					return;
-			}
-			cpumask_or(idle_smt, idle_smt, smt);
-		} else {
-			cpumask_andnot(idle_smt, idle_smt, smt);
-		}
-	}
-#endif
-}
-
 static void handle_hotplug(struct rq *rq, bool online)
 {
 	int cpu = cpu_of(rq);
@@ -3811,21 +3259,7 @@ static void rq_offline_scx(struct rq *rq)
 {
 	rq->scx.flags &= ~SCX_RQ_ONLINE;
 }
-
-#else	/* CONFIG_SMP */
-
-static bool test_and_clear_cpu_idle(int cpu) { return false; }
-static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, s32 prev_cpu, u64 flags)
-{
-	return -EBUSY;
-}
-static s32 scx_pick_idle_cpu_from_node(int node, const struct cpumask *cpus_allowed, u64 flags)
-{
-	return -EBUSY;
-}
-static void reset_idle_masks(void) {}
-
-#endif	/* CONFIG_SMP */
+#endif /* CONFIG_SMP */
 
 static bool check_rq_for_timeouts(struct rq *rq)
 {
@@ -6399,71 +5833,6 @@ void __init init_sched_ext_class(void)
 /********************************************************************************
  * Helpers that can be called from the BPF scheduler.
  */
-#include <linux/btf_ids.h>
-
-static bool check_builtin_idle_enabled(void)
-{
-	if (static_branch_likely(&scx_builtin_idle_enabled))
-		return true;
-
-	scx_ops_error("built-in idle tracking is disabled");
-	return false;
-}
-
-static bool check_builtin_idle_per_node_enabled(void)
-{
-	if (static_branch_likely(&scx_builtin_idle_per_node))
-		return true;
-
-	scx_ops_error("per-node idle tracking is disabled");
-	return false;
-}
-
-__bpf_kfunc_start_defs();
-
-/**
- * scx_bpf_select_cpu_dfl - The default implementation of ops.select_cpu()
- * @p: task_struct to select a CPU for
- * @prev_cpu: CPU @p was on previously
- * @wake_flags: %SCX_WAKE_* flags
- * @is_idle: out parameter indicating whether the returned CPU is idle
- *
- * Can only be called from ops.select_cpu() if the built-in CPU selection is
- * enabled - ops.update_idle() is missing or %SCX_OPS_KEEP_BUILTIN_IDLE is set.
- * @p, @prev_cpu and @wake_flags match ops.select_cpu().
- *
- * Returns the picked CPU with *@is_idle indicating whether the picked CPU is
- * currently idle and thus a good candidate for direct dispatching.
- */
-__bpf_kfunc s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
-				       u64 wake_flags, bool *is_idle)
-{
-	if (!check_builtin_idle_enabled())
-		goto prev_cpu;
-
-	if (!scx_kf_allowed(SCX_KF_SELECT_CPU))
-		goto prev_cpu;
-
-#ifdef CONFIG_SMP
-	return scx_select_cpu_dfl(p, prev_cpu, wake_flags, is_idle);
-#endif
-
-prev_cpu:
-	*is_idle = false;
-	return prev_cpu;
-}
-
-__bpf_kfunc_end_defs();
-
-BTF_KFUNCS_START(scx_kfunc_ids_select_cpu)
-BTF_ID_FLAGS(func, scx_bpf_select_cpu_dfl, KF_RCU)
-BTF_KFUNCS_END(scx_kfunc_ids_select_cpu)
-
-static const struct btf_kfunc_id_set scx_kfunc_set_select_cpu = {
-	.owner			= THIS_MODULE,
-	.set			= &scx_kfunc_ids_select_cpu,
-};
-
 static bool scx_dsq_insert_preamble(struct task_struct *p, u64 enq_flags)
 {
 	if (!scx_kf_allowed(SCX_KF_ENQUEUE | SCX_KF_DISPATCH))
@@ -7535,189 +6904,6 @@ __bpf_kfunc void scx_bpf_put_cpumask(const struct cpumask *cpumask)
 	 */
 }
 
-/**
- * scx_bpf_get_idle_cpumask_node - Get a referenced kptr to the idle-tracking
- * per-CPU cpumask of a target NUMA node.
- *
- * Returns an empty cpumask if idle tracking is not enabled, if @node is not
- * valid, or running on a UP kernel.
- */
-__bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask_node(int node)
-{
-	if (!check_builtin_idle_per_node_enabled())
-		return cpu_none_mask;
-
-	return get_idle_cpumask_node(node);
-}
-/**
- * scx_bpf_get_idle_cpumask - Get a referenced kptr to the idle-tracking
- * per-CPU cpumask of the current NUMA node.
- *
- * Returns an emtpy cpumask if idle tracking is not enabled, or running on a UP
- * kernel.
- */
-__bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void)
-{
-	if (!check_builtin_idle_enabled())
-		return cpu_none_mask;
-
-	return get_idle_cpumask_node(NUMA_NO_NODE);
-}
-
-/**
- * scx_bpf_get_idle_smtmask_node - Get a referenced kptr to the idle-tracking,
- * per-physical-core cpumask of a target NUMA node. Can be used to determine
- * if an entire physical core is free.
- *
- * Returns an empty cpumask if idle tracking is not enabled, if @node is not
- * valid, or running on a UP kernel.
- */
-__bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask_node(int node)
-{
-	if (!check_builtin_idle_per_node_enabled())
-		return cpu_none_mask;
-
-	return get_idle_smtmask_node(node);
-}
-
-/**
- * scx_bpf_get_idle_smtmask - Get a referenced kptr to the idle-tracking,
- * per-physical-core cpumask of the current NUMA node. Can be used to determine
- * if an entire physical core is free.
- *
- * Returns an empty cumask if idle tracking is not enabled, or running on a UP
- * kernel.
- */
-__bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask(void)
-{
-	if (!check_builtin_idle_enabled())
-		return cpu_none_mask;
-
-	return get_idle_smtmask_node(NUMA_NO_NODE);
-}
-
-/**
- * scx_bpf_put_idle_cpumask - Release a previously acquired referenced kptr to
- * either the percpu, or SMT idle-tracking cpumask.
- */
-__bpf_kfunc void scx_bpf_put_idle_cpumask(const struct cpumask *idle_mask)
-{
-	/*
-	 * Empty function body because we aren't actually acquiring or releasing
-	 * a reference to a global idle cpumask, which is read-only in the
-	 * caller and is never released. The acquire / release semantics here
-	 * are just used to make the cpumask a trusted pointer in the caller.
-	 */
-}
-
-/**
- * scx_bpf_test_and_clear_cpu_idle - Test and clear @cpu's idle state
- * @cpu: cpu to test and clear idle for
- *
- * Returns %true if @cpu was idle and its idle state was successfully cleared.
- * %false otherwise.
- *
- * Unavailable if ops.update_idle() is implemented and
- * %SCX_OPS_KEEP_BUILTIN_IDLE is not set.
- */
-__bpf_kfunc bool scx_bpf_test_and_clear_cpu_idle(s32 cpu)
-{
-	if (!check_builtin_idle_enabled())
-		return false;
-
-	if (ops_cpu_valid(cpu, NULL))
-		return test_and_clear_cpu_idle(cpu);
-	else
-		return false;
-}
-
-/**
- * scx_bpf_pick_idle_cpu_node - Pick and claim an idle cpu from a NUMA node
- * @node: target NUMA node
- * @cpus_allowed: Allowed cpumask
- * @flags: %SCX_PICK_IDLE_CPU_* flags
- *
- * Pick and claim an idle cpu in @cpus_allowed from the NUMA node @node.
- * Returns the picked idle cpu number on success. -%EBUSY if no matching cpu
- * was found.
- *
- * Unavailable if ops.update_idle() is implemented and
- * %SCX_OPS_KEEP_BUILTIN_IDLE is not set or if %SCX_OPS_KEEP_BUILTIN_IDLE is
- * not set.
- */
-__bpf_kfunc s32 scx_bpf_pick_idle_cpu_node(int node, const struct cpumask *cpus_allowed,
-				      u64 flags)
-{
-	node = validate_node(node);
-	if (node < 0)
-		return node;
-
-	if (!check_builtin_idle_per_node_enabled())
-		return -EBUSY;
-
-	return scx_pick_idle_cpu_from_node(node, cpus_allowed, flags);
-}
-
-/**
- * scx_bpf_pick_idle_cpu - Pick and claim an idle cpu
- * @cpus_allowed: Allowed cpumask
- * @flags: %SCX_PICK_IDLE_CPU_* flags
- *
- * Pick and claim an idle cpu in @cpus_allowed. Returns the picked idle cpu
- * number on success. -%EBUSY if no matching cpu was found.
- *
- * Idle CPU tracking may race against CPU scheduling state transitions. For
- * example, this function may return -%EBUSY as CPUs are transitioning into the
- * idle state. If the caller then assumes that there will be dispatch events on
- * the CPUs as they were all busy, the scheduler may end up stalling with CPUs
- * idling while there are pending tasks. Use scx_bpf_pick_any_cpu() and
- * scx_bpf_kick_cpu() to guarantee that there will be at least one dispatch
- * event in the near future.
- *
- * Unavailable if ops.update_idle() is implemented and
- * %SCX_OPS_KEEP_BUILTIN_IDLE is not set.
- */
-__bpf_kfunc s32 scx_bpf_pick_idle_cpu(const struct cpumask *cpus_allowed,
-				      u64 flags)
-{
-	if (!check_builtin_idle_enabled())
-		return -EBUSY;
-
-	return scx_pick_idle_cpu(cpus_allowed, smp_processor_id(), flags);
-}
-
-/**
- * scx_bpf_pick_any_cpu - Pick and claim an idle cpu if available or pick any CPU
- * @cpus_allowed: Allowed cpumask
- * @flags: %SCX_PICK_IDLE_CPU_* flags
- *
- * Pick and claim an idle cpu in @cpus_allowed. If none is available, pick any
- * CPU in @cpus_allowed. Guaranteed to succeed and returns the picked idle cpu
- * number if @cpus_allowed is not empty. -%EBUSY is returned if @cpus_allowed is
- * empty.
- *
- * If ops.update_idle() is implemented and %SCX_OPS_KEEP_BUILTIN_IDLE is not
- * set, this function can't tell which CPUs are idle and will always pick any
- * CPU.
- */
-__bpf_kfunc s32 scx_bpf_pick_any_cpu(const struct cpumask *cpus_allowed,
-				     u64 flags)
-{
-	s32 cpu;
-
-	if (static_branch_likely(&scx_builtin_idle_enabled)) {
-		cpu = scx_pick_idle_cpu(cpus_allowed, smp_processor_id(), flags);
-		if (cpu >= 0)
-			return cpu;
-	}
-
-	cpu = cpumask_any_distribute(cpus_allowed);
-	if (cpu < nr_cpu_ids)
-		return cpu;
-	else
-		return -EBUSY;
-}
-
 /**
  * scx_bpf_task_running - Is task currently running?
  * @p: task of interest
@@ -7796,15 +6982,6 @@ BTF_ID_FLAGS(func, scx_bpf_cpu_to_node)
 BTF_ID_FLAGS(func, scx_bpf_get_possible_cpumask, KF_ACQUIRE)
 BTF_ID_FLAGS(func, scx_bpf_get_online_cpumask, KF_ACQUIRE)
 BTF_ID_FLAGS(func, scx_bpf_put_cpumask, KF_RELEASE)
-BTF_ID_FLAGS(func, scx_bpf_get_idle_cpumask, KF_ACQUIRE)
-BTF_ID_FLAGS(func, scx_bpf_get_idle_cpumask_node, KF_ACQUIRE)
-BTF_ID_FLAGS(func, scx_bpf_get_idle_smtmask, KF_ACQUIRE)
-BTF_ID_FLAGS(func, scx_bpf_get_idle_smtmask_node, KF_ACQUIRE)
-BTF_ID_FLAGS(func, scx_bpf_put_idle_cpumask, KF_RELEASE)
-BTF_ID_FLAGS(func, scx_bpf_test_and_clear_cpu_idle)
-BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu, KF_RCU)
-BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu_node, KF_RCU)
-BTF_ID_FLAGS(func, scx_bpf_pick_any_cpu, KF_RCU)
 BTF_ID_FLAGS(func, scx_bpf_task_running, KF_RCU)
 BTF_ID_FLAGS(func, scx_bpf_task_cpu, KF_RCU)
 BTF_ID_FLAGS(func, scx_bpf_cpu_rq)
diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
new file mode 100644
index 000000000000..ec36848c1a10
--- /dev/null
+++ b/kernel/sched/ext_idle.c
@@ -0,0 +1,835 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * BPF extensible scheduler class: Documentation/scheduler/sched-ext.rst
+ *
+ * Built-in idle CPU tracking policy.
+ *
+ * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2022 Tejun Heo <tj@kernel.org>
+ * Copyright (c) 2022 David Vernet <dvernet@meta.com>
+ * Copyright (c) 2024 Andrea Righi <arighi@nvidia.com>
+ */
+
+static DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_enabled);
+static DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_per_node);
+
+static bool check_builtin_idle_enabled(void)
+{
+	if (static_branch_likely(&scx_builtin_idle_enabled))
+		return true;
+
+	scx_ops_error("built-in idle tracking is disabled");
+	return false;
+}
+
+static bool check_builtin_idle_per_node_enabled(void)
+{
+	if (static_branch_likely(&scx_builtin_idle_per_node))
+		return true;
+
+	scx_ops_error("per-node idle tracking is disabled");
+	return false;
+}
+
+#ifdef CONFIG_SMP
+static DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_llc);
+
+#ifdef CONFIG_CPUMASK_OFFSTACK
+#define CL_ALIGNED_IF_ONSTACK
+#else
+#define CL_ALIGNED_IF_ONSTACK __cacheline_aligned_in_smp
+#endif
+
+struct idle_cpumask {
+	cpumask_var_t cpu;
+	cpumask_var_t smt;
+};
+
+/*
+ * cpumasks to track idle CPUs within each NUMA node.
+ *
+ * If SCX_OPS_BUILTIN_IDLE_PER_NODE is not specified, a single flat cpumask
+ * from node 0 is used to track all idle CPUs system-wide.
+ */
+static struct idle_cpumask **idle_masks CL_ALIGNED_IF_ONSTACK;
+
+/*
+ * Make sure a NUMA node is in a valid range.
+ */
+static int validate_node(int node)
+{
+	/* If no node is specified, return the current one */
+	if (node == NUMA_NO_NODE)
+		return numa_node_id();
+
+	/* Make sure node is in the range of possible nodes */
+	if (node < 0 || node >= num_possible_nodes())
+		return -EINVAL;
+
+	return node;
+}
+
+static struct cpumask *get_idle_mask_node(int node, bool smt)
+{
+	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
+		return smt ? idle_masks[0]->smt : idle_masks[0]->cpu;
+
+	node = validate_node(node);
+	if (node < 0)
+		return cpu_none_mask;
+
+	return smt ? idle_masks[node]->smt : idle_masks[node]->cpu;
+}
+
+static struct cpumask *get_idle_cpumask_node(int node)
+{
+	return get_idle_mask_node(node, false);
+}
+
+static struct cpumask *get_idle_smtmask_node(int node)
+{
+	return get_idle_mask_node(node, true);
+}
+
+static void idle_masks_init(void)
+{
+	int node;
+
+	idle_masks = kcalloc(num_possible_nodes(), sizeof(*idle_masks), GFP_KERNEL);
+	BUG_ON(!idle_masks);
+
+	for_each_node_state(node, N_POSSIBLE) {
+		idle_masks[node] = kzalloc_node(sizeof(**idle_masks), GFP_KERNEL, node);
+		BUG_ON(!idle_masks[node]);
+
+		BUG_ON(!alloc_cpumask_var_node(&idle_masks[node]->cpu, GFP_KERNEL, node));
+		BUG_ON(!alloc_cpumask_var_node(&idle_masks[node]->smt, GFP_KERNEL, node));
+	}
+}
+
+static bool test_and_clear_cpu_idle(int cpu)
+{
+	int node = cpu_to_node(cpu);
+	struct cpumask *idle_cpu = get_idle_cpumask_node(node);
+
+#ifdef CONFIG_SCHED_SMT
+	/*
+	 * SMT mask should be cleared whether we can claim @cpu or not. The SMT
+	 * cluster is not wholly idle either way. This also prevents
+	 * scx_pick_idle_cpu() from getting caught in an infinite loop.
+	 */
+	if (sched_smt_active()) {
+		const struct cpumask *smt = cpu_smt_mask(cpu);
+		struct cpumask *idle_smt = get_idle_smtmask_node(node);
+
+		/*
+		 * If offline, @cpu is not its own sibling and
+		 * scx_pick_idle_cpu() can get caught in an infinite loop as
+		 * @cpu is never cleared from the idle SMT mask. Ensure that
+		 * @cpu is eventually cleared.
+		 *
+		 * NOTE: Use cpumask_intersects() and cpumask_test_cpu() to
+		 * reduce memory writes, which may help alleviate cache
+		 * coherence pressure.
+		 */
+		if (cpumask_intersects(smt, idle_smt))
+			cpumask_andnot(idle_smt, idle_smt, smt);
+		else if (cpumask_test_cpu(cpu, idle_smt))
+			__cpumask_clear_cpu(cpu, idle_smt);
+	}
+#endif
+	return cpumask_test_and_clear_cpu(cpu, idle_cpu);
+}
+
+static s32 scx_pick_idle_cpu_from_node(int node, const struct cpumask *cpus_allowed, u64 flags)
+{
+	int cpu;
+
+retry:
+	if (sched_smt_active()) {
+		cpu = cpumask_any_and_distribute(get_idle_smtmask_node(node), cpus_allowed);
+		if (cpu < nr_cpu_ids)
+			goto found;
+
+		if (flags & SCX_PICK_IDLE_CORE)
+			return -EBUSY;
+	}
+
+	cpu = cpumask_any_and_distribute(get_idle_cpumask_node(node), cpus_allowed);
+	if (cpu >= nr_cpu_ids)
+		return -EBUSY;
+
+found:
+	if (test_and_clear_cpu_idle(cpu))
+		return cpu;
+	goto retry;
+
+}
+
+static s32
+scx_pick_idle_cpu_numa(const struct cpumask *cpus_allowed, s32 prev_cpu, u64 flags)
+{
+	nodemask_t hop_nodes = NODE_MASK_NONE;
+	int start_node = cpu_to_node(prev_cpu);
+	s32 cpu = -EBUSY;
+
+	/*
+	 * Traverse all online nodes in order of increasing distance,
+	 * starting from prev_cpu's node.
+	 */
+	rcu_read_lock();
+	for_each_numa_hop_node(node, start_node, hop_nodes, N_ONLINE) {
+		cpu = scx_pick_idle_cpu_from_node(node, cpus_allowed, flags);
+		if (cpu >= 0)
+			break;
+	}
+	rcu_read_unlock();
+
+	return cpu;
+}
+
+static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, s32 prev_cpu, u64 flags)
+{
+	/*
+	 * Only node 0 is used if per-node idle cpumasks are disabled.
+	 */
+	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
+		return scx_pick_idle_cpu_from_node(0, cpus_allowed, flags);
+
+	return scx_pick_idle_cpu_numa(cpus_allowed, prev_cpu, flags);
+}
+
+/*
+ * Return the amount of CPUs in the same LLC domain of @cpu (or zero if the LLC
+ * domain is not defined).
+ */
+static unsigned int llc_weight(s32 cpu)
+{
+	struct sched_domain *sd;
+
+	sd = rcu_dereference(per_cpu(sd_llc, cpu));
+	if (!sd)
+		return 0;
+
+	return sd->span_weight;
+}
+
+/*
+ * Return the cpumask representing the LLC domain of @cpu (or NULL if the LLC
+ * domain is not defined).
+ */
+static struct cpumask *llc_span(s32 cpu)
+{
+	struct sched_domain *sd;
+
+	sd = rcu_dereference(per_cpu(sd_llc, cpu));
+	if (!sd)
+		return 0;
+
+	return sched_domain_span(sd);
+}
+
+/*
+ * Return the amount of CPUs in the same NUMA domain of @cpu (or zero if the
+ * NUMA domain is not defined).
+ */
+static unsigned int numa_weight(s32 cpu)
+{
+	struct sched_domain *sd;
+	struct sched_group *sg;
+
+	sd = rcu_dereference(per_cpu(sd_numa, cpu));
+	if (!sd)
+		return 0;
+	sg = sd->groups;
+	if (!sg)
+		return 0;
+
+	return sg->group_weight;
+}
+
+/*
+ * Return true if the LLC domains do not perfectly overlap with the NUMA
+ * domains, false otherwise.
+ */
+static bool llc_numa_mismatch(void)
+{
+	int cpu;
+
+	/*
+	 * We need to scan all online CPUs to verify whether their scheduling
+	 * domains overlap.
+	 *
+	 * While it is rare to encounter architectures with asymmetric NUMA
+	 * topologies, CPU hotplugging or virtualized environments can result
+	 * in asymmetric configurations.
+	 *
+	 * For example:
+	 *
+	 *  NUMA 0:
+	 *    - LLC 0: cpu0..cpu7
+	 *    - LLC 1: cpu8..cpu15 [offline]
+	 *
+	 *  NUMA 1:
+	 *    - LLC 0: cpu16..cpu23
+	 *    - LLC 1: cpu24..cpu31
+	 *
+	 * In this case, if we only check the first online CPU (cpu0), we might
+	 * incorrectly assume that the LLC and NUMA domains are fully
+	 * overlapping, which is incorrect (as NUMA 1 has two distinct LLC
+	 * domains).
+	 */
+	for_each_online_cpu(cpu)
+		if (llc_weight(cpu) != numa_weight(cpu))
+			return true;
+
+	return false;
+}
+
+/*
+ * Initialize topology-aware scheduling.
+ *
+ * Detect if the system has multiple LLC or multiple NUMA domains and enable
+ * cache-aware / NUMA-aware scheduling optimizations in the default CPU idle
+ * selection policy.
+ *
+ * Assumption: the kernel's internal topology representation assumes that each
+ * CPU belongs to a single LLC domain, and that each LLC domain is entirely
+ * contained within a single NUMA node.
+ */
+static void update_selcpu_topology(struct sched_ext_ops *ops)
+{
+	bool enable_llc = false;
+	unsigned int nr_cpus;
+	s32 cpu = cpumask_first(cpu_online_mask);
+
+	/*
+	 * Enable LLC domain optimization only when there are multiple LLC
+	 * domains among the online CPUs. If all online CPUs are part of a
+	 * single LLC domain, the idle CPU selection logic can choose any
+	 * online CPU without bias.
+	 *
+	 * Note that it is sufficient to check the LLC domain of the first
+	 * online CPU to determine whether a single LLC domain includes all
+	 * CPUs.
+	 */
+	rcu_read_lock();
+	nr_cpus = llc_weight(cpu);
+	if (nr_cpus > 0) {
+		if (nr_cpus < num_online_cpus())
+			enable_llc = true;
+		/*
+		 * No need to enable LLC optimization if the LLC domains are
+		 * perfectly overlapping with the NUMA domains when per-node
+		 * cpumasks are enabled.
+		 */
+		if ((ops->flags & SCX_OPS_BUILTIN_IDLE_PER_NODE) &&
+		    !llc_numa_mismatch())
+			enable_llc = false;
+		pr_debug("sched_ext: LLC=%*pb weight=%u\n",
+			 cpumask_pr_args(llc_span(cpu)), llc_weight(cpu));
+	}
+	rcu_read_unlock();
+
+	pr_debug("sched_ext: LLC idle selection %s\n",
+		 enable_llc ? "enabled" : "disabled");
+
+	if (enable_llc)
+		static_branch_enable_cpuslocked(&scx_selcpu_topo_llc);
+	else
+		static_branch_disable_cpuslocked(&scx_selcpu_topo_llc);
+
+	/*
+	 * Check if we need to enable per-node cpumasks.
+	 */
+	if (ops->flags & SCX_OPS_BUILTIN_IDLE_PER_NODE)
+		static_branch_enable_cpuslocked(&scx_builtin_idle_per_node);
+	else
+		static_branch_disable_cpuslocked(&scx_builtin_idle_per_node);
+}
+
+/*
+ * Built-in CPU idle selection policy:
+ *
+ * 1. Prioritize full-idle cores:
+ *   - always prioritize CPUs from fully idle cores (both logical CPUs are
+ *     idle) to avoid interference caused by SMT.
+ *
+ * 2. Reuse the same CPU:
+ *   - prefer the last used CPU to take advantage of cached data (L1, L2) and
+ *     branch prediction optimizations.
+ *
+ * 3. Pick a CPU within the same LLC (Last-Level Cache):
+ *   - if the above conditions aren't met, pick a CPU that shares the same LLC
+ *     to maintain cache locality.
+ *
+ * 4. Pick a CPU within the same NUMA node, if enabled:
+ *   - choose a CPU from the same NUMA node to reduce memory access latency.
+ *
+ * 5. Pick any idle CPU usable by the task.
+ *
+ * Step 3 is performed only if the system has multiple LLC domains that are not
+ * perfectly overlapping with the NUMA domains (see scx_selcpu_topo_llc).
+ *
+ * NOTE: tasks that can only run on 1 CPU are excluded by this logic, because
+ * we never call ops.select_cpu() for them, see select_task_rq().
+ */
+static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
+			      u64 wake_flags, bool *found)
+{
+	const struct cpumask *llc_cpus = NULL;
+	int node = cpu_to_node(prev_cpu);
+	s32 cpu;
+
+	*found = false;
+
+	/*
+	 * This is necessary to protect llc_cpus.
+	 */
+	rcu_read_lock();
+
+	/*
+	 * Determine the scheduling domain only if the task is allowed to run
+	 * on all CPUs.
+	 *
+	 * This is done primarily for efficiency, as it avoids the overhead of
+	 * updating a cpumask every time we need to select an idle CPU (which
+	 * can be costly in large SMP systems), but it also aligns logically:
+	 * if a task's scheduling domain is restricted by user-space (through
+	 * CPU affinity), the task will simply use the flat scheduling domain
+	 * defined by user-space.
+	 */
+	if (p->nr_cpus_allowed >= num_possible_cpus())
+		if (static_branch_maybe(CONFIG_SCHED_MC, &scx_selcpu_topo_llc))
+			llc_cpus = llc_span(prev_cpu);
+
+	/*
+	 * If WAKE_SYNC, try to migrate the wakee to the waker's CPU.
+	 */
+	if (wake_flags & SCX_WAKE_SYNC) {
+		cpu = smp_processor_id();
+
+		/*
+		 * If the waker's CPU is cache affine and prev_cpu is idle,
+		 * then avoid a migration.
+		 */
+		if (cpus_share_cache(cpu, prev_cpu) &&
+		    test_and_clear_cpu_idle(prev_cpu)) {
+			cpu = prev_cpu;
+			goto cpu_found;
+		}
+
+		/*
+		 * If the waker's local DSQ is empty, and the system is under
+		 * utilized, try to wake up @p to the local DSQ of the waker.
+		 *
+		 * Checking only for an empty local DSQ is insufficient as it
+		 * could give the wakee an unfair advantage when the system is
+		 * oversaturated.
+		 *
+		 * Checking only for the presence of idle CPUs is also
+		 * insufficient as the local DSQ of the waker could have tasks
+		 * piled up on it even if there is an idle core elsewhere on
+		 * the system.
+		 */
+		if (!(current->flags & PF_EXITING) &&
+		    cpu_rq(cpu)->scx.local_dsq.nr == 0 &&
+		    !cpumask_empty(get_idle_cpumask_node(cpu_to_node(cpu)))) {
+			if (cpumask_test_cpu(cpu, p->cpus_ptr))
+				goto cpu_found;
+		}
+	}
+
+	/*
+	 * If CPU has SMT, any wholly idle CPU is likely a better pick than
+	 * partially idle @prev_cpu.
+	 */
+	if (sched_smt_active()) {
+		/*
+		 * Keep using @prev_cpu if it's part of a fully idle core.
+		 */
+		if (cpumask_test_cpu(prev_cpu, get_idle_smtmask_node(node)) &&
+		    test_and_clear_cpu_idle(prev_cpu)) {
+			cpu = prev_cpu;
+			goto cpu_found;
+		}
+
+		/*
+		 * Search for any fully idle core in the same LLC domain.
+		 */
+		if (llc_cpus) {
+			cpu = scx_pick_idle_cpu_from_node(node, llc_cpus, SCX_PICK_IDLE_CORE);
+			if (cpu >= 0)
+				goto cpu_found;
+		}
+
+		/*
+		 * Search for any full idle core usable by the task.
+		 */
+		cpu = scx_pick_idle_cpu(p->cpus_ptr, prev_cpu, SCX_PICK_IDLE_CORE);
+		if (cpu >= 0)
+			goto cpu_found;
+	}
+
+	/*
+	 * Use @prev_cpu if it's idle.
+	 */
+	if (test_and_clear_cpu_idle(prev_cpu)) {
+		cpu = prev_cpu;
+		goto cpu_found;
+	}
+
+	/*
+	 * Search for any idle CPU in the same LLC domain.
+	 */
+	if (llc_cpus) {
+		cpu = scx_pick_idle_cpu_from_node(node, llc_cpus, 0);
+		if (cpu >= 0)
+			goto cpu_found;
+	}
+
+	/*
+	 * Search for any idle CPU usable by the task.
+	 */
+	cpu = scx_pick_idle_cpu(p->cpus_ptr, prev_cpu, 0);
+	if (cpu >= 0)
+		goto cpu_found;
+
+	rcu_read_unlock();
+	return prev_cpu;
+
+cpu_found:
+	rcu_read_unlock();
+
+	*found = true;
+	return cpu;
+}
+
+static void reset_idle_masks(void)
+{
+	int node;
+
+	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node)) {
+		cpumask_copy(get_idle_cpumask_node(0), cpu_online_mask);
+		cpumask_copy(get_idle_smtmask_node(0), cpu_online_mask);
+		return;
+	}
+
+	/*
+	 * Consider all online cpus idle. Should converge to the actual state
+	 * quickly.
+	 */
+	for_each_node_state(node, N_POSSIBLE) {
+		const struct cpumask *node_mask = cpumask_of_node(node);
+		struct cpumask *idle_cpu = get_idle_cpumask_node(node);
+		struct cpumask *idle_smt = get_idle_smtmask_node(node);
+
+		cpumask_and(idle_cpu, cpu_online_mask, node_mask);
+		cpumask_copy(idle_smt, idle_cpu);
+	}
+}
+
+void __scx_update_idle(struct rq *rq, bool idle)
+{
+	int cpu = cpu_of(rq);
+	int node = cpu_to_node(cpu);
+	struct cpumask *idle_cpu = get_idle_cpumask_node(node);
+
+	if (SCX_HAS_OP(update_idle) && !scx_rq_bypassing(rq)) {
+		SCX_CALL_OP(SCX_KF_REST, update_idle, cpu_of(rq), idle);
+		if (!static_branch_unlikely(&scx_builtin_idle_enabled))
+			return;
+	}
+
+	assign_cpu(cpu, idle_cpu, idle);
+
+#ifdef CONFIG_SCHED_SMT
+	if (sched_smt_active()) {
+		const struct cpumask *smt = cpu_smt_mask(cpu);
+		struct cpumask *idle_smt = get_idle_smtmask_node(node);
+
+		if (idle) {
+			/*
+			 * idle_smt handling is racy but that's fine as it's
+			 * only for optimization and self-correcting.
+			 */
+			for_each_cpu(cpu, smt) {
+				if (!cpumask_test_cpu(cpu, idle_cpu))
+					return;
+			}
+			cpumask_or(idle_smt, idle_smt, smt);
+		} else {
+			cpumask_andnot(idle_smt, idle_smt, smt);
+		}
+	}
+#endif
+}
+#else	/* !CONFIG_SMP */
+static inline int validate_node(int node)
+{
+	return numa_node_id();
+}
+
+static struct cpumask *get_idle_cpumask_node(int node)
+{
+	return cpu_none_mask;
+}
+
+static struct cpumask *get_idle_smtmask_node(int node)
+{
+	return cpu_none_mask;
+}
+
+static bool test_and_clear_cpu_idle(int cpu) { return false; }
+static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, s32 prev_cpu, u64 flags)
+{
+	return -EBUSY;
+}
+
+static s32 scx_pick_idle_cpu_from_node(int node, const struct cpumask *cpus_allowed, u64 flags)
+{
+	return -EBUSY;
+}
+static void reset_idle_masks(void) {}
+#endif	/* CONFIG_SMP */
+
+
+/********************************************************************************
+ * Helpers that can be called from the BPF scheduler.
+ */
+__bpf_kfunc_start_defs();
+
+/**
+ * scx_bpf_select_cpu_dfl - The default implementation of ops.select_cpu()
+ * @p: task_struct to select a CPU for
+ * @prev_cpu: CPU @p was on previously
+ * @wake_flags: %SCX_WAKE_* flags
+ * @is_idle: out parameter indicating whether the returned CPU is idle
+ *
+ * Can only be called from ops.select_cpu() if the built-in CPU selection is
+ * enabled - ops.update_idle() is missing or %SCX_OPS_KEEP_BUILTIN_IDLE is set.
+ * @p, @prev_cpu and @wake_flags match ops.select_cpu().
+ *
+ * Returns the picked CPU with *@is_idle indicating whether the picked CPU is
+ * currently idle and thus a good candidate for direct dispatching.
+ */
+__bpf_kfunc s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
+				       u64 wake_flags, bool *is_idle)
+{
+	if (!check_builtin_idle_enabled())
+		goto prev_cpu;
+
+	if (!scx_kf_allowed(SCX_KF_SELECT_CPU))
+		goto prev_cpu;
+
+#ifdef CONFIG_SMP
+	return scx_select_cpu_dfl(p, prev_cpu, wake_flags, is_idle);
+#endif
+
+prev_cpu:
+	*is_idle = false;
+	return prev_cpu;
+}
+
+/**
+ * scx_bpf_get_idle_cpumask_node - Get a referenced kptr to the idle-tracking
+ * per-CPU cpumask of a target NUMA node.
+ *
+ * Returns an empty cpumask if idle tracking is not enabled, if @node is not
+ * valid, or running on a UP kernel.
+ */
+__bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask_node(int node)
+{
+	if (!check_builtin_idle_per_node_enabled())
+		return cpu_none_mask;
+
+	return get_idle_cpumask_node(node);
+}
+/**
+ * scx_bpf_get_idle_cpumask - Get a referenced kptr to the idle-tracking
+ * per-CPU cpumask of the current NUMA node.
+ *
+ * Returns an emtpy cpumask if idle tracking is not enabled, or running on a UP
+ * kernel.
+ */
+__bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void)
+{
+	if (!check_builtin_idle_enabled())
+		return cpu_none_mask;
+
+	return get_idle_cpumask_node(NUMA_NO_NODE);
+}
+
+/**
+ * scx_bpf_get_idle_smtmask_node - Get a referenced kptr to the idle-tracking,
+ * per-physical-core cpumask of a target NUMA node. Can be used to determine
+ * if an entire physical core is free.
+ *
+ * Returns an empty cpumask if idle tracking is not enabled, if @node is not
+ * valid, or running on a UP kernel.
+ */
+__bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask_node(int node)
+{
+	if (!check_builtin_idle_per_node_enabled())
+		return cpu_none_mask;
+
+	return get_idle_smtmask_node(node);
+}
+
+/**
+ * scx_bpf_get_idle_smtmask - Get a referenced kptr to the idle-tracking,
+ * per-physical-core cpumask of the current NUMA node. Can be used to determine
+ * if an entire physical core is free.
+ *
+ * Returns an empty cumask if idle tracking is not enabled, or running on a UP
+ * kernel.
+ */
+__bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask(void)
+{
+	if (!check_builtin_idle_enabled())
+		return cpu_none_mask;
+
+	return get_idle_smtmask_node(NUMA_NO_NODE);
+}
+
+/**
+ * scx_bpf_put_idle_cpumask - Release a previously acquired referenced kptr to
+ * either the percpu, or SMT idle-tracking cpumask.
+ */
+__bpf_kfunc void scx_bpf_put_idle_cpumask(const struct cpumask *idle_mask)
+{
+	/*
+	 * Empty function body because we aren't actually acquiring or releasing
+	 * a reference to a global idle cpumask, which is read-only in the
+	 * caller and is never released. The acquire / release semantics here
+	 * are just used to make the cpumask a trusted pointer in the caller.
+	 */
+}
+
+/**
+ * scx_bpf_test_and_clear_cpu_idle - Test and clear @cpu's idle state
+ * @cpu: cpu to test and clear idle for
+ *
+ * Returns %true if @cpu was idle and its idle state was successfully cleared.
+ * %false otherwise.
+ *
+ * Unavailable if ops.update_idle() is implemented and
+ * %SCX_OPS_KEEP_BUILTIN_IDLE is not set.
+ */
+__bpf_kfunc bool scx_bpf_test_and_clear_cpu_idle(s32 cpu)
+{
+	if (!check_builtin_idle_enabled())
+		return false;
+
+	if (ops_cpu_valid(cpu, NULL))
+		return test_and_clear_cpu_idle(cpu);
+	else
+		return false;
+}
+
+/**
+ * scx_bpf_pick_idle_cpu_node - Pick and claim an idle cpu from a NUMA node
+ * @node: target NUMA node
+ * @cpus_allowed: Allowed cpumask
+ * @flags: %SCX_PICK_IDLE_CPU_* flags
+ *
+ * Pick and claim an idle cpu in @cpus_allowed from the NUMA node @node.
+ * Returns the picked idle cpu number on success. -%EBUSY if no matching cpu
+ * was found.
+ *
+ * Unavailable if ops.update_idle() is implemented and
+ * %SCX_OPS_KEEP_BUILTIN_IDLE is not set or if %SCX_OPS_KEEP_BUILTIN_IDLE is
+ * not set.
+ */
+__bpf_kfunc s32 scx_bpf_pick_idle_cpu_node(int node, const struct cpumask *cpus_allowed,
+				      u64 flags)
+{
+	node = validate_node(node);
+	if (node < 0)
+		return node;
+
+	if (!check_builtin_idle_per_node_enabled())
+		return -EBUSY;
+
+	return scx_pick_idle_cpu_from_node(node, cpus_allowed, flags);
+}
+
+/**
+ * scx_bpf_pick_idle_cpu - Pick and claim an idle cpu
+ * @cpus_allowed: Allowed cpumask
+ * @flags: %SCX_PICK_IDLE_CPU_* flags
+ *
+ * Pick and claim an idle cpu in @cpus_allowed. Returns the picked idle cpu
+ * number on success. -%EBUSY if no matching cpu was found.
+ *
+ * Idle CPU tracking may race against CPU scheduling state transitions. For
+ * example, this function may return -%EBUSY as CPUs are transitioning into the
+ * idle state. If the caller then assumes that there will be dispatch events on
+ * the CPUs as they were all busy, the scheduler may end up stalling with CPUs
+ * idling while there are pending tasks. Use scx_bpf_pick_any_cpu() and
+ * scx_bpf_kick_cpu() to guarantee that there will be at least one dispatch
+ * event in the near future.
+ *
+ * Unavailable if ops.update_idle() is implemented and
+ * %SCX_OPS_KEEP_BUILTIN_IDLE is not set.
+ */
+__bpf_kfunc s32 scx_bpf_pick_idle_cpu(const struct cpumask *cpus_allowed,
+				      u64 flags)
+{
+	if (!check_builtin_idle_enabled())
+		return -EBUSY;
+
+	return scx_pick_idle_cpu(cpus_allowed, smp_processor_id(), flags);
+}
+
+/**
+ * scx_bpf_pick_any_cpu - Pick and claim an idle cpu if available or pick any CPU
+ * @cpus_allowed: Allowed cpumask
+ * @flags: %SCX_PICK_IDLE_CPU_* flags
+ *
+ * Pick and claim an idle cpu in @cpus_allowed. If none is available, pick any
+ * CPU in @cpus_allowed. Guaranteed to succeed and returns the picked idle cpu
+ * number if @cpus_allowed is not empty. -%EBUSY is returned if @cpus_allowed is
+ * empty.
+ *
+ * If ops.update_idle() is implemented and %SCX_OPS_KEEP_BUILTIN_IDLE is not
+ * set, this function can't tell which CPUs are idle and will always pick any
+ * CPU.
+ */
+__bpf_kfunc s32 scx_bpf_pick_any_cpu(const struct cpumask *cpus_allowed,
+				     u64 flags)
+{
+	s32 cpu;
+
+	if (static_branch_likely(&scx_builtin_idle_enabled)) {
+		cpu = scx_pick_idle_cpu(cpus_allowed, smp_processor_id(), flags);
+		if (cpu >= 0)
+			return cpu;
+	}
+
+	cpu = cpumask_any_distribute(cpus_allowed);
+	if (cpu < nr_cpu_ids)
+		return cpu;
+	else
+		return -EBUSY;
+}
+
+__bpf_kfunc_end_defs();
+
+BTF_KFUNCS_START(scx_kfunc_ids_select_cpu)
+BTF_ID_FLAGS(func, scx_bpf_select_cpu_dfl, KF_RCU)
+BTF_ID_FLAGS(func, scx_bpf_get_idle_cpumask, KF_ACQUIRE)
+BTF_ID_FLAGS(func, scx_bpf_get_idle_cpumask_node, KF_ACQUIRE)
+BTF_ID_FLAGS(func, scx_bpf_get_idle_smtmask, KF_ACQUIRE)
+BTF_ID_FLAGS(func, scx_bpf_get_idle_smtmask_node, KF_ACQUIRE)
+BTF_ID_FLAGS(func, scx_bpf_put_idle_cpumask, KF_RELEASE)
+BTF_ID_FLAGS(func, scx_bpf_test_and_clear_cpu_idle)
+BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu, KF_RCU)
+BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu_node, KF_RCU)
+BTF_ID_FLAGS(func, scx_bpf_pick_any_cpu, KF_RCU)
+BTF_KFUNCS_END(scx_kfunc_ids_select_cpu)
+
+static const struct btf_kfunc_id_set scx_kfunc_set_select_cpu = {
+	.owner			= THIS_MODULE,
+	.set			= &scx_kfunc_ids_select_cpu,
+};
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/6] sched/topology: introduce for_each_numa_hop_node() / sched_numa_hop_node()
  2024-12-17  9:32 ` [PATCH 1/6] sched/topology: introduce for_each_numa_hop_node() / sched_numa_hop_node() Andrea Righi
@ 2024-12-17 21:57   ` Tejun Heo
  2024-12-18 10:23     ` Andrea Righi
  0 siblings, 1 reply; 23+ messages in thread
From: Tejun Heo @ 2024-12-17 21:57 UTC (permalink / raw)
  To: Andrea Righi
  Cc: David Vernet, Changwoo Min, Yury Norov, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	linux-kernel

Hello,

On Tue, Dec 17, 2024 at 10:32:26AM +0100, Andrea Righi wrote:
...
> +int sched_numa_hop_node(nodemask_t *hop_nodes, int start, unsigned int state)
> +{
> +	int dist, n, min_node, min_dist;
> +
> +	if (state >= NR_NODE_STATES)
> +		return NUMA_NO_NODE;
> +
> +	min_node = NUMA_NO_NODE;
> +	min_dist = INT_MAX;
> +
> +	for_each_node_state(n, state) {
> +		if (n == start || node_isset(n, *hop_nodes))
> +			continue;
> +		dist = node_distance(start, n);
> +		if (dist < min_dist) {
> +			min_dist = dist;
> +			min_node = n;
> +		}
> +	}
> +	if (min_node != NUMA_NO_NODE)
> +		node_set(min_node, *hop_nodes);
> +
> +	return min_node;
> +}

So, this would work but given that there is nothing dynamic about this
ordering, would it make more sense to build the ordering and store it
per-node? Then, the iteration just becomes walking that array.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/6] sched_ext: Introduce per-node idle cpumasks
  2024-12-17  9:32 ` [PATCH 3/6] sched_ext: Introduce per-node idle cpumasks Andrea Righi
@ 2024-12-17 23:22   ` Tejun Heo
  2024-12-18 10:21     ` Andrea Righi
  2024-12-17 23:23   ` Tejun Heo
  2024-12-20 16:48   ` Yury Norov
  2 siblings, 1 reply; 23+ messages in thread
From: Tejun Heo @ 2024-12-17 23:22 UTC (permalink / raw)
  To: Andrea Righi
  Cc: David Vernet, Changwoo Min, Yury Norov, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	linux-kernel

On Tue, Dec 17, 2024 at 10:32:28AM +0100, Andrea Righi wrote:
> +static int validate_node(int node)
> +{
> +	/* If no node is specified, return the current one */
> +	if (node == NUMA_NO_NODE)
> +		return numa_node_id();
> +
> +	/* Make sure node is in the range of possible nodes */
> +	if (node < 0 || node >= num_possible_nodes())
> +		return -EINVAL;

Are node IDs guaranteed to be consecutive? Shouldn't it be `node >=
nr_node_ids`? Also, should probably add node_possible(node)?

> +/*
> + * cpumasks to track idle CPUs within each NUMA node.
> + *
> + * If SCX_OPS_BUILTIN_IDLE_PER_NODE is not specified, a single flat cpumask
> + * from node 0 is used to track all idle CPUs system-wide.
> + */
> +static struct idle_cpumask **idle_masks CL_ALIGNED_IF_ONSTACK;

As the masks are allocated separately anyway, the aligned attribute can be
dropped. There's no reason to align the index array.

> +static struct cpumask *get_idle_mask_node(int node, bool smt)
> +{
> +	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
> +		return smt ? idle_masks[0]->smt : idle_masks[0]->cpu;
> +
> +	node = validate_node(node);

It's odd to validate input node in an internal function. If node is being
passed from BPF side, we should validate it and trigger scx_ops_error() if
invalid, but once the node number is inside the kernel, we should be able to
trust it.

> +static struct cpumask *get_idle_cpumask_node(int node)
> +{
> +	return get_idle_mask_node(node, false);

Maybe make the inner function return `struct idle_cpumasks *` so that the
caller can pick between cpu and smt?

> +static void idle_masks_init(void)
> +{
> +	int node;
> +
> +	idle_masks = kcalloc(num_possible_nodes(), sizeof(*idle_masks), GFP_KERNEL);

We probably want to use a variable name which is more qualified for a global
variable - scx_idle_masks?

> @@ -3173,6 +3245,9 @@ bool scx_prio_less(const struct task_struct *a, const struct task_struct *b,
>  
>  static bool test_and_clear_cpu_idle(int cpu)
>  {
> +	int node = cpu_to_node(cpu);
> +	struct cpumask *idle_cpu = get_idle_cpumask_node(node);

Can we use plurals for cpumask varialbles - idle_cpus here?

> -static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags)
> +static s32 scx_pick_idle_cpu_from_node(int node, const struct cpumask *cpus_allowed, u64 flags)

Do we need "from_node"?

>  {
>  	int cpu;
>  
>  retry:
>  	if (sched_smt_active()) {
> -		cpu = cpumask_any_and_distribute(idle_masks.smt, cpus_allowed);
> +		cpu = cpumask_any_and_distribute(get_idle_smtmask_node(node), cpus_allowed);

This too, would s/get_idle_smtmask_node(node)/idle_smtmask(node)/ work?
There are no node-unaware counterparts to these functions, right?

> +static s32
> +scx_pick_idle_cpu_numa(const struct cpumask *cpus_allowed, s32 prev_cpu, u64 flags)
> +{
> +	nodemask_t hop_nodes = NODE_MASK_NONE;
> +	int start_node = cpu_to_node(prev_cpu);
> +	s32 cpu = -EBUSY;
> +
> +	/*
> +	 * Traverse all online nodes in order of increasing distance,
> +	 * starting from prev_cpu's node.
> +	 */
> +	rcu_read_lock();

Is rcu_read_lock() necessary? Does lockdep warn if the explicit
rcu_read_lock() is dropped?

> @@ -3643,17 +3776,33 @@ static void set_cpus_allowed_scx(struct task_struct *p,
>  
>  static void reset_idle_masks(void)
>  {
> +	int node;
> +
> +	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node)) {
> +		cpumask_copy(get_idle_cpumask_node(0), cpu_online_mask);
> +		cpumask_copy(get_idle_smtmask_node(0), cpu_online_mask);
> +		return;
> +	}
> +
>  	/*
>  	 * Consider all online cpus idle. Should converge to the actual state
>  	 * quickly.
>  	 */
> -	cpumask_copy(idle_masks.cpu, cpu_online_mask);
> -	cpumask_copy(idle_masks.smt, cpu_online_mask);
> +	for_each_node_state(node, N_POSSIBLE) {
> +		const struct cpumask *node_mask = cpumask_of_node(node);
> +		struct cpumask *idle_cpu = get_idle_cpumask_node(node);
> +		struct cpumask *idle_smt = get_idle_smtmask_node(node);
> +
> +		cpumask_and(idle_cpu, cpu_online_mask, node_mask);
> +		cpumask_copy(idle_smt, idle_cpu);

Can you do the same cpumask_and() here? I don't think it'll cause practical
problems but idle_cpus can be updated inbetween and e.g. we can end up with
idle_smts that have different idle states between siblings.

>  /**
>   * scx_bpf_get_idle_cpumask - Get a referenced kptr to the idle-tracking
> - * per-CPU cpumask.
> + * per-CPU cpumask of the current NUMA node.

This is a bit misleading as it can be system-wide too.

It's a bit confusing for scx_bpf_get_idle_cpu/smtmask() to return per-node
mask while scx_bpf_pick_idle_cpu() and friends are not scoped to the node.
Also, scx_bpf_pick_idle_cpu() picking the local node as the origin probably
doesn't make sense for most use cases as it's usually called from
ops.select_cpu() and the waker won't necessarily run on the same node as the
wakee.

Maybe disallow scx_bpf_get_idle_cpu/smtmask() if idle_per_node is enabled
and add scx_bpF_get_idle_cpu/smtmask_node()? Ditto for
scx_bpf_pick_idle_cpu() and we can add a PICK_IDLE flag to allow/inhibit
CPUs outside the specified node.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/6] sched_ext: Introduce per-node idle cpumasks
  2024-12-17  9:32 ` [PATCH 3/6] sched_ext: Introduce per-node idle cpumasks Andrea Righi
  2024-12-17 23:22   ` Tejun Heo
@ 2024-12-17 23:23   ` Tejun Heo
  2024-12-20 16:48   ` Yury Norov
  2 siblings, 0 replies; 23+ messages in thread
From: Tejun Heo @ 2024-12-17 23:23 UTC (permalink / raw)
  To: Andrea Righi
  Cc: David Vernet, Changwoo Min, Yury Norov, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	linux-kernel

Also, can you split out a prep patch? The patch is rather big and there are
a couple changes which can be separated out (e.g.
check_builtin_idle_enabled() and assign_cpu() changes.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/6] sched_ext: Introduce per-node idle cpumasks
  2024-12-17 23:22   ` Tejun Heo
@ 2024-12-18 10:21     ` Andrea Righi
  2024-12-18 16:10       ` Tejun Heo
  0 siblings, 1 reply; 23+ messages in thread
From: Andrea Righi @ 2024-12-18 10:21 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Changwoo Min, Yury Norov, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	linux-kernel

Hi Tejun,

On Tue, Dec 17, 2024 at 01:22:26PM -1000, Tejun Heo wrote:
> On Tue, Dec 17, 2024 at 10:32:28AM +0100, Andrea Righi wrote:
> > +static int validate_node(int node)
> > +{
> > +	/* If no node is specified, return the current one */
> > +	if (node == NUMA_NO_NODE)
> > +		return numa_node_id();
> > +
> > +	/* Make sure node is in the range of possible nodes */
> > +	if (node < 0 || node >= num_possible_nodes())
> > +		return -EINVAL;
> 
> Are node IDs guaranteed to be consecutive? Shouldn't it be `node >=
> nr_node_ids`? Also, should probably add node_possible(node)?

Or even better add node_online(node), an offline NUMA node shouldn't be
used in this context.

> 
> > +/*
> > + * cpumasks to track idle CPUs within each NUMA node.
> > + *
> > + * If SCX_OPS_BUILTIN_IDLE_PER_NODE is not specified, a single flat cpumask
> > + * from node 0 is used to track all idle CPUs system-wide.
> > + */
> > +static struct idle_cpumask **idle_masks CL_ALIGNED_IF_ONSTACK;
> 
> As the masks are allocated separately anyway, the aligned attribute can be
> dropped. There's no reason to align the index array.

Right.

> 
> > +static struct cpumask *get_idle_mask_node(int node, bool smt)
> > +{
> > +	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
> > +		return smt ? idle_masks[0]->smt : idle_masks[0]->cpu;
> > +
> > +	node = validate_node(node);
> 
> It's odd to validate input node in an internal function. If node is being
> passed from BPF side, we should validate it and trigger scx_ops_error() if
> invalid, but once the node number is inside the kernel, we should be able to
> trust it.

Makes sense, I'll move the validation in the kfuncs and trigger
scx_ops_error() if the validation fails.

> 
> > +static struct cpumask *get_idle_cpumask_node(int node)
> > +{
> > +	return get_idle_mask_node(node, false);
> 
> Maybe make the inner function return `struct idle_cpumasks *` so that the
> caller can pick between cpu and smt?

Ok.

> 
> > +static void idle_masks_init(void)
> > +{
> > +	int node;
> > +
> > +	idle_masks = kcalloc(num_possible_nodes(), sizeof(*idle_masks), GFP_KERNEL);
> 
> We probably want to use a variable name which is more qualified for a global
> variable - scx_idle_masks?

Ok.

> 
> > @@ -3173,6 +3245,9 @@ bool scx_prio_less(const struct task_struct *a, const struct task_struct *b,
> >  
> >  static bool test_and_clear_cpu_idle(int cpu)
> >  {
> > +	int node = cpu_to_node(cpu);
> > +	struct cpumask *idle_cpu = get_idle_cpumask_node(node);
> 
> Can we use plurals for cpumask varialbles - idle_cpus here?

Ok.

> 
> > -static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags)
> > +static s32 scx_pick_idle_cpu_from_node(int node, const struct cpumask *cpus_allowed, u64 flags)
> 
> Do we need "from_node"?
> 
> >  {
> >  	int cpu;
> >  
> >  retry:
> >  	if (sched_smt_active()) {
> > -		cpu = cpumask_any_and_distribute(idle_masks.smt, cpus_allowed);
> > +		cpu = cpumask_any_and_distribute(get_idle_smtmask_node(node), cpus_allowed);
> 
> This too, would s/get_idle_smtmask_node(node)/idle_smtmask(node)/ work?
> There are no node-unaware counterparts to these functions, right?

Correct, we can just get rid of the _from_node() part.

> 
> > +static s32
> > +scx_pick_idle_cpu_numa(const struct cpumask *cpus_allowed, s32 prev_cpu, u64 flags)
> > +{
> > +	nodemask_t hop_nodes = NODE_MASK_NONE;
> > +	int start_node = cpu_to_node(prev_cpu);
> > +	s32 cpu = -EBUSY;
> > +
> > +	/*
> > +	 * Traverse all online nodes in order of increasing distance,
> > +	 * starting from prev_cpu's node.
> > +	 */
> > +	rcu_read_lock();
> 
> Is rcu_read_lock() necessary? Does lockdep warn if the explicit
> rcu_read_lock() is dropped?

Good point, the other for_each_numa_hop_mask() iterator requires it, but
only to access the cpumasks via rcu_dereference(). Since we are iterating
node IDs I think we can get rid of rcu_read_lock/unlock() here. I'll double
check if lockdep complains without it.

> 
> > @@ -3643,17 +3776,33 @@ static void set_cpus_allowed_scx(struct task_struct *p,
> >  
> >  static void reset_idle_masks(void)
> >  {
> > +	int node;
> > +
> > +	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node)) {
> > +		cpumask_copy(get_idle_cpumask_node(0), cpu_online_mask);
> > +		cpumask_copy(get_idle_smtmask_node(0), cpu_online_mask);
> > +		return;
> > +	}
> > +
> >  	/*
> >  	 * Consider all online cpus idle. Should converge to the actual state
> >  	 * quickly.
> >  	 */
> > -	cpumask_copy(idle_masks.cpu, cpu_online_mask);
> > -	cpumask_copy(idle_masks.smt, cpu_online_mask);
> > +	for_each_node_state(node, N_POSSIBLE) {
> > +		const struct cpumask *node_mask = cpumask_of_node(node);
> > +		struct cpumask *idle_cpu = get_idle_cpumask_node(node);
> > +		struct cpumask *idle_smt = get_idle_smtmask_node(node);
> > +
> > +		cpumask_and(idle_cpu, cpu_online_mask, node_mask);
> > +		cpumask_copy(idle_smt, idle_cpu);
> 
> Can you do the same cpumask_and() here? I don't think it'll cause practical
> problems but idle_cpus can be updated inbetween and e.g. we can end up with
> idle_smts that have different idle states between siblings.

Makes sense, the state should still converge to the right one in any case,
but I agree that it's more accurate to use cpumask_and() also for idle_smt.
Will change that.

> 
> >  /**
> >   * scx_bpf_get_idle_cpumask - Get a referenced kptr to the idle-tracking
> > - * per-CPU cpumask.
> > + * per-CPU cpumask of the current NUMA node.
> 
> This is a bit misleading as it can be system-wide too.
> 
> It's a bit confusing for scx_bpf_get_idle_cpu/smtmask() to return per-node
> mask while scx_bpf_pick_idle_cpu() and friends are not scoped to the node.
> Also, scx_bpf_pick_idle_cpu() picking the local node as the origin probably
> doesn't make sense for most use cases as it's usually called from
> ops.select_cpu() and the waker won't necessarily run on the same node as the
> wakee.
> 
> Maybe disallow scx_bpf_get_idle_cpu/smtmask() if idle_per_node is enabled
> and add scx_bpF_get_idle_cpu/smtmask_node()? Ditto for
> scx_bpf_pick_idle_cpu() and we can add a PICK_IDLE flag to allow/inhibit
> CPUs outside the specified node.

Yeah, I also don't like much the idea of implicitly use the current node
when SCX_OPS_BUILTIN_IDLE_PER_NODE is enabled.

I think it's totally reasonable to disallow the system-wide
scx_bpf_get_idle_cpu/smtmask() when the flag is enabled. Ultimately, it's
the scheduler's responsibility to enable or disable this feature, and if
it's enabled, the scheduler is expected to implement NUMA-aware logic.

I'm also fine with adding SCX_PICK_IDLE_NODE (or similar) to restrict the
search for an idle CPU to the specified node.

Thanks!
-Andrea

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/6] sched/topology: introduce for_each_numa_hop_node() / sched_numa_hop_node()
  2024-12-17 21:57   ` Tejun Heo
@ 2024-12-18 10:23     ` Andrea Righi
  2024-12-18 16:04       ` Tejun Heo
  0 siblings, 1 reply; 23+ messages in thread
From: Andrea Righi @ 2024-12-18 10:23 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Changwoo Min, Yury Norov, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	linux-kernel

On Tue, Dec 17, 2024 at 11:57:44AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Tue, Dec 17, 2024 at 10:32:26AM +0100, Andrea Righi wrote:
> ...
> > +int sched_numa_hop_node(nodemask_t *hop_nodes, int start, unsigned int state)
> > +{
> > +	int dist, n, min_node, min_dist;
> > +
> > +	if (state >= NR_NODE_STATES)
> > +		return NUMA_NO_NODE;
> > +
> > +	min_node = NUMA_NO_NODE;
> > +	min_dist = INT_MAX;
> > +
> > +	for_each_node_state(n, state) {
> > +		if (n == start || node_isset(n, *hop_nodes))
> > +			continue;
> > +		dist = node_distance(start, n);
> > +		if (dist < min_dist) {
> > +			min_dist = dist;
> > +			min_node = n;
> > +		}
> > +	}
> > +	if (min_node != NUMA_NO_NODE)
> > +		node_set(min_node, *hop_nodes);
> > +
> > +	return min_node;
> > +}
> 
> So, this would work but given that there is nothing dynamic about this
> ordering, would it make more sense to build the ordering and store it
> per-node? Then, the iteration just becomes walking that array.

I've also considered doing that. I don't know if it'd work with offline
nodes, but maybe we can just check node_online(node) at each iteration and
skip those that are not online.

-Andrea

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/6] sched/topology: introduce for_each_numa_hop_node() / sched_numa_hop_node()
  2024-12-18 10:23     ` Andrea Righi
@ 2024-12-18 16:04       ` Tejun Heo
  2024-12-19 18:26         ` Yury Norov
  0 siblings, 1 reply; 23+ messages in thread
From: Tejun Heo @ 2024-12-18 16:04 UTC (permalink / raw)
  To: Andrea Righi
  Cc: David Vernet, Changwoo Min, Yury Norov, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	linux-kernel

Hello,

On Wed, Dec 18, 2024 at 11:23:40AM +0100, Andrea Righi wrote:
...
> > So, this would work but given that there is nothing dynamic about this
> > ordering, would it make more sense to build the ordering and store it
> > per-node? Then, the iteration just becomes walking that array.
> 
> I've also considered doing that. I don't know if it'd work with offline
> nodes, but maybe we can just check node_online(node) at each iteration and
> skip those that are not online.

Yeah, there can be e.g. for_each_possible_node_by_dist() where nodes with
unknown distances (offline ones?) are put at the end and then there's also
for_each_online_node_by_dist() which filters out offline ones, and the
ordering can be updated from a CPU hotplug callback. The ordering can be
probably put in an rcu protected array? I'm not sure what's the
synchronization convention around node on/offlining. Is that protected
together with CPU on/offlining?

Given that there usually aren't that many nodes, the current implementation
is probably fine too, so please feel free to ignore this suggestion for now
too.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/6] sched_ext: Introduce per-node idle cpumasks
  2024-12-18 10:21     ` Andrea Righi
@ 2024-12-18 16:10       ` Tejun Heo
  2024-12-18 16:18         ` Andrea Righi
  0 siblings, 1 reply; 23+ messages in thread
From: Tejun Heo @ 2024-12-18 16:10 UTC (permalink / raw)
  To: Andrea Righi
  Cc: David Vernet, Changwoo Min, Yury Norov, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	linux-kernel

Hello,

On Wed, Dec 18, 2024 at 11:21:30AM +0100, Andrea Righi wrote:
...
> > Are node IDs guaranteed to be consecutive? Shouldn't it be `node >=
> > nr_node_ids`? Also, should probably add node_possible(node)?
> 
> Or even better add node_online(node), an offline NUMA node shouldn't be
> used in this context.

That can be too but then we'd have to worry about synchronizing against
going on/offline. Looks like that's protected by mem_hotplug_lock, so we'd
have to require get_online_mems() around these iterations, which might not
be worth it. Besides, if we want to triger abort on incorrect input, we'd
have to call sched_ext ops under mem_hotplug_lock, which we probably can't
do.

...
> > Is rcu_read_lock() necessary? Does lockdep warn if the explicit
> > rcu_read_lock() is dropped?
> 
> Good point, the other for_each_numa_hop_mask() iterator requires it, but
> only to access the cpumasks via rcu_dereference(). Since we are iterating
> node IDs I think we can get rid of rcu_read_lock/unlock() here. I'll double
> check if lockdep complains without it.

Yeah, this function should always be called with preemption disabled, so
even if rcu_read_lock() is required, it should already be implied by the
context.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/6] sched_ext: Introduce per-node idle cpumasks
  2024-12-18 16:10       ` Tejun Heo
@ 2024-12-18 16:18         ` Andrea Righi
  0 siblings, 0 replies; 23+ messages in thread
From: Andrea Righi @ 2024-12-18 16:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Changwoo Min, Yury Norov, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	linux-kernel

On Wed, Dec 18, 2024 at 06:10:17AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Wed, Dec 18, 2024 at 11:21:30AM +0100, Andrea Righi wrote:
> ...
> > > Are node IDs guaranteed to be consecutive? Shouldn't it be `node >=
> > > nr_node_ids`? Also, should probably add node_possible(node)?
> > 
> > Or even better add node_online(node), an offline NUMA node shouldn't be
> > used in this context.
> 
> That can be too but then we'd have to worry about synchronizing against
> going on/offline. Looks like that's protected by mem_hotplug_lock, so we'd
> have to require get_online_mems() around these iterations, which might not
> be worth it. Besides, if we want to triger abort on incorrect input, we'd
> have to call sched_ext ops under mem_hotplug_lock, which we probably can't
> do.

We can probably ignore this for now and add a "restart" logic like we do
with CPU hotplugging at some point.

-Andrea

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/6] sched/topology: introduce for_each_numa_hop_node() / sched_numa_hop_node()
  2024-12-18 16:04       ` Tejun Heo
@ 2024-12-19 18:26         ` Yury Norov
  2024-12-19 19:43           ` Andrea Righi
  2024-12-19 19:52           ` Peter Zijlstra
  0 siblings, 2 replies; 23+ messages in thread
From: Yury Norov @ 2024-12-19 18:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrea Righi, David Vernet, Changwoo Min, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	linux-kernel

On Wed, Dec 18, 2024 at 06:04:53AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Wed, Dec 18, 2024 at 11:23:40AM +0100, Andrea Righi wrote:
> ...
> > > So, this would work but given that there is nothing dynamic about this
> > > ordering, would it make more sense to build the ordering and store it
> > > per-node? Then, the iteration just becomes walking that array.
> > 
> > I've also considered doing that. I don't know if it'd work with offline
> > nodes, but maybe we can just check node_online(node) at each iteration and
> > skip those that are not online.

for_each_numa_hop_mask() only traverses N_CPU nodes, and N_CPU nodes have
proper distances.

I think that for_each_numa_hop_node() should match for_each_numa_hop_mask().
It would be good to cross-test them to ensure that they generate the same
order at least for N_CPU nodes.

If you think that for_each_numa_hop_node() should traverse non-N_CPU nodes,
you need a 'node_state' parameter. This will allow to make sure that at
least N_CPU portion works correctly.

> Yeah, there can be e.g. for_each_possible_node_by_dist() wheke nodes with
> unknown distances (offline ones?) are put at the end and then there's also
> for_each_online_node_by_dist() which filters out offline ones, and the
> ordering can be updated from a CPU hotplug callback.

We can assign UINT_MAX for those nodes I guess?

> The ordering can be
> probably put in an rcu protected array? I'm not sure what's the
> synchronization convention around node on/offlining. Is that protected
> together with CPU on/offlining?

The machinery is already there, we just need another array of nodemasks - 
sched_domains_numa_nodes in addition to sched_domains_numa_nodes. The
last one is already protected by RCU, and we need to update new array every
time when sched_domains_numa_nodes updated.

> Given that there usually aren't that many nodes, the current implementation
> is probably fine too, so please feel free to ignore this suggestion for now
> too.

I agree. The number of nodes on typical system is 1 or 2. Even if
it's 8, the Andrea's bubble sort will be still acceptable. So, I'm
OK with O(N^2) if you guys OK with it. I only would like to have
this choice explained in commit message.

Thanks,
Yury

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/6] sched/topology: introduce for_each_numa_hop_node() / sched_numa_hop_node()
  2024-12-19 18:26         ` Yury Norov
@ 2024-12-19 19:43           ` Andrea Righi
  2024-12-19 19:52           ` Peter Zijlstra
  1 sibling, 0 replies; 23+ messages in thread
From: Andrea Righi @ 2024-12-19 19:43 UTC (permalink / raw)
  To: Yury Norov
  Cc: Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	linux-kernel

On Thu, Dec 19, 2024 at 10:26:59AM -0800, Yury Norov wrote:
> On Wed, Dec 18, 2024 at 06:04:53AM -1000, Tejun Heo wrote:
> > Hello,
> > 
> > On Wed, Dec 18, 2024 at 11:23:40AM +0100, Andrea Righi wrote:
> > ...
> > > > So, this would work but given that there is nothing dynamic about this
> > > > ordering, would it make more sense to build the ordering and store it
> > > > per-node? Then, the iteration just becomes walking that array.
> > > 
> > > I've also considered doing that. I don't know if it'd work with offline
> > > nodes, but maybe we can just check node_online(node) at each iteration and
> > > skip those that are not online.
> 
> for_each_numa_hop_mask() only traverses N_CPU nodes, and N_CPU nodes have
> proper distances.
> 
> I think that for_each_numa_hop_node() should match for_each_numa_hop_mask().
> It would be good to cross-test them to ensure that they generate the same
> order at least for N_CPU nodes.

It'd be nice to have a kunit, I can take a look at this (in a separate
patch, I think we can add this later).

> 
> If you think that for_each_numa_hop_node() should traverse non-N_CPU nodes,
> you need a 'node_state' parameter. This will allow to make sure that at
> least N_CPU portion works correctly.
> 
> > Yeah, there can be e.g. for_each_possible_node_by_dist() wheke nodes with
> > unknown distances (offline ones?) are put at the end and then there's also
> > for_each_online_node_by_dist() which filters out offline ones, and the
> > ordering can be updated from a CPU hotplug callback.
> 
> We can assign UINT_MAX for those nodes I guess?
> 
> > The ordering can be
> > probably put in an rcu protected array? I'm not sure what's the
> > synchronization convention around node on/offlining. Is that protected
> > together with CPU on/offlining?
> 
> The machinery is already there, we just need another array of nodemasks - 
> sched_domains_numa_nodes in addition to sched_domains_numa_nodes. The
> last one is already protected by RCU, and we need to update new array every
> time when sched_domains_numa_nodes updated.
>  
> > Given that there usually aren't that many nodes, the current implementation
> > is probably fine too, so please feel free to ignore this suggestion for now
> > too.
> 
> I agree. The number of nodes on typical system is 1 or 2. Even if
> it's 8, the Andrea's bubble sort will be still acceptable. So, I'm
> OK with O(N^2) if you guys OK with it. I only would like to have
> this choice explained in commit message.

Good point, I'll add a comment about that.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/6] sched/topology: introduce for_each_numa_hop_node() / sched_numa_hop_node()
  2024-12-19 18:26         ` Yury Norov
  2024-12-19 19:43           ` Andrea Righi
@ 2024-12-19 19:52           ` Peter Zijlstra
  2024-12-19 21:16             ` Andrea Righi
  1 sibling, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2024-12-19 19:52 UTC (permalink / raw)
  To: Yury Norov
  Cc: Tejun Heo, Andrea Righi, David Vernet, Changwoo Min, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, linux-kernel

On Thu, Dec 19, 2024 at 10:26:59AM -0800, Yury Norov wrote:
> > Given that there usually aren't that many nodes, the current implementation
> > is probably fine too, so please feel free to ignore this suggestion for now
> > too.
> 
> I agree. The number of nodes on typical system is 1 or 2. Even if
> it's 8, the Andrea's bubble sort will be still acceptable. So, I'm
> OK with O(N^2) if you guys OK with it. I only would like to have
> this choice explained in commit message.

There are systems with 100s or 1000s of nodes out there. As long as
hitting this code path is optional I suppose that's not a problem, but
if not, they're going to be rather upset.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/6] sched/topology: introduce for_each_numa_hop_node() / sched_numa_hop_node()
  2024-12-19 19:52           ` Peter Zijlstra
@ 2024-12-19 21:16             ` Andrea Righi
  0 siblings, 0 replies; 23+ messages in thread
From: Andrea Righi @ 2024-12-19 21:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yury Norov, Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, linux-kernel

On Thu, Dec 19, 2024 at 08:52:30PM +0100, Peter Zijlstra wrote:
> On Thu, Dec 19, 2024 at 10:26:59AM -0800, Yury Norov wrote:
> > > Given that there usually aren't that many nodes, the current implementation
> > > is probably fine too, so please feel free to ignore this suggestion for now
> > > too.
> > 
> > I agree. The number of nodes on typical system is 1 or 2. Even if
> > it's 8, the Andrea's bubble sort will be still acceptable. So, I'm
> > OK with O(N^2) if you guys OK with it. I only would like to have
> > this choice explained in commit message.
> 
> There are systems with 100s or 1000s of nodes out there. As long as
> hitting this code path is optional I suppose that's not a problem, but
> if not, they're going to be rather upset.

Right, this code is optional, it's only hit when SCX_OPS_KEEP_BUILTIN_IDLE
is enabled (off by default) in an scx scheduler and the scheduler is asking
for any idle CPU in the system without speficying a target node.

So, it shouldn't be a big concern for now, and we can probably add
optimizations for special cases later. I'll add a comment to explain this
as well.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 6/6] sched_ext: Move built-in idle CPU selection policy to a separate file
  2024-12-17  9:32 ` [PATCH 6/6] sched_ext: Move built-in idle CPU selection policy to a separate file Andrea Righi
@ 2024-12-20 14:53   ` Yury Norov
  2024-12-20 14:58     ` Andrea Righi
  0 siblings, 1 reply; 23+ messages in thread
From: Yury Norov @ 2024-12-20 14:53 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	linux-kernel

On Tue, Dec 17, 2024 at 10:32:31AM +0100, Andrea Righi wrote:
> As ext.c is becoming quite large, move the idle CPU selection policy
> to a separate file (ext_idle.c) for better code readability.
> 
> No functional change, this is purely code reorganization.
> 
> Suggested-by: Yury Norov <yury.norov@gmail.com>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>

I think it should be a preparation patch at the beginning of the
series. The one major downside of patches of that sort is that when
moving code, you wipe git history. If you have to do that, you need
to move code before doing the work, such that at least your updates
will be accessible to git blame.

> ---
>  MAINTAINERS             |   1 +
>  kernel/sched/ext.c      | 857 +---------------------------------------
>  kernel/sched/ext_idle.c | 835 ++++++++++++++++++++++++++++++++++++++
>  3 files changed, 853 insertions(+), 840 deletions(-)
>  create mode 100644 kernel/sched/ext_idle.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 1e930c7a58b1..02960d1b9ee9 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -20909,6 +20909,7 @@ T:	git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git
>  F:	include/linux/sched/ext.h
>  F:	kernel/sched/ext.h
>  F:	kernel/sched/ext.c
> +F:	kernel/sched/ext_idle.c
>  F:	tools/sched_ext/
>  F:	tools/testing/selftests/sched_ext
>  
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 1edf4206aab5..40d13594c1d1 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -6,6 +6,8 @@
>   * Copyright (c) 2022 Tejun Heo <tj@kernel.org>
>   * Copyright (c) 2022 David Vernet <dvernet@meta.com>
>   */
> +#include <linux/btf_ids.h>
> +
>  #define SCX_OP_IDX(op)		(offsetof(struct sched_ext_ops, op) / sizeof(void (*)(void)))
>  
>  enum scx_consts {
> @@ -890,12 +892,6 @@ static bool scx_warned_zero_slice;
>  static DEFINE_STATIC_KEY_FALSE(scx_ops_enq_last);
>  static DEFINE_STATIC_KEY_FALSE(scx_ops_enq_exiting);
>  static DEFINE_STATIC_KEY_FALSE(scx_ops_cpu_preempt);
> -static DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_enabled);
> -static DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_per_node);
> -
> -#ifdef CONFIG_SMP
> -static DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_llc);
> -#endif
>  
>  static struct static_key_false scx_has_op[SCX_OPI_END] =
>  	{ [0 ... SCX_OPI_END-1] = STATIC_KEY_FALSE_INIT };
> @@ -903,6 +899,17 @@ static struct static_key_false scx_has_op[SCX_OPI_END] =
>  static atomic_t scx_exit_kind = ATOMIC_INIT(SCX_EXIT_DONE);
>  static struct scx_exit_info *scx_exit_info;
>  
> +#define scx_ops_error_kind(err, fmt, args...)					\
> +	scx_ops_exit_kind((err), 0, fmt, ##args)
> +
> +#define scx_ops_exit(code, fmt, args...)					\
> +	scx_ops_exit_kind(SCX_EXIT_UNREG_KERN, (code), fmt, ##args)
> +
> +#define scx_ops_error(fmt, args...)						\
> +	scx_ops_error_kind(SCX_EXIT_ERROR, fmt, ##args)
> +
> +#define SCX_HAS_OP(op)	static_branch_likely(&scx_has_op[SCX_OP_IDX(op)])
> +
>  static atomic_long_t scx_nr_rejected = ATOMIC_LONG_INIT(0);
>  static atomic_long_t scx_hotplug_seq = ATOMIC_LONG_INIT(0);
>  
> @@ -930,97 +937,6 @@ static unsigned long scx_watchdog_timestamp = INITIAL_JIFFIES;
>  
>  static struct delayed_work scx_watchdog_work;
>  
> -/* idle tracking */
> -#ifdef CONFIG_SMP
> -#ifdef CONFIG_CPUMASK_OFFSTACK
> -#define CL_ALIGNED_IF_ONSTACK
> -#else
> -#define CL_ALIGNED_IF_ONSTACK __cacheline_aligned_in_smp
> -#endif
> -
> -struct idle_cpumask {
> -	cpumask_var_t cpu;
> -	cpumask_var_t smt;
> -};
> -
> -/*
> - * Make sure a NUMA node is in a valid range.
> - */
> -static int validate_node(int node)
> -{
> -	/* If no node is specified, return the current one */
> -	if (node == NUMA_NO_NODE)
> -		return numa_node_id();
> -
> -	/* Make sure node is in the range of possible nodes */
> -	if (node < 0 || node >= num_possible_nodes())
> -		return -EINVAL;
> -
> -	return node;
> -}
> -
> -/*
> - * cpumasks to track idle CPUs within each NUMA node.
> - *
> - * If SCX_OPS_BUILTIN_IDLE_PER_NODE is not specified, a single flat cpumask
> - * from node 0 is used to track all idle CPUs system-wide.
> - */
> -static struct idle_cpumask **idle_masks CL_ALIGNED_IF_ONSTACK;
> -
> -static struct cpumask *get_idle_mask_node(int node, bool smt)
> -{
> -	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
> -		return smt ? idle_masks[0]->smt : idle_masks[0]->cpu;
> -
> -	node = validate_node(node);
> -	if (node < 0)
> -		return cpu_none_mask;
> -
> -	return smt ? idle_masks[node]->smt : idle_masks[node]->cpu;
> -}
> -
> -static struct cpumask *get_idle_cpumask_node(int node)
> -{
> -	return get_idle_mask_node(node, false);
> -}
> -
> -static struct cpumask *get_idle_smtmask_node(int node)
> -{
> -	return get_idle_mask_node(node, true);
> -}
> -
> -static void idle_masks_init(void)
> -{
> -	int node;
> -
> -	idle_masks = kcalloc(num_possible_nodes(), sizeof(*idle_masks), GFP_KERNEL);
> -	BUG_ON(!idle_masks);
> -
> -	for_each_node_state(node, N_POSSIBLE) {
> -		idle_masks[node] = kzalloc_node(sizeof(**idle_masks), GFP_KERNEL, node);
> -		BUG_ON(!idle_masks[node]);
> -
> -		BUG_ON(!alloc_cpumask_var_node(&idle_masks[node]->cpu, GFP_KERNEL, node));
> -		BUG_ON(!alloc_cpumask_var_node(&idle_masks[node]->smt, GFP_KERNEL, node));
> -	}
> -}
> -#else	/* !CONFIG_SMP */
> -static int validate_node(int node)
> -{
> -	return numa_node_id();
> -}
> -
> -static struct cpumask *get_idle_cpumask_node(int node)
> -{
> -	return cpu_none_mask;
> -}
> -
> -static struct cpumask *get_idle_smtmask_node(int node)
> -{
> -	return cpu_none_mask;
> -}
> -#endif	/* CONFIG_SMP */
> -
>  /* for %SCX_KICK_WAIT */
>  static unsigned long __percpu *scx_kick_cpus_pnt_seqs;
>  
> @@ -1107,17 +1023,6 @@ static __printf(3, 4) void scx_ops_exit_kind(enum scx_exit_kind kind,
>  					     s64 exit_code,
>  					     const char *fmt, ...);
>  
> -#define scx_ops_error_kind(err, fmt, args...)					\
> -	scx_ops_exit_kind((err), 0, fmt, ##args)
> -
> -#define scx_ops_exit(code, fmt, args...)					\
> -	scx_ops_exit_kind(SCX_EXIT_UNREG_KERN, (code), fmt, ##args)
> -
> -#define scx_ops_error(fmt, args...)						\
> -	scx_ops_error_kind(SCX_EXIT_ERROR, fmt, ##args)
> -
> -#define SCX_HAS_OP(op)	static_branch_likely(&scx_has_op[SCX_OP_IDX(op)])
> -
>  static long jiffies_delta_msecs(unsigned long at, unsigned long now)
>  {
>  	if (time_after(at, now))
> @@ -1624,6 +1529,9 @@ static int ops_sanitize_err(const char *ops_name, s32 err)
>  	return -EPROTO;
>  }
>  
> +/* Built-in idle CPU selection policy */
> +#include "ext_idle.c"
> +
>  static void run_deferred(struct rq *rq)
>  {
>  	process_ddsp_deferred_locals(rq);
> @@ -3247,406 +3155,6 @@ bool scx_prio_less(const struct task_struct *a, const struct task_struct *b,
>  #endif	/* CONFIG_SCHED_CORE */
>  
>  #ifdef CONFIG_SMP
> -
> -static bool test_and_clear_cpu_idle(int cpu)
> -{
> -	int node = cpu_to_node(cpu);
> -	struct cpumask *idle_cpu = get_idle_cpumask_node(node);
> -
> -#ifdef CONFIG_SCHED_SMT
> -	/*
> -	 * SMT mask should be cleared whether we can claim @cpu or not. The SMT
> -	 * cluster is not wholly idle either way. This also prevents
> -	 * scx_pick_idle_cpu() from getting caught in an infinite loop.
> -	 */
> -	if (sched_smt_active()) {
> -		const struct cpumask *smt = cpu_smt_mask(cpu);
> -		struct cpumask *idle_smt = get_idle_smtmask_node(node);
> -
> -		/*
> -		 * If offline, @cpu is not its own sibling and
> -		 * scx_pick_idle_cpu() can get caught in an infinite loop as
> -		 * @cpu is never cleared from the idle SMT mask. Ensure that
> -		 * @cpu is eventually cleared.
> -		 *
> -		 * NOTE: Use cpumask_intersects() and cpumask_test_cpu() to
> -		 * reduce memory writes, which may help alleviate cache
> -		 * coherence pressure.
> -		 */
> -		if (cpumask_intersects(smt, idle_smt))
> -			cpumask_andnot(idle_smt, idle_smt, smt);
> -		else if (cpumask_test_cpu(cpu, idle_smt))
> -			__cpumask_clear_cpu(cpu, idle_smt);
> -	}
> -#endif
> -	return cpumask_test_and_clear_cpu(cpu, idle_cpu);
> -}
> -
> -static s32 scx_pick_idle_cpu_from_node(int node, const struct cpumask *cpus_allowed, u64 flags)
> -{
> -	int cpu;
> -
> -retry:
> -	if (sched_smt_active()) {
> -		cpu = cpumask_any_and_distribute(get_idle_smtmask_node(node), cpus_allowed);
> -		if (cpu < nr_cpu_ids)
> -			goto found;
> -
> -		if (flags & SCX_PICK_IDLE_CORE)
> -			return -EBUSY;
> -	}
> -
> -	cpu = cpumask_any_and_distribute(get_idle_cpumask_node(node), cpus_allowed);
> -	if (cpu >= nr_cpu_ids)
> -		return -EBUSY;
> -
> -found:
> -	if (test_and_clear_cpu_idle(cpu))
> -		return cpu;
> -	else
> -		goto retry;
> -
> -}
> -
> -static s32
> -scx_pick_idle_cpu_numa(const struct cpumask *cpus_allowed, s32 prev_cpu, u64 flags)
> -{
> -	nodemask_t hop_nodes = NODE_MASK_NONE;
> -	int start_node = cpu_to_node(prev_cpu);
> -	s32 cpu = -EBUSY;
> -
> -	/*
> -	 * Traverse all online nodes in order of increasing distance,
> -	 * starting from prev_cpu's node.
> -	 */
> -	rcu_read_lock();
> -	for_each_numa_hop_node(node, start_node, hop_nodes, N_ONLINE) {
> -		cpu = scx_pick_idle_cpu_from_node(node, cpus_allowed, flags);
> -		if (cpu >= 0)
> -			break;
> -	}
> -	rcu_read_unlock();
> -
> -	return cpu;
> -}
> -
> -static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, s32 prev_cpu, u64 flags)
> -{
> -	/*
> -	 * Only node 0 is used if per-node idle cpumasks are disabled.
> -	 */
> -	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
> -		return scx_pick_idle_cpu_from_node(0, cpus_allowed, flags);
> -
> -	return scx_pick_idle_cpu_numa(cpus_allowed, prev_cpu, flags);
> -}
> -
> -/*
> - * Return the amount of CPUs in the same LLC domain of @cpu (or zero if the LLC
> - * domain is not defined).
> - */
> -static unsigned int llc_weight(s32 cpu)
> -{
> -	struct sched_domain *sd;
> -
> -	sd = rcu_dereference(per_cpu(sd_llc, cpu));
> -	if (!sd)
> -		return 0;
> -
> -	return sd->span_weight;
> -}
> -
> -/*
> - * Return the cpumask representing the LLC domain of @cpu (or NULL if the LLC
> - * domain is not defined).
> - */
> -static struct cpumask *llc_span(s32 cpu)
> -{
> -	struct sched_domain *sd;
> -
> -	sd = rcu_dereference(per_cpu(sd_llc, cpu));
> -	if (!sd)
> -		return 0;
> -
> -	return sched_domain_span(sd);
> -}
> -
> -/*
> - * Return the amount of CPUs in the same NUMA domain of @cpu (or zero if the
> - * NUMA domain is not defined).
> - */
> -static unsigned int numa_weight(s32 cpu)
> -{
> -	struct sched_domain *sd;
> -	struct sched_group *sg;
> -
> -	sd = rcu_dereference(per_cpu(sd_numa, cpu));
> -	if (!sd)
> -		return 0;
> -	sg = sd->groups;
> -	if (!sg)
> -		return 0;
> -
> -	return sg->group_weight;
> -}
> -
> -/*
> - * Return true if the LLC domains do not perfectly overlap with the NUMA
> - * domains, false otherwise.
> - */
> -static bool llc_numa_mismatch(void)
> -{
> -	int cpu;
> -
> -	/*
> -	 * We need to scan all online CPUs to verify whether their scheduling
> -	 * domains overlap.
> -	 *
> -	 * While it is rare to encounter architectures with asymmetric NUMA
> -	 * topologies, CPU hotplugging or virtualized environments can result
> -	 * in asymmetric configurations.
> -	 *
> -	 * For example:
> -	 *
> -	 *  NUMA 0:
> -	 *    - LLC 0: cpu0..cpu7
> -	 *    - LLC 1: cpu8..cpu15 [offline]
> -	 *
> -	 *  NUMA 1:
> -	 *    - LLC 0: cpu16..cpu23
> -	 *    - LLC 1: cpu24..cpu31
> -	 *
> -	 * In this case, if we only check the first online CPU (cpu0), we might
> -	 * incorrectly assume that the LLC and NUMA domains are fully
> -	 * overlapping, which is incorrect (as NUMA 1 has two distinct LLC
> -	 * domains).
> -	 */
> -	for_each_online_cpu(cpu)
> -		if (llc_weight(cpu) != numa_weight(cpu))
> -			return true;
> -
> -	return false;
> -}
> -
> -/*
> - * Initialize topology-aware scheduling.
> - *
> - * Detect if the system has multiple LLC or multiple NUMA domains and enable
> - * cache-aware / NUMA-aware scheduling optimizations in the default CPU idle
> - * selection policy.
> - *
> - * Assumption: the kernel's internal topology representation assumes that each
> - * CPU belongs to a single LLC domain, and that each LLC domain is entirely
> - * contained within a single NUMA node.
> - */
> -static void update_selcpu_topology(struct sched_ext_ops *ops)
> -{
> -	bool enable_llc = false;
> -	unsigned int nr_cpus;
> -	s32 cpu = cpumask_first(cpu_online_mask);
> -
> -	/*
> -	 * Enable LLC domain optimization only when there are multiple LLC
> -	 * domains among the online CPUs. If all online CPUs are part of a
> -	 * single LLC domain, the idle CPU selection logic can choose any
> -	 * online CPU without bias.
> -	 *
> -	 * Note that it is sufficient to check the LLC domain of the first
> -	 * online CPU to determine whether a single LLC domain includes all
> -	 * CPUs.
> -	 */
> -	rcu_read_lock();
> -	nr_cpus = llc_weight(cpu);
> -	if (nr_cpus > 0) {
> -		if (nr_cpus < num_online_cpus())
> -			enable_llc = true;
> -		/*
> -		 * No need to enable LLC optimization if the LLC domains are
> -		 * perfectly overlapping with the NUMA domains when per-node
> -		 * cpumasks are enabled.
> -		 */
> -		if ((ops->flags & SCX_OPS_BUILTIN_IDLE_PER_NODE) &&
> -		    !llc_numa_mismatch())
> -			enable_llc = false;
> -		pr_debug("sched_ext: LLC=%*pb weight=%u\n",
> -			 cpumask_pr_args(llc_span(cpu)), llc_weight(cpu));
> -	}
> -	rcu_read_unlock();
> -
> -	pr_debug("sched_ext: LLC idle selection %s\n",
> -		 enable_llc ? "enabled" : "disabled");
> -
> -	if (enable_llc)
> -		static_branch_enable_cpuslocked(&scx_selcpu_topo_llc);
> -	else
> -		static_branch_disable_cpuslocked(&scx_selcpu_topo_llc);
> -
> -	/*
> -	 * Check if we need to enable per-node cpumasks.
> -	 */
> -	if (ops->flags & SCX_OPS_BUILTIN_IDLE_PER_NODE)
> -		static_branch_enable_cpuslocked(&scx_builtin_idle_per_node);
> -	else
> -		static_branch_disable_cpuslocked(&scx_builtin_idle_per_node);
> -}
> -
> -/*
> - * Built-in CPU idle selection policy:
> - *
> - * 1. Prioritize full-idle cores:
> - *   - always prioritize CPUs from fully idle cores (both logical CPUs are
> - *     idle) to avoid interference caused by SMT.
> - *
> - * 2. Reuse the same CPU:
> - *   - prefer the last used CPU to take advantage of cached data (L1, L2) and
> - *     branch prediction optimizations.
> - *
> - * 3. Pick a CPU within the same LLC (Last-Level Cache):
> - *   - if the above conditions aren't met, pick a CPU that shares the same LLC
> - *     to maintain cache locality.
> - *
> - * 4. Pick a CPU within the same NUMA node, if enabled:
> - *   - choose a CPU from the same NUMA node to reduce memory access latency.
> - *
> - * 5. Pick any idle CPU usable by the task.
> - *
> - * Step 3 is performed only if the system has multiple LLC domains that are not
> - * perfectly overlapping with the NUMA domains (see scx_selcpu_topo_llc).
> - *
> - * NOTE: tasks that can only run on 1 CPU are excluded by this logic, because
> - * we never call ops.select_cpu() for them, see select_task_rq().
> - */
> -static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
> -			      u64 wake_flags, bool *found)
> -{
> -	const struct cpumask *llc_cpus = NULL;
> -	int node = cpu_to_node(prev_cpu);
> -	s32 cpu;
> -
> -	*found = false;
> -
> -	/*
> -	 * This is necessary to protect llc_cpus.
> -	 */
> -	rcu_read_lock();
> -
> -	/*
> -	 * Determine the scheduling domain only if the task is allowed to run
> -	 * on all CPUs.
> -	 *
> -	 * This is done primarily for efficiency, as it avoids the overhead of
> -	 * updating a cpumask every time we need to select an idle CPU (which
> -	 * can be costly in large SMP systems), but it also aligns logically:
> -	 * if a task's scheduling domain is restricted by user-space (through
> -	 * CPU affinity), the task will simply use the flat scheduling domain
> -	 * defined by user-space.
> -	 */
> -	if (p->nr_cpus_allowed >= num_possible_cpus())
> -		if (static_branch_maybe(CONFIG_SCHED_MC, &scx_selcpu_topo_llc))
> -			llc_cpus = llc_span(prev_cpu);
> -
> -	/*
> -	 * If WAKE_SYNC, try to migrate the wakee to the waker's CPU.
> -	 */
> -	if (wake_flags & SCX_WAKE_SYNC) {
> -		cpu = smp_processor_id();
> -
> -		/*
> -		 * If the waker's CPU is cache affine and prev_cpu is idle,
> -		 * then avoid a migration.
> -		 */
> -		if (cpus_share_cache(cpu, prev_cpu) &&
> -		    test_and_clear_cpu_idle(prev_cpu)) {
> -			cpu = prev_cpu;
> -			goto cpu_found;
> -		}
> -
> -		/*
> -		 * If the waker's local DSQ is empty, and the system is under
> -		 * utilized, try to wake up @p to the local DSQ of the waker.
> -		 *
> -		 * Checking only for an empty local DSQ is insufficient as it
> -		 * could give the wakee an unfair advantage when the system is
> -		 * oversaturated.
> -		 *
> -		 * Checking only for the presence of idle CPUs is also
> -		 * insufficient as the local DSQ of the waker could have tasks
> -		 * piled up on it even if there is an idle core elsewhere on
> -		 * the system.
> -		 */
> -		if (!(current->flags & PF_EXITING) &&
> -		    cpu_rq(cpu)->scx.local_dsq.nr == 0 &&
> -		    !cpumask_empty(get_idle_cpumask_node(cpu_to_node(cpu)))) {
> -			if (cpumask_test_cpu(cpu, p->cpus_ptr))
> -				goto cpu_found;
> -		}
> -	}
> -
> -	/*
> -	 * If CPU has SMT, any wholly idle CPU is likely a better pick than
> -	 * partially idle @prev_cpu.
> -	 */
> -	if (sched_smt_active()) {
> -		/*
> -		 * Keep using @prev_cpu if it's part of a fully idle core.
> -		 */
> -		if (cpumask_test_cpu(prev_cpu, get_idle_smtmask_node(node)) &&
> -		    test_and_clear_cpu_idle(prev_cpu)) {
> -			cpu = prev_cpu;
> -			goto cpu_found;
> -		}
> -
> -		/*
> -		 * Search for any fully idle core in the same LLC domain.
> -		 */
> -		if (llc_cpus) {
> -			cpu = scx_pick_idle_cpu_from_node(node, llc_cpus, SCX_PICK_IDLE_CORE);
> -			if (cpu >= 0)
> -				goto cpu_found;
> -		}
> -
> -		/*
> -		 * Search for any full idle core usable by the task.
> -		 */
> -		cpu = scx_pick_idle_cpu(p->cpus_ptr, prev_cpu, SCX_PICK_IDLE_CORE);
> -		if (cpu >= 0)
> -			goto cpu_found;
> -	}
> -
> -	/*
> -	 * Use @prev_cpu if it's idle.
> -	 */
> -	if (test_and_clear_cpu_idle(prev_cpu)) {
> -		cpu = prev_cpu;
> -		goto cpu_found;
> -	}
> -
> -	/*
> -	 * Search for any idle CPU in the same LLC domain.
> -	 */
> -	if (llc_cpus) {
> -		cpu = scx_pick_idle_cpu_from_node(node, llc_cpus, 0);
> -		if (cpu >= 0)
> -			goto cpu_found;
> -	}
> -
> -	/*
> -	 * Search for any idle CPU usable by the task.
> -	 */
> -	cpu = scx_pick_idle_cpu(p->cpus_ptr, prev_cpu, 0);
> -	if (cpu >= 0)
> -		goto cpu_found;
> -
> -	rcu_read_unlock();
> -	return prev_cpu;
> -
> -cpu_found:
> -	rcu_read_unlock();
> -
> -	*found = true;
> -	return cpu;
> -}
> -
>  static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flags)
>  {
>  	/*
> @@ -3713,66 +3221,6 @@ static void set_cpus_allowed_scx(struct task_struct *p,
>  				 (struct cpumask *)p->cpus_ptr);
>  }
>  
> -static void reset_idle_masks(void)
> -{
> -	int node;
> -
> -	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node)) {
> -		cpumask_copy(get_idle_cpumask_node(0), cpu_online_mask);
> -		cpumask_copy(get_idle_smtmask_node(0), cpu_online_mask);
> -		return;
> -	}
> -
> -	/*
> -	 * Consider all online cpus idle. Should converge to the actual state
> -	 * quickly.
> -	 */
> -	for_each_node_state(node, N_POSSIBLE) {
> -		const struct cpumask *node_mask = cpumask_of_node(node);
> -		struct cpumask *idle_cpu = get_idle_cpumask_node(node);
> -		struct cpumask *idle_smt = get_idle_smtmask_node(node);
> -
> -		cpumask_and(idle_cpu, cpu_online_mask, node_mask);
> -		cpumask_copy(idle_smt, idle_cpu);
> -	}
> -}
> -
> -void __scx_update_idle(struct rq *rq, bool idle)
> -{
> -	int cpu = cpu_of(rq);
> -	int node = cpu_to_node(cpu);
> -	struct cpumask *idle_cpu = get_idle_cpumask_node(node);
> -
> -	if (SCX_HAS_OP(update_idle) && !scx_rq_bypassing(rq)) {
> -		SCX_CALL_OP(SCX_KF_REST, update_idle, cpu_of(rq), idle);
> -		if (!static_branch_unlikely(&scx_builtin_idle_enabled))
> -			return;
> -	}
> -
> -	assign_cpu(cpu, idle_cpu, idle);
> -
> -#ifdef CONFIG_SCHED_SMT
> -	if (sched_smt_active()) {
> -		const struct cpumask *smt = cpu_smt_mask(cpu);
> -		struct cpumask *idle_smt = get_idle_smtmask_node(node);
> -
> -		if (idle) {
> -			/*
> -			 * idle_smt handling is racy but that's fine as it's
> -			 * only for optimization and self-correcting.
> -			 */
> -			for_each_cpu(cpu, smt) {
> -				if (!cpumask_test_cpu(cpu, idle_cpu))
> -					return;
> -			}
> -			cpumask_or(idle_smt, idle_smt, smt);
> -		} else {
> -			cpumask_andnot(idle_smt, idle_smt, smt);
> -		}
> -	}
> -#endif
> -}
> -
>  static void handle_hotplug(struct rq *rq, bool online)
>  {
>  	int cpu = cpu_of(rq);
> @@ -3811,21 +3259,7 @@ static void rq_offline_scx(struct rq *rq)
>  {
>  	rq->scx.flags &= ~SCX_RQ_ONLINE;
>  }
> -
> -#else	/* CONFIG_SMP */
> -
> -static bool test_and_clear_cpu_idle(int cpu) { return false; }
> -static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, s32 prev_cpu, u64 flags)
> -{
> -	return -EBUSY;
> -}
> -static s32 scx_pick_idle_cpu_from_node(int node, const struct cpumask *cpus_allowed, u64 flags)
> -{
> -	return -EBUSY;
> -}
> -static void reset_idle_masks(void) {}
> -
> -#endif	/* CONFIG_SMP */
> +#endif /* CONFIG_SMP */
>  
>  static bool check_rq_for_timeouts(struct rq *rq)
>  {
> @@ -6399,71 +5833,6 @@ void __init init_sched_ext_class(void)
>  /********************************************************************************
>   * Helpers that can be called from the BPF scheduler.
>   */
> -#include <linux/btf_ids.h>
> -
> -static bool check_builtin_idle_enabled(void)
> -{
> -	if (static_branch_likely(&scx_builtin_idle_enabled))
> -		return true;
> -
> -	scx_ops_error("built-in idle tracking is disabled");
> -	return false;
> -}
> -
> -static bool check_builtin_idle_per_node_enabled(void)
> -{
> -	if (static_branch_likely(&scx_builtin_idle_per_node))
> -		return true;
> -
> -	scx_ops_error("per-node idle tracking is disabled");
> -	return false;
> -}
> -
> -__bpf_kfunc_start_defs();
> -
> -/**
> - * scx_bpf_select_cpu_dfl - The default implementation of ops.select_cpu()
> - * @p: task_struct to select a CPU for
> - * @prev_cpu: CPU @p was on previously
> - * @wake_flags: %SCX_WAKE_* flags
> - * @is_idle: out parameter indicating whether the returned CPU is idle
> - *
> - * Can only be called from ops.select_cpu() if the built-in CPU selection is
> - * enabled - ops.update_idle() is missing or %SCX_OPS_KEEP_BUILTIN_IDLE is set.
> - * @p, @prev_cpu and @wake_flags match ops.select_cpu().
> - *
> - * Returns the picked CPU with *@is_idle indicating whether the picked CPU is
> - * currently idle and thus a good candidate for direct dispatching.
> - */
> -__bpf_kfunc s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
> -				       u64 wake_flags, bool *is_idle)
> -{
> -	if (!check_builtin_idle_enabled())
> -		goto prev_cpu;
> -
> -	if (!scx_kf_allowed(SCX_KF_SELECT_CPU))
> -		goto prev_cpu;
> -
> -#ifdef CONFIG_SMP
> -	return scx_select_cpu_dfl(p, prev_cpu, wake_flags, is_idle);
> -#endif
> -
> -prev_cpu:
> -	*is_idle = false;
> -	return prev_cpu;
> -}
> -
> -__bpf_kfunc_end_defs();
> -
> -BTF_KFUNCS_START(scx_kfunc_ids_select_cpu)
> -BTF_ID_FLAGS(func, scx_bpf_select_cpu_dfl, KF_RCU)
> -BTF_KFUNCS_END(scx_kfunc_ids_select_cpu)
> -
> -static const struct btf_kfunc_id_set scx_kfunc_set_select_cpu = {
> -	.owner			= THIS_MODULE,
> -	.set			= &scx_kfunc_ids_select_cpu,
> -};
> -
>  static bool scx_dsq_insert_preamble(struct task_struct *p, u64 enq_flags)
>  {
>  	if (!scx_kf_allowed(SCX_KF_ENQUEUE | SCX_KF_DISPATCH))
> @@ -7535,189 +6904,6 @@ __bpf_kfunc void scx_bpf_put_cpumask(const struct cpumask *cpumask)
>  	 */
>  }
>  
> -/**
> - * scx_bpf_get_idle_cpumask_node - Get a referenced kptr to the idle-tracking
> - * per-CPU cpumask of a target NUMA node.
> - *
> - * Returns an empty cpumask if idle tracking is not enabled, if @node is not
> - * valid, or running on a UP kernel.
> - */
> -__bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask_node(int node)
> -{
> -	if (!check_builtin_idle_per_node_enabled())
> -		return cpu_none_mask;
> -
> -	return get_idle_cpumask_node(node);
> -}
> -/**
> - * scx_bpf_get_idle_cpumask - Get a referenced kptr to the idle-tracking
> - * per-CPU cpumask of the current NUMA node.
> - *
> - * Returns an emtpy cpumask if idle tracking is not enabled, or running on a UP
> - * kernel.
> - */
> -__bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void)
> -{
> -	if (!check_builtin_idle_enabled())
> -		return cpu_none_mask;
> -
> -	return get_idle_cpumask_node(NUMA_NO_NODE);
> -}
> -
> -/**
> - * scx_bpf_get_idle_smtmask_node - Get a referenced kptr to the idle-tracking,
> - * per-physical-core cpumask of a target NUMA node. Can be used to determine
> - * if an entire physical core is free.
> - *
> - * Returns an empty cpumask if idle tracking is not enabled, if @node is not
> - * valid, or running on a UP kernel.
> - */
> -__bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask_node(int node)
> -{
> -	if (!check_builtin_idle_per_node_enabled())
> -		return cpu_none_mask;
> -
> -	return get_idle_smtmask_node(node);
> -}
> -
> -/**
> - * scx_bpf_get_idle_smtmask - Get a referenced kptr to the idle-tracking,
> - * per-physical-core cpumask of the current NUMA node. Can be used to determine
> - * if an entire physical core is free.
> - *
> - * Returns an empty cumask if idle tracking is not enabled, or running on a UP
> - * kernel.
> - */
> -__bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask(void)
> -{
> -	if (!check_builtin_idle_enabled())
> -		return cpu_none_mask;
> -
> -	return get_idle_smtmask_node(NUMA_NO_NODE);
> -}
> -
> -/**
> - * scx_bpf_put_idle_cpumask - Release a previously acquired referenced kptr to
> - * either the percpu, or SMT idle-tracking cpumask.
> - */
> -__bpf_kfunc void scx_bpf_put_idle_cpumask(const struct cpumask *idle_mask)
> -{
> -	/*
> -	 * Empty function body because we aren't actually acquiring or releasing
> -	 * a reference to a global idle cpumask, which is read-only in the
> -	 * caller and is never released. The acquire / release semantics here
> -	 * are just used to make the cpumask a trusted pointer in the caller.
> -	 */
> -}
> -
> -/**
> - * scx_bpf_test_and_clear_cpu_idle - Test and clear @cpu's idle state
> - * @cpu: cpu to test and clear idle for
> - *
> - * Returns %true if @cpu was idle and its idle state was successfully cleared.
> - * %false otherwise.
> - *
> - * Unavailable if ops.update_idle() is implemented and
> - * %SCX_OPS_KEEP_BUILTIN_IDLE is not set.
> - */
> -__bpf_kfunc bool scx_bpf_test_and_clear_cpu_idle(s32 cpu)
> -{
> -	if (!check_builtin_idle_enabled())
> -		return false;
> -
> -	if (ops_cpu_valid(cpu, NULL))
> -		return test_and_clear_cpu_idle(cpu);
> -	else
> -		return false;
> -}
> -
> -/**
> - * scx_bpf_pick_idle_cpu_node - Pick and claim an idle cpu from a NUMA node
> - * @node: target NUMA node
> - * @cpus_allowed: Allowed cpumask
> - * @flags: %SCX_PICK_IDLE_CPU_* flags
> - *
> - * Pick and claim an idle cpu in @cpus_allowed from the NUMA node @node.
> - * Returns the picked idle cpu number on success. -%EBUSY if no matching cpu
> - * was found.
> - *
> - * Unavailable if ops.update_idle() is implemented and
> - * %SCX_OPS_KEEP_BUILTIN_IDLE is not set or if %SCX_OPS_KEEP_BUILTIN_IDLE is
> - * not set.
> - */
> -__bpf_kfunc s32 scx_bpf_pick_idle_cpu_node(int node, const struct cpumask *cpus_allowed,
> -				      u64 flags)
> -{
> -	node = validate_node(node);
> -	if (node < 0)
> -		return node;
> -
> -	if (!check_builtin_idle_per_node_enabled())
> -		return -EBUSY;
> -
> -	return scx_pick_idle_cpu_from_node(node, cpus_allowed, flags);
> -}
> -
> -/**
> - * scx_bpf_pick_idle_cpu - Pick and claim an idle cpu
> - * @cpus_allowed: Allowed cpumask
> - * @flags: %SCX_PICK_IDLE_CPU_* flags
> - *
> - * Pick and claim an idle cpu in @cpus_allowed. Returns the picked idle cpu
> - * number on success. -%EBUSY if no matching cpu was found.
> - *
> - * Idle CPU tracking may race against CPU scheduling state transitions. For
> - * example, this function may return -%EBUSY as CPUs are transitioning into the
> - * idle state. If the caller then assumes that there will be dispatch events on
> - * the CPUs as they were all busy, the scheduler may end up stalling with CPUs
> - * idling while there are pending tasks. Use scx_bpf_pick_any_cpu() and
> - * scx_bpf_kick_cpu() to guarantee that there will be at least one dispatch
> - * event in the near future.
> - *
> - * Unavailable if ops.update_idle() is implemented and
> - * %SCX_OPS_KEEP_BUILTIN_IDLE is not set.
> - */
> -__bpf_kfunc s32 scx_bpf_pick_idle_cpu(const struct cpumask *cpus_allowed,
> -				      u64 flags)
> -{
> -	if (!check_builtin_idle_enabled())
> -		return -EBUSY;
> -
> -	return scx_pick_idle_cpu(cpus_allowed, smp_processor_id(), flags);
> -}
> -
> -/**
> - * scx_bpf_pick_any_cpu - Pick and claim an idle cpu if available or pick any CPU
> - * @cpus_allowed: Allowed cpumask
> - * @flags: %SCX_PICK_IDLE_CPU_* flags
> - *
> - * Pick and claim an idle cpu in @cpus_allowed. If none is available, pick any
> - * CPU in @cpus_allowed. Guaranteed to succeed and returns the picked idle cpu
> - * number if @cpus_allowed is not empty. -%EBUSY is returned if @cpus_allowed is
> - * empty.
> - *
> - * If ops.update_idle() is implemented and %SCX_OPS_KEEP_BUILTIN_IDLE is not
> - * set, this function can't tell which CPUs are idle and will always pick any
> - * CPU.
> - */
> -__bpf_kfunc s32 scx_bpf_pick_any_cpu(const struct cpumask *cpus_allowed,
> -				     u64 flags)
> -{
> -	s32 cpu;
> -
> -	if (static_branch_likely(&scx_builtin_idle_enabled)) {
> -		cpu = scx_pick_idle_cpu(cpus_allowed, smp_processor_id(), flags);
> -		if (cpu >= 0)
> -			return cpu;
> -	}
> -
> -	cpu = cpumask_any_distribute(cpus_allowed);
> -	if (cpu < nr_cpu_ids)
> -		return cpu;
> -	else
> -		return -EBUSY;
> -}
> -
>  /**
>   * scx_bpf_task_running - Is task currently running?
>   * @p: task of interest
> @@ -7796,15 +6982,6 @@ BTF_ID_FLAGS(func, scx_bpf_cpu_to_node)
>  BTF_ID_FLAGS(func, scx_bpf_get_possible_cpumask, KF_ACQUIRE)
>  BTF_ID_FLAGS(func, scx_bpf_get_online_cpumask, KF_ACQUIRE)
>  BTF_ID_FLAGS(func, scx_bpf_put_cpumask, KF_RELEASE)
> -BTF_ID_FLAGS(func, scx_bpf_get_idle_cpumask, KF_ACQUIRE)
> -BTF_ID_FLAGS(func, scx_bpf_get_idle_cpumask_node, KF_ACQUIRE)
> -BTF_ID_FLAGS(func, scx_bpf_get_idle_smtmask, KF_ACQUIRE)
> -BTF_ID_FLAGS(func, scx_bpf_get_idle_smtmask_node, KF_ACQUIRE)
> -BTF_ID_FLAGS(func, scx_bpf_put_idle_cpumask, KF_RELEASE)
> -BTF_ID_FLAGS(func, scx_bpf_test_and_clear_cpu_idle)
> -BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu, KF_RCU)
> -BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu_node, KF_RCU)
> -BTF_ID_FLAGS(func, scx_bpf_pick_any_cpu, KF_RCU)
>  BTF_ID_FLAGS(func, scx_bpf_task_running, KF_RCU)
>  BTF_ID_FLAGS(func, scx_bpf_task_cpu, KF_RCU)
>  BTF_ID_FLAGS(func, scx_bpf_cpu_rq)
> diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
> new file mode 100644
> index 000000000000..ec36848c1a10
> --- /dev/null
> +++ b/kernel/sched/ext_idle.c
> @@ -0,0 +1,835 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * BPF extensible scheduler class: Documentation/scheduler/sched-ext.rst
> + *
> + * Built-in idle CPU tracking policy.
> + *
> + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
> + * Copyright (c) 2022 Tejun Heo <tj@kernel.org>
> + * Copyright (c) 2022 David Vernet <dvernet@meta.com>
> + * Copyright (c) 2024 Andrea Righi <arighi@nvidia.com>
> + */
> +
> +static DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_enabled);
> +static DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_per_node);
> +
> +static bool check_builtin_idle_enabled(void)
> +{
> +	if (static_branch_likely(&scx_builtin_idle_enabled))
> +		return true;
> +
> +	scx_ops_error("built-in idle tracking is disabled");
> +	return false;
> +}
> +
> +static bool check_builtin_idle_per_node_enabled(void)
> +{
> +	if (static_branch_likely(&scx_builtin_idle_per_node))
> +		return true;
> +
> +	scx_ops_error("per-node idle tracking is disabled");
> +	return false;
> +}
> +
> +#ifdef CONFIG_SMP
> +static DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_llc);
> +
> +#ifdef CONFIG_CPUMASK_OFFSTACK
> +#define CL_ALIGNED_IF_ONSTACK
> +#else
> +#define CL_ALIGNED_IF_ONSTACK __cacheline_aligned_in_smp
> +#endif
> +
> +struct idle_cpumask {
> +	cpumask_var_t cpu;
> +	cpumask_var_t smt;
> +};
> +
> +/*
> + * cpumasks to track idle CPUs within each NUMA node.
> + *
> + * If SCX_OPS_BUILTIN_IDLE_PER_NODE is not specified, a single flat cpumask
> + * from node 0 is used to track all idle CPUs system-wide.
> + */
> +static struct idle_cpumask **idle_masks CL_ALIGNED_IF_ONSTACK;
> +
> +/*
> + * Make sure a NUMA node is in a valid range.
> + */
> +static int validate_node(int node)
> +{
> +	/* If no node is specified, return the current one */
> +	if (node == NUMA_NO_NODE)
> +		return numa_node_id();
> +
> +	/* Make sure node is in the range of possible nodes */
> +	if (node < 0 || node >= num_possible_nodes())
> +		return -EINVAL;
> +
> +	return node;
> +}
> +
> +static struct cpumask *get_idle_mask_node(int node, bool smt)
> +{
> +	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
> +		return smt ? idle_masks[0]->smt : idle_masks[0]->cpu;
> +
> +	node = validate_node(node);
> +	if (node < 0)
> +		return cpu_none_mask;
> +
> +	return smt ? idle_masks[node]->smt : idle_masks[node]->cpu;
> +}
> +
> +static struct cpumask *get_idle_cpumask_node(int node)
> +{
> +	return get_idle_mask_node(node, false);
> +}
> +
> +static struct cpumask *get_idle_smtmask_node(int node)
> +{
> +	return get_idle_mask_node(node, true);
> +}
> +
> +static void idle_masks_init(void)
> +{
> +	int node;
> +
> +	idle_masks = kcalloc(num_possible_nodes(), sizeof(*idle_masks), GFP_KERNEL);
> +	BUG_ON(!idle_masks);
> +
> +	for_each_node_state(node, N_POSSIBLE) {
> +		idle_masks[node] = kzalloc_node(sizeof(**idle_masks), GFP_KERNEL, node);
> +		BUG_ON(!idle_masks[node]);
> +
> +		BUG_ON(!alloc_cpumask_var_node(&idle_masks[node]->cpu, GFP_KERNEL, node));
> +		BUG_ON(!alloc_cpumask_var_node(&idle_masks[node]->smt, GFP_KERNEL, node));
> +	}
> +}
> +
> +static bool test_and_clear_cpu_idle(int cpu)
> +{
> +	int node = cpu_to_node(cpu);
> +	struct cpumask *idle_cpu = get_idle_cpumask_node(node);
> +
> +#ifdef CONFIG_SCHED_SMT
> +	/*
> +	 * SMT mask should be cleared whether we can claim @cpu or not. The SMT
> +	 * cluster is not wholly idle either way. This also prevents
> +	 * scx_pick_idle_cpu() from getting caught in an infinite loop.
> +	 */
> +	if (sched_smt_active()) {
> +		const struct cpumask *smt = cpu_smt_mask(cpu);
> +		struct cpumask *idle_smt = get_idle_smtmask_node(node);
> +
> +		/*
> +		 * If offline, @cpu is not its own sibling and
> +		 * scx_pick_idle_cpu() can get caught in an infinite loop as
> +		 * @cpu is never cleared from the idle SMT mask. Ensure that
> +		 * @cpu is eventually cleared.
> +		 *
> +		 * NOTE: Use cpumask_intersects() and cpumask_test_cpu() to
> +		 * reduce memory writes, which may help alleviate cache
> +		 * coherence pressure.
> +		 */
> +		if (cpumask_intersects(smt, idle_smt))
> +			cpumask_andnot(idle_smt, idle_smt, smt);
> +		else if (cpumask_test_cpu(cpu, idle_smt))
> +			__cpumask_clear_cpu(cpu, idle_smt);
> +	}
> +#endif
> +	return cpumask_test_and_clear_cpu(cpu, idle_cpu);
> +}
> +
> +static s32 scx_pick_idle_cpu_from_node(int node, const struct cpumask *cpus_allowed, u64 flags)
> +{
> +	int cpu;
> +
> +retry:
> +	if (sched_smt_active()) {
> +		cpu = cpumask_any_and_distribute(get_idle_smtmask_node(node), cpus_allowed);
> +		if (cpu < nr_cpu_ids)
> +			goto found;
> +
> +		if (flags & SCX_PICK_IDLE_CORE)
> +			return -EBUSY;
> +	}
> +
> +	cpu = cpumask_any_and_distribute(get_idle_cpumask_node(node), cpus_allowed);
> +	if (cpu >= nr_cpu_ids)
> +		return -EBUSY;
> +
> +found:
> +	if (test_and_clear_cpu_idle(cpu))
> +		return cpu;
> +	goto retry;
> +
> +}
> +
> +static s32
> +scx_pick_idle_cpu_numa(const struct cpumask *cpus_allowed, s32 prev_cpu, u64 flags)
> +{
> +	nodemask_t hop_nodes = NODE_MASK_NONE;
> +	int start_node = cpu_to_node(prev_cpu);
> +	s32 cpu = -EBUSY;
> +
> +	/*
> +	 * Traverse all online nodes in order of increasing distance,
> +	 * starting from prev_cpu's node.
> +	 */
> +	rcu_read_lock();
> +	for_each_numa_hop_node(node, start_node, hop_nodes, N_ONLINE) {
> +		cpu = scx_pick_idle_cpu_from_node(node, cpus_allowed, flags);
> +		if (cpu >= 0)
> +			break;
> +	}
> +	rcu_read_unlock();
> +
> +	return cpu;
> +}
> +
> +static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, s32 prev_cpu, u64 flags)
> +{
> +	/*
> +	 * Only node 0 is used if per-node idle cpumasks are disabled.
> +	 */
> +	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
> +		return scx_pick_idle_cpu_from_node(0, cpus_allowed, flags);
> +
> +	return scx_pick_idle_cpu_numa(cpus_allowed, prev_cpu, flags);
> +}
> +
> +/*
> + * Return the amount of CPUs in the same LLC domain of @cpu (or zero if the LLC
> + * domain is not defined).
> + */
> +static unsigned int llc_weight(s32 cpu)
> +{
> +	struct sched_domain *sd;
> +
> +	sd = rcu_dereference(per_cpu(sd_llc, cpu));
> +	if (!sd)
> +		return 0;
> +
> +	return sd->span_weight;
> +}
> +
> +/*
> + * Return the cpumask representing the LLC domain of @cpu (or NULL if the LLC
> + * domain is not defined).
> + */
> +static struct cpumask *llc_span(s32 cpu)
> +{
> +	struct sched_domain *sd;
> +
> +	sd = rcu_dereference(per_cpu(sd_llc, cpu));
> +	if (!sd)
> +		return 0;
> +
> +	return sched_domain_span(sd);
> +}
> +
> +/*
> + * Return the amount of CPUs in the same NUMA domain of @cpu (or zero if the
> + * NUMA domain is not defined).
> + */
> +static unsigned int numa_weight(s32 cpu)
> +{
> +	struct sched_domain *sd;
> +	struct sched_group *sg;
> +
> +	sd = rcu_dereference(per_cpu(sd_numa, cpu));
> +	if (!sd)
> +		return 0;
> +	sg = sd->groups;
> +	if (!sg)
> +		return 0;
> +
> +	return sg->group_weight;
> +}
> +
> +/*
> + * Return true if the LLC domains do not perfectly overlap with the NUMA
> + * domains, false otherwise.
> + */
> +static bool llc_numa_mismatch(void)
> +{
> +	int cpu;
> +
> +	/*
> +	 * We need to scan all online CPUs to verify whether their scheduling
> +	 * domains overlap.
> +	 *
> +	 * While it is rare to encounter architectures with asymmetric NUMA
> +	 * topologies, CPU hotplugging or virtualized environments can result
> +	 * in asymmetric configurations.
> +	 *
> +	 * For example:
> +	 *
> +	 *  NUMA 0:
> +	 *    - LLC 0: cpu0..cpu7
> +	 *    - LLC 1: cpu8..cpu15 [offline]
> +	 *
> +	 *  NUMA 1:
> +	 *    - LLC 0: cpu16..cpu23
> +	 *    - LLC 1: cpu24..cpu31
> +	 *
> +	 * In this case, if we only check the first online CPU (cpu0), we might
> +	 * incorrectly assume that the LLC and NUMA domains are fully
> +	 * overlapping, which is incorrect (as NUMA 1 has two distinct LLC
> +	 * domains).
> +	 */
> +	for_each_online_cpu(cpu)
> +		if (llc_weight(cpu) != numa_weight(cpu))
> +			return true;
> +
> +	return false;
> +}
> +
> +/*
> + * Initialize topology-aware scheduling.
> + *
> + * Detect if the system has multiple LLC or multiple NUMA domains and enable
> + * cache-aware / NUMA-aware scheduling optimizations in the default CPU idle
> + * selection policy.
> + *
> + * Assumption: the kernel's internal topology representation assumes that each
> + * CPU belongs to a single LLC domain, and that each LLC domain is entirely
> + * contained within a single NUMA node.
> + */
> +static void update_selcpu_topology(struct sched_ext_ops *ops)
> +{
> +	bool enable_llc = false;
> +	unsigned int nr_cpus;
> +	s32 cpu = cpumask_first(cpu_online_mask);
> +
> +	/*
> +	 * Enable LLC domain optimization only when there are multiple LLC
> +	 * domains among the online CPUs. If all online CPUs are part of a
> +	 * single LLC domain, the idle CPU selection logic can choose any
> +	 * online CPU without bias.
> +	 *
> +	 * Note that it is sufficient to check the LLC domain of the first
> +	 * online CPU to determine whether a single LLC domain includes all
> +	 * CPUs.
> +	 */
> +	rcu_read_lock();
> +	nr_cpus = llc_weight(cpu);
> +	if (nr_cpus > 0) {
> +		if (nr_cpus < num_online_cpus())
> +			enable_llc = true;
> +		/*
> +		 * No need to enable LLC optimization if the LLC domains are
> +		 * perfectly overlapping with the NUMA domains when per-node
> +		 * cpumasks are enabled.
> +		 */
> +		if ((ops->flags & SCX_OPS_BUILTIN_IDLE_PER_NODE) &&
> +		    !llc_numa_mismatch())
> +			enable_llc = false;
> +		pr_debug("sched_ext: LLC=%*pb weight=%u\n",
> +			 cpumask_pr_args(llc_span(cpu)), llc_weight(cpu));
> +	}
> +	rcu_read_unlock();
> +
> +	pr_debug("sched_ext: LLC idle selection %s\n",
> +		 enable_llc ? "enabled" : "disabled");
> +
> +	if (enable_llc)
> +		static_branch_enable_cpuslocked(&scx_selcpu_topo_llc);
> +	else
> +		static_branch_disable_cpuslocked(&scx_selcpu_topo_llc);
> +
> +	/*
> +	 * Check if we need to enable per-node cpumasks.
> +	 */
> +	if (ops->flags & SCX_OPS_BUILTIN_IDLE_PER_NODE)
> +		static_branch_enable_cpuslocked(&scx_builtin_idle_per_node);
> +	else
> +		static_branch_disable_cpuslocked(&scx_builtin_idle_per_node);
> +}
> +
> +/*
> + * Built-in CPU idle selection policy:
> + *
> + * 1. Prioritize full-idle cores:
> + *   - always prioritize CPUs from fully idle cores (both logical CPUs are
> + *     idle) to avoid interference caused by SMT.
> + *
> + * 2. Reuse the same CPU:
> + *   - prefer the last used CPU to take advantage of cached data (L1, L2) and
> + *     branch prediction optimizations.
> + *
> + * 3. Pick a CPU within the same LLC (Last-Level Cache):
> + *   - if the above conditions aren't met, pick a CPU that shares the same LLC
> + *     to maintain cache locality.
> + *
> + * 4. Pick a CPU within the same NUMA node, if enabled:
> + *   - choose a CPU from the same NUMA node to reduce memory access latency.
> + *
> + * 5. Pick any idle CPU usable by the task.
> + *
> + * Step 3 is performed only if the system has multiple LLC domains that are not
> + * perfectly overlapping with the NUMA domains (see scx_selcpu_topo_llc).
> + *
> + * NOTE: tasks that can only run on 1 CPU are excluded by this logic, because
> + * we never call ops.select_cpu() for them, see select_task_rq().
> + */
> +static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
> +			      u64 wake_flags, bool *found)
> +{
> +	const struct cpumask *llc_cpus = NULL;
> +	int node = cpu_to_node(prev_cpu);
> +	s32 cpu;
> +
> +	*found = false;
> +
> +	/*
> +	 * This is necessary to protect llc_cpus.
> +	 */
> +	rcu_read_lock();
> +
> +	/*
> +	 * Determine the scheduling domain only if the task is allowed to run
> +	 * on all CPUs.
> +	 *
> +	 * This is done primarily for efficiency, as it avoids the overhead of
> +	 * updating a cpumask every time we need to select an idle CPU (which
> +	 * can be costly in large SMP systems), but it also aligns logically:
> +	 * if a task's scheduling domain is restricted by user-space (through
> +	 * CPU affinity), the task will simply use the flat scheduling domain
> +	 * defined by user-space.
> +	 */
> +	if (p->nr_cpus_allowed >= num_possible_cpus())
> +		if (static_branch_maybe(CONFIG_SCHED_MC, &scx_selcpu_topo_llc))
> +			llc_cpus = llc_span(prev_cpu);
> +
> +	/*
> +	 * If WAKE_SYNC, try to migrate the wakee to the waker's CPU.
> +	 */
> +	if (wake_flags & SCX_WAKE_SYNC) {
> +		cpu = smp_processor_id();
> +
> +		/*
> +		 * If the waker's CPU is cache affine and prev_cpu is idle,
> +		 * then avoid a migration.
> +		 */
> +		if (cpus_share_cache(cpu, prev_cpu) &&
> +		    test_and_clear_cpu_idle(prev_cpu)) {
> +			cpu = prev_cpu;
> +			goto cpu_found;
> +		}
> +
> +		/*
> +		 * If the waker's local DSQ is empty, and the system is under
> +		 * utilized, try to wake up @p to the local DSQ of the waker.
> +		 *
> +		 * Checking only for an empty local DSQ is insufficient as it
> +		 * could give the wakee an unfair advantage when the system is
> +		 * oversaturated.
> +		 *
> +		 * Checking only for the presence of idle CPUs is also
> +		 * insufficient as the local DSQ of the waker could have tasks
> +		 * piled up on it even if there is an idle core elsewhere on
> +		 * the system.
> +		 */
> +		if (!(current->flags & PF_EXITING) &&
> +		    cpu_rq(cpu)->scx.local_dsq.nr == 0 &&
> +		    !cpumask_empty(get_idle_cpumask_node(cpu_to_node(cpu)))) {
> +			if (cpumask_test_cpu(cpu, p->cpus_ptr))
> +				goto cpu_found;
> +		}
> +	}
> +
> +	/*
> +	 * If CPU has SMT, any wholly idle CPU is likely a better pick than
> +	 * partially idle @prev_cpu.
> +	 */
> +	if (sched_smt_active()) {
> +		/*
> +		 * Keep using @prev_cpu if it's part of a fully idle core.
> +		 */
> +		if (cpumask_test_cpu(prev_cpu, get_idle_smtmask_node(node)) &&
> +		    test_and_clear_cpu_idle(prev_cpu)) {
> +			cpu = prev_cpu;
> +			goto cpu_found;
> +		}
> +
> +		/*
> +		 * Search for any fully idle core in the same LLC domain.
> +		 */
> +		if (llc_cpus) {
> +			cpu = scx_pick_idle_cpu_from_node(node, llc_cpus, SCX_PICK_IDLE_CORE);
> +			if (cpu >= 0)
> +				goto cpu_found;
> +		}
> +
> +		/*
> +		 * Search for any full idle core usable by the task.
> +		 */
> +		cpu = scx_pick_idle_cpu(p->cpus_ptr, prev_cpu, SCX_PICK_IDLE_CORE);
> +		if (cpu >= 0)
> +			goto cpu_found;
> +	}
> +
> +	/*
> +	 * Use @prev_cpu if it's idle.
> +	 */
> +	if (test_and_clear_cpu_idle(prev_cpu)) {
> +		cpu = prev_cpu;
> +		goto cpu_found;
> +	}
> +
> +	/*
> +	 * Search for any idle CPU in the same LLC domain.
> +	 */
> +	if (llc_cpus) {
> +		cpu = scx_pick_idle_cpu_from_node(node, llc_cpus, 0);
> +		if (cpu >= 0)
> +			goto cpu_found;
> +	}
> +
> +	/*
> +	 * Search for any idle CPU usable by the task.
> +	 */
> +	cpu = scx_pick_idle_cpu(p->cpus_ptr, prev_cpu, 0);
> +	if (cpu >= 0)
> +		goto cpu_found;
> +
> +	rcu_read_unlock();
> +	return prev_cpu;
> +
> +cpu_found:
> +	rcu_read_unlock();
> +
> +	*found = true;
> +	return cpu;
> +}
> +
> +static void reset_idle_masks(void)
> +{
> +	int node;
> +
> +	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node)) {
> +		cpumask_copy(get_idle_cpumask_node(0), cpu_online_mask);
> +		cpumask_copy(get_idle_smtmask_node(0), cpu_online_mask);
> +		return;
> +	}
> +
> +	/*
> +	 * Consider all online cpus idle. Should converge to the actual state
> +	 * quickly.
> +	 */
> +	for_each_node_state(node, N_POSSIBLE) {
> +		const struct cpumask *node_mask = cpumask_of_node(node);
> +		struct cpumask *idle_cpu = get_idle_cpumask_node(node);
> +		struct cpumask *idle_smt = get_idle_smtmask_node(node);
> +
> +		cpumask_and(idle_cpu, cpu_online_mask, node_mask);
> +		cpumask_copy(idle_smt, idle_cpu);
> +	}
> +}
> +
> +void __scx_update_idle(struct rq *rq, bool idle)
> +{
> +	int cpu = cpu_of(rq);
> +	int node = cpu_to_node(cpu);
> +	struct cpumask *idle_cpu = get_idle_cpumask_node(node);
> +
> +	if (SCX_HAS_OP(update_idle) && !scx_rq_bypassing(rq)) {
> +		SCX_CALL_OP(SCX_KF_REST, update_idle, cpu_of(rq), idle);
> +		if (!static_branch_unlikely(&scx_builtin_idle_enabled))
> +			return;
> +	}
> +
> +	assign_cpu(cpu, idle_cpu, idle);
> +
> +#ifdef CONFIG_SCHED_SMT
> +	if (sched_smt_active()) {
> +		const struct cpumask *smt = cpu_smt_mask(cpu);
> +		struct cpumask *idle_smt = get_idle_smtmask_node(node);
> +
> +		if (idle) {
> +			/*
> +			 * idle_smt handling is racy but that's fine as it's
> +			 * only for optimization and self-correcting.
> +			 */
> +			for_each_cpu(cpu, smt) {
> +				if (!cpumask_test_cpu(cpu, idle_cpu))
> +					return;
> +			}
> +			cpumask_or(idle_smt, idle_smt, smt);
> +		} else {
> +			cpumask_andnot(idle_smt, idle_smt, smt);
> +		}
> +	}
> +#endif
> +}
> +#else	/* !CONFIG_SMP */
> +static inline int validate_node(int node)
> +{
> +	return numa_node_id();
> +}
> +
> +static struct cpumask *get_idle_cpumask_node(int node)
> +{
> +	return cpu_none_mask;
> +}
> +
> +static struct cpumask *get_idle_smtmask_node(int node)
> +{
> +	return cpu_none_mask;
> +}
> +
> +static bool test_and_clear_cpu_idle(int cpu) { return false; }
> +static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, s32 prev_cpu, u64 flags)
> +{
> +	return -EBUSY;
> +}
> +
> +static s32 scx_pick_idle_cpu_from_node(int node, const struct cpumask *cpus_allowed, u64 flags)
> +{
> +	return -EBUSY;
> +}
> +static void reset_idle_masks(void) {}
> +#endif	/* CONFIG_SMP */
> +
> +
> +/********************************************************************************
> + * Helpers that can be called from the BPF scheduler.
> + */
> +__bpf_kfunc_start_defs();
> +
> +/**
> + * scx_bpf_select_cpu_dfl - The default implementation of ops.select_cpu()
> + * @p: task_struct to select a CPU for
> + * @prev_cpu: CPU @p was on previously
> + * @wake_flags: %SCX_WAKE_* flags
> + * @is_idle: out parameter indicating whether the returned CPU is idle
> + *
> + * Can only be called from ops.select_cpu() if the built-in CPU selection is
> + * enabled - ops.update_idle() is missing or %SCX_OPS_KEEP_BUILTIN_IDLE is set.
> + * @p, @prev_cpu and @wake_flags match ops.select_cpu().
> + *
> + * Returns the picked CPU with *@is_idle indicating whether the picked CPU is
> + * currently idle and thus a good candidate for direct dispatching.
> + */
> +__bpf_kfunc s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
> +				       u64 wake_flags, bool *is_idle)
> +{
> +	if (!check_builtin_idle_enabled())
> +		goto prev_cpu;
> +
> +	if (!scx_kf_allowed(SCX_KF_SELECT_CPU))
> +		goto prev_cpu;
> +
> +#ifdef CONFIG_SMP
> +	return scx_select_cpu_dfl(p, prev_cpu, wake_flags, is_idle);
> +#endif
> +
> +prev_cpu:
> +	*is_idle = false;
> +	return prev_cpu;
> +}
> +
> +/**
> + * scx_bpf_get_idle_cpumask_node - Get a referenced kptr to the idle-tracking
> + * per-CPU cpumask of a target NUMA node.
> + *
> + * Returns an empty cpumask if idle tracking is not enabled, if @node is not
> + * valid, or running on a UP kernel.
> + */
> +__bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask_node(int node)
> +{
> +	if (!check_builtin_idle_per_node_enabled())
> +		return cpu_none_mask;
> +
> +	return get_idle_cpumask_node(node);
> +}
> +/**
> + * scx_bpf_get_idle_cpumask - Get a referenced kptr to the idle-tracking
> + * per-CPU cpumask of the current NUMA node.
> + *
> + * Returns an emtpy cpumask if idle tracking is not enabled, or running on a UP
> + * kernel.
> + */
> +__bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void)
> +{
> +	if (!check_builtin_idle_enabled())
> +		return cpu_none_mask;
> +
> +	return get_idle_cpumask_node(NUMA_NO_NODE);
> +}
> +
> +/**
> + * scx_bpf_get_idle_smtmask_node - Get a referenced kptr to the idle-tracking,
> + * per-physical-core cpumask of a target NUMA node. Can be used to determine
> + * if an entire physical core is free.
> + *
> + * Returns an empty cpumask if idle tracking is not enabled, if @node is not
> + * valid, or running on a UP kernel.
> + */
> +__bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask_node(int node)
> +{
> +	if (!check_builtin_idle_per_node_enabled())
> +		return cpu_none_mask;
> +
> +	return get_idle_smtmask_node(node);
> +}
> +
> +/**
> + * scx_bpf_get_idle_smtmask - Get a referenced kptr to the idle-tracking,
> + * per-physical-core cpumask of the current NUMA node. Can be used to determine
> + * if an entire physical core is free.
> + *
> + * Returns an empty cumask if idle tracking is not enabled, or running on a UP
> + * kernel.
> + */
> +__bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask(void)
> +{
> +	if (!check_builtin_idle_enabled())
> +		return cpu_none_mask;
> +
> +	return get_idle_smtmask_node(NUMA_NO_NODE);
> +}
> +
> +/**
> + * scx_bpf_put_idle_cpumask - Release a previously acquired referenced kptr to
> + * either the percpu, or SMT idle-tracking cpumask.
> + */
> +__bpf_kfunc void scx_bpf_put_idle_cpumask(const struct cpumask *idle_mask)
> +{
> +	/*
> +	 * Empty function body because we aren't actually acquiring or releasing
> +	 * a reference to a global idle cpumask, which is read-only in the
> +	 * caller and is never released. The acquire / release semantics here
> +	 * are just used to make the cpumask a trusted pointer in the caller.
> +	 */
> +}
> +
> +/**
> + * scx_bpf_test_and_clear_cpu_idle - Test and clear @cpu's idle state
> + * @cpu: cpu to test and clear idle for
> + *
> + * Returns %true if @cpu was idle and its idle state was successfully cleared.
> + * %false otherwise.
> + *
> + * Unavailable if ops.update_idle() is implemented and
> + * %SCX_OPS_KEEP_BUILTIN_IDLE is not set.
> + */
> +__bpf_kfunc bool scx_bpf_test_and_clear_cpu_idle(s32 cpu)
> +{
> +	if (!check_builtin_idle_enabled())
> +		return false;
> +
> +	if (ops_cpu_valid(cpu, NULL))
> +		return test_and_clear_cpu_idle(cpu);
> +	else
> +		return false;
> +}
> +
> +/**
> + * scx_bpf_pick_idle_cpu_node - Pick and claim an idle cpu from a NUMA node
> + * @node: target NUMA node
> + * @cpus_allowed: Allowed cpumask
> + * @flags: %SCX_PICK_IDLE_CPU_* flags
> + *
> + * Pick and claim an idle cpu in @cpus_allowed from the NUMA node @node.
> + * Returns the picked idle cpu number on success. -%EBUSY if no matching cpu
> + * was found.
> + *
> + * Unavailable if ops.update_idle() is implemented and
> + * %SCX_OPS_KEEP_BUILTIN_IDLE is not set or if %SCX_OPS_KEEP_BUILTIN_IDLE is
> + * not set.
> + */
> +__bpf_kfunc s32 scx_bpf_pick_idle_cpu_node(int node, const struct cpumask *cpus_allowed,
> +				      u64 flags)
> +{
> +	node = validate_node(node);
> +	if (node < 0)
> +		return node;
> +
> +	if (!check_builtin_idle_per_node_enabled())
> +		return -EBUSY;
> +
> +	return scx_pick_idle_cpu_from_node(node, cpus_allowed, flags);
> +}
> +
> +/**
> + * scx_bpf_pick_idle_cpu - Pick and claim an idle cpu
> + * @cpus_allowed: Allowed cpumask
> + * @flags: %SCX_PICK_IDLE_CPU_* flags
> + *
> + * Pick and claim an idle cpu in @cpus_allowed. Returns the picked idle cpu
> + * number on success. -%EBUSY if no matching cpu was found.
> + *
> + * Idle CPU tracking may race against CPU scheduling state transitions. For
> + * example, this function may return -%EBUSY as CPUs are transitioning into the
> + * idle state. If the caller then assumes that there will be dispatch events on
> + * the CPUs as they were all busy, the scheduler may end up stalling with CPUs
> + * idling while there are pending tasks. Use scx_bpf_pick_any_cpu() and
> + * scx_bpf_kick_cpu() to guarantee that there will be at least one dispatch
> + * event in the near future.
> + *
> + * Unavailable if ops.update_idle() is implemented and
> + * %SCX_OPS_KEEP_BUILTIN_IDLE is not set.
> + */
> +__bpf_kfunc s32 scx_bpf_pick_idle_cpu(const struct cpumask *cpus_allowed,
> +				      u64 flags)
> +{
> +	if (!check_builtin_idle_enabled())
> +		return -EBUSY;
> +
> +	return scx_pick_idle_cpu(cpus_allowed, smp_processor_id(), flags);
> +}
> +
> +/**
> + * scx_bpf_pick_any_cpu - Pick and claim an idle cpu if available or pick any CPU
> + * @cpus_allowed: Allowed cpumask
> + * @flags: %SCX_PICK_IDLE_CPU_* flags
> + *
> + * Pick and claim an idle cpu in @cpus_allowed. If none is available, pick any
> + * CPU in @cpus_allowed. Guaranteed to succeed and returns the picked idle cpu
> + * number if @cpus_allowed is not empty. -%EBUSY is returned if @cpus_allowed is
> + * empty.
> + *
> + * If ops.update_idle() is implemented and %SCX_OPS_KEEP_BUILTIN_IDLE is not
> + * set, this function can't tell which CPUs are idle and will always pick any
> + * CPU.
> + */
> +__bpf_kfunc s32 scx_bpf_pick_any_cpu(const struct cpumask *cpus_allowed,
> +				     u64 flags)
> +{
> +	s32 cpu;
> +
> +	if (static_branch_likely(&scx_builtin_idle_enabled)) {
> +		cpu = scx_pick_idle_cpu(cpus_allowed, smp_processor_id(), flags);
> +		if (cpu >= 0)
> +			return cpu;
> +	}
> +
> +	cpu = cpumask_any_distribute(cpus_allowed);
> +	if (cpu < nr_cpu_ids)
> +		return cpu;
> +	else
> +		return -EBUSY;
> +}
> +
> +__bpf_kfunc_end_defs();
> +
> +BTF_KFUNCS_START(scx_kfunc_ids_select_cpu)
> +BTF_ID_FLAGS(func, scx_bpf_select_cpu_dfl, KF_RCU)
> +BTF_ID_FLAGS(func, scx_bpf_get_idle_cpumask, KF_ACQUIRE)
> +BTF_ID_FLAGS(func, scx_bpf_get_idle_cpumask_node, KF_ACQUIRE)
> +BTF_ID_FLAGS(func, scx_bpf_get_idle_smtmask, KF_ACQUIRE)
> +BTF_ID_FLAGS(func, scx_bpf_get_idle_smtmask_node, KF_ACQUIRE)
> +BTF_ID_FLAGS(func, scx_bpf_put_idle_cpumask, KF_RELEASE)
> +BTF_ID_FLAGS(func, scx_bpf_test_and_clear_cpu_idle)
> +BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu, KF_RCU)
> +BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu_node, KF_RCU)
> +BTF_ID_FLAGS(func, scx_bpf_pick_any_cpu, KF_RCU)
> +BTF_KFUNCS_END(scx_kfunc_ids_select_cpu)
> +
> +static const struct btf_kfunc_id_set scx_kfunc_set_select_cpu = {
> +	.owner			= THIS_MODULE,
> +	.set			= &scx_kfunc_ids_select_cpu,
> +};
> -- 
> 2.47.1

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 6/6] sched_ext: Move built-in idle CPU selection policy to a separate file
  2024-12-20 14:53   ` Yury Norov
@ 2024-12-20 14:58     ` Andrea Righi
  0 siblings, 0 replies; 23+ messages in thread
From: Andrea Righi @ 2024-12-20 14:58 UTC (permalink / raw)
  To: Yury Norov
  Cc: Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	linux-kernel

On Fri, Dec 20, 2024 at 06:53:28AM -0800, Yury Norov wrote:
> On Tue, Dec 17, 2024 at 10:32:31AM +0100, Andrea Righi wrote:
> > As ext.c is becoming quite large, move the idle CPU selection policy
> > to a separate file (ext_idle.c) for better code readability.
> > 
> > No functional change, this is purely code reorganization.
> > 
> > Suggested-by: Yury Norov <yury.norov@gmail.com>
> > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> 
> I think it should be a preparation patch at the beginning of the
> series. The one major downside of patches of that sort is that when
> moving code, you wipe git history. If you have to do that, you need
> to move code before doing the work, such that at least your updates
> will be accessible to git blame.

Right, I've already moved this at the beginning. :)

I'm re-running the usual set of tests and will likely send a new patch set
if everything looks good.

Thanks!
-Andrea

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/6] sched_ext: Introduce per-node idle cpumasks
  2024-12-17  9:32 ` [PATCH 3/6] sched_ext: Introduce per-node idle cpumasks Andrea Righi
  2024-12-17 23:22   ` Tejun Heo
  2024-12-17 23:23   ` Tejun Heo
@ 2024-12-20 16:48   ` Yury Norov
  2024-12-20 17:52     ` Andrea Righi
  2 siblings, 1 reply; 23+ messages in thread
From: Yury Norov @ 2024-12-20 16:48 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Rasmus Villemoes, linux-kernel

On Tue, Dec 17, 2024 at 10:32:28AM +0100, Andrea Righi wrote:
> Using a single global idle mask can lead to inefficiencies and a lot of
> stress on the cache coherency protocol on large systems with multiple
> NUMA nodes, since all the CPUs can create a really intense read/write
> activity on the single global cpumask.
> 
> Therefore, split the global cpumask into multiple per-NUMA node cpumasks
> to improve scalability and performance on large systems.
> 
> The concept is that each cpumask will track only the idle CPUs within
> its corresponding NUMA node, treating CPUs in other NUMA nodes as busy.
> In this way concurrent access to the idle cpumask will be restricted
> within each NUMA node.
> 
> NOTE: with per-node idle cpumasks enabled scx_bpf_get_idle_cpu/smtmask()
> are returning the cpumask of the current NUMA node, instead of a single
> cpumask for all the CPUs.
> 
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> ---
>  kernel/sched/ext.c | 281 +++++++++++++++++++++++++++++++++------------
>  1 file changed, 209 insertions(+), 72 deletions(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index a17abd2df4d4..d4666db4a212 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -894,6 +894,7 @@ static DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_enabled);
>  #ifdef CONFIG_SMP
>  static DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_llc);
>  static DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_numa);
> +static DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_per_node);
>  #endif
>  
>  static struct static_key_false scx_has_op[SCX_OPI_END] =
> @@ -937,11 +938,82 @@ static struct delayed_work scx_watchdog_work;
>  #define CL_ALIGNED_IF_ONSTACK __cacheline_aligned_in_smp
>  #endif
>  
> -static struct {
> +struct idle_cpumask {
>  	cpumask_var_t cpu;
>  	cpumask_var_t smt;
> -} idle_masks CL_ALIGNED_IF_ONSTACK;
> +};
> +
> +/*
> + * Make sure a NUMA node is in a valid range.
> + */
> +static int validate_node(int node)
> +{
> +	/* If no node is specified, return the current one */
> +	if (node == NUMA_NO_NODE)
> +		return numa_node_id();
> +
> +	/* Make sure node is in the range of possible nodes */
> +	if (node < 0 || node >= num_possible_nodes())
> +		return -EINVAL;
> +
> +	return node;
> +}

This is needed in BPF code, right? Kernel users should always provide
correct parameters.

> +/*
> + * cpumasks to track idle CPUs within each NUMA node.
> + *
> + * If SCX_OPS_BUILTIN_IDLE_PER_NODE is not specified, a single flat cpumask
> + * from node 0 is used to track all idle CPUs system-wide.
> + */
> +static struct idle_cpumask **idle_masks CL_ALIGNED_IF_ONSTACK;
> +
> +static struct cpumask *get_idle_mask_node(int node, bool smt)

Like Rasmus said in another thread, this 'get' prefix looks weird.

  > +/**
  > + * get_parity8 - get the parity of an u8 value
  
  I know it's prototypical bikeshedding, but what's with the "get_"
  prefix? Certainly the purpose of any pure function is to "get" the
  result of some computation. We don't have "get_strlen()".
  
  Please either just "parity8", or if you want to emphasize that this
  belongs in the bit-munging family, "bit_parity8".

So maybe idle_nodes(), idle_smt_nodes() and so on?

> +{
> +	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
> +		return smt ? idle_masks[0]->smt : idle_masks[0]->cpu;
> +
> +	node = validate_node(node);
> + 	if (node < 0)
> +		return cpu_none_mask;

cpu_none_mask is a valid pointer. How should user distinguish invalid
parameter and empty nodemask? You should return -EINVAL.

Anyways...

This would make a horrid mess. Imagine I'm your user, and I'm
experimenting with this idle feature. I decided to begin with
per-node idle masks disabled and call it for example like:

        get_idle_mask_node(random(), true)

And it works! I get all my idle CPUs just well. So I'm happily build
my complex and huge codebase on top of it.

Then one day I decide to enable those fancy per-node idle masks
because you said they unload interconnect and whatever... At that
point what I want is to click a button just to collect some perf
numbers. So I do that, and I find all my codebase coredumped in some
random place... Because per-node idle masks basic functions
simply have different interface: they disallow node = random() as
parameter, while global idle mask is OK with it.

> +
> +	return smt ? idle_masks[node]->smt : idle_masks[node]->cpu;
> +}
> +
> +static struct cpumask *get_idle_cpumask_node(int node)

inline or __always_inline?

> +{
> +	return get_idle_mask_node(node, false);
> +}
> +
> +static struct cpumask *get_idle_smtmask_node(int node)
> +{
> +	return get_idle_mask_node(node, true);
> +}
> +
> +static void idle_masks_init(void)
> +{
> +	int node;
> +
> +	idle_masks = kcalloc(num_possible_nodes(), sizeof(*idle_masks), GFP_KERNEL);
> +	BUG_ON(!idle_masks);
> +
> +	for_each_node_state(node, N_POSSIBLE) {
> +		idle_masks[node] = kzalloc_node(sizeof(**idle_masks), GFP_KERNEL, node);
> +		BUG_ON(!idle_masks[node]);
> +
> +		BUG_ON(!alloc_cpumask_var_node(&idle_masks[node]->cpu, GFP_KERNEL, node));
> +		BUG_ON(!alloc_cpumask_var_node(&idle_masks[node]->smt, GFP_KERNEL, node));
> +	}
> +}
> +#else	/* !CONFIG_SMP */
> +static struct cpumask *get_idle_cpumask_node(int node)
> +{
> +	return cpu_none_mask;
> +}
>  
> +static struct cpumask *get_idle_smtmask_node(int node)
> +{
> +	return cpu_none_mask;
> +}
>  #endif	/* CONFIG_SMP */
>  
>  /* for %SCX_KICK_WAIT */
> @@ -3173,6 +3245,9 @@ bool scx_prio_less(const struct task_struct *a, const struct task_struct *b,
>  
>  static bool test_and_clear_cpu_idle(int cpu)
>  {
> +	int node = cpu_to_node(cpu);
> +	struct cpumask *idle_cpu = get_idle_cpumask_node(node);
> +
>  #ifdef CONFIG_SCHED_SMT
>  	/*
>  	 * SMT mask should be cleared whether we can claim @cpu or not. The SMT
> @@ -3181,29 +3256,34 @@ static bool test_and_clear_cpu_idle(int cpu)
>  	 */
>  	if (sched_smt_active()) {
>  		const struct cpumask *smt = cpu_smt_mask(cpu);
> +		struct cpumask *idle_smt = get_idle_smtmask_node(node);
>  
>  		/*
>  		 * If offline, @cpu is not its own sibling and
>  		 * scx_pick_idle_cpu() can get caught in an infinite loop as
> -		 * @cpu is never cleared from idle_masks.smt. Ensure that @cpu
> -		 * is eventually cleared.
> +		 * @cpu is never cleared from the idle SMT mask. Ensure that
> +		 * @cpu is eventually cleared.
> +		 *
> +		 * NOTE: Use cpumask_intersects() and cpumask_test_cpu() to
> +		 * reduce memory writes, which may help alleviate cache
> +		 * coherence pressure.
>  		 */
> -		if (cpumask_intersects(smt, idle_masks.smt))
> -			cpumask_andnot(idle_masks.smt, idle_masks.smt, smt);
> -		else if (cpumask_test_cpu(cpu, idle_masks.smt))
> -			__cpumask_clear_cpu(cpu, idle_masks.smt);
> +		if (cpumask_intersects(smt, idle_smt))
> +			cpumask_andnot(idle_smt, idle_smt, smt);
> +		else if (cpumask_test_cpu(cpu, idle_smt))
> +			__cpumask_clear_cpu(cpu, idle_smt);
>  	}
>  #endif
> -	return cpumask_test_and_clear_cpu(cpu, idle_masks.cpu);
> +	return cpumask_test_and_clear_cpu(cpu, idle_cpu);
>  }
>  
> -static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags)
> +static s32 scx_pick_idle_cpu_from_node(int node, const struct cpumask *cpus_allowed, u64 flags)
>  {
>  	int cpu;
>  
>  retry:
>  	if (sched_smt_active()) {
> -		cpu = cpumask_any_and_distribute(idle_masks.smt, cpus_allowed);
> +		cpu = cpumask_any_and_distribute(get_idle_smtmask_node(node), cpus_allowed);
>  		if (cpu < nr_cpu_ids)
>  			goto found;
>  
> @@ -3211,7 +3291,7 @@ static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags)
>  			return -EBUSY;
>  	}
>  
> -	cpu = cpumask_any_and_distribute(idle_masks.cpu, cpus_allowed);
> +	cpu = cpumask_any_and_distribute(get_idle_cpumask_node(node), cpus_allowed);
>  	if (cpu >= nr_cpu_ids)
>  		return -EBUSY;
>  
> @@ -3220,6 +3300,40 @@ static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags)
>  		return cpu;
>  	else
>  		goto retry;
> +
> +}
> +
> +static s32
> +scx_pick_idle_cpu_numa(const struct cpumask *cpus_allowed, s32 prev_cpu, u64 flags)

We have a sort of convention here. If you pick a cpu based on NUMA
distances, the 'numa' should be a 2nd prefix after a subsystem where
it's used. Refer for example:

        sched_numa_find_nth_cpu()
        sched_numa_hop_mask()
        for_each_numa_hop_mask()

So I'd name it scx_numa_idle_cpu()

> +{
> +	nodemask_t hop_nodes = NODE_MASK_NONE;
> +	int start_node = cpu_to_node(prev_cpu);
> +	s32 cpu = -EBUSY;
> +
> +	/*
> +	 * Traverse all online nodes in order of increasing distance,
> +	 * starting from prev_cpu's node.
> +	 */
> +	rcu_read_lock();

RCU is a leftover from for_each_numa_hop_mask(), I guess?

> +	for_each_numa_hop_node(node, start_node, hop_nodes, N_ONLINE) {
> +		cpu = scx_pick_idle_cpu_from_node(node, cpus_allowed, flags);
> +		if (cpu >= 0)
> +			break;
> +	}
> +	rcu_read_unlock();
> +
> +	return cpu;
> +}
> +
> +static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, s32 prev_cpu, u64 flags)
> +{
> +	/*
> +	 * Only node 0 is used if per-node idle cpumasks are disabled.
> +	 */
> +	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
> +		return scx_pick_idle_cpu_from_node(0, cpus_allowed, flags);
> +
> +	return scx_pick_idle_cpu_numa(cpus_allowed, prev_cpu, flags);
>  }
>  
>  /*
> @@ -3339,7 +3453,7 @@ static bool llc_numa_mismatch(void)
>   * CPU belongs to a single LLC domain, and that each LLC domain is entirely
>   * contained within a single NUMA node.
>   */
> -static void update_selcpu_topology(void)
> +static void update_selcpu_topology(struct sched_ext_ops *ops)
>  {
>  	bool enable_llc = false, enable_numa = false;
>  	unsigned int nr_cpus;
> @@ -3360,6 +3474,14 @@ static void update_selcpu_topology(void)
>  	if (nr_cpus > 0) {
>  		if (nr_cpus < num_online_cpus())
>  			enable_llc = true;
> +		/*
> +		 * No need to enable LLC optimization if the LLC domains are
> +		 * perfectly overlapping with the NUMA domains when per-node
> +		 * cpumasks are enabled.
> +		 */
> +		if ((ops->flags & SCX_OPS_BUILTIN_IDLE_PER_NODE) &&
> +		    !llc_numa_mismatch())
> +			enable_llc = false;
>  		pr_debug("sched_ext: LLC=%*pb weight=%u\n",
>  			 cpumask_pr_args(llc_span(cpu)), llc_weight(cpu));
>  	}
> @@ -3395,6 +3517,14 @@ static void update_selcpu_topology(void)
>  		static_branch_enable_cpuslocked(&scx_selcpu_topo_numa);
>  	else
>  		static_branch_disable_cpuslocked(&scx_selcpu_topo_numa);
> +
> +	/*
> +	 * Check if we need to enable per-node cpumasks.
> +	 */
> +	if (ops->flags & SCX_OPS_BUILTIN_IDLE_PER_NODE)
> +		static_branch_enable_cpuslocked(&scx_builtin_idle_per_node);
> +	else
> +		static_branch_disable_cpuslocked(&scx_builtin_idle_per_node);
>  }
>  
>  /*
> @@ -3415,6 +3545,8 @@ static void update_selcpu_topology(void)
>   * 4. Pick a CPU within the same NUMA node, if enabled:
>   *   - choose a CPU from the same NUMA node to reduce memory access latency.
>   *
> + * 5. Pick any idle CPU usable by the task.
> + *

I would make it a separate patch, or even send it independently. It
doesn't look related to the series...

>   * Step 3 and 4 are performed only if the system has, respectively, multiple
>   * LLC domains / multiple NUMA nodes (see scx_selcpu_topo_llc and
>   * scx_selcpu_topo_numa).
> @@ -3427,6 +3559,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  {
>  	const struct cpumask *llc_cpus = NULL;
>  	const struct cpumask *numa_cpus = NULL;
> +	int node = cpu_to_node(prev_cpu);
>  	s32 cpu;
>  
>  	*found = false;
> @@ -3484,9 +3617,9 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  		 * piled up on it even if there is an idle core elsewhere on
>  		 * the system.
>  		 */
> -		if (!cpumask_empty(idle_masks.cpu) &&
> -		    !(current->flags & PF_EXITING) &&
> -		    cpu_rq(cpu)->scx.local_dsq.nr == 0) {
> +		if (!(current->flags & PF_EXITING) &&
> +		    cpu_rq(cpu)->scx.local_dsq.nr == 0 &&
> +		    !cpumask_empty(get_idle_cpumask_node(cpu_to_node(cpu)))) {
>  			if (cpumask_test_cpu(cpu, p->cpus_ptr))
>  				goto cpu_found;
>  		}
> @@ -3500,7 +3633,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  		/*
>  		 * Keep using @prev_cpu if it's part of a fully idle core.
>  		 */
> -		if (cpumask_test_cpu(prev_cpu, idle_masks.smt) &&
> +		if (cpumask_test_cpu(prev_cpu, get_idle_smtmask_node(node)) &&
>  		    test_and_clear_cpu_idle(prev_cpu)) {
>  			cpu = prev_cpu;
>  			goto cpu_found;
> @@ -3510,7 +3643,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  		 * Search for any fully idle core in the same LLC domain.
>  		 */
>  		if (llc_cpus) {
> -			cpu = scx_pick_idle_cpu(llc_cpus, SCX_PICK_IDLE_CORE);
> +			cpu = scx_pick_idle_cpu_from_node(node, llc_cpus, SCX_PICK_IDLE_CORE);
>  			if (cpu >= 0)
>  				goto cpu_found;
>  		}
> @@ -3519,7 +3652,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  		 * Search for any fully idle core in the same NUMA node.
>  		 */
>  		if (numa_cpus) {
> -			cpu = scx_pick_idle_cpu(numa_cpus, SCX_PICK_IDLE_CORE);
> +			cpu = scx_pick_idle_cpu_from_node(node, numa_cpus, SCX_PICK_IDLE_CORE);
>  			if (cpu >= 0)
>  				goto cpu_found;
>  		}
> @@ -3527,7 +3660,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  		/*
>  		 * Search for any full idle core usable by the task.
>  		 */
> -		cpu = scx_pick_idle_cpu(p->cpus_ptr, SCX_PICK_IDLE_CORE);
> +		cpu = scx_pick_idle_cpu(p->cpus_ptr, prev_cpu, SCX_PICK_IDLE_CORE);
>  		if (cpu >= 0)
>  			goto cpu_found;
>  	}
> @@ -3544,7 +3677,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  	 * Search for any idle CPU in the same LLC domain.
>  	 */
>  	if (llc_cpus) {
> -		cpu = scx_pick_idle_cpu(llc_cpus, 0);
> +		cpu = scx_pick_idle_cpu_from_node(node, llc_cpus, 0);
>  		if (cpu >= 0)
>  			goto cpu_found;
>  	}
> @@ -3553,7 +3686,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  	 * Search for any idle CPU in the same NUMA node.
>  	 */
>  	if (numa_cpus) {
> -		cpu = scx_pick_idle_cpu(numa_cpus, 0);
> +		cpu = scx_pick_idle_cpu_from_node(node, numa_cpus, 0);
>  		if (cpu >= 0)
>  			goto cpu_found;
>  	}
> @@ -3561,7 +3694,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  	/*
>  	 * Search for any idle CPU usable by the task.
>  	 */
> -	cpu = scx_pick_idle_cpu(p->cpus_ptr, 0);
> +	cpu = scx_pick_idle_cpu(p->cpus_ptr, prev_cpu, 0);
>  	if (cpu >= 0)
>  		goto cpu_found;
>  
> @@ -3643,17 +3776,33 @@ static void set_cpus_allowed_scx(struct task_struct *p,
>  
>  static void reset_idle_masks(void)
>  {
> +	int node;
> +
> +	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node)) {
> +		cpumask_copy(get_idle_cpumask_node(0), cpu_online_mask);
> +		cpumask_copy(get_idle_smtmask_node(0), cpu_online_mask);
> +		return;
> +	}
> +
>  	/*
>  	 * Consider all online cpus idle. Should converge to the actual state
>  	 * quickly.
>  	 */
> -	cpumask_copy(idle_masks.cpu, cpu_online_mask);
> -	cpumask_copy(idle_masks.smt, cpu_online_mask);
> +	for_each_node_state(node, N_POSSIBLE) {

for_each_node(node)

If the comment above is still valid, you don't need to visit every
possible node. You need to visit only N_ONLINE, or even N_CPU nodes.

This adds to the discussion in patch #1. Taking care of CPU-less nodes
is useless for the scheduler purposes. And if you don't even initialize
them, then in for_each_numa_hop_node() you should skip those nodes.

> +		const struct cpumask *node_mask = cpumask_of_node(node);
> +		struct cpumask *idle_cpu = get_idle_cpumask_node(node);
> +		struct cpumask *idle_smt = get_idle_smtmask_node(node);
> +
> +		cpumask_and(idle_cpu, cpu_online_mask, node_mask);
> +		cpumask_copy(idle_smt, idle_cpu);
> +	}
>  }
>  
>  void __scx_update_idle(struct rq *rq, bool idle)
>  {
>  	int cpu = cpu_of(rq);
> +	int node = cpu_to_node(cpu);
> +	struct cpumask *idle_cpu = get_idle_cpumask_node(node);
>  
>  	if (SCX_HAS_OP(update_idle) && !scx_rq_bypassing(rq)) {
>  		SCX_CALL_OP(SCX_KF_REST, update_idle, cpu_of(rq), idle);
> @@ -3661,27 +3810,25 @@ void __scx_update_idle(struct rq *rq, bool idle)
>  			return;
>  	}
>  
> -	if (idle)
> -		cpumask_set_cpu(cpu, idle_masks.cpu);
> -	else
> -		cpumask_clear_cpu(cpu, idle_masks.cpu);
> +	assign_cpu(cpu, idle_cpu, idle);

Can you also make it a separate preparation patch?

>  
>  #ifdef CONFIG_SCHED_SMT
>  	if (sched_smt_active()) {
>  		const struct cpumask *smt = cpu_smt_mask(cpu);
> +		struct cpumask *idle_smt = get_idle_smtmask_node(node);
>  
>  		if (idle) {
>  			/*
> -			 * idle_masks.smt handling is racy but that's fine as
> -			 * it's only for optimization and self-correcting.
> +			 * idle_smt handling is racy but that's fine as it's
> +			 * only for optimization and self-correcting.
>  			 */
>  			for_each_cpu(cpu, smt) {
> -				if (!cpumask_test_cpu(cpu, idle_masks.cpu))
> +				if (!cpumask_test_cpu(cpu, idle_cpu))
>  					return;
>  			}

So, we continue only if all smt cpus are idle. This looks like an
opencoded cpumask_subset():

                        if (!cpumask_subset(idle_cpu, smt))
                                return;

Is that true? If so, can you please again make it a small
preparation patch?

> -			cpumask_or(idle_masks.smt, idle_masks.smt, smt);
> +			cpumask_or(idle_smt, idle_smt, smt);
>  		} else {
> -			cpumask_andnot(idle_masks.smt, idle_masks.smt, smt);
> +			cpumask_andnot(idle_smt, idle_smt, smt);
>  		}
>  	}
>  #endif
> @@ -3694,7 +3841,7 @@ static void handle_hotplug(struct rq *rq, bool online)
>  	atomic_long_inc(&scx_hotplug_seq);
>  
>  	if (scx_enabled())
> -		update_selcpu_topology();
> +		update_selcpu_topology(&scx_ops);
>  
>  	if (online && SCX_HAS_OP(cpu_online))
>  		SCX_CALL_OP(SCX_KF_UNLOCKED, cpu_online, cpu);
> @@ -3729,7 +3876,10 @@ static void rq_offline_scx(struct rq *rq)
>  #else	/* CONFIG_SMP */
>  
>  static bool test_and_clear_cpu_idle(int cpu) { return false; }
> -static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags) { return -EBUSY; }
> +static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, s32 prev_cpu, u64 flags)
> +{
> +	return -EBUSY;
> +}
>  static void reset_idle_masks(void) {}
>  
>  #endif	/* CONFIG_SMP */
> @@ -5579,7 +5729,7 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
>  
>  	check_hotplug_seq(ops);
>  #ifdef CONFIG_SMP
> -	update_selcpu_topology();
> +	update_selcpu_topology(ops);
>  #endif
>  	cpus_read_unlock();
>  
> @@ -6272,8 +6422,7 @@ void __init init_sched_ext_class(void)
>  
>  	BUG_ON(rhashtable_init(&dsq_hash, &dsq_hash_params));
>  #ifdef CONFIG_SMP
> -	BUG_ON(!alloc_cpumask_var(&idle_masks.cpu, GFP_KERNEL));
> -	BUG_ON(!alloc_cpumask_var(&idle_masks.smt, GFP_KERNEL));
> +	idle_masks_init();
>  #endif
>  	scx_kick_cpus_pnt_seqs =
>  		__alloc_percpu(sizeof(scx_kick_cpus_pnt_seqs[0]) * nr_cpu_ids,
> @@ -6309,6 +6458,15 @@ void __init init_sched_ext_class(void)
>   */
>  #include <linux/btf_ids.h>
>  
> +static bool check_builtin_idle_enabled(void)
> +{
> +	if (static_branch_likely(&scx_builtin_idle_enabled))
> +		return true;
> +
> +	scx_ops_error("built-in idle tracking is disabled");
> +	return false;
> +}
> +
>  __bpf_kfunc_start_defs();
>  
>  /**
> @@ -6328,10 +6486,8 @@ __bpf_kfunc_start_defs();
>  __bpf_kfunc s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  				       u64 wake_flags, bool *is_idle)
>  {
> -	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
> -		scx_ops_error("built-in idle tracking is disabled");
> +	if (!check_builtin_idle_enabled())
>  		goto prev_cpu;
> -	}
>  
>  	if (!scx_kf_allowed(SCX_KF_SELECT_CPU))
>  		goto prev_cpu;
> @@ -7419,46 +7575,31 @@ __bpf_kfunc void scx_bpf_put_cpumask(const struct cpumask *cpumask)
>  
>  /**
>   * scx_bpf_get_idle_cpumask - Get a referenced kptr to the idle-tracking
> - * per-CPU cpumask.
> + * per-CPU cpumask of the current NUMA node.
>   *
>   * Returns NULL if idle tracking is not enabled, or running on a UP kernel.
>   */
>  __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void)
>  {
> -	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
> -		scx_ops_error("built-in idle tracking is disabled");
> +	if (!check_builtin_idle_enabled())
>  		return cpu_none_mask;
> -	}
>  
> -#ifdef CONFIG_SMP
> -	return idle_masks.cpu;
> -#else
> -	return cpu_none_mask;
> -#endif
> +	return get_idle_cpumask_node(NUMA_NO_NODE);
>  }
>  
>  /**
>   * scx_bpf_get_idle_smtmask - Get a referenced kptr to the idle-tracking,
> - * per-physical-core cpumask. Can be used to determine if an entire physical
> - * core is free.
> + * per-physical-core cpumask of the current NUMA node. Can be used to determine
> + * if an entire physical core is free.
>   *
>   * Returns NULL if idle tracking is not enabled, or running on a UP kernel.
>   */
>  __bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask(void)
>  {
> -	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
> -		scx_ops_error("built-in idle tracking is disabled");
> +	if (!check_builtin_idle_enabled())
>  		return cpu_none_mask;
> -	}
>  
> -#ifdef CONFIG_SMP
> -	if (sched_smt_active())
> -		return idle_masks.smt;
> -	else
> -		return idle_masks.cpu;
> -#else
> -	return cpu_none_mask;
> -#endif
> +	return get_idle_smtmask_node(NUMA_NO_NODE);
>  }
>  
>  /**
> @@ -7487,10 +7628,8 @@ __bpf_kfunc void scx_bpf_put_idle_cpumask(const struct cpumask *idle_mask)
>   */
>  __bpf_kfunc bool scx_bpf_test_and_clear_cpu_idle(s32 cpu)
>  {
> -	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
> -		scx_ops_error("built-in idle tracking is disabled");
> +	if (!check_builtin_idle_enabled())
>  		return false;
> -	}
>  
>  	if (ops_cpu_valid(cpu, NULL))
>  		return test_and_clear_cpu_idle(cpu);
> @@ -7520,12 +7659,10 @@ __bpf_kfunc bool scx_bpf_test_and_clear_cpu_idle(s32 cpu)
>  __bpf_kfunc s32 scx_bpf_pick_idle_cpu(const struct cpumask *cpus_allowed,
>  				      u64 flags)
>  {
> -	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
> -		scx_ops_error("built-in idle tracking is disabled");
> +	if (!check_builtin_idle_enabled())
>  		return -EBUSY;
> -	}
>  
> -	return scx_pick_idle_cpu(cpus_allowed, flags);
> +	return scx_pick_idle_cpu(cpus_allowed, smp_processor_id(), flags);
>  }
>  
>  /**
> @@ -7548,7 +7685,7 @@ __bpf_kfunc s32 scx_bpf_pick_any_cpu(const struct cpumask *cpus_allowed,
>  	s32 cpu;
>  
>  	if (static_branch_likely(&scx_builtin_idle_enabled)) {
> -		cpu = scx_pick_idle_cpu(cpus_allowed, flags);
> +		cpu = scx_pick_idle_cpu(cpus_allowed, smp_processor_id(), flags);
>  		if (cpu >= 0)
>  			return cpu;
>  	}
> -- 
> 2.47.1

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/6] sched_ext: Introduce per-node idle cpumasks
  2024-12-20 16:48   ` Yury Norov
@ 2024-12-20 17:52     ` Andrea Righi
  0 siblings, 0 replies; 23+ messages in thread
From: Andrea Righi @ 2024-12-20 17:52 UTC (permalink / raw)
  To: Yury Norov
  Cc: Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Rasmus Villemoes, linux-kernel

On Fri, Dec 20, 2024 at 08:48:55AM -0800, Yury Norov wrote:
> On Tue, Dec 17, 2024 at 10:32:28AM +0100, Andrea Righi wrote:
> > Using a single global idle mask can lead to inefficiencies and a lot of
> > stress on the cache coherency protocol on large systems with multiple
> > NUMA nodes, since all the CPUs can create a really intense read/write
> > activity on the single global cpumask.
> > 
> > Therefore, split the global cpumask into multiple per-NUMA node cpumasks
> > to improve scalability and performance on large systems.
> > 
> > The concept is that each cpumask will track only the idle CPUs within
> > its corresponding NUMA node, treating CPUs in other NUMA nodes as busy.
> > In this way concurrent access to the idle cpumask will be restricted
> > within each NUMA node.
> > 
> > NOTE: with per-node idle cpumasks enabled scx_bpf_get_idle_cpu/smtmask()
> > are returning the cpumask of the current NUMA node, instead of a single
> > cpumask for all the CPUs.
> > 
> > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> > ---
> >  kernel/sched/ext.c | 281 +++++++++++++++++++++++++++++++++------------
> >  1 file changed, 209 insertions(+), 72 deletions(-)
> > 
> > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> > index a17abd2df4d4..d4666db4a212 100644
> > --- a/kernel/sched/ext.c
> > +++ b/kernel/sched/ext.c
> > @@ -894,6 +894,7 @@ static DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_enabled);
> >  #ifdef CONFIG_SMP
> >  static DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_llc);
> >  static DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_numa);
> > +static DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_per_node);
> >  #endif
> >  
> >  static struct static_key_false scx_has_op[SCX_OPI_END] =
> > @@ -937,11 +938,82 @@ static struct delayed_work scx_watchdog_work;
> >  #define CL_ALIGNED_IF_ONSTACK __cacheline_aligned_in_smp
> >  #endif
> >  
> > -static struct {
> > +struct idle_cpumask {
> >  	cpumask_var_t cpu;
> >  	cpumask_var_t smt;
> > -} idle_masks CL_ALIGNED_IF_ONSTACK;
> > +};
> > +
> > +/*
> > + * Make sure a NUMA node is in a valid range.
> > + */
> > +static int validate_node(int node)
> > +{
> > +	/* If no node is specified, return the current one */
> > +	if (node == NUMA_NO_NODE)
> > +		return numa_node_id();
> > +
> > +	/* Make sure node is in the range of possible nodes */
> > +	if (node < 0 || node >= num_possible_nodes())
> > +		return -EINVAL;
> > +
> > +	return node;
> > +}
> 
> This is needed in BPF code, right? Kernel users should always provide
> correct parameters.

This should be used only in the kfuncs, I should probably move this down to
the BPF helpers section. For internal kernel functions we may want to
WARN_ON_ONCE().

> 
> > +/*
> > + * cpumasks to track idle CPUs within each NUMA node.
> > + *
> > + * If SCX_OPS_BUILTIN_IDLE_PER_NODE is not specified, a single flat cpumask
> > + * from node 0 is used to track all idle CPUs system-wide.
> > + */
> > +static struct idle_cpumask **idle_masks CL_ALIGNED_IF_ONSTACK;
> > +
> > +static struct cpumask *get_idle_mask_node(int node, bool smt)
> 
> Like Rasmus said in another thread, this 'get' prefix looks weird.
> 
>   > +/**
>   > + * get_parity8 - get the parity of an u8 value
>   
>   I know it's prototypical bikeshedding, but what's with the "get_"
>   prefix? Certainly the purpose of any pure function is to "get" the
>   result of some computation. We don't have "get_strlen()".
>   
>   Please either just "parity8", or if you want to emphasize that this
>   belongs in the bit-munging family, "bit_parity8".
> 
> So maybe idle_nodes(), idle_smt_nodes() and so on?

It's a bit different here, because in theory we get a "reference" to the
object (even if we're not refcounting it), it's not a pure function.

But I'm also totally fine to get rid of the "get_" prefix.

> 
> > +{
> > +	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
> > +		return smt ? idle_masks[0]->smt : idle_masks[0]->cpu;
> > +
> > +	node = validate_node(node);
> > + 	if (node < 0)
> > +		return cpu_none_mask;
> 
> cpu_none_mask is a valid pointer. How should user distinguish invalid
> parameter and empty nodemask? You should return -EINVAL.
> 
> Anyways...
> 
> This would make a horrid mess. Imagine I'm your user, and I'm
> experimenting with this idle feature. I decided to begin with
> per-node idle masks disabled and call it for example like:
> 
>         get_idle_mask_node(random(), true)
> 
> And it works! I get all my idle CPUs just well. So I'm happily build
> my complex and huge codebase on top of it.
> 
> Then one day I decide to enable those fancy per-node idle masks
> because you said they unload interconnect and whatever... At that
> point what I want is to click a button just to collect some perf
> numbers. So I do that, and I find all my codebase coredumped in some
> random place... Because per-node idle masks basic functions
> simply have different interface: they disallow node = random() as
> parameter, while global idle mask is OK with it.

So, the idea here is to use validate_node() to catch incorrect nodes
provided by the user (scx scheduler) and trigger an scx_ops_error(), which
will force the scheduler to exit with an error.

In this context cpu_none_mask serves as a "temporary harmless value" that
can be returned to the caller, that will just exit.

Again, we could add a sanity check for the kernel code here as well, like a
WARN_ON_ONCE() to catch potential incorrect values that might be used by
other internal kernel functions (and still return cpu_none_mask).

> 
> > +
> > +	return smt ? idle_masks[node]->smt : idle_masks[node]->cpu;
> > +}
> > +
> > +static struct cpumask *get_idle_cpumask_node(int node)
> 
> inline or __always_inline?

Ok.

> > +
> > +static s32
> > +scx_pick_idle_cpu_numa(const struct cpumask *cpus_allowed, s32 prev_cpu, u64 flags)
> 
> We have a sort of convention here. If you pick a cpu based on NUMA
> distances, the 'numa' should be a 2nd prefix after a subsystem where
> it's used. Refer for example:
> 
>         sched_numa_find_nth_cpu()
>         sched_numa_hop_mask()
>         for_each_numa_hop_mask()
> 
> So I'd name it scx_numa_idle_cpu()

Ah, that's good to know, thanks!

(the _numa variant has been removed in v8, but it's still good to know)

> 
> > +{
> > +	nodemask_t hop_nodes = NODE_MASK_NONE;
> > +	int start_node = cpu_to_node(prev_cpu);
> > +	s32 cpu = -EBUSY;
> > +
> > +	/*
> > +	 * Traverse all online nodes in order of increasing distance,
> > +	 * starting from prev_cpu's node.
> > +	 */
> > +	rcu_read_lock();
> 
> RCU is a leftover from for_each_numa_hop_mask(), I guess?

Correct! Already removed in the next version.

> 
> > +	for_each_numa_hop_node(node, start_node, hop_nodes, N_ONLINE) {
> > +		cpu = scx_pick_idle_cpu_from_node(node, cpus_allowed, flags);
> > +		if (cpu >= 0)
> > +			break;
> > +	}
> > +	rcu_read_unlock();
> > +
> > +	return cpu;
> > +}
> > +
> > +static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, s32 prev_cpu, u64 flags)
> > +{
> > +	/*
> > +	 * Only node 0 is used if per-node idle cpumasks are disabled.
> > +	 */
> > +	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
> > +		return scx_pick_idle_cpu_from_node(0, cpus_allowed, flags);
> > +
> > +	return scx_pick_idle_cpu_numa(cpus_allowed, prev_cpu, flags);
> >  }
> >  
> >  /*
> > @@ -3339,7 +3453,7 @@ static bool llc_numa_mismatch(void)
> >   * CPU belongs to a single LLC domain, and that each LLC domain is entirely
> >   * contained within a single NUMA node.
> >   */
> > -static void update_selcpu_topology(void)
> > +static void update_selcpu_topology(struct sched_ext_ops *ops)
> >  {
> >  	bool enable_llc = false, enable_numa = false;
> >  	unsigned int nr_cpus;
> > @@ -3360,6 +3474,14 @@ static void update_selcpu_topology(void)
> >  	if (nr_cpus > 0) {
> >  		if (nr_cpus < num_online_cpus())
> >  			enable_llc = true;
> > +		/*
> > +		 * No need to enable LLC optimization if the LLC domains are
> > +		 * perfectly overlapping with the NUMA domains when per-node
> > +		 * cpumasks are enabled.
> > +		 */
> > +		if ((ops->flags & SCX_OPS_BUILTIN_IDLE_PER_NODE) &&
> > +		    !llc_numa_mismatch())
> > +			enable_llc = false;
> >  		pr_debug("sched_ext: LLC=%*pb weight=%u\n",
> >  			 cpumask_pr_args(llc_span(cpu)), llc_weight(cpu));
> >  	}
> > @@ -3395,6 +3517,14 @@ static void update_selcpu_topology(void)
> >  		static_branch_enable_cpuslocked(&scx_selcpu_topo_numa);
> >  	else
> >  		static_branch_disable_cpuslocked(&scx_selcpu_topo_numa);
> > +
> > +	/*
> > +	 * Check if we need to enable per-node cpumasks.
> > +	 */
> > +	if (ops->flags & SCX_OPS_BUILTIN_IDLE_PER_NODE)
> > +		static_branch_enable_cpuslocked(&scx_builtin_idle_per_node);
> > +	else
> > +		static_branch_disable_cpuslocked(&scx_builtin_idle_per_node);
> >  }
> >  
> >  /*
> > @@ -3415,6 +3545,8 @@ static void update_selcpu_topology(void)
> >   * 4. Pick a CPU within the same NUMA node, if enabled:
> >   *   - choose a CPU from the same NUMA node to reduce memory access latency.
> >   *
> > + * 5. Pick any idle CPU usable by the task.
> > + *
> 
> I would make it a separate patch, or even send it independently. It
> doesn't look related to the series...

Right, it's a separate preparation patch in the next version. But I'll
probably send this as a totally separate patch.

> >  static void reset_idle_masks(void)
> >  {
> > +     int node;
> > +
> > +     if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node)) {
> > +             cpumask_copy(get_idle_cpumask_node(0), cpu_online_mask);
> > +             cpumask_copy(get_idle_smtmask_node(0), cpu_online_mask);
> > +             return;
> > +     }
> > +
> >  	/*
> >  	 * Consider all online cpus idle. Should converge to the actual state
> >  	 * quickly.
> >  	 */
> > -	cpumask_copy(idle_masks.cpu, cpu_online_mask);
> > -	cpumask_copy(idle_masks.smt, cpu_online_mask);
> > +	for_each_node_state(node, N_POSSIBLE) {
> 
> for_each_node(node)
> 
> If the comment above is still valid, you don't need to visit every
> possible node. You need to visit only N_ONLINE, or even N_CPU nodes.
> 
> This adds to the discussion in patch #1. Taking care of CPU-less nodes
> is useless for the scheduler purposes. And if you don't even initialize
> them, then in for_each_numa_hop_node() you should skip those nodes.

Don't we need to worry about nodes going online/offline and dealing with
mem_hotplug_lock as mentioned by Tejun?

Maybe it's safer to initialize all of them here and visit the N_CPU nodes
when looking for idle CPUs? Or am I missing something?

Or maybe we could even initialize and navigate the N_CPU nodes and trigger
a scheduler restart on hotplug events if SCX_OPS_BUILTIN_IDLE_PER_NODE is
enabled...

> 
> > +		const struct cpumask *node_mask = cpumask_of_node(node);
> > +		struct cpumask *idle_cpu = get_idle_cpumask_node(node);
> > +		struct cpumask *idle_smt = get_idle_smtmask_node(node);
> > +
> > +		cpumask_and(idle_cpu, cpu_online_mask, node_mask);
> > +		cpumask_copy(idle_smt, idle_cpu);
> > +	}
> >  }
> >  
> >  void __scx_update_idle(struct rq *rq, bool idle)
> >  {
> >  	int cpu = cpu_of(rq);
> > +	int node = cpu_to_node(cpu);
> > +	struct cpumask *idle_cpu = get_idle_cpumask_node(node);
> >  
> >  	if (SCX_HAS_OP(update_idle) && !scx_rq_bypassing(rq)) {
> >  		SCX_CALL_OP(SCX_KF_REST, update_idle, cpu_of(rq), idle);
> > @@ -3661,27 +3810,25 @@ void __scx_update_idle(struct rq *rq, bool idle)
> >  			return;
> >  	}
> >  
> > -	if (idle)
> > -		cpumask_set_cpu(cpu, idle_masks.cpu);
> > -	else
> > -		cpumask_clear_cpu(cpu, idle_masks.cpu);
> > +	assign_cpu(cpu, idle_cpu, idle);
> 
> Can you also make it a separate preparation patch?

Yep, done in the next version.

> 
> >  
> >  #ifdef CONFIG_SCHED_SMT
> >  	if (sched_smt_active()) {
> >  		const struct cpumask *smt = cpu_smt_mask(cpu);
> > +		struct cpumask *idle_smt = get_idle_smtmask_node(node);
> >  
> >  		if (idle) {
> >  			/*
> > -			 * idle_masks.smt handling is racy but that's fine as
> > -			 * it's only for optimization and self-correcting.
> > +			 * idle_smt handling is racy but that's fine as it's
> > +			 * only for optimization and self-correcting.
> >  			 */
> >  			for_each_cpu(cpu, smt) {
> > -				if (!cpumask_test_cpu(cpu, idle_masks.cpu))
> > +				if (!cpumask_test_cpu(cpu, idle_cpu))
> >  					return;
> >  			}
> 
> So, we continue only if all smt cpus are idle. This looks like an
> opencoded cpumask_subset():
> 
>                         if (!cpumask_subset(idle_cpu, smt))
>                                 return;
> 
> Is that true? If so, can you please again make it a small
> preparation patch?

Good point, if all the CPUs in smt are all idle (== smt is a subset of
idle_cpu), we want to set all the smt bits in idle_smt. So I think the
following should be correct:

    if (!cpumask_subset(smt, idle_cpu))
        return;
    cpumask_or(idle_smt, idle_smt, smt);

Will test this & make a preparation patch.

Thanks!
-Andrea

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2024-12-20 17:53 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-17  9:32 [PATCHSET v7 sched_ext/for-6.14] sched_ext: split global idle cpumask into per-NUMA cpumasks Andrea Righi
2024-12-17  9:32 ` [PATCH 1/6] sched/topology: introduce for_each_numa_hop_node() / sched_numa_hop_node() Andrea Righi
2024-12-17 21:57   ` Tejun Heo
2024-12-18 10:23     ` Andrea Righi
2024-12-18 16:04       ` Tejun Heo
2024-12-19 18:26         ` Yury Norov
2024-12-19 19:43           ` Andrea Righi
2024-12-19 19:52           ` Peter Zijlstra
2024-12-19 21:16             ` Andrea Righi
2024-12-17  9:32 ` [PATCH 2/6] sched_ext: Introduce SCX_OPS_NODE_BUILTIN_IDLE Andrea Righi
2024-12-17  9:32 ` [PATCH 3/6] sched_ext: Introduce per-node idle cpumasks Andrea Righi
2024-12-17 23:22   ` Tejun Heo
2024-12-18 10:21     ` Andrea Righi
2024-12-18 16:10       ` Tejun Heo
2024-12-18 16:18         ` Andrea Righi
2024-12-17 23:23   ` Tejun Heo
2024-12-20 16:48   ` Yury Norov
2024-12-20 17:52     ` Andrea Righi
2024-12-17  9:32 ` [PATCH 4/6] sched_ext: Get rid of the scx_selcpu_topo_numa logic Andrea Righi
2024-12-17  9:32 ` [PATCH 5/6] sched_ext: Introduce NUMA aware idle cpu kfunc helpers Andrea Righi
2024-12-17  9:32 ` [PATCH 6/6] sched_ext: Move built-in idle CPU selection policy to a separate file Andrea Righi
2024-12-20 14:53   ` Yury Norov
2024-12-20 14:58     ` Andrea Righi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).