[PATCHSET v8 sched_ext/for-6.14] sched_ext: split global idle cpumask into per-NUMA cpumasks

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCHSET v8 sched_ext/for-6.14] sched_ext: split global idle cpumask into per-NUMA cpumasks
@ 2024-12-20 15:11 Andrea Righi
  2024-12-20 15:11 ` [PATCH 01/10] sched/topology: introduce for_each_numa_hop_node() / sched_numa_hop_node() Andrea Righi
                   ` (9 more replies)
  0 siblings, 10 replies; 30+ messages in thread
From: Andrea Righi @ 2024-12-20 15:11 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min
  Cc: Yury Norov, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, bpf, linux-kernel

= Overview =

As discussed during the sched_ext office hours, using a global cpumask to
keep track of the idle CPUs can be inefficient and it may not scale really
well on large NUMA systems.

Therefore, split the idle cpumask into multiple per-NUMA node cpumasks to
improve scalability and performance on such large systems.

Scalability issues seem to be more noticeable on Intel Sapphire Rapids
dual-socket architectures.

= Test =

Hardware:
 - System: DGX B200
    - CPUs: 224 SMT threads (112 physical cores)
    - Processor: INTEL(R) XEON(R) PLATINUM 8570
    - 2 NUMA nodes

Scheduler:
 - scx_simple [1] (so that we can focus at the built-in idle selection
   policy and not at the scheduling policy itself)

Test:
 - Run a parallel kernel build `make -j $(nproc)` and measure the average
   elapsed time over 10 runs:

          avg time | stdev
          ---------+------
 before:   52.431s | 2.895
  after:   50.342s | 2.895

= Conclusion =

Splitting the global cpumask into multiple per-NUMA cpumasks helped to
achieve a speedup of approximately +4% with this particular architecture
and test case.

I've repeated the same test on a DGX-1 (40 physical cores, Intel Xeon
E5-2698 v4 @ 2.20GHz, 2 NUMA nodes) and I didn't observe any measurable
difference.

In general, on smaller systems, I haven't noticed any measurable
regressions or improvements with the same test (parallel kernel build) and
scheduler (scx_simple).

Moreover, with a modified scx_bpfland that uses the new NUMA-aware APIs I
observed an additional +2-2.5% performance improvement in the same test.

NOTE: splitting the global cpumask into multiple cpumasks may increase the
overhead of scx_bpf_pick_idle_cpu() or ops.select_cpu() (for schedulers
relying on the built-in CPU idle selection policy) in presence of multiple
NUMA nodes, particularly under high system load, since we may have to
access multiple cpumasks to find an idle CPU.

However, this increased overhead seems to be highly compensated by a lower
overhead when updating the idle state (__scx_update_idle()) and by the fact
that CPUs are more likely operating within their local idle cpumask,
reducing the stress on the cache coherency protocol.

= References =

[1] https://github.com/sched-ext/scx/blob/main/scheds/c/scx_simple.bpf.c

ChangeLog v7 -> v8:
 - patch set refactoring: move ext_idle.c as first patch and introduce more
   preparation patches
 - introduce SCX_PICK_IDLE_NODE to restrict idle CPU selection to a single
   specified node
 - trigger scx_ops_error() when the *_node() kfunc's are used without
   enbling SCX_OPS_NODE_BUILTIN_IDLE
 - check for node_possible() in validate_node()
 - do node validation in the kfunc's (instead of the internal kernel
   functions) and trigger scx_ops_error in case of failure
 - rename idle_masks -> scx_idle_masks
 - drop unused CL_ALIGNED_IF_ONSTACK
 - drop unnecessary rcu_read_lock/unlock() when iterating NUMA nodes

ChangeLog v6 -> v7:
 - addressed some issues based on Yury's review (thanks!)
 - introduced a new iterator to navigate the NUMA nodes in order of
   increasing distance

ChangeLog v5 -> v6:
 - refactor patch set to introduce SCX_OPS_NODE_BUILTIN_IDLE before
   the per-node cpumasks
 - move idle CPU selection policy to a separate file (ext_idle.c)
   (no functional change, just some code shuffling)

ChangeLog v4 -> v5:
 - introduce new scx_bpf_cpu_to_node() kfunc
 - provide __COMPAT_*() helpers for the new kfunc's

ChangeLog v3 -> v4:
 - introduce SCX_OPS_NODE_BUILTIN_IDLE to select multiple per-node
   cpumasks or single flat cpumask
 - introduce new kfuncs to access per-node idle cpumasks information
 - use for_each_numa_hop_mask() to traverse NUMA nodes in increasing
   distance
 - dropped nodemask helpers (not needed anymore)
 - rebase to sched_ext/for-6.14

ChangeLog v2 -> v3:
  - introduce for_each_online_node_wrap()
  - re-introduce cpumask_intersects() in test_and_clear_cpu_idle() (to
    reduce memory writes / cache coherence pressure)
  - get rid of the redundant scx_selcpu_topo_numa logic
  [test results are pretty much identical, so I haven't updated them from v2]

ChangeLog v1 -> v2:
  - renamed for_each_node_mask|state_from() -> for_each_node_mask|state_wrap()
  - misc cpumask optimizations (thanks to Yury)

Andrea Righi (10):
      sched/topology: introduce for_each_numa_hop_node() / sched_numa_hop_node()
      sched_ext: Move built-in idle CPU selection policy to a separate file
      sched_ext: idle: introduce check_builtin_idle_enabled() helper
      sched_ext: idle: use assign_cpu() to update the idle cpumask
      sched_ext: idle: clarify comments
      sched_ext: Introduce SCX_OPS_NODE_BUILTIN_IDLE
      sched_ext: Introduce per-node idle cpumasks
      sched_ext: idle: introduce SCX_PICK_IDLE_NODE
      sched_ext: idle: Get rid of the scx_selcpu_topo_numa logic
      sched_ext: idle: Introduce NUMA aware idle cpu kfunc helpers

 MAINTAINERS                              |   1 +
 include/linux/topology.h                 |  28 +-
 kernel/sched/ext.c                       | 727 ++----------------------
 kernel/sched/ext_idle.c                  | 931 +++++++++++++++++++++++++++++++
 kernel/sched/topology.c                  |  49 ++
 tools/sched_ext/include/scx/common.bpf.h |   4 +
 tools/sched_ext/include/scx/compat.bpf.h |  19 +
 7 files changed, 1068 insertions(+), 691 deletions(-)
 create mode 100644 kernel/sched/ext_idle.c

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 01/10] sched/topology: introduce for_each_numa_hop_node() / sched_numa_hop_node()
  2024-12-20 15:11 [PATCHSET v8 sched_ext/for-6.14] sched_ext: split global idle cpumask into per-NUMA cpumasks Andrea Righi
@ 2024-12-20 15:11 ` Andrea Righi
  2024-12-23 21:18   ` Yury Norov
  2024-12-20 15:11 ` [PATCH 02/10] sched_ext: Move built-in idle CPU selection policy to a separate file Andrea Righi
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 30+ messages in thread
From: Andrea Righi @ 2024-12-20 15:11 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min
  Cc: Yury Norov, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, bpf, linux-kernel

Introduce for_each_numa_hop_node() and sched_numa_hop_node() to iterate
over node IDs in order of increasing NUMA distance from a given starting
node.

These iterator functions are similar to for_each_numa_hop_mask() and
sched_numa_hop_mask(), but instead of providing a cpumask at each
iteration, they provide a node ID.

Example usage:

  nodemask_t hop_nodes = NODE_MASK_NONE;
  int start = cpu_to_node(smp_processor_id());

  for_each_numa_hop_node(node, start, hop_nodes, N_ONLINE)
  	pr_info("node (%d, %d) -> \n",
  		 start, node, node_distance(start, node);

Simulating the following NUMA topology in virtme-ng:

 $ numactl -H
 available: 4 nodes (0-3)
 node 0 cpus: 0 1
 node 0 size: 1006 MB
 node 0 free: 928 MB
 node 1 cpus: 2 3
 node 1 size: 1007 MB
 node 1 free: 986 MB
 node 2 cpus: 4 5
 node 2 size: 889 MB
 node 2 free: 862 MB
 node 3 cpus: 6 7
 node 3 size: 1006 MB
 node 3 free: 983 MB
 node distances:
 node     0    1    2    3
    0:   10   51   31   41
    1:   51   10   21   61
    2:   31   21   10   11
    3:   41   61   11   10

The output of the example above (on node 0) is the following:

 [   84.953644] node (0, 0) -> 10
 [   84.953712] node (0, 2) -> 31
 [   84.953764] node (0, 3) -> 41
 [   84.953817] node (0, 1) -> 51

Cc: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 include/linux/topology.h | 28 ++++++++++++++++++++++-
 kernel/sched/topology.c  | 49 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 76 insertions(+), 1 deletion(-)

diff --git a/include/linux/topology.h b/include/linux/topology.h
index 52f5850730b3..d9014d90580d 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -248,12 +248,18 @@ static inline const struct cpumask *cpu_cpu_mask(int cpu)
 #ifdef CONFIG_NUMA
 int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node);
 extern const struct cpumask *sched_numa_hop_mask(unsigned int node, unsigned int hops);
-#else
+extern int sched_numa_hop_node(nodemask_t *hop_nodes, int start, unsigned int state);
+#else /* !CONFIG_NUMA */
 static __always_inline int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node)
 {
 	return cpumask_nth_and(cpu, cpus, cpu_online_mask);
 }
 
+static inline int sched_numa_hop_node(nodemask_t *hop_nodes, int start, unsigned int state)
+{
+	return NUMA_NO_NODE;
+}
+
 static inline const struct cpumask *
 sched_numa_hop_mask(unsigned int node, unsigned int hops)
 {
@@ -261,6 +267,26 @@ sched_numa_hop_mask(unsigned int node, unsigned int hops)
 }
 #endif	/* CONFIG_NUMA */
 
+/**
+ * for_each_numa_hop_node - iterate over NUMA nodes at increasing hop distances
+ *                          from a given starting node.
+ * @__node: the iteration variable, representing the current NUMA node.
+ * @__start: the NUMA node to start the iteration from.
+ * @__hop_nodes: a nodemask_t to track the visited nodes.
+ * @__state: state of NUMA nodes to iterate.
+ *
+ * Requires rcu_lock to be held.
+ *
+ * This macro iterates over NUMA nodes in increasing distance from
+ * @start_node.
+ *
+ * Yields NUMA_NO_NODE when all the nodes have been visited.
+ */
+#define for_each_numa_hop_node(__node, __start, __hop_nodes, __state)		\
+	for (int __node = __start;						\
+	     __node != NUMA_NO_NODE;						\
+	     __node = sched_numa_hop_node(&(__hop_nodes), __start, __state))
+
 /**
  * for_each_numa_hop_mask - iterate over cpumasks of increasing NUMA distance
  *                          from a given node.
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 9748a4c8d668..8e77c235ad9a 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2185,6 +2185,55 @@ int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node)
 }
 EXPORT_SYMBOL_GPL(sched_numa_find_nth_cpu);
 
+/**
+ * sched_numa_hop_node - Find the NUMA node at the closest hop distance
+ *                       from node @start.
+ *
+ * @hop_nodes: a pointer to a nodemask_t representing the visited nodes.
+ * @start: the NUMA node to start the hop search from.
+ * @state: the node state to filter nodes by.
+ *
+ * This function iterates over all NUMA nodes in the given state and
+ * calculates the hop distance to the starting node. It returns the NUMA
+ * node that is the closest in terms of hop distance
+ * that has not already been considered (not set in @hop_nodes). If the
+ * node is found, it is marked as considered in the @hop_nodes bitmask.
+ *
+ * The function checks if the node is not the start node and ensures it is
+ * not already part of the hop_nodes set. It then computes the distance to
+ * the start node using the node_distance() function. The closest node is
+ * chosen based on the minimum distance.
+ *
+ * Returns the NUMA node ID closest in terms of hop distance from the
+ * @start node, or NUMA_NO_NODE if no node is found (or all nodes have been
+ * visited).
+ */
+int sched_numa_hop_node(nodemask_t *hop_nodes, int start, unsigned int state)
+{
+	int dist, n, min_node, min_dist;
+
+	if (state >= NR_NODE_STATES)
+		return NUMA_NO_NODE;
+
+	min_node = NUMA_NO_NODE;
+	min_dist = INT_MAX;
+
+	for_each_node_state(n, state) {
+		if (n == start || node_isset(n, *hop_nodes))
+			continue;
+		dist = node_distance(start, n);
+		if (dist < min_dist) {
+			min_dist = dist;
+			min_node = n;
+		}
+	}
+	if (min_node != NUMA_NO_NODE)
+		node_set(min_node, *hop_nodes);
+
+	return min_node;
+}
+EXPORT_SYMBOL_GPL(sched_numa_hop_node);
+
 /**
  * sched_numa_hop_mask() - Get the cpumask of CPUs at most @hops hops away from
  *                         @node
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 02/10] sched_ext: Move built-in idle CPU selection policy to a separate file
  2024-12-20 15:11 [PATCHSET v8 sched_ext/for-6.14] sched_ext: split global idle cpumask into per-NUMA cpumasks Andrea Righi
  2024-12-20 15:11 ` [PATCH 01/10] sched/topology: introduce for_each_numa_hop_node() / sched_numa_hop_node() Andrea Righi
@ 2024-12-20 15:11 ` Andrea Righi
  2024-12-24 21:21   ` Tejun Heo
  2024-12-20 15:11 ` [PATCH 03/10] sched_ext: idle: introduce check_builtin_idle_enabled() helper Andrea Righi
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 30+ messages in thread
From: Andrea Righi @ 2024-12-20 15:11 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min
  Cc: Yury Norov, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, bpf, linux-kernel

As ext.c is becoming quite large, move the idle CPU selection policy
to a separate file (ext_idle.c) for better code readability.

Also group together all the idle CPU selection kfunc's to the same
btf_kfunc_id_set block.

No functional change, this is purely code reorganization.

Suggested-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 MAINTAINERS             |   1 +
 kernel/sched/ext.c      | 702 +---------------------------------------
 kernel/sched/ext_idle.c | 686 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 703 insertions(+), 686 deletions(-)
 create mode 100644 kernel/sched/ext_idle.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 1e930c7a58b1..02960d1b9ee9 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -20909,6 +20909,7 @@ T:	git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git
 F:	include/linux/sched/ext.h
 F:	kernel/sched/ext.h
 F:	kernel/sched/ext.c
+F:	kernel/sched/ext_idle.c
 F:	tools/sched_ext/
 F:	tools/testing/selftests/sched_ext
 
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 54e659ba9476..769e43fdea1e 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -6,6 +6,8 @@
  * Copyright (c) 2022 Tejun Heo <tj@kernel.org>
  * Copyright (c) 2022 David Vernet <dvernet@meta.com>
  */
+#include <linux/btf_ids.h>
+
 #define SCX_OP_IDX(op)		(offsetof(struct sched_ext_ops, op) / sizeof(void (*)(void)))
 
 enum scx_consts {
@@ -882,12 +884,6 @@ static bool scx_warned_zero_slice;
 static DEFINE_STATIC_KEY_FALSE(scx_ops_enq_last);
 static DEFINE_STATIC_KEY_FALSE(scx_ops_enq_exiting);
 static DEFINE_STATIC_KEY_FALSE(scx_ops_cpu_preempt);
-static DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_enabled);
-
-#ifdef CONFIG_SMP
-static DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_llc);
-static DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_numa);
-#endif
 
 static struct static_key_false scx_has_op[SCX_OPI_END] =
 	{ [0 ... SCX_OPI_END-1] = STATIC_KEY_FALSE_INIT };
@@ -895,6 +891,17 @@ static struct static_key_false scx_has_op[SCX_OPI_END] =
 static atomic_t scx_exit_kind = ATOMIC_INIT(SCX_EXIT_DONE);
 static struct scx_exit_info *scx_exit_info;
 
+#define scx_ops_error_kind(err, fmt, args...)					\
+	scx_ops_exit_kind((err), 0, fmt, ##args)
+
+#define scx_ops_exit(code, fmt, args...)					\
+	scx_ops_exit_kind(SCX_EXIT_UNREG_KERN, (code), fmt, ##args)
+
+#define scx_ops_error(fmt, args...)						\
+	scx_ops_error_kind(SCX_EXIT_ERROR, fmt, ##args)
+
+#define SCX_HAS_OP(op)	static_branch_likely(&scx_has_op[SCX_OP_IDX(op)])
+
 static atomic_long_t scx_nr_rejected = ATOMIC_LONG_INIT(0);
 static atomic_long_t scx_hotplug_seq = ATOMIC_LONG_INIT(0);
 
@@ -922,21 +929,6 @@ static unsigned long scx_watchdog_timestamp = INITIAL_JIFFIES;
 
 static struct delayed_work scx_watchdog_work;
 
-/* idle tracking */
-#ifdef CONFIG_SMP
-#ifdef CONFIG_CPUMASK_OFFSTACK
-#define CL_ALIGNED_IF_ONSTACK
-#else
-#define CL_ALIGNED_IF_ONSTACK __cacheline_aligned_in_smp
-#endif
-
-static struct {
-	cpumask_var_t cpu;
-	cpumask_var_t smt;
-} idle_masks CL_ALIGNED_IF_ONSTACK;
-
-#endif	/* CONFIG_SMP */
-
 /* for %SCX_KICK_WAIT */
 static unsigned long __percpu *scx_kick_cpus_pnt_seqs;
 
@@ -1023,17 +1015,6 @@ static __printf(3, 4) void scx_ops_exit_kind(enum scx_exit_kind kind,
 					     s64 exit_code,
 					     const char *fmt, ...);
 
-#define scx_ops_error_kind(err, fmt, args...)					\
-	scx_ops_exit_kind((err), 0, fmt, ##args)
-
-#define scx_ops_exit(code, fmt, args...)					\
-	scx_ops_exit_kind(SCX_EXIT_UNREG_KERN, (code), fmt, ##args)
-
-#define scx_ops_error(fmt, args...)						\
-	scx_ops_error_kind(SCX_EXIT_ERROR, fmt, ##args)
-
-#define SCX_HAS_OP(op)	static_branch_likely(&scx_has_op[SCX_OP_IDX(op)])
-
 static long jiffies_delta_msecs(unsigned long at, unsigned long now)
 {
 	if (time_after(at, now))
@@ -1540,6 +1521,9 @@ static int ops_sanitize_err(const char *ops_name, s32 err)
 	return -EPROTO;
 }
 
+/* Built-in idle CPU selection policy */
+#include "ext_idle.c"
+
 static void run_deferred(struct rq *rq)
 {
 	process_ddsp_deferred_locals(rq);
@@ -3164,410 +3148,6 @@ bool scx_prio_less(const struct task_struct *a, const struct task_struct *b,
 
 #ifdef CONFIG_SMP
 
-static bool test_and_clear_cpu_idle(int cpu)
-{
-#ifdef CONFIG_SCHED_SMT
-	/*
-	 * SMT mask should be cleared whether we can claim @cpu or not. The SMT
-	 * cluster is not wholly idle either way. This also prevents
-	 * scx_pick_idle_cpu() from getting caught in an infinite loop.
-	 */
-	if (sched_smt_active()) {
-		const struct cpumask *smt = cpu_smt_mask(cpu);
-
-		/*
-		 * If offline, @cpu is not its own sibling and
-		 * scx_pick_idle_cpu() can get caught in an infinite loop as
-		 * @cpu is never cleared from idle_masks.smt. Ensure that @cpu
-		 * is eventually cleared.
-		 */
-		if (cpumask_intersects(smt, idle_masks.smt))
-			cpumask_andnot(idle_masks.smt, idle_masks.smt, smt);
-		else if (cpumask_test_cpu(cpu, idle_masks.smt))
-			__cpumask_clear_cpu(cpu, idle_masks.smt);
-	}
-#endif
-	return cpumask_test_and_clear_cpu(cpu, idle_masks.cpu);
-}
-
-static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags)
-{
-	int cpu;
-
-retry:
-	if (sched_smt_active()) {
-		cpu = cpumask_any_and_distribute(idle_masks.smt, cpus_allowed);
-		if (cpu < nr_cpu_ids)
-			goto found;
-
-		if (flags & SCX_PICK_IDLE_CORE)
-			return -EBUSY;
-	}
-
-	cpu = cpumask_any_and_distribute(idle_masks.cpu, cpus_allowed);
-	if (cpu >= nr_cpu_ids)
-		return -EBUSY;
-
-found:
-	if (test_and_clear_cpu_idle(cpu))
-		return cpu;
-	else
-		goto retry;
-}
-
-/*
- * Return the amount of CPUs in the same LLC domain of @cpu (or zero if the LLC
- * domain is not defined).
- */
-static unsigned int llc_weight(s32 cpu)
-{
-	struct sched_domain *sd;
-
-	sd = rcu_dereference(per_cpu(sd_llc, cpu));
-	if (!sd)
-		return 0;
-
-	return sd->span_weight;
-}
-
-/*
- * Return the cpumask representing the LLC domain of @cpu (or NULL if the LLC
- * domain is not defined).
- */
-static struct cpumask *llc_span(s32 cpu)
-{
-	struct sched_domain *sd;
-
-	sd = rcu_dereference(per_cpu(sd_llc, cpu));
-	if (!sd)
-		return 0;
-
-	return sched_domain_span(sd);
-}
-
-/*
- * Return the amount of CPUs in the same NUMA domain of @cpu (or zero if the
- * NUMA domain is not defined).
- */
-static unsigned int numa_weight(s32 cpu)
-{
-	struct sched_domain *sd;
-	struct sched_group *sg;
-
-	sd = rcu_dereference(per_cpu(sd_numa, cpu));
-	if (!sd)
-		return 0;
-	sg = sd->groups;
-	if (!sg)
-		return 0;
-
-	return sg->group_weight;
-}
-
-/*
- * Return the cpumask representing the NUMA domain of @cpu (or NULL if the NUMA
- * domain is not defined).
- */
-static struct cpumask *numa_span(s32 cpu)
-{
-	struct sched_domain *sd;
-	struct sched_group *sg;
-
-	sd = rcu_dereference(per_cpu(sd_numa, cpu));
-	if (!sd)
-		return NULL;
-	sg = sd->groups;
-	if (!sg)
-		return NULL;
-
-	return sched_group_span(sg);
-}
-
-/*
- * Return true if the LLC domains do not perfectly overlap with the NUMA
- * domains, false otherwise.
- */
-static bool llc_numa_mismatch(void)
-{
-	int cpu;
-
-	/*
-	 * We need to scan all online CPUs to verify whether their scheduling
-	 * domains overlap.
-	 *
-	 * While it is rare to encounter architectures with asymmetric NUMA
-	 * topologies, CPU hotplugging or virtualized environments can result
-	 * in asymmetric configurations.
-	 *
-	 * For example:
-	 *
-	 *  NUMA 0:
-	 *    - LLC 0: cpu0..cpu7
-	 *    - LLC 1: cpu8..cpu15 [offline]
-	 *
-	 *  NUMA 1:
-	 *    - LLC 0: cpu16..cpu23
-	 *    - LLC 1: cpu24..cpu31
-	 *
-	 * In this case, if we only check the first online CPU (cpu0), we might
-	 * incorrectly assume that the LLC and NUMA domains are fully
-	 * overlapping, which is incorrect (as NUMA 1 has two distinct LLC
-	 * domains).
-	 */
-	for_each_online_cpu(cpu)
-		if (llc_weight(cpu) != numa_weight(cpu))
-			return true;
-
-	return false;
-}
-
-/*
- * Initialize topology-aware scheduling.
- *
- * Detect if the system has multiple LLC or multiple NUMA domains and enable
- * cache-aware / NUMA-aware scheduling optimizations in the default CPU idle
- * selection policy.
- *
- * Assumption: the kernel's internal topology representation assumes that each
- * CPU belongs to a single LLC domain, and that each LLC domain is entirely
- * contained within a single NUMA node.
- */
-static void update_selcpu_topology(void)
-{
-	bool enable_llc = false, enable_numa = false;
-	unsigned int nr_cpus;
-	s32 cpu = cpumask_first(cpu_online_mask);
-
-	/*
-	 * Enable LLC domain optimization only when there are multiple LLC
-	 * domains among the online CPUs. If all online CPUs are part of a
-	 * single LLC domain, the idle CPU selection logic can choose any
-	 * online CPU without bias.
-	 *
-	 * Note that it is sufficient to check the LLC domain of the first
-	 * online CPU to determine whether a single LLC domain includes all
-	 * CPUs.
-	 */
-	rcu_read_lock();
-	nr_cpus = llc_weight(cpu);
-	if (nr_cpus > 0) {
-		if (nr_cpus < num_online_cpus())
-			enable_llc = true;
-		pr_debug("sched_ext: LLC=%*pb weight=%u\n",
-			 cpumask_pr_args(llc_span(cpu)), llc_weight(cpu));
-	}
-
-	/*
-	 * Enable NUMA optimization only when there are multiple NUMA domains
-	 * among the online CPUs and the NUMA domains don't perfectly overlaps
-	 * with the LLC domains.
-	 *
-	 * If all CPUs belong to the same NUMA node and the same LLC domain,
-	 * enabling both NUMA and LLC optimizations is unnecessary, as checking
-	 * for an idle CPU in the same domain twice is redundant.
-	 */
-	nr_cpus = numa_weight(cpu);
-	if (nr_cpus > 0) {
-		if (nr_cpus < num_online_cpus() && llc_numa_mismatch())
-			enable_numa = true;
-		pr_debug("sched_ext: NUMA=%*pb weight=%u\n",
-			 cpumask_pr_args(numa_span(cpu)), numa_weight(cpu));
-	}
-	rcu_read_unlock();
-
-	pr_debug("sched_ext: LLC idle selection %s\n",
-		 enable_llc ? "enabled" : "disabled");
-	pr_debug("sched_ext: NUMA idle selection %s\n",
-		 enable_numa ? "enabled" : "disabled");
-
-	if (enable_llc)
-		static_branch_enable_cpuslocked(&scx_selcpu_topo_llc);
-	else
-		static_branch_disable_cpuslocked(&scx_selcpu_topo_llc);
-	if (enable_numa)
-		static_branch_enable_cpuslocked(&scx_selcpu_topo_numa);
-	else
-		static_branch_disable_cpuslocked(&scx_selcpu_topo_numa);
-}
-
-/*
- * Built-in CPU idle selection policy:
- *
- * 1. Prioritize full-idle cores:
- *   - always prioritize CPUs from fully idle cores (both logical CPUs are
- *     idle) to avoid interference caused by SMT.
- *
- * 2. Reuse the same CPU:
- *   - prefer the last used CPU to take advantage of cached data (L1, L2) and
- *     branch prediction optimizations.
- *
- * 3. Pick a CPU within the same LLC (Last-Level Cache):
- *   - if the above conditions aren't met, pick a CPU that shares the same LLC
- *     to maintain cache locality.
- *
- * 4. Pick a CPU within the same NUMA node, if enabled:
- *   - choose a CPU from the same NUMA node to reduce memory access latency.
- *
- * Step 3 and 4 are performed only if the system has, respectively, multiple
- * LLC domains / multiple NUMA nodes (see scx_selcpu_topo_llc and
- * scx_selcpu_topo_numa).
- *
- * NOTE: tasks that can only run on 1 CPU are excluded by this logic, because
- * we never call ops.select_cpu() for them, see select_task_rq().
- */
-static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
-			      u64 wake_flags, bool *found)
-{
-	const struct cpumask *llc_cpus = NULL;
-	const struct cpumask *numa_cpus = NULL;
-	s32 cpu;
-
-	*found = false;
-
-	/*
-	 * This is necessary to protect llc_cpus.
-	 */
-	rcu_read_lock();
-
-	/*
-	 * Determine the scheduling domain only if the task is allowed to run
-	 * on all CPUs.
-	 *
-	 * This is done primarily for efficiency, as it avoids the overhead of
-	 * updating a cpumask every time we need to select an idle CPU (which
-	 * can be costly in large SMP systems), but it also aligns logically:
-	 * if a task's scheduling domain is restricted by user-space (through
-	 * CPU affinity), the task will simply use the flat scheduling domain
-	 * defined by user-space.
-	 */
-	if (p->nr_cpus_allowed >= num_possible_cpus()) {
-		if (static_branch_maybe(CONFIG_NUMA, &scx_selcpu_topo_numa))
-			numa_cpus = numa_span(prev_cpu);
-
-		if (static_branch_maybe(CONFIG_SCHED_MC, &scx_selcpu_topo_llc))
-			llc_cpus = llc_span(prev_cpu);
-	}
-
-	/*
-	 * If WAKE_SYNC, try to migrate the wakee to the waker's CPU.
-	 */
-	if (wake_flags & SCX_WAKE_SYNC) {
-		cpu = smp_processor_id();
-
-		/*
-		 * If the waker's CPU is cache affine and prev_cpu is idle,
-		 * then avoid a migration.
-		 */
-		if (cpus_share_cache(cpu, prev_cpu) &&
-		    test_and_clear_cpu_idle(prev_cpu)) {
-			cpu = prev_cpu;
-			goto cpu_found;
-		}
-
-		/*
-		 * If the waker's local DSQ is empty, and the system is under
-		 * utilized, try to wake up @p to the local DSQ of the waker.
-		 *
-		 * Checking only for an empty local DSQ is insufficient as it
-		 * could give the wakee an unfair advantage when the system is
-		 * oversaturated.
-		 *
-		 * Checking only for the presence of idle CPUs is also
-		 * insufficient as the local DSQ of the waker could have tasks
-		 * piled up on it even if there is an idle core elsewhere on
-		 * the system.
-		 */
-		if (!cpumask_empty(idle_masks.cpu) &&
-		    !(current->flags & PF_EXITING) &&
-		    cpu_rq(cpu)->scx.local_dsq.nr == 0) {
-			if (cpumask_test_cpu(cpu, p->cpus_ptr))
-				goto cpu_found;
-		}
-	}
-
-	/*
-	 * If CPU has SMT, any wholly idle CPU is likely a better pick than
-	 * partially idle @prev_cpu.
-	 */
-	if (sched_smt_active()) {
-		/*
-		 * Keep using @prev_cpu if it's part of a fully idle core.
-		 */
-		if (cpumask_test_cpu(prev_cpu, idle_masks.smt) &&
-		    test_and_clear_cpu_idle(prev_cpu)) {
-			cpu = prev_cpu;
-			goto cpu_found;
-		}
-
-		/*
-		 * Search for any fully idle core in the same LLC domain.
-		 */
-		if (llc_cpus) {
-			cpu = scx_pick_idle_cpu(llc_cpus, SCX_PICK_IDLE_CORE);
-			if (cpu >= 0)
-				goto cpu_found;
-		}
-
-		/*
-		 * Search for any fully idle core in the same NUMA node.
-		 */
-		if (numa_cpus) {
-			cpu = scx_pick_idle_cpu(numa_cpus, SCX_PICK_IDLE_CORE);
-			if (cpu >= 0)
-				goto cpu_found;
-		}
-
-		/*
-		 * Search for any full idle core usable by the task.
-		 */
-		cpu = scx_pick_idle_cpu(p->cpus_ptr, SCX_PICK_IDLE_CORE);
-		if (cpu >= 0)
-			goto cpu_found;
-	}
-
-	/*
-	 * Use @prev_cpu if it's idle.
-	 */
-	if (test_and_clear_cpu_idle(prev_cpu)) {
-		cpu = prev_cpu;
-		goto cpu_found;
-	}
-
-	/*
-	 * Search for any idle CPU in the same LLC domain.
-	 */
-	if (llc_cpus) {
-		cpu = scx_pick_idle_cpu(llc_cpus, 0);
-		if (cpu >= 0)
-			goto cpu_found;
-	}
-
-	/*
-	 * Search for any idle CPU in the same NUMA node.
-	 */
-	if (numa_cpus) {
-		cpu = scx_pick_idle_cpu(numa_cpus, 0);
-		if (cpu >= 0)
-			goto cpu_found;
-	}
-
-	/*
-	 * Search for any idle CPU usable by the task.
-	 */
-	cpu = scx_pick_idle_cpu(p->cpus_ptr, 0);
-	if (cpu >= 0)
-		goto cpu_found;
-
-	rcu_read_unlock();
-	return prev_cpu;
-
-cpu_found:
-	rcu_read_unlock();
-
-	*found = true;
-	return cpu;
-}
-
 static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flags)
 {
 	/*
@@ -3634,52 +3214,6 @@ static void set_cpus_allowed_scx(struct task_struct *p,
 				 (struct cpumask *)p->cpus_ptr);
 }
 
-static void reset_idle_masks(void)
-{
-	/*
-	 * Consider all online cpus idle. Should converge to the actual state
-	 * quickly.
-	 */
-	cpumask_copy(idle_masks.cpu, cpu_online_mask);
-	cpumask_copy(idle_masks.smt, cpu_online_mask);
-}
-
-void __scx_update_idle(struct rq *rq, bool idle)
-{
-	int cpu = cpu_of(rq);
-
-	if (SCX_HAS_OP(update_idle) && !scx_rq_bypassing(rq)) {
-		SCX_CALL_OP(SCX_KF_REST, update_idle, cpu_of(rq), idle);
-		if (!static_branch_unlikely(&scx_builtin_idle_enabled))
-			return;
-	}
-
-	if (idle)
-		cpumask_set_cpu(cpu, idle_masks.cpu);
-	else
-		cpumask_clear_cpu(cpu, idle_masks.cpu);
-
-#ifdef CONFIG_SCHED_SMT
-	if (sched_smt_active()) {
-		const struct cpumask *smt = cpu_smt_mask(cpu);
-
-		if (idle) {
-			/*
-			 * idle_masks.smt handling is racy but that's fine as
-			 * it's only for optimization and self-correcting.
-			 */
-			for_each_cpu(cpu, smt) {
-				if (!cpumask_test_cpu(cpu, idle_masks.cpu))
-					return;
-			}
-			cpumask_or(idle_masks.smt, idle_masks.smt, smt);
-		} else {
-			cpumask_andnot(idle_masks.smt, idle_masks.smt, smt);
-		}
-	}
-#endif
-}
-
 static void handle_hotplug(struct rq *rq, bool online)
 {
 	int cpu = cpu_of(rq);
@@ -3719,12 +3253,6 @@ static void rq_offline_scx(struct rq *rq)
 	rq->scx.flags &= ~SCX_RQ_ONLINE;
 }
 
-#else	/* CONFIG_SMP */
-
-static bool test_and_clear_cpu_idle(int cpu) { return false; }
-static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags) { return -EBUSY; }
-static void reset_idle_masks(void) {}
-
 #endif	/* CONFIG_SMP */
 
 static bool check_rq_for_timeouts(struct rq *rq)
@@ -6290,55 +5818,6 @@ void __init init_sched_ext_class(void)
 /********************************************************************************
  * Helpers that can be called from the BPF scheduler.
  */
-#include <linux/btf_ids.h>
-
-__bpf_kfunc_start_defs();
-
-/**
- * scx_bpf_select_cpu_dfl - The default implementation of ops.select_cpu()
- * @p: task_struct to select a CPU for
- * @prev_cpu: CPU @p was on previously
- * @wake_flags: %SCX_WAKE_* flags
- * @is_idle: out parameter indicating whether the returned CPU is idle
- *
- * Can only be called from ops.select_cpu() if the built-in CPU selection is
- * enabled - ops.update_idle() is missing or %SCX_OPS_KEEP_BUILTIN_IDLE is set.
- * @p, @prev_cpu and @wake_flags match ops.select_cpu().
- *
- * Returns the picked CPU with *@is_idle indicating whether the picked CPU is
- * currently idle and thus a good candidate for direct dispatching.
- */
-__bpf_kfunc s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
-				       u64 wake_flags, bool *is_idle)
-{
-	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
-		scx_ops_error("built-in idle tracking is disabled");
-		goto prev_cpu;
-	}
-
-	if (!scx_kf_allowed(SCX_KF_SELECT_CPU))
-		goto prev_cpu;
-
-#ifdef CONFIG_SMP
-	return scx_select_cpu_dfl(p, prev_cpu, wake_flags, is_idle);
-#endif
-
-prev_cpu:
-	*is_idle = false;
-	return prev_cpu;
-}
-
-__bpf_kfunc_end_defs();
-
-BTF_KFUNCS_START(scx_kfunc_ids_select_cpu)
-BTF_ID_FLAGS(func, scx_bpf_select_cpu_dfl, KF_RCU)
-BTF_KFUNCS_END(scx_kfunc_ids_select_cpu)
-
-static const struct btf_kfunc_id_set scx_kfunc_set_select_cpu = {
-	.owner			= THIS_MODULE,
-	.set			= &scx_kfunc_ids_select_cpu,
-};
-
 static bool scx_dsq_insert_preamble(struct task_struct *p, u64 enq_flags)
 {
 	if (!scx_kf_allowed(SCX_KF_ENQUEUE | SCX_KF_DISPATCH))
@@ -7400,149 +6879,6 @@ __bpf_kfunc void scx_bpf_put_cpumask(const struct cpumask *cpumask)
 	 */
 }
 
-/**
- * scx_bpf_get_idle_cpumask - Get a referenced kptr to the idle-tracking
- * per-CPU cpumask.
- *
- * Returns NULL if idle tracking is not enabled, or running on a UP kernel.
- */
-__bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void)
-{
-	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
-		scx_ops_error("built-in idle tracking is disabled");
-		return cpu_none_mask;
-	}
-
-#ifdef CONFIG_SMP
-	return idle_masks.cpu;
-#else
-	return cpu_none_mask;
-#endif
-}
-
-/**
- * scx_bpf_get_idle_smtmask - Get a referenced kptr to the idle-tracking,
- * per-physical-core cpumask. Can be used to determine if an entire physical
- * core is free.
- *
- * Returns NULL if idle tracking is not enabled, or running on a UP kernel.
- */
-__bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask(void)
-{
-	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
-		scx_ops_error("built-in idle tracking is disabled");
-		return cpu_none_mask;
-	}
-
-#ifdef CONFIG_SMP
-	if (sched_smt_active())
-		return idle_masks.smt;
-	else
-		return idle_masks.cpu;
-#else
-	return cpu_none_mask;
-#endif
-}
-
-/**
- * scx_bpf_put_idle_cpumask - Release a previously acquired referenced kptr to
- * either the percpu, or SMT idle-tracking cpumask.
- */
-__bpf_kfunc void scx_bpf_put_idle_cpumask(const struct cpumask *idle_mask)
-{
-	/*
-	 * Empty function body because we aren't actually acquiring or releasing
-	 * a reference to a global idle cpumask, which is read-only in the
-	 * caller and is never released. The acquire / release semantics here
-	 * are just used to make the cpumask a trusted pointer in the caller.
-	 */
-}
-
-/**
- * scx_bpf_test_and_clear_cpu_idle - Test and clear @cpu's idle state
- * @cpu: cpu to test and clear idle for
- *
- * Returns %true if @cpu was idle and its idle state was successfully cleared.
- * %false otherwise.
- *
- * Unavailable if ops.update_idle() is implemented and
- * %SCX_OPS_KEEP_BUILTIN_IDLE is not set.
- */
-__bpf_kfunc bool scx_bpf_test_and_clear_cpu_idle(s32 cpu)
-{
-	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
-		scx_ops_error("built-in idle tracking is disabled");
-		return false;
-	}
-
-	if (ops_cpu_valid(cpu, NULL))
-		return test_and_clear_cpu_idle(cpu);
-	else
-		return false;
-}
-
-/**
- * scx_bpf_pick_idle_cpu - Pick and claim an idle cpu
- * @cpus_allowed: Allowed cpumask
- * @flags: %SCX_PICK_IDLE_CPU_* flags
- *
- * Pick and claim an idle cpu in @cpus_allowed. Returns the picked idle cpu
- * number on success. -%EBUSY if no matching cpu was found.
- *
- * Idle CPU tracking may race against CPU scheduling state transitions. For
- * example, this function may return -%EBUSY as CPUs are transitioning into the
- * idle state. If the caller then assumes that there will be dispatch events on
- * the CPUs as they were all busy, the scheduler may end up stalling with CPUs
- * idling while there are pending tasks. Use scx_bpf_pick_any_cpu() and
- * scx_bpf_kick_cpu() to guarantee that there will be at least one dispatch
- * event in the near future.
- *
- * Unavailable if ops.update_idle() is implemented and
- * %SCX_OPS_KEEP_BUILTIN_IDLE is not set.
- */
-__bpf_kfunc s32 scx_bpf_pick_idle_cpu(const struct cpumask *cpus_allowed,
-				      u64 flags)
-{
-	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
-		scx_ops_error("built-in idle tracking is disabled");
-		return -EBUSY;
-	}
-
-	return scx_pick_idle_cpu(cpus_allowed, flags);
-}
-
-/**
- * scx_bpf_pick_any_cpu - Pick and claim an idle cpu if available or pick any CPU
- * @cpus_allowed: Allowed cpumask
- * @flags: %SCX_PICK_IDLE_CPU_* flags
- *
- * Pick and claim an idle cpu in @cpus_allowed. If none is available, pick any
- * CPU in @cpus_allowed. Guaranteed to succeed and returns the picked idle cpu
- * number if @cpus_allowed is not empty. -%EBUSY is returned if @cpus_allowed is
- * empty.
- *
- * If ops.update_idle() is implemented and %SCX_OPS_KEEP_BUILTIN_IDLE is not
- * set, this function can't tell which CPUs are idle and will always pick any
- * CPU.
- */
-__bpf_kfunc s32 scx_bpf_pick_any_cpu(const struct cpumask *cpus_allowed,
-				     u64 flags)
-{
-	s32 cpu;
-
-	if (static_branch_likely(&scx_builtin_idle_enabled)) {
-		cpu = scx_pick_idle_cpu(cpus_allowed, flags);
-		if (cpu >= 0)
-			return cpu;
-	}
-
-	cpu = cpumask_any_distribute(cpus_allowed);
-	if (cpu < nr_cpu_ids)
-		return cpu;
-	else
-		return -EBUSY;
-}
-
 /**
  * scx_bpf_task_running - Is task currently running?
  * @p: task of interest
@@ -7620,12 +6956,6 @@ BTF_ID_FLAGS(func, scx_bpf_nr_cpu_ids)
 BTF_ID_FLAGS(func, scx_bpf_get_possible_cpumask, KF_ACQUIRE)
 BTF_ID_FLAGS(func, scx_bpf_get_online_cpumask, KF_ACQUIRE)
 BTF_ID_FLAGS(func, scx_bpf_put_cpumask, KF_RELEASE)
-BTF_ID_FLAGS(func, scx_bpf_get_idle_cpumask, KF_ACQUIRE)
-BTF_ID_FLAGS(func, scx_bpf_get_idle_smtmask, KF_ACQUIRE)
-BTF_ID_FLAGS(func, scx_bpf_put_idle_cpumask, KF_RELEASE)
-BTF_ID_FLAGS(func, scx_bpf_test_and_clear_cpu_idle)
-BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu, KF_RCU)
-BTF_ID_FLAGS(func, scx_bpf_pick_any_cpu, KF_RCU)
 BTF_ID_FLAGS(func, scx_bpf_task_running, KF_RCU)
 BTF_ID_FLAGS(func, scx_bpf_task_cpu, KF_RCU)
 BTF_ID_FLAGS(func, scx_bpf_cpu_rq)
diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
new file mode 100644
index 000000000000..9e8479dd7277
--- /dev/null
+++ b/kernel/sched/ext_idle.c
@@ -0,0 +1,686 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * BPF extensible scheduler class: Documentation/scheduler/sched-ext.rst
+ *
+ * Built-in idle CPU tracking policy.
+ *
+ * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2022 Tejun Heo <tj@kernel.org>
+ * Copyright (c) 2022 David Vernet <dvernet@meta.com>
+ * Copyright (c) 2024 Andrea Righi <arighi@nvidia.com>
+ */
+
+static DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_enabled);
+
+#ifdef CONFIG_SMP
+#ifdef CONFIG_CPUMASK_OFFSTACK
+#define CL_ALIGNED_IF_ONSTACK
+#else
+#define CL_ALIGNED_IF_ONSTACK __cacheline_aligned_in_smp
+#endif
+
+static struct {
+	cpumask_var_t cpu;
+	cpumask_var_t smt;
+} idle_masks CL_ALIGNED_IF_ONSTACK;
+
+static DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_llc);
+static DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_numa);
+
+static bool test_and_clear_cpu_idle(int cpu)
+{
+#ifdef CONFIG_SCHED_SMT
+	/*
+	 * SMT mask should be cleared whether we can claim @cpu or not. The SMT
+	 * cluster is not wholly idle either way. This also prevents
+	 * scx_pick_idle_cpu() from getting caught in an infinite loop.
+	 */
+	if (sched_smt_active()) {
+		const struct cpumask *smt = cpu_smt_mask(cpu);
+
+		/*
+		 * If offline, @cpu is not its own sibling and
+		 * scx_pick_idle_cpu() can get caught in an infinite loop as
+		 * @cpu is never cleared from idle_masks.smt. Ensure that @cpu
+		 * is eventually cleared.
+		 */
+		if (cpumask_intersects(smt, idle_masks.smt))
+			cpumask_andnot(idle_masks.smt, idle_masks.smt, smt);
+		else if (cpumask_test_cpu(cpu, idle_masks.smt))
+			__cpumask_clear_cpu(cpu, idle_masks.smt);
+	}
+#endif
+	return cpumask_test_and_clear_cpu(cpu, idle_masks.cpu);
+}
+
+static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags)
+{
+	int cpu;
+
+retry:
+	if (sched_smt_active()) {
+		cpu = cpumask_any_and_distribute(idle_masks.smt, cpus_allowed);
+		if (cpu < nr_cpu_ids)
+			goto found;
+
+		if (flags & SCX_PICK_IDLE_CORE)
+			return -EBUSY;
+	}
+
+	cpu = cpumask_any_and_distribute(idle_masks.cpu, cpus_allowed);
+	if (cpu >= nr_cpu_ids)
+		return -EBUSY;
+
+found:
+	if (test_and_clear_cpu_idle(cpu))
+		return cpu;
+	else
+		goto retry;
+}
+
+/*
+ * Return the amount of CPUs in the same LLC domain of @cpu (or zero if the LLC
+ * domain is not defined).
+ */
+static unsigned int llc_weight(s32 cpu)
+{
+	struct sched_domain *sd;
+
+	sd = rcu_dereference(per_cpu(sd_llc, cpu));
+	if (!sd)
+		return 0;
+
+	return sd->span_weight;
+}
+
+/*
+ * Return the cpumask representing the LLC domain of @cpu (or NULL if the LLC
+ * domain is not defined).
+ */
+static struct cpumask *llc_span(s32 cpu)
+{
+	struct sched_domain *sd;
+
+	sd = rcu_dereference(per_cpu(sd_llc, cpu));
+	if (!sd)
+		return 0;
+
+	return sched_domain_span(sd);
+}
+
+/*
+ * Return the amount of CPUs in the same NUMA domain of @cpu (or zero if the
+ * NUMA domain is not defined).
+ */
+static unsigned int numa_weight(s32 cpu)
+{
+	struct sched_domain *sd;
+	struct sched_group *sg;
+
+	sd = rcu_dereference(per_cpu(sd_numa, cpu));
+	if (!sd)
+		return 0;
+	sg = sd->groups;
+	if (!sg)
+		return 0;
+
+	return sg->group_weight;
+}
+
+/*
+ * Return the cpumask representing the NUMA domain of @cpu (or NULL if the NUMA
+ * domain is not defined).
+ */
+static struct cpumask *numa_span(s32 cpu)
+{
+	struct sched_domain *sd;
+	struct sched_group *sg;
+
+	sd = rcu_dereference(per_cpu(sd_numa, cpu));
+	if (!sd)
+		return NULL;
+	sg = sd->groups;
+	if (!sg)
+		return NULL;
+
+	return sched_group_span(sg);
+}
+
+/*
+ * Return true if the LLC domains do not perfectly overlap with the NUMA
+ * domains, false otherwise.
+ */
+static bool llc_numa_mismatch(void)
+{
+	int cpu;
+
+	/*
+	 * We need to scan all online CPUs to verify whether their scheduling
+	 * domains overlap.
+	 *
+	 * While it is rare to encounter architectures with asymmetric NUMA
+	 * topologies, CPU hotplugging or virtualized environments can result
+	 * in asymmetric configurations.
+	 *
+	 * For example:
+	 *
+	 *  NUMA 0:
+	 *    - LLC 0: cpu0..cpu7
+	 *    - LLC 1: cpu8..cpu15 [offline]
+	 *
+	 *  NUMA 1:
+	 *    - LLC 0: cpu16..cpu23
+	 *    - LLC 1: cpu24..cpu31
+	 *
+	 * In this case, if we only check the first online CPU (cpu0), we might
+	 * incorrectly assume that the LLC and NUMA domains are fully
+	 * overlapping, which is incorrect (as NUMA 1 has two distinct LLC
+	 * domains).
+	 */
+	for_each_online_cpu(cpu)
+		if (llc_weight(cpu) != numa_weight(cpu))
+			return true;
+
+	return false;
+}
+
+/*
+ * Initialize topology-aware scheduling.
+ *
+ * Detect if the system has multiple LLC or multiple NUMA domains and enable
+ * cache-aware / NUMA-aware scheduling optimizations in the default CPU idle
+ * selection policy.
+ *
+ * Assumption: the kernel's internal topology representation assumes that each
+ * CPU belongs to a single LLC domain, and that each LLC domain is entirely
+ * contained within a single NUMA node.
+ */
+static void update_selcpu_topology(void)
+{
+	bool enable_llc = false, enable_numa = false;
+	unsigned int nr_cpus;
+	s32 cpu = cpumask_first(cpu_online_mask);
+
+	/*
+	 * Enable LLC domain optimization only when there are multiple LLC
+	 * domains among the online CPUs. If all online CPUs are part of a
+	 * single LLC domain, the idle CPU selection logic can choose any
+	 * online CPU without bias.
+	 *
+	 * Note that it is sufficient to check the LLC domain of the first
+	 * online CPU to determine whether a single LLC domain includes all
+	 * CPUs.
+	 */
+	rcu_read_lock();
+	nr_cpus = llc_weight(cpu);
+	if (nr_cpus > 0) {
+		if (nr_cpus < num_online_cpus())
+			enable_llc = true;
+		pr_debug("sched_ext: LLC=%*pb weight=%u\n",
+			 cpumask_pr_args(llc_span(cpu)), llc_weight(cpu));
+	}
+
+	/*
+	 * Enable NUMA optimization only when there are multiple NUMA domains
+	 * among the online CPUs and the NUMA domains don't perfectly overlaps
+	 * with the LLC domains.
+	 *
+	 * If all CPUs belong to the same NUMA node and the same LLC domain,
+	 * enabling both NUMA and LLC optimizations is unnecessary, as checking
+	 * for an idle CPU in the same domain twice is redundant.
+	 */
+	nr_cpus = numa_weight(cpu);
+	if (nr_cpus > 0) {
+		if (nr_cpus < num_online_cpus() && llc_numa_mismatch())
+			enable_numa = true;
+		pr_debug("sched_ext: NUMA=%*pb weight=%u\n",
+			 cpumask_pr_args(numa_span(cpu)), numa_weight(cpu));
+	}
+	rcu_read_unlock();
+
+	pr_debug("sched_ext: LLC idle selection %s\n",
+		 enable_llc ? "enabled" : "disabled");
+	pr_debug("sched_ext: NUMA idle selection %s\n",
+		 enable_numa ? "enabled" : "disabled");
+
+	if (enable_llc)
+		static_branch_enable_cpuslocked(&scx_selcpu_topo_llc);
+	else
+		static_branch_disable_cpuslocked(&scx_selcpu_topo_llc);
+	if (enable_numa)
+		static_branch_enable_cpuslocked(&scx_selcpu_topo_numa);
+	else
+		static_branch_disable_cpuslocked(&scx_selcpu_topo_numa);
+}
+
+/*
+ * Built-in CPU idle selection policy:
+ *
+ * 1. Prioritize full-idle cores:
+ *   - always prioritize CPUs from fully idle cores (both logical CPUs are
+ *     idle) to avoid interference caused by SMT.
+ *
+ * 2. Reuse the same CPU:
+ *   - prefer the last used CPU to take advantage of cached data (L1, L2) and
+ *     branch prediction optimizations.
+ *
+ * 3. Pick a CPU within the same LLC (Last-Level Cache):
+ *   - if the above conditions aren't met, pick a CPU that shares the same LLC
+ *     to maintain cache locality.
+ *
+ * 4. Pick a CPU within the same NUMA node, if enabled:
+ *   - choose a CPU from the same NUMA node to reduce memory access latency.
+ *
+ * Step 3 and 4 are performed only if the system has, respectively, multiple
+ * LLC domains / multiple NUMA nodes (see scx_selcpu_topo_llc and
+ * scx_selcpu_topo_numa).
+ *
+ * NOTE: tasks that can only run on 1 CPU are excluded by this logic, because
+ * we never call ops.select_cpu() for them, see select_task_rq().
+ */
+static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
+			      u64 wake_flags, bool *found)
+{
+	const struct cpumask *llc_cpus = NULL;
+	const struct cpumask *numa_cpus = NULL;
+	s32 cpu;
+
+	*found = false;
+
+	/*
+	 * This is necessary to protect llc_cpus.
+	 */
+	rcu_read_lock();
+
+	/*
+	 * Determine the scheduling domain only if the task is allowed to run
+	 * on all CPUs.
+	 *
+	 * This is done primarily for efficiency, as it avoids the overhead of
+	 * updating a cpumask every time we need to select an idle CPU (which
+	 * can be costly in large SMP systems), but it also aligns logically:
+	 * if a task's scheduling domain is restricted by user-space (through
+	 * CPU affinity), the task will simply use the flat scheduling domain
+	 * defined by user-space.
+	 */
+	if (p->nr_cpus_allowed >= num_possible_cpus()) {
+		if (static_branch_maybe(CONFIG_NUMA, &scx_selcpu_topo_numa))
+			numa_cpus = numa_span(prev_cpu);
+
+		if (static_branch_maybe(CONFIG_SCHED_MC, &scx_selcpu_topo_llc))
+			llc_cpus = llc_span(prev_cpu);
+	}
+
+	/*
+	 * If WAKE_SYNC, try to migrate the wakee to the waker's CPU.
+	 */
+	if (wake_flags & SCX_WAKE_SYNC) {
+		cpu = smp_processor_id();
+
+		/*
+		 * If the waker's CPU is cache affine and prev_cpu is idle,
+		 * then avoid a migration.
+		 */
+		if (cpus_share_cache(cpu, prev_cpu) &&
+		    test_and_clear_cpu_idle(prev_cpu)) {
+			cpu = prev_cpu;
+			goto cpu_found;
+		}
+
+		/*
+		 * If the waker's local DSQ is empty, and the system is under
+		 * utilized, try to wake up @p to the local DSQ of the waker.
+		 *
+		 * Checking only for an empty local DSQ is insufficient as it
+		 * could give the wakee an unfair advantage when the system is
+		 * oversaturated.
+		 *
+		 * Checking only for the presence of idle CPUs is also
+		 * insufficient as the local DSQ of the waker could have tasks
+		 * piled up on it even if there is an idle core elsewhere on
+		 * the system.
+		 */
+		if (!cpumask_empty(idle_masks.cpu) &&
+		    !(current->flags & PF_EXITING) &&
+		    cpu_rq(cpu)->scx.local_dsq.nr == 0) {
+			if (cpumask_test_cpu(cpu, p->cpus_ptr))
+				goto cpu_found;
+		}
+	}
+
+	/*
+	 * If CPU has SMT, any wholly idle CPU is likely a better pick than
+	 * partially idle @prev_cpu.
+	 */
+	if (sched_smt_active()) {
+		/*
+		 * Keep using @prev_cpu if it's part of a fully idle core.
+		 */
+		if (cpumask_test_cpu(prev_cpu, idle_masks.smt) &&
+		    test_and_clear_cpu_idle(prev_cpu)) {
+			cpu = prev_cpu;
+			goto cpu_found;
+		}
+
+		/*
+		 * Search for any fully idle core in the same LLC domain.
+		 */
+		if (llc_cpus) {
+			cpu = scx_pick_idle_cpu(llc_cpus, SCX_PICK_IDLE_CORE);
+			if (cpu >= 0)
+				goto cpu_found;
+		}
+
+		/*
+		 * Search for any fully idle core in the same NUMA node.
+		 */
+		if (numa_cpus) {
+			cpu = scx_pick_idle_cpu(numa_cpus, SCX_PICK_IDLE_CORE);
+			if (cpu >= 0)
+				goto cpu_found;
+		}
+
+		/*
+		 * Search for any full idle core usable by the task.
+		 */
+		cpu = scx_pick_idle_cpu(p->cpus_ptr, SCX_PICK_IDLE_CORE);
+		if (cpu >= 0)
+			goto cpu_found;
+	}
+
+	/*
+	 * Use @prev_cpu if it's idle.
+	 */
+	if (test_and_clear_cpu_idle(prev_cpu)) {
+		cpu = prev_cpu;
+		goto cpu_found;
+	}
+
+	/*
+	 * Search for any idle CPU in the same LLC domain.
+	 */
+	if (llc_cpus) {
+		cpu = scx_pick_idle_cpu(llc_cpus, 0);
+		if (cpu >= 0)
+			goto cpu_found;
+	}
+
+	/*
+	 * Search for any idle CPU in the same NUMA node.
+	 */
+	if (numa_cpus) {
+		cpu = scx_pick_idle_cpu(numa_cpus, 0);
+		if (cpu >= 0)
+			goto cpu_found;
+	}
+
+	/*
+	 * Search for any idle CPU usable by the task.
+	 */
+	cpu = scx_pick_idle_cpu(p->cpus_ptr, 0);
+	if (cpu >= 0)
+		goto cpu_found;
+
+	rcu_read_unlock();
+	return prev_cpu;
+
+cpu_found:
+	rcu_read_unlock();
+
+	*found = true;
+	return cpu;
+}
+
+static void reset_idle_masks(void)
+{
+	/*
+	 * Consider all online cpus idle. Should converge to the actual state
+	 * quickly.
+	 */
+	cpumask_copy(idle_masks.cpu, cpu_online_mask);
+	cpumask_copy(idle_masks.smt, cpu_online_mask);
+}
+
+void __scx_update_idle(struct rq *rq, bool idle)
+{
+	int cpu = cpu_of(rq);
+
+	if (SCX_HAS_OP(update_idle) && !scx_rq_bypassing(rq)) {
+		SCX_CALL_OP(SCX_KF_REST, update_idle, cpu_of(rq), idle);
+		if (!static_branch_unlikely(&scx_builtin_idle_enabled))
+			return;
+	}
+
+	if (idle)
+		cpumask_set_cpu(cpu, idle_masks.cpu);
+	else
+		cpumask_clear_cpu(cpu, idle_masks.cpu);
+
+#ifdef CONFIG_SCHED_SMT
+	if (sched_smt_active()) {
+		const struct cpumask *smt = cpu_smt_mask(cpu);
+
+		if (idle) {
+			/*
+			 * idle_masks.smt handling is racy but that's fine as
+			 * it's only for optimization and self-correcting.
+			 */
+			for_each_cpu(cpu, smt) {
+				if (!cpumask_test_cpu(cpu, idle_masks.cpu))
+					return;
+			}
+			cpumask_or(idle_masks.smt, idle_masks.smt, smt);
+		} else {
+			cpumask_andnot(idle_masks.smt, idle_masks.smt, smt);
+		}
+	}
+#endif
+}
+
+#else	/* !CONFIG_SMP */
+
+static bool test_and_clear_cpu_idle(int cpu) { return false; }
+static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags) { return -EBUSY; }
+static void reset_idle_masks(void) {}
+
+#endif	/* CONFIG_SMP */
+
+
+/********************************************************************************
+ * Helpers that can be called from the BPF scheduler.
+ */
+__bpf_kfunc_start_defs();
+
+/**
+ * scx_bpf_select_cpu_dfl - The default implementation of ops.select_cpu()
+ * @p: task_struct to select a CPU for
+ * @prev_cpu: CPU @p was on previously
+ * @wake_flags: %SCX_WAKE_* flags
+ * @is_idle: out parameter indicating whether the returned CPU is idle
+ *
+ * Can only be called from ops.select_cpu() if the built-in CPU selection is
+ * enabled - ops.update_idle() is missing or %SCX_OPS_KEEP_BUILTIN_IDLE is set.
+ * @p, @prev_cpu and @wake_flags match ops.select_cpu().
+ *
+ * Returns the picked CPU with *@is_idle indicating whether the picked CPU is
+ * currently idle and thus a good candidate for direct dispatching.
+ */
+__bpf_kfunc s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
+				       u64 wake_flags, bool *is_idle)
+{
+	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
+		scx_ops_error("built-in idle tracking is disabled");
+		goto prev_cpu;
+	}
+
+	if (!scx_kf_allowed(SCX_KF_SELECT_CPU))
+		goto prev_cpu;
+
+#ifdef CONFIG_SMP
+	return scx_select_cpu_dfl(p, prev_cpu, wake_flags, is_idle);
+#endif
+
+prev_cpu:
+	*is_idle = false;
+	return prev_cpu;
+}
+
+/**
+ * scx_bpf_get_idle_cpumask - Get a referenced kptr to the idle-tracking
+ * per-CPU cpumask.
+ *
+ * Returns NULL if idle tracking is not enabled, or running on a UP kernel.
+ */
+__bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void)
+{
+	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
+		scx_ops_error("built-in idle tracking is disabled");
+		return cpu_none_mask;
+	}
+
+#ifdef CONFIG_SMP
+	return idle_masks.cpu;
+#else
+	return cpu_none_mask;
+#endif
+}
+
+/**
+ * scx_bpf_get_idle_smtmask - Get a referenced kptr to the idle-tracking,
+ * per-physical-core cpumask. Can be used to determine if an entire physical
+ * core is free.
+ *
+ * Returns NULL if idle tracking is not enabled, or running on a UP kernel.
+ */
+__bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask(void)
+{
+	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
+		scx_ops_error("built-in idle tracking is disabled");
+		return cpu_none_mask;
+	}
+
+#ifdef CONFIG_SMP
+	if (sched_smt_active())
+		return idle_masks.smt;
+	else
+		return idle_masks.cpu;
+#else
+	return cpu_none_mask;
+#endif
+}
+
+/**
+ * scx_bpf_put_idle_cpumask - Release a previously acquired referenced kptr to
+ * either the percpu, or SMT idle-tracking cpumask.
+ */
+__bpf_kfunc void scx_bpf_put_idle_cpumask(const struct cpumask *idle_mask)
+{
+	/*
+	 * Empty function body because we aren't actually acquiring or releasing
+	 * a reference to a global idle cpumask, which is read-only in the
+	 * caller and is never released. The acquire / release semantics here
+	 * are just used to make the cpumask a trusted pointer in the caller.
+	 */
+}
+
+/**
+ * scx_bpf_test_and_clear_cpu_idle - Test and clear @cpu's idle state
+ * @cpu: cpu to test and clear idle for
+ *
+ * Returns %true if @cpu was idle and its idle state was successfully cleared.
+ * %false otherwise.
+ *
+ * Unavailable if ops.update_idle() is implemented and
+ * %SCX_OPS_KEEP_BUILTIN_IDLE is not set.
+ */
+__bpf_kfunc bool scx_bpf_test_and_clear_cpu_idle(s32 cpu)
+{
+	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
+		scx_ops_error("built-in idle tracking is disabled");
+		return false;
+	}
+
+	if (ops_cpu_valid(cpu, NULL))
+		return test_and_clear_cpu_idle(cpu);
+	else
+		return false;
+}
+
+/**
+ * scx_bpf_pick_idle_cpu - Pick and claim an idle cpu
+ * @cpus_allowed: Allowed cpumask
+ * @flags: %SCX_PICK_IDLE_CPU_* flags
+ *
+ * Pick and claim an idle cpu in @cpus_allowed. Returns the picked idle cpu
+ * number on success. -%EBUSY if no matching cpu was found.
+ *
+ * Idle CPU tracking may race against CPU scheduling state transitions. For
+ * example, this function may return -%EBUSY as CPUs are transitioning into the
+ * idle state. If the caller then assumes that there will be dispatch events on
+ * the CPUs as they were all busy, the scheduler may end up stalling with CPUs
+ * idling while there are pending tasks. Use scx_bpf_pick_any_cpu() and
+ * scx_bpf_kick_cpu() to guarantee that there will be at least one dispatch
+ * event in the near future.
+ *
+ * Unavailable if ops.update_idle() is implemented and
+ * %SCX_OPS_KEEP_BUILTIN_IDLE is not set.
+ */
+__bpf_kfunc s32 scx_bpf_pick_idle_cpu(const struct cpumask *cpus_allowed,
+				      u64 flags)
+{
+	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
+		scx_ops_error("built-in idle tracking is disabled");
+		return -EBUSY;
+	}
+
+	return scx_pick_idle_cpu(cpus_allowed, flags);
+}
+
+/**
+ * scx_bpf_pick_any_cpu - Pick and claim an idle cpu if available or pick any CPU
+ * @cpus_allowed: Allowed cpumask
+ * @flags: %SCX_PICK_IDLE_CPU_* flags
+ *
+ * Pick and claim an idle cpu in @cpus_allowed. If none is available, pick any
+ * CPU in @cpus_allowed. Guaranteed to succeed and returns the picked idle cpu
+ * number if @cpus_allowed is not empty. -%EBUSY is returned if @cpus_allowed is
+ * empty.
+ *
+ * If ops.update_idle() is implemented and %SCX_OPS_KEEP_BUILTIN_IDLE is not
+ * set, this function can't tell which CPUs are idle and will always pick any
+ * CPU.
+ */
+__bpf_kfunc s32 scx_bpf_pick_any_cpu(const struct cpumask *cpus_allowed,
+				     u64 flags)
+{
+	s32 cpu;
+
+	if (static_branch_likely(&scx_builtin_idle_enabled)) {
+		cpu = scx_pick_idle_cpu(cpus_allowed, flags);
+		if (cpu >= 0)
+			return cpu;
+	}
+
+	cpu = cpumask_any_distribute(cpus_allowed);
+	if (cpu < nr_cpu_ids)
+		return cpu;
+	else
+		return -EBUSY;
+}
+
+__bpf_kfunc_end_defs();
+
+BTF_KFUNCS_START(scx_kfunc_ids_select_cpu)
+BTF_ID_FLAGS(func, scx_bpf_select_cpu_dfl, KF_RCU)
+BTF_ID_FLAGS(func, scx_bpf_get_idle_cpumask, KF_ACQUIRE)
+BTF_ID_FLAGS(func, scx_bpf_get_idle_smtmask, KF_ACQUIRE)
+BTF_ID_FLAGS(func, scx_bpf_put_idle_cpumask, KF_RELEASE)
+BTF_ID_FLAGS(func, scx_bpf_test_and_clear_cpu_idle)
+BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu, KF_RCU)
+BTF_ID_FLAGS(func, scx_bpf_pick_any_cpu, KF_RCU)
+BTF_KFUNCS_END(scx_kfunc_ids_select_cpu)
+
+static const struct btf_kfunc_id_set scx_kfunc_set_select_cpu = {
+	.owner			= THIS_MODULE,
+	.set			= &scx_kfunc_ids_select_cpu,
+};
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 03/10] sched_ext: idle: introduce check_builtin_idle_enabled() helper
  2024-12-20 15:11 [PATCHSET v8 sched_ext/for-6.14] sched_ext: split global idle cpumask into per-NUMA cpumasks Andrea Righi
  2024-12-20 15:11 ` [PATCH 01/10] sched/topology: introduce for_each_numa_hop_node() / sched_numa_hop_node() Andrea Righi
  2024-12-20 15:11 ` [PATCH 02/10] sched_ext: Move built-in idle CPU selection policy to a separate file Andrea Righi
@ 2024-12-20 15:11 ` Andrea Righi
  2024-12-20 15:11 ` [PATCH 04/10] sched_ext: idle: use assign_cpu() to update the idle cpumask Andrea Righi
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 30+ messages in thread
From: Andrea Righi @ 2024-12-20 15:11 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min
  Cc: Yury Norov, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, bpf, linux-kernel

Minor refactoring to add a helper function for checking if the built-in
idle CPU selection policy is enabled.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/ext_idle.c | 29 ++++++++++++++---------------
 1 file changed, 14 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
index 9e8479dd7277..0e57830072e4 100644
--- a/kernel/sched/ext_idle.c
+++ b/kernel/sched/ext_idle.c
@@ -12,6 +12,15 @@
 
 static DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_enabled);
 
+static bool check_builtin_idle_enabled(void)
+{
+	if (static_branch_likely(&scx_builtin_idle_enabled))
+		return true;
+
+	scx_ops_error("built-in idle tracking is disabled");
+	return false;
+}
+
 #ifdef CONFIG_SMP
 #ifdef CONFIG_CPUMASK_OFFSTACK
 #define CL_ALIGNED_IF_ONSTACK
@@ -508,10 +517,8 @@ __bpf_kfunc_start_defs();
 __bpf_kfunc s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 				       u64 wake_flags, bool *is_idle)
 {
-	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
-		scx_ops_error("built-in idle tracking is disabled");
+	if (!check_builtin_idle_enabled())
 		goto prev_cpu;
-	}
 
 	if (!scx_kf_allowed(SCX_KF_SELECT_CPU))
 		goto prev_cpu;
@@ -533,10 +540,8 @@ __bpf_kfunc s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
  */
 __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void)
 {
-	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
-		scx_ops_error("built-in idle tracking is disabled");
+	if (!check_builtin_idle_enabled())
 		return cpu_none_mask;
-	}
 
 #ifdef CONFIG_SMP
 	return idle_masks.cpu;
@@ -554,10 +559,8 @@ __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void)
  */
 __bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask(void)
 {
-	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
-		scx_ops_error("built-in idle tracking is disabled");
+	if (!check_builtin_idle_enabled())
 		return cpu_none_mask;
-	}
 
 #ifdef CONFIG_SMP
 	if (sched_smt_active())
@@ -595,10 +598,8 @@ __bpf_kfunc void scx_bpf_put_idle_cpumask(const struct cpumask *idle_mask)
  */
 __bpf_kfunc bool scx_bpf_test_and_clear_cpu_idle(s32 cpu)
 {
-	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
-		scx_ops_error("built-in idle tracking is disabled");
+	if (!check_builtin_idle_enabled())
 		return false;
-	}
 
 	if (ops_cpu_valid(cpu, NULL))
 		return test_and_clear_cpu_idle(cpu);
@@ -628,10 +629,8 @@ __bpf_kfunc bool scx_bpf_test_and_clear_cpu_idle(s32 cpu)
 __bpf_kfunc s32 scx_bpf_pick_idle_cpu(const struct cpumask *cpus_allowed,
 				      u64 flags)
 {
-	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
-		scx_ops_error("built-in idle tracking is disabled");
+	if (!check_builtin_idle_enabled())
 		return -EBUSY;
-	}
 
 	return scx_pick_idle_cpu(cpus_allowed, flags);
 }
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 04/10] sched_ext: idle: use assign_cpu() to update the idle cpumask
  2024-12-20 15:11 [PATCHSET v8 sched_ext/for-6.14] sched_ext: split global idle cpumask into per-NUMA cpumasks Andrea Righi
                   ` (2 preceding siblings ...)
  2024-12-20 15:11 ` [PATCH 03/10] sched_ext: idle: introduce check_builtin_idle_enabled() helper Andrea Righi
@ 2024-12-20 15:11 ` Andrea Righi
  2024-12-23 22:26   ` Yury Norov
  2024-12-20 15:11 ` [PATCH 05/10] sched_ext: idle: clarify comments Andrea Righi
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 30+ messages in thread
From: Andrea Righi @ 2024-12-20 15:11 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min
  Cc: Yury Norov, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, bpf, linux-kernel

Use the assign_cpu() helper to set or clear the CPU in the idle mask,
based on the idle condition.

Cc: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/ext_idle.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
index 0e57830072e4..dedd39febc88 100644
--- a/kernel/sched/ext_idle.c
+++ b/kernel/sched/ext_idle.c
@@ -460,10 +460,7 @@ void __scx_update_idle(struct rq *rq, bool idle)
 			return;
 	}
 
-	if (idle)
-		cpumask_set_cpu(cpu, idle_masks.cpu);
-	else
-		cpumask_clear_cpu(cpu, idle_masks.cpu);
+	assign_cpu(cpu, idle_masks.cpu, idle);
 
 #ifdef CONFIG_SCHED_SMT
 	if (sched_smt_active()) {
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 05/10] sched_ext: idle: clarify comments
  2024-12-20 15:11 [PATCHSET v8 sched_ext/for-6.14] sched_ext: split global idle cpumask into per-NUMA cpumasks Andrea Righi
                   ` (3 preceding siblings ...)
  2024-12-20 15:11 ` [PATCH 04/10] sched_ext: idle: use assign_cpu() to update the idle cpumask Andrea Righi
@ 2024-12-20 15:11 ` Andrea Righi
  2024-12-23 22:28   ` Yury Norov
  2024-12-20 15:11 ` [PATCH 06/10] sched_ext: Introduce SCX_OPS_NODE_BUILTIN_IDLE Andrea Righi
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 30+ messages in thread
From: Andrea Righi @ 2024-12-20 15:11 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min
  Cc: Yury Norov, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, bpf, linux-kernel

Add a comments to clarify about the usage of cpumask_intersects().

Moreover, update scx_select_cpu_dfl() description clarifying that the
final step of the idle selection logic involves searching for any idle
CPU in the system that the task can use.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/ext_idle.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
index dedd39febc88..4952e2793304 100644
--- a/kernel/sched/ext_idle.c
+++ b/kernel/sched/ext_idle.c
@@ -52,6 +52,10 @@ static bool test_and_clear_cpu_idle(int cpu)
 		 * scx_pick_idle_cpu() can get caught in an infinite loop as
 		 * @cpu is never cleared from idle_masks.smt. Ensure that @cpu
 		 * is eventually cleared.
+		 *
+		 * NOTE: Use cpumask_intersects() and cpumask_test_cpu() to
+		 * reduce memory writes, which may help alleviate cache
+		 * coherence pressure.
 		 */
 		if (cpumask_intersects(smt, idle_masks.smt))
 			cpumask_andnot(idle_masks.smt, idle_masks.smt, smt);
@@ -280,6 +284,8 @@ static void update_selcpu_topology(void)
  * 4. Pick a CPU within the same NUMA node, if enabled:
  *   - choose a CPU from the same NUMA node to reduce memory access latency.
  *
+ * 5. Pick any idle CPU usable by the task.
+ *
  * Step 3 and 4 are performed only if the system has, respectively, multiple
  * LLC domains / multiple NUMA nodes (see scx_selcpu_topo_llc and
  * scx_selcpu_topo_numa).
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 06/10] sched_ext: Introduce SCX_OPS_NODE_BUILTIN_IDLE
  2024-12-20 15:11 [PATCHSET v8 sched_ext/for-6.14] sched_ext: split global idle cpumask into per-NUMA cpumasks Andrea Righi
                   ` (4 preceding siblings ...)
  2024-12-20 15:11 ` [PATCH 05/10] sched_ext: idle: clarify comments Andrea Righi
@ 2024-12-20 15:11 ` Andrea Righi
  2024-12-20 15:11 ` [PATCH 07/10] sched_ext: Introduce per-node idle cpumasks Andrea Righi
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 30+ messages in thread
From: Andrea Righi @ 2024-12-20 15:11 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min
  Cc: Yury Norov, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, bpf, linux-kernel

Add the new scheduler flag SCX_OPS_NODE_BUILTIN_IDLE, which allows each
scx scheduler to select between using a global flat idle cpumask or
multiple per-node cpumasks.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/ext.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 769e43fdea1e..148ec04d4a0a 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -124,6 +124,12 @@ enum scx_ops_flags {
 	 */
 	SCX_OPS_SWITCH_PARTIAL	= 1LLU << 3,
 
+	/*
+	 * If set, enable per-node idle cpumasks. If clear, use a single global
+	 * flat idle cpumask.
+	 */
+	SCX_OPS_BUILTIN_IDLE_PER_NODE = 1LLU << 4,
+
 	/*
 	 * CPU cgroup support flags
 	 */
@@ -133,6 +139,7 @@ enum scx_ops_flags {
 				  SCX_OPS_ENQ_LAST |
 				  SCX_OPS_ENQ_EXITING |
 				  SCX_OPS_SWITCH_PARTIAL |
+				  SCX_OPS_BUILTIN_IDLE_PER_NODE |
 				  SCX_OPS_HAS_CGROUP_WEIGHT,
 };
 
@@ -4974,6 +4981,16 @@ static int validate_ops(const struct sched_ext_ops *ops)
 		return -EINVAL;
 	}
 
+	/*
+	 * SCX_OPS_BUILTIN_IDLE_PER_NODE requires built-in CPU idle
+	 * selection policy to be enabled.
+	 */
+	if ((ops->flags & SCX_OPS_BUILTIN_IDLE_PER_NODE) &&
+	    (ops->update_idle && !(ops->flags & SCX_OPS_KEEP_BUILTIN_IDLE))) {
+		scx_ops_error("SCX_OPS_BUILTIN_IDLE_PER_NODE requires CPU idle selection enabled");
+		return -EINVAL;
+	}
+
 	return 0;
 }
 
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 07/10] sched_ext: Introduce per-node idle cpumasks
  2024-12-20 15:11 [PATCHSET v8 sched_ext/for-6.14] sched_ext: split global idle cpumask into per-NUMA cpumasks Andrea Righi
                   ` (5 preceding siblings ...)
  2024-12-20 15:11 ` [PATCH 06/10] sched_ext: Introduce SCX_OPS_NODE_BUILTIN_IDLE Andrea Righi
@ 2024-12-20 15:11 ` Andrea Righi
  2024-12-24  4:05   ` Yury Norov
  2024-12-20 15:11 ` [PATCH 08/10] sched_ext: idle: introduce SCX_PICK_IDLE_NODE Andrea Righi
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 30+ messages in thread
From: Andrea Righi @ 2024-12-20 15:11 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min
  Cc: Yury Norov, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, bpf, linux-kernel

Using a single global idle mask can lead to inefficiencies and a lot of
stress on the cache coherency protocol on large systems with multiple
NUMA nodes, since all the CPUs can create a really intense read/write
activity on the single global cpumask.

Therefore, split the global cpumask into multiple per-NUMA node cpumasks
to improve scalability and performance on large systems.

The concept is that each cpumask will track only the idle CPUs within
its corresponding NUMA node, treating CPUs in other NUMA nodes as busy.
In this way concurrent access to the idle cpumask will be restricted
within each NUMA node.

NOTE: if a scheduler enables the per-node idle cpumasks (via
SCX_OPS_BUILTIN_IDLE_PER_NODE), scx_bpf_get_idle_cpu/smtmask() will
trigger an scx error, since there are no system-wide cpumasks.

By default (when SCX_OPS_BUILTIN_IDLE_PER_NODE is not enabled), only the
cpumask of node 0 is used as a single global flat CPU mask, maintaining
the previous behavior.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/ext.c      |   7 +-
 kernel/sched/ext_idle.c | 258 +++++++++++++++++++++++++++++++---------
 2 files changed, 208 insertions(+), 57 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 148ec04d4a0a..143938e935f1 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -3228,7 +3228,7 @@ static void handle_hotplug(struct rq *rq, bool online)
 	atomic_long_inc(&scx_hotplug_seq);
 
 	if (scx_enabled())
-		update_selcpu_topology();
+		update_selcpu_topology(&scx_ops);
 
 	if (online && SCX_HAS_OP(cpu_online))
 		SCX_CALL_OP(SCX_KF_UNLOCKED, cpu_online, cpu);
@@ -5107,7 +5107,7 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
 
 	check_hotplug_seq(ops);
 #ifdef CONFIG_SMP
-	update_selcpu_topology();
+	update_selcpu_topology(ops);
 #endif
 	cpus_read_unlock();
 
@@ -5800,8 +5800,7 @@ void __init init_sched_ext_class(void)
 
 	BUG_ON(rhashtable_init(&dsq_hash, &dsq_hash_params));
 #ifdef CONFIG_SMP
-	BUG_ON(!alloc_cpumask_var(&idle_masks.cpu, GFP_KERNEL));
-	BUG_ON(!alloc_cpumask_var(&idle_masks.smt, GFP_KERNEL));
+	idle_masks_init();
 #endif
 	scx_kick_cpus_pnt_seqs =
 		__alloc_percpu(sizeof(scx_kick_cpus_pnt_seqs[0]) * nr_cpu_ids,
diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
index 4952e2793304..444f2a15f1d4 100644
--- a/kernel/sched/ext_idle.c
+++ b/kernel/sched/ext_idle.c
@@ -10,7 +10,14 @@
  * Copyright (c) 2024 Andrea Righi <arighi@nvidia.com>
  */
 
+/*
+ * If NUMA awareness is disabled consider only node 0 as a single global
+ * NUMA node.
+ */
+#define NUMA_FLAT_NODE	0
+
 static DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_enabled);
+static DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_per_node);
 
 static bool check_builtin_idle_enabled(void)
 {
@@ -22,22 +29,82 @@ static bool check_builtin_idle_enabled(void)
 }
 
 #ifdef CONFIG_SMP
-#ifdef CONFIG_CPUMASK_OFFSTACK
-#define CL_ALIGNED_IF_ONSTACK
-#else
-#define CL_ALIGNED_IF_ONSTACK __cacheline_aligned_in_smp
-#endif
-
-static struct {
+struct idle_cpumask {
 	cpumask_var_t cpu;
 	cpumask_var_t smt;
-} idle_masks CL_ALIGNED_IF_ONSTACK;
+};
+
+/*
+ * cpumasks to track idle CPUs within each NUMA node.
+ *
+ * If SCX_OPS_BUILTIN_IDLE_PER_NODE is not specified, a single flat cpumask
+ * from node 0 is used to track all idle CPUs system-wide.
+ */
+static struct idle_cpumask **scx_idle_masks;
+
+static struct idle_cpumask *get_idle_mask(int node)
+{
+	if (node == NUMA_NO_NODE)
+		node = numa_node_id();
+	else if (WARN_ON_ONCE(node < 0 || node >= nr_node_ids))
+		return NULL;
+	return scx_idle_masks[node];
+}
+
+static struct cpumask *get_idle_cpumask(int node)
+{
+	struct idle_cpumask *mask = get_idle_mask(node);
+
+	return mask ? mask->cpu : cpu_none_mask;
+}
+
+static struct cpumask *get_idle_smtmask(int node)
+{
+	struct idle_cpumask *mask = get_idle_mask(node);
+
+	return mask ? mask->smt : cpu_none_mask;
+}
+
+static void idle_masks_init(void)
+{
+	int node;
+
+	scx_idle_masks = kcalloc(num_possible_nodes(), sizeof(*scx_idle_masks), GFP_KERNEL);
+	BUG_ON(!scx_idle_masks);
+
+	for_each_node_state(node, N_POSSIBLE) {
+		scx_idle_masks[node] = kzalloc_node(sizeof(**scx_idle_masks), GFP_KERNEL, node);
+		BUG_ON(!scx_idle_masks[node]);
+
+		BUG_ON(!alloc_cpumask_var_node(&scx_idle_masks[node]->cpu, GFP_KERNEL, node));
+		BUG_ON(!alloc_cpumask_var_node(&scx_idle_masks[node]->smt, GFP_KERNEL, node));
+	}
+}
 
 static DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_llc);
 static DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_numa);
 
+/*
+ * Return the node id associated to a target idle CPU (used to determine
+ * the proper idle cpumask).
+ */
+static int idle_cpu_to_node(int cpu)
+{
+	int node;
+
+	if (static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
+		node = cpu_to_node(cpu);
+	else
+		node = NUMA_FLAT_NODE;
+
+	return node;
+}
+
 static bool test_and_clear_cpu_idle(int cpu)
 {
+	int node = idle_cpu_to_node(cpu);
+	struct cpumask *idle_cpus = get_idle_cpumask(node);
+
 #ifdef CONFIG_SCHED_SMT
 	/*
 	 * SMT mask should be cleared whether we can claim @cpu or not. The SMT
@@ -46,33 +113,37 @@ static bool test_and_clear_cpu_idle(int cpu)
 	 */
 	if (sched_smt_active()) {
 		const struct cpumask *smt = cpu_smt_mask(cpu);
+		struct cpumask *idle_smts = get_idle_smtmask(node);
 
 		/*
 		 * If offline, @cpu is not its own sibling and
 		 * scx_pick_idle_cpu() can get caught in an infinite loop as
-		 * @cpu is never cleared from idle_masks.smt. Ensure that @cpu
-		 * is eventually cleared.
+		 * @cpu is never cleared from the idle SMT mask. Ensure that
+		 * @cpu is eventually cleared.
 		 *
 		 * NOTE: Use cpumask_intersects() and cpumask_test_cpu() to
 		 * reduce memory writes, which may help alleviate cache
 		 * coherence pressure.
 		 */
-		if (cpumask_intersects(smt, idle_masks.smt))
-			cpumask_andnot(idle_masks.smt, idle_masks.smt, smt);
-		else if (cpumask_test_cpu(cpu, idle_masks.smt))
-			__cpumask_clear_cpu(cpu, idle_masks.smt);
+		if (cpumask_intersects(smt, idle_smts))
+			cpumask_andnot(idle_smts, idle_smts, smt);
+		else if (cpumask_test_cpu(cpu, idle_smts))
+			__cpumask_clear_cpu(cpu, idle_smts);
 	}
 #endif
-	return cpumask_test_and_clear_cpu(cpu, idle_masks.cpu);
+	return cpumask_test_and_clear_cpu(cpu, idle_cpus);
 }
 
-static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags)
+/*
+ * Pick an idle CPU in a specific NUMA node.
+ */
+static s32 pick_idle_cpu_from_node(const struct cpumask *cpus_allowed, int node, u64 flags)
 {
 	int cpu;
 
 retry:
 	if (sched_smt_active()) {
-		cpu = cpumask_any_and_distribute(idle_masks.smt, cpus_allowed);
+		cpu = cpumask_any_and_distribute(get_idle_smtmask(node), cpus_allowed);
 		if (cpu < nr_cpu_ids)
 			goto found;
 
@@ -80,15 +151,57 @@ static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags)
 			return -EBUSY;
 	}
 
-	cpu = cpumask_any_and_distribute(idle_masks.cpu, cpus_allowed);
+	cpu = cpumask_any_and_distribute(get_idle_cpumask(node), cpus_allowed);
 	if (cpu >= nr_cpu_ids)
 		return -EBUSY;
 
 found:
 	if (test_and_clear_cpu_idle(cpu))
 		return cpu;
-	else
-		goto retry;
+	goto retry;
+}
+
+/*
+ * Find the best idle CPU in the system, relative to @node.
+ *
+ * If @node is NUMA_NO_NODE, start from the current node.
+ */
+static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, int node, u64 flags)
+{
+	nodemask_t hop_nodes = NODE_MASK_NONE;
+	s32 cpu = -EBUSY;
+
+	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
+		return pick_idle_cpu_from_node(cpus_allowed, NUMA_FLAT_NODE, flags);
+
+	/*
+	 * If a NUMA node was not specified, start with the current one.
+	 */
+	if (node == NUMA_NO_NODE)
+		node = numa_node_id();
+
+	/*
+	 * Traverse all nodes in order of increasing distance, starting
+	 * from prev_cpu's node.
+	 *
+	 * This loop is O(N^2), with N being the amount of NUMA nodes,
+	 * which might be quite expensive in large NUMA systems. However,
+	 * this complexity comes into play only when a scheduler enables
+	 * SCX_OPS_BUILTIN_IDLE_PER_NODE and it's requesting an idle CPU
+	 * without specifying a target NUMA node, so it shouldn't be a
+	 * bottleneck is most cases.
+	 *
+	 * As a future optimization we may want to cache the list of hop
+	 * nodes in a per-node array, instead of actually traversing them
+	 * every time.
+	 */
+	for_each_numa_hop_node(n, node, hop_nodes, N_POSSIBLE) {
+		cpu = pick_idle_cpu_from_node(cpus_allowed, n, flags);
+		if (cpu >= 0)
+			break;
+	}
+
+	return cpu;
 }
 
 /*
@@ -208,7 +321,7 @@ static bool llc_numa_mismatch(void)
  * CPU belongs to a single LLC domain, and that each LLC domain is entirely
  * contained within a single NUMA node.
  */
-static void update_selcpu_topology(void)
+static void update_selcpu_topology(struct sched_ext_ops *ops)
 {
 	bool enable_llc = false, enable_numa = false;
 	unsigned int nr_cpus;
@@ -298,6 +411,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 {
 	const struct cpumask *llc_cpus = NULL;
 	const struct cpumask *numa_cpus = NULL;
+	int node = idle_cpu_to_node(prev_cpu);
 	s32 cpu;
 
 	*found = false;
@@ -355,9 +469,9 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 		 * piled up on it even if there is an idle core elsewhere on
 		 * the system.
 		 */
-		if (!cpumask_empty(idle_masks.cpu) &&
-		    !(current->flags & PF_EXITING) &&
-		    cpu_rq(cpu)->scx.local_dsq.nr == 0) {
+		if (!(current->flags & PF_EXITING) &&
+		    cpu_rq(cpu)->scx.local_dsq.nr == 0 &&
+		    !cpumask_empty(get_idle_cpumask(idle_cpu_to_node(cpu)))) {
 			if (cpumask_test_cpu(cpu, p->cpus_ptr))
 				goto cpu_found;
 		}
@@ -371,7 +485,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 		/*
 		 * Keep using @prev_cpu if it's part of a fully idle core.
 		 */
-		if (cpumask_test_cpu(prev_cpu, idle_masks.smt) &&
+		if (cpumask_test_cpu(prev_cpu, get_idle_smtmask(node)) &&
 		    test_and_clear_cpu_idle(prev_cpu)) {
 			cpu = prev_cpu;
 			goto cpu_found;
@@ -381,7 +495,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 		 * Search for any fully idle core in the same LLC domain.
 		 */
 		if (llc_cpus) {
-			cpu = scx_pick_idle_cpu(llc_cpus, SCX_PICK_IDLE_CORE);
+			cpu = pick_idle_cpu_from_node(llc_cpus, node, SCX_PICK_IDLE_CORE);
 			if (cpu >= 0)
 				goto cpu_found;
 		}
@@ -390,15 +504,19 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 		 * Search for any fully idle core in the same NUMA node.
 		 */
 		if (numa_cpus) {
-			cpu = scx_pick_idle_cpu(numa_cpus, SCX_PICK_IDLE_CORE);
+			cpu = scx_pick_idle_cpu(numa_cpus, node, SCX_PICK_IDLE_CORE);
 			if (cpu >= 0)
 				goto cpu_found;
 		}
 
 		/*
 		 * Search for any full idle core usable by the task.
+		 *
+		 * If NUMA aware idle selection is enabled, the search will
+		 * begin in prev_cpu's node and proceed to other nodes in
+		 * order of increasing distance.
 		 */
-		cpu = scx_pick_idle_cpu(p->cpus_ptr, SCX_PICK_IDLE_CORE);
+		cpu = scx_pick_idle_cpu(p->cpus_ptr, node, SCX_PICK_IDLE_CORE);
 		if (cpu >= 0)
 			goto cpu_found;
 	}
@@ -415,7 +533,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 	 * Search for any idle CPU in the same LLC domain.
 	 */
 	if (llc_cpus) {
-		cpu = scx_pick_idle_cpu(llc_cpus, 0);
+		cpu = pick_idle_cpu_from_node(llc_cpus, node, 0);
 		if (cpu >= 0)
 			goto cpu_found;
 	}
@@ -424,7 +542,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 	 * Search for any idle CPU in the same NUMA node.
 	 */
 	if (numa_cpus) {
-		cpu = scx_pick_idle_cpu(numa_cpus, 0);
+		cpu = pick_idle_cpu_from_node(numa_cpus, node, 0);
 		if (cpu >= 0)
 			goto cpu_found;
 	}
@@ -432,7 +550,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 	/*
 	 * Search for any idle CPU usable by the task.
 	 */
-	cpu = scx_pick_idle_cpu(p->cpus_ptr, 0);
+	cpu = scx_pick_idle_cpu(p->cpus_ptr, node, 0);
 	if (cpu >= 0)
 		goto cpu_found;
 
@@ -448,17 +566,33 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 
 static void reset_idle_masks(void)
 {
+	int node;
+
+	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node)) {
+		cpumask_copy(get_idle_cpumask(NUMA_FLAT_NODE), cpu_online_mask);
+		cpumask_copy(get_idle_smtmask(NUMA_FLAT_NODE), cpu_online_mask);
+		return;
+	}
+
 	/*
 	 * Consider all online cpus idle. Should converge to the actual state
 	 * quickly.
 	 */
-	cpumask_copy(idle_masks.cpu, cpu_online_mask);
-	cpumask_copy(idle_masks.smt, cpu_online_mask);
+	for_each_node_state(node, N_POSSIBLE) {
+		const struct cpumask *node_mask = cpumask_of_node(node);
+		struct cpumask *idle_cpu = get_idle_cpumask(node);
+		struct cpumask *idle_smt = get_idle_smtmask(node);
+
+		cpumask_and(idle_cpu, cpu_online_mask, node_mask);
+		cpumask_copy(idle_smt, idle_cpu);
+	}
 }
 
 void __scx_update_idle(struct rq *rq, bool idle)
 {
 	int cpu = cpu_of(rq);
+	int node = idle_cpu_to_node(cpu);
+	struct cpumask *idle_cpu = get_idle_cpumask(node);
 
 	if (SCX_HAS_OP(update_idle) && !scx_rq_bypassing(rq)) {
 		SCX_CALL_OP(SCX_KF_REST, update_idle, cpu_of(rq), idle);
@@ -466,24 +600,25 @@ void __scx_update_idle(struct rq *rq, bool idle)
 			return;
 	}
 
-	assign_cpu(cpu, idle_masks.cpu, idle);
+	assign_cpu(cpu, idle_cpu, idle);
 
 #ifdef CONFIG_SCHED_SMT
 	if (sched_smt_active()) {
 		const struct cpumask *smt = cpu_smt_mask(cpu);
+		struct cpumask *idle_smt = get_idle_smtmask(node);
 
 		if (idle) {
 			/*
-			 * idle_masks.smt handling is racy but that's fine as
-			 * it's only for optimization and self-correcting.
+			 * idle_smt handling is racy but that's fine as it's
+			 * only for optimization and self-correcting.
 			 */
 			for_each_cpu(cpu, smt) {
-				if (!cpumask_test_cpu(cpu, idle_masks.cpu))
+				if (!cpumask_test_cpu(cpu, idle_cpu))
 					return;
 			}
-			cpumask_or(idle_masks.smt, idle_masks.smt, smt);
+			cpumask_or(idle_smt, idle_smt, smt);
 		} else {
-			cpumask_andnot(idle_masks.smt, idle_masks.smt, smt);
+			cpumask_andnot(idle_smt, idle_smt, smt);
 		}
 	}
 #endif
@@ -491,8 +626,23 @@ void __scx_update_idle(struct rq *rq, bool idle)
 
 #else	/* !CONFIG_SMP */
 
+static struct cpumask *get_idle_cpumask(int node)
+{
+	return cpu_none_mask;
+}
+
+static struct cpumask *get_idle_smtmask(int node)
+{
+	return cpu_none_mask;
+}
+
 static bool test_and_clear_cpu_idle(int cpu) { return false; }
-static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags) { return -EBUSY; }
+
+static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, int node, u64 flags)
+{
+	return -EBUSY;
+}
+
 static void reset_idle_masks(void) {}
 
 #endif	/* CONFIG_SMP */
@@ -546,11 +696,12 @@ __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void)
 	if (!check_builtin_idle_enabled())
 		return cpu_none_mask;
 
-#ifdef CONFIG_SMP
-	return idle_masks.cpu;
-#else
-	return cpu_none_mask;
-#endif
+	if (static_branch_unlikely(&scx_builtin_idle_per_node)) {
+		scx_ops_error("SCX_OPS_BUILTIN_IDLE_PER_NODE enabled");
+		return cpu_none_mask;
+	}
+
+	return get_idle_cpumask(NUMA_FLAT_NODE);
 }
 
 /**
@@ -565,14 +716,15 @@ __bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask(void)
 	if (!check_builtin_idle_enabled())
 		return cpu_none_mask;
 
-#ifdef CONFIG_SMP
+	if (static_branch_unlikely(&scx_builtin_idle_per_node)) {
+		scx_ops_error("SCX_OPS_BUILTIN_IDLE_PER_NODE enabled");
+		return cpu_none_mask;
+	}
+
 	if (sched_smt_active())
-		return idle_masks.smt;
+		return get_idle_smtmask(NUMA_FLAT_NODE);
 	else
-		return idle_masks.cpu;
-#else
-	return cpu_none_mask;
-#endif
+		return get_idle_cpumask(NUMA_FLAT_NODE);
 }
 
 /**
@@ -635,7 +787,7 @@ __bpf_kfunc s32 scx_bpf_pick_idle_cpu(const struct cpumask *cpus_allowed,
 	if (!check_builtin_idle_enabled())
 		return -EBUSY;
 
-	return scx_pick_idle_cpu(cpus_allowed, flags);
+	return scx_pick_idle_cpu(cpus_allowed, NUMA_NO_NODE, flags);
 }
 
 /**
@@ -658,7 +810,7 @@ __bpf_kfunc s32 scx_bpf_pick_any_cpu(const struct cpumask *cpus_allowed,
 	s32 cpu;
 
 	if (static_branch_likely(&scx_builtin_idle_enabled)) {
-		cpu = scx_pick_idle_cpu(cpus_allowed, flags);
+		cpu = scx_pick_idle_cpu(cpus_allowed, NUMA_NO_NODE, flags);
 		if (cpu >= 0)
 			return cpu;
 	}
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 08/10] sched_ext: idle: introduce SCX_PICK_IDLE_NODE
  2024-12-20 15:11 [PATCHSET v8 sched_ext/for-6.14] sched_ext: split global idle cpumask into per-NUMA cpumasks Andrea Righi
                   ` (6 preceding siblings ...)
  2024-12-20 15:11 ` [PATCH 07/10] sched_ext: Introduce per-node idle cpumasks Andrea Righi
@ 2024-12-20 15:11 ` Andrea Righi
  2024-12-24  2:48   ` Yury Norov
  2024-12-20 15:11 ` [PATCH 09/10] sched_ext: idle: Get rid of the scx_selcpu_topo_numa logic Andrea Righi
  2024-12-20 15:11 ` [PATCH 10/10] sched_ext: idle: Introduce NUMA aware idle cpu kfunc helpers Andrea Righi
  9 siblings, 1 reply; 30+ messages in thread
From: Andrea Righi @ 2024-12-20 15:11 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min
  Cc: Yury Norov, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, bpf, linux-kernel

Introduce a flag to restrict the selection of an idle CPU to a specific
NUMA node.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/ext.c      |  1 +
 kernel/sched/ext_idle.c | 11 +++++++++--
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 143938e935f1..da5c15bd3c56 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -773,6 +773,7 @@ enum scx_deq_flags {
 
 enum scx_pick_idle_cpu_flags {
 	SCX_PICK_IDLE_CORE	= 1LLU << 0,	/* pick a CPU whose SMT siblings are also idle */
+	SCX_PICK_IDLE_NODE	= 1LLU << 1,	/* pick a CPU in the same target NUMA node */
 };
 
 enum scx_kick_flags {
diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
index 444f2a15f1d4..013deaa08f12 100644
--- a/kernel/sched/ext_idle.c
+++ b/kernel/sched/ext_idle.c
@@ -199,6 +199,12 @@ static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, int node, u64 f
 		cpu = pick_idle_cpu_from_node(cpus_allowed, n, flags);
 		if (cpu >= 0)
 			break;
+		/*
+		 * Check if the search is restricted to the same core or
+		 * the same node.
+		 */
+		if (flags & SCX_PICK_IDLE_NODE)
+			break;
 	}
 
 	return cpu;
@@ -495,7 +501,8 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 		 * Search for any fully idle core in the same LLC domain.
 		 */
 		if (llc_cpus) {
-			cpu = pick_idle_cpu_from_node(llc_cpus, node, SCX_PICK_IDLE_CORE);
+			cpu = scx_pick_idle_cpu(llc_cpus, node,
+						SCX_PICK_IDLE_CORE | SCX_PICK_IDLE_NODE);
 			if (cpu >= 0)
 				goto cpu_found;
 		}
@@ -533,7 +540,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 	 * Search for any idle CPU in the same LLC domain.
 	 */
 	if (llc_cpus) {
-		cpu = pick_idle_cpu_from_node(llc_cpus, node, 0);
+		cpu = scx_pick_idle_cpu(llc_cpus, node, SCX_PICK_IDLE_NODE);
 		if (cpu >= 0)
 			goto cpu_found;
 	}
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 09/10] sched_ext: idle: Get rid of the scx_selcpu_topo_numa logic
  2024-12-20 15:11 [PATCHSET v8 sched_ext/for-6.14] sched_ext: split global idle cpumask into per-NUMA cpumasks Andrea Righi
                   ` (7 preceding siblings ...)
  2024-12-20 15:11 ` [PATCH 08/10] sched_ext: idle: introduce SCX_PICK_IDLE_NODE Andrea Righi
@ 2024-12-20 15:11 ` Andrea Righi
  2024-12-23 23:39   ` Yury Norov
  2024-12-20 15:11 ` [PATCH 10/10] sched_ext: idle: Introduce NUMA aware idle cpu kfunc helpers Andrea Righi
  9 siblings, 1 reply; 30+ messages in thread
From: Andrea Righi @ 2024-12-20 15:11 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min
  Cc: Yury Norov, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, bpf, linux-kernel

With the introduction of separate per-NUMA node cpumasks, we
automatically track idle CPUs within each NUMA node.

This makes the special logic for determining idle CPUs in each NUMA node
redundant and unnecessary, so we can get rid of it.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/ext_idle.c | 93 ++++++++++-------------------------------
 1 file changed, 23 insertions(+), 70 deletions(-)

diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
index 013deaa08f12..b36e93da1b75 100644
--- a/kernel/sched/ext_idle.c
+++ b/kernel/sched/ext_idle.c
@@ -82,7 +82,6 @@ static void idle_masks_init(void)
 }
 
 static DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_llc);
-static DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_numa);
 
 /*
  * Return the node id associated to a target idle CPU (used to determine
@@ -259,25 +258,6 @@ static unsigned int numa_weight(s32 cpu)
 	return sg->group_weight;
 }
 
-/*
- * Return the cpumask representing the NUMA domain of @cpu (or NULL if the NUMA
- * domain is not defined).
- */
-static struct cpumask *numa_span(s32 cpu)
-{
-	struct sched_domain *sd;
-	struct sched_group *sg;
-
-	sd = rcu_dereference(per_cpu(sd_numa, cpu));
-	if (!sd)
-		return NULL;
-	sg = sd->groups;
-	if (!sg)
-		return NULL;
-
-	return sched_group_span(sg);
-}
-
 /*
  * Return true if the LLC domains do not perfectly overlap with the NUMA
  * domains, false otherwise.
@@ -329,7 +309,7 @@ static bool llc_numa_mismatch(void)
  */
 static void update_selcpu_topology(struct sched_ext_ops *ops)
 {
-	bool enable_llc = false, enable_numa = false;
+	bool enable_llc = false;
 	unsigned int nr_cpus;
 	s32 cpu = cpumask_first(cpu_online_mask);
 
@@ -348,41 +328,34 @@ static void update_selcpu_topology(struct sched_ext_ops *ops)
 	if (nr_cpus > 0) {
 		if (nr_cpus < num_online_cpus())
 			enable_llc = true;
+		/*
+		 * No need to enable LLC optimization if the LLC domains are
+		 * perfectly overlapping with the NUMA domains when per-node
+		 * cpumasks are enabled.
+		 */
+		if ((ops->flags & SCX_OPS_BUILTIN_IDLE_PER_NODE) &&
+		    !llc_numa_mismatch())
+			enable_llc = false;
 		pr_debug("sched_ext: LLC=%*pb weight=%u\n",
 			 cpumask_pr_args(llc_span(cpu)), llc_weight(cpu));
 	}
-
-	/*
-	 * Enable NUMA optimization only when there are multiple NUMA domains
-	 * among the online CPUs and the NUMA domains don't perfectly overlaps
-	 * with the LLC domains.
-	 *
-	 * If all CPUs belong to the same NUMA node and the same LLC domain,
-	 * enabling both NUMA and LLC optimizations is unnecessary, as checking
-	 * for an idle CPU in the same domain twice is redundant.
-	 */
-	nr_cpus = numa_weight(cpu);
-	if (nr_cpus > 0) {
-		if (nr_cpus < num_online_cpus() && llc_numa_mismatch())
-			enable_numa = true;
-		pr_debug("sched_ext: NUMA=%*pb weight=%u\n",
-			 cpumask_pr_args(numa_span(cpu)), numa_weight(cpu));
-	}
 	rcu_read_unlock();
 
 	pr_debug("sched_ext: LLC idle selection %s\n",
 		 enable_llc ? "enabled" : "disabled");
-	pr_debug("sched_ext: NUMA idle selection %s\n",
-		 enable_numa ? "enabled" : "disabled");
 
 	if (enable_llc)
 		static_branch_enable_cpuslocked(&scx_selcpu_topo_llc);
 	else
 		static_branch_disable_cpuslocked(&scx_selcpu_topo_llc);
-	if (enable_numa)
-		static_branch_enable_cpuslocked(&scx_selcpu_topo_numa);
+
+	/*
+	 * Check if we need to enable per-node cpumasks.
+	 */
+	if (ops->flags & SCX_OPS_BUILTIN_IDLE_PER_NODE)
+		static_branch_enable_cpuslocked(&scx_builtin_idle_per_node);
 	else
-		static_branch_disable_cpuslocked(&scx_selcpu_topo_numa);
+		static_branch_disable_cpuslocked(&scx_builtin_idle_per_node);
 }
 
 /*
@@ -405,9 +378,8 @@ static void update_selcpu_topology(struct sched_ext_ops *ops)
  *
  * 5. Pick any idle CPU usable by the task.
  *
- * Step 3 and 4 are performed only if the system has, respectively, multiple
- * LLC domains / multiple NUMA nodes (see scx_selcpu_topo_llc and
- * scx_selcpu_topo_numa).
+ * Step 3 is performed only if the system has multiple LLC domains that are not
+ * perfectly overlapping with the NUMA domains (see scx_selcpu_topo_llc).
  *
  * NOTE: tasks that can only run on 1 CPU are excluded by this logic, because
  * we never call ops.select_cpu() for them, see select_task_rq().
@@ -416,7 +388,6 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 			      u64 wake_flags, bool *found)
 {
 	const struct cpumask *llc_cpus = NULL;
-	const struct cpumask *numa_cpus = NULL;
 	int node = idle_cpu_to_node(prev_cpu);
 	s32 cpu;
 
@@ -438,13 +409,9 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 	 * CPU affinity), the task will simply use the flat scheduling domain
 	 * defined by user-space.
 	 */
-	if (p->nr_cpus_allowed >= num_possible_cpus()) {
-		if (static_branch_maybe(CONFIG_NUMA, &scx_selcpu_topo_numa))
-			numa_cpus = numa_span(prev_cpu);
-
+	if (p->nr_cpus_allowed >= num_possible_cpus())
 		if (static_branch_maybe(CONFIG_SCHED_MC, &scx_selcpu_topo_llc))
 			llc_cpus = llc_span(prev_cpu);
-	}
 
 	/*
 	 * If WAKE_SYNC, try to migrate the wakee to the waker's CPU.
@@ -507,15 +474,6 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 				goto cpu_found;
 		}
 
-		/*
-		 * Search for any fully idle core in the same NUMA node.
-		 */
-		if (numa_cpus) {
-			cpu = scx_pick_idle_cpu(numa_cpus, node, SCX_PICK_IDLE_CORE);
-			if (cpu >= 0)
-				goto cpu_found;
-		}
-
 		/*
 		 * Search for any full idle core usable by the task.
 		 *
@@ -545,17 +503,12 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 			goto cpu_found;
 	}
 
-	/*
-	 * Search for any idle CPU in the same NUMA node.
-	 */
-	if (numa_cpus) {
-		cpu = pick_idle_cpu_from_node(numa_cpus, node, 0);
-		if (cpu >= 0)
-			goto cpu_found;
-	}
-
 	/*
 	 * Search for any idle CPU usable by the task.
+	 *
+	 * If NUMA aware idle selection is enabled, the search will begin
+	 * in prev_cpu's node and proceed to other nodes in order of
+	 * increasing distance.
 	 */
 	cpu = scx_pick_idle_cpu(p->cpus_ptr, node, 0);
 	if (cpu >= 0)
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 10/10] sched_ext: idle: Introduce NUMA aware idle cpu kfunc helpers
  2024-12-20 15:11 [PATCHSET v8 sched_ext/for-6.14] sched_ext: split global idle cpumask into per-NUMA cpumasks Andrea Righi
                   ` (8 preceding siblings ...)
  2024-12-20 15:11 ` [PATCH 09/10] sched_ext: idle: Get rid of the scx_selcpu_topo_numa logic Andrea Righi
@ 2024-12-20 15:11 ` Andrea Righi
  2024-12-24  0:57   ` Yury Norov
  9 siblings, 1 reply; 30+ messages in thread
From: Andrea Righi @ 2024-12-20 15:11 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min
  Cc: Yury Norov, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, bpf, linux-kernel

Add the following kfunc's to provide scx schedulers direct access to
per-node idle cpumasks information:

 const struct cpumask *scx_bpf_get_idle_cpumask_node(int node)
 const struct cpumask *scx_bpf_get_idle_smtmask_node(int node)
 s32 scx_bpf_pick_idle_cpu_node(const cpumask_t *cpus_allowed,
				int node, u64 flags)
 int scx_bpf_cpu_to_node(s32 cpu)

Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/ext_idle.c                  | 163 ++++++++++++++++++++---
 tools/sched_ext/include/scx/common.bpf.h |   4 +
 tools/sched_ext/include/scx/compat.bpf.h |  19 +++
 3 files changed, 170 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
index b36e93da1b75..0f8ccc1e290e 100644
--- a/kernel/sched/ext_idle.c
+++ b/kernel/sched/ext_idle.c
@@ -28,6 +28,60 @@ static bool check_builtin_idle_enabled(void)
 	return false;
 }
 
+static bool check_builtin_idle_per_node_enabled(void)
+{
+	if (static_branch_likely(&scx_builtin_idle_per_node))
+		return true;
+
+	scx_ops_error("per-node idle tracking is disabled");
+	return false;
+}
+
+/*
+ * Validate and resolve a NUMA node.
+ *
+ * Return the resolved node ID on success or a negative value otherwise.
+ */
+static int validate_node(int node)
+{
+	if (!check_builtin_idle_per_node_enabled())
+		return -EINVAL;
+
+	/* If no node is specified, use the current one */
+	if (node == NUMA_NO_NODE)
+		return numa_node_id();
+
+	/* Make sure node is in a valid range */
+	if (node < 0 || node >= nr_node_ids) {
+		scx_ops_error("invalid node %d", node);
+		return -ENOENT;
+	}
+
+	/* Make sure the node is part of the set of possible nodes */
+	if (!node_possible(node)) {
+		scx_ops_error("unavailable node %d", node);
+		return -EINVAL;
+	}
+
+	return node;
+}
+
+/*
+ * Return the node id associated to a target idle CPU (used to determine
+ * the proper idle cpumask).
+ */
+static int idle_cpu_to_node(int cpu)
+{
+	int node;
+
+	if (static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
+		node = cpu_to_node(cpu);
+	else
+		node = NUMA_FLAT_NODE;
+
+	return node;
+}
+
 #ifdef CONFIG_SMP
 struct idle_cpumask {
 	cpumask_var_t cpu;
@@ -83,22 +137,6 @@ static void idle_masks_init(void)
 
 static DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_llc);
 
-/*
- * Return the node id associated to a target idle CPU (used to determine
- * the proper idle cpumask).
- */
-static int idle_cpu_to_node(int cpu)
-{
-	int node;
-
-	if (static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
-		node = cpu_to_node(cpu);
-	else
-		node = NUMA_FLAT_NODE;
-
-	return node;
-}
-
 static bool test_and_clear_cpu_idle(int cpu)
 {
 	int node = idle_cpu_to_node(cpu);
@@ -613,6 +651,17 @@ static void reset_idle_masks(void) {}
  */
 __bpf_kfunc_start_defs();
 
+/**
+ * scx_bpf_cpu_to_node - Return the NUMA node the given @cpu belongs to
+ */
+__bpf_kfunc int scx_bpf_cpu_to_node(s32 cpu)
+{
+	if (cpu < 0 || cpu >= nr_cpu_ids)
+		return -EINVAL;
+
+	return idle_cpu_to_node(cpu);
+}
+
 /**
  * scx_bpf_select_cpu_dfl - The default implementation of ops.select_cpu()
  * @p: task_struct to select a CPU for
@@ -645,6 +694,28 @@ __bpf_kfunc s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 	return prev_cpu;
 }
 
+/**
+ * scx_bpf_get_idle_cpumask_node - Get a referenced kptr to the idle-tracking
+ * per-CPU cpumask of a target NUMA node.
+ *
+ * NUMA_NO_NODE is interpreted as the current node.
+ *
+ * Returns an empty cpumask if idle tracking is not enabled, if @node is not
+ * valid, or running on a UP kernel.
+ */
+__bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask_node(int node)
+{
+	node = validate_node(node);
+	if (node < 0)
+		return cpu_none_mask;
+
+#ifdef CONFIG_SMP
+	return get_idle_cpumask(node);
+#else
+	return cpu_none_mask;
+#endif
+}
+
 /**
  * scx_bpf_get_idle_cpumask - Get a referenced kptr to the idle-tracking
  * per-CPU cpumask.
@@ -664,6 +735,32 @@ __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void)
 	return get_idle_cpumask(NUMA_FLAT_NODE);
 }
 
+/**
+ * scx_bpf_get_idle_smtmask_node - Get a referenced kptr to the idle-tracking,
+ * per-physical-core cpumask of a target NUMA node. Can be used to determine
+ * if an entire physical core is free.
+ *
+ * NUMA_NO_NODE is interpreted as the current node.
+ *
+ * Returns an empty cpumask if idle tracking is not enabled, if @node is not
+ * valid, or running on a UP kernel.
+ */
+__bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask_node(int node)
+{
+	node = validate_node(node);
+	if (node < 0)
+		return cpu_none_mask;
+
+#ifdef CONFIG_SMP
+	if (sched_smt_active())
+		return get_idle_smtmask(node);
+	else
+		return get_idle_cpumask(node);
+#else
+	return cpu_none_mask;
+#endif
+}
+
 /**
  * scx_bpf_get_idle_smtmask - Get a referenced kptr to the idle-tracking,
  * per-physical-core cpumask. Can be used to determine if an entire physical
@@ -722,6 +819,36 @@ __bpf_kfunc bool scx_bpf_test_and_clear_cpu_idle(s32 cpu)
 		return false;
 }
 
+/**
+ * scx_bpf_pick_idle_cpu_node - Pick and claim an idle cpu from a NUMA node
+ * @cpus_allowed: Allowed cpumask
+ * @node: target NUMA node
+ * @flags: %SCX_PICK_IDLE_CPU_* flags
+ *
+ * Pick and claim an idle cpu in @cpus_allowed from the NUMA node @node.
+ * Returns the picked idle cpu number on success. -%EBUSY if no matching cpu
+ * was found.
+ *
+ * If @node is NUMA_NO_NODE, the search is restricted to the current NUMA
+ * node. Otherwise, the search starts from @node and proceeds to other
+ * online NUMA nodes in order of increasing distance (unless
+ * SCX_PICK_IDLE_NODE is specified, in which case the search is limited to
+ * the target @node).
+ *
+ * Unavailable if ops.update_idle() is implemented and
+ * %SCX_OPS_KEEP_BUILTIN_IDLE is not set or if %SCX_OPS_KEEP_BUILTIN_IDLE is
+ * not set.
+ */
+__bpf_kfunc s32 scx_bpf_pick_idle_cpu_node(const struct cpumask *cpus_allowed,
+					   int node, u64 flags)
+{
+	node = validate_node(node);
+	if (node < 0)
+		return node;
+
+	return scx_pick_idle_cpu(cpus_allowed, node, flags);
+}
+
 /**
  * scx_bpf_pick_idle_cpu - Pick and claim an idle cpu
  * @cpus_allowed: Allowed cpumask
@@ -785,11 +912,15 @@ __bpf_kfunc s32 scx_bpf_pick_any_cpu(const struct cpumask *cpus_allowed,
 __bpf_kfunc_end_defs();
 
 BTF_KFUNCS_START(scx_kfunc_ids_select_cpu)
+BTF_ID_FLAGS(func, scx_bpf_cpu_to_node)
 BTF_ID_FLAGS(func, scx_bpf_select_cpu_dfl, KF_RCU)
+BTF_ID_FLAGS(func, scx_bpf_get_idle_cpumask_node, KF_ACQUIRE)
 BTF_ID_FLAGS(func, scx_bpf_get_idle_cpumask, KF_ACQUIRE)
+BTF_ID_FLAGS(func, scx_bpf_get_idle_smtmask_node, KF_ACQUIRE)
 BTF_ID_FLAGS(func, scx_bpf_get_idle_smtmask, KF_ACQUIRE)
 BTF_ID_FLAGS(func, scx_bpf_put_idle_cpumask, KF_RELEASE)
 BTF_ID_FLAGS(func, scx_bpf_test_and_clear_cpu_idle)
+BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu_node, KF_RCU)
 BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu, KF_RCU)
 BTF_ID_FLAGS(func, scx_bpf_pick_any_cpu, KF_RCU)
 BTF_KFUNCS_END(scx_kfunc_ids_select_cpu)
diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h
index 858ba1f438f6..fe0433f7c4d9 100644
--- a/tools/sched_ext/include/scx/common.bpf.h
+++ b/tools/sched_ext/include/scx/common.bpf.h
@@ -63,13 +63,17 @@ u32 scx_bpf_cpuperf_cap(s32 cpu) __ksym __weak;
 u32 scx_bpf_cpuperf_cur(s32 cpu) __ksym __weak;
 void scx_bpf_cpuperf_set(s32 cpu, u32 perf) __ksym __weak;
 u32 scx_bpf_nr_cpu_ids(void) __ksym __weak;
+int scx_bpf_cpu_to_node(s32 cpu) __ksym __weak;
 const struct cpumask *scx_bpf_get_possible_cpumask(void) __ksym __weak;
 const struct cpumask *scx_bpf_get_online_cpumask(void) __ksym __weak;
 void scx_bpf_put_cpumask(const struct cpumask *cpumask) __ksym __weak;
+const struct cpumask *scx_bpf_get_idle_cpumask_node(int node) __ksym __weak;
 const struct cpumask *scx_bpf_get_idle_cpumask(void) __ksym;
+const struct cpumask *scx_bpf_get_idle_smtmask_node(int node) __ksym __weak;
 const struct cpumask *scx_bpf_get_idle_smtmask(void) __ksym;
 void scx_bpf_put_idle_cpumask(const struct cpumask *cpumask) __ksym;
 bool scx_bpf_test_and_clear_cpu_idle(s32 cpu) __ksym;
+s32 scx_bpf_pick_idle_cpu_node(const cpumask_t *cpus_allowed, int node, u64 flags) __ksym __weak;
 s32 scx_bpf_pick_idle_cpu(const cpumask_t *cpus_allowed, u64 flags) __ksym;
 s32 scx_bpf_pick_any_cpu(const cpumask_t *cpus_allowed, u64 flags) __ksym;
 bool scx_bpf_task_running(const struct task_struct *p) __ksym;
diff --git a/tools/sched_ext/include/scx/compat.bpf.h b/tools/sched_ext/include/scx/compat.bpf.h
index d56520100a26..dfc329d5a91e 100644
--- a/tools/sched_ext/include/scx/compat.bpf.h
+++ b/tools/sched_ext/include/scx/compat.bpf.h
@@ -125,6 +125,25 @@ bool scx_bpf_dispatch_vtime_from_dsq___compat(struct bpf_iter_scx_dsq *it__iter,
 	false;									\
 })
 
+#define __COMPAT_scx_bpf_cpu_to_node(cpu)					\
+	(bpf_ksym_exists(scx_bpf_cpu_to_node) ?					\
+	 scx_bpf_cpu_to_node(cpu) : 0)
+
+#define __COMPAT_scx_bpf_get_idle_cpumask_node(node)				\
+	(bpf_ksym_exists(scx_bpf_get_idle_cpumask_node) ?			\
+	 scx_bpf_get_idle_cpumask_node(node) :					\
+	 scx_bpf_get_idle_cpumask())						\
+
+#define __COMPAT_scx_bpf_get_idle_smtmask_node(node)				\
+	(bpf_ksym_exists(scx_bpf_get_idle_smtmask_node) ?			\
+	 scx_bpf_get_idle_smtmask_node(node) :					\
+	 scx_bpf_get_idle_smtmask())
+
+#define __COMPAT_scx_bpf_pick_idle_cpu_node(cpus_allowed, node, flags)		\
+	(bpf_ksym_exists(scx_bpf_pick_idle_cpu_node) ?				\
+	 scx_bpf_pick_idle_cpu_node(cpus_allowed, node, flags) :		\
+	 scx_bpf_pick_idle_cpu(cpus_allowed, flags))
+
 /*
  * Define sched_ext_ops. This may be expanded to define multiple variants for
  * backward compatibility. See compat.h::SCX_OPS_LOAD/ATTACH().
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 01/10] sched/topology: introduce for_each_numa_hop_node() / sched_numa_hop_node()
  2024-12-20 15:11 ` [PATCH 01/10] sched/topology: introduce for_each_numa_hop_node() / sched_numa_hop_node() Andrea Righi
@ 2024-12-23 21:18   ` Yury Norov
  2024-12-24  7:54     ` Andrea Righi
  0 siblings, 1 reply; 30+ messages in thread
From: Yury Norov @ 2024-12-23 21:18 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, bpf,
	linux-kernel

On Fri, Dec 20, 2024 at 04:11:33PM +0100, Andrea Righi wrote:
> Introduce for_each_numa_hop_node() and sched_numa_hop_node() to iterate
> over node IDs in order of increasing NUMA distance from a given starting
> node.
> 
> These iterator functions are similar to for_each_numa_hop_mask() and
> sched_numa_hop_mask(), but instead of providing a cpumask at each
> iteration, they provide a node ID.
> 
> Example usage:
> 
>   nodemask_t hop_nodes = NODE_MASK_NONE;
>   int start = cpu_to_node(smp_processor_id());
> 
>   for_each_numa_hop_node(node, start, hop_nodes, N_ONLINE)
>   	pr_info("node (%d, %d) -> \n",
>   		 start, node, node_distance(start, node);

This iterates nodes, not hops. The hop is a set of equidistant nodes,
and the iterator (the first argument) should be a nodemask. I'm OK with
that as soon as you find it practical. But then you shouldn't mention
hops in the patch.

Also, can you check that it works correctly against a configuration with
equidistant nodes?

> Simulating the following NUMA topology in virtme-ng:
> 
>  $ numactl -H
>  available: 4 nodes (0-3)
>  node 0 cpus: 0 1
>  node 0 size: 1006 MB
>  node 0 free: 928 MB
>  node 1 cpus: 2 3
>  node 1 size: 1007 MB
>  node 1 free: 986 MB
>  node 2 cpus: 4 5
>  node 2 size: 889 MB
>  node 2 free: 862 MB
>  node 3 cpus: 6 7
>  node 3 size: 1006 MB
>  node 3 free: 983 MB
>  node distances:
>  node     0    1    2    3
>     0:   10   51   31   41
>     1:   51   10   21   61
>     2:   31   21   10   11
>     3:   41   61   11   10
> 
> The output of the example above (on node 0) is the following:
> 
>  [   84.953644] node (0, 0) -> 10
>  [   84.953712] node (0, 2) -> 31
>  [   84.953764] node (0, 3) -> 41
>  [   84.953817] node (0, 1) -> 51
> 
> Cc: Yury Norov <yury.norov@gmail.com>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> ---
>  include/linux/topology.h | 28 ++++++++++++++++++++++-
>  kernel/sched/topology.c  | 49 ++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 76 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/topology.h b/include/linux/topology.h
> index 52f5850730b3..d9014d90580d 100644
> --- a/include/linux/topology.h
> +++ b/include/linux/topology.h
> @@ -248,12 +248,18 @@ static inline const struct cpumask *cpu_cpu_mask(int cpu)
>  #ifdef CONFIG_NUMA
>  int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node);
>  extern const struct cpumask *sched_numa_hop_mask(unsigned int node, unsigned int hops);
> -#else
> +extern int sched_numa_hop_node(nodemask_t *hop_nodes, int start, unsigned int state);
> +#else /* !CONFIG_NUMA */
>  static __always_inline int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node)
>  {
>  	return cpumask_nth_and(cpu, cpus, cpu_online_mask);
>  }
>  
> +static inline int sched_numa_hop_node(nodemask_t *hop_nodes, int start, unsigned int state)
> +{
> +	return NUMA_NO_NODE;
> +}
> +
>  static inline const struct cpumask *
>  sched_numa_hop_mask(unsigned int node, unsigned int hops)
>  {
> @@ -261,6 +267,26 @@ sched_numa_hop_mask(unsigned int node, unsigned int hops)
>  }
>  #endif	/* CONFIG_NUMA */
>  
> +/**
> + * for_each_numa_hop_node - iterate over NUMA nodes at increasing hop distances
> + *                          from a given starting node.
> + * @__node: the iteration variable, representing the current NUMA node.
> + * @__start: the NUMA node to start the iteration from.
> + * @__hop_nodes: a nodemask_t to track the visited nodes.
> + * @__state: state of NUMA nodes to iterate.
> + *
> + * Requires rcu_lock to be held.
> + *
> + * This macro iterates over NUMA nodes in increasing distance from
> + * @start_node.
> + *
> + * Yields NUMA_NO_NODE when all the nodes have been visited.
> + */
> +#define for_each_numa_hop_node(__node, __start, __hop_nodes, __state)		\

As soon as this is not the hops iterator, the proper name would be just

        for_each_numa_node()

And because the 'numa' prefix here doesn't look like a prefix, I think
it would be nice to comment what this 'numa' means and what's the
difference between this and for_each_node() iterator, especially in
terms of complexity.

Also you don't need underscores in macro declarations unless
absolutely necessary.

> +	for (int __node = __start;						\

The __node declared in for() masks out the __node provided in the
macro.

> +	     __node != NUMA_NO_NODE;						\
> +	     __node = sched_numa_hop_node(&(__hop_nodes), __start, __state))
> +
>  /**
>   * for_each_numa_hop_mask - iterate over cpumasks of increasing NUMA distance
>   *                          from a given node.
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 9748a4c8d668..8e77c235ad9a 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2185,6 +2185,55 @@ int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node)
>  }
>  EXPORT_SYMBOL_GPL(sched_numa_find_nth_cpu);
>  
> +/**
> + * sched_numa_hop_node - Find the NUMA node at the closest hop distance
> + *                       from node @start.
> + *
> + * @hop_nodes: a pointer to a nodemask_t representing the visited nodes.
> + * @start: the NUMA node to start the hop search from.
> + * @state: the node state to filter nodes by.
> + *
> + * This function iterates over all NUMA nodes in the given state and
> + * calculates the hop distance to the starting node. It returns the NUMA
> + * node that is the closest in terms of hop distance
> + * that has not already been considered (not set in @hop_nodes). If the
> + * node is found, it is marked as considered in the @hop_nodes bitmask.
> + *
> + * The function checks if the node is not the start node and ensures it is
> + * not already part of the hop_nodes set. It then computes the distance to
> + * the start node using the node_distance() function. The closest node is
> + * chosen based on the minimum distance.
> + *
> + * Returns the NUMA node ID closest in terms of hop distance from the
> + * @start node, or NUMA_NO_NODE if no node is found (or all nodes have been
> + * visited).

for_each_node_state() returns MAX_NUMNODES when it finishes
traversing. I think you should do the same here. 

> + */
> +int sched_numa_hop_node(nodemask_t *hop_nodes, int start, unsigned int state)
> +{
> +	int dist, n, min_node, min_dist;
> +
> +	if (state >= NR_NODE_STATES)
> +		return NUMA_NO_NODE;

 -EINVAL. But, do we need to check the parameter at all?

> +
> +	min_node = NUMA_NO_NODE;
> +	min_dist = INT_MAX;
> +
> +	for_each_node_state(n, state) {
> +		if (n == start || node_isset(n, *hop_nodes))
> +			continue;
> +		dist = node_distance(start, n);
> +		if (dist < min_dist) {
> +			min_dist = dist;
> +			min_node = n;
> +		}
> +	}

This is a version of numa_nearest_node(). The only difference is that
you add 'hop_nodes' mask, which in fact is 'visited' nodes.

I think it should be like:

 int numa_nearest_unvisited_node(nodemask_t *visited, int start, unsigned int state)
 {
        for_each_node_andnot(node_states[state], visited) (
                ...
        }
 }

Thanks,
Yury

> +	if (min_node != NUMA_NO_NODE)
> +		node_set(min_node, *hop_nodes);
> +
> +	return min_node;
> +}
> +EXPORT_SYMBOL_GPL(sched_numa_hop_node);
> +
>  /**
>   * sched_numa_hop_mask() - Get the cpumask of CPUs at most @hops hops away from
>   *                         @node
> -- 
> 2.47.1

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 04/10] sched_ext: idle: use assign_cpu() to update the idle cpumask
  2024-12-20 15:11 ` [PATCH 04/10] sched_ext: idle: use assign_cpu() to update the idle cpumask Andrea Righi
@ 2024-12-23 22:26   ` Yury Norov
  0 siblings, 0 replies; 30+ messages in thread
From: Yury Norov @ 2024-12-23 22:26 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, bpf,
	linux-kernel

On Fri, Dec 20, 2024 at 04:11:36PM +0100, Andrea Righi wrote:
> Use the assign_cpu() helper to set or clear the CPU in the idle mask,
> based on the idle condition.
> 
> Cc: Yury Norov <yury.norov@gmail.com>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>

Acked-by: Yury Norov <yury.norov@gmail.com>

> ---
>  kernel/sched/ext_idle.c | 5 +----
>  1 file changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
> index 0e57830072e4..dedd39febc88 100644
> --- a/kernel/sched/ext_idle.c
> +++ b/kernel/sched/ext_idle.c
> @@ -460,10 +460,7 @@ void __scx_update_idle(struct rq *rq, bool idle)
>  			return;
>  	}
>  
> -	if (idle)
> -		cpumask_set_cpu(cpu, idle_masks.cpu);
> -	else
> -		cpumask_clear_cpu(cpu, idle_masks.cpu);
> +	assign_cpu(cpu, idle_masks.cpu, idle);
>  
>  #ifdef CONFIG_SCHED_SMT
>  	if (sched_smt_active()) {
> -- 
> 2.47.1

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 05/10] sched_ext: idle: clarify comments
  2024-12-20 15:11 ` [PATCH 05/10] sched_ext: idle: clarify comments Andrea Righi
@ 2024-12-23 22:28   ` Yury Norov
  0 siblings, 0 replies; 30+ messages in thread
From: Yury Norov @ 2024-12-23 22:28 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, bpf,
	linux-kernel

On Fri, Dec 20, 2024 at 04:11:37PM +0100, Andrea Righi wrote:
> Add a comments to clarify about the usage of cpumask_intersects().
> 
> Moreover, update scx_select_cpu_dfl() description clarifying that the
> final step of the idle selection logic involves searching for any idle
> CPU in the system that the task can use.
> 
> Signed-off-by: Andrea Righi <arighi@nvidia.com>

Reviewed-by: Yury Norov <yury.norov@gmail.com>

> ---
>  kernel/sched/ext_idle.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
> index dedd39febc88..4952e2793304 100644
> --- a/kernel/sched/ext_idle.c
> +++ b/kernel/sched/ext_idle.c
> @@ -52,6 +52,10 @@ static bool test_and_clear_cpu_idle(int cpu)
>  		 * scx_pick_idle_cpu() can get caught in an infinite loop as
>  		 * @cpu is never cleared from idle_masks.smt. Ensure that @cpu
>  		 * is eventually cleared.
> +		 *
> +		 * NOTE: Use cpumask_intersects() and cpumask_test_cpu() to
> +		 * reduce memory writes, which may help alleviate cache
> +		 * coherence pressure.
>  		 */
>  		if (cpumask_intersects(smt, idle_masks.smt))
>  			cpumask_andnot(idle_masks.smt, idle_masks.smt, smt);
> @@ -280,6 +284,8 @@ static void update_selcpu_topology(void)
>   * 4. Pick a CPU within the same NUMA node, if enabled:
>   *   - choose a CPU from the same NUMA node to reduce memory access latency.
>   *
> + * 5. Pick any idle CPU usable by the task.
> + *
>   * Step 3 and 4 are performed only if the system has, respectively, multiple
>   * LLC domains / multiple NUMA nodes (see scx_selcpu_topo_llc and
>   * scx_selcpu_topo_numa).
> -- 
> 2.47.1

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 09/10] sched_ext: idle: Get rid of the scx_selcpu_topo_numa logic
  2024-12-20 15:11 ` [PATCH 09/10] sched_ext: idle: Get rid of the scx_selcpu_topo_numa logic Andrea Righi
@ 2024-12-23 23:39   ` Yury Norov
  2024-12-24  8:58     ` Andrea Righi
  0 siblings, 1 reply; 30+ messages in thread
From: Yury Norov @ 2024-12-23 23:39 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, bpf,
	linux-kernel

On Fri, Dec 20, 2024 at 04:11:41PM +0100, Andrea Righi wrote:
> With the introduction of separate per-NUMA node cpumasks, we
> automatically track idle CPUs within each NUMA node.
> 
> This makes the special logic for determining idle CPUs in each NUMA node
> redundant and unnecessary, so we can get rid of it.

But it looks like you do more than that... 

> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> ---
>  kernel/sched/ext_idle.c | 93 ++++++++++-------------------------------
>  1 file changed, 23 insertions(+), 70 deletions(-)
> 
> diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
> index 013deaa08f12..b36e93da1b75 100644
> --- a/kernel/sched/ext_idle.c
> +++ b/kernel/sched/ext_idle.c
> @@ -82,7 +82,6 @@ static void idle_masks_init(void)
>  }
>  
>  static DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_llc);
> -static DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_numa);
>  
>  /*
>   * Return the node id associated to a target idle CPU (used to determine
> @@ -259,25 +258,6 @@ static unsigned int numa_weight(s32 cpu)
>  	return sg->group_weight;
>  }
>  
> -/*
> - * Return the cpumask representing the NUMA domain of @cpu (or NULL if the NUMA
> - * domain is not defined).
> - */
> -static struct cpumask *numa_span(s32 cpu)
> -{
> -	struct sched_domain *sd;
> -	struct sched_group *sg;
> -
> -	sd = rcu_dereference(per_cpu(sd_numa, cpu));
> -	if (!sd)
> -		return NULL;
> -	sg = sd->groups;
> -	if (!sg)
> -		return NULL;
> -
> -	return sched_group_span(sg);

I didn't find llc_span() and node_span() in vanilla kernel. Does this series
have prerequisites?

> -}
> -
>  /*
>   * Return true if the LLC domains do not perfectly overlap with the NUMA
>   * domains, false otherwise.
> @@ -329,7 +309,7 @@ static bool llc_numa_mismatch(void)
>   */
>  static void update_selcpu_topology(struct sched_ext_ops *ops)
>  {
> -	bool enable_llc = false, enable_numa = false;
> +	bool enable_llc = false;
>  	unsigned int nr_cpus;
>  	s32 cpu = cpumask_first(cpu_online_mask);
>  
> @@ -348,41 +328,34 @@ static void update_selcpu_topology(struct sched_ext_ops *ops)
>  	if (nr_cpus > 0) {
>  		if (nr_cpus < num_online_cpus())
>  			enable_llc = true;
> +		/*
> +		 * No need to enable LLC optimization if the LLC domains are
> +		 * perfectly overlapping with the NUMA domains when per-node
> +		 * cpumasks are enabled.
> +		 */
> +		if ((ops->flags & SCX_OPS_BUILTIN_IDLE_PER_NODE) &&
> +		    !llc_numa_mismatch())
> +			enable_llc = false;

This doesn't sound like redundancy removal. I may be wrong, but this
looks like a sort of optimization. If so, it deserves to be a separate
patch.

>  		pr_debug("sched_ext: LLC=%*pb weight=%u\n",
>  			 cpumask_pr_args(llc_span(cpu)), llc_weight(cpu));
>  	}
> -
> -	/*
> -	 * Enable NUMA optimization only when there are multiple NUMA domains
> -	 * among the online CPUs and the NUMA domains don't perfectly overlaps
> -	 * with the LLC domains.
> -	 *
> -	 * If all CPUs belong to the same NUMA node and the same LLC domain,
> -	 * enabling both NUMA and LLC optimizations is unnecessary, as checking
> -	 * for an idle CPU in the same domain twice is redundant.
> -	 */
> -	nr_cpus = numa_weight(cpu);

Neither I found numa_weight()...

> -	if (nr_cpus > 0) {
> -		if (nr_cpus < num_online_cpus() && llc_numa_mismatch())
> -			enable_numa = true;
> -		pr_debug("sched_ext: NUMA=%*pb weight=%u\n",
> -			 cpumask_pr_args(numa_span(cpu)), numa_weight(cpu));
> -	}

This calls numa_weight() twice. Get rid of it for good.

>  	rcu_read_unlock();
>  
>  	pr_debug("sched_ext: LLC idle selection %s\n",
>  		 enable_llc ? "enabled" : "disabled");
> -	pr_debug("sched_ext: NUMA idle selection %s\n",
> -		 enable_numa ? "enabled" : "disabled");
>  
>  	if (enable_llc)
>  		static_branch_enable_cpuslocked(&scx_selcpu_topo_llc);
>  	else
>  		static_branch_disable_cpuslocked(&scx_selcpu_topo_llc);
> -	if (enable_numa)
> -		static_branch_enable_cpuslocked(&scx_selcpu_topo_numa);
> +
> +	/*
> +	 * Check if we need to enable per-node cpumasks.
> +	 */
> +	if (ops->flags & SCX_OPS_BUILTIN_IDLE_PER_NODE)
> +		static_branch_enable_cpuslocked(&scx_builtin_idle_per_node);

This is the key from the whole series!

>  	else
> -		static_branch_disable_cpuslocked(&scx_selcpu_topo_numa);
> +		static_branch_disable_cpuslocked(&scx_builtin_idle_per_node);
>  }

This knob enables the whole new machinery, and it definitely deserves to
be a separate, very last patch. Now it looks like a sneaky replacement of
scx_selcpu_topo_numa with scx_builtin_idle_per_node, and this is wrong. 

Are you sure you need a comment on top of it? To me, the code is quite
self-explaining...

>  
>  /*
> @@ -405,9 +378,8 @@ static void update_selcpu_topology(struct sched_ext_ops *ops)
>   *
>   * 5. Pick any idle CPU usable by the task.
>   *
> - * Step 3 and 4 are performed only if the system has, respectively, multiple
> - * LLC domains / multiple NUMA nodes (see scx_selcpu_topo_llc and
> - * scx_selcpu_topo_numa).
> + * Step 3 is performed only if the system has multiple LLC domains that are not
> + * perfectly overlapping with the NUMA domains (see scx_selcpu_topo_llc).
>   *
>   * NOTE: tasks that can only run on 1 CPU are excluded by this logic, because
>   * we never call ops.select_cpu() for them, see select_task_rq().
> @@ -416,7 +388,6 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  			      u64 wake_flags, bool *found)
>  {
>  	const struct cpumask *llc_cpus = NULL;
> -	const struct cpumask *numa_cpus = NULL;
>  	int node = idle_cpu_to_node(prev_cpu);
>  	s32 cpu;
>  
> @@ -438,13 +409,9 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  	 * CPU affinity), the task will simply use the flat scheduling domain
>  	 * defined by user-space.
>  	 */
> -	if (p->nr_cpus_allowed >= num_possible_cpus()) {
> -		if (static_branch_maybe(CONFIG_NUMA, &scx_selcpu_topo_numa))
> -			numa_cpus = numa_span(prev_cpu);
> -
> +	if (p->nr_cpus_allowed >= num_possible_cpus())
>  		if (static_branch_maybe(CONFIG_SCHED_MC, &scx_selcpu_topo_llc))
>  			llc_cpus = llc_span(prev_cpu);
> -	}

I'd keep the curve braces. That would minimize your patch and preserve
more history.

>  
>  	/*
>  	 * If WAKE_SYNC, try to migrate the wakee to the waker's CPU.
> @@ -507,15 +474,6 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  				goto cpu_found;
>  		}
>  
> -		/*
> -		 * Search for any fully idle core in the same NUMA node.
> -		 */
> -		if (numa_cpus) {
> -			cpu = scx_pick_idle_cpu(numa_cpus, node, SCX_PICK_IDLE_CORE);
> -			if (cpu >= 0)
> -				goto cpu_found;
> -		}
> -
>  		/*
>  		 * Search for any full idle core usable by the task.
>  		 *
> @@ -545,17 +503,12 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  			goto cpu_found;
>  	}
>  
> -	/*
> -	 * Search for any idle CPU in the same NUMA node.
> -	 */
> -	if (numa_cpus) {
> -		cpu = pick_idle_cpu_from_node(numa_cpus, node, 0);
> -		if (cpu >= 0)
> -			goto cpu_found;
> -	}
> -
>  	/*
>  	 * Search for any idle CPU usable by the task.
> +	 *
> +	 * If NUMA aware idle selection is enabled, the search will begin
> +	 * in prev_cpu's node and proceed to other nodes in order of
> +	 * increasing distance.

Again, this definitely not a redundancy removal. This is a description
of new behavior, and should go with the implementation of that
behavior.

>  	 */
>  	cpu = scx_pick_idle_cpu(p->cpus_ptr, node, 0);
>  	if (cpu >= 0)
> -- 
> 2.47.1

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 10/10] sched_ext: idle: Introduce NUMA aware idle cpu kfunc helpers
  2024-12-20 15:11 ` [PATCH 10/10] sched_ext: idle: Introduce NUMA aware idle cpu kfunc helpers Andrea Righi
@ 2024-12-24  0:57   ` Yury Norov
  2024-12-24  9:32     ` Andrea Righi
  0 siblings, 1 reply; 30+ messages in thread
From: Yury Norov @ 2024-12-24  0:57 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, bpf,
	linux-kernel

On Fri, Dec 20, 2024 at 04:11:42PM +0100, Andrea Righi wrote:
> Add the following kfunc's to provide scx schedulers direct access to
> per-node idle cpumasks information:
> 
>  const struct cpumask *scx_bpf_get_idle_cpumask_node(int node)
>  const struct cpumask *scx_bpf_get_idle_smtmask_node(int node)
>  s32 scx_bpf_pick_idle_cpu_node(const cpumask_t *cpus_allowed,
> 				int node, u64 flags)
>  int scx_bpf_cpu_to_node(s32 cpu)
> 
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> ---
>  kernel/sched/ext_idle.c                  | 163 ++++++++++++++++++++---
>  tools/sched_ext/include/scx/common.bpf.h |   4 +
>  tools/sched_ext/include/scx/compat.bpf.h |  19 +++
>  3 files changed, 170 insertions(+), 16 deletions(-)
> 
> diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
> index b36e93da1b75..0f8ccc1e290e 100644
> --- a/kernel/sched/ext_idle.c
> +++ b/kernel/sched/ext_idle.c
> @@ -28,6 +28,60 @@ static bool check_builtin_idle_enabled(void)
>  	return false;
>  }
>  
> +static bool check_builtin_idle_per_node_enabled(void)
> +{
> +	if (static_branch_likely(&scx_builtin_idle_per_node))
> +		return true;

return 0;

> +
> +	scx_ops_error("per-node idle tracking is disabled");
> +	return false;

return -ENOTSUP;

> +}
> +
> +/*
> + * Validate and resolve a NUMA node.
> + *
> + * Return the resolved node ID on success or a negative value otherwise.
> + */
> +static int validate_node(int node)
> +{
> +	if (!check_builtin_idle_per_node_enabled())
> +		return -EINVAL;

So the node may be valid, but this validator may fail. EINVAL is a
misleading error code for that. You need ENOTSUP.

> +
> +	/* If no node is specified, use the current one */
> +	if (node == NUMA_NO_NODE)
> +		return numa_node_id();
> +
> +	/* Make sure node is in a valid range */
> +	if (node < 0 || node >= nr_node_ids) {
> +		scx_ops_error("invalid node %d", node);
> +		return -ENOENT;

No such file or directory? Hmm...

This should be EINVAL. I would join this one with node_possible()
check. We probably need bpf_node_possible() or something...

> +	}
> +
> +	/* Make sure the node is part of the set of possible nodes */
> +	if (!node_possible(node)) {
> +		scx_ops_error("unavailable node %d", node);

Not that it's unavailable. It just doesn't exist... I'd say:

	scx_ops_error("Non-existing node %d. The existing nodes are: %pbl",
                      node, nodemask_pr_args(node_states[N_POSSIBLE]));

> +		return -EINVAL;
> +	}

What if user provides offline or cpu-less nodes? Is that a normal usage?
If not, it would be nice to print warning, or even return an error...

> +
> +	return node;
> +}
> +
> +/*
> + * Return the node id associated to a target idle CPU (used to determine
> + * the proper idle cpumask).
> + */
> +static int idle_cpu_to_node(int cpu)
> +{
> +	int node;
> +
> +	if (static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
> +		node = cpu_to_node(cpu);
> +	else
> +		node = NUMA_FLAT_NODE;
> +
> +	return node;
> +}
> +
>  #ifdef CONFIG_SMP
>  struct idle_cpumask {
>  	cpumask_var_t cpu;
> @@ -83,22 +137,6 @@ static void idle_masks_init(void)
>  
>  static DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_llc);
>  
> -/*
> - * Return the node id associated to a target idle CPU (used to determine
> - * the proper idle cpumask).
> - */
> -static int idle_cpu_to_node(int cpu)
> -{
> -	int node;
> -
> -	if (static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
> -		node = cpu_to_node(cpu);
> -	else
> -		node = NUMA_FLAT_NODE;
> -
> -	return node;
> -}
> -
>  static bool test_and_clear_cpu_idle(int cpu)
>  {
>  	int node = idle_cpu_to_node(cpu);
> @@ -613,6 +651,17 @@ static void reset_idle_masks(void) {}
>   */
>  __bpf_kfunc_start_defs();
>  
> +/**
> + * scx_bpf_cpu_to_node - Return the NUMA node the given @cpu belongs to
> + */
> +__bpf_kfunc int scx_bpf_cpu_to_node(s32 cpu)
> +{
> +	if (cpu < 0 || cpu >= nr_cpu_ids)
> +		return -EINVAL;
> +
> +	return idle_cpu_to_node(cpu);
> +}
> +
>  /**
>   * scx_bpf_select_cpu_dfl - The default implementation of ops.select_cpu()
>   * @p: task_struct to select a CPU for
> @@ -645,6 +694,28 @@ __bpf_kfunc s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  	return prev_cpu;
>  }
>  
> +/**
> + * scx_bpf_get_idle_cpumask_node - Get a referenced kptr to the idle-tracking
> + * per-CPU cpumask of a target NUMA node.
> + *
> + * NUMA_NO_NODE is interpreted as the current node.
> + *
> + * Returns an empty cpumask if idle tracking is not enabled, if @node is not
> + * valid, or running on a UP kernel.
> + */
> +__bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask_node(int node)
> +{
> +	node = validate_node(node);
> +	if (node < 0)
> +		return cpu_none_mask;

I think I commented this in v7. This simply hides an error. You need to
return ERR_PTR(node). And your user should check it with IS_ERR_VALUE().

This should be consistent with scx_bpf_pick_idle_cpu_node(), where you
return an actual error.

> +
> +#ifdef CONFIG_SMP
> +	return get_idle_cpumask(node);
> +#else
> +	return cpu_none_mask;
> +#endif
> +}
> +
>  /**
>   * scx_bpf_get_idle_cpumask - Get a referenced kptr to the idle-tracking
>   * per-CPU cpumask.
> @@ -664,6 +735,32 @@ __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void)
>  	return get_idle_cpumask(NUMA_FLAT_NODE);
>  }
>  
> +/**
> + * scx_bpf_get_idle_smtmask_node - Get a referenced kptr to the idle-tracking,
> + * per-physical-core cpumask of a target NUMA node. Can be used to determine
> + * if an entire physical core is free.

If it goes to DOCs, it should have parameters section.

> + *
> + * NUMA_NO_NODE is interpreted as the current node.
> + *
> + * Returns an empty cpumask if idle tracking is not enabled, if @node is not
> + * valid, or running on a UP kernel.
> + */
> +__bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask_node(int node)
> +{
> +	node = validate_node(node);
> +	if (node < 0)
> +		return cpu_none_mask;
> +
> +#ifdef CONFIG_SMP
> +	if (sched_smt_active())
> +		return get_idle_smtmask(node);
> +	else
> +		return get_idle_cpumask(node);
> +#else
> +	return cpu_none_mask;
> +#endif
> +}
> +
>  /**
>   * scx_bpf_get_idle_smtmask - Get a referenced kptr to the idle-tracking,
>   * per-physical-core cpumask. Can be used to determine if an entire physical
> @@ -722,6 +819,36 @@ __bpf_kfunc bool scx_bpf_test_and_clear_cpu_idle(s32 cpu)
>  		return false;
>  }
>  
> +/**
> + * scx_bpf_pick_idle_cpu_node - Pick and claim an idle cpu from a NUMA node
> + * @cpus_allowed: Allowed cpumask
> + * @node: target NUMA node
> + * @flags: %SCX_PICK_IDLE_CPU_* flags
> + *
> + * Pick and claim an idle cpu in @cpus_allowed from the NUMA node @node.
> + * Returns the picked idle cpu number on success. -%EBUSY if no matching cpu
> + * was found.

validate_node() returns more errors.

> + *
> + * If @node is NUMA_NO_NODE, the search is restricted to the current NUMA
> + * node. Otherwise, the search starts from @node and proceeds to other
> + * online NUMA nodes in order of increasing distance (unless
> + * SCX_PICK_IDLE_NODE is specified, in which case the search is limited to
> + * the target @node).

Can you reorder statements, like:

Restricted to current node if NUMA_NO_NODE.
Restricted to @node if SCX_PICK_IDLE_NODE is specified
Otherwise ...

What if NUMA_NO_NODE + SCX_PICK_IDLE_NODE? Seems to be OK, but looks
redundant and non-intuitive. Why not if NUMA_NO_NODE provided, start
from current node, but not restrict with it?

> + *
> + * Unavailable if ops.update_idle() is implemented and
> + * %SCX_OPS_KEEP_BUILTIN_IDLE is not set or if %SCX_OPS_KEEP_BUILTIN_IDLE is
> + * not set.
> + */
> +__bpf_kfunc s32 scx_bpf_pick_idle_cpu_node(const struct cpumask *cpus_allowed,
> +					   int node, u64 flags)
> +{
> +	node = validate_node(node);

Hold on! This validate_node() replaces NO_NODE with current node but
doesn't touch flags. It means that scx_pick_idle_cpu() will never see
NO_NODE, and will not be able to restrict to current node. The comment
above is incorrect, right?

> +	if (node < 0)
> +		return node;
> +
> +	return scx_pick_idle_cpu(cpus_allowed, node, flags);
> +}
> +
>  /**
>   * scx_bpf_pick_idle_cpu - Pick and claim an idle cpu
>   * @cpus_allowed: Allowed cpumask
> @@ -785,11 +912,15 @@ __bpf_kfunc s32 scx_bpf_pick_any_cpu(const struct cpumask *cpus_allowed,
>  __bpf_kfunc_end_defs();
>  
>  BTF_KFUNCS_START(scx_kfunc_ids_select_cpu)
> +BTF_ID_FLAGS(func, scx_bpf_cpu_to_node)
>  BTF_ID_FLAGS(func, scx_bpf_select_cpu_dfl, KF_RCU)
> +BTF_ID_FLAGS(func, scx_bpf_get_idle_cpumask_node, KF_ACQUIRE)
>  BTF_ID_FLAGS(func, scx_bpf_get_idle_cpumask, KF_ACQUIRE)
> +BTF_ID_FLAGS(func, scx_bpf_get_idle_smtmask_node, KF_ACQUIRE)
>  BTF_ID_FLAGS(func, scx_bpf_get_idle_smtmask, KF_ACQUIRE)
>  BTF_ID_FLAGS(func, scx_bpf_put_idle_cpumask, KF_RELEASE)
>  BTF_ID_FLAGS(func, scx_bpf_test_and_clear_cpu_idle)
> +BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu_node, KF_RCU)
>  BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu, KF_RCU)
>  BTF_ID_FLAGS(func, scx_bpf_pick_any_cpu, KF_RCU)
>  BTF_KFUNCS_END(scx_kfunc_ids_select_cpu)
> diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h
> index 858ba1f438f6..fe0433f7c4d9 100644
> --- a/tools/sched_ext/include/scx/common.bpf.h
> +++ b/tools/sched_ext/include/scx/common.bpf.h
> @@ -63,13 +63,17 @@ u32 scx_bpf_cpuperf_cap(s32 cpu) __ksym __weak;
>  u32 scx_bpf_cpuperf_cur(s32 cpu) __ksym __weak;
>  void scx_bpf_cpuperf_set(s32 cpu, u32 perf) __ksym __weak;
>  u32 scx_bpf_nr_cpu_ids(void) __ksym __weak;
> +int scx_bpf_cpu_to_node(s32 cpu) __ksym __weak;
>  const struct cpumask *scx_bpf_get_possible_cpumask(void) __ksym __weak;
>  const struct cpumask *scx_bpf_get_online_cpumask(void) __ksym __weak;
>  void scx_bpf_put_cpumask(const struct cpumask *cpumask) __ksym __weak;
> +const struct cpumask *scx_bpf_get_idle_cpumask_node(int node) __ksym __weak;
>  const struct cpumask *scx_bpf_get_idle_cpumask(void) __ksym;
> +const struct cpumask *scx_bpf_get_idle_smtmask_node(int node) __ksym __weak;
>  const struct cpumask *scx_bpf_get_idle_smtmask(void) __ksym;
>  void scx_bpf_put_idle_cpumask(const struct cpumask *cpumask) __ksym;
>  bool scx_bpf_test_and_clear_cpu_idle(s32 cpu) __ksym;
> +s32 scx_bpf_pick_idle_cpu_node(const cpumask_t *cpus_allowed, int node, u64 flags) __ksym __weak;
>  s32 scx_bpf_pick_idle_cpu(const cpumask_t *cpus_allowed, u64 flags) __ksym;
>  s32 scx_bpf_pick_any_cpu(const cpumask_t *cpus_allowed, u64 flags) __ksym;
>  bool scx_bpf_task_running(const struct task_struct *p) __ksym;
> diff --git a/tools/sched_ext/include/scx/compat.bpf.h b/tools/sched_ext/include/scx/compat.bpf.h
> index d56520100a26..dfc329d5a91e 100644
> --- a/tools/sched_ext/include/scx/compat.bpf.h
> +++ b/tools/sched_ext/include/scx/compat.bpf.h
> @@ -125,6 +125,25 @@ bool scx_bpf_dispatch_vtime_from_dsq___compat(struct bpf_iter_scx_dsq *it__iter,
>  	false;									\
>  })
>  
> +#define __COMPAT_scx_bpf_cpu_to_node(cpu)					\
> +	(bpf_ksym_exists(scx_bpf_cpu_to_node) ?					\
> +	 scx_bpf_cpu_to_node(cpu) : 0)
> +
> +#define __COMPAT_scx_bpf_get_idle_cpumask_node(node)				\
> +	(bpf_ksym_exists(scx_bpf_get_idle_cpumask_node) ?			\
> +	 scx_bpf_get_idle_cpumask_node(node) :					\
> +	 scx_bpf_get_idle_cpumask())						\
> +
> +#define __COMPAT_scx_bpf_get_idle_smtmask_node(node)				\
> +	(bpf_ksym_exists(scx_bpf_get_idle_smtmask_node) ?			\
> +	 scx_bpf_get_idle_smtmask_node(node) :					\
> +	 scx_bpf_get_idle_smtmask())
> +
> +#define __COMPAT_scx_bpf_pick_idle_cpu_node(cpus_allowed, node, flags)		\
> +	(bpf_ksym_exists(scx_bpf_pick_idle_cpu_node) ?				\
> +	 scx_bpf_pick_idle_cpu_node(cpus_allowed, node, flags) :		\
> +	 scx_bpf_pick_idle_cpu(cpus_allowed, flags))
> +
>  /*
>   * Define sched_ext_ops. This may be expanded to define multiple variants for
>   * backward compatibility. See compat.h::SCX_OPS_LOAD/ATTACH().
> -- 
> 2.47.1

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 08/10] sched_ext: idle: introduce SCX_PICK_IDLE_NODE
  2024-12-20 15:11 ` [PATCH 08/10] sched_ext: idle: introduce SCX_PICK_IDLE_NODE Andrea Righi
@ 2024-12-24  2:48   ` Yury Norov
  2024-12-24  3:53     ` Yury Norov
  2024-12-24  8:22     ` Andrea Righi
  0 siblings, 2 replies; 30+ messages in thread
From: Yury Norov @ 2024-12-24  2:48 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, bpf,
	linux-kernel

On Fri, Dec 20, 2024 at 04:11:40PM +0100, Andrea Righi wrote:
> Introduce a flag to restrict the selection of an idle CPU to a specific
> NUMA node.
> 
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> ---
>  kernel/sched/ext.c      |  1 +
>  kernel/sched/ext_idle.c | 11 +++++++++--
>  2 files changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 143938e935f1..da5c15bd3c56 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -773,6 +773,7 @@ enum scx_deq_flags {
>  
>  enum scx_pick_idle_cpu_flags {
>  	SCX_PICK_IDLE_CORE	= 1LLU << 0,	/* pick a CPU whose SMT siblings are also idle */
> +	SCX_PICK_IDLE_NODE	= 1LLU << 1,	/* pick a CPU in the same target NUMA node */

SCX_FORCE_NODE or SCX_FIX_NODE?

>  };
>  
>  enum scx_kick_flags {
> diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
> index 444f2a15f1d4..013deaa08f12 100644
> --- a/kernel/sched/ext_idle.c
> +++ b/kernel/sched/ext_idle.c
> @@ -199,6 +199,12 @@ static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, int node, u64 f
>  		cpu = pick_idle_cpu_from_node(cpus_allowed, n, flags);
>  		if (cpu >= 0)
>  			break;
> +		/*
> +		 * Check if the search is restricted to the same core or
> +		 * the same node.
> +		 */
> +		if (flags & SCX_PICK_IDLE_NODE)
> +			break;

Yeah, if you will give a better name for the flag, you'll not have to
comment the code.

>  	}
>  
>  	return cpu;
> @@ -495,7 +501,8 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  		 * Search for any fully idle core in the same LLC domain.
>  		 */
>  		if (llc_cpus) {
> -			cpu = pick_idle_cpu_from_node(llc_cpus, node, SCX_PICK_IDLE_CORE);
> +			cpu = scx_pick_idle_cpu(llc_cpus, node,
> +						SCX_PICK_IDLE_CORE | SCX_PICK_IDLE_NODE);

You change it from scx_pick_idle_cpu() to pick_idle_cpu_from_node()
in patch 7 just to revert it back in patch 8...

You can use scx_pick_idle_cpu() in patch 7 already because
scx_builtin_idle_per_node is always disabled, and you always
follow the NUMA_FLAT_NODE path.  Here you will just add the
SCX_PICK_IDLE_NODE flag. 

That's the point of separating functionality and control patches. In
patch 7 you may need to mention explicitly that your new per-node
idle masks are unconditionally disabled, and will be enabled in the
last patch of the series, so some following patches will detail the
implementation.

>  			if (cpu >= 0)
>  				goto cpu_found;
>  		}
> @@ -533,7 +540,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  	 * Search for any idle CPU in the same LLC domain.
>  	 */
>  	if (llc_cpus) {
> -		cpu = pick_idle_cpu_from_node(llc_cpus, node, 0);
> +		cpu = scx_pick_idle_cpu(llc_cpus, node, SCX_PICK_IDLE_NODE);
>  		if (cpu >= 0)
>  			goto cpu_found;
>  	}
> -- 
> 2.47.1

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 08/10] sched_ext: idle: introduce SCX_PICK_IDLE_NODE
  2024-12-24  2:48   ` Yury Norov
@ 2024-12-24  3:53     ` Yury Norov
  2024-12-24  8:37       ` Andrea Righi
  2024-12-24  8:22     ` Andrea Righi
  1 sibling, 1 reply; 30+ messages in thread
From: Yury Norov @ 2024-12-24  3:53 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, bpf,
	linux-kernel

On Mon, Dec 23, 2024 at 06:48:48PM -0800, Yury Norov wrote:
> On Fri, Dec 20, 2024 at 04:11:40PM +0100, Andrea Righi wrote:
> > Introduce a flag to restrict the selection of an idle CPU to a specific
> > NUMA node.
> > 
> > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> > ---
> >  kernel/sched/ext.c      |  1 +
> >  kernel/sched/ext_idle.c | 11 +++++++++--
> >  2 files changed, 10 insertions(+), 2 deletions(-)
> > 
> > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> > index 143938e935f1..da5c15bd3c56 100644
> > --- a/kernel/sched/ext.c
> > +++ b/kernel/sched/ext.c
> > @@ -773,6 +773,7 @@ enum scx_deq_flags {
> >  
> >  enum scx_pick_idle_cpu_flags {
> >  	SCX_PICK_IDLE_CORE	= 1LLU << 0,	/* pick a CPU whose SMT siblings are also idle */
> > +	SCX_PICK_IDLE_NODE	= 1LLU << 1,	/* pick a CPU in the same target NUMA node */
> 
> SCX_FORCE_NODE or SCX_FIX_NODE?
> 
> >  };
> >  
> >  enum scx_kick_flags {
> > diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
> > index 444f2a15f1d4..013deaa08f12 100644
> > --- a/kernel/sched/ext_idle.c
> > +++ b/kernel/sched/ext_idle.c
> > @@ -199,6 +199,12 @@ static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, int node, u64 f

This function begins with:

 static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, int node, u64 flags)
 {
      nodemask_t hop_nodes = NODE_MASK_NONE;
      s32 cpu = -EBUSY;
 
      if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
              return pick_idle_cpu_from_node(cpus_allowed, NUMA_FLAT_NODE, flags);

      ...
 
So if I disable scx_builtin_idle_per_node and then call:

        scx_pick_idle_cpu(some_cpus, numa_node_id(), SCX_PICK_IDLE_NODE)

I may get a CPU from any non-local node, right? I think we need to honor user's
request:  

      if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
              return pick_idle_cpu_from_node(cpus_allowed,
                     flags & SCX_PICK_IDLE_NODE ? node :  NUMA_FLAT_NODE, flags);

That way the code will be coherent: if you enable idle cpumasks, you
will be able to follow all the NUMA hierarchy. If you disable them, at
least you honor user's request to return a CPU from a given node, if
he's very explicit about his intention.

You can be even nicer:

      if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node)) {
                node = pick_idle_cpu_from_node(cpus, node, flags);
                if (node == MAX_NUM_NODES && flags & SCX_PICK_IDLE_NODE == 0)
                        node = pick_idle_cpu_from_node(cpus, NUMA_FLAT_NODE, flags);

                return node;
      }

> >  		cpu = pick_idle_cpu_from_node(cpus_allowed, n, flags);
> >  		if (cpu >= 0)
> >  			break;
> > +		/*
> > +		 * Check if the search is restricted to the same core or
> > +		 * the same node.
> > +		 */
> > +		if (flags & SCX_PICK_IDLE_NODE)
> > +			break;
> 
> Yeah, if you will give a better name for the flag, you'll not have to
> comment the code.
> 
> >  	}
> >  
> >  	return cpu;
> > @@ -495,7 +501,8 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
> >  		 * Search for any fully idle core in the same LLC domain.
> >  		 */
> >  		if (llc_cpus) {
> > -			cpu = pick_idle_cpu_from_node(llc_cpus, node, SCX_PICK_IDLE_CORE);
> > +			cpu = scx_pick_idle_cpu(llc_cpus, node,
> > +						SCX_PICK_IDLE_CORE | SCX_PICK_IDLE_NODE);
> 
> You change it from scx_pick_idle_cpu() to pick_idle_cpu_from_node()
> in patch 7 just to revert it back in patch 8...
> 
> You can use scx_pick_idle_cpu() in patch 7 already because
> scx_builtin_idle_per_node is always disabled, and you always
> follow the NUMA_FLAT_NODE path.  Here you will just add the
> SCX_PICK_IDLE_NODE flag. 
> 
> That's the point of separating functionality and control patches. In
> patch 7 you may need to mention explicitly that your new per-node
> idle masks are unconditionally disabled, and will be enabled in the
> last patch of the series, so some following patches will detail the
> implementation.
> 
> >  			if (cpu >= 0)
> >  				goto cpu_found;
> >  		}
> > @@ -533,7 +540,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
> >  	 * Search for any idle CPU in the same LLC domain.
> >  	 */
> >  	if (llc_cpus) {
> > -		cpu = pick_idle_cpu_from_node(llc_cpus, node, 0);
> > +		cpu = scx_pick_idle_cpu(llc_cpus, node, SCX_PICK_IDLE_NODE);
> >  		if (cpu >= 0)
> >  			goto cpu_found;
> >  	}
> > -- 
> > 2.47.1

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 07/10] sched_ext: Introduce per-node idle cpumasks
  2024-12-20 15:11 ` [PATCH 07/10] sched_ext: Introduce per-node idle cpumasks Andrea Righi
@ 2024-12-24  4:05   ` Yury Norov
  2024-12-24  8:18     ` Andrea Righi
  0 siblings, 1 reply; 30+ messages in thread
From: Yury Norov @ 2024-12-24  4:05 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, bpf,
	linux-kernel

On Fri, Dec 20, 2024 at 04:11:39PM +0100, Andrea Righi wrote:
> Using a single global idle mask can lead to inefficiencies and a lot of
> stress on the cache coherency protocol on large systems with multiple
> NUMA nodes, since all the CPUs can create a really intense read/write
> activity on the single global cpumask.
> 
> Therefore, split the global cpumask into multiple per-NUMA node cpumasks
> to improve scalability and performance on large systems.
> 
> The concept is that each cpumask will track only the idle CPUs within
> its corresponding NUMA node, treating CPUs in other NUMA nodes as busy.
> In this way concurrent access to the idle cpumask will be restricted
> within each NUMA node.
> 
> NOTE: if a scheduler enables the per-node idle cpumasks (via
> SCX_OPS_BUILTIN_IDLE_PER_NODE), scx_bpf_get_idle_cpu/smtmask() will
> trigger an scx error, since there are no system-wide cpumasks.
> 
> By default (when SCX_OPS_BUILTIN_IDLE_PER_NODE is not enabled), only the
> cpumask of node 0 is used as a single global flat CPU mask, maintaining
> the previous behavior.
> 
> Signed-off-by: Andrea Righi <arighi@nvidia.com>

This is a rather big patch... Can you split it somehow? Maybe
introduce new functions in a separate patch, and use them in the
following patch(es)?

> ---
>  kernel/sched/ext.c      |   7 +-
>  kernel/sched/ext_idle.c | 258 +++++++++++++++++++++++++++++++---------
>  2 files changed, 208 insertions(+), 57 deletions(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 148ec04d4a0a..143938e935f1 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -3228,7 +3228,7 @@ static void handle_hotplug(struct rq *rq, bool online)
>  	atomic_long_inc(&scx_hotplug_seq);
>  
>  	if (scx_enabled())
> -		update_selcpu_topology();
> +		update_selcpu_topology(&scx_ops);
>  
>  	if (online && SCX_HAS_OP(cpu_online))
>  		SCX_CALL_OP(SCX_KF_UNLOCKED, cpu_online, cpu);
> @@ -5107,7 +5107,7 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
>  
>  	check_hotplug_seq(ops);
>  #ifdef CONFIG_SMP
> -	update_selcpu_topology();
> +	update_selcpu_topology(ops);
>  #endif
>  	cpus_read_unlock();
>  
> @@ -5800,8 +5800,7 @@ void __init init_sched_ext_class(void)
>  
>  	BUG_ON(rhashtable_init(&dsq_hash, &dsq_hash_params));
>  #ifdef CONFIG_SMP
> -	BUG_ON(!alloc_cpumask_var(&idle_masks.cpu, GFP_KERNEL));
> -	BUG_ON(!alloc_cpumask_var(&idle_masks.smt, GFP_KERNEL));
> +	idle_masks_init();
>  #endif
>  	scx_kick_cpus_pnt_seqs =
>  		__alloc_percpu(sizeof(scx_kick_cpus_pnt_seqs[0]) * nr_cpu_ids,
> diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
> index 4952e2793304..444f2a15f1d4 100644
> --- a/kernel/sched/ext_idle.c
> +++ b/kernel/sched/ext_idle.c
> @@ -10,7 +10,14 @@
>   * Copyright (c) 2024 Andrea Righi <arighi@nvidia.com>
>   */
>  
> +/*
> + * If NUMA awareness is disabled consider only node 0 as a single global
> + * NUMA node.
> + */
> +#define NUMA_FLAT_NODE	0

If it's a global idle node maybe 

 #define GLOBAL_IDLE_NODE	0

This actually bypasses NUMA, so it's weird to mention NUMA here.

> +
>  static DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_enabled);
> +static DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_per_node);
>  
>  static bool check_builtin_idle_enabled(void)
>  {
> @@ -22,22 +29,82 @@ static bool check_builtin_idle_enabled(void)
>  }
>  
>  #ifdef CONFIG_SMP
> -#ifdef CONFIG_CPUMASK_OFFSTACK
> -#define CL_ALIGNED_IF_ONSTACK
> -#else
> -#define CL_ALIGNED_IF_ONSTACK __cacheline_aligned_in_smp
> -#endif
> -
> -static struct {
> +struct idle_cpumask {
>  	cpumask_var_t cpu;
>  	cpumask_var_t smt;
> -} idle_masks CL_ALIGNED_IF_ONSTACK;
> +};

We already have struct cpumask, and this struct idle_cpumask may
mislead. Maybe struct idle_cpus or something?

> +
> +/*
> + * cpumasks to track idle CPUs within each NUMA node.
> + *
> + * If SCX_OPS_BUILTIN_IDLE_PER_NODE is not specified, a single flat cpumask
> + * from node 0 is used to track all idle CPUs system-wide.
> + */
> +static struct idle_cpumask **scx_idle_masks;
> +
> +static struct idle_cpumask *get_idle_mask(int node)

Didn't we agree to drop this 'get' thing?

> +{
> +	if (node == NUMA_NO_NODE)
> +		node = numa_node_id();
> +	else if (WARN_ON_ONCE(node < 0 || node >= nr_node_ids))
> +		return NULL;

Kernel users always provide correct parameters. I don't even think you
need to check for NO_NODE, because if I as user of your API need to
provide current node, I can use numa_node_id() just as well.

If you drop all that sanity bloating, your function will be a
one-liner, and the question is: do you need it at all?

We usually need such wrappers to apply 'const' qualifier or do some
housekeeping before dereferencing. But in this case you just return
a pointer, and I don't understand why local users can't do it
themself.

The following idle_mask_init() happily ignores just added accessor...

> +	return scx_idle_masks[node];
> +}

> +
> +static struct cpumask *get_idle_cpumask(int node)
> +{
> +	struct idle_cpumask *mask = get_idle_mask(node);
> +
> +	return mask ? mask->cpu : cpu_none_mask;
> +}
> +
> +static struct cpumask *get_idle_smtmask(int node)
> +{
> +	struct idle_cpumask *mask = get_idle_mask(node);
> +
> +	return mask ? mask->smt : cpu_none_mask;
> +}

For those two guys... I think you agreed with Tejun that you don't
need them. To me the following is more verbose:
        
        idle_cpus(node)->smt;

> +
> +static void idle_masks_init(void)
> +{
> +	int node;
> +
> +	scx_idle_masks = kcalloc(num_possible_nodes(), sizeof(*scx_idle_masks), GFP_KERNEL);
> +	BUG_ON(!scx_idle_masks);
> +
> +	for_each_node_state(node, N_POSSIBLE) {
> +		scx_idle_masks[node] = kzalloc_node(sizeof(**scx_idle_masks), GFP_KERNEL, node);
> +		BUG_ON(!scx_idle_masks[node]);
> +
> +		BUG_ON(!alloc_cpumask_var_node(&scx_idle_masks[node]->cpu, GFP_KERNEL, node));
> +		BUG_ON(!alloc_cpumask_var_node(&scx_idle_masks[node]->smt, GFP_KERNEL, node));
> +	}
> +}
>  
>  static DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_llc);
>  static DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_numa);
>  
> +/*
> + * Return the node id associated to a target idle CPU (used to determine
> + * the proper idle cpumask).
> + */
> +static int idle_cpu_to_node(int cpu)
> +{
> +	int node;
> +
> +	if (static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
> +		node = cpu_to_node(cpu);

Nit: can you just return cpu_to_node(cpu). This will save 3 LOCs

> +	else
> +		node = NUMA_FLAT_NODE;
> +
> +	return node;
> +}
> +
>  static bool test_and_clear_cpu_idle(int cpu)
>  {
> +	int node = idle_cpu_to_node(cpu);
> +	struct cpumask *idle_cpus = get_idle_cpumask(node);
> +
>  #ifdef CONFIG_SCHED_SMT
>  	/*
>  	 * SMT mask should be cleared whether we can claim @cpu or not. The SMT
> @@ -46,33 +113,37 @@ static bool test_and_clear_cpu_idle(int cpu)
>  	 */
>  	if (sched_smt_active()) {
>  		const struct cpumask *smt = cpu_smt_mask(cpu);
> +		struct cpumask *idle_smts = get_idle_smtmask(node);
>  
>  		/*
>  		 * If offline, @cpu is not its own sibling and
>  		 * scx_pick_idle_cpu() can get caught in an infinite loop as
> -		 * @cpu is never cleared from idle_masks.smt. Ensure that @cpu
> -		 * is eventually cleared.
> +		 * @cpu is never cleared from the idle SMT mask. Ensure that
> +		 * @cpu is eventually cleared.
>  		 *
>  		 * NOTE: Use cpumask_intersects() and cpumask_test_cpu() to
>  		 * reduce memory writes, which may help alleviate cache
>  		 * coherence pressure.
>  		 */
> -		if (cpumask_intersects(smt, idle_masks.smt))
> -			cpumask_andnot(idle_masks.smt, idle_masks.smt, smt);
> -		else if (cpumask_test_cpu(cpu, idle_masks.smt))
> -			__cpumask_clear_cpu(cpu, idle_masks.smt);
> +		if (cpumask_intersects(smt, idle_smts))
> +			cpumask_andnot(idle_smts, idle_smts, smt);
> +		else if (cpumask_test_cpu(cpu, idle_smts))
> +			__cpumask_clear_cpu(cpu, idle_smts);
>  	}
>  #endif
> -	return cpumask_test_and_clear_cpu(cpu, idle_masks.cpu);
> +	return cpumask_test_and_clear_cpu(cpu, idle_cpus);
>  }
>  
> -static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags)
> +/*
> + * Pick an idle CPU in a specific NUMA node.
> + */
> +static s32 pick_idle_cpu_from_node(const struct cpumask *cpus_allowed, int node, u64 flags)
>  {
>  	int cpu;
>  
>  retry:
>  	if (sched_smt_active()) {
> -		cpu = cpumask_any_and_distribute(idle_masks.smt, cpus_allowed);
> +		cpu = cpumask_any_and_distribute(get_idle_smtmask(node), cpus_allowed);
>  		if (cpu < nr_cpu_ids)
>  			goto found;
>  
> @@ -80,15 +151,57 @@ static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags)
>  			return -EBUSY;
>  	}
>  
> -	cpu = cpumask_any_and_distribute(idle_masks.cpu, cpus_allowed);
> +	cpu = cpumask_any_and_distribute(get_idle_cpumask(node), cpus_allowed);
>  	if (cpu >= nr_cpu_ids)
>  		return -EBUSY;
>  
>  found:
>  	if (test_and_clear_cpu_idle(cpu))
>  		return cpu;
> -	else
> -		goto retry;
> +	goto retry;
> +}

Yes, I see this too. But to me minimizing your patch and preserving as
much history as you can is more important.

After all, newcomers should have a room to practice :)

> +
> +/*
> + * Find the best idle CPU in the system, relative to @node.
> + *
> + * If @node is NUMA_NO_NODE, start from the current node.
> + */

And if you don't invent this rule for kernel users, you don't need to
explain it everywhere.

> +static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, int node, u64 flags)
> +{
> +	nodemask_t hop_nodes = NODE_MASK_NONE;
> +	s32 cpu = -EBUSY;
> +
> +	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
> +		return pick_idle_cpu_from_node(cpus_allowed, NUMA_FLAT_NODE, flags);
> +
> +	/*
> +	 * If a NUMA node was not specified, start with the current one.
> +	 */
> +	if (node == NUMA_NO_NODE)
> +		node = numa_node_id();

And enforce too...

> +
> +	/*
> +	 * Traverse all nodes in order of increasing distance, starting
> +	 * from prev_cpu's node.
> +	 *
> +	 * This loop is O(N^2), with N being the amount of NUMA nodes,
> +	 * which might be quite expensive in large NUMA systems. However,
> +	 * this complexity comes into play only when a scheduler enables
> +	 * SCX_OPS_BUILTIN_IDLE_PER_NODE and it's requesting an idle CPU
> +	 * without specifying a target NUMA node, so it shouldn't be a
> +	 * bottleneck is most cases.
> +	 *
> +	 * As a future optimization we may want to cache the list of hop
> +	 * nodes in a per-node array, instead of actually traversing them
> +	 * every time.
> +	 */
> +	for_each_numa_hop_node(n, node, hop_nodes, N_POSSIBLE) {
> +		cpu = pick_idle_cpu_from_node(cpus_allowed, n, flags);
> +		if (cpu >= 0)
> +			break;
> +	}
> +
> +	return cpu;
>  }
>  
>  /*
> @@ -208,7 +321,7 @@ static bool llc_numa_mismatch(void)
>   * CPU belongs to a single LLC domain, and that each LLC domain is entirely
>   * contained within a single NUMA node.
>   */
> -static void update_selcpu_topology(void)
> +static void update_selcpu_topology(struct sched_ext_ops *ops)
>  {
>  	bool enable_llc = false, enable_numa = false;
>  	unsigned int nr_cpus;
> @@ -298,6 +411,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  {
>  	const struct cpumask *llc_cpus = NULL;
>  	const struct cpumask *numa_cpus = NULL;
> +	int node = idle_cpu_to_node(prev_cpu);
>  	s32 cpu;
>  
>  	*found = false;
> @@ -355,9 +469,9 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  		 * piled up on it even if there is an idle core elsewhere on
>  		 * the system.
>  		 */
> -		if (!cpumask_empty(idle_masks.cpu) &&
> -		    !(current->flags & PF_EXITING) &&
> -		    cpu_rq(cpu)->scx.local_dsq.nr == 0) {
> +		if (!(current->flags & PF_EXITING) &&
> +		    cpu_rq(cpu)->scx.local_dsq.nr == 0 &&
> +		    !cpumask_empty(get_idle_cpumask(idle_cpu_to_node(cpu)))) {
>  			if (cpumask_test_cpu(cpu, p->cpus_ptr))
>  				goto cpu_found;
>  		}
> @@ -371,7 +485,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  		/*
>  		 * Keep using @prev_cpu if it's part of a fully idle core.
>  		 */
> -		if (cpumask_test_cpu(prev_cpu, idle_masks.smt) &&
> +		if (cpumask_test_cpu(prev_cpu, get_idle_smtmask(node)) &&
>  		    test_and_clear_cpu_idle(prev_cpu)) {
>  			cpu = prev_cpu;
>  			goto cpu_found;
> @@ -381,7 +495,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  		 * Search for any fully idle core in the same LLC domain.
>  		 */
>  		if (llc_cpus) {
> -			cpu = scx_pick_idle_cpu(llc_cpus, SCX_PICK_IDLE_CORE);
> +			cpu = pick_idle_cpu_from_node(llc_cpus, node, SCX_PICK_IDLE_CORE);
>  			if (cpu >= 0)
>  				goto cpu_found;
>  		}
> @@ -390,15 +504,19 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  		 * Search for any fully idle core in the same NUMA node.
>  		 */
>  		if (numa_cpus) {
> -			cpu = scx_pick_idle_cpu(numa_cpus, SCX_PICK_IDLE_CORE);
> +			cpu = scx_pick_idle_cpu(numa_cpus, node, SCX_PICK_IDLE_CORE);
>  			if (cpu >= 0)
>  				goto cpu_found;
>  		}
>  
>  		/*
>  		 * Search for any full idle core usable by the task.
> +		 *
> +		 * If NUMA aware idle selection is enabled, the search will
> +		 * begin in prev_cpu's node and proceed to other nodes in
> +		 * order of increasing distance.
>  		 */
> -		cpu = scx_pick_idle_cpu(p->cpus_ptr, SCX_PICK_IDLE_CORE);
> +		cpu = scx_pick_idle_cpu(p->cpus_ptr, node, SCX_PICK_IDLE_CORE);
>  		if (cpu >= 0)
>  			goto cpu_found;
>  	}
> @@ -415,7 +533,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  	 * Search for any idle CPU in the same LLC domain.
>  	 */
>  	if (llc_cpus) {
> -		cpu = scx_pick_idle_cpu(llc_cpus, 0);
> +		cpu = pick_idle_cpu_from_node(llc_cpus, node, 0);
>  		if (cpu >= 0)
>  			goto cpu_found;
>  	}
> @@ -424,7 +542,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  	 * Search for any idle CPU in the same NUMA node.
>  	 */
>  	if (numa_cpus) {
> -		cpu = scx_pick_idle_cpu(numa_cpus, 0);
> +		cpu = pick_idle_cpu_from_node(numa_cpus, node, 0);
>  		if (cpu >= 0)
>  			goto cpu_found;
>  	}
> @@ -432,7 +550,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  	/*
>  	 * Search for any idle CPU usable by the task.
>  	 */
> -	cpu = scx_pick_idle_cpu(p->cpus_ptr, 0);
> +	cpu = scx_pick_idle_cpu(p->cpus_ptr, node, 0);
>  	if (cpu >= 0)
>  		goto cpu_found;
>  
> @@ -448,17 +566,33 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  
>  static void reset_idle_masks(void)
>  {
> +	int node;
> +
> +	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node)) {
> +		cpumask_copy(get_idle_cpumask(NUMA_FLAT_NODE), cpu_online_mask);
> +		cpumask_copy(get_idle_smtmask(NUMA_FLAT_NODE), cpu_online_mask);
> +		return;
> +	}
> +
>  	/*
>  	 * Consider all online cpus idle. Should converge to the actual state
>  	 * quickly.
>  	 */
> -	cpumask_copy(idle_masks.cpu, cpu_online_mask);
> -	cpumask_copy(idle_masks.smt, cpu_online_mask);
> +	for_each_node_state(node, N_POSSIBLE) {
> +		const struct cpumask *node_mask = cpumask_of_node(node);
> +		struct cpumask *idle_cpu = get_idle_cpumask(node);
> +		struct cpumask *idle_smt = get_idle_smtmask(node);
> +
> +		cpumask_and(idle_cpu, cpu_online_mask, node_mask);
> +		cpumask_copy(idle_smt, idle_cpu);

Tejun asked you to use cpumask_and() in both cases, didn't he?

> +	}
>  }
>  
>  void __scx_update_idle(struct rq *rq, bool idle)
>  {
>  	int cpu = cpu_of(rq);
> +	int node = idle_cpu_to_node(cpu);
> +	struct cpumask *idle_cpu = get_idle_cpumask(node);
>  
>  	if (SCX_HAS_OP(update_idle) && !scx_rq_bypassing(rq)) {
>  		SCX_CALL_OP(SCX_KF_REST, update_idle, cpu_of(rq), idle);
> @@ -466,24 +600,25 @@ void __scx_update_idle(struct rq *rq, bool idle)
>  			return;
>  	}
>  
> -	assign_cpu(cpu, idle_masks.cpu, idle);
> +	assign_cpu(cpu, idle_cpu, idle);
>  
>  #ifdef CONFIG_SCHED_SMT
>  	if (sched_smt_active()) {
>  		const struct cpumask *smt = cpu_smt_mask(cpu);
> +		struct cpumask *idle_smt = get_idle_smtmask(node);
>  
>  		if (idle) {
>  			/*
> -			 * idle_masks.smt handling is racy but that's fine as
> -			 * it's only for optimization and self-correcting.
> +			 * idle_smt handling is racy but that's fine as it's
> +			 * only for optimization and self-correcting.
>  			 */
>  			for_each_cpu(cpu, smt) {
> -				if (!cpumask_test_cpu(cpu, idle_masks.cpu))
> +				if (!cpumask_test_cpu(cpu, idle_cpu))
>  					return;
>  			}
> -			cpumask_or(idle_masks.smt, idle_masks.smt, smt);
> +			cpumask_or(idle_smt, idle_smt, smt);
>  		} else {
> -			cpumask_andnot(idle_masks.smt, idle_masks.smt, smt);
> +			cpumask_andnot(idle_smt, idle_smt, smt);
>  		}
>  	}
>  #endif
> @@ -491,8 +626,23 @@ void __scx_update_idle(struct rq *rq, bool idle)
>  
>  #else	/* !CONFIG_SMP */
>  
> +static struct cpumask *get_idle_cpumask(int node)
> +{
> +	return cpu_none_mask;
> +}
> +
> +static struct cpumask *get_idle_smtmask(int node)
> +{
> +	return cpu_none_mask;
> +}
> +
>  static bool test_and_clear_cpu_idle(int cpu) { return false; }
> -static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags) { return -EBUSY; }
> +
> +static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, int node, u64 flags)
> +{
> +	return -EBUSY;
> +}
> +
>  static void reset_idle_masks(void) {}
>  
>  #endif	/* CONFIG_SMP */
> @@ -546,11 +696,12 @@ __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void)
>  	if (!check_builtin_idle_enabled())
>  		return cpu_none_mask;
>  
> -#ifdef CONFIG_SMP
> -	return idle_masks.cpu;
> -#else
> -	return cpu_none_mask;
> -#endif
> +	if (static_branch_unlikely(&scx_builtin_idle_per_node)) {
> +		scx_ops_error("SCX_OPS_BUILTIN_IDLE_PER_NODE enabled");
> +		return cpu_none_mask;
> +	}
> +
> +	return get_idle_cpumask(NUMA_FLAT_NODE);
>  }
>  
>  /**
> @@ -565,14 +716,15 @@ __bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask(void)
>  	if (!check_builtin_idle_enabled())
>  		return cpu_none_mask;
>  
> -#ifdef CONFIG_SMP
> +	if (static_branch_unlikely(&scx_builtin_idle_per_node)) {
> +		scx_ops_error("SCX_OPS_BUILTIN_IDLE_PER_NODE enabled");
> +		return cpu_none_mask;
> +	}
> +
>  	if (sched_smt_active())
> -		return idle_masks.smt;
> +		return get_idle_smtmask(NUMA_FLAT_NODE);
>  	else
> -		return idle_masks.cpu;
> -#else
> -	return cpu_none_mask;
> -#endif
> +		return get_idle_cpumask(NUMA_FLAT_NODE);
>  }
>  
>  /**
> @@ -635,7 +787,7 @@ __bpf_kfunc s32 scx_bpf_pick_idle_cpu(const struct cpumask *cpus_allowed,
>  	if (!check_builtin_idle_enabled())
>  		return -EBUSY;
>  
> -	return scx_pick_idle_cpu(cpus_allowed, flags);
> +	return scx_pick_idle_cpu(cpus_allowed, NUMA_NO_NODE, flags);
>  }
>  
>  /**
> @@ -658,7 +810,7 @@ __bpf_kfunc s32 scx_bpf_pick_any_cpu(const struct cpumask *cpus_allowed,
>  	s32 cpu;
>  
>  	if (static_branch_likely(&scx_builtin_idle_enabled)) {
> -		cpu = scx_pick_idle_cpu(cpus_allowed, flags);
> +		cpu = scx_pick_idle_cpu(cpus_allowed, NUMA_NO_NODE, flags);
>  		if (cpu >= 0)
>  			return cpu;
>  	}
> -- 
> 2.47.1

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 01/10] sched/topology: introduce for_each_numa_hop_node() / sched_numa_hop_node()
  2024-12-23 21:18   ` Yury Norov
@ 2024-12-24  7:54     ` Andrea Righi
  2024-12-24 17:33       ` Yury Norov
  0 siblings, 1 reply; 30+ messages in thread
From: Andrea Righi @ 2024-12-24  7:54 UTC (permalink / raw)
  To: Yury Norov
  Cc: Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, bpf,
	linux-kernel

Hi Yury,

On Mon, Dec 23, 2024 at 01:18:10PM -0800, Yury Norov wrote:
> On Fri, Dec 20, 2024 at 04:11:33PM +0100, Andrea Righi wrote:
> > Introduce for_each_numa_hop_node() and sched_numa_hop_node() to iterate
> > over node IDs in order of increasing NUMA distance from a given starting
> > node.
> > 
> > These iterator functions are similar to for_each_numa_hop_mask() and
> > sched_numa_hop_mask(), but instead of providing a cpumask at each
> > iteration, they provide a node ID.
> > 
> > Example usage:
> > 
> >   nodemask_t hop_nodes = NODE_MASK_NONE;
> >   int start = cpu_to_node(smp_processor_id());
> > 
> >   for_each_numa_hop_node(node, start, hop_nodes, N_ONLINE)
> >   	pr_info("node (%d, %d) -> \n",
> >   		 start, node, node_distance(start, node);
> 
> This iterates nodes, not hops. The hop is a set of equidistant nodes,
> and the iterator (the first argument) should be a nodemask. I'm OK with
> that as soon as you find it practical. But then you shouldn't mention
> hops in the patch.
> 
> Also, can you check that it works correctly against a configuration with
> equidistant nodes?

Ok, and yes, it should mention nodes, not hops.

> 
> > Simulating the following NUMA topology in virtme-ng:
> > 
> >  $ numactl -H
> >  available: 4 nodes (0-3)
> >  node 0 cpus: 0 1
> >  node 0 size: 1006 MB
> >  node 0 free: 928 MB
> >  node 1 cpus: 2 3
> >  node 1 size: 1007 MB
> >  node 1 free: 986 MB
> >  node 2 cpus: 4 5
> >  node 2 size: 889 MB
> >  node 2 free: 862 MB
> >  node 3 cpus: 6 7
> >  node 3 size: 1006 MB
> >  node 3 free: 983 MB
> >  node distances:
> >  node     0    1    2    3
> >     0:   10   51   31   41
> >     1:   51   10   21   61
> >     2:   31   21   10   11
> >     3:   41   61   11   10
> > 
> > The output of the example above (on node 0) is the following:
> > 
> >  [   84.953644] node (0, 0) -> 10
> >  [   84.953712] node (0, 2) -> 31
> >  [   84.953764] node (0, 3) -> 41
> >  [   84.953817] node (0, 1) -> 51
> > 
> > Cc: Yury Norov <yury.norov@gmail.com>
> > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> > ---
> >  include/linux/topology.h | 28 ++++++++++++++++++++++-
> >  kernel/sched/topology.c  | 49 ++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 76 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/topology.h b/include/linux/topology.h
> > index 52f5850730b3..d9014d90580d 100644
> > --- a/include/linux/topology.h
> > +++ b/include/linux/topology.h
> > @@ -248,12 +248,18 @@ static inline const struct cpumask *cpu_cpu_mask(int cpu)
> >  #ifdef CONFIG_NUMA
> >  int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node);
> >  extern const struct cpumask *sched_numa_hop_mask(unsigned int node, unsigned int hops);
> > -#else
> > +extern int sched_numa_hop_node(nodemask_t *hop_nodes, int start, unsigned int state);
> > +#else /* !CONFIG_NUMA */
> >  static __always_inline int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node)
> >  {
> >  	return cpumask_nth_and(cpu, cpus, cpu_online_mask);
> >  }
> >  
> > +static inline int sched_numa_hop_node(nodemask_t *hop_nodes, int start, unsigned int state)
> > +{
> > +	return NUMA_NO_NODE;
> > +}
> > +
> >  static inline const struct cpumask *
> >  sched_numa_hop_mask(unsigned int node, unsigned int hops)
> >  {
> > @@ -261,6 +267,26 @@ sched_numa_hop_mask(unsigned int node, unsigned int hops)
> >  }
> >  #endif	/* CONFIG_NUMA */
> >  
> > +/**
> > + * for_each_numa_hop_node - iterate over NUMA nodes at increasing hop distances
> > + *                          from a given starting node.
> > + * @__node: the iteration variable, representing the current NUMA node.
> > + * @__start: the NUMA node to start the iteration from.
> > + * @__hop_nodes: a nodemask_t to track the visited nodes.
> > + * @__state: state of NUMA nodes to iterate.
> > + *
> > + * Requires rcu_lock to be held.
> > + *
> > + * This macro iterates over NUMA nodes in increasing distance from
> > + * @start_node.
> > + *
> > + * Yields NUMA_NO_NODE when all the nodes have been visited.
> > + */
> > +#define for_each_numa_hop_node(__node, __start, __hop_nodes, __state)		\
> 
> As soon as this is not the hops iterator, the proper name would be just
> 
>         for_each_numa_node()
> 
> And because the 'numa' prefix here doesn't look like a prefix, I think
> it would be nice to comment what this 'numa' means and what's the
> difference between this and for_each_node() iterator, especially in
> terms of complexity.

How about renaming this iterator for_each_node_by_distance()?

> 
> Also you don't need underscores in macro declarations unless
> absolutely necessary.

Ok.

> 
> > +	for (int __node = __start;						\
> 
> The __node declared in for() masks out the __node provided in the
> macro.

Ok will fix this.

> 
> > +	     __node != NUMA_NO_NODE;						\
> > +	     __node = sched_numa_hop_node(&(__hop_nodes), __start, __state))
> > +
> >  /**
> >   * for_each_numa_hop_mask - iterate over cpumasks of increasing NUMA distance
> >   *                          from a given node.
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 9748a4c8d668..8e77c235ad9a 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -2185,6 +2185,55 @@ int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node)
> >  }
> >  EXPORT_SYMBOL_GPL(sched_numa_find_nth_cpu);
> >  
> > +/**
> > + * sched_numa_hop_node - Find the NUMA node at the closest hop distance
> > + *                       from node @start.
> > + *
> > + * @hop_nodes: a pointer to a nodemask_t representing the visited nodes.
> > + * @start: the NUMA node to start the hop search from.
> > + * @state: the node state to filter nodes by.
> > + *
> > + * This function iterates over all NUMA nodes in the given state and
> > + * calculates the hop distance to the starting node. It returns the NUMA
> > + * node that is the closest in terms of hop distance
> > + * that has not already been considered (not set in @hop_nodes). If the
> > + * node is found, it is marked as considered in the @hop_nodes bitmask.
> > + *
> > + * The function checks if the node is not the start node and ensures it is
> > + * not already part of the hop_nodes set. It then computes the distance to
> > + * the start node using the node_distance() function. The closest node is
> > + * chosen based on the minimum distance.
> > + *
> > + * Returns the NUMA node ID closest in terms of hop distance from the
> > + * @start node, or NUMA_NO_NODE if no node is found (or all nodes have been
> > + * visited).
> 
> for_each_node_state() returns MAX_NUMNODES when it finishes
> traversing. I think you should do the same here. 

Ok.

> 
> > + */
> > +int sched_numa_hop_node(nodemask_t *hop_nodes, int start, unsigned int state)
> > +{
> > +	int dist, n, min_node, min_dist;
> > +
> > +	if (state >= NR_NODE_STATES)
> > +		return NUMA_NO_NODE;
> 
>  -EINVAL. But, do we need to check the parameter at all?

numa_nearest_node() has the same check (returning -EINVAL), it seems sane
to do this check here as well to prevent out-of-bounds access to
node_states[state].

> 
> > +
> > +	min_node = NUMA_NO_NODE;
> > +	min_dist = INT_MAX;
> > +
> > +	for_each_node_state(n, state) {
> > +		if (n == start || node_isset(n, *hop_nodes))
> > +			continue;
> > +		dist = node_distance(start, n);
> > +		if (dist < min_dist) {
> > +			min_dist = dist;
> > +			min_node = n;
> > +		}
> > +	}
> 
> This is a version of numa_nearest_node(). The only difference is that
> you add 'hop_nodes' mask, which in fact is 'visited' nodes.
> 
> I think it should be like:
> 
>  int numa_nearest_unvisited_node(nodemask_t *visited, int start, unsigned int state)
>  {
>         for_each_node_andnot(node_states[state], visited) (
>                 ...
>         }
>  }

Makes sense. I'll change this and at this point I'll move this code to
mm/mempolicy.c, close to numa_nearest_node().

Thanks!
-Andrea

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 07/10] sched_ext: Introduce per-node idle cpumasks
  2024-12-24  4:05   ` Yury Norov
@ 2024-12-24  8:18     ` Andrea Righi
  2024-12-24 17:59       ` Yury Norov
  0 siblings, 1 reply; 30+ messages in thread
From: Andrea Righi @ 2024-12-24  8:18 UTC (permalink / raw)
  To: Yury Norov
  Cc: Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, bpf,
	linux-kernel

On Mon, Dec 23, 2024 at 08:05:53PM -0800, Yury Norov wrote:
> On Fri, Dec 20, 2024 at 04:11:39PM +0100, Andrea Righi wrote:
> > Using a single global idle mask can lead to inefficiencies and a lot of
> > stress on the cache coherency protocol on large systems with multiple
> > NUMA nodes, since all the CPUs can create a really intense read/write
> > activity on the single global cpumask.
> > 
> > Therefore, split the global cpumask into multiple per-NUMA node cpumasks
> > to improve scalability and performance on large systems.
> > 
> > The concept is that each cpumask will track only the idle CPUs within
> > its corresponding NUMA node, treating CPUs in other NUMA nodes as busy.
> > In this way concurrent access to the idle cpumask will be restricted
> > within each NUMA node.
> > 
> > NOTE: if a scheduler enables the per-node idle cpumasks (via
> > SCX_OPS_BUILTIN_IDLE_PER_NODE), scx_bpf_get_idle_cpu/smtmask() will
> > trigger an scx error, since there are no system-wide cpumasks.
> > 
> > By default (when SCX_OPS_BUILTIN_IDLE_PER_NODE is not enabled), only the
> > cpumask of node 0 is used as a single global flat CPU mask, maintaining
> > the previous behavior.
> > 
> > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> 
> This is a rather big patch... Can you split it somehow? Maybe
> introduce new functions in a separate patch, and use them in the
> following patch(es)?

Yes, it's quite big unfortunately. I've tried to split it more, but it's
not trivial to have self-consistent changes that don't break things or
introduce unused functions...

> 
> > ---
> >  kernel/sched/ext.c      |   7 +-
> >  kernel/sched/ext_idle.c | 258 +++++++++++++++++++++++++++++++---------
> >  2 files changed, 208 insertions(+), 57 deletions(-)
> > 
> > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> > index 148ec04d4a0a..143938e935f1 100644
> > --- a/kernel/sched/ext.c
> > +++ b/kernel/sched/ext.c
> > @@ -3228,7 +3228,7 @@ static void handle_hotplug(struct rq *rq, bool online)
> >  	atomic_long_inc(&scx_hotplug_seq);
> >  
> >  	if (scx_enabled())
> > -		update_selcpu_topology();
> > +		update_selcpu_topology(&scx_ops);
> >  
> >  	if (online && SCX_HAS_OP(cpu_online))
> >  		SCX_CALL_OP(SCX_KF_UNLOCKED, cpu_online, cpu);
> > @@ -5107,7 +5107,7 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
> >  
> >  	check_hotplug_seq(ops);
> >  #ifdef CONFIG_SMP
> > -	update_selcpu_topology();
> > +	update_selcpu_topology(ops);
> >  #endif
> >  	cpus_read_unlock();
> >  
> > @@ -5800,8 +5800,7 @@ void __init init_sched_ext_class(void)
> >  
> >  	BUG_ON(rhashtable_init(&dsq_hash, &dsq_hash_params));
> >  #ifdef CONFIG_SMP
> > -	BUG_ON(!alloc_cpumask_var(&idle_masks.cpu, GFP_KERNEL));
> > -	BUG_ON(!alloc_cpumask_var(&idle_masks.smt, GFP_KERNEL));
> > +	idle_masks_init();
> >  #endif
> >  	scx_kick_cpus_pnt_seqs =
> >  		__alloc_percpu(sizeof(scx_kick_cpus_pnt_seqs[0]) * nr_cpu_ids,
> > diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
> > index 4952e2793304..444f2a15f1d4 100644
> > --- a/kernel/sched/ext_idle.c
> > +++ b/kernel/sched/ext_idle.c
> > @@ -10,7 +10,14 @@
> >   * Copyright (c) 2024 Andrea Righi <arighi@nvidia.com>
> >   */
> >  
> > +/*
> > + * If NUMA awareness is disabled consider only node 0 as a single global
> > + * NUMA node.
> > + */
> > +#define NUMA_FLAT_NODE	0
> 
> If it's a global idle node maybe 
> 
>  #define GLOBAL_IDLE_NODE	0
> 
> This actually bypasses NUMA, so it's weird to mention NUMA here.

Ok.

> 
> > +
> >  static DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_enabled);
> > +static DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_per_node);
> >  
> >  static bool check_builtin_idle_enabled(void)
> >  {
> > @@ -22,22 +29,82 @@ static bool check_builtin_idle_enabled(void)
> >  }
> >  
> >  #ifdef CONFIG_SMP
> > -#ifdef CONFIG_CPUMASK_OFFSTACK
> > -#define CL_ALIGNED_IF_ONSTACK
> > -#else
> > -#define CL_ALIGNED_IF_ONSTACK __cacheline_aligned_in_smp
> > -#endif
> > -
> > -static struct {
> > +struct idle_cpumask {
> >  	cpumask_var_t cpu;
> >  	cpumask_var_t smt;
> > -} idle_masks CL_ALIGNED_IF_ONSTACK;
> > +};
> 
> We already have struct cpumask, and this struct idle_cpumask may
> mislead. Maybe struct idle_cpus or something?

Ok, I like struct idle_cpus.

> 
> > +
> > +/*
> > + * cpumasks to track idle CPUs within each NUMA node.
> > + *
> > + * If SCX_OPS_BUILTIN_IDLE_PER_NODE is not specified, a single flat cpumask
> > + * from node 0 is used to track all idle CPUs system-wide.
> > + */
> > +static struct idle_cpumask **scx_idle_masks;
> > +
> > +static struct idle_cpumask *get_idle_mask(int node)
> 
> Didn't we agree to drop this 'get' thing?

Hm... no? :)

The analogy you pointed out was with get_parity8() which implements a pure
function, so we should just use parity8() as "get" is implicit.

This case is a bit different IMHO, because it's getting a reference to the
object (so not a pure function) and we may even have a put_idle_mask()
potentially.

Personally I like to have the "get" here, but if you think it's confusing
or it makes the code less readable I'm ok to drop it.

> 
> > +{
> > +	if (node == NUMA_NO_NODE)
> > +		node = numa_node_id();
> > +	else if (WARN_ON_ONCE(node < 0 || node >= nr_node_ids))
> > +		return NULL;
> 
> Kernel users always provide correct parameters. I don't even think you
> need to check for NO_NODE, because if I as user of your API need to
> provide current node, I can use numa_node_id() just as well.
> 
> If you drop all that sanity bloating, your function will be a
> one-liner, and the question is: do you need it at all?
> 
> We usually need such wrappers to apply 'const' qualifier or do some
> housekeeping before dereferencing. But in this case you just return
> a pointer, and I don't understand why local users can't do it
> themself.
> 
> The following idle_mask_init() happily ignores just added accessor...

Ok, makes sense. I'll drop the sanity checks here.

> 
> > +	return scx_idle_masks[node];
> > +}
> 
> > +
> > +static struct cpumask *get_idle_cpumask(int node)
> > +{
> > +	struct idle_cpumask *mask = get_idle_mask(node);
> > +
> > +	return mask ? mask->cpu : cpu_none_mask;
> > +}
> > +
> > +static struct cpumask *get_idle_smtmask(int node)
> > +{
> > +	struct idle_cpumask *mask = get_idle_mask(node);
> > +
> > +	return mask ? mask->smt : cpu_none_mask;
> > +}
> 
> For those two guys... I think you agreed with Tejun that you don't
> need them. To me the following is more verbose:
>         
>         idle_cpus(node)->smt;

Ok.

> 
> > +
> > +static void idle_masks_init(void)
> > +{
> > +	int node;
> > +
> > +	scx_idle_masks = kcalloc(num_possible_nodes(), sizeof(*scx_idle_masks), GFP_KERNEL);
> > +	BUG_ON(!scx_idle_masks);
> > +
> > +	for_each_node_state(node, N_POSSIBLE) {
> > +		scx_idle_masks[node] = kzalloc_node(sizeof(**scx_idle_masks), GFP_KERNEL, node);
> > +		BUG_ON(!scx_idle_masks[node]);
> > +
> > +		BUG_ON(!alloc_cpumask_var_node(&scx_idle_masks[node]->cpu, GFP_KERNEL, node));
> > +		BUG_ON(!alloc_cpumask_var_node(&scx_idle_masks[node]->smt, GFP_KERNEL, node));
> > +	}
> > +}
> >  
> >  static DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_llc);
> >  static DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_numa);
> >  
> > +/*
> > + * Return the node id associated to a target idle CPU (used to determine
> > + * the proper idle cpumask).
> > + */
> > +static int idle_cpu_to_node(int cpu)
> > +{
> > +	int node;
> > +
> > +	if (static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
> > +		node = cpu_to_node(cpu);
> 
> Nit: can you just return cpu_to_node(cpu). This will save 3 LOCs

Ok.

> 
> > +	else
> > +		node = NUMA_FLAT_NODE;
> > +
> > +	return node;
> > +}
> > +
> >  static bool test_and_clear_cpu_idle(int cpu)
> >  {
> > +	int node = idle_cpu_to_node(cpu);
> > +	struct cpumask *idle_cpus = get_idle_cpumask(node);
> > +
> >  #ifdef CONFIG_SCHED_SMT
> >  	/*
> >  	 * SMT mask should be cleared whether we can claim @cpu or not. The SMT
> > @@ -46,33 +113,37 @@ static bool test_and_clear_cpu_idle(int cpu)
> >  	 */
> >  	if (sched_smt_active()) {
> >  		const struct cpumask *smt = cpu_smt_mask(cpu);
> > +		struct cpumask *idle_smts = get_idle_smtmask(node);
> >  
> >  		/*
> >  		 * If offline, @cpu is not its own sibling and
> >  		 * scx_pick_idle_cpu() can get caught in an infinite loop as
> > -		 * @cpu is never cleared from idle_masks.smt. Ensure that @cpu
> > -		 * is eventually cleared.
> > +		 * @cpu is never cleared from the idle SMT mask. Ensure that
> > +		 * @cpu is eventually cleared.
> >  		 *
> >  		 * NOTE: Use cpumask_intersects() and cpumask_test_cpu() to
> >  		 * reduce memory writes, which may help alleviate cache
> >  		 * coherence pressure.
> >  		 */
> > -		if (cpumask_intersects(smt, idle_masks.smt))
> > -			cpumask_andnot(idle_masks.smt, idle_masks.smt, smt);
> > -		else if (cpumask_test_cpu(cpu, idle_masks.smt))
> > -			__cpumask_clear_cpu(cpu, idle_masks.smt);
> > +		if (cpumask_intersects(smt, idle_smts))
> > +			cpumask_andnot(idle_smts, idle_smts, smt);
> > +		else if (cpumask_test_cpu(cpu, idle_smts))
> > +			__cpumask_clear_cpu(cpu, idle_smts);
> >  	}
> >  #endif
> > -	return cpumask_test_and_clear_cpu(cpu, idle_masks.cpu);
> > +	return cpumask_test_and_clear_cpu(cpu, idle_cpus);
> >  }
> >  
> > -static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags)
> > +/*
> > + * Pick an idle CPU in a specific NUMA node.
> > + */
> > +static s32 pick_idle_cpu_from_node(const struct cpumask *cpus_allowed, int node, u64 flags)
> >  {
> >  	int cpu;
> >  
> >  retry:
> >  	if (sched_smt_active()) {
> > -		cpu = cpumask_any_and_distribute(idle_masks.smt, cpus_allowed);
> > +		cpu = cpumask_any_and_distribute(get_idle_smtmask(node), cpus_allowed);
> >  		if (cpu < nr_cpu_ids)
> >  			goto found;
> >  
> > @@ -80,15 +151,57 @@ static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags)
> >  			return -EBUSY;
> >  	}
> >  
> > -	cpu = cpumask_any_and_distribute(idle_masks.cpu, cpus_allowed);
> > +	cpu = cpumask_any_and_distribute(get_idle_cpumask(node), cpus_allowed);
> >  	if (cpu >= nr_cpu_ids)
> >  		return -EBUSY;
> >  
> >  found:
> >  	if (test_and_clear_cpu_idle(cpu))
> >  		return cpu;
> > -	else
> > -		goto retry;
> > +	goto retry;
> > +}
> 
> Yes, I see this too. But to me minimizing your patch and preserving as
> much history as you can is more important.
> 
> After all, newcomers should have a room to practice :)

checkpatch.pl is bugging me. :)

But ok, makes sense, I'll ignore that for now.

> 
> > +
> > +/*
> > + * Find the best idle CPU in the system, relative to @node.
> > + *
> > + * If @node is NUMA_NO_NODE, start from the current node.
> > + */
> 
> And if you don't invent this rule for kernel users, you don't need to
> explain it everywhere.

I think we mentioned treating NUMA_NO_NODE as current node, but I might
have misunderstood. In an earlier patch set I was just ignoring
NUMA_NO_NODE. Should we return an error instead?

> 
> > +static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, int node, u64 flags)
> > +{
> > +	nodemask_t hop_nodes = NODE_MASK_NONE;
> > +	s32 cpu = -EBUSY;
> > +
> > +	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
> > +		return pick_idle_cpu_from_node(cpus_allowed, NUMA_FLAT_NODE, flags);
> > +
> > +	/*
> > +	 * If a NUMA node was not specified, start with the current one.
> > +	 */
> > +	if (node == NUMA_NO_NODE)
> > +		node = numa_node_id();
> 
> And enforce too...
> 
> > +
> > +	/*
> > +	 * Traverse all nodes in order of increasing distance, starting
> > +	 * from prev_cpu's node.
> > +	 *
> > +	 * This loop is O(N^2), with N being the amount of NUMA nodes,
> > +	 * which might be quite expensive in large NUMA systems. However,
> > +	 * this complexity comes into play only when a scheduler enables
> > +	 * SCX_OPS_BUILTIN_IDLE_PER_NODE and it's requesting an idle CPU
> > +	 * without specifying a target NUMA node, so it shouldn't be a
> > +	 * bottleneck is most cases.
> > +	 *
> > +	 * As a future optimization we may want to cache the list of hop
> > +	 * nodes in a per-node array, instead of actually traversing them
> > +	 * every time.
> > +	 */
> > +	for_each_numa_hop_node(n, node, hop_nodes, N_POSSIBLE) {
> > +		cpu = pick_idle_cpu_from_node(cpus_allowed, n, flags);
> > +		if (cpu >= 0)
> > +			break;
> > +	}
> > +
> > +	return cpu;
> >  }
> >  
> >  /*
> > @@ -208,7 +321,7 @@ static bool llc_numa_mismatch(void)
> >   * CPU belongs to a single LLC domain, and that each LLC domain is entirely
> >   * contained within a single NUMA node.
> >   */
> > -static void update_selcpu_topology(void)
> > +static void update_selcpu_topology(struct sched_ext_ops *ops)
> >  {
> >  	bool enable_llc = false, enable_numa = false;
> >  	unsigned int nr_cpus;
> > @@ -298,6 +411,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
> >  {
> >  	const struct cpumask *llc_cpus = NULL;
> >  	const struct cpumask *numa_cpus = NULL;
> > +	int node = idle_cpu_to_node(prev_cpu);
> >  	s32 cpu;
> >  
> >  	*found = false;
> > @@ -355,9 +469,9 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
> >  		 * piled up on it even if there is an idle core elsewhere on
> >  		 * the system.
> >  		 */
> > -		if (!cpumask_empty(idle_masks.cpu) &&
> > -		    !(current->flags & PF_EXITING) &&
> > -		    cpu_rq(cpu)->scx.local_dsq.nr == 0) {
> > +		if (!(current->flags & PF_EXITING) &&
> > +		    cpu_rq(cpu)->scx.local_dsq.nr == 0 &&
> > +		    !cpumask_empty(get_idle_cpumask(idle_cpu_to_node(cpu)))) {
> >  			if (cpumask_test_cpu(cpu, p->cpus_ptr))
> >  				goto cpu_found;
> >  		}
> > @@ -371,7 +485,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
> >  		/*
> >  		 * Keep using @prev_cpu if it's part of a fully idle core.
> >  		 */
> > -		if (cpumask_test_cpu(prev_cpu, idle_masks.smt) &&
> > +		if (cpumask_test_cpu(prev_cpu, get_idle_smtmask(node)) &&
> >  		    test_and_clear_cpu_idle(prev_cpu)) {
> >  			cpu = prev_cpu;
> >  			goto cpu_found;
> > @@ -381,7 +495,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
> >  		 * Search for any fully idle core in the same LLC domain.
> >  		 */
> >  		if (llc_cpus) {
> > -			cpu = scx_pick_idle_cpu(llc_cpus, SCX_PICK_IDLE_CORE);
> > +			cpu = pick_idle_cpu_from_node(llc_cpus, node, SCX_PICK_IDLE_CORE);
> >  			if (cpu >= 0)
> >  				goto cpu_found;
> >  		}
> > @@ -390,15 +504,19 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
> >  		 * Search for any fully idle core in the same NUMA node.
> >  		 */
> >  		if (numa_cpus) {
> > -			cpu = scx_pick_idle_cpu(numa_cpus, SCX_PICK_IDLE_CORE);
> > +			cpu = scx_pick_idle_cpu(numa_cpus, node, SCX_PICK_IDLE_CORE);
> >  			if (cpu >= 0)
> >  				goto cpu_found;
> >  		}
> >  
> >  		/*
> >  		 * Search for any full idle core usable by the task.
> > +		 *
> > +		 * If NUMA aware idle selection is enabled, the search will
> > +		 * begin in prev_cpu's node and proceed to other nodes in
> > +		 * order of increasing distance.
> >  		 */
> > -		cpu = scx_pick_idle_cpu(p->cpus_ptr, SCX_PICK_IDLE_CORE);
> > +		cpu = scx_pick_idle_cpu(p->cpus_ptr, node, SCX_PICK_IDLE_CORE);
> >  		if (cpu >= 0)
> >  			goto cpu_found;
> >  	}
> > @@ -415,7 +533,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
> >  	 * Search for any idle CPU in the same LLC domain.
> >  	 */
> >  	if (llc_cpus) {
> > -		cpu = scx_pick_idle_cpu(llc_cpus, 0);
> > +		cpu = pick_idle_cpu_from_node(llc_cpus, node, 0);
> >  		if (cpu >= 0)
> >  			goto cpu_found;
> >  	}
> > @@ -424,7 +542,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
> >  	 * Search for any idle CPU in the same NUMA node.
> >  	 */
> >  	if (numa_cpus) {
> > -		cpu = scx_pick_idle_cpu(numa_cpus, 0);
> > +		cpu = pick_idle_cpu_from_node(numa_cpus, node, 0);
> >  		if (cpu >= 0)
> >  			goto cpu_found;
> >  	}
> > @@ -432,7 +550,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
> >  	/*
> >  	 * Search for any idle CPU usable by the task.
> >  	 */
> > -	cpu = scx_pick_idle_cpu(p->cpus_ptr, 0);
> > +	cpu = scx_pick_idle_cpu(p->cpus_ptr, node, 0);
> >  	if (cpu >= 0)
> >  		goto cpu_found;
> >  
> > @@ -448,17 +566,33 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
> >  
> >  static void reset_idle_masks(void)
> >  {
> > +	int node;
> > +
> > +	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node)) {
> > +		cpumask_copy(get_idle_cpumask(NUMA_FLAT_NODE), cpu_online_mask);
> > +		cpumask_copy(get_idle_smtmask(NUMA_FLAT_NODE), cpu_online_mask);
> > +		return;
> > +	}
> > +
> >  	/*
> >  	 * Consider all online cpus idle. Should converge to the actual state
> >  	 * quickly.
> >  	 */
> > -	cpumask_copy(idle_masks.cpu, cpu_online_mask);
> > -	cpumask_copy(idle_masks.smt, cpu_online_mask);
> > +	for_each_node_state(node, N_POSSIBLE) {
> > +		const struct cpumask *node_mask = cpumask_of_node(node);
> > +		struct cpumask *idle_cpu = get_idle_cpumask(node);
> > +		struct cpumask *idle_smt = get_idle_smtmask(node);
> > +
> > +		cpumask_and(idle_cpu, cpu_online_mask, node_mask);
> > +		cpumask_copy(idle_smt, idle_cpu);
> 
> Tejun asked you to use cpumask_and() in both cases, didn't he?

Ah! My bad, I missed this change during the last rebase. Will fix it.

-Andrea

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 08/10] sched_ext: idle: introduce SCX_PICK_IDLE_NODE
  2024-12-24  2:48   ` Yury Norov
  2024-12-24  3:53     ` Yury Norov
@ 2024-12-24  8:22     ` Andrea Righi
  2024-12-24 21:29       ` Tejun Heo
  1 sibling, 1 reply; 30+ messages in thread
From: Andrea Righi @ 2024-12-24  8:22 UTC (permalink / raw)
  To: Yury Norov
  Cc: Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, bpf,
	linux-kernel

On Mon, Dec 23, 2024 at 06:48:45PM -0800, Yury Norov wrote:
> On Fri, Dec 20, 2024 at 04:11:40PM +0100, Andrea Righi wrote:
> > Introduce a flag to restrict the selection of an idle CPU to a specific
> > NUMA node.
> > 
> > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> > ---
> >  kernel/sched/ext.c      |  1 +
> >  kernel/sched/ext_idle.c | 11 +++++++++--
> >  2 files changed, 10 insertions(+), 2 deletions(-)
> > 
> > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> > index 143938e935f1..da5c15bd3c56 100644
> > --- a/kernel/sched/ext.c
> > +++ b/kernel/sched/ext.c
> > @@ -773,6 +773,7 @@ enum scx_deq_flags {
> >  
> >  enum scx_pick_idle_cpu_flags {
> >  	SCX_PICK_IDLE_CORE	= 1LLU << 0,	/* pick a CPU whose SMT siblings are also idle */
> > +	SCX_PICK_IDLE_NODE	= 1LLU << 1,	/* pick a CPU in the same target NUMA node */
> 
> SCX_FORCE_NODE or SCX_FIX_NODE?

Ok, I like SCX_FORCE_NODE.

> 
> >  };
> >  
> >  enum scx_kick_flags {
> > diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
> > index 444f2a15f1d4..013deaa08f12 100644
> > --- a/kernel/sched/ext_idle.c
> > +++ b/kernel/sched/ext_idle.c
> > @@ -199,6 +199,12 @@ static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, int node, u64 f
> >  		cpu = pick_idle_cpu_from_node(cpus_allowed, n, flags);
> >  		if (cpu >= 0)
> >  			break;
> > +		/*
> > +		 * Check if the search is restricted to the same core or
> > +		 * the same node.
> > +		 */
> > +		if (flags & SCX_PICK_IDLE_NODE)
> > +			break;
> 
> Yeah, if you will give a better name for the flag, you'll not have to
> comment the code.
> 
> >  	}
> >  
> >  	return cpu;
> > @@ -495,7 +501,8 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
> >  		 * Search for any fully idle core in the same LLC domain.
> >  		 */
> >  		if (llc_cpus) {
> > -			cpu = pick_idle_cpu_from_node(llc_cpus, node, SCX_PICK_IDLE_CORE);
> > +			cpu = scx_pick_idle_cpu(llc_cpus, node,
> > +						SCX_PICK_IDLE_CORE | SCX_PICK_IDLE_NODE);
> 
> You change it from scx_pick_idle_cpu() to pick_idle_cpu_from_node()
> in patch 7 just to revert it back in patch 8...
> 
> You can use scx_pick_idle_cpu() in patch 7 already because
> scx_builtin_idle_per_node is always disabled, and you always
> follow the NUMA_FLAT_NODE path.  Here you will just add the
> SCX_PICK_IDLE_NODE flag. 
> 
> That's the point of separating functionality and control patches. In
> patch 7 you may need to mention explicitly that your new per-node
> idle masks are unconditionally disabled, and will be enabled in the
> last patch of the series, so some following patches will detail the
> implementation.

Ok.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 08/10] sched_ext: idle: introduce SCX_PICK_IDLE_NODE
  2024-12-24  3:53     ` Yury Norov
@ 2024-12-24  8:37       ` Andrea Righi
  2024-12-24 18:15         ` Yury Norov
  0 siblings, 1 reply; 30+ messages in thread
From: Andrea Righi @ 2024-12-24  8:37 UTC (permalink / raw)
  To: Yury Norov
  Cc: Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, bpf,
	linux-kernel

On Mon, Dec 23, 2024 at 07:53:21PM -0800, Yury Norov wrote:
> On Mon, Dec 23, 2024 at 06:48:48PM -0800, Yury Norov wrote:
> > On Fri, Dec 20, 2024 at 04:11:40PM +0100, Andrea Righi wrote:
> > > Introduce a flag to restrict the selection of an idle CPU to a specific
> > > NUMA node.
> > > 
> > > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> > > ---
> > >  kernel/sched/ext.c      |  1 +
> > >  kernel/sched/ext_idle.c | 11 +++++++++--
> > >  2 files changed, 10 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> > > index 143938e935f1..da5c15bd3c56 100644
> > > --- a/kernel/sched/ext.c
> > > +++ b/kernel/sched/ext.c
> > > @@ -773,6 +773,7 @@ enum scx_deq_flags {
> > >  
> > >  enum scx_pick_idle_cpu_flags {
> > >  	SCX_PICK_IDLE_CORE	= 1LLU << 0,	/* pick a CPU whose SMT siblings are also idle */
> > > +	SCX_PICK_IDLE_NODE	= 1LLU << 1,	/* pick a CPU in the same target NUMA node */
> > 
> > SCX_FORCE_NODE or SCX_FIX_NODE?
> > 
> > >  };
> > >  
> > >  enum scx_kick_flags {
> > > diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
> > > index 444f2a15f1d4..013deaa08f12 100644
> > > --- a/kernel/sched/ext_idle.c
> > > +++ b/kernel/sched/ext_idle.c
> > > @@ -199,6 +199,12 @@ static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, int node, u64 f
> 
> This function begins with:
> 
>  static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, int node, u64 flags)
>  {
>       nodemask_t hop_nodes = NODE_MASK_NONE;
>       s32 cpu = -EBUSY;
>  
>       if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
>               return pick_idle_cpu_from_node(cpus_allowed, NUMA_FLAT_NODE, flags);
> 
>       ...
>  
> So if I disable scx_builtin_idle_per_node and then call:
> 
>         scx_pick_idle_cpu(some_cpus, numa_node_id(), SCX_PICK_IDLE_NODE)
> 
> I may get a CPU from any non-local node, right? I think we need to honor user's
> request:  
> 
>       if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
>               return pick_idle_cpu_from_node(cpus_allowed,
>                      flags & SCX_PICK_IDLE_NODE ? node :  NUMA_FLAT_NODE, flags);
> 
> That way the code will be coherent: if you enable idle cpumasks, you
> will be able to follow all the NUMA hierarchy. If you disable them, at
> least you honor user's request to return a CPU from a given node, if
> he's very explicit about his intention.
> 
> You can be even nicer:
> 
>       if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node)) {
>                 node = pick_idle_cpu_from_node(cpus, node, flags);
>                 if (node == MAX_NUM_NODES && flags & SCX_PICK_IDLE_NODE == 0)
>                         node = pick_idle_cpu_from_node(cpus, NUMA_FLAT_NODE, flags);
> 
>                 return node;
>       }
> 

Sorry, I'm not following, if scx_builtin_idle_per_node is disabled, we’re
only tracking idle CPUs in a single NUMA_FLAT_NODE (which is node 0). All
the other cpumasks are just empty, and we would always return -EBUSY if we
honor the user request.

Maybe we should just return an error if scx_builtin_idle_per_node is
disabled and the user is requesting an idle CPU in a specific node?

-Andrea

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 09/10] sched_ext: idle: Get rid of the scx_selcpu_topo_numa logic
  2024-12-23 23:39   ` Yury Norov
@ 2024-12-24  8:58     ` Andrea Righi
  0 siblings, 0 replies; 30+ messages in thread
From: Andrea Righi @ 2024-12-24  8:58 UTC (permalink / raw)
  To: Yury Norov
  Cc: Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, bpf,
	linux-kernel

On Mon, Dec 23, 2024 at 03:39:56PM -0800, Yury Norov wrote:
> On Fri, Dec 20, 2024 at 04:11:41PM +0100, Andrea Righi wrote:
> > With the introduction of separate per-NUMA node cpumasks, we
> > automatically track idle CPUs within each NUMA node.
> > 
> > This makes the special logic for determining idle CPUs in each NUMA node
> > redundant and unnecessary, so we can get rid of it.
> 
> But it looks like you do more than that... 
> 
> > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> > ---
> >  kernel/sched/ext_idle.c | 93 ++++++++++-------------------------------
> >  1 file changed, 23 insertions(+), 70 deletions(-)
> > 
> > diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
> > index 013deaa08f12..b36e93da1b75 100644
> > --- a/kernel/sched/ext_idle.c
> > +++ b/kernel/sched/ext_idle.c
> > @@ -82,7 +82,6 @@ static void idle_masks_init(void)
> >  }
> >  
> >  static DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_llc);
> > -static DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_numa);
> >  
> >  /*
> >   * Return the node id associated to a target idle CPU (used to determine
> > @@ -259,25 +258,6 @@ static unsigned int numa_weight(s32 cpu)
> >  	return sg->group_weight;
> >  }
> >  
> > -/*
> > - * Return the cpumask representing the NUMA domain of @cpu (or NULL if the NUMA
> > - * domain is not defined).
> > - */
> > -static struct cpumask *numa_span(s32 cpu)
> > -{
> > -	struct sched_domain *sd;
> > -	struct sched_group *sg;
> > -
> > -	sd = rcu_dereference(per_cpu(sd_numa, cpu));
> > -	if (!sd)
> > -		return NULL;
> > -	sg = sd->groups;
> > -	if (!sg)
> > -		return NULL;
> > -
> > -	return sched_group_span(sg);
> 
> I didn't find llc_span() and node_span() in vanilla kernel. Does this series
> have prerequisites?

This patch set is based on the sched_ext/for-6.14 branch:
https://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git/

I put sched_ext/for-6.14 in the cover email, maybe it wasn't very clear.
I should have mentioned the git repo in the email.

> 
> > -}
> > -
> >  /*
> >   * Return true if the LLC domains do not perfectly overlap with the NUMA
> >   * domains, false otherwise.
> > @@ -329,7 +309,7 @@ static bool llc_numa_mismatch(void)
> >   */
> >  static void update_selcpu_topology(struct sched_ext_ops *ops)
> >  {
> > -	bool enable_llc = false, enable_numa = false;
> > +	bool enable_llc = false;
> >  	unsigned int nr_cpus;
> >  	s32 cpu = cpumask_first(cpu_online_mask);
> >  
> > @@ -348,41 +328,34 @@ static void update_selcpu_topology(struct sched_ext_ops *ops)
> >  	if (nr_cpus > 0) {
> >  		if (nr_cpus < num_online_cpus())
> >  			enable_llc = true;
> > +		/*
> > +		 * No need to enable LLC optimization if the LLC domains are
> > +		 * perfectly overlapping with the NUMA domains when per-node
> > +		 * cpumasks are enabled.
> > +		 */
> > +		if ((ops->flags & SCX_OPS_BUILTIN_IDLE_PER_NODE) &&
> > +		    !llc_numa_mismatch())
> > +			enable_llc = false;
> 
> This doesn't sound like redundancy removal. I may be wrong, but this
> looks like a sort of optimization. If so, it deserves to be a separate
> patch.

So, the initial idea was to replace the current NUMA awareness logic with
the per-node cpumasks.

But in fact, we're doing this change:

 - before:
   - NUMA-awareness logic implicitly enabled if the node CPUs don't overlap
     with LLC CPUs (as it would be redundant)

 - after :
   - NUMA-awareness logic explicitly enabled when the scx scheduler sets
     SCX_OPS_BUILTIN_IDLE_PER_NODE in .flags (and in this case implicitly
     disable LLC awareness if the node/llc CPUs are overlapping)

Maybe a better approach would be to keep the old NUMA/LLC logic exactly as
it is in sched_ext/for-6.14 if SCX_OPS_BUILTIN_IDLE_PER_NODE is not
specified, otherwise use the new logic (and implicitly disable
scx_selcpu_topo_numa).

In this way this "removal" patch would only implement the logic to disable
scx_selcpu_topo_numa when SCX_OPS_BUILTIN_IDLE_PER_NODE is used.

-Andrea

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 10/10] sched_ext: idle: Introduce NUMA aware idle cpu kfunc helpers
  2024-12-24  0:57   ` Yury Norov
@ 2024-12-24  9:32     ` Andrea Righi
  0 siblings, 0 replies; 30+ messages in thread
From: Andrea Righi @ 2024-12-24  9:32 UTC (permalink / raw)
  To: Yury Norov
  Cc: Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, bpf,
	linux-kernel

On Mon, Dec 23, 2024 at 04:57:36PM -0800, Yury Norov wrote:
> On Fri, Dec 20, 2024 at 04:11:42PM +0100, Andrea Righi wrote:
> > Add the following kfunc's to provide scx schedulers direct access to
> > per-node idle cpumasks information:
> > 
> >  const struct cpumask *scx_bpf_get_idle_cpumask_node(int node)
> >  const struct cpumask *scx_bpf_get_idle_smtmask_node(int node)
> >  s32 scx_bpf_pick_idle_cpu_node(const cpumask_t *cpus_allowed,
> > 				int node, u64 flags)
> >  int scx_bpf_cpu_to_node(s32 cpu)
> > 
> > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> > ---
> >  kernel/sched/ext_idle.c                  | 163 ++++++++++++++++++++---
> >  tools/sched_ext/include/scx/common.bpf.h |   4 +
> >  tools/sched_ext/include/scx/compat.bpf.h |  19 +++
> >  3 files changed, 170 insertions(+), 16 deletions(-)
> > 
> > diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
> > index b36e93da1b75..0f8ccc1e290e 100644
> > --- a/kernel/sched/ext_idle.c
> > +++ b/kernel/sched/ext_idle.c
> > @@ -28,6 +28,60 @@ static bool check_builtin_idle_enabled(void)
> >  	return false;
> >  }
> >  
> > +static bool check_builtin_idle_per_node_enabled(void)
> > +{
> > +	if (static_branch_likely(&scx_builtin_idle_per_node))
> > +		return true;
> 
> return 0;
> 
> > +
> > +	scx_ops_error("per-node idle tracking is disabled");
> > +	return false;
> 
> return -ENOTSUP;

Ok.

> 
> > +}
> > +
> > +/*
> > + * Validate and resolve a NUMA node.
> > + *
> > + * Return the resolved node ID on success or a negative value otherwise.
> > + */
> > +static int validate_node(int node)
> > +{
> > +	if (!check_builtin_idle_per_node_enabled())
> > +		return -EINVAL;
> 
> So the node may be valid, but this validator may fail. EINVAL is a
> misleading error code for that. You need ENOTSUP.

Ok.

> 
> > +
> > +	/* If no node is specified, use the current one */
> > +	if (node == NUMA_NO_NODE)
> > +		return numa_node_id();
> > +
> > +	/* Make sure node is in a valid range */
> > +	if (node < 0 || node >= nr_node_ids) {
> > +		scx_ops_error("invalid node %d", node);
> > +		return -ENOENT;
> 
> No such file or directory? Hmm...
> 
> This should be EINVAL. I would join this one with node_possible()
> check. We probably need bpf_node_possible() or something...

Ok about EINVAL.

About bpf_node_possible() I'm not sure, it'd be convenient to have a kfunc
for the BPF code to validate a node, but then we may also need to introduce
bpf_node_online(), or even bpf_node_state(), ...?

This can be probably addressed in a separate patch.

> 
> > +	}
> > +
> > +	/* Make sure the node is part of the set of possible nodes */
> > +	if (!node_possible(node)) {
> > +		scx_ops_error("unavailable node %d", node);
> 
> Not that it's unavailable. It just doesn't exist... I'd say:
> 
> 	scx_ops_error("Non-existing node %d. The existing nodes are: %pbl",
>                       node, nodemask_pr_args(node_states[N_POSSIBLE]));
> 
> > +		return -EINVAL;
> > +	}
> 
> What if user provides offline or cpu-less nodes? Is that a normal usage?
> If not, it would be nice to print warning, or even return an error...

I think we're returning -EBUSY in this case, which might be a reasonable
error already. Triggering an scx_ops_error() seems a bit too aggressive.

> 
> > +
> > +	return node;
> > +}
> > +
> > +/*
> > + * Return the node id associated to a target idle CPU (used to determine
> > + * the proper idle cpumask).
> > + */
> > +static int idle_cpu_to_node(int cpu)
> > +{
> > +	int node;
> > +
> > +	if (static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
> > +		node = cpu_to_node(cpu);
> > +	else
> > +		node = NUMA_FLAT_NODE;
> > +
> > +	return node;
> > +}
> > +
> >  #ifdef CONFIG_SMP
> >  struct idle_cpumask {
> >  	cpumask_var_t cpu;
> > @@ -83,22 +137,6 @@ static void idle_masks_init(void)
> >  
> >  static DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_llc);
> >  
> > -/*
> > - * Return the node id associated to a target idle CPU (used to determine
> > - * the proper idle cpumask).
> > - */
> > -static int idle_cpu_to_node(int cpu)
> > -{
> > -	int node;
> > -
> > -	if (static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
> > -		node = cpu_to_node(cpu);
> > -	else
> > -		node = NUMA_FLAT_NODE;
> > -
> > -	return node;
> > -}
> > -
> >  static bool test_and_clear_cpu_idle(int cpu)
> >  {
> >  	int node = idle_cpu_to_node(cpu);
> > @@ -613,6 +651,17 @@ static void reset_idle_masks(void) {}
> >   */
> >  __bpf_kfunc_start_defs();
> >  
> > +/**
> > + * scx_bpf_cpu_to_node - Return the NUMA node the given @cpu belongs to
> > + */
> > +__bpf_kfunc int scx_bpf_cpu_to_node(s32 cpu)
> > +{
> > +	if (cpu < 0 || cpu >= nr_cpu_ids)
> > +		return -EINVAL;
> > +
> > +	return idle_cpu_to_node(cpu);
> > +}
> > +
> >  /**
> >   * scx_bpf_select_cpu_dfl - The default implementation of ops.select_cpu()
> >   * @p: task_struct to select a CPU for
> > @@ -645,6 +694,28 @@ __bpf_kfunc s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
> >  	return prev_cpu;
> >  }
> >  
> > +/**
> > + * scx_bpf_get_idle_cpumask_node - Get a referenced kptr to the idle-tracking
> > + * per-CPU cpumask of a target NUMA node.
> > + *
> > + * NUMA_NO_NODE is interpreted as the current node.
> > + *
> > + * Returns an empty cpumask if idle tracking is not enabled, if @node is not
> > + * valid, or running on a UP kernel.
> > + */
> > +__bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask_node(int node)
> > +{
> > +	node = validate_node(node);
> > +	if (node < 0)
> > +		return cpu_none_mask;
> 
> I think I commented this in v7. This simply hides an error. You need to
> return ERR_PTR(node). And your user should check it with IS_ERR_VALUE().
> 
> This should be consistent with scx_bpf_pick_idle_cpu_node(), where you
> return an actual error.

I think I changed it... somewhere, but it looks like I missed this part. :)
Will change this as well, thanks!

> 
> > +
> > +#ifdef CONFIG_SMP
> > +	return get_idle_cpumask(node);
> > +#else
> > +	return cpu_none_mask;
> > +#endif
> > +}
> > +
> >  /**
> >   * scx_bpf_get_idle_cpumask - Get a referenced kptr to the idle-tracking
> >   * per-CPU cpumask.
> > @@ -664,6 +735,32 @@ __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void)
> >  	return get_idle_cpumask(NUMA_FLAT_NODE);
> >  }
> >  
> > +/**
> > + * scx_bpf_get_idle_smtmask_node - Get a referenced kptr to the idle-tracking,
> > + * per-physical-core cpumask of a target NUMA node. Can be used to determine
> > + * if an entire physical core is free.
> 
> If it goes to DOCs, it should have parameters section.

Ok.

> 
> > + *
> > + * NUMA_NO_NODE is interpreted as the current node.
> > + *
> > + * Returns an empty cpumask if idle tracking is not enabled, if @node is not
> > + * valid, or running on a UP kernel.
> > + */
> > +__bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask_node(int node)
> > +{
> > +	node = validate_node(node);
> > +	if (node < 0)
> > +		return cpu_none_mask;
> > +
> > +#ifdef CONFIG_SMP
> > +	if (sched_smt_active())
> > +		return get_idle_smtmask(node);
> > +	else
> > +		return get_idle_cpumask(node);
> > +#else
> > +	return cpu_none_mask;
> > +#endif
> > +}
> > +
> >  /**
> >   * scx_bpf_get_idle_smtmask - Get a referenced kptr to the idle-tracking,
> >   * per-physical-core cpumask. Can be used to determine if an entire physical
> > @@ -722,6 +819,36 @@ __bpf_kfunc bool scx_bpf_test_and_clear_cpu_idle(s32 cpu)
> >  		return false;
> >  }
> >  
> > +/**
> > + * scx_bpf_pick_idle_cpu_node - Pick and claim an idle cpu from a NUMA node
> > + * @cpus_allowed: Allowed cpumask
> > + * @node: target NUMA node
> > + * @flags: %SCX_PICK_IDLE_CPU_* flags
> > + *
> > + * Pick and claim an idle cpu in @cpus_allowed from the NUMA node @node.
> > + * Returns the picked idle cpu number on success. -%EBUSY if no matching cpu
> > + * was found.
> 
> validate_node() returns more errors.
> 
> > + *
> > + * If @node is NUMA_NO_NODE, the search is restricted to the current NUMA
> > + * node. Otherwise, the search starts from @node and proceeds to other
> > + * online NUMA nodes in order of increasing distance (unless
> > + * SCX_PICK_IDLE_NODE is specified, in which case the search is limited to
> > + * the target @node).
> 
> Can you reorder statements, like:
> 
> Restricted to current node if NUMA_NO_NODE.
> Restricted to @node if SCX_PICK_IDLE_NODE is specified
> Otherwise ...
> 
> What if NUMA_NO_NODE + SCX_PICK_IDLE_NODE? Seems to be OK, but looks
> redundant and non-intuitive. Why not if NUMA_NO_NODE provided, start
> from current node, but not restrict with it?

The more I think about NUMA_NO_NODE behavior, the more I'm convinved we
should just return -EBUSY (or a similar error). Implicitly assuming
NUMA_NO_NODE == current node seems a bit confusing in some cases.

Moreover, BPF already has the bpf_get_numa_node_id() helper, so there's
no reason to introduce this NUMA_NO_NODE == current node assumption.

> 
> > + *
> > + * Unavailable if ops.update_idle() is implemented and
> > + * %SCX_OPS_KEEP_BUILTIN_IDLE is not set or if %SCX_OPS_KEEP_BUILTIN_IDLE is
> > + * not set.
> > + */
> > +__bpf_kfunc s32 scx_bpf_pick_idle_cpu_node(const struct cpumask *cpus_allowed,
> > +					   int node, u64 flags)
> > +{
> > +	node = validate_node(node);
> 
> Hold on! This validate_node() replaces NO_NODE with current node but
> doesn't touch flags. It means that scx_pick_idle_cpu() will never see
> NO_NODE, and will not be able to restrict to current node. The comment
> above is incorrect, right?

Yes, the comment is incorrect, the logic here was to simply replace
NUMA_NO_NODE with current node, the restriction is only determined by
SCX_PICK_IDLE_NODE.

However, as mentioned above, I think we should just get rid of this
NO_NODE == current node assumption, this is yet another place where it adds
unnecessary complexity and it makes the code harder to follow.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 01/10] sched/topology: introduce for_each_numa_hop_node() / sched_numa_hop_node()
  2024-12-24  7:54     ` Andrea Righi
@ 2024-12-24 17:33       ` Yury Norov
  0 siblings, 0 replies; 30+ messages in thread
From: Yury Norov @ 2024-12-24 17:33 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, bpf,
	linux-kernel

> > > + */
> > > +int sched_numa_hop_node(nodemask_t *hop_nodes, int start, unsigned int state)
> > > +{
> > > +	int dist, n, min_node, min_dist;
> > > +
> > > +	if (state >= NR_NODE_STATES)
> > > +		return NUMA_NO_NODE;
> > 
> >  -EINVAL. But, do we need to check the parameter at all?
> 
> numa_nearest_node() has the same check (returning -EINVAL), it seems sane
> to do this check here as well to prevent out-of-bounds access to
> node_states[state].

And I don't think we need to check state in there.

numa_nearest_node() can probably explain it because it's an exported
function. But your sched_numa_hop_node() is an entirely in-kernel thing.
Kernel functions don't check parameters.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 07/10] sched_ext: Introduce per-node idle cpumasks
  2024-12-24  8:18     ` Andrea Righi
@ 2024-12-24 17:59       ` Yury Norov
  0 siblings, 0 replies; 30+ messages in thread
From: Yury Norov @ 2024-12-24 17:59 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, bpf,
	linux-kernel

On Tue, Dec 24, 2024 at 09:18:28AM +0100, Andrea Righi wrote:
> On Mon, Dec 23, 2024 at 08:05:53PM -0800, Yury Norov wrote:
> > On Fri, Dec 20, 2024 at 04:11:39PM +0100, Andrea Righi wrote:

...

> > > + * cpumasks to track idle CPUs within each NUMA node.
> > > + *
> > > + * If SCX_OPS_BUILTIN_IDLE_PER_NODE is not specified, a single flat cpumask
> > > + * from node 0 is used to track all idle CPUs system-wide.
> > > + */
> > > +static struct idle_cpumask **scx_idle_masks;
> > > +
> > > +static struct idle_cpumask *get_idle_mask(int node)
> > 
> > Didn't we agree to drop this 'get' thing?
> 
> Hm... no? :)
> 
> The analogy you pointed out was with get_parity8() which implements a pure
> function, so we should just use parity8() as "get" is implicit.
> 
> This case is a bit different IMHO, because it's getting a reference to the
> object (so not a pure function) and we may even have a put_idle_mask()
> potentially.
> 
> Personally I like to have the "get" here, but if you think it's confusing
> or it makes the code less readable I'm ok to drop it.

OK, whatever

...
 
> > > + * Find the best idle CPU in the system, relative to @node.
> > > + *
> > > + * If @node is NUMA_NO_NODE, start from the current node.
> > > + */
> > 
> > And if you don't invent this rule for kernel users, you don't need to
> > explain it everywhere.
> 
> I think we mentioned treating NUMA_NO_NODE as current node, but I might
> have misunderstood.

That's for userspace to maybe save a syscall. For in-kernel users it's
unneeded for sure.

> In an earlier patch set I was just ignoring
> NUMA_NO_NODE. Should we return an error instead?

Kernel users will never give you NUMA_NO_NODE if you ask them not to
do that. Resolving NO_NODE to current node for kernel users is useless.
They can call numa_node_id() just as well.

Userspace can do everything, so you have to check what they pass. For
userspace, it's up to you either to resolve NO_NODE to current node,
random node, simulate disabled per-numa idlemasks, do anything else or
return an error.
 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 08/10] sched_ext: idle: introduce SCX_PICK_IDLE_NODE
  2024-12-24  8:37       ` Andrea Righi
@ 2024-12-24 18:15         ` Yury Norov
  0 siblings, 0 replies; 30+ messages in thread
From: Yury Norov @ 2024-12-24 18:15 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, bpf,
	linux-kernel

On Tue, Dec 24, 2024 at 09:37:35AM +0100, Andrea Righi wrote:
> On Mon, Dec 23, 2024 at 07:53:21PM -0800, Yury Norov wrote:
> > On Mon, Dec 23, 2024 at 06:48:48PM -0800, Yury Norov wrote:
> > > On Fri, Dec 20, 2024 at 04:11:40PM +0100, Andrea Righi wrote:
> > > > Introduce a flag to restrict the selection of an idle CPU to a specific
> > > > NUMA node.
> > > > 
> > > > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> > > > ---
> > > >  kernel/sched/ext.c      |  1 +
> > > >  kernel/sched/ext_idle.c | 11 +++++++++--
> > > >  2 files changed, 10 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> > > > index 143938e935f1..da5c15bd3c56 100644
> > > > --- a/kernel/sched/ext.c
> > > > +++ b/kernel/sched/ext.c
> > > > @@ -773,6 +773,7 @@ enum scx_deq_flags {
> > > >  
> > > >  enum scx_pick_idle_cpu_flags {
> > > >  	SCX_PICK_IDLE_CORE	= 1LLU << 0,	/* pick a CPU whose SMT siblings are also idle */
> > > > +	SCX_PICK_IDLE_NODE	= 1LLU << 1,	/* pick a CPU in the same target NUMA node */
> > > 
> > > SCX_FORCE_NODE or SCX_FIX_NODE?
> > > 
> > > >  };
> > > >  
> > > >  enum scx_kick_flags {
> > > > diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
> > > > index 444f2a15f1d4..013deaa08f12 100644
> > > > --- a/kernel/sched/ext_idle.c
> > > > +++ b/kernel/sched/ext_idle.c
> > > > @@ -199,6 +199,12 @@ static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, int node, u64 f
> > 
> > This function begins with:
> > 
> >  static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, int node, u64 flags)
> >  {
> >       nodemask_t hop_nodes = NODE_MASK_NONE;
> >       s32 cpu = -EBUSY;
> >  
> >       if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
> >               return pick_idle_cpu_from_node(cpus_allowed, NUMA_FLAT_NODE, flags);
> > 
> >       ...
> >  
> > So if I disable scx_builtin_idle_per_node and then call:
> > 
> >         scx_pick_idle_cpu(some_cpus, numa_node_id(), SCX_PICK_IDLE_NODE)
> > 
> > I may get a CPU from any non-local node, right? I think we need to honor user's
> > request:  
> > 
> >       if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
> >               return pick_idle_cpu_from_node(cpus_allowed,
> >                      flags & SCX_PICK_IDLE_NODE ? node :  NUMA_FLAT_NODE, flags);
> > 
> > That way the code will be coherent: if you enable idle cpumasks, you
> > will be able to follow all the NUMA hierarchy. If you disable them, at
> > least you honor user's request to return a CPU from a given node, if
> > he's very explicit about his intention.
> > 
> > You can be even nicer:
> > 
> >       if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node)) {
> >                 node = pick_idle_cpu_from_node(cpus, node, flags);
> >                 if (node == MAX_NUM_NODES && flags & SCX_PICK_IDLE_NODE == 0)
> >                         node = pick_idle_cpu_from_node(cpus, NUMA_FLAT_NODE, flags);
> > 
> >                 return node;
> >       }
> > 
> 
> Sorry, I'm not following, if scx_builtin_idle_per_node is disabled, we’re
> only tracking idle CPUs in a single NUMA_FLAT_NODE (which is node 0). All
> the other cpumasks are just empty, and we would always return -EBUSY if we
> honor the user request.

You're right. We can still do that like this:

       if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node)) {
                 cpumask_and(tmp, cpus, cpumask_of_node(node));
                 node = pick_idle_cpu_from_node(tmp, NUMA_FLAT_NODE, flags);
                 if (node == MAX_NUM_NODES && flags & SCX_PICK_IDLE_NODE == 0)
                         node = pick_idle_cpu_from_node(cpus, NUMA_FLAT_NODE, flags);
 
                 return node;
       }

But I'm not sure we need this complication. Maybe later...

> 
> Maybe we should just return an error if scx_builtin_idle_per_node is
> disabled and the user is requesting an idle CPU in a specific node?

The problem is that NUMA_FLAT_NODE is 0, and you can't distinguish it
from node #0. You can drop NUMA_FLAT_NODE and ask users to always
provide NUMA_NO_NODE if idle_per_node is disabled, or you can ignore
the node entirely. You just need to describe it explicitly.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 02/10] sched_ext: Move built-in idle CPU selection policy to a separate file
  2024-12-20 15:11 ` [PATCH 02/10] sched_ext: Move built-in idle CPU selection policy to a separate file Andrea Righi
@ 2024-12-24 21:21   ` Tejun Heo
  0 siblings, 0 replies; 30+ messages in thread
From: Tejun Heo @ 2024-12-24 21:21 UTC (permalink / raw)
  To: Andrea Righi
  Cc: David Vernet, Changwoo Min, Yury Norov, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, bpf,
	linux-kernel

Hello,

On Fri, Dec 20, 2024 at 04:11:34PM +0100, Andrea Righi wrote:
...
> +/* Built-in idle CPU selection policy */
> +#include "ext_idle.c"

While it's true that sched is built by inlining all the source files for
build performance, I think it'd be better to keep source files compatible
with independent compilation units as much as possible - ie. put interface
in .h and pretend that .c's are built separately so that we can maintain
some semblance to code organization people are more used to if for nothing
else.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 08/10] sched_ext: idle: introduce SCX_PICK_IDLE_NODE
  2024-12-24  8:22     ` Andrea Righi
@ 2024-12-24 21:29       ` Tejun Heo
  0 siblings, 0 replies; 30+ messages in thread
From: Tejun Heo @ 2024-12-24 21:29 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Yury Norov, David Vernet, Changwoo Min, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, bpf,
	linux-kernel

On Tue, Dec 24, 2024 at 09:22:49AM +0100, Andrea Righi wrote:
> > >  enum scx_pick_idle_cpu_flags {
> > >  	SCX_PICK_IDLE_CORE	= 1LLU << 0,	/* pick a CPU whose SMT siblings are also idle */
> > > +	SCX_PICK_IDLE_NODE	= 1LLU << 1,	/* pick a CPU in the same target NUMA node */
> > 
> > SCX_FORCE_NODE or SCX_FIX_NODE?
> 
> Ok, I like SCX_FORCE_NODE.

How about SCX_PICK_IDLE_IN_NODE?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2024-12-24 21:29 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-20 15:11 [PATCHSET v8 sched_ext/for-6.14] sched_ext: split global idle cpumask into per-NUMA cpumasks Andrea Righi
2024-12-20 15:11 ` [PATCH 01/10] sched/topology: introduce for_each_numa_hop_node() / sched_numa_hop_node() Andrea Righi
2024-12-23 21:18   ` Yury Norov
2024-12-24  7:54     ` Andrea Righi
2024-12-24 17:33       ` Yury Norov
2024-12-20 15:11 ` [PATCH 02/10] sched_ext: Move built-in idle CPU selection policy to a separate file Andrea Righi
2024-12-24 21:21   ` Tejun Heo
2024-12-20 15:11 ` [PATCH 03/10] sched_ext: idle: introduce check_builtin_idle_enabled() helper Andrea Righi
2024-12-20 15:11 ` [PATCH 04/10] sched_ext: idle: use assign_cpu() to update the idle cpumask Andrea Righi
2024-12-23 22:26   ` Yury Norov
2024-12-20 15:11 ` [PATCH 05/10] sched_ext: idle: clarify comments Andrea Righi
2024-12-23 22:28   ` Yury Norov
2024-12-20 15:11 ` [PATCH 06/10] sched_ext: Introduce SCX_OPS_NODE_BUILTIN_IDLE Andrea Righi
2024-12-20 15:11 ` [PATCH 07/10] sched_ext: Introduce per-node idle cpumasks Andrea Righi
2024-12-24  4:05   ` Yury Norov
2024-12-24  8:18     ` Andrea Righi
2024-12-24 17:59       ` Yury Norov
2024-12-20 15:11 ` [PATCH 08/10] sched_ext: idle: introduce SCX_PICK_IDLE_NODE Andrea Righi
2024-12-24  2:48   ` Yury Norov
2024-12-24  3:53     ` Yury Norov
2024-12-24  8:37       ` Andrea Righi
2024-12-24 18:15         ` Yury Norov
2024-12-24  8:22     ` Andrea Righi
2024-12-24 21:29       ` Tejun Heo
2024-12-20 15:11 ` [PATCH 09/10] sched_ext: idle: Get rid of the scx_selcpu_topo_numa logic Andrea Righi
2024-12-23 23:39   ` Yury Norov
2024-12-24  8:58     ` Andrea Righi
2024-12-20 15:11 ` [PATCH 10/10] sched_ext: idle: Introduce NUMA aware idle cpu kfunc helpers Andrea Righi
2024-12-24  0:57   ` Yury Norov
2024-12-24  9:32     ` Andrea Righi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox