[PATCHSET v5 sched_ext/for-6.14] sched_ext: split global idle cpumask into per-NUMA cpumasks

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCHSET v5 sched_ext/for-6.14] sched_ext: split global idle cpumask into per-NUMA cpumasks
@ 2024-12-09 10:40 Andrea Righi
  2024-12-09 10:40 ` [PATCH 1/4] sched_ext: Introduce per-NUMA idle cpumasks Andrea Righi
                   ` (3 more replies)
  0 siblings, 4 replies; 20+ messages in thread
From: Andrea Righi @ 2024-12-09 10:40 UTC (permalink / raw)
  To: Tejun Heo, David Vernet; +Cc: Yury Norov, Changwoo Min, linux-kernel

= Overview =

As discussed during the sched_ext office hours, using a global cpumask
to keep track of the idle CPUs can be inefficient and it may not scale
really well on large NUMA systems.

Therefore, split the idle cpumask into multiple per-NUMA node cpumasks
to improve scalability and performance on such large systems.

Scalability issues seem to be more noticeable on Intel Sapphire Rapids
dual-socket architectures.

= Test =

Hardware:
 - System: DGX B200
    - CPUs: 224 SMT threads (112 physical cores)
    - Processor: INTEL(R) XEON(R) PLATINUM 8570
    - 2 NUMA nodes

Scheduler:
 - scx_simple [1] (so that we can focus at the built-in idle selection
   policy and not at the scheduling policy itself)

Test:
 - Run a parallel kernel build `make -j $(nproc)` and measure the average
   elapsed time over 10 runs:

          avg time | stdev
          ---------+------
 before:   52.431s | 2.895
  after:   50.342s | 2.895

= Conclusion =

Splitting the global cpumask into multiple per-NUMA cpumasks helped to
achieve a speedup of approximately +4% with this particular architecture
and test case.

I've repeated the same test on a DGX-1 (40 physical cores, Intel Xeon
E5-2698 v4 @ 2.20GHz, 2 NUMA nodes) and I didn't observe any measurable
difference.

In general, on smaller systems, I haven't noticed any measurable
regressions or improvements with the same test (parallel kernel build)
and scheduler (scx_simple).

NOTE: splitting the global cpumask into multiple cpumasks may increase
the overhead of scx_bpf_pick_idle_cpu() or ops.select_cpu() (for
schedulers relying on the built-in CPU idle selection policy) in
presence of multiple NUMA nodes, particularly under high system load,
since we may have to access multiple cpumasks to find an idle CPU.

However, this increased overhead seems to be highly compensated by a
lower overhead when updating the idle state (__scx_update_idle()) and by
the fact that CPUs are more likely operating within their local idle
cpumask, reducing the stress on the cache coherency protocol.

= References =

[1] https://github.com/sched-ext/scx/blob/main/scheds/c/scx_simple.bpf.c

ChangeLog v4 -> v5:
 - introduce new scx_bpf_cpu_to_node() kfunc
 - provide __COMPAT_*() helpers for the new kfunc's

ChangeLog v3 -> v4:
 - introduce SCX_OPS_NODE_BUILTIN_IDLE to select multiple per-node
   cpumasks or single flat cpumask
 - introduce new kfuncs to access per-node idle cpumasks information
 - use for_each_numa_hop_mask() to traverse NUMA nodes in increasing
   distance
 - dropped nodemask helpers (not needed anymore)
 - rebase to sched_ext/for-6.14

ChangeLog v2 -> v3:
  - introduce for_each_online_node_wrap()
  - re-introduce cpumask_intersects() in test_and_clear_cpu_idle() (to
    reduce memory writes / cache coherence pressure)
  - get rid of the redundant scx_selcpu_topo_numa logic
  [test results are pretty much identical, so I haven't updated them from v2]

ChangeLog v1 -> v2:
  - renamed for_each_node_mask|state_from() -> for_each_node_mask|state_wrap()
  - misc cpumask optimizations (thanks to Yury)

Andrea Righi (4):
      sched_ext: Introduce per-NUMA idle cpumasks
      sched_ext: Get rid of the scx_selcpu_topo_numa logic
      sched_ext: Introduce SCX_OPS_NODE_BUILTIN_IDLE
      sched_ext: Introduce NUMA aware idle cpu kfunc helpers

 kernel/sched/ext.c                       | 438 ++++++++++++++++++++++---------
 tools/sched_ext/include/scx/common.bpf.h |   4 +
 tools/sched_ext/include/scx/compat.bpf.h |  19 ++
 3 files changed, 332 insertions(+), 129 deletions(-)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/4] sched_ext: Introduce per-NUMA idle cpumasks
  2024-12-09 10:40 [PATCHSET v5 sched_ext/for-6.14] sched_ext: split global idle cpumask into per-NUMA cpumasks Andrea Righi
@ 2024-12-09 10:40 ` Andrea Righi
  2024-12-09 19:32   ` Yury Norov
  2024-12-11 17:46   ` Yury Norov
  2024-12-09 10:40 ` [PATCH 2/4] sched_ext: Get rid of the scx_selcpu_topo_numa logic Andrea Righi
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 20+ messages in thread
From: Andrea Righi @ 2024-12-09 10:40 UTC (permalink / raw)
  To: Tejun Heo, David Vernet; +Cc: Yury Norov, Changwoo Min, linux-kernel

Using a single global idle mask can lead to inefficiencies and a lot of
stress on the cache coherency protocol on large systems with multiple
NUMA nodes, since all the CPUs can create a really intense read/write
activity on the single global cpumask.

Therefore, split the global cpumask into multiple per-NUMA node cpumasks
to improve scalability and performance on large systems.

The concept is that each cpumask will track only the idle CPUs within
its corresponding NUMA node, treating CPUs in other NUMA nodes as busy.
In this way concurrent access to the idle cpumask will be restricted
within each NUMA node.

NOTE: the scx_bpf_get_idle_cpu/smtmask() kfunc's, that are supposed to
return a single cpumask for all the CPUs, have been changed to report
only the cpumask of the current NUMA node (using the current CPU).

This is breaking the old behavior, but it will be addressed in the next
commits, introducing a new flag to switch between the old single global
flat idle cpumask or the multiple per-node cpumasks.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/ext.c | 221 ++++++++++++++++++++++++++++++++++-----------
 1 file changed, 166 insertions(+), 55 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 71342f3719c1..7e7f2869826f 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -933,8 +933,60 @@ static struct delayed_work scx_watchdog_work;
 static struct {
 	cpumask_var_t cpu;
 	cpumask_var_t smt;
-} idle_masks CL_ALIGNED_IF_ONSTACK;
+} **idle_masks CL_ALIGNED_IF_ONSTACK;
 
+static struct cpumask *get_idle_cpumask_node(int node)
+{
+	return idle_masks[node]->cpu;
+}
+
+static struct cpumask *get_idle_smtmask_node(int node)
+{
+	return idle_masks[node]->smt;
+}
+
+static struct cpumask *get_curr_idle_cpumask(void)
+{
+	int node = cpu_to_node(smp_processor_id());
+
+	return get_idle_cpumask_node(node);
+}
+
+static struct cpumask *get_curr_idle_smtmask(void)
+{
+	int node = cpu_to_node(smp_processor_id());
+
+	if (sched_smt_active())
+		return get_idle_smtmask_node(node);
+	else
+		return get_idle_cpumask_node(node);
+}
+
+static void idle_masks_init(void)
+{
+	int node;
+
+	idle_masks = kcalloc(num_possible_nodes(), sizeof(*idle_masks), GFP_KERNEL);
+	BUG_ON(!idle_masks);
+
+	for_each_node_state(node, N_POSSIBLE) {
+		idle_masks[node] = kzalloc_node(sizeof(**idle_masks), GFP_KERNEL, node);
+		BUG_ON(!idle_masks[node]);
+
+		BUG_ON(!alloc_cpumask_var_node(&idle_masks[node]->cpu, GFP_KERNEL, node));
+		BUG_ON(!alloc_cpumask_var_node(&idle_masks[node]->smt, GFP_KERNEL, node));
+	}
+}
+#else	/* !CONFIG_SMP */
+static struct cpumask *get_curr_idle_cpumask(void)
+{
+	return cpu_none_mask;
+}
+
+static struct cpumask *get_curr_idle_smtmask(void)
+{
+	return cpu_none_mask;
+}
 #endif	/* CONFIG_SMP */
 
 /* for %SCX_KICK_WAIT */
@@ -3166,6 +3218,9 @@ bool scx_prio_less(const struct task_struct *a, const struct task_struct *b,
 
 static bool test_and_clear_cpu_idle(int cpu)
 {
+	int node = cpu_to_node(cpu);
+	struct cpumask *idle_cpu = get_idle_cpumask_node(node);
+
 #ifdef CONFIG_SCHED_SMT
 	/*
 	 * SMT mask should be cleared whether we can claim @cpu or not. The SMT
@@ -3174,29 +3229,34 @@ static bool test_and_clear_cpu_idle(int cpu)
 	 */
 	if (sched_smt_active()) {
 		const struct cpumask *smt = cpu_smt_mask(cpu);
+		struct cpumask *idle_smt = get_idle_smtmask_node(node);
 
 		/*
 		 * If offline, @cpu is not its own sibling and
 		 * scx_pick_idle_cpu() can get caught in an infinite loop as
-		 * @cpu is never cleared from idle_masks.smt. Ensure that @cpu
-		 * is eventually cleared.
+		 * @cpu is never cleared from the idle SMT mask. Ensure that
+		 * @cpu is eventually cleared.
+		 *
+		 * NOTE: Use cpumask_intersects() and cpumask_test_cpu() to
+		 * reduce memory writes, which may help alleviate cache
+		 * coherence pressure.
 		 */
-		if (cpumask_intersects(smt, idle_masks.smt))
-			cpumask_andnot(idle_masks.smt, idle_masks.smt, smt);
-		else if (cpumask_test_cpu(cpu, idle_masks.smt))
-			__cpumask_clear_cpu(cpu, idle_masks.smt);
+		if (cpumask_intersects(smt, idle_smt))
+			cpumask_andnot(idle_smt, idle_smt, smt);
+		else if (cpumask_test_cpu(cpu, idle_smt))
+			__cpumask_clear_cpu(cpu, idle_smt);
 	}
 #endif
-	return cpumask_test_and_clear_cpu(cpu, idle_masks.cpu);
+	return cpumask_test_and_clear_cpu(cpu, idle_cpu);
 }
 
-static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags)
+static s32 scx_pick_idle_cpu_from_node(int node, const struct cpumask *cpus_allowed, u64 flags)
 {
 	int cpu;
 
 retry:
 	if (sched_smt_active()) {
-		cpu = cpumask_any_and_distribute(idle_masks.smt, cpus_allowed);
+		cpu = cpumask_any_and_distribute(get_idle_smtmask_node(node), cpus_allowed);
 		if (cpu < nr_cpu_ids)
 			goto found;
 
@@ -3204,15 +3264,66 @@ static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags)
 			return -EBUSY;
 	}
 
-	cpu = cpumask_any_and_distribute(idle_masks.cpu, cpus_allowed);
-	if (cpu >= nr_cpu_ids)
-		return -EBUSY;
+	cpu = cpumask_any_and_distribute(get_idle_cpumask_node(node), cpus_allowed);
+	if (cpu < nr_cpu_ids)
+		goto found;
+
+	return -EBUSY;
 
 found:
 	if (test_and_clear_cpu_idle(cpu))
 		return cpu;
 	else
 		goto retry;
+
+}
+
+static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, s32 prev_cpu, u64 flags)
+{
+	const struct cpumask *node_mask;
+	s32 cpu;
+
+	/*
+	 * Only node 0 is used if per-node idle cpumasks are disabled.
+	 */
+	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
+		return scx_pick_idle_cpu_from_node(0, cpus_allowed, flags);
+
+	/*
+	 * Traverse all nodes in order of increasing distance, starting from
+	 * prev_cpu's node.
+	 */
+	rcu_read_lock();
+	for_each_numa_hop_mask(node_mask, cpu_to_node(prev_cpu)) {
+		/*
+		 * scx_pick_idle_cpu_from_node() can be expensive and redundant
+		 * if none of the CPUs in the NUMA node can be used (according
+		 * to cpus_allowed).
+		 *
+		 * Therefore, check if the NUMA node is usable in advance to
+		 * save some CPU cycles.
+		 */
+		if (!cpumask_intersects(node_mask, cpus_allowed))
+			continue;
+
+		/*
+		 * It would be nice to have a "node" iterator, instead of the
+		 * cpumask, to get rid of the cpumask_first() to determine the
+		 * node.
+		 */
+		cpu = cpumask_first(node_mask);
+		if (cpu >= nr_cpu_ids)
+			continue;
+
+		cpu = scx_pick_idle_cpu_from_node(cpu_to_node(cpu), cpus_allowed, flags);
+		if (cpu >= 0)
+			goto out_unlock;
+	}
+	cpu = -EBUSY;
+
+out_unlock:
+	rcu_read_unlock();
+	return cpu;
 }
 
 /*
@@ -3420,6 +3531,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 {
 	const struct cpumask *llc_cpus = NULL;
 	const struct cpumask *numa_cpus = NULL;
+	int node = cpu_to_node(prev_cpu);
 	s32 cpu;
 
 	*found = false;
@@ -3477,9 +3589,9 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 		 * piled up on it even if there is an idle core elsewhere on
 		 * the system.
 		 */
-		if (!cpumask_empty(idle_masks.cpu) &&
-		    !(current->flags & PF_EXITING) &&
-		    cpu_rq(cpu)->scx.local_dsq.nr == 0) {
+		if (!(current->flags & PF_EXITING) &&
+		    cpu_rq(cpu)->scx.local_dsq.nr == 0 &&
+		    !cpumask_empty(get_idle_cpumask_node(cpu_to_node(cpu)))) {
 			if (cpumask_test_cpu(cpu, p->cpus_ptr))
 				goto cpu_found;
 		}
@@ -3493,7 +3605,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 		/*
 		 * Keep using @prev_cpu if it's part of a fully idle core.
 		 */
-		if (cpumask_test_cpu(prev_cpu, idle_masks.smt) &&
+		if (cpumask_test_cpu(prev_cpu, get_idle_smtmask_node(node)) &&
 		    test_and_clear_cpu_idle(prev_cpu)) {
 			cpu = prev_cpu;
 			goto cpu_found;
@@ -3503,7 +3615,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 		 * Search for any fully idle core in the same LLC domain.
 		 */
 		if (llc_cpus) {
-			cpu = scx_pick_idle_cpu(llc_cpus, SCX_PICK_IDLE_CORE);
+			cpu = scx_pick_idle_cpu_from_node(node, llc_cpus, SCX_PICK_IDLE_CORE);
 			if (cpu >= 0)
 				goto cpu_found;
 		}
@@ -3512,7 +3624,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 		 * Search for any fully idle core in the same NUMA node.
 		 */
 		if (numa_cpus) {
-			cpu = scx_pick_idle_cpu(numa_cpus, SCX_PICK_IDLE_CORE);
+			cpu = scx_pick_idle_cpu_from_node(node, numa_cpus, SCX_PICK_IDLE_CORE);
 			if (cpu >= 0)
 				goto cpu_found;
 		}
@@ -3520,7 +3632,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 		/*
 		 * Search for any full idle core usable by the task.
 		 */
-		cpu = scx_pick_idle_cpu(p->cpus_ptr, SCX_PICK_IDLE_CORE);
+		cpu = scx_pick_idle_cpu(p->cpus_ptr, prev_cpu, SCX_PICK_IDLE_CORE);
 		if (cpu >= 0)
 			goto cpu_found;
 	}
@@ -3537,7 +3649,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 	 * Search for any idle CPU in the same LLC domain.
 	 */
 	if (llc_cpus) {
-		cpu = scx_pick_idle_cpu(llc_cpus, 0);
+		cpu = scx_pick_idle_cpu_from_node(node, llc_cpus, 0);
 		if (cpu >= 0)
 			goto cpu_found;
 	}
@@ -3546,7 +3658,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 	 * Search for any idle CPU in the same NUMA node.
 	 */
 	if (numa_cpus) {
-		cpu = scx_pick_idle_cpu(numa_cpus, 0);
+		cpu = scx_pick_idle_cpu_from_node(node, numa_cpus, 0);
 		if (cpu >= 0)
 			goto cpu_found;
 	}
@@ -3554,7 +3666,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 	/*
 	 * Search for any idle CPU usable by the task.
 	 */
-	cpu = scx_pick_idle_cpu(p->cpus_ptr, 0);
+	cpu = scx_pick_idle_cpu(p->cpus_ptr, prev_cpu, 0);
 	if (cpu >= 0)
 		goto cpu_found;
 
@@ -3636,17 +3748,27 @@ static void set_cpus_allowed_scx(struct task_struct *p,
 
 static void reset_idle_masks(void)
 {
+	int node;
+
 	/*
 	 * Consider all online cpus idle. Should converge to the actual state
 	 * quickly.
 	 */
-	cpumask_copy(idle_masks.cpu, cpu_online_mask);
-	cpumask_copy(idle_masks.smt, cpu_online_mask);
+	for_each_node_state(node, N_POSSIBLE) {
+		const struct cpumask *node_mask = cpumask_of_node(node);
+		struct cpumask *idle_cpu = get_idle_cpumask_node(node);
+		struct cpumask *idle_smt = get_idle_smtmask_node(node);
+
+		cpumask_and(idle_cpu, cpu_online_mask, node_mask);
+		cpumask_copy(idle_smt, idle_cpu);
+	}
 }
 
 void __scx_update_idle(struct rq *rq, bool idle)
 {
 	int cpu = cpu_of(rq);
+	int node = cpu_to_node(cpu);
+	struct cpumask *idle_cpu = get_idle_cpumask_node(node);
 
 	if (SCX_HAS_OP(update_idle) && !scx_rq_bypassing(rq)) {
 		SCX_CALL_OP(SCX_KF_REST, update_idle, cpu_of(rq), idle);
@@ -3654,27 +3776,25 @@ void __scx_update_idle(struct rq *rq, bool idle)
 			return;
 	}
 
-	if (idle)
-		cpumask_set_cpu(cpu, idle_masks.cpu);
-	else
-		cpumask_clear_cpu(cpu, idle_masks.cpu);
+	assign_cpu(cpu, idle_cpu, idle);
 
 #ifdef CONFIG_SCHED_SMT
 	if (sched_smt_active()) {
 		const struct cpumask *smt = cpu_smt_mask(cpu);
+		struct cpumask *idle_smt = get_idle_smtmask_node(node);
 
 		if (idle) {
 			/*
-			 * idle_masks.smt handling is racy but that's fine as
-			 * it's only for optimization and self-correcting.
+			 * idle_smt handling is racy but that's fine as it's
+			 * only for optimization and self-correcting.
 			 */
 			for_each_cpu(cpu, smt) {
-				if (!cpumask_test_cpu(cpu, idle_masks.cpu))
+				if (!cpumask_test_cpu(cpu, idle_cpu))
 					return;
 			}
-			cpumask_or(idle_masks.smt, idle_masks.smt, smt);
+			cpumask_or(idle_smt, idle_smt, smt);
 		} else {
-			cpumask_andnot(idle_masks.smt, idle_masks.smt, smt);
+			cpumask_andnot(idle_smt, idle_smt, smt);
 		}
 	}
 #endif
@@ -3722,7 +3842,10 @@ static void rq_offline_scx(struct rq *rq)
 #else	/* CONFIG_SMP */
 
 static bool test_and_clear_cpu_idle(int cpu) { return false; }
-static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags) { return -EBUSY; }
+static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, s32 prev_cpu, u64 flags)
+{
+	return -EBUSY;
+}
 static void reset_idle_masks(void) {}
 
 #endif	/* CONFIG_SMP */
@@ -6255,8 +6378,7 @@ void __init init_sched_ext_class(void)
 
 	BUG_ON(rhashtable_init(&dsq_hash, &dsq_hash_params));
 #ifdef CONFIG_SMP
-	BUG_ON(!alloc_cpumask_var(&idle_masks.cpu, GFP_KERNEL));
-	BUG_ON(!alloc_cpumask_var(&idle_masks.smt, GFP_KERNEL));
+	idle_masks_init();
 #endif
 	scx_kick_cpus_pnt_seqs =
 		__alloc_percpu(sizeof(scx_kick_cpus_pnt_seqs[0]) * nr_cpu_ids,
@@ -7402,7 +7524,7 @@ __bpf_kfunc void scx_bpf_put_cpumask(const struct cpumask *cpumask)
 
 /**
  * scx_bpf_get_idle_cpumask - Get a referenced kptr to the idle-tracking
- * per-CPU cpumask.
+ * per-CPU cpumask of the current NUMA node.
  *
  * Returns NULL if idle tracking is not enabled, or running on a UP kernel.
  */
@@ -7413,17 +7535,13 @@ __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void)
 		return cpu_none_mask;
 	}
 
-#ifdef CONFIG_SMP
-	return idle_masks.cpu;
-#else
-	return cpu_none_mask;
-#endif
+	return get_curr_idle_cpumask();
 }
 
 /**
  * scx_bpf_get_idle_smtmask - Get a referenced kptr to the idle-tracking,
- * per-physical-core cpumask. Can be used to determine if an entire physical
- * core is free.
+ * per-physical-core cpumask of the current NUMA node. Can be used to determine
+ * if an entire physical core is free.
  *
  * Returns NULL if idle tracking is not enabled, or running on a UP kernel.
  */
@@ -7434,14 +7552,7 @@ __bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask(void)
 		return cpu_none_mask;
 	}
 
-#ifdef CONFIG_SMP
-	if (sched_smt_active())
-		return idle_masks.smt;
-	else
-		return idle_masks.cpu;
-#else
-	return cpu_none_mask;
-#endif
+	return get_curr_idle_smtmask();
 }
 
 /**
@@ -7508,7 +7619,7 @@ __bpf_kfunc s32 scx_bpf_pick_idle_cpu(const struct cpumask *cpus_allowed,
 		return -EBUSY;
 	}
 
-	return scx_pick_idle_cpu(cpus_allowed, flags);
+	return scx_pick_idle_cpu(cpus_allowed, smp_processor_id(), flags);
 }
 
 /**
@@ -7531,7 +7642,7 @@ __bpf_kfunc s32 scx_bpf_pick_any_cpu(const struct cpumask *cpus_allowed,
 	s32 cpu;
 
 	if (static_branch_likely(&scx_builtin_idle_enabled)) {
-		cpu = scx_pick_idle_cpu(cpus_allowed, flags);
+		cpu = scx_pick_idle_cpu(cpus_allowed, smp_processor_id(), flags);
 		if (cpu >= 0)
 			return cpu;
 	}
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/4] sched_ext: Introduce per-NUMA idle cpumasks
  2024-12-09 10:40 ` [PATCH 1/4] sched_ext: Introduce per-NUMA idle cpumasks Andrea Righi
@ 2024-12-09 19:32   ` Yury Norov
  2024-12-09 20:40     ` Andrea Righi
  2024-12-10  0:14     ` Andrea Righi
  2024-12-11 17:46   ` Yury Norov
  1 sibling, 2 replies; 20+ messages in thread
From: Yury Norov @ 2024-12-09 19:32 UTC (permalink / raw)
  To: Andrea Righi; +Cc: Tejun Heo, David Vernet, Changwoo Min, linux-kernel

On Mon, Dec 09, 2024 at 11:40:55AM +0100, Andrea Righi wrote:
> Using a single global idle mask can lead to inefficiencies and a lot of
> stress on the cache coherency protocol on large systems with multiple
> NUMA nodes, since all the CPUs can create a really intense read/write
> activity on the single global cpumask.
> 
> Therefore, split the global cpumask into multiple per-NUMA node cpumasks
> to improve scalability and performance on large systems.
> 
> The concept is that each cpumask will track only the idle CPUs within
> its corresponding NUMA node, treating CPUs in other NUMA nodes as busy.
> In this way concurrent access to the idle cpumask will be restricted
> within each NUMA node.
> 
> NOTE: the scx_bpf_get_idle_cpu/smtmask() kfunc's, that are supposed to
> return a single cpumask for all the CPUs, have been changed to report
> only the cpumask of the current NUMA node (using the current CPU).
> 
> This is breaking the old behavior, but it will be addressed in the next
> commits, introducing a new flag to switch between the old single global
> flat idle cpumask or the multiple per-node cpumasks.

Why don't you change the order of commits such that you first
introduce the flag and then add new feature? That way you'll not have
to explain yourself.

Also, the kernel/sched/ext.c is already 7k+ LOCs. Can you move the
per-node idle masks to a separate file? You can also make this feature
configurable, and those who don't care (pretty much everyone except
for PLATINUM 8570 victims, right?) will not have to even compile it.

I'd like to see it enabled only for those who can really benefit from it.

> 
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> ---
>  kernel/sched/ext.c | 221 ++++++++++++++++++++++++++++++++++-----------
>  1 file changed, 166 insertions(+), 55 deletions(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 71342f3719c1..7e7f2869826f 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -933,8 +933,60 @@ static struct delayed_work scx_watchdog_work;
>  static struct {
>  	cpumask_var_t cpu;
>  	cpumask_var_t smt;
> -} idle_masks CL_ALIGNED_IF_ONSTACK;
> +} **idle_masks CL_ALIGNED_IF_ONSTACK;
>  
> +static struct cpumask *get_idle_cpumask_node(int node)
> +{
> +	return idle_masks[node]->cpu;
> +}
> +
> +static struct cpumask *get_idle_smtmask_node(int node)
> +{
> +	return idle_masks[node]->smt;
> +}

This implies that NUMA_NO_NODE, which is -1, will never be passed.
Is that indeed impossible?

> +
> +static struct cpumask *get_curr_idle_cpumask(void)
> +{
> +	int node = cpu_to_node(smp_processor_id());
> +
> +	return get_idle_cpumask_node(node);
> +}
> +
> +static struct cpumask *get_curr_idle_smtmask(void)
> +{
> +	int node = cpu_to_node(smp_processor_id());
> +
> +	if (sched_smt_active())
> +		return get_idle_smtmask_node(node);
> +	else
> +		return get_idle_cpumask_node(node);
> +}
> +
> +static void idle_masks_init(void)
> +{
> +	int node;
> +
> +	idle_masks = kcalloc(num_possible_nodes(), sizeof(*idle_masks), GFP_KERNEL);
> +	BUG_ON(!idle_masks);
> +
> +	for_each_node_state(node, N_POSSIBLE) {

for_each_node(node)

> +		idle_masks[node] = kzalloc_node(sizeof(**idle_masks), GFP_KERNEL, node);
> +		BUG_ON(!idle_masks[node]);
> +
> +		BUG_ON(!alloc_cpumask_var_node(&idle_masks[node]->cpu, GFP_KERNEL, node));
> +		BUG_ON(!alloc_cpumask_var_node(&idle_masks[node]->smt, GFP_KERNEL, node));
> +	}
> +}
> +#else	/* !CONFIG_SMP */
> +static struct cpumask *get_curr_idle_cpumask(void)
> +{
> +	return cpu_none_mask;
> +}
> +
> +static struct cpumask *get_curr_idle_smtmask(void)
> +{
> +	return cpu_none_mask;
> +}
>  #endif	/* CONFIG_SMP */
>  
>  /* for %SCX_KICK_WAIT */
> @@ -3166,6 +3218,9 @@ bool scx_prio_less(const struct task_struct *a, const struct task_struct *b,
>  
>  static bool test_and_clear_cpu_idle(int cpu)
>  {
> +	int node = cpu_to_node(cpu);
> +	struct cpumask *idle_cpu = get_idle_cpumask_node(node);
> +
>  #ifdef CONFIG_SCHED_SMT
>  	/*
>  	 * SMT mask should be cleared whether we can claim @cpu or not. The SMT
> @@ -3174,29 +3229,34 @@ static bool test_and_clear_cpu_idle(int cpu)
>  	 */
>  	if (sched_smt_active()) {
>  		const struct cpumask *smt = cpu_smt_mask(cpu);
> +		struct cpumask *idle_smt = get_idle_smtmask_node(node);
>  
>  		/*
>  		 * If offline, @cpu is not its own sibling and
>  		 * scx_pick_idle_cpu() can get caught in an infinite loop as
> -		 * @cpu is never cleared from idle_masks.smt. Ensure that @cpu
> -		 * is eventually cleared.
> +		 * @cpu is never cleared from the idle SMT mask. Ensure that
> +		 * @cpu is eventually cleared.
> +		 *
> +		 * NOTE: Use cpumask_intersects() and cpumask_test_cpu() to
> +		 * reduce memory writes, which may help alleviate cache
> +		 * coherence pressure.
>  		 */
> -		if (cpumask_intersects(smt, idle_masks.smt))
> -			cpumask_andnot(idle_masks.smt, idle_masks.smt, smt);
> -		else if (cpumask_test_cpu(cpu, idle_masks.smt))
> -			__cpumask_clear_cpu(cpu, idle_masks.smt);
> +		if (cpumask_intersects(smt, idle_smt))
> +			cpumask_andnot(idle_smt, idle_smt, smt);
> +		else if (cpumask_test_cpu(cpu, idle_smt))
> +			__cpumask_clear_cpu(cpu, idle_smt);
>  	}
>  #endif
> -	return cpumask_test_and_clear_cpu(cpu, idle_masks.cpu);
> +	return cpumask_test_and_clear_cpu(cpu, idle_cpu);
>  }
>  
> -static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags)
> +static s32 scx_pick_idle_cpu_from_node(int node, const struct cpumask *cpus_allowed, u64 flags)
>  {
>  	int cpu;
>  
>  retry:
>  	if (sched_smt_active()) {
> -		cpu = cpumask_any_and_distribute(idle_masks.smt, cpus_allowed);
> +		cpu = cpumask_any_and_distribute(get_idle_smtmask_node(node), cpus_allowed);
>  		if (cpu < nr_cpu_ids)
>  			goto found;
>  
> @@ -3204,15 +3264,66 @@ static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags)
>  			return -EBUSY;
>  	}
>  
> -	cpu = cpumask_any_and_distribute(idle_masks.cpu, cpus_allowed);
> -	if (cpu >= nr_cpu_ids)
> -		return -EBUSY;
> +	cpu = cpumask_any_and_distribute(get_idle_cpumask_node(node), cpus_allowed);
> +	if (cpu < nr_cpu_ids)
> +		goto found;
> +
> +	return -EBUSY;

What for did you change the error handling logic? Can you keep the
original?

>  
>  found:
>  	if (test_and_clear_cpu_idle(cpu))
>  		return cpu;
>  	else
>  		goto retry;
> +
> +}
> +
> +static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, s32 prev_cpu, u64 flags)
> +{
> +	const struct cpumask *node_mask;
> +	s32 cpu;
> +
> +	/*
> +	 * Only node 0 is used if per-node idle cpumasks are disabled.
> +	 */
> +	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
> +		return scx_pick_idle_cpu_from_node(0, cpus_allowed, flags);
> +
> +	/*
> +	 * Traverse all nodes in order of increasing distance, starting from
> +	 * prev_cpu's node.
> +	 */
> +	rcu_read_lock();
> +	for_each_numa_hop_mask(node_mask, cpu_to_node(prev_cpu)) {

This 'node_mask' misleads. This is not a nodemask. This is a cpumask -
all cpus in the hop. Can you name it 'hop_cpus', or similar?

> +		/*
> +		 * scx_pick_idle_cpu_from_node() can be expensive and redundant
> +		 * if none of the CPUs in the NUMA node can be used (according
> +		 * to cpus_allowed).
> +		 *
> +		 * Therefore, check if the NUMA node is usable in advance to
> +		 * save some CPU cycles.
> +		 */
> +		if (!cpumask_intersects(node_mask, cpus_allowed))
> +			continue;
> +
> +		/*
> +		 * It would be nice to have a "node" iterator, instead of the
> +		 * cpumask, to get rid of the cpumask_first() to determine the
> +		 * node.
> +		 */

I'm not aware about such case. And this one doesn't look like that because you
filter against a cpumask (cpus_allowed). Right?

> +		cpu = cpumask_first(node_mask);
> +		if (cpu >= nr_cpu_ids)
> +			continue;

for_each_numa_hop_mask() doesn't traverse per-node CPUs. It traverses
per-hop CPUs. The difference is that the hop may include more than one
node if distances from a given node to those belonging the hop are the
same.

So, imagine we have nodes 1 and 2 in the same hop, but cpus_allowed
has only cpus from node 2. With this configuration you pass the
cpumask_intersects() check, but when picking a CPU, you ignore the
cpus_allowed and pick node 1. This is wrong, I guess.

Instead, I would make a single check:

        cpu = cpumask_first_and(node_mask, cpus_allowed);
        if (cpu >= nr_cpu_ids)
                continue;

> +
> +		cpu = scx_pick_idle_cpu_from_node(cpu_to_node(cpu), cpus_allowed, flags);
> +		if (cpu >= 0)
> +			goto out_unlock;

The code in scx_pick_idle_cpu_from_node() is written such that CPUs
that your picks are well distributed across the nodes. Your code above
always picks 1st node in the hop. I think here you should use 

        cpumask_any_and_distribute()
 
like the scx_pick_idle_cpu_from_node() does with idle_masks.

Another problem is that high-level hops include CPUs from lower-level
ones. It means that you will traverse the same lower-level CPUs again
for nothing. So, you need to mask them out.

Another-another problem: if your hop has more than one node, you should
not jump to the next hop until you tried every node from the current hop.
This brings another loop, but doesn't increase complexity if you
carefully mask-out all visited nodes.

> +	}
> +	cpu = -EBUSY;

This hides the actual error returned from scx_pick_idle_cpu_from_node().

> +
> +out_unlock:
> +	rcu_read_unlock();
> +	return cpu;
>  }

And altogether this should look like:

 int scx_pick_idle_cpu_from_hop(struct cpumask *hop_cpus, struct cpumask *cpus_allowed)
 {
         int node, cpu, random_cpu;

         do {

                 /* Pick a 'random' CPU in the hop */
                 random_cpu = cpumask_any_and_distribute(hop_cpus, cpus_allowed);
                 if (random_cpu >= nr_cpu_ids)
                         continue;

                 node = cpu_to_node(random_cpu);

                 /* Find an idle CPU in the same node */
                 cpu = scx_pick_idle_cpu_from_node(node, cpus_allowed, flags);
                 if (cpu >= 0)
                         break;

                 /* No luck? Try other nodes */
         } while (cpumask_andnot(hop_cpus, hop_cpus, cpumask_of_node(node)));

         return cpu;
 }

 static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, s32 prev_cpu, u64 flags)
 {
        const struct cpumask *next, *prev = cpu_none_mask;
        int prev_node = cpu_to_node(prev_cpu);
 ...
	for_each_numa_hop_mask(next, prev_node) {
                cpumask_andnot(hop_cpus, next, prev);
                cpu = scx_pick_idle_cpu_from_hop(hop_cpus, cpus_allowed);
                prev = next;
        }
 ...
 }

Not tested, but should work.

>  /*
> @@ -3420,6 +3531,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  {
>  	const struct cpumask *llc_cpus = NULL;
>  	const struct cpumask *numa_cpus = NULL;
> +	int node = cpu_to_node(prev_cpu);
>  	s32 cpu;
>  
>  	*found = false;
> @@ -3477,9 +3589,9 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  		 * piled up on it even if there is an idle core elsewhere on
>  		 * the system.
>  		 */
> -		if (!cpumask_empty(idle_masks.cpu) &&
> -		    !(current->flags & PF_EXITING) &&
> -		    cpu_rq(cpu)->scx.local_dsq.nr == 0) {
> +		if (!(current->flags & PF_EXITING) &&
> +		    cpu_rq(cpu)->scx.local_dsq.nr == 0 &&
> +		    !cpumask_empty(get_idle_cpumask_node(cpu_to_node(cpu)))) {
>  			if (cpumask_test_cpu(cpu, p->cpus_ptr))
>  				goto cpu_found;
>  		}
> @@ -3493,7 +3605,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  		/*
>  		 * Keep using @prev_cpu if it's part of a fully idle core.
>  		 */
> -		if (cpumask_test_cpu(prev_cpu, idle_masks.smt) &&
> +		if (cpumask_test_cpu(prev_cpu, get_idle_smtmask_node(node)) &&
>  		    test_and_clear_cpu_idle(prev_cpu)) {
>  			cpu = prev_cpu;
>  			goto cpu_found;
> @@ -3503,7 +3615,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  		 * Search for any fully idle core in the same LLC domain.
>  		 */
>  		if (llc_cpus) {
> -			cpu = scx_pick_idle_cpu(llc_cpus, SCX_PICK_IDLE_CORE);
> +			cpu = scx_pick_idle_cpu_from_node(node, llc_cpus, SCX_PICK_IDLE_CORE);
>  			if (cpu >= 0)
>  				goto cpu_found;
>  		}
> @@ -3512,7 +3624,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  		 * Search for any fully idle core in the same NUMA node.
>  		 */
>  		if (numa_cpus) {
> -			cpu = scx_pick_idle_cpu(numa_cpus, SCX_PICK_IDLE_CORE);
> +			cpu = scx_pick_idle_cpu_from_node(node, numa_cpus, SCX_PICK_IDLE_CORE);
>  			if (cpu >= 0)
>  				goto cpu_found;
>  		}
> @@ -3520,7 +3632,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  		/*
>  		 * Search for any full idle core usable by the task.
>  		 */
> -		cpu = scx_pick_idle_cpu(p->cpus_ptr, SCX_PICK_IDLE_CORE);
> +		cpu = scx_pick_idle_cpu(p->cpus_ptr, prev_cpu, SCX_PICK_IDLE_CORE);
>  		if (cpu >= 0)
>  			goto cpu_found;
>  	}
> @@ -3537,7 +3649,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  	 * Search for any idle CPU in the same LLC domain.
>  	 */
>  	if (llc_cpus) {
> -		cpu = scx_pick_idle_cpu(llc_cpus, 0);
> +		cpu = scx_pick_idle_cpu_from_node(node, llc_cpus, 0);
>  		if (cpu >= 0)
>  			goto cpu_found;
>  	}
> @@ -3546,7 +3658,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  	 * Search for any idle CPU in the same NUMA node.
>  	 */
>  	if (numa_cpus) {
> -		cpu = scx_pick_idle_cpu(numa_cpus, 0);
> +		cpu = scx_pick_idle_cpu_from_node(node, numa_cpus, 0);
>  		if (cpu >= 0)
>  			goto cpu_found;
>  	}
> @@ -3554,7 +3666,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  	/*
>  	 * Search for any idle CPU usable by the task.
>  	 */
> -	cpu = scx_pick_idle_cpu(p->cpus_ptr, 0);
> +	cpu = scx_pick_idle_cpu(p->cpus_ptr, prev_cpu, 0);
>  	if (cpu >= 0)
>  		goto cpu_found;
>  
> @@ -3636,17 +3748,27 @@ static void set_cpus_allowed_scx(struct task_struct *p,
>  
>  static void reset_idle_masks(void)
>  {
> +	int node;
> +
>  	/*
>  	 * Consider all online cpus idle. Should converge to the actual state
>  	 * quickly.
>  	 */
> -	cpumask_copy(idle_masks.cpu, cpu_online_mask);
> -	cpumask_copy(idle_masks.smt, cpu_online_mask);
> +	for_each_node_state(node, N_POSSIBLE) {

for_each_node()

> +		const struct cpumask *node_mask = cpumask_of_node(node);
> +		struct cpumask *idle_cpu = get_idle_cpumask_node(node);
> +		struct cpumask *idle_smt = get_idle_smtmask_node(node);
> +
> +		cpumask_and(idle_cpu, cpu_online_mask, node_mask);
> +		cpumask_copy(idle_smt, idle_cpu);
> +	}
>  }
>  
>  void __scx_update_idle(struct rq *rq, bool idle)
>  {
>  	int cpu = cpu_of(rq);
> +	int node = cpu_to_node(cpu);
> +	struct cpumask *idle_cpu = get_idle_cpumask_node(node);
>  
>  	if (SCX_HAS_OP(update_idle) && !scx_rq_bypassing(rq)) {
>  		SCX_CALL_OP(SCX_KF_REST, update_idle, cpu_of(rq), idle);
> @@ -3654,27 +3776,25 @@ void __scx_update_idle(struct rq *rq, bool idle)
>  			return;
>  	}
>  
> -	if (idle)
> -		cpumask_set_cpu(cpu, idle_masks.cpu);
> -	else
> -		cpumask_clear_cpu(cpu, idle_masks.cpu);
> +	assign_cpu(cpu, idle_cpu, idle);
>  
>  #ifdef CONFIG_SCHED_SMT
>  	if (sched_smt_active()) {
>  		const struct cpumask *smt = cpu_smt_mask(cpu);
> +		struct cpumask *idle_smt = get_idle_smtmask_node(node);
>  
>  		if (idle) {
>  			/*
> -			 * idle_masks.smt handling is racy but that's fine as
> -			 * it's only for optimization and self-correcting.
> +			 * idle_smt handling is racy but that's fine as it's
> +			 * only for optimization and self-correcting.
>  			 */
>  			for_each_cpu(cpu, smt) {
> -				if (!cpumask_test_cpu(cpu, idle_masks.cpu))
> +				if (!cpumask_test_cpu(cpu, idle_cpu))
>  					return;
>  			}
> -			cpumask_or(idle_masks.smt, idle_masks.smt, smt);
> +			cpumask_or(idle_smt, idle_smt, smt);
>  		} else {
> -			cpumask_andnot(idle_masks.smt, idle_masks.smt, smt);
> +			cpumask_andnot(idle_smt, idle_smt, smt);
>  		}
>  	}
>  #endif
> @@ -3722,7 +3842,10 @@ static void rq_offline_scx(struct rq *rq)
>  #else	/* CONFIG_SMP */
>  
>  static bool test_and_clear_cpu_idle(int cpu) { return false; }
> -static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags) { return -EBUSY; }
> +static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, s32 prev_cpu, u64 flags)
> +{
> +	return -EBUSY;
> +}
>  static void reset_idle_masks(void) {}
>  
>  #endif	/* CONFIG_SMP */
> @@ -6255,8 +6378,7 @@ void __init init_sched_ext_class(void)
>  
>  	BUG_ON(rhashtable_init(&dsq_hash, &dsq_hash_params));
>  #ifdef CONFIG_SMP
> -	BUG_ON(!alloc_cpumask_var(&idle_masks.cpu, GFP_KERNEL));
> -	BUG_ON(!alloc_cpumask_var(&idle_masks.smt, GFP_KERNEL));
> +	idle_masks_init();
>  #endif
>  	scx_kick_cpus_pnt_seqs =
>  		__alloc_percpu(sizeof(scx_kick_cpus_pnt_seqs[0]) * nr_cpu_ids,
> @@ -7402,7 +7524,7 @@ __bpf_kfunc void scx_bpf_put_cpumask(const struct cpumask *cpumask)
>  
>  /**
>   * scx_bpf_get_idle_cpumask - Get a referenced kptr to the idle-tracking
> - * per-CPU cpumask.
> + * per-CPU cpumask of the current NUMA node.
>   *
>   * Returns NULL if idle tracking is not enabled, or running on a UP kernel.
>   */
> @@ -7413,17 +7535,13 @@ __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void)
>  		return cpu_none_mask;
>  	}
>  
> -#ifdef CONFIG_SMP
> -	return idle_masks.cpu;
> -#else
> -	return cpu_none_mask;
> -#endif
> +	return get_curr_idle_cpumask();
>  }
>  
>  /**
>   * scx_bpf_get_idle_smtmask - Get a referenced kptr to the idle-tracking,
> - * per-physical-core cpumask. Can be used to determine if an entire physical
> - * core is free.
> + * per-physical-core cpumask of the current NUMA node. Can be used to determine
> + * if an entire physical core is free.
>   *
>   * Returns NULL if idle tracking is not enabled, or running on a UP kernel.
>   */
> @@ -7434,14 +7552,7 @@ __bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask(void)
>  		return cpu_none_mask;
>  	}
>  
> -#ifdef CONFIG_SMP
> -	if (sched_smt_active())
> -		return idle_masks.smt;
> -	else
> -		return idle_masks.cpu;
> -#else
> -	return cpu_none_mask;
> -#endif
> +	return get_curr_idle_smtmask();
>  }
>  
>  /**
> @@ -7508,7 +7619,7 @@ __bpf_kfunc s32 scx_bpf_pick_idle_cpu(const struct cpumask *cpus_allowed,
>  		return -EBUSY;
>  	}
>  
> -	return scx_pick_idle_cpu(cpus_allowed, flags);
> +	return scx_pick_idle_cpu(cpus_allowed, smp_processor_id(), flags);
>  }
>  
>  /**
> @@ -7531,7 +7642,7 @@ __bpf_kfunc s32 scx_bpf_pick_any_cpu(const struct cpumask *cpus_allowed,
>  	s32 cpu;
>  
>  	if (static_branch_likely(&scx_builtin_idle_enabled)) {
> -		cpu = scx_pick_idle_cpu(cpus_allowed, flags);
> +		cpu = scx_pick_idle_cpu(cpus_allowed, smp_processor_id(), flags);
>  		if (cpu >= 0)
>  			return cpu;
>  	}
> -- 
> 2.47.1

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/4] sched_ext: Introduce per-NUMA idle cpumasks
  2024-12-09 19:32   ` Yury Norov
@ 2024-12-09 20:40     ` Andrea Righi
  2024-12-10  0:14     ` Andrea Righi
  1 sibling, 0 replies; 20+ messages in thread
From: Andrea Righi @ 2024-12-09 20:40 UTC (permalink / raw)
  To: Yury Norov; +Cc: Tejun Heo, David Vernet, Changwoo Min, linux-kernel

Hi Yury,

On Mon, Dec 09, 2024 at 11:32:56AM -0800, Yury Norov wrote:
> On Mon, Dec 09, 2024 at 11:40:55AM +0100, Andrea Righi wrote:
> > Using a single global idle mask can lead to inefficiencies and a lot of
> > stress on the cache coherency protocol on large systems with multiple
> > NUMA nodes, since all the CPUs can create a really intense read/write
> > activity on the single global cpumask.
> > 
> > Therefore, split the global cpumask into multiple per-NUMA node cpumasks
> > to improve scalability and performance on large systems.
> > 
> > The concept is that each cpumask will track only the idle CPUs within
> > its corresponding NUMA node, treating CPUs in other NUMA nodes as busy.
> > In this way concurrent access to the idle cpumask will be restricted
> > within each NUMA node.
> > 
> > NOTE: the scx_bpf_get_idle_cpu/smtmask() kfunc's, that are supposed to
> > return a single cpumask for all the CPUs, have been changed to report
> > only the cpumask of the current NUMA node (using the current CPU).
> > 
> > This is breaking the old behavior, but it will be addressed in the next
> > commits, introducing a new flag to switch between the old single global
> > flat idle cpumask or the multiple per-node cpumasks.
> 
> Why don't you change the order of commits such that you first
> introduce the flag and then add new feature? That way you'll not have
> to explain yourself.

Good point! I'll refactor the patch set.

> 
> Also, the kernel/sched/ext.c is already 7k+ LOCs. Can you move the
> per-node idle masks to a separate file? You can also make this feature
> configurable, and those who don't care (pretty much everyone except
> for PLATINUM 8570 victims, right?) will not have to even compile it.
> 
> I'd like to see it enabled only for those who can really benefit from it.

Ok about moving it to a separate file.

I'm not completely convinvced about making it a config option, I think
it'd nice to allow individual scx schedulers to decide whether to use
the NUMA-aware idle selection or the flat idle selection logic. This can
also pave the way for future enhancements (i.e., introducing generic
sched domains).

Moreover, in terms of overhead, there's not much difference between a
scheduler that doesn't set SCX_OPS_BUILTIN_IDLE_PER_NODE and having a
single staticlly-built idle cpumask (in both cases we will still use a
single global cpumask).

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/4] sched_ext: Introduce per-NUMA idle cpumasks
  2024-12-09 19:32   ` Yury Norov
  2024-12-09 20:40     ` Andrea Righi
@ 2024-12-10  0:14     ` Andrea Righi
  2024-12-10  2:10       ` Yury Norov
  1 sibling, 1 reply; 20+ messages in thread
From: Andrea Righi @ 2024-12-10  0:14 UTC (permalink / raw)
  To: Yury Norov; +Cc: Tejun Heo, David Vernet, Changwoo Min, linux-kernel

On Mon, Dec 09, 2024 at 11:32:56AM -0800, Yury Norov wrote:
...
> > +static struct cpumask *get_idle_cpumask_node(int node)
> > +{
> > +	return idle_masks[node]->cpu;
> > +}
> > +
> > +static struct cpumask *get_idle_smtmask_node(int node)
> > +{
> > +	return idle_masks[node]->smt;
> > +}
> 
> This implies that NUMA_NO_NODE, which is -1, will never be passed.
> Is that indeed impossible?

In PATCH 4/4 I adds some checks to make sure "node" is valid and return
NULL otherwise, but thinking more about it, it seems better to return
cpu_none_mask.

> 
> > +
> > +static struct cpumask *get_curr_idle_cpumask(void)
> > +{
> > +	int node = cpu_to_node(smp_processor_id());
> > +
> > +	return get_idle_cpumask_node(node);
> > +}
> > +
> > +static struct cpumask *get_curr_idle_smtmask(void)
> > +{
> > +	int node = cpu_to_node(smp_processor_id());
> > +
> > +	if (sched_smt_active())
> > +		return get_idle_smtmask_node(node);
> > +	else
> > +		return get_idle_cpumask_node(node);
> > +}
> > +
> > +static void idle_masks_init(void)
> > +{
> > +	int node;
> > +
> > +	idle_masks = kcalloc(num_possible_nodes(), sizeof(*idle_masks), GFP_KERNEL);
> > +	BUG_ON(!idle_masks);
> > +
> > +	for_each_node_state(node, N_POSSIBLE) {
> 
> for_each_node(node)

Ok.

> 
> > +		idle_masks[node] = kzalloc_node(sizeof(**idle_masks), GFP_KERNEL, node);
> > +		BUG_ON(!idle_masks[node]);
> > +
> > +		BUG_ON(!alloc_cpumask_var_node(&idle_masks[node]->cpu, GFP_KERNEL, node));
> > +		BUG_ON(!alloc_cpumask_var_node(&idle_masks[node]->smt, GFP_KERNEL, node));
> > +	}
> > +}
> > +#else	/* !CONFIG_SMP */
> > +static struct cpumask *get_curr_idle_cpumask(void)
> > +{
> > +	return cpu_none_mask;
> > +}
> > +
> > +static struct cpumask *get_curr_idle_smtmask(void)
> > +{
> > +	return cpu_none_mask;
> > +}
> >  #endif	/* CONFIG_SMP */
> >  
> >  /* for %SCX_KICK_WAIT */
> > @@ -3166,6 +3218,9 @@ bool scx_prio_less(const struct task_struct *a, const struct task_struct *b,
> >  
> >  static bool test_and_clear_cpu_idle(int cpu)
> >  {
> > +	int node = cpu_to_node(cpu);
> > +	struct cpumask *idle_cpu = get_idle_cpumask_node(node);
> > +
> >  #ifdef CONFIG_SCHED_SMT
> >  	/*
> >  	 * SMT mask should be cleared whether we can claim @cpu or not. The SMT
> > @@ -3174,29 +3229,34 @@ static bool test_and_clear_cpu_idle(int cpu)
> >  	 */
> >  	if (sched_smt_active()) {
> >  		const struct cpumask *smt = cpu_smt_mask(cpu);
> > +		struct cpumask *idle_smt = get_idle_smtmask_node(node);
> >  
> >  		/*
> >  		 * If offline, @cpu is not its own sibling and
> >  		 * scx_pick_idle_cpu() can get caught in an infinite loop as
> > -		 * @cpu is never cleared from idle_masks.smt. Ensure that @cpu
> > -		 * is eventually cleared.
> > +		 * @cpu is never cleared from the idle SMT mask. Ensure that
> > +		 * @cpu is eventually cleared.
> > +		 *
> > +		 * NOTE: Use cpumask_intersects() and cpumask_test_cpu() to
> > +		 * reduce memory writes, which may help alleviate cache
> > +		 * coherence pressure.
> >  		 */
> > -		if (cpumask_intersects(smt, idle_masks.smt))
> > -			cpumask_andnot(idle_masks.smt, idle_masks.smt, smt);
> > -		else if (cpumask_test_cpu(cpu, idle_masks.smt))
> > -			__cpumask_clear_cpu(cpu, idle_masks.smt);
> > +		if (cpumask_intersects(smt, idle_smt))
> > +			cpumask_andnot(idle_smt, idle_smt, smt);
> > +		else if (cpumask_test_cpu(cpu, idle_smt))
> > +			__cpumask_clear_cpu(cpu, idle_smt);
> >  	}
> >  #endif
> > -	return cpumask_test_and_clear_cpu(cpu, idle_masks.cpu);
> > +	return cpumask_test_and_clear_cpu(cpu, idle_cpu);
> >  }
> >  
> > -static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags)
> > +static s32 scx_pick_idle_cpu_from_node(int node, const struct cpumask *cpus_allowed, u64 flags)
> >  {
> >  	int cpu;
> >  
> >  retry:
> >  	if (sched_smt_active()) {
> > -		cpu = cpumask_any_and_distribute(idle_masks.smt, cpus_allowed);
> > +		cpu = cpumask_any_and_distribute(get_idle_smtmask_node(node), cpus_allowed);
> >  		if (cpu < nr_cpu_ids)
> >  			goto found;
> >  
> > @@ -3204,15 +3264,66 @@ static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags)
> >  			return -EBUSY;
> >  	}
> >  
> > -	cpu = cpumask_any_and_distribute(idle_masks.cpu, cpus_allowed);
> > -	if (cpu >= nr_cpu_ids)
> > -		return -EBUSY;
> > +	cpu = cpumask_any_and_distribute(get_idle_cpumask_node(node), cpus_allowed);
> > +	if (cpu < nr_cpu_ids)
> > +		goto found;
> > +
> > +	return -EBUSY;
> 
> What for did you change the error handling logic? Can you keep the
> original?

I changed that to follow the same pattern as the one inside
sched_smt_active(), it should be equivalent, but I can keep the
original.

> 
> >  
> >  found:
> >  	if (test_and_clear_cpu_idle(cpu))
> >  		return cpu;
> >  	else
> >  		goto retry;
> > +
> > +}
> > +
> > +static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, s32 prev_cpu, u64 flags)
> > +{
> > +	const struct cpumask *node_mask;
> > +	s32 cpu;
> > +
> > +	/*
> > +	 * Only node 0 is used if per-node idle cpumasks are disabled.
> > +	 */
> > +	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
> > +		return scx_pick_idle_cpu_from_node(0, cpus_allowed, flags);
> > +
> > +	/*
> > +	 * Traverse all nodes in order of increasing distance, starting from
> > +	 * prev_cpu's node.
> > +	 */
> > +	rcu_read_lock();
> > +	for_each_numa_hop_mask(node_mask, cpu_to_node(prev_cpu)) {
> 
> This 'node_mask' misleads. This is not a nodemask. This is a cpumask -
> all cpus in the hop. Can you name it 'hop_cpus', or similar?

Ack.

> 
> > +		/*
> > +		 * scx_pick_idle_cpu_from_node() can be expensive and redundant
> > +		 * if none of the CPUs in the NUMA node can be used (according
> > +		 * to cpus_allowed).
> > +		 *
> > +		 * Therefore, check if the NUMA node is usable in advance to
> > +		 * save some CPU cycles.
> > +		 */
> > +		if (!cpumask_intersects(node_mask, cpus_allowed))
> > +			continue;
> > +
> > +		/*
> > +		 * It would be nice to have a "node" iterator, instead of the
> > +		 * cpumask, to get rid of the cpumask_first() to determine the
> > +		 * node.
> > +		 */
> 
> I'm not aware about such case. And this one doesn't look like that because you
> filter against a cpumask (cpus_allowed). Right?

I'm interested to get the node id here to use it with
scx_pick_idle_cpu_from_node().

> 
> > +		cpu = cpumask_first(node_mask);
> > +		if (cpu >= nr_cpu_ids)
> > +			continue;
> 
> for_each_numa_hop_mask() doesn't traverse per-node CPUs. It traverses
> per-hop CPUs. The difference is that the hop may include more than one
> node if distances from a given node to those belonging the hop are the
> same.

Hm... I was missing this aspect. Thanks for clarifying this.

> 
> So, imagine we have nodes 1 and 2 in the same hop, but cpus_allowed
> has only cpus from node 2. With this configuration you pass the
> cpumask_intersects() check, but when picking a CPU, you ignore the
> cpus_allowed and pick node 1. This is wrong, I guess.

Yeah, that's not what I would like to do.

> 
> Instead, I would make a single check:
> 
>         cpu = cpumask_first_and(node_mask, cpus_allowed);
>         if (cpu >= nr_cpu_ids)
>                 continue;

Ok, but at this point I'm not sure that for_each_numa_hop_mask() is the
most efficient way to iterate NUMA nodes.

> 
> > +
> > +		cpu = scx_pick_idle_cpu_from_node(cpu_to_node(cpu), cpus_allowed, flags);
> > +		if (cpu >= 0)
> > +			goto out_unlock;
> 
> The code in scx_pick_idle_cpu_from_node() is written such that CPUs
> that your picks are well distributed across the nodes. Your code above
> always picks 1st node in the hop. I think here you should use 
> 
>         cpumask_any_and_distribute()
>  
> like the scx_pick_idle_cpu_from_node() does with idle_masks.

Right, that's because I was (incorrectly) assuming that each hop was a
single node.

> 
> Another problem is that high-level hops include CPUs from lower-level
> ones. It means that you will traverse the same lower-level CPUs again
> for nothing. So, you need to mask them out.
> 
> Another-another problem: if your hop has more than one node, you should
> not jump to the next hop until you tried every node from the current hop.
> This brings another loop, but doesn't increase complexity if you
> carefully mask-out all visited nodes.
> 
> > +	}
> > +	cpu = -EBUSY;
> 
> This hides the actual error returned from scx_pick_idle_cpu_from_node().

Right, scx_pick_idle_cpu_from_node() can only returns -EBUSY at the
moment, but it'd be nicer to pass the returned error.

> 
> > +
> > +out_unlock:
> > +	rcu_read_unlock();
> > +	return cpu;
> >  }
> 
> And altogether this should look like:
> 
>  int scx_pick_idle_cpu_from_hop(struct cpumask *hop_cpus, struct cpumask *cpus_allowed)
>  {
>          int node, cpu, random_cpu;
> 
>          do {
> 
>                  /* Pick a 'random' CPU in the hop */
>                  random_cpu = cpumask_any_and_distribute(hop_cpus, cpus_allowed);
>                  if (random_cpu >= nr_cpu_ids)
>                          continue;
> 
>                  node = cpu_to_node(random_cpu);
> 
>                  /* Find an idle CPU in the same node */
>                  cpu = scx_pick_idle_cpu_from_node(node, cpus_allowed, flags);
>                  if (cpu >= 0)
>                          break;
> 
>                  /* No luck? Try other nodes */
>          } while (cpumask_andnot(hop_cpus, hop_cpus, cpumask_of_node(node)));
> 
>          return cpu;
>  }
> 
>  static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, s32 prev_cpu, u64 flags)
>  {
>         const struct cpumask *next, *prev = cpu_none_mask;
>         int prev_node = cpu_to_node(prev_cpu);
>  ...
> 	for_each_numa_hop_mask(next, prev_node) {
>                 cpumask_andnot(hop_cpus, next, prev);
>                 cpu = scx_pick_idle_cpu_from_hop(hop_cpus, cpus_allowed);
>                 prev = next;
>         }
>  ...
>  }
> 
> Not tested, but should work.

Makes sense to me, I'll do some testing with this.

> > @@ -3636,17 +3748,27 @@ static void set_cpus_allowed_scx(struct task_struct *p,
> >  
> >  static void reset_idle_masks(void)
> >  {
> > +	int node;
> > +
> >  	/*
> >  	 * Consider all online cpus idle. Should converge to the actual state
> >  	 * quickly.
> >  	 */
> > -	cpumask_copy(idle_masks.cpu, cpu_online_mask);
> > -	cpumask_copy(idle_masks.smt, cpu_online_mask);
> > +	for_each_node_state(node, N_POSSIBLE) {
> 
> for_each_node()
> 

Ok.

And I think I haven't missed any comment this time, sorry about that and
thanks again for the review!

-Andrea

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/4] sched_ext: Introduce per-NUMA idle cpumasks
  2024-12-10  0:14     ` Andrea Righi
@ 2024-12-10  2:10       ` Yury Norov
  2024-12-14  6:05         ` Andrea Righi
  0 siblings, 1 reply; 20+ messages in thread
From: Yury Norov @ 2024-12-10  2:10 UTC (permalink / raw)
  To: Andrea Righi; +Cc: Tejun Heo, David Vernet, Changwoo Min, linux-kernel

On Tue, Dec 10, 2024 at 01:14:43AM +0100, Andrea Righi wrote:
> > And altogether this should look like:
> > 
> >  int scx_pick_idle_cpu_from_hop(struct cpumask *hop_cpus, struct cpumask *cpus_allowed)
> >  {
> >          int node, cpu, random_cpu;
> > 
> >          do {
> > 
> >                  /* Pick a 'random' CPU in the hop */
> >                  random_cpu = cpumask_any_and_distribute(hop_cpus, cpus_allowed);
> >                  if (random_cpu >= nr_cpu_ids)
> >                          continue;
> > 
> >                  node = cpu_to_node(random_cpu);
> > 
> >                  /* Find an idle CPU in the same node */
> >                  cpu = scx_pick_idle_cpu_from_node(node, cpus_allowed, flags);
> >                  if (cpu >= 0)
> >                          break;
> > 
> >                  /* No luck? Try other nodes */
> >          } while (cpumask_andnot(hop_cpus, hop_cpus, cpumask_of_node(node)));
> > 
> >          return cpu;
> >  }
> > 
> >  static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, s32 prev_cpu, u64 flags)
> >  {
> >         const struct cpumask *next, *prev = cpu_none_mask;
> >         int prev_node = cpu_to_node(prev_cpu);
> >  ...
> > 	for_each_numa_hop_mask(next, prev_node) {
> >                 cpumask_andnot(hop_cpus, next, prev);
> >                 cpu = scx_pick_idle_cpu_from_hop(hop_cpus, cpus_allowed);
> >                 prev = next;
> >         }
> >  ...
> >  }
> > 
> > Not tested, but should work.
> 
> Makes sense to me, I'll do some testing with this.

One thing you can do to optimize it is introducing a function that
pulls nodes from the hop_cpus:

 void sched_get_hop_nodes(nodemask_t *hop_nodes, const struct cpumask *hop_cpus)
 {
        int cpu;

        for_each_cpu(cpu, hop_cpus) {
               node_set(cpu_to_node(cpu);, hop_nodes);
               cpu = cpumask_next_zero(cpu, cpumask_of_node(node)
        }
 }

This should be O(N), but it will let you to avoid O(N*M) in the loop
condition inside scx_pick_idle_cpu_from_hop():

 int scx_pick_idle_cpu_from_hop(nodemask_t *hop_nodes, struct cpumask *cpus_allowed)
 {
        int node, idle_cpu, random_cpu;

        for_each_node_mask(node, &hop_nodes) {
                /* Pick a 'random' CPU in the node */
                random_cpu = cpumask_any_and_distribute(cpumask_of_node(node), cpus_allowed);
                if (random_cpu >= nr_cpu_ids)
                        continue;

                /* Find an idle CPU in the same node */
                idle_cpu = scx_pick_idle_cpu_from_node(node, cpus_allowed, flags);
                if (idle_cpu >= 0)
                        break;

        }

        return cpu;
 }

And at this point I'd also compare the above with non-randomized
version:

 static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, s32 prev_cpu, u64 flags)
 {
        const struct cpumask *next, *prev = cpu_none_mask;
        int prev_node = cpu_to_node(prev_cpu);
        nodemask_t hop_nodes;
 ...
	for_each_numa_hop_mask(next, prev_node) {
                if (!cpumask_and_andnot(hop_cpus, next, cpus_allow, prev))
                        goto cont;

                sched_get_hop_nodes(hop_nodes, hop_cpus);
                for_each_node_mask(node, hop_nodes) {
                        cpu = scx_pick_idle_cpu_from_node(node, cpus_allowed, flags);
                        if (cpu >= 0)
                                goto found;
                }

        cont:
                prev = next;
        }
 ...
 }

Don't know how it works, but it looks really good.

Thanks,
Yury

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/4] sched_ext: Introduce per-NUMA idle cpumasks
  2024-12-10  2:10       ` Yury Norov
@ 2024-12-14  6:05         ` Andrea Righi
  0 siblings, 0 replies; 20+ messages in thread
From: Andrea Righi @ 2024-12-14  6:05 UTC (permalink / raw)
  To: Yury Norov; +Cc: Tejun Heo, David Vernet, Changwoo Min, linux-kernel

On Mon, Dec 09, 2024 at 06:10:45PM -0800, Yury Norov wrote:
> On Tue, Dec 10, 2024 at 01:14:43AM +0100, Andrea Righi wrote:
> > > And altogether this should look like:
> > > 
> > >  int scx_pick_idle_cpu_from_hop(struct cpumask *hop_cpus, struct cpumask *cpus_allowed)
> > >  {
> > >          int node, cpu, random_cpu;
> > > 
> > >          do {
> > > 
> > >                  /* Pick a 'random' CPU in the hop */
> > >                  random_cpu = cpumask_any_and_distribute(hop_cpus, cpus_allowed);
> > >                  if (random_cpu >= nr_cpu_ids)
> > >                          continue;
> > > 
> > >                  node = cpu_to_node(random_cpu);
> > > 
> > >                  /* Find an idle CPU in the same node */
> > >                  cpu = scx_pick_idle_cpu_from_node(node, cpus_allowed, flags);
> > >                  if (cpu >= 0)
> > >                          break;
> > > 
> > >                  /* No luck? Try other nodes */
> > >          } while (cpumask_andnot(hop_cpus, hop_cpus, cpumask_of_node(node)));
> > > 
> > >          return cpu;
> > >  }
> > > 
> > >  static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, s32 prev_cpu, u64 flags)
> > >  {
> > >         const struct cpumask *next, *prev = cpu_none_mask;
> > >         int prev_node = cpu_to_node(prev_cpu);
> > >  ...
> > > 	for_each_numa_hop_mask(next, prev_node) {
> > >                 cpumask_andnot(hop_cpus, next, prev);
> > >                 cpu = scx_pick_idle_cpu_from_hop(hop_cpus, cpus_allowed);
> > >                 prev = next;
> > >         }
> > >  ...
> > >  }
> > > 
> > > Not tested, but should work.
> > 
> > Makes sense to me, I'll do some testing with this.
> 
> One thing you can do to optimize it is introducing a function that
> pulls nodes from the hop_cpus:
> 
>  void sched_get_hop_nodes(nodemask_t *hop_nodes, const struct cpumask *hop_cpus)
>  {
>         int cpu;
> 
>         for_each_cpu(cpu, hop_cpus) {
>                node_set(cpu_to_node(cpu);, hop_nodes);
>                cpu = cpumask_next_zero(cpu, cpumask_of_node(node)
>         }
>  }
> 
> This should be O(N), but it will let you to avoid O(N*M) in the loop
> condition inside scx_pick_idle_cpu_from_hop():
> 
>  int scx_pick_idle_cpu_from_hop(nodemask_t *hop_nodes, struct cpumask *cpus_allowed)
>  {
>         int node, idle_cpu, random_cpu;
> 
>         for_each_node_mask(node, &hop_nodes) {
>                 /* Pick a 'random' CPU in the node */
>                 random_cpu = cpumask_any_and_distribute(cpumask_of_node(node), cpus_allowed);
>                 if (random_cpu >= nr_cpu_ids)
>                         continue;
> 
>                 /* Find an idle CPU in the same node */
>                 idle_cpu = scx_pick_idle_cpu_from_node(node, cpus_allowed, flags);
>                 if (idle_cpu >= 0)
>                         break;
> 
>         }
> 
>         return cpu;
>  }
> 
> And at this point I'd also compare the above with non-randomized
> version:
> 
>  static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, s32 prev_cpu, u64 flags)
>  {
>         const struct cpumask *next, *prev = cpu_none_mask;
>         int prev_node = cpu_to_node(prev_cpu);
>         nodemask_t hop_nodes;
>  ...
> 	for_each_numa_hop_mask(next, prev_node) {
>                 if (!cpumask_and_andnot(hop_cpus, next, cpus_allow, prev))
>                         goto cont;
> 
>                 sched_get_hop_nodes(hop_nodes, hop_cpus);
>                 for_each_node_mask(node, hop_nodes) {
>                         cpu = scx_pick_idle_cpu_from_node(node, cpus_allowed, flags);
>                         if (cpu >= 0)
>                                 goto found;
>                 }
> 
>         cont:
>                 prev = next;
>         }
>  ...
>  }
> 
> Don't know how it works, but it looks really good.

I've done some tests with multiple variations of this. I think one of the
issues is that we need to allocate hop_cpus and it seems to introduce some
overhead (or it could be the operations on the cpumasks), but it seems to
pretty much nullify the benefits of the per-node idle cpumasks.

However, I came up with this one, which doesn't seem to introduce much
overhead:

int sched_numa_hop_node(nodemask_t *hop_nodes, int start, unsigned int state)
{
        int dist, n, min_node, min_dist;

        if (state >= NR_NODE_STATES)
                return NUMA_NO_NODE;

        min_node = NUMA_NO_NODE;
        min_dist = INT_MAX;

        for_each_node_state(n, state) {
                if (n == start || node_isset(n, *hop_nodes))
                        continue;
                dist = node_distance(start, n);
                if (dist < min_dist) {
                        min_dist = dist;
                        min_node = n;
                }
        }
        if (min_node != NUMA_NO_NODE)
                node_set(min_node, *hop_nodes);

        return min_node;
}

#define for_each_numa_hop_node(__node, __start, __hop_nodes, __state)           \
        for (int __node = __start;                                              \
             __node != NUMA_NO_NODE;                                            \
             __node = sched_numa_hop_node(&(__hop_nodes), __start, __state))

It's O(N^2) with the number of nodes, but it shouldn't be too bad, unless
we are in MAXSMP systems maybe, but even in in this case scx schedulers
have the option to use scx_bpf_pick_idle_cpu_node(), instead of the generic
scx_pick_idle_cpu(), and define their own criteria to iterate on the NUMA
nodes.

I'll do more testing, if we don't find any issue I'm think I'll go with
this one in the next patch set.

-Andrea

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/4] sched_ext: Introduce per-NUMA idle cpumasks
  2024-12-09 10:40 ` [PATCH 1/4] sched_ext: Introduce per-NUMA idle cpumasks Andrea Righi
  2024-12-09 19:32   ` Yury Norov
@ 2024-12-11 17:46   ` Yury Norov
  1 sibling, 0 replies; 20+ messages in thread
From: Yury Norov @ 2024-12-11 17:46 UTC (permalink / raw)
  To: Andrea Righi; +Cc: Tejun Heo, David Vernet, Changwoo Min, linux-kernel

> +static struct cpumask *get_curr_idle_cpumask(void)
> +{
> +	int node = cpu_to_node(smp_processor_id());

numa_node_id()

> +	return get_idle_cpumask_node(node);
> +}

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 2/4] sched_ext: Get rid of the scx_selcpu_topo_numa logic
  2024-12-09 10:40 [PATCHSET v5 sched_ext/for-6.14] sched_ext: split global idle cpumask into per-NUMA cpumasks Andrea Righi
  2024-12-09 10:40 ` [PATCH 1/4] sched_ext: Introduce per-NUMA idle cpumasks Andrea Righi
@ 2024-12-09 10:40 ` Andrea Righi
  2024-12-11  8:05   ` Changwoo Min
  2024-12-09 10:40 ` [PATCH 3/4] sched_ext: Introduce SCX_OPS_NODE_BUILTIN_IDLE Andrea Righi
  2024-12-09 10:40 ` [PATCH 4/4] sched_ext: Introduce NUMA aware idle cpu kfunc helpers Andrea Righi
  3 siblings, 1 reply; 20+ messages in thread
From: Andrea Righi @ 2024-12-09 10:40 UTC (permalink / raw)
  To: Tejun Heo, David Vernet; +Cc: Yury Norov, Changwoo Min, linux-kernel

With the introduction of separate per-NUMA node cpumasks, we
automatically track idle CPUs within each NUMA node.

This makes the special logic for determining idle CPUs in each NUMA node
redundant and unnecessary, so we can get rid of it.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/ext.c | 77 +++-------------------------------------------
 1 file changed, 5 insertions(+), 72 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 7e7f2869826f..ed2f0d13915c 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -886,7 +886,6 @@ static DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_enabled);
 
 #ifdef CONFIG_SMP
 static DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_llc);
-static DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_numa);
 #endif
 
 static struct static_key_false scx_has_op[SCX_OPI_END] =
@@ -3375,25 +3374,6 @@ static unsigned int numa_weight(s32 cpu)
 	return sg->group_weight;
 }
 
-/*
- * Return the cpumask representing the NUMA domain of @cpu (or NULL if the NUMA
- * domain is not defined).
- */
-static struct cpumask *numa_span(s32 cpu)
-{
-	struct sched_domain *sd;
-	struct sched_group *sg;
-
-	sd = rcu_dereference(per_cpu(sd_numa, cpu));
-	if (!sd)
-		return NULL;
-	sg = sd->groups;
-	if (!sg)
-		return NULL;
-
-	return sched_group_span(sg);
-}
-
 /*
  * Return true if the LLC domains do not perfectly overlap with the NUMA
  * domains, false otherwise.
@@ -3445,7 +3425,7 @@ static bool llc_numa_mismatch(void)
  */
 static void update_selcpu_topology(void)
 {
-	bool enable_llc = false, enable_numa = false;
+	bool enable_llc = false;
 	unsigned int nr_cpus;
 	s32 cpu = cpumask_first(cpu_online_mask);
 
@@ -3462,43 +3442,20 @@ static void update_selcpu_topology(void)
 	rcu_read_lock();
 	nr_cpus = llc_weight(cpu);
 	if (nr_cpus > 0) {
-		if (nr_cpus < num_online_cpus())
+		if ((nr_cpus < num_online_cpus()) && llc_numa_mismatch())
 			enable_llc = true;
 		pr_debug("sched_ext: LLC=%*pb weight=%u\n",
 			 cpumask_pr_args(llc_span(cpu)), llc_weight(cpu));
 	}
-
-	/*
-	 * Enable NUMA optimization only when there are multiple NUMA domains
-	 * among the online CPUs and the NUMA domains don't perfectly overlaps
-	 * with the LLC domains.
-	 *
-	 * If all CPUs belong to the same NUMA node and the same LLC domain,
-	 * enabling both NUMA and LLC optimizations is unnecessary, as checking
-	 * for an idle CPU in the same domain twice is redundant.
-	 */
-	nr_cpus = numa_weight(cpu);
-	if (nr_cpus > 0) {
-		if (nr_cpus < num_online_cpus() && llc_numa_mismatch())
-			enable_numa = true;
-		pr_debug("sched_ext: NUMA=%*pb weight=%u\n",
-			 cpumask_pr_args(numa_span(cpu)), numa_weight(cpu));
-	}
 	rcu_read_unlock();
 
 	pr_debug("sched_ext: LLC idle selection %s\n",
 		 enable_llc ? "enabled" : "disabled");
-	pr_debug("sched_ext: NUMA idle selection %s\n",
-		 enable_numa ? "enabled" : "disabled");
 
 	if (enable_llc)
 		static_branch_enable_cpuslocked(&scx_selcpu_topo_llc);
 	else
 		static_branch_disable_cpuslocked(&scx_selcpu_topo_llc);
-	if (enable_numa)
-		static_branch_enable_cpuslocked(&scx_selcpu_topo_numa);
-	else
-		static_branch_disable_cpuslocked(&scx_selcpu_topo_numa);
 }
 
 /*
@@ -3519,9 +3476,8 @@ static void update_selcpu_topology(void)
  * 4. Pick a CPU within the same NUMA node, if enabled:
  *   - choose a CPU from the same NUMA node to reduce memory access latency.
  *
- * Step 3 and 4 are performed only if the system has, respectively, multiple
- * LLC domains / multiple NUMA nodes (see scx_selcpu_topo_llc and
- * scx_selcpu_topo_numa).
+ * Step 3 is performed only if the system has multiple LLC domains that are not
+ * perfectly overlapping with the NUMA domains (see scx_selcpu_topo_llc).
  *
  * NOTE: tasks that can only run on 1 CPU are excluded by this logic, because
  * we never call ops.select_cpu() for them, see select_task_rq().
@@ -3530,7 +3486,6 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 			      u64 wake_flags, bool *found)
 {
 	const struct cpumask *llc_cpus = NULL;
-	const struct cpumask *numa_cpus = NULL;
 	int node = cpu_to_node(prev_cpu);
 	s32 cpu;
 
@@ -3552,13 +3507,9 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 	 * CPU affinity), the task will simply use the flat scheduling domain
 	 * defined by user-space.
 	 */
-	if (p->nr_cpus_allowed >= num_possible_cpus()) {
-		if (static_branch_maybe(CONFIG_NUMA, &scx_selcpu_topo_numa))
-			numa_cpus = numa_span(prev_cpu);
-
+	if (p->nr_cpus_allowed >= num_possible_cpus())
 		if (static_branch_maybe(CONFIG_SCHED_MC, &scx_selcpu_topo_llc))
 			llc_cpus = llc_span(prev_cpu);
-	}
 
 	/*
 	 * If WAKE_SYNC, try to migrate the wakee to the waker's CPU.
@@ -3620,15 +3571,6 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 				goto cpu_found;
 		}
 
-		/*
-		 * Search for any fully idle core in the same NUMA node.
-		 */
-		if (numa_cpus) {
-			cpu = scx_pick_idle_cpu_from_node(node, numa_cpus, SCX_PICK_IDLE_CORE);
-			if (cpu >= 0)
-				goto cpu_found;
-		}
-
 		/*
 		 * Search for any full idle core usable by the task.
 		 */
@@ -3654,15 +3596,6 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 			goto cpu_found;
 	}
 
-	/*
-	 * Search for any idle CPU in the same NUMA node.
-	 */
-	if (numa_cpus) {
-		cpu = scx_pick_idle_cpu_from_node(node, numa_cpus, 0);
-		if (cpu >= 0)
-			goto cpu_found;
-	}
-
 	/*
 	 * Search for any idle CPU usable by the task.
 	 */
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/4] sched_ext: Get rid of the scx_selcpu_topo_numa logic
  2024-12-09 10:40 ` [PATCH 2/4] sched_ext: Get rid of the scx_selcpu_topo_numa logic Andrea Righi
@ 2024-12-11  8:05   ` Changwoo Min
  2024-12-11 12:22     ` Andrea Righi
  0 siblings, 1 reply; 20+ messages in thread
From: Changwoo Min @ 2024-12-11  8:05 UTC (permalink / raw)
  To: Andrea Righi, Tejun Heo, David Vernet; +Cc: Yury Norov, linux-kernel

Hello Andrea,

On 24. 12. 9. 19:40, Andrea Righi wrote:
>   /*
> @@ -3519,9 +3476,8 @@ static void update_selcpu_topology(void)
>    * 4. Pick a CPU within the same NUMA node, if enabled:
>    *   - choose a CPU from the same NUMA node to reduce memory access latency.
>    *
> - * Step 3 and 4 are performed only if the system has, respectively, multiple
> - * LLC domains / multiple NUMA nodes (see scx_selcpu_topo_llc and
> - * scx_selcpu_topo_numa).
> + * Step 3 is performed only if the system has multiple LLC domains that are not
> + * perfectly overlapping with the NUMA domains (see scx_selcpu_topo_llc).
>    *
>    * NOTE: tasks that can only run on 1 CPU are excluded by this logic, because
>    * we never call ops.select_cpu() for them, see select_task_rq().
> @@ -3530,7 +3486,6 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>   			      u64 wake_flags, bool *found)


Adding Step 5 to the comment describing how it works if there is no idle 
CPU within a NUMA node would be nice. For example,

   5. Pick any idle CPU usable by the task.

Regards,
Changwoo Min

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/4] sched_ext: Get rid of the scx_selcpu_topo_numa logic
  2024-12-11  8:05   ` Changwoo Min
@ 2024-12-11 12:22     ` Andrea Righi
  0 siblings, 0 replies; 20+ messages in thread
From: Andrea Righi @ 2024-12-11 12:22 UTC (permalink / raw)
  To: Changwoo Min; +Cc: Tejun Heo, David Vernet, Yury Norov, linux-kernel

On Wed, Dec 11, 2024 at 05:05:35PM +0900, Changwoo Min wrote:
> Hello Andrea,
> 
> On 24. 12. 9. 19:40, Andrea Righi wrote:
> >   /*
> > @@ -3519,9 +3476,8 @@ static void update_selcpu_topology(void)
> >    * 4. Pick a CPU within the same NUMA node, if enabled:
> >    *   - choose a CPU from the same NUMA node to reduce memory access latency.
> >    *
> > - * Step 3 and 4 are performed only if the system has, respectively, multiple
> > - * LLC domains / multiple NUMA nodes (see scx_selcpu_topo_llc and
> > - * scx_selcpu_topo_numa).
> > + * Step 3 is performed only if the system has multiple LLC domains that are not
> > + * perfectly overlapping with the NUMA domains (see scx_selcpu_topo_llc).
> >    *
> >    * NOTE: tasks that can only run on 1 CPU are excluded by this logic, because
> >    * we never call ops.select_cpu() for them, see select_task_rq().
> > @@ -3530,7 +3486,6 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
> >   			      u64 wake_flags, bool *found)
> 
> 
> Adding Step 5 to the comment describing how it works if there is no idle CPU
> within a NUMA node would be nice. For example,
> 
>   5. Pick any idle CPU usable by the task.

Good idea, will add that, thanks!

-Andrea

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 3/4] sched_ext: Introduce SCX_OPS_NODE_BUILTIN_IDLE
  2024-12-09 10:40 [PATCHSET v5 sched_ext/for-6.14] sched_ext: split global idle cpumask into per-NUMA cpumasks Andrea Righi
  2024-12-09 10:40 ` [PATCH 1/4] sched_ext: Introduce per-NUMA idle cpumasks Andrea Righi
  2024-12-09 10:40 ` [PATCH 2/4] sched_ext: Get rid of the scx_selcpu_topo_numa logic Andrea Righi
@ 2024-12-09 10:40 ` Andrea Righi
  2024-12-11 18:21   ` Yury Norov
  2024-12-09 10:40 ` [PATCH 4/4] sched_ext: Introduce NUMA aware idle cpu kfunc helpers Andrea Righi
  3 siblings, 1 reply; 20+ messages in thread
From: Andrea Righi @ 2024-12-09 10:40 UTC (permalink / raw)
  To: Tejun Heo, David Vernet; +Cc: Yury Norov, Changwoo Min, linux-kernel

Add a flag (SCX_OPS_NODE_BUILTIN_IDLE) to toggle between a global flat
idle cpumask and multiple per-node CPU masks.

This allows each sched_ext scheduler to choose between the new or old
model, preserving backward compatibility and preventing disruptions to
existing behavior.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/ext.c | 56 +++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 50 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index ed2f0d13915c..d0d57323bcfc 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -122,6 +122,12 @@ enum scx_ops_flags {
 	 */
 	SCX_OPS_SWITCH_PARTIAL	= 1LLU << 3,
 
+	/*
+	 * If set, enable per-node idle cpumasks. If clear, use a single global
+	 * flat idle cpumask.
+	 */
+	SCX_OPS_BUILTIN_IDLE_PER_NODE = 1LLU << 4,
+
 	/*
 	 * CPU cgroup support flags
 	 */
@@ -131,6 +137,7 @@ enum scx_ops_flags {
 				  SCX_OPS_ENQ_LAST |
 				  SCX_OPS_ENQ_EXITING |
 				  SCX_OPS_SWITCH_PARTIAL |
+				  SCX_OPS_BUILTIN_IDLE_PER_NODE |
 				  SCX_OPS_HAS_CGROUP_WEIGHT,
 };
 
@@ -886,6 +893,7 @@ static DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_enabled);
 
 #ifdef CONFIG_SMP
 static DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_llc);
+static DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_per_node);
 #endif
 
 static struct static_key_false scx_has_op[SCX_OPI_END] =
@@ -929,18 +937,32 @@ static struct delayed_work scx_watchdog_work;
 #define CL_ALIGNED_IF_ONSTACK __cacheline_aligned_in_smp
 #endif
 
-static struct {
+struct idle_cpumask {
 	cpumask_var_t cpu;
 	cpumask_var_t smt;
-} **idle_masks CL_ALIGNED_IF_ONSTACK;
+};
+
+/*
+ * cpumasks to track idle CPUs within each NUMA node.
+ *
+ * If SCX_OPS_BUILTIN_IDLE_PER_NODE is not specified, a single flat cpumask
+ * from node 0 is used to track all idle CPUs system-wide.
+ */
+static struct idle_cpumask **idle_masks CL_ALIGNED_IF_ONSTACK;
 
 static struct cpumask *get_idle_cpumask_node(int node)
 {
+	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
+		return idle_masks[0]->cpu;
+
 	return idle_masks[node]->cpu;
 }
 
 static struct cpumask *get_idle_smtmask_node(int node)
 {
+	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
+		return idle_masks[0]->smt;
+
 	return idle_masks[node]->smt;
 }
 
@@ -3423,7 +3445,7 @@ static bool llc_numa_mismatch(void)
  * CPU belongs to a single LLC domain, and that each LLC domain is entirely
  * contained within a single NUMA node.
  */
-static void update_selcpu_topology(void)
+static void update_selcpu_topology(struct sched_ext_ops *ops)
 {
 	bool enable_llc = false;
 	unsigned int nr_cpus;
@@ -3442,8 +3464,16 @@ static void update_selcpu_topology(void)
 	rcu_read_lock();
 	nr_cpus = llc_weight(cpu);
 	if (nr_cpus > 0) {
-		if ((nr_cpus < num_online_cpus()) && llc_numa_mismatch())
+		if (nr_cpus < num_online_cpus())
 			enable_llc = true;
+		/*
+		 * No need to enable LLC optimization if the LLC domains are
+		 * perfectly overlapping with the NUMA domains when per-node
+		 * cpumasks are enabled.
+		 */
+		if ((ops->flags & SCX_OPS_BUILTIN_IDLE_PER_NODE) &&
+		    !llc_numa_mismatch())
+			enable_llc = false;
 		pr_debug("sched_ext: LLC=%*pb weight=%u\n",
 			 cpumask_pr_args(llc_span(cpu)), llc_weight(cpu));
 	}
@@ -3456,6 +3486,14 @@ static void update_selcpu_topology(void)
 		static_branch_enable_cpuslocked(&scx_selcpu_topo_llc);
 	else
 		static_branch_disable_cpuslocked(&scx_selcpu_topo_llc);
+
+	/*
+	 * Check if we need to enable per-node cpumasks.
+	 */
+	if (ops->flags & SCX_OPS_BUILTIN_IDLE_PER_NODE)
+		static_branch_enable_cpuslocked(&scx_builtin_idle_per_node);
+	else
+		static_branch_disable_cpuslocked(&scx_builtin_idle_per_node);
 }
 
 /*
@@ -3683,6 +3721,12 @@ static void reset_idle_masks(void)
 {
 	int node;
 
+	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node)) {
+		cpumask_copy(get_idle_cpumask_node(0), cpu_online_mask);
+		cpumask_copy(get_idle_smtmask_node(0), cpu_online_mask);
+		return;
+	}
+
 	/*
 	 * Consider all online cpus idle. Should converge to the actual state
 	 * quickly.
@@ -3740,7 +3784,7 @@ static void handle_hotplug(struct rq *rq, bool online)
 	atomic_long_inc(&scx_hotplug_seq);
 
 	if (scx_enabled())
-		update_selcpu_topology();
+		update_selcpu_topology(&scx_ops);
 
 	if (online && SCX_HAS_OP(cpu_online))
 		SCX_CALL_OP(SCX_KF_UNLOCKED, cpu_online, cpu);
@@ -5618,7 +5662,7 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
 
 	check_hotplug_seq(ops);
 #ifdef CONFIG_SMP
-	update_selcpu_topology();
+	update_selcpu_topology(ops);
 #endif
 	cpus_read_unlock();
 
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/4] sched_ext: Introduce SCX_OPS_NODE_BUILTIN_IDLE
  2024-12-09 10:40 ` [PATCH 3/4] sched_ext: Introduce SCX_OPS_NODE_BUILTIN_IDLE Andrea Righi
@ 2024-12-11 18:21   ` Yury Norov
  2024-12-11 19:59     ` Andrea Righi
  0 siblings, 1 reply; 20+ messages in thread
From: Yury Norov @ 2024-12-11 18:21 UTC (permalink / raw)
  To: Andrea Righi; +Cc: Tejun Heo, David Vernet, Changwoo Min, linux-kernel

On Mon, Dec 09, 2024 at 11:40:57AM +0100, Andrea Righi wrote:
> Add a flag (SCX_OPS_NODE_BUILTIN_IDLE) to toggle between a global flat
> idle cpumask and multiple per-node CPU masks.
> 
> This allows each sched_ext scheduler to choose between the new or old
> model, preserving backward compatibility and preventing disruptions to
> existing behavior.
> 
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> ---
>  kernel/sched/ext.c | 56 +++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 50 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index ed2f0d13915c..d0d57323bcfc 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -122,6 +122,12 @@ enum scx_ops_flags {
>  	 */
>  	SCX_OPS_SWITCH_PARTIAL	= 1LLU << 3,
>  
> +	/*
> +	 * If set, enable per-node idle cpumasks. If clear, use a single global
> +	 * flat idle cpumask.
> +	 */
> +	SCX_OPS_BUILTIN_IDLE_PER_NODE = 1LLU << 4,
> +
>  	/*
>  	 * CPU cgroup support flags
>  	 */
> @@ -131,6 +137,7 @@ enum scx_ops_flags {
>  				  SCX_OPS_ENQ_LAST |
>  				  SCX_OPS_ENQ_EXITING |
>  				  SCX_OPS_SWITCH_PARTIAL |
> +				  SCX_OPS_BUILTIN_IDLE_PER_NODE |
>  				  SCX_OPS_HAS_CGROUP_WEIGHT,
>  };
>  
> @@ -886,6 +893,7 @@ static DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_enabled);
>  
>  #ifdef CONFIG_SMP
>  static DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_llc);
> +static DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_per_node);
>  #endif
>  
>  static struct static_key_false scx_has_op[SCX_OPI_END] =
> @@ -929,18 +937,32 @@ static struct delayed_work scx_watchdog_work;
>  #define CL_ALIGNED_IF_ONSTACK __cacheline_aligned_in_smp
>  #endif
>  
> -static struct {
> +struct idle_cpumask {
>  	cpumask_var_t cpu;
>  	cpumask_var_t smt;
> -} **idle_masks CL_ALIGNED_IF_ONSTACK;
> +};
> +
> +/*
> + * cpumasks to track idle CPUs within each NUMA node.
> + *
> + * If SCX_OPS_BUILTIN_IDLE_PER_NODE is not specified, a single flat cpumask
> + * from node 0 is used to track all idle CPUs system-wide.
> + */
> +static struct idle_cpumask **idle_masks CL_ALIGNED_IF_ONSTACK;
>  
>  static struct cpumask *get_idle_cpumask_node(int node)
>  {
> +	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
> +		return idle_masks[0]->cpu;
> +
>  	return idle_masks[node]->cpu;
>  }
>  
>  static struct cpumask *get_idle_smtmask_node(int node)
>  {
> +	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
> +		return idle_masks[0]->smt;
> +
>  	return idle_masks[node]->smt;
>  }
>  
> @@ -3423,7 +3445,7 @@ static bool llc_numa_mismatch(void)
>   * CPU belongs to a single LLC domain, and that each LLC domain is entirely
>   * contained within a single NUMA node.
>   */
> -static void update_selcpu_topology(void)
> +static void update_selcpu_topology(struct sched_ext_ops *ops)
>  {
>  	bool enable_llc = false;
>  	unsigned int nr_cpus;
> @@ -3442,8 +3464,16 @@ static void update_selcpu_topology(void)
>  	rcu_read_lock();
>  	nr_cpus = llc_weight(cpu);
>  	if (nr_cpus > 0) {
> -		if ((nr_cpus < num_online_cpus()) && llc_numa_mismatch())
> +		if (nr_cpus < num_online_cpus())
>  			enable_llc = true;
> +		/*
> +		 * No need to enable LLC optimization if the LLC domains are
> +		 * perfectly overlapping with the NUMA domains when per-node
> +		 * cpumasks are enabled.
> +		 */
> +		if ((ops->flags & SCX_OPS_BUILTIN_IDLE_PER_NODE) &&
> +		    !llc_numa_mismatch())
> +			enable_llc = false;
>  		pr_debug("sched_ext: LLC=%*pb weight=%u\n",
>  			 cpumask_pr_args(llc_span(cpu)), llc_weight(cpu));
>  	}
> @@ -3456,6 +3486,14 @@ static void update_selcpu_topology(void)
>  		static_branch_enable_cpuslocked(&scx_selcpu_topo_llc);
>  	else
>  		static_branch_disable_cpuslocked(&scx_selcpu_topo_llc);
> +
> +	/*
> +	 * Check if we need to enable per-node cpumasks.
> +	 */
> +	if (ops->flags & SCX_OPS_BUILTIN_IDLE_PER_NODE)
> +		static_branch_enable_cpuslocked(&scx_builtin_idle_per_node);
> +	else
> +		static_branch_disable_cpuslocked(&scx_builtin_idle_per_node);
>  }

The patch that introduces the flag should go the very first in the series,
but should unconditionally disable scx_builtin_idle_per_node.

The following patches should add all the machinery you need. The machinery
should be conditional on the scx_builtin_idle_per_node, i.e. disabled for
a while.

Doing that, you'll be able to introduce your functionality as a whole:

   static struct cpumask *get_idle_cpumask_node(int node)
   {
   	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
   		return idle_masks[0]->cpu;
   
   	return idle_masks[node]->cpu;
   }

Much better than patching just introduced code, right?

The very last patch should only be a chunk that enables scx_builtin_idle_per_node
based on SCX_OPS_BUILTIN_IDLE_PER_NODE.

This way, when your feature will get merged, from git-bisect perspective
it will be enabled atomically by the very last patch, but those interested
in internals will have nice coherent history.

Thanks,
Yury

>  
>  /*
> @@ -3683,6 +3721,12 @@ static void reset_idle_masks(void)
>  {
>  	int node;
>  
> +	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node)) {
> +		cpumask_copy(get_idle_cpumask_node(0), cpu_online_mask);
> +		cpumask_copy(get_idle_smtmask_node(0), cpu_online_mask);
> +		return;
> +	}
> +
>  	/*
>  	 * Consider all online cpus idle. Should converge to the actual state
>  	 * quickly.
> @@ -3740,7 +3784,7 @@ static void handle_hotplug(struct rq *rq, bool online)
>  	atomic_long_inc(&scx_hotplug_seq);
>  
>  	if (scx_enabled())
> -		update_selcpu_topology();
> +		update_selcpu_topology(&scx_ops);
>  
>  	if (online && SCX_HAS_OP(cpu_online))
>  		SCX_CALL_OP(SCX_KF_UNLOCKED, cpu_online, cpu);
> @@ -5618,7 +5662,7 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
>  
>  	check_hotplug_seq(ops);
>  #ifdef CONFIG_SMP
> -	update_selcpu_topology();
> +	update_selcpu_topology(ops);
>  #endif
>  	cpus_read_unlock();
>  
> -- 
> 2.47.1

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/4] sched_ext: Introduce SCX_OPS_NODE_BUILTIN_IDLE
  2024-12-11 18:21   ` Yury Norov
@ 2024-12-11 19:59     ` Andrea Righi
  0 siblings, 0 replies; 20+ messages in thread
From: Andrea Righi @ 2024-12-11 19:59 UTC (permalink / raw)
  To: Yury Norov; +Cc: Tejun Heo, David Vernet, Changwoo Min, linux-kernel

On Wed, Dec 11, 2024 at 10:21:49AM -0800, Yury Norov wrote:
...
> > +	/*
> > +	 * Check if we need to enable per-node cpumasks.
> > +	 */
> > +	if (ops->flags & SCX_OPS_BUILTIN_IDLE_PER_NODE)
> > +		static_branch_enable_cpuslocked(&scx_builtin_idle_per_node);
> > +	else
> > +		static_branch_disable_cpuslocked(&scx_builtin_idle_per_node);
> >  }
> 
> The patch that introduces the flag should go the very first in the series,
> but should unconditionally disable scx_builtin_idle_per_node.

Ack, that's a good idea.

> 
> The following patches should add all the machinery you need. The machinery
> should be conditional on the scx_builtin_idle_per_node, i.e. disabled for
> a while.
> 
> Doing that, you'll be able to introduce your functionality as a whole:
> 
>    static struct cpumask *get_idle_cpumask_node(int node)
>    {
>    	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
>    		return idle_masks[0]->cpu;
>    
>    	return idle_masks[node]->cpu;
>    }
> 
> Much better than patching just introduced code, right?

Agreed.

> 
> The very last patch should only be a chunk that enables scx_builtin_idle_per_node
> based on SCX_OPS_BUILTIN_IDLE_PER_NODE.
> 
> This way, when your feature will get merged, from git-bisect perspective
> it will be enabled atomically by the very last patch, but those interested
> in internals will have nice coherent history.

Makes sense, I'll refactor this in the next version, thanks!

-Andrea

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 4/4] sched_ext: Introduce NUMA aware idle cpu kfunc helpers
  2024-12-09 10:40 [PATCHSET v5 sched_ext/for-6.14] sched_ext: split global idle cpumask into per-NUMA cpumasks Andrea Righi
                   ` (2 preceding siblings ...)
  2024-12-09 10:40 ` [PATCH 3/4] sched_ext: Introduce SCX_OPS_NODE_BUILTIN_IDLE Andrea Righi
@ 2024-12-09 10:40 ` Andrea Righi
  2024-12-11 17:43   ` Yury Norov
  3 siblings, 1 reply; 20+ messages in thread
From: Andrea Righi @ 2024-12-09 10:40 UTC (permalink / raw)
  To: Tejun Heo, David Vernet; +Cc: Yury Norov, Changwoo Min, linux-kernel

Add the following kfunc's to provide scx schedulers direct access to
per-node idle cpumasks information:

 const struct cpumask *scx_bpf_get_idle_cpumask_node(int node)
 const struct cpumask *scx_bpf_get_idle_smtmask_node(int node)
 s32 scx_bpf_pick_idle_cpu_node(int node,
                                const cpumask_t *cpus_allowed, u64 flags)
 int scx_bpf_cpu_to_node(s32 cpu)

Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/ext.c                       | 96 +++++++++++++++++++++++-
 tools/sched_ext/include/scx/common.bpf.h |  4 +
 tools/sched_ext/include/scx/compat.bpf.h | 19 +++++
 3 files changed, 117 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index d0d57323bcfc..ea7cc481782c 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -433,6 +433,7 @@ struct sched_ext_ops {
 	 * - scx_bpf_select_cpu_dfl()
 	 * - scx_bpf_test_and_clear_cpu_idle()
 	 * - scx_bpf_pick_idle_cpu()
+	 * - scx_bpf_pick_idle_cpu_node()
 	 *
 	 * The user also must implement ops.select_cpu() as the default
 	 * implementation relies on scx_bpf_select_cpu_dfl().
@@ -955,6 +956,8 @@ static struct cpumask *get_idle_cpumask_node(int node)
 	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
 		return idle_masks[0]->cpu;
 
+	if (node < 0 || node >= num_possible_nodes())
+		return NULL;
 	return idle_masks[node]->cpu;
 }
 
@@ -963,6 +966,8 @@ static struct cpumask *get_idle_smtmask_node(int node)
 	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
 		return idle_masks[0]->smt;
 
+	if (node < 0 || node >= num_possible_nodes())
+		return NULL;
 	return idle_masks[node]->smt;
 }
 
@@ -7469,6 +7474,16 @@ __bpf_kfunc u32 scx_bpf_nr_cpu_ids(void)
 	return nr_cpu_ids;
 }
 
+/**
+ * scx_bpf_cpu_to_node - Return the NUMA node the given @cpu belongs to
+ */
+__bpf_kfunc int scx_bpf_cpu_to_node(s32 cpu)
+{
+	if (cpu < 0 || cpu >= nr_cpu_ids)
+		return -EINVAL;
+	return cpu_to_node(cpu);
+}
+
 /**
  * scx_bpf_get_possible_cpumask - Get a referenced kptr to cpu_possible_mask
  */
@@ -7499,11 +7514,32 @@ __bpf_kfunc void scx_bpf_put_cpumask(const struct cpumask *cpumask)
 	 */
 }
 
+/**
+ * scx_bpf_get_idle_cpumask_node - Get a referenced kptr to the idle-tracking
+ * per-CPU cpumask of a target NUMA node.
+ *
+ * Returns an empty cpumask if idle tracking is not enabled, if @node is not
+ * valid, or running on a UP kernel.
+ */
+__bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask_node(int node)
+{
+	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
+		scx_ops_error("built-in idle tracking is disabled");
+		return cpu_none_mask;
+	}
+	if (!static_branch_likely(&scx_builtin_idle_per_node)) {
+		scx_ops_error("per-node idle tracking is disabled");
+		return cpu_none_mask;
+	}
+
+	return get_idle_cpumask_node(node) ? : cpu_none_mask;
+}
 /**
  * scx_bpf_get_idle_cpumask - Get a referenced kptr to the idle-tracking
  * per-CPU cpumask of the current NUMA node.
  *
- * Returns NULL if idle tracking is not enabled, or running on a UP kernel.
+ * Returns an emtpy cpumask if idle tracking is not enabled, or running on a UP
+ * kernel.
  */
 __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void)
 {
@@ -7515,12 +7551,35 @@ __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void)
 	return get_curr_idle_cpumask();
 }
 
+/**
+ * scx_bpf_get_idle_smtmask_node - Get a referenced kptr to the idle-tracking,
+ * per-physical-core cpumask of a target NUMA node. Can be used to determine
+ * if an entire physical core is free.
+ *
+ * Returns an empty cpumask if idle tracking is not enabled, if @node is not
+ * valid, or running on a UP kernel.
+ */
+__bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask_node(int node)
+{
+	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
+		scx_ops_error("built-in idle tracking is disabled");
+		return cpu_none_mask;
+	}
+	if (!static_branch_likely(&scx_builtin_idle_per_node)) {
+		scx_ops_error("per-node idle tracking is disabled");
+		return cpu_none_mask;
+	}
+
+	return get_idle_smtmask_node(node) ? : cpu_none_mask;
+}
+
 /**
  * scx_bpf_get_idle_smtmask - Get a referenced kptr to the idle-tracking,
  * per-physical-core cpumask of the current NUMA node. Can be used to determine
  * if an entire physical core is free.
  *
- * Returns NULL if idle tracking is not enabled, or running on a UP kernel.
+ * Returns an empty cumask if idle tracking is not enabled, or running on a UP
+ * kernel.
  */
 __bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask(void)
 {
@@ -7569,6 +7628,35 @@ __bpf_kfunc bool scx_bpf_test_and_clear_cpu_idle(s32 cpu)
 		return false;
 }
 
+/**
+ * scx_bpf_pick_idle_cpu_node - Pick and claim an idle cpu from a NUMA node
+ * @node: target NUMA node
+ * @cpus_allowed: Allowed cpumask
+ * @flags: %SCX_PICK_IDLE_CPU_* flags
+ *
+ * Pick and claim an idle cpu in @cpus_allowed from the NUMA node @node.
+ * Returns the picked idle cpu number on success. -%EBUSY if no matching cpu
+ * was found.
+ *
+ * Unavailable if ops.update_idle() is implemented and
+ * %SCX_OPS_KEEP_BUILTIN_IDLE is not set or if %SCX_OPS_KEEP_BUILTIN_IDLE is
+ * not set.
+ */
+__bpf_kfunc s32 scx_bpf_pick_idle_cpu_node(int node, const struct cpumask *cpus_allowed,
+				      u64 flags)
+{
+	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
+		scx_ops_error("built-in idle tracking is disabled");
+		return -EBUSY;
+	}
+	if (!static_branch_likely(&scx_builtin_idle_per_node)) {
+		scx_ops_error("per-node idle tracking is disabled");
+		return -EBUSY;
+	}
+
+	return scx_pick_idle_cpu_from_node(node, cpus_allowed, flags);
+}
+
 /**
  * scx_bpf_pick_idle_cpu - Pick and claim an idle cpu
  * @cpus_allowed: Allowed cpumask
@@ -7705,14 +7793,18 @@ BTF_ID_FLAGS(func, scx_bpf_cpuperf_cap)
 BTF_ID_FLAGS(func, scx_bpf_cpuperf_cur)
 BTF_ID_FLAGS(func, scx_bpf_cpuperf_set)
 BTF_ID_FLAGS(func, scx_bpf_nr_cpu_ids)
+BTF_ID_FLAGS(func, scx_bpf_cpu_to_node)
 BTF_ID_FLAGS(func, scx_bpf_get_possible_cpumask, KF_ACQUIRE)
 BTF_ID_FLAGS(func, scx_bpf_get_online_cpumask, KF_ACQUIRE)
 BTF_ID_FLAGS(func, scx_bpf_put_cpumask, KF_RELEASE)
 BTF_ID_FLAGS(func, scx_bpf_get_idle_cpumask, KF_ACQUIRE)
+BTF_ID_FLAGS(func, scx_bpf_get_idle_cpumask_node, KF_ACQUIRE)
 BTF_ID_FLAGS(func, scx_bpf_get_idle_smtmask, KF_ACQUIRE)
+BTF_ID_FLAGS(func, scx_bpf_get_idle_smtmask_node, KF_ACQUIRE)
 BTF_ID_FLAGS(func, scx_bpf_put_idle_cpumask, KF_RELEASE)
 BTF_ID_FLAGS(func, scx_bpf_test_and_clear_cpu_idle)
 BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu, KF_RCU)
+BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu_node, KF_RCU)
 BTF_ID_FLAGS(func, scx_bpf_pick_any_cpu, KF_RCU)
 BTF_ID_FLAGS(func, scx_bpf_task_running, KF_RCU)
 BTF_ID_FLAGS(func, scx_bpf_task_cpu, KF_RCU)
diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h
index 625f5b046776..9bbf6d5083b5 100644
--- a/tools/sched_ext/include/scx/common.bpf.h
+++ b/tools/sched_ext/include/scx/common.bpf.h
@@ -59,14 +59,18 @@ u32 scx_bpf_cpuperf_cap(s32 cpu) __ksym __weak;
 u32 scx_bpf_cpuperf_cur(s32 cpu) __ksym __weak;
 void scx_bpf_cpuperf_set(s32 cpu, u32 perf) __ksym __weak;
 u32 scx_bpf_nr_cpu_ids(void) __ksym __weak;
+int scx_bpf_cpu_to_node(s32 cpu) __ksym __weak;
 const struct cpumask *scx_bpf_get_possible_cpumask(void) __ksym __weak;
 const struct cpumask *scx_bpf_get_online_cpumask(void) __ksym __weak;
 void scx_bpf_put_cpumask(const struct cpumask *cpumask) __ksym __weak;
 const struct cpumask *scx_bpf_get_idle_cpumask(void) __ksym;
+const struct cpumask *scx_bpf_get_idle_cpumask_node(int node) __ksym __weak;
 const struct cpumask *scx_bpf_get_idle_smtmask(void) __ksym;
+const struct cpumask *scx_bpf_get_idle_smtmask_node(int node) __ksym __weak;
 void scx_bpf_put_idle_cpumask(const struct cpumask *cpumask) __ksym;
 bool scx_bpf_test_and_clear_cpu_idle(s32 cpu) __ksym;
 s32 scx_bpf_pick_idle_cpu(const cpumask_t *cpus_allowed, u64 flags) __ksym;
+s32 scx_bpf_pick_idle_cpu_node(int node, const cpumask_t *cpus_allowed, u64 flags) __ksym __weak;
 s32 scx_bpf_pick_any_cpu(const cpumask_t *cpus_allowed, u64 flags) __ksym;
 bool scx_bpf_task_running(const struct task_struct *p) __ksym;
 s32 scx_bpf_task_cpu(const struct task_struct *p) __ksym;
diff --git a/tools/sched_ext/include/scx/compat.bpf.h b/tools/sched_ext/include/scx/compat.bpf.h
index d56520100a26..587650490743 100644
--- a/tools/sched_ext/include/scx/compat.bpf.h
+++ b/tools/sched_ext/include/scx/compat.bpf.h
@@ -125,6 +125,25 @@ bool scx_bpf_dispatch_vtime_from_dsq___compat(struct bpf_iter_scx_dsq *it__iter,
 	false;									\
 })
 
+#define __COMPAT_scx_bpf_cpu_to_node(cpu)                                       \
+	(bpf_ksym_exists(scx_bpf_cpu_to_node) ?                                 \
+	 scx_bpf_cpu_to_node(cpu) : 0)
+
+#define __COMPAT_scx_bpf_get_idle_cpumask_node(node)                            \
+	(bpf_ksym_exists(scx_bpf_get_idle_cpumask_node) ?                       \
+	 scx_bpf_get_idle_cpumask_node(node) :                                  \
+	 scx_bpf_get_idle_cpumask())                                            \
+
+#define __COMPAT_scx_bpf_get_idle_smtmask_node(node)                            \
+	(bpf_ksym_exists(scx_bpf_get_idle_smtmask_node) ?                       \
+	 scx_bpf_get_idle_smtmask_node(node) :                                  \
+	 scx_bpf_get_idle_smtmask())                                            \
+
+#define __COMPAT_scx_bpf_pick_idle_cpu_node(node, cpus_allowed, flags)          \
+	(bpf_ksym_exists(scx_bpf_pick_idle_cpu_node) ?                          \
+	 scx_bpf_pick_idle_cpu_node(node, cpus_allowed, flags) :                \
+	 scx_bpf_pick_idle_cpu(cpus_allowed, flags))
+
 /*
  * Define sched_ext_ops. This may be expanded to define multiple variants for
  * backward compatibility. See compat.h::SCX_OPS_LOAD/ATTACH().
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/4] sched_ext: Introduce NUMA aware idle cpu kfunc helpers
  2024-12-09 10:40 ` [PATCH 4/4] sched_ext: Introduce NUMA aware idle cpu kfunc helpers Andrea Righi
@ 2024-12-11 17:43   ` Yury Norov
  2024-12-11 20:20     ` Andrea Righi
  0 siblings, 1 reply; 20+ messages in thread
From: Yury Norov @ 2024-12-11 17:43 UTC (permalink / raw)
  To: Andrea Righi; +Cc: Tejun Heo, David Vernet, Changwoo Min, linux-kernel

On Mon, Dec 09, 2024 at 11:40:58AM +0100, Andrea Righi wrote:
> Add the following kfunc's to provide scx schedulers direct access to
> per-node idle cpumasks information:
> 
>  const struct cpumask *scx_bpf_get_idle_cpumask_node(int node)
>  const struct cpumask *scx_bpf_get_idle_smtmask_node(int node)
>  s32 scx_bpf_pick_idle_cpu_node(int node,
>                                 const cpumask_t *cpus_allowed, u64 flags)
>  int scx_bpf_cpu_to_node(s32 cpu)
> 
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> ---
>  kernel/sched/ext.c                       | 96 +++++++++++++++++++++++-
>  tools/sched_ext/include/scx/common.bpf.h |  4 +
>  tools/sched_ext/include/scx/compat.bpf.h | 19 +++++
>  3 files changed, 117 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index d0d57323bcfc..ea7cc481782c 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -433,6 +433,7 @@ struct sched_ext_ops {
>  	 * - scx_bpf_select_cpu_dfl()
>  	 * - scx_bpf_test_and_clear_cpu_idle()
>  	 * - scx_bpf_pick_idle_cpu()
> +	 * - scx_bpf_pick_idle_cpu_node()
>  	 *
>  	 * The user also must implement ops.select_cpu() as the default
>  	 * implementation relies on scx_bpf_select_cpu_dfl().
> @@ -955,6 +956,8 @@ static struct cpumask *get_idle_cpumask_node(int node)
>  	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
>  		return idle_masks[0]->cpu;
>  
> +	if (node < 0 || node >= num_possible_nodes())
> +		return NULL;

1. This sanity should go before the check above.
2. In-kernel users don't need to do sanity checks. BPF users should,
   but for them you need to move it in BPF wrapper.
3. -1 is a valid parameter, means NUMA_NO_NODE. 

>  	return idle_masks[node]->cpu;
>  }
>  
> @@ -963,6 +966,8 @@ static struct cpumask *get_idle_smtmask_node(int node)
>  	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
>  		return idle_masks[0]->smt;
>  
> +	if (node < 0 || node >= num_possible_nodes())
> +		return NULL;
>  	return idle_masks[node]->smt;
>  }
>  
> @@ -7469,6 +7474,16 @@ __bpf_kfunc u32 scx_bpf_nr_cpu_ids(void)
>  	return nr_cpu_ids;
>  }
>  
> +/**
> + * scx_bpf_cpu_to_node - Return the NUMA node the given @cpu belongs to
> + */
> +__bpf_kfunc int scx_bpf_cpu_to_node(s32 cpu)
> +{
> +	if (cpu < 0 || cpu >= nr_cpu_ids)
> +		return -EINVAL;
> +	return cpu_to_node(cpu);
> +}

I believe this wrapper should be declared somewhere in
kernel/sched/topology.c, and better be a separate patch.

> +
>  /**
>   * scx_bpf_get_possible_cpumask - Get a referenced kptr to cpu_possible_mask
>   */
> @@ -7499,11 +7514,32 @@ __bpf_kfunc void scx_bpf_put_cpumask(const struct cpumask *cpumask)
>  	 */
>  }
>  
> +/**
> + * scx_bpf_get_idle_cpumask_node - Get a referenced kptr to the idle-tracking
> + * per-CPU cpumask of a target NUMA node.
> + *
> + * Returns an empty cpumask if idle tracking is not enabled, if @node is not
> + * valid, or running on a UP kernel.
> + */
> +__bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask_node(int node)
> +{
> +	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
> +		scx_ops_error("built-in idle tracking is disabled");
> +		return cpu_none_mask;
> +	}
> +	if (!static_branch_likely(&scx_builtin_idle_per_node)) {
> +		scx_ops_error("per-node idle tracking is disabled");
> +		return cpu_none_mask;
> +	}

Nub question: is it possible that scx_builtin_idle_per_node is enable,
but scx_builtin_idle_enabled not? From my naive perspective, we can't
enable per-node idle masks without enabling general idle masks. Or I
mislead it?

> +
> +	return get_idle_cpumask_node(node) ? : cpu_none_mask;
> +}
>  /**
>   * scx_bpf_get_idle_cpumask - Get a referenced kptr to the idle-tracking
>   * per-CPU cpumask of the current NUMA node.
>   *
> - * Returns NULL if idle tracking is not enabled, or running on a UP kernel.
> + * Returns an emtpy cpumask if idle tracking is not enabled, or running on a UP
> + * kernel.
>   */
>  __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void)
>  {
> @@ -7515,12 +7551,35 @@ __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void)
>  	return get_curr_idle_cpumask();
>  }
>  
> +/**
> + * scx_bpf_get_idle_smtmask_node - Get a referenced kptr to the idle-tracking,
> + * per-physical-core cpumask of a target NUMA node. Can be used to determine
> + * if an entire physical core is free.
> + *
> + * Returns an empty cpumask if idle tracking is not enabled, if @node is not
> + * valid, or running on a UP kernel.
> + */
> +__bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask_node(int node)
> +{
> +	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
> +		scx_ops_error("built-in idle tracking is disabled");
> +		return cpu_none_mask;
> +	}
> +	if (!static_branch_likely(&scx_builtin_idle_per_node)) {
> +		scx_ops_error("per-node idle tracking is disabled");
> +		return cpu_none_mask;
> +	}

Can you add vertical spacing between blocks?

Also, because you use this construction more than once, I think it
makes sense to make it a helper.

> +
> +	return get_idle_smtmask_node(node) ? : cpu_none_mask;
> +}
> +
>  /**
>   * scx_bpf_get_idle_smtmask - Get a referenced kptr to the idle-tracking,
>   * per-physical-core cpumask of the current NUMA node. Can be used to determine
>   * if an entire physical core is free.
>   *
> - * Returns NULL if idle tracking is not enabled, or running on a UP kernel.
> + * Returns an empty cumask if idle tracking is not enabled, or running on a UP
> + * kernel.
>   */
>  __bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask(void)
>  {
> @@ -7569,6 +7628,35 @@ __bpf_kfunc bool scx_bpf_test_and_clear_cpu_idle(s32 cpu)
>  		return false;
>  }
>  
> +/**
> + * scx_bpf_pick_idle_cpu_node - Pick and claim an idle cpu from a NUMA node
> + * @node: target NUMA node
> + * @cpus_allowed: Allowed cpumask
> + * @flags: %SCX_PICK_IDLE_CPU_* flags
> + *
> + * Pick and claim an idle cpu in @cpus_allowed from the NUMA node @node.
> + * Returns the picked idle cpu number on success. -%EBUSY if no matching cpu
> + * was found.
> + *
> + * Unavailable if ops.update_idle() is implemented and
> + * %SCX_OPS_KEEP_BUILTIN_IDLE is not set or if %SCX_OPS_KEEP_BUILTIN_IDLE is
> + * not set.
> + */
> +__bpf_kfunc s32 scx_bpf_pick_idle_cpu_node(int node, const struct cpumask *cpus_allowed,
> +				      u64 flags)
> +{

Sanity checks here?

> +	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
> +		scx_ops_error("built-in idle tracking is disabled");
> +		return -EBUSY;
> +	}
> +	if (!static_branch_likely(&scx_builtin_idle_per_node)) {
> +		scx_ops_error("per-node idle tracking is disabled");
> +		return -EBUSY;
> +	}
> +
> +	return scx_pick_idle_cpu_from_node(node, cpus_allowed, flags);
> +}
> +
>  /**
>   * scx_bpf_pick_idle_cpu - Pick and claim an idle cpu
>   * @cpus_allowed: Allowed cpumask
> @@ -7705,14 +7793,18 @@ BTF_ID_FLAGS(func, scx_bpf_cpuperf_cap)
>  BTF_ID_FLAGS(func, scx_bpf_cpuperf_cur)
>  BTF_ID_FLAGS(func, scx_bpf_cpuperf_set)
>  BTF_ID_FLAGS(func, scx_bpf_nr_cpu_ids)
> +BTF_ID_FLAGS(func, scx_bpf_cpu_to_node)
>  BTF_ID_FLAGS(func, scx_bpf_get_possible_cpumask, KF_ACQUIRE)
>  BTF_ID_FLAGS(func, scx_bpf_get_online_cpumask, KF_ACQUIRE)
>  BTF_ID_FLAGS(func, scx_bpf_put_cpumask, KF_RELEASE)
>  BTF_ID_FLAGS(func, scx_bpf_get_idle_cpumask, KF_ACQUIRE)
> +BTF_ID_FLAGS(func, scx_bpf_get_idle_cpumask_node, KF_ACQUIRE)
>  BTF_ID_FLAGS(func, scx_bpf_get_idle_smtmask, KF_ACQUIRE)
> +BTF_ID_FLAGS(func, scx_bpf_get_idle_smtmask_node, KF_ACQUIRE)
>  BTF_ID_FLAGS(func, scx_bpf_put_idle_cpumask, KF_RELEASE)
>  BTF_ID_FLAGS(func, scx_bpf_test_and_clear_cpu_idle)
>  BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu, KF_RCU)
> +BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu_node, KF_RCU)
>  BTF_ID_FLAGS(func, scx_bpf_pick_any_cpu, KF_RCU)
>  BTF_ID_FLAGS(func, scx_bpf_task_running, KF_RCU)
>  BTF_ID_FLAGS(func, scx_bpf_task_cpu, KF_RCU)
> diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h
> index 625f5b046776..9bbf6d5083b5 100644
> --- a/tools/sched_ext/include/scx/common.bpf.h
> +++ b/tools/sched_ext/include/scx/common.bpf.h
> @@ -59,14 +59,18 @@ u32 scx_bpf_cpuperf_cap(s32 cpu) __ksym __weak;
>  u32 scx_bpf_cpuperf_cur(s32 cpu) __ksym __weak;
>  void scx_bpf_cpuperf_set(s32 cpu, u32 perf) __ksym __weak;
>  u32 scx_bpf_nr_cpu_ids(void) __ksym __weak;
> +int scx_bpf_cpu_to_node(s32 cpu) __ksym __weak;
>  const struct cpumask *scx_bpf_get_possible_cpumask(void) __ksym __weak;
>  const struct cpumask *scx_bpf_get_online_cpumask(void) __ksym __weak;
>  void scx_bpf_put_cpumask(const struct cpumask *cpumask) __ksym __weak;
>  const struct cpumask *scx_bpf_get_idle_cpumask(void) __ksym;
> +const struct cpumask *scx_bpf_get_idle_cpumask_node(int node) __ksym __weak;
>  const struct cpumask *scx_bpf_get_idle_smtmask(void) __ksym;
> +const struct cpumask *scx_bpf_get_idle_smtmask_node(int node) __ksym __weak;
>  void scx_bpf_put_idle_cpumask(const struct cpumask *cpumask) __ksym;
>  bool scx_bpf_test_and_clear_cpu_idle(s32 cpu) __ksym;
>  s32 scx_bpf_pick_idle_cpu(const cpumask_t *cpus_allowed, u64 flags) __ksym;
> +s32 scx_bpf_pick_idle_cpu_node(int node, const cpumask_t *cpus_allowed, u64 flags) __ksym __weak;
>  s32 scx_bpf_pick_any_cpu(const cpumask_t *cpus_allowed, u64 flags) __ksym;
>  bool scx_bpf_task_running(const struct task_struct *p) __ksym;
>  s32 scx_bpf_task_cpu(const struct task_struct *p) __ksym;
> diff --git a/tools/sched_ext/include/scx/compat.bpf.h b/tools/sched_ext/include/scx/compat.bpf.h
> index d56520100a26..587650490743 100644
> --- a/tools/sched_ext/include/scx/compat.bpf.h
> +++ b/tools/sched_ext/include/scx/compat.bpf.h
> @@ -125,6 +125,25 @@ bool scx_bpf_dispatch_vtime_from_dsq___compat(struct bpf_iter_scx_dsq *it__iter,
>  	false;									\
>  })
>  
> +#define __COMPAT_scx_bpf_cpu_to_node(cpu)                                       \
> +	(bpf_ksym_exists(scx_bpf_cpu_to_node) ?                                 \
> +	 scx_bpf_cpu_to_node(cpu) : 0)
> +
> +#define __COMPAT_scx_bpf_get_idle_cpumask_node(node)                            \
> +	(bpf_ksym_exists(scx_bpf_get_idle_cpumask_node) ?                       \
> +	 scx_bpf_get_idle_cpumask_node(node) :                                  \
> +	 scx_bpf_get_idle_cpumask())                                            \
> +
> +#define __COMPAT_scx_bpf_get_idle_smtmask_node(node)                            \
> +	(bpf_ksym_exists(scx_bpf_get_idle_smtmask_node) ?                       \
> +	 scx_bpf_get_idle_smtmask_node(node) :                                  \
> +	 scx_bpf_get_idle_smtmask())                                            \
> +
> +#define __COMPAT_scx_bpf_pick_idle_cpu_node(node, cpus_allowed, flags)          \
> +	(bpf_ksym_exists(scx_bpf_pick_idle_cpu_node) ?                          \
> +	 scx_bpf_pick_idle_cpu_node(node, cpus_allowed, flags) :                \
> +	 scx_bpf_pick_idle_cpu(cpus_allowed, flags))
> +
>  /*
>   * Define sched_ext_ops. This may be expanded to define multiple variants for
>   * backward compatibility. See compat.h::SCX_OPS_LOAD/ATTACH().
> -- 
> 2.47.1

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/4] sched_ext: Introduce NUMA aware idle cpu kfunc helpers
  2024-12-11 17:43   ` Yury Norov
@ 2024-12-11 20:20     ` Andrea Righi
  2024-12-11 20:47       ` Yury Norov
  0 siblings, 1 reply; 20+ messages in thread
From: Andrea Righi @ 2024-12-11 20:20 UTC (permalink / raw)
  To: Yury Norov; +Cc: Tejun Heo, David Vernet, Changwoo Min, linux-kernel

On Wed, Dec 11, 2024 at 09:43:26AM -0800, Yury Norov wrote:
> On Mon, Dec 09, 2024 at 11:40:58AM +0100, Andrea Righi wrote:
> > Add the following kfunc's to provide scx schedulers direct access to
> > per-node idle cpumasks information:
> > 
> >  const struct cpumask *scx_bpf_get_idle_cpumask_node(int node)
> >  const struct cpumask *scx_bpf_get_idle_smtmask_node(int node)
> >  s32 scx_bpf_pick_idle_cpu_node(int node,
> >                                 const cpumask_t *cpus_allowed, u64 flags)
> >  int scx_bpf_cpu_to_node(s32 cpu)
> > 
> > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> > ---
> >  kernel/sched/ext.c                       | 96 +++++++++++++++++++++++-
> >  tools/sched_ext/include/scx/common.bpf.h |  4 +
> >  tools/sched_ext/include/scx/compat.bpf.h | 19 +++++
> >  3 files changed, 117 insertions(+), 2 deletions(-)
> > 
> > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> > index d0d57323bcfc..ea7cc481782c 100644
> > --- a/kernel/sched/ext.c
> > +++ b/kernel/sched/ext.c
> > @@ -433,6 +433,7 @@ struct sched_ext_ops {
> >  	 * - scx_bpf_select_cpu_dfl()
> >  	 * - scx_bpf_test_and_clear_cpu_idle()
> >  	 * - scx_bpf_pick_idle_cpu()
> > +	 * - scx_bpf_pick_idle_cpu_node()
> >  	 *
> >  	 * The user also must implement ops.select_cpu() as the default
> >  	 * implementation relies on scx_bpf_select_cpu_dfl().
> > @@ -955,6 +956,8 @@ static struct cpumask *get_idle_cpumask_node(int node)
> >  	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
> >  		return idle_masks[0]->cpu;
> >  
> > +	if (node < 0 || node >= num_possible_nodes())
> > +		return NULL;
> 
> 1. This sanity should go before the check above.
> 2. In-kernel users don't need to do sanity checks. BPF users should,
>    but for them you need to move it in BPF wrapper.
> 3. -1 is a valid parameter, means NUMA_NO_NODE. 

Ok, but what would you return with NUMA_NO_NODE, in theory we should return
a global system-wide cpumask, that doesn't exist with the per-node
cpumasks. Maybe just return cpu_none_mask? That's what I've done in the
next version, that seems safer than returning NULL.

> 
> >  	return idle_masks[node]->cpu;
> >  }
> >  
> > @@ -963,6 +966,8 @@ static struct cpumask *get_idle_smtmask_node(int node)
> >  	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
> >  		return idle_masks[0]->smt;
> >  
> > +	if (node < 0 || node >= num_possible_nodes())
> > +		return NULL;
> >  	return idle_masks[node]->smt;
> >  }
> >  
> > @@ -7469,6 +7474,16 @@ __bpf_kfunc u32 scx_bpf_nr_cpu_ids(void)
> >  	return nr_cpu_ids;
> >  }
> >  
> > +/**
> > + * scx_bpf_cpu_to_node - Return the NUMA node the given @cpu belongs to
> > + */
> > +__bpf_kfunc int scx_bpf_cpu_to_node(s32 cpu)
> > +{
> > +	if (cpu < 0 || cpu >= nr_cpu_ids)
> > +		return -EINVAL;
> > +	return cpu_to_node(cpu);
> > +}
> 
> I believe this wrapper should be declared somewhere in
> kernel/sched/topology.c, and better be a separate patch.

Maybe kernel/bpf/helpers.c? And name it bpf_cpu_to_node()?

> 
> > +
> >  /**
> >   * scx_bpf_get_possible_cpumask - Get a referenced kptr to cpu_possible_mask
> >   */
> > @@ -7499,11 +7514,32 @@ __bpf_kfunc void scx_bpf_put_cpumask(const struct cpumask *cpumask)
> >  	 */
> >  }
> >  
> > +/**
> > + * scx_bpf_get_idle_cpumask_node - Get a referenced kptr to the idle-tracking
> > + * per-CPU cpumask of a target NUMA node.
> > + *
> > + * Returns an empty cpumask if idle tracking is not enabled, if @node is not
> > + * valid, or running on a UP kernel.
> > + */
> > +__bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask_node(int node)
> > +{
> > +	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
> > +		scx_ops_error("built-in idle tracking is disabled");
> > +		return cpu_none_mask;
> > +	}
> > +	if (!static_branch_likely(&scx_builtin_idle_per_node)) {
> > +		scx_ops_error("per-node idle tracking is disabled");
> > +		return cpu_none_mask;
> > +	}
> 
> Nub question: is it possible that scx_builtin_idle_per_node is enable,
> but scx_builtin_idle_enabled not? From my naive perspective, we can't
> enable per-node idle masks without enabling general idle masks. Or I
> mislead it?

In theory a BPF scheduler could set SCX_OPS_BUILTIN_IDLE_PER_NODE (without
SCX_OPS_KEEP_BUILTIN_IDLE) in .flags while implementing ops.update_idle().

In this way we would have scx_builtin_idle_enabled==false and
scx_builtin_idle_per_node==true, which doesn't make much sense, so we
should probably handle this case in validate_ops() and trigger an error.

Good catch!

> 
> > +
> > +	return get_idle_cpumask_node(node) ? : cpu_none_mask;
> > +}
> >  /**
> >   * scx_bpf_get_idle_cpumask - Get a referenced kptr to the idle-tracking
> >   * per-CPU cpumask of the current NUMA node.
> >   *
> > - * Returns NULL if idle tracking is not enabled, or running on a UP kernel.
> > + * Returns an emtpy cpumask if idle tracking is not enabled, or running on a UP
> > + * kernel.
> >   */
> >  __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void)
> >  {
> > @@ -7515,12 +7551,35 @@ __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void)
> >  	return get_curr_idle_cpumask();
> >  }
> >  
> > +/**
> > + * scx_bpf_get_idle_smtmask_node - Get a referenced kptr to the idle-tracking,
> > + * per-physical-core cpumask of a target NUMA node. Can be used to determine
> > + * if an entire physical core is free.
> > + *
> > + * Returns an empty cpumask if idle tracking is not enabled, if @node is not
> > + * valid, or running on a UP kernel.
> > + */
> > +__bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask_node(int node)
> > +{
> > +	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
> > +		scx_ops_error("built-in idle tracking is disabled");
> > +		return cpu_none_mask;
> > +	}
> > +	if (!static_branch_likely(&scx_builtin_idle_per_node)) {
> > +		scx_ops_error("per-node idle tracking is disabled");
> > +		return cpu_none_mask;
> > +	}
> 
> Can you add vertical spacing between blocks?

You mean a blank between the two blocks, right?

Anyway, ...

> 
> Also, because you use this construction more than once, I think it
> makes sense to make it a helper.

With a proper error check in validate_ops() we can just get rid of the
scx_builtin_idle_enabled block and simply check scx_builtin_idle_per_node.

> 
> > +
> > +	return get_idle_smtmask_node(node) ? : cpu_none_mask;
> > +}
> > +
> >  /**
> >   * scx_bpf_get_idle_smtmask - Get a referenced kptr to the idle-tracking,
> >   * per-physical-core cpumask of the current NUMA node. Can be used to determine
> >   * if an entire physical core is free.
> >   *
> > - * Returns NULL if idle tracking is not enabled, or running on a UP kernel.
> > + * Returns an empty cumask if idle tracking is not enabled, or running on a UP
> > + * kernel.
> >   */
> >  __bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask(void)
> >  {
> > @@ -7569,6 +7628,35 @@ __bpf_kfunc bool scx_bpf_test_and_clear_cpu_idle(s32 cpu)
> >  		return false;
> >  }
> >  
> > +/**
> > + * scx_bpf_pick_idle_cpu_node - Pick and claim an idle cpu from a NUMA node
> > + * @node: target NUMA node
> > + * @cpus_allowed: Allowed cpumask
> > + * @flags: %SCX_PICK_IDLE_CPU_* flags
> > + *
> > + * Pick and claim an idle cpu in @cpus_allowed from the NUMA node @node.
> > + * Returns the picked idle cpu number on success. -%EBUSY if no matching cpu
> > + * was found.
> > + *
> > + * Unavailable if ops.update_idle() is implemented and
> > + * %SCX_OPS_KEEP_BUILTIN_IDLE is not set or if %SCX_OPS_KEEP_BUILTIN_IDLE is
> > + * not set.
> > + */
> > +__bpf_kfunc s32 scx_bpf_pick_idle_cpu_node(int node, const struct cpumask *cpus_allowed,
> > +				      u64 flags)
> > +{
> 
> Sanity checks here?

Indeed, thanks!

-Andrea

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/4] sched_ext: Introduce NUMA aware idle cpu kfunc helpers
  2024-12-11 20:20     ` Andrea Righi
@ 2024-12-11 20:47       ` Yury Norov
  2024-12-11 20:55         ` Andrea Righi
  0 siblings, 1 reply; 20+ messages in thread
From: Yury Norov @ 2024-12-11 20:47 UTC (permalink / raw)
  To: Andrea Righi; +Cc: Tejun Heo, David Vernet, Changwoo Min, linux-kernel

On Wed, Dec 11, 2024 at 09:20:56PM +0100, Andrea Righi wrote:
> On Wed, Dec 11, 2024 at 09:43:26AM -0800, Yury Norov wrote:
> > On Mon, Dec 09, 2024 at 11:40:58AM +0100, Andrea Righi wrote:
> > > Add the following kfunc's to provide scx schedulers direct access to
> > > per-node idle cpumasks information:
> > > 
> > >  const struct cpumask *scx_bpf_get_idle_cpumask_node(int node)
> > >  const struct cpumask *scx_bpf_get_idle_smtmask_node(int node)
> > >  s32 scx_bpf_pick_idle_cpu_node(int node,
> > >                                 const cpumask_t *cpus_allowed, u64 flags)
> > >  int scx_bpf_cpu_to_node(s32 cpu)
> > > 
> > > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> > > ---
> > >  kernel/sched/ext.c                       | 96 +++++++++++++++++++++++-
> > >  tools/sched_ext/include/scx/common.bpf.h |  4 +
> > >  tools/sched_ext/include/scx/compat.bpf.h | 19 +++++
> > >  3 files changed, 117 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> > > index d0d57323bcfc..ea7cc481782c 100644
> > > --- a/kernel/sched/ext.c
> > > +++ b/kernel/sched/ext.c
> > > @@ -433,6 +433,7 @@ struct sched_ext_ops {
> > >  	 * - scx_bpf_select_cpu_dfl()
> > >  	 * - scx_bpf_test_and_clear_cpu_idle()
> > >  	 * - scx_bpf_pick_idle_cpu()
> > > +	 * - scx_bpf_pick_idle_cpu_node()
> > >  	 *
> > >  	 * The user also must implement ops.select_cpu() as the default
> > >  	 * implementation relies on scx_bpf_select_cpu_dfl().
> > > @@ -955,6 +956,8 @@ static struct cpumask *get_idle_cpumask_node(int node)
> > >  	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
> > >  		return idle_masks[0]->cpu;
> > >  
> > > +	if (node < 0 || node >= num_possible_nodes())
> > > +		return NULL;
> > 
> > 1. This sanity should go before the check above.
> > 2. In-kernel users don't need to do sanity checks. BPF users should,
> >    but for them you need to move it in BPF wrapper.
> > 3. -1 is a valid parameter, means NUMA_NO_NODE. 
> 
> Ok, but what would you return with NUMA_NO_NODE, in theory we should return
> a global system-wide cpumask, that doesn't exist with the per-node
> cpumasks. Maybe just return cpu_none_mask? That's what I've done in the
> next version, that seems safer than returning NULL.
 
To begin with, you can just disallow NUMA_NO_NODE for this interface.
Put a corresponding comment or warning, and you're done.

On the other hand, you can treat it as 'I don't care' hint and return
a cpumask for any node that has idle CPUs.

Returning cpu_none_mask?.. OK, it's possible, but what does that
bring? User will have to traverse empty mask just to find nothing.
I'd rather disallow NUMA_NO_NODE than returning something useless.

> > >  	return idle_masks[node]->cpu;
> > >  }
> > >  
> > > @@ -963,6 +966,8 @@ static struct cpumask *get_idle_smtmask_node(int node)
> > >  	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
> > >  		return idle_masks[0]->smt;
> > >  
> > > +	if (node < 0 || node >= num_possible_nodes())
> > > +		return NULL;
> > >  	return idle_masks[node]->smt;
> > >  }
> > >  
> > > @@ -7469,6 +7474,16 @@ __bpf_kfunc u32 scx_bpf_nr_cpu_ids(void)
> > >  	return nr_cpu_ids;
> > >  }
> > >  
> > > +/**
> > > + * scx_bpf_cpu_to_node - Return the NUMA node the given @cpu belongs to
> > > + */
> > > +__bpf_kfunc int scx_bpf_cpu_to_node(s32 cpu)
> > > +{
> > > +	if (cpu < 0 || cpu >= nr_cpu_ids)
> > > +		return -EINVAL;
> > > +	return cpu_to_node(cpu);
> > > +}
> > 
> > I believe this wrapper should be declared somewhere in
> > kernel/sched/topology.c, and better be a separate patch.
> 
> Maybe kernel/bpf/helpers.c? And name it bpf_cpu_to_node()?

Sure, even better
 
> > > +
> > >  /**
> > >   * scx_bpf_get_possible_cpumask - Get a referenced kptr to cpu_possible_mask
> > >   */
> > > @@ -7499,11 +7514,32 @@ __bpf_kfunc void scx_bpf_put_cpumask(const struct cpumask *cpumask)
> > >  	 */
> > >  }
> > >  
> > > +/**
> > > + * scx_bpf_get_idle_cpumask_node - Get a referenced kptr to the idle-tracking
> > > + * per-CPU cpumask of a target NUMA node.
> > > + *
> > > + * Returns an empty cpumask if idle tracking is not enabled, if @node is not
> > > + * valid, or running on a UP kernel.
> > > + */
> > > +__bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask_node(int node)
> > > +{
> > > +	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
> > > +		scx_ops_error("built-in idle tracking is disabled");
> > > +		return cpu_none_mask;
> > > +	}
> > > +	if (!static_branch_likely(&scx_builtin_idle_per_node)) {
> > > +		scx_ops_error("per-node idle tracking is disabled");
> > > +		return cpu_none_mask;
> > > +	}
> > 
> > Nub question: is it possible that scx_builtin_idle_per_node is enable,
> > but scx_builtin_idle_enabled not? From my naive perspective, we can't
> > enable per-node idle masks without enabling general idle masks. Or I
> > mislead it?
> 
> In theory a BPF scheduler could set SCX_OPS_BUILTIN_IDLE_PER_NODE (without
> SCX_OPS_KEEP_BUILTIN_IDLE) in .flags while implementing ops.update_idle().
> 
> In this way we would have scx_builtin_idle_enabled==false and
> scx_builtin_idle_per_node==true, which doesn't make much sense, so we
> should probably handle this case in validate_ops() and trigger an error.
> 
> Good catch!
> 
> > 
> > > +
> > > +	return get_idle_cpumask_node(node) ? : cpu_none_mask;
> > > +}
> > >  /**
> > >   * scx_bpf_get_idle_cpumask - Get a referenced kptr to the idle-tracking
> > >   * per-CPU cpumask of the current NUMA node.
> > >   *
> > > - * Returns NULL if idle tracking is not enabled, or running on a UP kernel.
> > > + * Returns an emtpy cpumask if idle tracking is not enabled, or running on a UP
> > > + * kernel.
> > >   */
> > >  __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void)
> > >  {
> > > @@ -7515,12 +7551,35 @@ __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void)
> > >  	return get_curr_idle_cpumask();
> > >  }
> > >  
> > > +/**
> > > + * scx_bpf_get_idle_smtmask_node - Get a referenced kptr to the idle-tracking,
> > > + * per-physical-core cpumask of a target NUMA node. Can be used to determine
> > > + * if an entire physical core is free.
> > > + *
> > > + * Returns an empty cpumask if idle tracking is not enabled, if @node is not
> > > + * valid, or running on a UP kernel.
> > > + */
> > > +__bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask_node(int node)
> > > +{
> > > +	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
> > > +		scx_ops_error("built-in idle tracking is disabled");
> > > +		return cpu_none_mask;
> > > +	}
> > > +	if (!static_branch_likely(&scx_builtin_idle_per_node)) {
> > > +		scx_ops_error("per-node idle tracking is disabled");
> > > +		return cpu_none_mask;
> > > +	}
> > 
> > Can you add vertical spacing between blocks?
> 
> You mean a blank between the two blocks, right?

Yes

> Anyway, ...
> 
> > 
> > Also, because you use this construction more than once, I think it
> > makes sense to make it a helper.
> 
> With a proper error check in validate_ops() we can just get rid of the
> scx_builtin_idle_enabled block and simply check scx_builtin_idle_per_node.

But still, having a helper is better than opencoding the same 4-lines
pattern again and again

> > > +
> > > +	return get_idle_smtmask_node(node) ? : cpu_none_mask;
> > > +}
> > > +
> > >  /**
> > >   * scx_bpf_get_idle_smtmask - Get a referenced kptr to the idle-tracking,
> > >   * per-physical-core cpumask of the current NUMA node. Can be used to determine
> > >   * if an entire physical core is free.
> > >   *
> > > - * Returns NULL if idle tracking is not enabled, or running on a UP kernel.
> > > + * Returns an empty cumask if idle tracking is not enabled, or running on a UP
> > > + * kernel.
> > >   */
> > >  __bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask(void)
> > >  {
> > > @@ -7569,6 +7628,35 @@ __bpf_kfunc bool scx_bpf_test_and_clear_cpu_idle(s32 cpu)
> > >  		return false;
> > >  }
> > >  
> > > +/**
> > > + * scx_bpf_pick_idle_cpu_node - Pick and claim an idle cpu from a NUMA node
> > > + * @node: target NUMA node
> > > + * @cpus_allowed: Allowed cpumask
> > > + * @flags: %SCX_PICK_IDLE_CPU_* flags
> > > + *
> > > + * Pick and claim an idle cpu in @cpus_allowed from the NUMA node @node.
> > > + * Returns the picked idle cpu number on success. -%EBUSY if no matching cpu
> > > + * was found.
> > > + *
> > > + * Unavailable if ops.update_idle() is implemented and
> > > + * %SCX_OPS_KEEP_BUILTIN_IDLE is not set or if %SCX_OPS_KEEP_BUILTIN_IDLE is
> > > + * not set.
> > > + */
> > > +__bpf_kfunc s32 scx_bpf_pick_idle_cpu_node(int node, const struct cpumask *cpus_allowed,
> > > +				      u64 flags)
> > > +{
> > 
> > Sanity checks here?
> 
> Indeed, thanks!
> 
> -Andrea

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/4] sched_ext: Introduce NUMA aware idle cpu kfunc helpers
  2024-12-11 20:47       ` Yury Norov
@ 2024-12-11 20:55         ` Andrea Righi
  0 siblings, 0 replies; 20+ messages in thread
From: Andrea Righi @ 2024-12-11 20:55 UTC (permalink / raw)
  To: Yury Norov; +Cc: Tejun Heo, David Vernet, Changwoo Min, linux-kernel

On Wed, Dec 11, 2024 at 12:47:22PM -0800, Yury Norov wrote:
> On Wed, Dec 11, 2024 at 09:20:56PM +0100, Andrea Righi wrote:
> > On Wed, Dec 11, 2024 at 09:43:26AM -0800, Yury Norov wrote:
> > > On Mon, Dec 09, 2024 at 11:40:58AM +0100, Andrea Righi wrote:
> > > > Add the following kfunc's to provide scx schedulers direct access to
> > > > per-node idle cpumasks information:
> > > > 
> > > >  const struct cpumask *scx_bpf_get_idle_cpumask_node(int node)
> > > >  const struct cpumask *scx_bpf_get_idle_smtmask_node(int node)
> > > >  s32 scx_bpf_pick_idle_cpu_node(int node,
> > > >                                 const cpumask_t *cpus_allowed, u64 flags)
> > > >  int scx_bpf_cpu_to_node(s32 cpu)
> > > > 
> > > > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> > > > ---
> > > >  kernel/sched/ext.c                       | 96 +++++++++++++++++++++++-
> > > >  tools/sched_ext/include/scx/common.bpf.h |  4 +
> > > >  tools/sched_ext/include/scx/compat.bpf.h | 19 +++++
> > > >  3 files changed, 117 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> > > > index d0d57323bcfc..ea7cc481782c 100644
> > > > --- a/kernel/sched/ext.c
> > > > +++ b/kernel/sched/ext.c
> > > > @@ -433,6 +433,7 @@ struct sched_ext_ops {
> > > >  	 * - scx_bpf_select_cpu_dfl()
> > > >  	 * - scx_bpf_test_and_clear_cpu_idle()
> > > >  	 * - scx_bpf_pick_idle_cpu()
> > > > +	 * - scx_bpf_pick_idle_cpu_node()
> > > >  	 *
> > > >  	 * The user also must implement ops.select_cpu() as the default
> > > >  	 * implementation relies on scx_bpf_select_cpu_dfl().
> > > > @@ -955,6 +956,8 @@ static struct cpumask *get_idle_cpumask_node(int node)
> > > >  	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
> > > >  		return idle_masks[0]->cpu;
> > > >  
> > > > +	if (node < 0 || node >= num_possible_nodes())
> > > > +		return NULL;
> > > 
> > > 1. This sanity should go before the check above.
> > > 2. In-kernel users don't need to do sanity checks. BPF users should,
> > >    but for them you need to move it in BPF wrapper.
> > > 3. -1 is a valid parameter, means NUMA_NO_NODE. 
> > 
> > Ok, but what would you return with NUMA_NO_NODE, in theory we should return
> > a global system-wide cpumask, that doesn't exist with the per-node
> > cpumasks. Maybe just return cpu_none_mask? That's what I've done in the
> > next version, that seems safer than returning NULL.
>  
> To begin with, you can just disallow NUMA_NO_NODE for this interface.
> Put a corresponding comment or warning, and you're done.

Ok.

> 
> On the other hand, you can treat it as 'I don't care' hint and return
> a cpumask for any node that has idle CPUs.
> 
> Returning cpu_none_mask?.. OK, it's possible, but what does that
> bring? User will have to traverse empty mask just to find nothing.
> I'd rather disallow NUMA_NO_NODE than returning something useless.

I like the idea of returning a "random" node, or maybe
idle_masks[numa_node_id()]?

> 
> > > >  	return idle_masks[node]->cpu;
> > > >  }
> > > >  
> > > > @@ -963,6 +966,8 @@ static struct cpumask *get_idle_smtmask_node(int node)
> > > >  	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
> > > >  		return idle_masks[0]->smt;
> > > >  
> > > > +	if (node < 0 || node >= num_possible_nodes())
> > > > +		return NULL;
> > > >  	return idle_masks[node]->smt;
> > > >  }
> > > >  
> > > > @@ -7469,6 +7474,16 @@ __bpf_kfunc u32 scx_bpf_nr_cpu_ids(void)
> > > >  	return nr_cpu_ids;
> > > >  }
> > > >  
> > > > +/**
> > > > + * scx_bpf_cpu_to_node - Return the NUMA node the given @cpu belongs to
> > > > + */
> > > > +__bpf_kfunc int scx_bpf_cpu_to_node(s32 cpu)
> > > > +{
> > > > +	if (cpu < 0 || cpu >= nr_cpu_ids)
> > > > +		return -EINVAL;
> > > > +	return cpu_to_node(cpu);
> > > > +}
> > > 
> > > I believe this wrapper should be declared somewhere in
> > > kernel/sched/topology.c, and better be a separate patch.
> > 
> > Maybe kernel/bpf/helpers.c? And name it bpf_cpu_to_node()?
> 
> Sure, even better
>  
> > > > +
> > > >  /**
> > > >   * scx_bpf_get_possible_cpumask - Get a referenced kptr to cpu_possible_mask
> > > >   */
> > > > @@ -7499,11 +7514,32 @@ __bpf_kfunc void scx_bpf_put_cpumask(const struct cpumask *cpumask)
> > > >  	 */
> > > >  }
> > > >  
> > > > +/**
> > > > + * scx_bpf_get_idle_cpumask_node - Get a referenced kptr to the idle-tracking
> > > > + * per-CPU cpumask of a target NUMA node.
> > > > + *
> > > > + * Returns an empty cpumask if idle tracking is not enabled, if @node is not
> > > > + * valid, or running on a UP kernel.
> > > > + */
> > > > +__bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask_node(int node)
> > > > +{
> > > > +	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
> > > > +		scx_ops_error("built-in idle tracking is disabled");
> > > > +		return cpu_none_mask;
> > > > +	}
> > > > +	if (!static_branch_likely(&scx_builtin_idle_per_node)) {
> > > > +		scx_ops_error("per-node idle tracking is disabled");
> > > > +		return cpu_none_mask;
> > > > +	}
> > > 
> > > Nub question: is it possible that scx_builtin_idle_per_node is enable,
> > > but scx_builtin_idle_enabled not? From my naive perspective, we can't
> > > enable per-node idle masks without enabling general idle masks. Or I
> > > mislead it?
> > 
> > In theory a BPF scheduler could set SCX_OPS_BUILTIN_IDLE_PER_NODE (without
> > SCX_OPS_KEEP_BUILTIN_IDLE) in .flags while implementing ops.update_idle().
> > 
> > In this way we would have scx_builtin_idle_enabled==false and
> > scx_builtin_idle_per_node==true, which doesn't make much sense, so we
> > should probably handle this case in validate_ops() and trigger an error.
> > 
> > Good catch!
> > 
> > > 
> > > > +
> > > > +	return get_idle_cpumask_node(node) ? : cpu_none_mask;
> > > > +}
> > > >  /**
> > > >   * scx_bpf_get_idle_cpumask - Get a referenced kptr to the idle-tracking
> > > >   * per-CPU cpumask of the current NUMA node.
> > > >   *
> > > > - * Returns NULL if idle tracking is not enabled, or running on a UP kernel.
> > > > + * Returns an emtpy cpumask if idle tracking is not enabled, or running on a UP
> > > > + * kernel.
> > > >   */
> > > >  __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void)
> > > >  {
> > > > @@ -7515,12 +7551,35 @@ __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void)
> > > >  	return get_curr_idle_cpumask();
> > > >  }
> > > >  
> > > > +/**
> > > > + * scx_bpf_get_idle_smtmask_node - Get a referenced kptr to the idle-tracking,
> > > > + * per-physical-core cpumask of a target NUMA node. Can be used to determine
> > > > + * if an entire physical core is free.
> > > > + *
> > > > + * Returns an empty cpumask if idle tracking is not enabled, if @node is not
> > > > + * valid, or running on a UP kernel.
> > > > + */
> > > > +__bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask_node(int node)
> > > > +{
> > > > +	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
> > > > +		scx_ops_error("built-in idle tracking is disabled");
> > > > +		return cpu_none_mask;
> > > > +	}
> > > > +	if (!static_branch_likely(&scx_builtin_idle_per_node)) {
> > > > +		scx_ops_error("per-node idle tracking is disabled");
> > > > +		return cpu_none_mask;
> > > > +	}
> > > 
> > > Can you add vertical spacing between blocks?
> > 
> > You mean a blank between the two blocks, right?
> 
> Yes
> 
> > Anyway, ...
> > 
> > > 
> > > Also, because you use this construction more than once, I think it
> > > makes sense to make it a helper.
> > 
> > With a proper error check in validate_ops() we can just get rid of the
> > scx_builtin_idle_enabled block and simply check scx_builtin_idle_per_node.
> 
> But still, having a helper is better than opencoding the same 4-lines
> pattern again and again

Yep, makes sense. Will do that.

-Andrea

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCHSET v4 sched_ext/for-6.14] sched_ext: split global idle cpumask into per-NUMA cpumasks
@ 2024-12-05 21:00 Andrea Righi
  2024-12-05 21:00 ` [PATCH 4/4] sched_ext: Introduce NUMA aware idle cpu kfunc helpers Andrea Righi
  0 siblings, 1 reply; 20+ messages in thread
From: Andrea Righi @ 2024-12-05 21:00 UTC (permalink / raw)
  To: Tejun Heo, David Vernet; +Cc: Yury Norov, linux-kernel

= Overview =

As discussed during the sched_ext office hours, using a global cpumask
to keep track of the idle CPUs can be inefficient and it may not scale
really well on large NUMA systems.

Therefore, split the idle cpumask into multiple per-NUMA node cpumasks
to improve scalability and performance on such large systems.

Scalability issues seem to be more noticeable on Intel Sapphire Rapids
dual-socket architectures.

= Test =

Hardware:
 - System: DGX B200
    - CPUs: 224 SMT threads (112 physical cores)
    - Processor: INTEL(R) XEON(R) PLATINUM 8570
    - 2 NUMA nodes

Scheduler:
 - scx_simple [1] (so that we can focus at the built-in idle selection
   policy and not at the scheduling policy itself)

Test:
 - Run a parallel kernel build `make -j $(nproc)` and measure the average
   elapsed time over 10 runs:

          avg time | stdev
          ---------+------
 before:   52.431s | 2.895
  after:   50.342s | 2.895

= Conclusion =

Splitting the global cpumask into multiple per-NUMA cpumasks helped to
achieve a speedup of approximately +4% with this particular architecture
and test case.

I've repeated the same test on a DGX-1 (40 physical cores, Intel Xeon
E5-2698 v4 @ 2.20GHz, 2 NUMA nodes) and I didn't observe any measurable
difference.

In general, on smaller systems, I haven't noticed any measurable
regressions or improvements with the same test (parallel kernel build)
and scheduler (scx_simple).

NOTE: splitting the global cpumask into multiple cpumasks may increase
the overhead of scx_bpf_pick_idle_cpu() or ops.select_cpu() (for
schedulers relying on the built-in CPU idle selection policy) in
presence of multiple NUMA nodes, particularly under high system load,
since we may have to access multiple cpumasks to find an idle CPU.

However, this increased overhead seems to be highly compensated by a
lower overhead when updating the idle state (__scx_update_idle()) and by
the fact that CPUs are more likely operating within their local idle
cpumask, reducing the stress on the cache coherency protocol.

= References =

[1] https://github.com/sched-ext/scx/blob/main/scheds/c/scx_simple.bpf.c

ChangeLog v3 -> v4:
 - introduce SCX_OPS_NODE_BUILTIN_IDLE to select multiple per-node
   cpumasks or single flat cpumask
 - introduce new kfuncs to access per-node idle cpumasks information
 - use for_each_numa_hop_mask() to traverse NUMA nodes in increasing
   distance
 - dropped nodemask helpers (not needed anymore)
 - rebase to sched_ext/for-6.14

ChangeLog v2 -> v3:
  - introduce for_each_online_node_wrap()
  - re-introduce cpumask_intersects() in test_and_clear_cpu_idle() (to
    reduce memory writes / cache coherence pressure)
  - get rid of the redundant scx_selcpu_topo_numa logic
  [test results are pretty much identical, so I haven't updated them from v2]

ChangeLog v1 -> v2:
  - renamed for_each_node_mask|state_from() -> for_each_node_mask|state_wrap()
  - misc cpumask optimizations (thanks to Yury)

Andrea Righi (4):
      sched_ext: Introduce per-NUMA idle cpumasks
      sched_ext: Get rid of the scx_selcpu_topo_numa logic
      sched_ext: Introduce SCX_OPS_NODE_BUILTIN_IDLE
      sched_ext: Introduce NUMA aware idle cpu kfunc helpers

 kernel/sched/ext.c                       | 427 +++++++++++++++++++++----------
 tools/sched_ext/include/scx/common.bpf.h |   3 +
 2 files changed, 301 insertions(+), 129 deletions(-)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 4/4] sched_ext: Introduce NUMA aware idle cpu kfunc helpers
  2024-12-05 21:00 [PATCHSET v4 sched_ext/for-6.14] sched_ext: split global idle cpumask into per-NUMA cpumasks Andrea Righi
@ 2024-12-05 21:00 ` Andrea Righi
  0 siblings, 0 replies; 20+ messages in thread
From: Andrea Righi @ 2024-12-05 21:00 UTC (permalink / raw)
  To: Tejun Heo, David Vernet; +Cc: Yury Norov, linux-kernel

Add the following kfunc's to provide scx schedulers direct access to
per-node idle cpumasks information:

 const struct cpumask *scx_bpf_get_idle_cpumask_node(int node)
 const struct cpumask *scx_bpf_get_idle_smtmask_node(int node)
 s32 scx_bpf_pick_idle_cpu_node(int node,
 				const cpumask_t *cpus_allowed, u64 flags)

Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/ext.c                       | 85 +++++++++++++++++++++++-
 tools/sched_ext/include/scx/common.bpf.h |  3 +
 2 files changed, 86 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index d0d57323bcfc..b3d20e396b43 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -433,6 +433,7 @@ struct sched_ext_ops {
 	 * - scx_bpf_select_cpu_dfl()
 	 * - scx_bpf_test_and_clear_cpu_idle()
 	 * - scx_bpf_pick_idle_cpu()
+	 * - scx_bpf_pick_idle_cpu_node()
 	 *
 	 * The user also must implement ops.select_cpu() as the default
 	 * implementation relies on scx_bpf_select_cpu_dfl().
@@ -955,6 +956,8 @@ static struct cpumask *get_idle_cpumask_node(int node)
 	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
 		return idle_masks[0]->cpu;
 
+	if (node < 0 || node >= num_possible_nodes())
+		return NULL;
 	return idle_masks[node]->cpu;
 }
 
@@ -963,6 +966,8 @@ static struct cpumask *get_idle_smtmask_node(int node)
 	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
 		return idle_masks[0]->smt;
 
+	if (node < 0 || node >= num_possible_nodes())
+		return NULL;
 	return idle_masks[node]->smt;
 }
 
@@ -7499,11 +7504,32 @@ __bpf_kfunc void scx_bpf_put_cpumask(const struct cpumask *cpumask)
 	 */
 }
 
+/**
+ * scx_bpf_get_idle_cpumask_node - Get a referenced kptr to the idle-tracking
+ * per-CPU cpumask of a target NUMA node.
+ *
+ * Returns an empty cpumask if idle tracking is not enabled, if @node is not
+ * valid, or running on a UP kernel.
+ */
+__bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask_node(int node)
+{
+	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
+		scx_ops_error("built-in idle tracking is disabled");
+		return cpu_none_mask;
+	}
+	if (!static_branch_likely(&scx_builtin_idle_per_node)) {
+		scx_ops_error("per-node idle tracking is disabled");
+		return cpu_none_mask;
+	}
+
+	return get_idle_cpumask_node(node) ? : cpu_none_mask;
+}
 /**
  * scx_bpf_get_idle_cpumask - Get a referenced kptr to the idle-tracking
  * per-CPU cpumask of the current NUMA node.
  *
- * Returns NULL if idle tracking is not enabled, or running on a UP kernel.
+ * Returns an emtpy cpumask if idle tracking is not enabled, or running on a UP
+ * kernel.
  */
 __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void)
 {
@@ -7515,12 +7541,35 @@ __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void)
 	return get_curr_idle_cpumask();
 }
 
+/**
+ * scx_bpf_get_idle_smtmask_node - Get a referenced kptr to the idle-tracking,
+ * per-physical-core cpumask of a target NUMA node. Can be used to determine
+ * if an entire physical core is free.
+ *
+ * Returns an empty cpumask if idle tracking is not enabled, if @node is not
+ * valid, or running on a UP kernel.
+ */
+__bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask_node(int node)
+{
+	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
+		scx_ops_error("built-in idle tracking is disabled");
+		return cpu_none_mask;
+	}
+	if (!static_branch_likely(&scx_builtin_idle_per_node)) {
+		scx_ops_error("per-node idle tracking is disabled");
+		return cpu_none_mask;
+	}
+
+	return get_idle_smtmask_node(node) ? : cpu_none_mask;
+}
+
 /**
  * scx_bpf_get_idle_smtmask - Get a referenced kptr to the idle-tracking,
  * per-physical-core cpumask of the current NUMA node. Can be used to determine
  * if an entire physical core is free.
  *
- * Returns NULL if idle tracking is not enabled, or running on a UP kernel.
+ * Returns an empty cumask if idle tracking is not enabled, or running on a UP
+ * kernel.
  */
 __bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask(void)
 {
@@ -7569,6 +7618,35 @@ __bpf_kfunc bool scx_bpf_test_and_clear_cpu_idle(s32 cpu)
 		return false;
 }
 
+/**
+ * scx_bpf_pick_idle_cpu_node - Pick and claim an idle cpu from a NUMA node
+ * @node: target NUMA node
+ * @cpus_allowed: Allowed cpumask
+ * @flags: %SCX_PICK_IDLE_CPU_* flags
+ *
+ * Pick and claim an idle cpu in @cpus_allowed from the NUMA node @node.
+ * Returns the picked idle cpu number on success. -%EBUSY if no matching cpu
+ * was found.
+ *
+ * Unavailable if ops.update_idle() is implemented and
+ * %SCX_OPS_KEEP_BUILTIN_IDLE is not set or if %SCX_OPS_KEEP_BUILTIN_IDLE is
+ * not set.
+ */
+__bpf_kfunc s32 scx_bpf_pick_idle_cpu_node(int node, const struct cpumask *cpus_allowed,
+				      u64 flags)
+{
+	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
+		scx_ops_error("built-in idle tracking is disabled");
+		return -EBUSY;
+	}
+	if (!static_branch_likely(&scx_builtin_idle_per_node)) {
+		scx_ops_error("per-node idle tracking is disabled");
+		return -EBUSY;
+	}
+
+	return scx_pick_idle_cpu_from_node(node, cpus_allowed, flags);
+}
+
 /**
  * scx_bpf_pick_idle_cpu - Pick and claim an idle cpu
  * @cpus_allowed: Allowed cpumask
@@ -7709,10 +7787,13 @@ BTF_ID_FLAGS(func, scx_bpf_get_possible_cpumask, KF_ACQUIRE)
 BTF_ID_FLAGS(func, scx_bpf_get_online_cpumask, KF_ACQUIRE)
 BTF_ID_FLAGS(func, scx_bpf_put_cpumask, KF_RELEASE)
 BTF_ID_FLAGS(func, scx_bpf_get_idle_cpumask, KF_ACQUIRE)
+BTF_ID_FLAGS(func, scx_bpf_get_idle_cpumask_node, KF_ACQUIRE)
 BTF_ID_FLAGS(func, scx_bpf_get_idle_smtmask, KF_ACQUIRE)
+BTF_ID_FLAGS(func, scx_bpf_get_idle_smtmask_node, KF_ACQUIRE)
 BTF_ID_FLAGS(func, scx_bpf_put_idle_cpumask, KF_RELEASE)
 BTF_ID_FLAGS(func, scx_bpf_test_and_clear_cpu_idle)
 BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu, KF_RCU)
+BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu_node, KF_RCU)
 BTF_ID_FLAGS(func, scx_bpf_pick_any_cpu, KF_RCU)
 BTF_ID_FLAGS(func, scx_bpf_task_running, KF_RCU)
 BTF_ID_FLAGS(func, scx_bpf_task_cpu, KF_RCU)
diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h
index 625f5b046776..41b12f9c6b36 100644
--- a/tools/sched_ext/include/scx/common.bpf.h
+++ b/tools/sched_ext/include/scx/common.bpf.h
@@ -63,10 +63,13 @@ const struct cpumask *scx_bpf_get_possible_cpumask(void) __ksym __weak;
 const struct cpumask *scx_bpf_get_online_cpumask(void) __ksym __weak;
 void scx_bpf_put_cpumask(const struct cpumask *cpumask) __ksym __weak;
 const struct cpumask *scx_bpf_get_idle_cpumask(void) __ksym;
+const struct cpumask *scx_bpf_get_idle_cpumask_node(int node) __ksym;
 const struct cpumask *scx_bpf_get_idle_smtmask(void) __ksym;
+const struct cpumask *scx_bpf_get_idle_smtmask_node(int node) __ksym;
 void scx_bpf_put_idle_cpumask(const struct cpumask *cpumask) __ksym;
 bool scx_bpf_test_and_clear_cpu_idle(s32 cpu) __ksym;
 s32 scx_bpf_pick_idle_cpu(const cpumask_t *cpus_allowed, u64 flags) __ksym;
+s32 scx_bpf_pick_idle_cpu_node(int node, const cpumask_t *cpus_allowed, u64 flags) __ksym;
 s32 scx_bpf_pick_any_cpu(const cpumask_t *cpus_allowed, u64 flags) __ksym;
 bool scx_bpf_task_running(const struct task_struct *p) __ksym;
 s32 scx_bpf_task_cpu(const struct task_struct *p) __ksym;
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2024-12-14  6:06 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-09 10:40 [PATCHSET v5 sched_ext/for-6.14] sched_ext: split global idle cpumask into per-NUMA cpumasks Andrea Righi
2024-12-09 10:40 ` [PATCH 1/4] sched_ext: Introduce per-NUMA idle cpumasks Andrea Righi
2024-12-09 19:32   ` Yury Norov
2024-12-09 20:40     ` Andrea Righi
2024-12-10  0:14     ` Andrea Righi
2024-12-10  2:10       ` Yury Norov
2024-12-14  6:05         ` Andrea Righi
2024-12-11 17:46   ` Yury Norov
2024-12-09 10:40 ` [PATCH 2/4] sched_ext: Get rid of the scx_selcpu_topo_numa logic Andrea Righi
2024-12-11  8:05   ` Changwoo Min
2024-12-11 12:22     ` Andrea Righi
2024-12-09 10:40 ` [PATCH 3/4] sched_ext: Introduce SCX_OPS_NODE_BUILTIN_IDLE Andrea Righi
2024-12-11 18:21   ` Yury Norov
2024-12-11 19:59     ` Andrea Righi
2024-12-09 10:40 ` [PATCH 4/4] sched_ext: Introduce NUMA aware idle cpu kfunc helpers Andrea Righi
2024-12-11 17:43   ` Yury Norov
2024-12-11 20:20     ` Andrea Righi
2024-12-11 20:47       ` Yury Norov
2024-12-11 20:55         ` Andrea Righi
  -- strict thread matches above, loose matches on Subject: below --
2024-12-05 21:00 [PATCHSET v4 sched_ext/for-6.14] sched_ext: split global idle cpumask into per-NUMA cpumasks Andrea Righi
2024-12-05 21:00 ` [PATCH 4/4] sched_ext: Introduce NUMA aware idle cpu kfunc helpers Andrea Righi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox